• #### 科普

SCIENCE

#### 英语

ENGLISH

#### 科技

TECHNOLOGY

MOVIE

FOOD

#### 励志

INSPIRATIONS

#### 社会

SOCIETY

TRAVEL

#### 动物

ANIMALS

KIDS

#### 卡通

CARTOON

#### 计算机

COMPUTER

#### 心理

PSYCHOLOGY

#### 教育

EDUCATION

#### 手工

HANDCRAFTS

#### 趣闻

MYSTERIES

CAREER

GEEKS

#### 时尚

FASHION

• 精品课
• 公开课
• 欢迎下载我们在各应用市场备受好评的APP

点击下载Android最新版本

点击下载iOS最新版本 扫码下载译学馆APP

#### 《机器学习之数学》#1 简介

Intro - The Math of Intelligence

Hello World! It’s Siraj.

And welcome to “The Math of Intelligence”
【智能的数学】
【编导：西拉杰】

For the next 3 months, we’re going to take a journey

through the most important math concepts that underlie machine learning.

That means all the concepts you need from the great disciplines

of calculus, linear algebra, probability theory, and statistics.

The prerequisites are knowing basic python syntax and algebra.

Every single algorithm we code

will be done without using any popular machine learning library,

because the point of this course is to help you

build a solid mathematical intuition around building algorithms

that can learn from data.

I mean let’s face it,

you could just use a black box API for all this stuff,

but if you have the intuition

you’ll know exactly which algorithm to use for the job.

Or even how it’s to make your own from scratch.

As humans, we are constantly receiving data through our five senses

and somehow we’ve got to make sense of all this chaotic input

so that we can survive.

Thanks to the evolutionary process

we’ve developed brains capable of doing this.

We’ve got the most precious resource in the universe intelligence,

the ability to learn and apply knowledge.

One way to measure our intelligence

against the rest of the animal kingdom is using a ladder.

Ours is indeed the most generalized type of intelligence,

capable of being applied to the widest variety of tasks.

But that doesn’t mean that we are necessarily the best kind of intelligence.

In the 1960s, a primate researcher named,

Dr. Jane Goodall, concluded that

chimpanzees had been living in the forest for hundreds of thousands of years

without overpopulating or destroying their environment at all.

Orcas have the ability to sleep with one hemisphere of their brain at a time,

which allows them to recuperate,

while being aware of their surroundings.

In some ways animals are more intelligent than us.

Intelligence consists of many dimensions.

Think of it like a multi-dimensional space of possibility.

When building a AI,

the human brain is a great road map,

after all the neural networks have achieved

state of the art performance in countless tasks,

but it’s not the only road map,

there are many possible types of intelligence out there that we can and will create.

Some will seem familiar to us, and some very alien.

Thinking in a way we’ve never done before.

Like when AlphaGo played move 37.

Even the best Go players in the world were stunned at the move.

It went against everything we’ve learned about the game from millennia of practice,

but it turned out to be an objectively better strategy that led to its win.

The many different types of intelligence are like symphonies,

each comprising of different instruments and

these instruments vary, not just in their dynamics

but in their pitch and tempo and color and melody.

The amount of data that we’re generating is growing really fast.

No I mean really, REALLY fast!

In the time since you started watching this video

enough data was generated for you to spend an entire lifetime analyzing.

And only 0.5% of all data ever is.

Creating intelligence isn’t just a nice to have, it’s a necessity.

Put in the right hands it will help us solve problems

we never dreamed could be possible to solve.

So where do we start?

At it’s core, machine learning is all about mathematical optimization.

This is a way of thinking.

Every single problem can be broken down into an optimization problem.

Once we have some data set that acts as our input,

we’ll build a model that uses that data to optimize for an objective,

a goal that we want to reach.

And the way it does this is by minimizing some error value that we define.

One example problem could be, “what should I wear today?”

I could frame this as optimizing for stylishness, instead of say, comfort,

then define an error that I want to minimize

as the amount of ratings a group of people give me that are negative.

Or even what’s the best design for my iOS app’s homepage.

Rather than hardcoding in some elements,

I could find a data set of app designs and their ratings from users.

If I want to optimize for a design that would be the highest rated

I would learn the mapping between design styles and ratings.

This is the way that every single layer of the stack will be built in the future.

Sometimes our data is labeled,

sometimes it isn’t,

there are different techniques we can use to find patterns in this data.

And sometimes optimizing for an objective can happen

not through the frame of pattern recognition but

through the exploration of many possibilities and seeing what works and what doesn’t.

There are many ways that we can frame the learning process,

but the easiest way to learn is when we used labelled data.

Mathematically speaking we have some input.

There‘s a domain, X, where every point of X has features that we observe.

Then we have a label set Y.

So the data consists of a set of labeled examples that we can denote this way.

The output, then, would be a prediction rule.

So given a new X value, what’s its associated Y value?

We’ve gotta learn this mapping, which is an unknown distribution over X,

to be able to answer this.

So we have to measure some error function that acts as a performance metric.

So what we’d do is choose from a number of possible models

to represent this function.

We’ll initially set some parameter values to represent the mapping,

then we’d evaluate the initial result,

measure the error, update the parameters,

and repeat this process optimizing the model again and again

until it fully learns the mapping.

Was it convex or concave functions that were easier to optimize? I think convex.

I really hope my lab partner is epic at optimization.

I guess I should be thankful,

not many data scientists get a grant from CERN to detect the Higgs-Boson.

What was her name again? Eloise, I think.

Yup, she did win an award at ICML. I wonder if she’s cute?

No, that doesn’t matter. I am not going to mix business and pleasure, not this time.

Suppose I’ve got a bunch of data points.

These are just toy data points,

like what Apple probably trained Siri on.

They’re all x-y value pairs where x represents the distance a person bikes,

and y represents the amount of calories they lost.

We can just plot them on a graph like so.

We want to be able to predict the calories lost for a new person

giving their biking distance.

How should we do this?

Well we could try to draw a line that fits through all the data points

but it seems like our points are too spaced out for

a straight line to pass through all of them.

So we can settle for drawing the line of best fit,

a line that goes through as many data points as possible.

Algebra tells us that the equation for a straight line is of the form y = mx+ b.

Where m represents the slope or steepness of the line
b表示y轴截距点
and b represents its y-axis intercept point.

We want to find the optimal values for b and m such that

line fits as many points as possible, so given any new x value,

we can plug it into our equation and it’ll output the most likely y value.

Our error metric can be a measure of closeness, which we can define like this.

So lets start off with a random b and m value and plot this line.

For every single data point we have,

let’s calculate its associated y value.

Then we’ll subtract the actual y value from it to measure the distance between the two.

We’ll want to square this error to make our next steps easier.

Once we sum all these values we get a single value

that represents our error given that line we just drew.

Now if we did this process repeatedly, say 666 times,

for a bunch of different randomly drawn lines,

we could create a 3D graph

that shows the error value for every associated b and m value.

Notice how there is a valley in this graph.

At the bottom of this valley, the error is at its smallest.

And so the associated b and m values would be the line of best fit,

where the distance between all our data points and our line would be the smallest!

But how do we find it?

Well we’ll need to try out a bunch of different lines to create this 3D graph.

But rather than just randomly drawing lines over and over again with no signal,

what if we could do it in a more efficient way,

such that each successive line we draw brings us closer and closer

to the bottom of this valley.

We need a direction a way to descend this valley.

What if for a given function, we could find the slope of it at a given point.

Then that slope would point in a certain direction, towards the minima of the graph.

And when we re-draw our line over and over again

we could do so using the slope as our compass,

as our guide on how best to redraw as we
（走过死亡之荫的山谷）
( walk through the valley of the shadow of death )

towards the minima until our slope approaches 0.

In calculus, we call this slope the derivative of a function.

Since we are updating 2 values, b and m.

We want to calculate the derivative with respect to both of them, the partial derivative.

The partial derivative with respect to a variable

means that we calculate the derivative of that variable

while ignoring the others.

So we’ll compute the partial derivative with respect to b.

Then the partial derivative with respect to m.

To do this we use the power rule.

We multiply the exponent by the coefficient and subtract 1 from the exponent.

Once we have these 2 values we can update both of these parameters

from our function by subtracting them from our existing b and m values.

And we just keep doing that for a set number of iterations that we pre-define.

So this optimization technique that we just performed is called gradient descent

and it’s the most popular one in machine learning.

So what do you need to remember from this video? 3 points.

The derivative is the slope of a function at a given line,

the partial derivative is the slope with respect to one variable in that function.

We can use them to compose a gradient

which points in the direction of the local minima of a function.

And gradient descent is a very popular optimization strategy in machine learning

that uses the gradient to do this.
【介绍编程挑战赛的获胜者】

Now its your turn. I’ve got a coding challenge for you.

Implement gradient descent on your own on a different dataset that I’ll provide.

Check out the GitHub link for details,

the winner will be announced in a week.

Please subscribe for more programming videos

and for now I’ve gotta find memorize the power rule

so thanks for watching 🙂