• #### 科普

SCIENCE

#### 英语

ENGLISH

#### 科技

TECHNOLOGY

MOVIE

FOOD

#### 励志

INSPIRATIONS

#### 社会

SOCIETY

TRAVEL

#### 动物

ANIMALS

KIDS

#### 卡通

CARTOON

#### 计算机

COMPUTER

#### 心理

PSYCHOLOGY

#### 教育

EDUCATION

#### 手工

HANDCRAFTS

#### 趣闻

MYSTERIES

CAREER

GEEKS

#### 时尚

FASHION

• 精品课
• 公开课
• 欢迎下载我们在各应用市场备受好评的APP

点击下载Android最新版本

点击下载iOS最新版本 扫码下载译学馆APP

#### 《机器学习之数学》#3 二阶优化

Second Order Optimization - The Math of Intelligence #2

[背景音乐]
[Music Playing]
[背景音乐]
[Music Playing]
[背景音乐]
[Music Playing]
[背景音乐]
[Music Playing]
[背景音乐]
[Music Playing]
[背景音乐]
[Music Playing]

Hello World! It’s Siraj.

[背景音乐]
[Music Playing]

There are thousands of languages spoken across the world,

each one unique in its ability to represents concepts and convey ideas.

but there is one language

that is shared by all humans,

regardless of where you come from:

Mathematics.

you possess the ability to understand this language of numbers

that connects us all, across continents and time.

Like all languages, fluency requires practice.

But unlike any other language,

the more fluent you become in math,

the more unstoppable you’ll be in anything you want to do in life.

Math is happening all around us,

to a degree most people don’t realize.

We can think of everything as a set of variables, as metrics.

And there exists relations between all of these variables.

In math, we call these relations functions.

Its our way of representing a set of patterns,

a mapping, a relationship between many variables.

No matter what machine learning model we use,

no matter what dataset we use,

the goal of machine learning is to optimize for an objective

And by doing so, we are approximating a function.

The process of optimization

helps us iteratively discover the functions hidden in the depth of data.

Last week, we talked about a popular optimization technique

This can be broken down into a 5 step process.

First, we define some machine learning model

with a set of initial weight values.

These act as the the coefficients of the function that the model represents.

The mapping between input data and output predictions.

These values are naive,

we have no idea what they should actually be,

but we’re trying to discover the optimal ones.

We’ll define an error function,

and when we plot the graph of the relationship between all the possible error values,

and all the possible weight values for our function,

we’ll see that there exists a valley, the minima.

We’ll use our error to help us compute

the partial derivative with respect to each weight value we have

and this gives us our gradient.

The gradient represents the change in the error

when the weights are changed by a very small value from their original value.

We use the gradient to update the values of our weights in a direction

such that the error is minimized,

iteratively coming closer and closer to the minima of the function.

We step our solution in the negative direction of the gradient repeatedly.

When we reach it,

we have learned the optimal weight values for our model,

where our gradient is equal to zero.

Our model will then be able to make predictions for input data it’s never seen before.

Most optimization problems can be solved using gradient descent and its variants.

They all fall into a category called first order optimization methods.

We call them first order

because they only require us to compute the first derivative.

But there’s another class of techniques that aren’t as widely used

called second order optimization methods

that require us to compute the second derivative.

The first derivative tells us

if the function is increasing or decreasing at a certain point,

and the second derivative tells us

if the first derivative is increasing or decreasing,

which hints at its curvature.

First order methods provide us with a line

that is tangential to a point on an error surface,

and second order methods provides us with a quadratic surface

that kisses the curvature of the error surface.

Haha Get a room you two

The advantage then of second order methods

is that they don’t ignore the curvature of the error surface.

And in terms of step-wise performance, they are better.

Let’s look at a popular second order optimization technique called Newton’s method,

named after the dude who invented calculus.

Who’s name was…

There are actually two versions of Newton’s method.

The first version is for finding the roots of a polynomial,

all those points where it intersects the x-axis.

So if you threw a ball and recorded its trajectory,

finding the root of the equation would tell you exactly what time it hits the ground.

The second version is for optimization, and its what we use in machine learning.

But lets code the root finding version first

to develop some basic intuition.

You must be Siraj, the famous data scientist.
– 见到你很高兴- 我也是
– Nice to meet you.- Nice to meet you too.
– 对于如何预测神经网络有线索了么
– Any relations how to predict our neural network?
– 还没有 这会是个很大的挑战
– Not yet, this is gonna be quite a challenge.
– 我们要对一个500TB的数据集进行异常检测
– We gotta perform an anomaly detection on a 500 TB dataset.
– 我认为可以先用牛顿方法进行逻辑回归
– I was thinking, just use Newton’s method to perform logistic regression first,
– 看是否能预测一部分
– to see if we could predict partibles.
– 牛顿方法
– Newton’s method?
– 是的 它是优化的一种方式
– Yeah, it’s a form of optimization,

I already build a prototype which has trained for a few hours.
– 作为一个机器学习者 我很佩服你- 应该的
– I am an impressed machine learner.- You should be.

Let’s say we have a function f of x

and some initial guessed solution.

Newton’s method says that we first find the slope of the tangent line at our guess point,

then find the point at which the tangent line intersects the x axis.

We’ll use that point to find its projection in the original function.

Then we iterate again from our first step,

this time replacing our first point with this one.

We keep iterating and eventually we’ll stop
x的当前值小于或者等于门槛值
when our current value of x is less than or equal to a threshold.

So that’s the root finding version of Newton’s Method,

where we’re trying to find where the function equals zero.

But in the optimization version,

we’re trying to find where the derivative of the function equals zero, its minima.

At a high level, given a random starting location,

we construct a quadratic approximation to the objective function

that matches the first and second derivative values at that point.

And then we minimize that quadratic function

The minimizer of the quadratic function is used as the starting point in the next step

and we repeat this process iteratively.

OK, so lets go over two cases of Newton’ s method for optimization to learn more.

A 1D case and a 2D case.

In the first case we’ve got a 1 dimensional function.

We can obtain a quadratic approximation at a given point of the function

using what’s called a Taylor series expansion,

neglecting terms of order three or higher.

A Taylor series is a representation of a function

as an infinite sum of terms.

They are calculated from the values of the functions derivatives at a single point.

It was invented by an English mathematician named Brook Taylor.

Swift. Just kidding.

So we’d take the second order Taylor series for our initial point x,

and minimize it by finding the first derivative and second derivative,

and equating them to zero.

In order to find the minimum x value,

we iterate this process.

In the second case,

let’s say we’ve got a function of multiple dimensions.

We can find the minimum of it, using the same approach,

except for 2 changes,

we replace the first derivatives with a gradient,

and the second derivatives with a hessian.

A hessian is a matrix of the second order partial derivatives of a scalar,

and it describes the local curvature of a multivariable function.

Check this out: derivatives help us compute gradients

which we can represent using a Jacobian matrix for first order optimization.

And we can use the hassian for second order optimization.

These are 4 of the 5 derivative operators used in all of calculus,

they’re the ways that we organize and represent change numerically.

So when should you use a second order method?

First order methods are usually less computationally expensive to compute

and less time expensive,

converging pretty fast on large datasets.

Second order methods are faster

when the second derivative is known and easy to compute.

But the second derivative is often intractable to compute,

requiring lots of computation.

For certain problems,

gradient descent can get stuck along paths of slow convergence around saddle points,

whereas second order methods won’t.

Trying out different optimization techniques for your specific problem

is the best way to see what works best.

Here are the key points to remember:

First order optimization techniques use the first derivative of a function to minimize it,

second order optimization techniques used the second derivative.

The Jacobian is a matrix of first partial derivatives

and the Hessian is a matrix of second partial derivatives.

And Newton’s Method is a popular second order optimization technique

that can sometimes outperform gradient descent.

Last weeks coding challenge winner is Alberto Garces.
Alberto使用了梯度下降找到了最优解
Alberto used gradient descent to find the line of best fit.

His Jupyter notebook is insanely detailed,

Very well thought out.

That was dope Alberto, Wizard of the Week.

And the runner up is Ivan Gusev

who implemented gradient descent from scratch for polynomials of any order.

This weeks challenge is to implement Newtons method for optimization from scratch.

and winners announced next week.

Please subscribe for more programming videos,

and for now I’ve gotta invent the 6th derivative,

so thanks for watching.