ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

《机器学习之数学》#3 二阶优化 – 译学馆
未登陆,请登陆后再发表信息
最新评论 (0)
播放视频

《机器学习之数学》#3 二阶优化

Second Order Optimization - The Math of Intelligence #2

[背景音乐]
[Music Playing]
[背景音乐]
[Music Playing]
[背景音乐]
[Music Playing]
[背景音乐]
[Music Playing]
[背景音乐]
[Music Playing]
[背景音乐]
[Music Playing]
大家好 我是西拉杰
Hello World! It’s Siraj.
我们来谈谈优化吧
And let’s talk about optimization.
[背景音乐]
[Music Playing]
全世界有成千上万种语言
There are thousands of languages spoken across the world,
每一种都有它独特的能力来表示概念和传递信息
each one unique in its ability to represents concepts and convey ideas.
但是有一种语言
but there is one language
被所有人共享
that is shared by all humans,
无论你来自哪里
regardless of where you come from:
这种语言就是数学
Mathematics.
无论你的文化背景和年纪
No matter your culture or your age,
你都有理解这门数字语言的能力
you possess the ability to understand this language of numbers
这种能力把不同时空的我们连接到一起
that connects us all, across continents and time.
和其它语言类似 熟练掌握数学需要练习
Like all languages, fluency requires practice.
但是和其它语言不同的是
But unlike any other language,
在数学上越熟练
the more fluent you become in math,
在生活中面对想要做的事就越势不可挡
the more unstoppable you’ll be in anything you want to do in life.
数学无处不在
Math is happening all around us,
以至于大部分人没有意识到这一点
to a degree most people don’t realize.
我们可以把任何事物都看做一组变量或指标
We can think of everything as a set of variables, as metrics.
而且这些变量之间存在一些关系
And there exists relations between all of these variables.
在数学中 我们将这种关系称为函数
In math, we call these relations functions.
我们用这种方式来表示一组模式
Its our way of representing a set of patterns,
一个映射或多个变量之间关联的方式
a mapping, a relationship between many variables.
不管我们用什么样的机器学习模型
No matter what machine learning model we use,
不管我们用什么数据集
no matter what dataset we use,
机器学习的目的是优化一个目标
the goal of machine learning is to optimize for an objective
这样做 我们实际上在近似一个函数
And by doing so, we are approximating a function.
优化的过程
The process of optimization
帮助我们逐步发现隐藏在数据深处的函数
helps us iteratively discover the functions hidden in the depth of data.
上周我们讨论了一种流行的优化技巧
Last week, we talked about a popular optimization technique
叫做梯度下降
called gradient descent.
梯度下降可以被拆解成五步
This can be broken down into a 5 step process.
第一步 我们定义某个机器学习模型
First, we define some machine learning model
附带一组初始权重值
with a set of initial weight values.
它们是模型所代表的函数中的系数
These act as the the coefficients of the function that the model represents.
即输入数据和输出数据之间的映射关系
The mapping between input data and output predictions.
这些系数的得来完全无经验可循
These values are naive,
我们并不知道他们实际应该是多少
we have no idea what they should actually be,
但是我们正在努力去找到最合适的值
but we’re trying to discover the optimal ones.
我们将定义一个误差函数
We’ll define an error function,
当我们画出所有可能的误差值
and when we plot the graph of the relationship between all the possible error values,
与权重值之间的关系的图像
and all the possible weight values for our function,
我们可以看到一个波谷 即最小值
we’ll see that there exists a valley, the minima.
我们将用用误差函数帮助我们计算
We’ll use our error to help us compute
每个权重值的偏导数
the partial derivative with respect to each weight value we have
这样我们就得到了梯度
and this gives us our gradient.
梯度表示权重发生非常小的变化时
The gradient represents the change in the error
误差值随其发生的变化
when the weights are changed by a very small value from their original value.
我们用梯度来在某个方向上更新权重的值
We use the gradient to update the values of our weights in a direction
以使误差变得最小
such that the error is minimized,
通过反复迭代使误差接近最小值
iteratively coming closer and closer to the minima of the function.
我们朝负梯度方向不断移动
We step our solution in the negative direction of the gradient repeatedly.
当我们到达这个值时
When we reach it,
我们就得到了模型权重的最优解
we have learned the optimal weight values for our model,
此时梯度为零
where our gradient is equal to zero.
然后模型就能对未见过的输入值进行预测
Our model will then be able to make predictions for input data it’s never seen before.
大部分的优化问题可以用梯度下降法或其变体解决
Most optimization problems can be solved using gradient descent and its variants.
这些方法都是同一类 叫做一阶最优化方法
They all fall into a category called first order optimization methods.
我们称它们为一阶
We call them first order
因为他们只需要计算一阶导数
because they only require us to compute the first derivative.
但是有一类方法并没有像一阶优化那样被广泛使用
But there’s another class of techniques that aren’t as widely used
这类方法是二阶优化
called second order optimization methods
它们需要计算二阶导数
that require us to compute the second derivative.
一阶导数告诉我们
The first derivative tells us
一个函数在特定点上是增还是减
if the function is increasing or decreasing at a certain point,
二阶导数告诉我们
and the second derivative tells us
一阶导数在特定点上是增是减
if the first derivative is increasing or decreasing,
这表示了函数的曲率
which hints at its curvature.
使用一阶优化方法会得到一条线
First order methods provide us with a line
这条线是误差函数曲面上某一点的切线
that is tangential to a point on an error surface,
使用二阶优化会得到一个二次曲面
and second order methods provides us with a quadratic surface
与误差函数的曲率吻合
that kisses the curvature of the error surface.
哈哈 你们两个去开房吧
Haha Get a room you two
二阶优化的优势是
The advantage then of second order methods
没有忽略误差函数的曲率
is that they don’t ignore the curvature of the error surface.
在逐步进行的过程中 他们的表现更好
And in terms of step-wise performance, they are better.
让我们来看一个流行的二阶优化方法 牛顿法
Let’s look at a popular second order optimization technique called Newton’s method,
牛顿法是以微积分发明者的名字命名的
named after the dude who invented calculus.
他的名字是……
Who’s name was…
实际上 牛顿法有两个版本
There are actually two versions of Newton’s method.
第一个版本是用来解多项式的根
The first version is for finding the roots of a polynomial,
这些点都是多项式曲线与x轴的交点
all those points where it intersects the x-axis.
所以 假设你扔出一个球并记录它的轨迹
So if you threw a ball and recorded its trajectory,
方程的根会告诉你 球落地的准确时间
finding the root of the equation would tell you exactly what time it hits the ground.
第二个版本用于机器学习的优化
The second version is for optimization, and its what we use in machine learning.
让我们先学习第一个版本
But lets code the root finding version first
来掌握一些基本概念吧
to develop some basic intuition.
你一定是著名的数据科学家西拉杰吧
You must be Siraj, the famous data scientist.
– 见到你很高兴- 我也是
– Nice to meet you.- Nice to meet you too.
– 对于如何预测神经网络有线索了么
– Any relations how to predict our neural network?
– 还没有 这会是个很大的挑战
– Not yet, this is gonna be quite a challenge.
– 我们要对一个500TB的数据集进行异常检测
– We gotta perform an anomaly detection on a 500 TB dataset.
– 我认为可以先用牛顿方法进行逻辑回归
– I was thinking, just use Newton’s method to perform logistic regression first,
– 看是否能预测一部分
– to see if we could predict partibles.
– 牛顿方法
– Newton’s method?
– 是的 它是优化的一种方式
– Yeah, it’s a form of optimization,
我已经建好了模型 训练了几个小时了
I already build a prototype which has trained for a few hours.
– 作为一个机器学习者 我很佩服你- 应该的
– I am an impressed machine learner.- You should be.
假设我们有函数f(x)
Let’s say we have a function f of x
以及一个初步猜想的根
and some initial guessed solution.
牛顿法需要我们先找到该点切线的斜率
Newton’s method says that we first find the slope of the tangent line at our guess point,
接着找到切线与x轴的交点
then find the point at which the tangent line intersects the x axis.
再用这个点找到它在原始函数上的投影
We’ll use that point to find its projection in the original function.
然后我们重复第一步
Then we iterate again from our first step,
这一次用这个点代替第一个点
this time replacing our first point with this one.
不断重复这个步骤 直到
We keep iterating and eventually we’ll stop
x的当前值小于或者等于门槛值
when our current value of x is less than or equal to a threshold.
这就是如何运用牛顿法来找到方程的解
So that’s the root finding version of Newton’s Method,
通过它找到使函数值为0的点
where we’re trying to find where the function equals zero.
但是在用于优化的版本中
But in the optimization version,
我们找到函数的导数为零的点 即最小值
we’re trying to find where the derivative of the function equals zero, its minima.
从更高的层面来看 对任意给定的起始位置
At a high level, given a random starting location,
我们构造一个目标函数的二次近似
we construct a quadratic approximation to the objective function
匹配该点的一阶和二阶导数
that matches the first and second derivative values at that point.
然后最小化该二次函数
And then we minimize that quadratic function
而不是原始函数
instead of the original function.
二次函数取最小值的点作为下一次的起点
The minimizer of the quadratic function is used as the starting point in the next step
然后不断迭代这个过程
and we repeat this process iteratively.
我们通过两个牛顿法优化的例子来进一步学习
OK, so lets go over two cases of Newton’ s method for optimization to learn more.
这里有一维和二维两个例子
A 1D case and a 2D case.
在一维例子中存在一个一维函数
In the first case we’ve got a 1 dimensional function.
我们通过泰勒级数展开法
We can obtain a quadratic approximation at a given point of the function
在定点得到函数的二次近似
using what’s called a Taylor series expansion,
忽略泰勒展开后三次及以上的项
neglecting terms of order three or higher.
泰勒级数使用无限项多项式的和
A Taylor series is a representation of a function
来表示一个函数
as an infinite sum of terms.
这些项通过求函数在某一点的导数得到
They are calculated from the values of the functions derivatives at a single point.
它的发明者是英国数学家布鲁克·泰勒
It was invented by an English mathematician named Brook Taylor.
泰勒·斯威夫特 开玩笑的
Swift. Just kidding.
我们将二阶泰勒级数作为初始点
So we’d take the second order Taylor series for our initial point x,
通过找到一阶和二阶导数为零的点
and minimize it by finding the first derivative and second derivative,
来对它进行最小化
and equating them to zero.
为了找到x的最小值
In order to find the minimum x value,
我们重复这个过程
we iterate this process.
第二个的例子中
In the second case,
假设我们有一个多维函数
let’s say we’ve got a function of multiple dimensions.
我们可以用同样的方法来得到最小值
We can find the minimum of it, using the same approach,
除了以下两个改动
except for 2 changes,
我们用梯度来替换一阶导数
we replace the first derivatives with a gradient,
用海森矩阵来替换二阶导数
and the second derivatives with a hessian.
海森矩阵是一个由二阶偏导数构成的矩阵
A hessian is a matrix of the second order partial derivatives of a scalar,
它描述了多变量函数的局域曲率
and it describes the local curvature of a multivariable function.
总结一下 导数会帮助我们计算梯度
Check this out: derivatives help us compute gradients
在一阶优化下 我们可以用雅可比矩阵来表示
which we can represent using a Jacobian matrix for first order optimization.
我们可以用海森矩阵来进行二阶优化
And we can use the hassian for second order optimization.
这些是微积分用到的所有五个运算符中的四个
These are 4 of the 5 derivative operators used in all of calculus,
他们是我们组织和表示数值变化的方式
they’re the ways that we organize and represent change numerically.
那么 我们什么时候需要使用二阶优化呢
So when should you use a second order method?
一阶优化通常计算成本较低
First order methods are usually less computationally expensive to compute
耗时较少
and less time expensive,
并且在处理大数据集时收敛较快
converging pretty fast on large datasets.
二阶优化方法
Second order methods are faster
在二阶导数已知并且容易计算时很快
when the second derivative is known and easy to compute.
但二阶导数通常难以计算
But the second derivative is often intractable to compute,
需要消耗很多计算成本
requiring lots of computation.
在某些特定问题上
For certain problems,
梯度下降可能在鞍点附近收敛缓慢甚至卡住
gradient descent can get stuck along paths of slow convergence around saddle points,
但是二阶优化不会
whereas second order methods won’t.
对于不同的问题采取不同的优化方法
Trying out different optimization techniques for your specific problem
是检验哪种方法更好的最好的办法
is the best way to see what works best.
以下是这节课需要记住的内容
Here are the key points to remember:
一阶优化使用一阶导数来最小化误差函数
First order optimization techniques use the first derivative of a function to minimize it,
二阶优化使用二阶偏导数做同样的事
second order optimization techniques used the second derivative.
雅可比矩阵是一阶偏导数组成的矩阵
The Jacobian is a matrix of first partial derivatives
海森矩阵是二阶偏导数组成的矩阵
and the Hessian is a matrix of second partial derivatives.
牛顿法是广泛应用的二阶优化方法
And Newton’s Method is a popular second order optimization technique
在某些情况下会比梯度下降更有效
that can sometimes outperform gradient descent.
上周编程挑战的优胜者是Alberto Garces
Last weeks coding challenge winner is Alberto Garces.
Alberto使用了梯度下降找到了最优解
Alberto used gradient descent to find the line of best fit.
他的Jupyter编辑器里记录的非常详尽
His Jupyter notebook is insanely detailed,
通过阅读它你可以学习梯度下降
you could learn gradient descent just from reading it alone.
思路非常清晰
Very well thought out.
这就是Alberto 我们本周的编程之星
That was dope Alberto, Wizard of the Week.
第二名是Ivan Gusev
And the runner up is Ivan Gusev
他从头实现了对任意阶的多项式的梯度下降
who implemented gradient descent from scratch for polynomials of any order.
这周的挑战是从头开始写一个牛顿优化方法
This weeks challenge is to implement Newtons method for optimization from scratch.
细节在README文档中
Details in the README,
把你的Github链接附在评论里
post your GitHub link in the comments,
我们会在下周宣布谁是冠军
and winners announced next week.
请订阅我的频道来获取更多的编程视频
Please subscribe for more programming videos,
现在我要去发明六阶导数了
and for now I’ve gotta invent the 6th derivative,
多谢观看
so thanks for watching.

发表评论

译制信息
视频概述

主要介绍牛顿法在机器学习优化中的使用

听录译者

收集自网络

翻译译者

鹿琳

审核员

审核员X

视频来源

https://www.youtube.com/watch?v=UIFMLK2nj_w

相关推荐