ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

反向传播算法——深度学习系列3+ – 译学馆
未登陆,请登陆后再发表信息
最新评论 (0)
播放视频

反向传播算法——深度学习系列3+

Backpropagation calculus | Appendix to deep learning chapter 3

这集开始我们就假设你已经看过第三集了
The hard assumption here is that you’ve watched part 3,
那集让大家直观上感受反向传播算法的原理
giving an intuitive walkthrough of the backpropagation algorithm. Here,
在这集里 我们会更深入讲解一些其中的微积分理论
we get a bit more formal and dive into the relevant calculus.
这个看不太懂很正常所以
It’s normal for this to be a little confusing,
我们的六字格言“停一停想一想”
so the mantra to regularly pause and ponder certainly applies
在这依旧管用
as much here as anywhere else.
这集我们的目标 是给大家展示在机器学习中
Our main goal is to show how people in machine learning
我们一般是怎么
commonly think about the chain rule
理解链式法则的
from the calculus in the context of networks,
这点跟别的基础
which has a different feel
微积分课讲得会有点不一样
for how much most introductory calculus courses approach the subject.
对于微积分不够熟悉的观众
For those of you uncomfortable with the relevant calculus,
我之前已经做了一整个系列了
I do have a whole series on the topic.
我们从最最简单的网络讲起吧
Let’s just start off with an extremely simple network,
每层只有一个神经元
one where each layer has a single neuron in it.
图上这个网络就是由3个权重
So this particular network is determined
和3个偏置决定的
by 3 weights and 3 biases,
我们的目标是理解代价函数
and our goal is to understand how sensitive
对于这些变量有多敏感这样
the cost function is to these variables.
我们就知道怎么调整这些变量
That way we know which adjustments to these terms
才可以使得代价降低得最快
is going to cause the most efficient decrease to the cost function.
我们先来关注最后两个神经元吧
And we’re just focus on the connection between the last two neurons.
我给最后一个神经元的激活值一个上标L
Let’s label the activation of that last neuron a with a superscript L,
表示它处在第L层那么
indicating which layer it ’ s in,
前一个神经元的激活值就是a^(L-1)
so the activation of this previous neuron is a^(L-1).
这不是指数
There are not exponents,
而是用来标记我们正在讨论哪一层
they’re just a way of indexing what we ’ re talking about,
过一会我会用到下标来表示别的意思
since I want to save subscripts for different indices later on.
给定一个训练样本 我们把这个最终层激活
Let ’ s say that the value we want this last activation to be
值要接近的目标叫做y
for a given training example is y.
那么y就要么是0
For example, y might be 0 or 1.
要么是1那么这个简易网络对于单个训练样本的代价 就等于 (a^(L) –
So the cost of this simple network for a single training example is (a^(L) – y)^2.
y)^2对于这个样本 我们把这个代价值标记为C_0
We’ll denote the cost of this one training example as C_0.
还记得吗 最终层的激活值是这么算出来的—— 一个权重
As a reminder, this last activation is determined by a weight, which I’m going to call w^(L)
w^(L)乘上前一个神经元的激活值
times the previous neuron’s activation,
再加上一个偏置b^(L)
plus some bias, which I’ll call b^(L),
最后把加权和塞进一个特定的非线性函数
then you pump that through some special nonlinear function
例如sigmoid ReLU之类的
like a sigmoid or a ReLU.
给这个加权和一个名字
It’s actually going to make things easier
会方便很多
for us if we give a special name
就叫它z好了
to this weighted sum, like z,
跟对应的激活值用同一个上标
with the same superscript as the relevant activations.
这里的项挺多
So there are a lot of terms.
概括起来 我们拿这个权重
And a way you might conceptualize this is that the weight,
前一个激活值 和这个偏置值
the previous activation, and the bias
一起来算出z
altogether are used to compute z,
再算出a
which in turn lets us compute a,
最后再用上常量y
which finally, along with the constant y, let us compute the cost.
算出代价当然 a^(L-1)是由它自己的权重和偏置决定的
And of course, a^(L-1) is influenced by its own weight and bias, and such.
以此类推但我们现在重点不在那里
But we are not gonna focus on that right now.
这些东西都是数字
All of these are just numbers, right?
没错吧我们可以想象每个数字
And it can be nice to think of each one
都对应一个数轴
as having its own little number line.
我们第一个目标是理解
Our first goal is to understand
代价函数对权重w^(L)的微小变化有多敏感
how sensitive the cost function is to small changes in our weight w^(L).
或者换句话讲 求C对w^(L)的导数
Or phrased differently, what’s the derivative of C with respect to w^(L).
当你看到∂w之类的项时
When you see this “∂w” term,
请把它当做这是对w的微小扰动
think of it as meaning “ some tiny nudge to w ”,
好比变个0.01
like a change by 0.01.
然后把∂C当做
And think of this “ ∂C ” term
“改变w对C的值造成的变化”
as meaning “ whatever the resulting nudge to the cost is ”.
我们求的是这两个数的比值
What we want is their ratio. Conceptually,
概念上说 w^(L)的微小变化会导致z^(L)产生些变化
this tiny nudge to w^(L) causes some nudge to z^(L)
然后会导致a^(L)产生变化 最终影响到代价值那么
which in turn causes some change to a^(L), which directly influences the cost.
我们把式子拆开 首先求z^(L)的变化量比上w^(L)的变化量
So we break this up by first looking at the ratio of a tiny change to z^(L) to the tiny change in w^(L).
也就是求z^(L)关于w^(L)的导数同理
That is, the derivative of z^(L) with respect to w^(L). Likewise,
考虑a^(L)的变化量 比上因变量z^(L)的变化量
you then consider the ratio of a change to a^(L) to the tiny change in z^(L) that caused it,
以及最终的C的变化量 比上直接改动a^(L)产生的变化量
as well as the ratio between the final nudge to C and this intermediate nudge to a^(L).
这不就是链式法则么
This right here is the chain rule,
把三个比相乘 就可以算出C对w^(L)的微小变化有多敏感
where multiplying together these three ratios gives us the sensitivity of C to small changes in w^(L).
现在屏幕上多了一大坨符号
So on screen right now, there’s kinda lot of symbols,
稍稍花点时间理解一下每个
so take a moment to make sure it
符号都什么意思吧
’ s clear what they all are,
马上我们就要对各个求导了
because now we are gonna compute the relevant derivatives.
C关于a^(L)的导数 就是2(a^(L) –
The derivative of C with respect to a^(L) works out to be 2(a^(L) – y). Notice,
y)这也就意味着
this means that its size is proportional to
导数的大小跟网络最终输出减目标结果
the difference between the network ’ s output,
的差成正比
and the thing we want it to be.
如果网络的输出差别很大
So if that output was very different,
即使w稍稍变一点 代价也会改变非常大
even slight changes stand to have a big impact on the cost function.
a^(L)对z^(L)求导就是求sigmoid的导数
The derivative of a^(L) with respect to z^(L) is just the derivative of our sigmoid function,
或就你选择的非线性激活函数
or whatever nonlinearity you choose to use.
而z^(L)对w^(L)求导
And the derivative of z^(L) with respect to w^(L),
结果就是a^(L-1)
in this case comes out just to be a^(L-1).
对我自己来说
Now I don’t know about you,
这里如果不退一步
but I think it ’ s easy to get stuck head-down in these formulas
好好想想这些公式的含义
without taking a moment to sit back and
很容易卡住
remind yourself what they all actually mean.
就最后这个导数来说
In the case of this last derivative,
这个权重的改变量∂
the amount that a small nudge
w对最后一层的影响有多大
to this weight influences the last layer
取决于之前一层的神经元
depends on how strong the previous neuron is. Remember,
所谓“一同激活的神经元
this is where that “ neurons
关联在一起”的出处即来源于此
that fire together wire together ” idea comes in.
不过这只是包含一个训练样本的代价对w^(L)的导数
And all of this is the derivative with respect to w^(L) only of the cost for a specific training example.
由于总的代价函数是许许多多训练样本
Since the full cost function involves averaging together
所有代价的总平均
all those costs across many training examples,
它对w^(L)的导数就需要求 这个表达式之于每一个训练样本的平均
its derivative requires averaging this expression that we found over all training examples.
当然这只是梯度向量∇
And of course that is just one component
C的一个分量
of the gradient vector,
而梯度向量∇C本身
which itself is built up from
则由代价函数对每一个权重
the partial derivatives of the cost function
和每一个偏置求偏导构成
with respect to all those weights and biases.
求出这些偏导
But even though it was just one
中的一个
of those partial derivatives we need,
就完成了一大半的工作量
it’s more than 50% of the work.
对偏置的求导步骤也基本相同
The sensitivity to the bias, for example, is almost identical.
只要把∂z/∂w替换成
We just need to change out
∂z/∂b即可
this ∂z/∂w term for a ∂z/∂b,
对应的公式中可以看出
And if you look at the relevant formula,
导数∂z/∂b等于1
that derivative comes to be 1. Also,
这里也涉及到了
and this is where the idea
反向传播的概念
of propagating backwards comes in,
我们来看下这个代价函数对
you can see how sensitive this cost function
上一层的激活值的敏感度
is to the activation of the previous layer; namely,
展开来说 链式法则的第一项
this initial derivative in the chain rule expansion,
z对上一层激活值的敏感度
the sensitivity of z to the previous activation,
就是权重w^(L)虽然
comes out to be the weight w^(L).
说过
And again, even
我们不能直接改变激活值
though we won ’ t be able to directly influence that activation,
但我们很有必要关注这个值
it’s helpful to keep track of,
因为我们可以
because now we can just keep
反向应用链式法则
iterating this chain rule idea backwards
来计算代价函数对之前
to see how sensitive the cost function is
的权重和偏置的敏感度
to previous weights and to previous biases.
你可能觉得这个例子举得太简单了
And you might think this is an overly simple example,
毕竟每层只有一个神经元
since all layers just have 1 neuron,
而真实的神经网络会比这个例子复杂百倍
and things are just gon na get exponentially more complicated in the real network.
然而说真的 每层多加若干个神经元并不会复杂很多真的
But honestly, not that much changes when we give the layers multiple neurons.
只不过多写一些下标罢了
Really it’s just a few more indices to keep track of.
我们用加上下标的神经元来表示L层的若干个神经元
Rather than the activation of a given layer simply being a^(L),
而不是用a^(L)统称L层
it’s also going to have a subscript indicating which neuron
的激活值
of that layer it is.
现在用k来标注(L-1)层的神经元 j则是L层的神经元
Let’s go ahead and use the letter k to index the layer (L-1), and j to index the layer (L).
要求代价函数 我们从期望的输出着手
For the the cost, again we look at what the desired output is.
计算上一层激活值和期望输出的差值的平方
But this time
然后求和计算上一层激活值和期望
we add up the squares of the differences
输出的差值的平方
between these last layer activations and the desired output.
然后求和即求(a_j^(L) – y_j)^2的和
That is, you take a sum over (a_j^(L) – y_j)^2
由于权重的数量多了不少
Since there are a lot more weights,
那么每个权重要
each one has to have a couple more
多用几个下标
indices to keep track of where it is.
我们记连接第k个神经元和第j个神经元的连线为w_{jk}^(L)
So let’s call the weight of the edge connecting this k-th neuron to the j-th neuron w_{jk}^(L).
这些下标感觉像标反了 可能有点别扭
Those indices might feel a little backwards at first,
不过和第一集视频中的权重矩阵
but it lines up with how you ’ d index the weight matrix
的下标是一致的同样的
that I talked about in the Part 1 video.
把加权和记为z
Just as before, it ’ s still nice to give a name to the relevant weighted sum,
总是很方便那么
like z,
最后一层的激活值依然等于指定的函数
so that the activation of the last layer is just your special function,
在z处的函数值
like the sigmoid, applied to z.
你懂我的意思吧
You can kinda see what I mean, right?
现在的方程式和之前每层只有一个神经元的时候本质是一样的
These are all essentially the same equations we had before in the one-neuron-per-layer case;
只是看着复杂一些
it just looks a little more complicated.
链式法则形式的导数表达式所描述的
And indeed, the chain-rule derivative expression
代价对某个权重的敏感度
describing how sensitive the cost is to a specific weight
也是一样的
looks essentially the same.
这里观众可以暂停推导
I ’ ll leave it to you to pause and think
一下每一项的含义
about each of these terms if you want.
唯一改变的是
What does change here, though,
代价对(L-1)层激活值的导数
is the derivative of the cost with respect to one of the activations in the layer (L-1).
此时
In this case,
激活值可以通过不同的途径影响代价函数就是说
the difference is the neuron influences the cost function through multiple paths.
神经元一边通过a_0^(L)来影响代价函数
That is, on the one hand, it influences a_0^(L), which plays a role in the cost function,
另一边通过a_1^(L)来影响代价函数
but it also has an influence on a_1^(L), which also plays a role in the cost function.
得把这些都加起来
And you have to add those up.
然后……就搞定了
And that… well that is pretty much it.
只要计算出倒数第二层代价
Once you know how sensitive the cost function
函数对激活值的敏感度
is to the activations in this second to last layer,
接下来只要重复上述过程
you can just repeat the process
计算喂给倒数第二层的权重和偏置
for all the weights and biases feeding into that layer.
就好了现在长吁一口气吧!
So pat yourself on the back!
如果这里明白了
If this all of these makes sense,
那你就看明白了神经网络的主力——反向传播
you have now looked deep into the heart of backpropagation,
那你就看明白了神经网络的主力——反向传播
the workhorse behind how neural networks learn.
链式法则表达式给出
These chain rule expressions give you
了决定梯度每个分量的偏导
the derivatives that determine each component in the gradient
使得我们能不断下探 最小化神经网络的代价乌啦啦
that helps minimize the cost of the network by repeatedly stepping downhill. Hhhhpf.
光是静下来想一想
If you sit back and think about all that,
这些复杂的层层叠叠就很烧脑
that ’ s a lot of layers of complexity to wrap your mind around.
消化这些知识会花一些时间
So don’t worry if it takes time
别气馁了
for your mind to digest it all.

发表评论

译制信息
视频概述
听录译者

收集自网络

翻译译者

收集自网络

审核员

自动通过审核

视频来源

https://www.youtube.com/watch?v=tIeHLnjs5U8

相关推荐