• #### 科普

SCIENCE

#### 英语

ENGLISH

#### 科技

TECHNOLOGY

MOVIE

FOOD

#### 励志

INSPIRATIONS

#### 社会

SOCIETY

TRAVEL

#### 动物

ANIMALS

KIDS

#### 卡通

CARTOON

#### 计算机

COMPUTER

#### 心理

PSYCHOLOGY

#### 教育

EDUCATION

#### 手工

HANDCRAFTS

#### 趣闻

MYSTERIES

CAREER

GEEKS

#### 时尚

FASHION

• 精品课
• 公开课
• 欢迎下载我们在各应用市场备受好评的APP

点击下载Android最新版本

点击下载iOS最新版本 扫码下载译学馆APP

#### 反向传播算法——深度学习系列3+

Backpropagation calculus | Appendix to deep learning chapter 3

The hard assumption here is that you’ve watched part 3,

giving an intuitive walkthrough of the backpropagation algorithm. Here,

we get a bit more formal and dive into the relevant calculus.

It’s normal for this to be a little confusing,

so the mantra to regularly pause and ponder certainly applies

as much here as anywhere else.

Our main goal is to show how people in machine learning

commonly think about the chain rule

from the calculus in the context of networks,

which has a different feel

for how much most introductory calculus courses approach the subject.

For those of you uncomfortable with the relevant calculus,

I do have a whole series on the topic.

Let’s just start off with an extremely simple network,

one where each layer has a single neuron in it.

So this particular network is determined

by 3 weights and 3 biases,

and our goal is to understand how sensitive

the cost function is to these variables.

That way we know which adjustments to these terms

is going to cause the most efficient decrease to the cost function.

And we’re just focus on the connection between the last two neurons.

Let’s label the activation of that last neuron a with a superscript L,

indicating which layer it ’ s in,

so the activation of this previous neuron is a^(L-1).

There are not exponents,

they’re just a way of indexing what we ’ re talking about,

since I want to save subscripts for different indices later on.

Let ’ s say that the value we want this last activation to be

for a given training example is y.

For example, y might be 0 or 1.

So the cost of this simple network for a single training example is (a^(L) – y)^2.
y)^2对于这个样本 我们把这个代价值标记为C_0
We’ll denote the cost of this one training example as C_0.

As a reminder, this last activation is determined by a weight, which I’m going to call w^(L)
w^(L)乘上前一个神经元的激活值
times the previous neuron’s activation,

plus some bias, which I’ll call b^(L),

then you pump that through some special nonlinear function

like a sigmoid or a ReLU.

It’s actually going to make things easier

for us if we give a special name

to this weighted sum, like z,

with the same superscript as the relevant activations.

So there are a lot of terms.

And a way you might conceptualize this is that the weight,

the previous activation, and the bias

altogether are used to compute z,

which in turn lets us compute a,

which finally, along with the constant y, let us compute the cost.

And of course, a^(L-1) is influenced by its own weight and bias, and such.

But we are not gonna focus on that right now.

All of these are just numbers, right?

And it can be nice to think of each one

as having its own little number line.

Our first goal is to understand

how sensitive the cost function is to small changes in our weight w^(L).

Or phrased differently, what’s the derivative of C with respect to w^(L).

When you see this “∂w” term,

think of it as meaning “ some tiny nudge to w ”,

like a change by 0.01.

And think of this “ ∂C ” term
“改变w对C的值造成的变化”
as meaning “ whatever the resulting nudge to the cost is ”.

What we want is their ratio. Conceptually,

this tiny nudge to w^(L) causes some nudge to z^(L)

which in turn causes some change to a^(L), which directly influences the cost.

So we break this up by first looking at the ratio of a tiny change to z^(L) to the tiny change in w^(L).

That is, the derivative of z^(L) with respect to w^(L). Likewise,

you then consider the ratio of a change to a^(L) to the tiny change in z^(L) that caused it,

as well as the ratio between the final nudge to C and this intermediate nudge to a^(L).

This right here is the chain rule,

where multiplying together these three ratios gives us the sensitivity of C to small changes in w^(L).

So on screen right now, there’s kinda lot of symbols,

so take a moment to make sure it

’ s clear what they all are,

because now we are gonna compute the relevant derivatives.
C关于a^(L)的导数 就是2(a^(L) –
The derivative of C with respect to a^(L) works out to be 2(a^(L) – y). Notice,
y)这也就意味着
this means that its size is proportional to

the difference between the network ’ s output,

and the thing we want it to be.

So if that output was very different,

even slight changes stand to have a big impact on the cost function.
a^(L)对z^(L)求导就是求sigmoid的导数
The derivative of a^(L) with respect to z^(L) is just the derivative of our sigmoid function,

or whatever nonlinearity you choose to use.

And the derivative of z^(L) with respect to w^(L),

in this case comes out just to be a^(L-1).

Now I don’t know about you,

but I think it ’ s easy to get stuck head-down in these formulas

without taking a moment to sit back and

remind yourself what they all actually mean.

In the case of this last derivative,

the amount that a small nudge
w对最后一层的影响有多大
to this weight influences the last layer

depends on how strong the previous neuron is. Remember,

this is where that “ neurons

that fire together wire together ” idea comes in.

And all of this is the derivative with respect to w^(L) only of the cost for a specific training example.

Since the full cost function involves averaging together

all those costs across many training examples,

its derivative requires averaging this expression that we found over all training examples.

And of course that is just one component
C的一个分量

which itself is built up from

the partial derivatives of the cost function

with respect to all those weights and biases.

But even though it was just one

of those partial derivatives we need,

it’s more than 50% of the work.

The sensitivity to the bias, for example, is almost identical.

We just need to change out
∂z/∂b即可
this ∂z/∂w term for a ∂z/∂b,

And if you look at the relevant formula,

that derivative comes to be 1. Also,

and this is where the idea

of propagating backwards comes in,

you can see how sensitive this cost function

is to the activation of the previous layer; namely,

this initial derivative in the chain rule expansion,
z对上一层激活值的敏感度
the sensitivity of z to the previous activation,

comes out to be the weight w^(L).

And again, even

though we won ’ t be able to directly influence that activation,

it’s helpful to keep track of,

because now we can just keep

iterating this chain rule idea backwards

to see how sensitive the cost function is

to previous weights and to previous biases.

And you might think this is an overly simple example,

since all layers just have 1 neuron,

and things are just gon na get exponentially more complicated in the real network.

But honestly, not that much changes when we give the layers multiple neurons.

Really it’s just a few more indices to keep track of.

Rather than the activation of a given layer simply being a^(L),

it’s also going to have a subscript indicating which neuron

of that layer it is.

Let’s go ahead and use the letter k to index the layer (L-1), and j to index the layer (L).

For the the cost, again we look at what the desired output is.

But this time

we add up the squares of the differences

between these last layer activations and the desired output.

That is, you take a sum over (a_j^(L) – y_j)^2

Since there are a lot more weights,

each one has to have a couple more

indices to keep track of where it is.

So let’s call the weight of the edge connecting this k-th neuron to the j-th neuron w_{jk}^(L).

Those indices might feel a little backwards at first,

but it lines up with how you ’ d index the weight matrix

that I talked about in the Part 1 video.

Just as before, it ’ s still nice to give a name to the relevant weighted sum,

like z,

so that the activation of the last layer is just your special function,

like the sigmoid, applied to z.

You can kinda see what I mean, right?

These are all essentially the same equations we had before in the one-neuron-per-layer case;

it just looks a little more complicated.

And indeed, the chain-rule derivative expression

describing how sensitive the cost is to a specific weight

looks essentially the same.

I ’ ll leave it to you to pause and think

about each of these terms if you want.

What does change here, though,

is the derivative of the cost with respect to one of the activations in the layer (L-1).

In this case,

the difference is the neuron influences the cost function through multiple paths.

That is, on the one hand, it influences a_0^(L), which plays a role in the cost function,

but it also has an influence on a_1^(L), which also plays a role in the cost function.

And you have to add those up.

And that… well that is pretty much it.

Once you know how sensitive the cost function

is to the activations in this second to last layer,

you can just repeat the process

for all the weights and biases feeding into that layer.

So pat yourself on the back!

If this all of these makes sense,

you have now looked deep into the heart of backpropagation,

the workhorse behind how neural networks learn.

These chain rule expressions give you

the derivatives that determine each component in the gradient

that helps minimize the cost of the network by repeatedly stepping downhill. Hhhhpf.

If you sit back and think about all that,

that ’ s a lot of layers of complexity to wrap your mind around.

So don’t worry if it takes time

for your mind to digest it all.