Hello everybody and welcome to part 10 of our machine learning tutorial series.
In this part what we’re going to be talking about
is we’ve been talking about linear regression
还讲了怎么用 Python 来算出最优拟合线
and we’ve got to the point where we could calculate the best fit line in our Python code.
But now the question is how good of a fit
is our best fit line. How do we determine the accuracy, right?
So the way that we determine accuracy is through R squared or
the coefficient of determination.
And the coefficient of determination is calculated using what’s known as squared error.
So first we need to figure out what the heck squared error is.
然后才能计算 R平方 或者说决定系数
And then we can calculate R squared or the coefficient of determination.
So to exemplify this consider you know you’ve got two graphs.
And then you’ve got some plots on those graphs. And then what you want to do is
画出最优拟合线 是吧 就像这样
draw the best fit line. Okay. So something like this
and then…I have no idea something like this, right? And
if I asked you which one is a better fit？
You would say the the one on the right.
And then if I asked why is that the best fit. You might think for a moment. but
you would probably come up with…well the the dots are closer to the line
and the one on the right as they are… closer than they are the one on the left anyways.
Now of course we don’t have any ticks on our axis here.
And so I might say that the one on the left were actually zoomed in really far.
And they’re actually much closer so you don’t really know.
But really it’s how good of the fit is it.
How good a the fit is the best fit line.
How good if your fit is your model to your data set.
So it’s very relative to your data set.
And you’ll see more why in just a moment.
So we know it’s the distance. So what…how do we actually calculate this.
Well, we use squared error so we’ve got a graph here.
我们有一些数据点 很漂亮的数据点 然后我们算出了最优拟合线
And then we got some data points, some beautiful data points. And we have our best fit line.
And the way that we calculate squared error is we say the error
is the distance between the point and the line…and the best fit line.
And then what we say is it’s not just error.
We want to square that value. So we want to do squared error is you know
e² 好的 你也许会问 为啥要平方呢？
e squared. Okay. So you might ask why are we squaring it, right? Well
In this case the distance right…In one case, the distance might be positive.
And in another case the distance might be negative.
So one reason why we square it is so that we’re only dealing with positive values.
You might then ask why is it e squared and not like absolute value of e.
Well we want to square it because what have you had a point that was like
way out here. That would be an outlier and
your linear data set should not have an outlier
because we only want to do linear regression on linear data.
Okay. That only makes sense. So we square the error because
we want to…
我们要惩罚异常值 你也许会问了 那为啥不用4次方
We want to penalize for outliers. So then you might ask well why not using power of 4.
or 6 or 18 for that matter.
You could use these other ones if you want to penalize for outliers.
You can use a bigger value there if you want to penalize even heavier for outliers.
It’s just so happens the standard is going to be squared error.
And if you’re not using squared error.
And maybe you’re publishing something publicly either in a paper or maybe you’ve got
你有一些数据的模型 或者 Python 模型之类的
some data, some sort of module in Python or something you’d want to
alert people to the fact that you are not doing it the way that
most people do it. Okay. So that is squared error.
接下来我们来计算决定系数 或者说 R平方
Now how do we calculate the coefficient of determination or R squared.
So R squared is calculating the following so R squared equals.
And it is one minus
1减去平方误差 一般把它写作 SE
And it’s one minus the squared error and generally you’re going to see squared error denoted as SE.
这里就是 y帽线的方差 y帽线是什么？还记得么？
So it’s the squared error of the y hat line. What the heck is the y hat line? Remember?
y帽线 最优拟合 回归线 都是一回事
y hat, best fit, regression line, all the same thing.
Divided by the squared error
of the mean of the y.
就是数据集 ys 的均值
That’s the mean of the ys of your dataset.
看起来就像是这样 这就是 ys 的均值
So what might that actually look like. The mean of your ys might be that.
So it’s just a simple straight line. And what we’re trying to do
is compare the accuracy of that line to the accuracy of like the best fit line.
讲道理 最优拟合线基本上会比 ys 均值线
And honestly the best fit line is almost certainly going to be better
的准确度高 但我们希望尽量高 是吧
than the mean of the y’s. But we want it to be like way better. Okay.
所以再来看看 R平方 和对 R平方 的计算
So looking at R squared and the calculation of R squared.
What’s like a…what’s a good value, right?
What do we think might be a good value versus what do we think might be a bad value.
Do let’s consider a value like…Let’s say R squared
How would we arrive at 0.8? Well we would know that
R平方 等于0.8 那么
in order for R squared to be 0.8. It would have to be
the squared error of the y hat line divided by
the squared error of the mean of the ys would have to be…
like this equation here would have to be equal to 0.2 like…
That’s the only way we could get to 1 – what is 0.8.
So how…what would be an example of this equation here
being 0.2. Well we would find…that we would need the squared error. Let’s say of y
maybe the squared error of the y hat line is
2 and the squared error of the mean of the ys is 10. Okay equals.
Do if that was the case
We were saying you know the squared error of the y hat line is actually significantly lower
than the squared error of the mean of the y. Is that a good thing or a bad thing.
当然是极好的 我们还想让它更低点呢 不过这样也不错啦
Well that’s a pretty good thing. We would prefer it to be even lower than that. But you know that’s pretty good.
So that means this data is probably pretty linear, right? So it’s…
So an R squared of 0.8 is pretty good.
What if your R squared was like 0.3 for example.
So if R squared for example was 0.3.
How would we arrive maybe at that equation?
Well we would need the squared error
divided by R the squared error of the y hat, divided by the square root of the mean of the ys.
We would need that to be 0.7, right?
可以用 7÷10 得到
And we could get that by you know 7 over 10.
然后 y帽线 的平方误差就
And now the squared error of the y hat is
就非常接近 ys均值 的平方误差了
a lot closer to the squared error of the mean of the ys.
那很明显这个值不是很好 我们想让这个 R平方 的值
So obviously this is more negative. So we want the R squared value
to be high.
How high is kind of determined by you.
But the accuracy in this case of our model
is…Let’s say we call it 0.8.
这就是 R平方 的值 并不是一个百分比的准确度
That is the R squared value. So it’s not a percent accuracy is.
就是 R平方 决定系数
It is the R squared. It’s the coefficient of determination.
就是这个值 现在我们知道了 R平方
That is the value. So now that we know what the calculation
for R squared is. And we know what squared error is.
我们还知道了怎么算 y帽线 这里已经算过了
And we know how to calculate the y hat line. We’ve already done that.
我们知道怎么去计算 ys的均值 我们并不是一定要算这个
We know how to calculate the mean of the ys. We haven’t done that necessarily.
Actually we have done that. Because that was part of our our best fit line calculation. So we’ve done that.
We know how to do everything here. So we can definitely calculate this in Python.
So that is what we are going to be doing in the next video.
If you have questions, comments, concerns up to this point. Please feel free leaving below.
Otherwise as always thanks for watching. Thanks for all the support and subscriptions and until next time.