• #### 科普

SCIENCE

#### 英语

ENGLISH

#### 科技

TECHNOLOGY

MOVIE

FOOD

#### 励志

INSPIRATIONS

#### 社会

SOCIETY

TRAVEL

#### 动物

ANIMALS

KIDS

#### 卡通

CARTOON

#### 计算机

COMPUTER

#### 心理

PSYCHOLOGY

#### 教育

EDUCATION

#### 手工

HANDCRAFTS

#### 趣闻

MYSTERIES

CAREER

GEEKS

#### 时尚

FASHION

• 精品课
• 公开课
• 欢迎下载我们在各应用市场备受好评的APP

点击下载Android最新版本

点击下载iOS最新版本 扫码下载译学馆APP

#### 《机器学习Python实践》#10 R平方理论

R Squared Theory - Practical Machine Learning Tutorial with Python p.10

Hello everybody and welcome to part 10 of our machine learning tutorial series.

In this part what we’re going to be talking about

is we’ve been talking about linear regression

and we’ve got to the point where we could calculate the best fit line in our Python code.

But now the question is how good of a fit

is our best fit line. How do we determine the accuracy, right?

So the way that we determine accuracy is through R squared or

the coefficient of determination.

And the coefficient of determination is calculated using what’s known as squared error.

So first we need to figure out what the heck squared error is.

And then we can calculate R squared or the coefficient of determination.

So to exemplify this consider you know you’ve got two graphs.

And then you’ve got some plots on those graphs. And then what you want to do is

draw the best fit line. Okay. So something like this

and then…I have no idea something like this, right? And

if I asked you which one is a better fit？

You would say the the one on the right.

And then if I asked why is that the best fit. You might think for a moment. but

you would probably come up with…well the the dots are closer to the line

and the one on the right as they are… closer than they are the one on the left anyways.

Now of course we don’t have any ticks on our axis here.

And so I might say that the one on the left were actually zoomed in really far.

And they’re actually much closer so you don’t really know.

But really it’s how good of the fit is it.

How good a the fit is the best fit line.

So it’s very relative to your data set.

And you’ll see more why in just a moment.

So we know it’s the distance. So what…how do we actually calculate this.

Well, we use squared error so we’ve got a graph here.

And then we got some data points, some beautiful data points. And we have our best fit line.

And the way that we calculate squared error is we say the error

is the distance between the point and the line…and the best fit line.

And then what we say is it’s not just error.

We want to square that value. So we want to do squared error is you know
e² 好的 你也许会问 为啥要平方呢？
e squared. Okay. So you might ask why are we squaring it, right? Well

In this case the distance right…In one case, the distance might be positive.

And in another case the distance might be negative.

So one reason why we square it is so that we’re only dealing with positive values.

You might then ask why is it e squared and not like absolute value of e.

Well we want to square it because what have you had a point that was like

way out here. That would be an outlier and

your linear data set should not have an outlier

because we only want to do linear regression on linear data.

Okay. That only makes sense. So we square the error because

we want to…

We want to penalize for outliers. So then you might ask well why not using power of 4.

or 6 or 18 for that matter.

You could use these other ones if you want to penalize for outliers.

You can use a bigger value there if you want to penalize even heavier for outliers.

It’s just so happens the standard is going to be squared error.

And if you’re not using squared error.

And maybe you’re publishing something publicly either in a paper or maybe you’ve got

some data, some sort of module in Python or something you’d want to

alert people to the fact that you are not doing it the way that

most people do it. Okay. So that is squared error.

Now how do we calculate the coefficient of determination or R squared.
R平方要计算这些东西
So R squared is calculating the following so R squared equals.
R平方 就等于
And it is one minus
1减去平方误差 一般把它写作 SE
And it’s one minus the squared error and generally you’re going to see squared error denoted as SE.

So it’s the squared error of the y hat line. What the heck is the y hat line? Remember?
y帽线 最优拟合 回归线 都是一回事
y hat, best fit, regression line, all the same thing.

Divided by the squared error
y均值 的平方误差
of the mean of the y.

That’s the mean of the ys of your dataset.

So what might that actually look like. The mean of your ys might be that.

So it’s just a simple straight line. And what we’re trying to do

is compare the accuracy of that line to the accuracy of like the best fit line.

And honestly the best fit line is almost certainly going to be better

than the mean of the y’s. But we want it to be like way better. Okay.

So looking at R squared and the calculation of R squared.

What’s like a…what’s a good value, right?

What do we think might be a good value versus what do we think might be a bad value.

Do let’s consider a value like…Let’s say R squared

equals 0.8.

How would we arrive at 0.8? Well we would know that
R平方 等于0.8 那么
in order for R squared to be 0.8. It would have to be
y帽线的平方误差 除以
the squared error of the y hat line divided by
ys均值的平方误差
the squared error of the mean of the ys would have to be…

like this equation here would have to be equal to 0.2 like…

That’s the only way we could get to 1 – what is 0.8.

So how…what would be an example of this equation here

being 0.2. Well we would find…that we would need the squared error. Let’s say of y

maybe the squared error of the y hat line is

2 and the squared error of the mean of the ys is 10. Okay equals.

Do if that was the case

We were saying you know the squared error of the y hat line is actually significantly lower
y均值的平方误差 这是好事还是坏事？
than the squared error of the mean of the y. Is that a good thing or a bad thing.

Well that’s a pretty good thing. We would prefer it to be even lower than that. But you know that’s pretty good.

So that means this data is probably pretty linear, right? So it’s…
R平方值等于0.8就挺不错
So an R squared of 0.8 is pretty good.

What if your R squared was like 0.3 for example.

So if R squared for example was 0.3.

How would we arrive maybe at that equation?

Well we would need the squared error

divided by R the squared error of the y hat, divided by the square root of the mean of the ys.

We would need that to be 0.7, right?

And we could get that by you know 7 over 10.

And now the squared error of the y hat is

a lot closer to the squared error of the mean of the ys.

So obviously this is more negative. So we want the R squared value

to be high.

How high is kind of determined by you.

But the accuracy in this case of our model

is…Let’s say we call it 0.8.

That is the R squared value. So it’s not a percent accuracy is.

It is the R squared. It’s the coefficient of determination.

That is the value. So now that we know what the calculation

for R squared is. And we know what squared error is.

And we know how to calculate the y hat line. We’ve already done that.

We know how to calculate the mean of the ys. We haven’t done that necessarily.

Actually we have done that. Because that was part of our our best fit line calculation. So we’ve done that.

We know how to do everything here. So we can definitely calculate this in Python.

So that is what we are going to be doing in the next video.

If you have questions, comments, concerns up to this point. Please feel free leaving below.

Otherwise as always thanks for watching. Thanks for all the support and subscriptions and until next time.

[B]刀子

[B]刀子