最新评论 (0)

《机器学习Python实践》#8 如何编写出最优拟合斜率

How to program the Best Fit Slope - Practical Machine Learning Tutorial with Python p.8

大家好 欢迎来到机器学习系列教程第八讲
What is going on everybody and welcome to part eight of our machine learning tutorial series.
In this part we’re gonna start working on
用 Python 来写出一个简单的线性回归算法
creating a simple linear regression algorithm from scratch in Python.
好的开始吧 我们已经知道直线的定义是 y = mx + b
So to start we know that the definition of the line is y = mx + b.
x 对于我们来说是已知的 因为它就在 x 轴上嘛
And so we know that x will figure out just simply because that’s on the x-axis.
但我们还是需要求出 m 和 b 的值
But we know we need to know m and b.
m 是我们要求的最优拟合线的斜率
And m is going to be our best fit slope.
而 b 则是 y轴截距
And then b is that y-intercept.
那首先我们先来计算斜率 m 的值
So first we’re going to calculate for m, the slope.
And I’ll pull up the equation again just as a reminder so that’s
m 等于 x的均值乘以y的均值 减去 x乘以y 的均值
m equals the mean of the x values times the mean of the y values minus the mean of the x’s times the y’s.
然后这些合起来 除以
All of that is divided over the mean of the x’s to the power of two minus
x均值的平方 减去 x²的均值 好的
the mean of all of the x’s to the power of two. Okay.
很简单 接下来我们要把这个方程用 Python 写出来
Easy enough. So now we’re going to be translating that into Python code.
首先我们要从 statistics 模块里引入 mean 方法
So the first thing that we’re going to do is from statistics we’re gonna import the mean or mean
接着我们引入 numpy 模块 简写为 np
And then we’re going to import numpy as np.
这下你应该可以猜到为什么我们要用 mean 方法了
so you should be able to guess why we’re bringing in mean.
There was there’s quite a few uses of mean there.
另外再说一下 回归出的线应该是直线
Also just for the record, the regression line…like a regression line is just take a straight line. So you…
举个例子吧 我给你们看张图 好
So you know for example…an example might be just pull up an image here, right.
这是一些数据点 然后这条红线 就是我们回归的线了
This is some data points and then you’ve got this straight red line. That’s your regression line.
That’s also your best fit line.
如果你和学统计学的人聊过天 他们可能会告诉你这条线也叫 y hat line
And you might even hear people call it the Y hat line if you’re talking to a statistician. Anyway.
Now we’re going to define some simple values.
You can get to the point where you’re using real data.
But I think the easiest thing to do is to just define some simple data.
比如说 x 是 1 2 3 4 5 还有 6
So we’re just gonna say xs 1 2 3 4 5 and 6 yeah.
y 为 5 4 6 5 6 和 7
And then some ys will say are 5 4 6 5 6 and 7.
我们快速用图形表示一下 当然你们不用做这一步
You don’t have to do this part but we’re going to visualize this data real quick.
先 import matplotlib.pyplot as plt
So import matplotlib.pyplot as plt.
不用把图像做得太好看 所以我就直接用 plt.plot(xs, ys)
And I’m just gonna…we’re not gonna make it pretty or anything. I’m just gonna say plt.plot xs ys,
然后 plt.show() 这样就可以看到我们要处理的数据了
plt.show. Just so we can see the data we’re working with here. Okay.
那这就是我们的数据了 不过用折线表示了 我把它改成散点图吧
So this is the data of course we just made it a line. So let’s make it a scatter plot. Okay.
So that’s the data that we’re working with.
就是简单的散点图 不过你应该可以看出来了
It’s just a simple scatter plot but you can probably see already.
This is positively correlated data.
你大概会画出这么一条拟合线 差不多吧
And you could probably think of a line might be something like this you know drawing up. Anyway
那这就是我们的数据了 我先把它关了 好
so that’s our data and I’m gonna move this over. Okay.
Now I’m gonna get rid of showing the graph.
Now we know that what we kind of need to do is if you recalling the
之前的例子 我们的数据并不是 Python 的 list
previous examples. Our data was not actually a Python list.
也不是 Python数组因为 Python 没数组 而是 numpy 的数组
It wasn’t a Python array because it doesn’t exist and instead it was an numpy array.
所以我们将这里改成 np.array
So we’re gonna change this to np.array.
And you just can put that around in like parentheses
像这样 这样基本上就是把它们转换成
parentheses like that. So that’s just like basically converting this to
numpy数组了 接下来我们修改数据类型
our numpy array. And then we’re gonna also change the data type.
直接写 dtype 其实也不算修改 这其实就是默认类型 是 np.float64
So we’re gonna say dtype or actually we’re not changing it. This will be the default but…np.float64.
这么做的原因主要是 以后我们可能
And we’re mainly doing this because we’re probably going to be revisiting
linear regression.
那时数据类型就会比较重要了 现在的话
And it’ll be in a time where the data type actually matters. For now,
you can or…you don’t have to put the data type there.
We’re just being very explicit there with the data type.
So now we want to get the
最优拟合线 所以我们要定义一个函数
best fit slope. So let’s say we’re going to define a function
that generated the best fit slope.
好 best_fit_slope
Okay. So best fit slope.
我们需要传递 x 和 y 的值给这个函数
And we know we passed the x’s and y’s through.
最后返回 m 的值
And then eventually we want to get to the point where we return m.
接下来我们只要写 m = best_fit_slop(xs, ys)
And then we would just be like m = best_fit_slope
就行了 对吧?
of the x’s and y’s, right?
这样就搞定了 对吧?
We’re done! Right?
But anyway that’s a nice a skeleton function there. So
the first order of business is
需要有 x 和 y 的均值 是吧?
we do the mean of the xs times the mean of the ys, right?
这样才能算 x的均值 乘以 y的均值
So mean x times the mean of the ys.
So how do we calculate that?
可以这样 m等于
Well we can start off by saying m
equals…now this is not complete of course but we’re gonna say
m等于 首先是 x 的均值
m equals, at first, it’s the mean of the xs,
乘以 y 的均值
multiplied by the mean of the ys.
看来没什么问题 接下来呢?
So so far so good what was that next step?
x的均值 乘以 y的均值
Mean of the xs times the mean of ys.
然后减去 x乘以y
And then it’s minus the mean of the xs
times the ys.
So now what we need to do is add that. So
那 x的均值 乘以 y的均值 就算是个变量
the mean of the xs times the mean of the ys. That’s one variation or like one of the
I hate to say variables.
其中一部分吧 接下来我可能要在括号间加些空格
One of the parts. So then what we’re going to do is we’re gonna put more parentheses space.
你不用在这里加空格 这可不符合 PEP8 标准
You don’t have to add the space there. That’s definitely not PEP8.
这里加空格只是为了看起来容易些 这毕竟是个很长的式子
Just making it easier to read because this is kind of going to be a long one.
好的 这里是 减去 x乘以y
Anyway minus and that was the mean of the xs
的均值 对吧?
times the ys, Right?
好 现在回到我们的函数
Okay. So now going back to our function here.
It is…we have done the entire top of this fraction basically.
接下来我们要写完整式子 回到代码
So now we need to do the next layer. So the next thing that we’re gonna do is coming back
to the code. We’re going to add a third parenthesis here.
So I’m gonna add another parenthesis
再来一个空格 这里就是空格 括号 括号
and a space. You’ve got this space or these parenthesis parenthesis.
I’m going to add a space.
加一个斜杠 这是我们的除号
I’ll slash. So this is our division sign.
And then I’m gonna hit enter.
And the reason why I’m able to hit enter
is because of this parenthesis here.
好 这就是为什么我要用括号把它们全括起来
Okay. So anyways that’s why we’re encasing all this in a parenthesis.
就是这样 那接下来我们回到分母部分
Just for the record. So now going back to the bottom.
我们需要有 x均值的二次方
We’ve got…we need to do the mean of the x’s to the power of 2.
好 怎么做?
Okay. So how might we do that?
Well the…first of all
Let’s consider what the power of 2 actually is.
The power of two is basically the…
比如就说 x均值的二次方吧
Like let’s say is mean the x’s to the power of 2.
就是 x的均值 再乘以 x的均值 是吧
That’s the mean of the x’s times the mean of the x’s. Okay.
Python 中有几种不同的方法可以实现这个 比如
So in Python there’s a few different things that we can do. You can do something like…
Let’s do this.
And you could say a lot of times you can do mean
x的均值 ^2 像这样
of the xs to the power of 2. Like that.
But let’s run that really quick and we’ll see that
we get this unsupported operand for
不支持这个运算符 好 换个方法
the data type we’re using. Okay. Another option
可以是 **
can be times times.
And it looks like that one is acceptable.
还有一个方法就是 x的均值 直接乘以 x的均值
Another option is mean of x’s literally times the mean of the xs.
好 这两种方法都能达到目的
Okay. Both of those will give you what you’re looking for.
Now finally it was…
x均值的平方 减去
the mean of the x’s squared minus the
x’s square or the mean of the x’s squared.
那这是一个减法运算 对吧?
So for us to do that and that was minus, right?
这部分是一个整体 接下来要减去
So this is all one. So at this point we’re gonna go minus
x²的均值 所以是
mean of the…And this is the mean of the x’s squared. So this could be…
大多数情况下你可以这么做 但是
you could…a lot of times you can get away with this but I don’t think…
Because of our data type we’ll do that. Yeah. So…
或者……啊 这里应该是2 写错啦 我真是个菜鸟
or actually…or ha…2 typo. What an amateur.
就这样吧 所以这样看来不行
Anyway, all right. So that’s not gonna work out for us.
还有一种方法就是用 **2 好 看来这个可以
So another option will be like that…like times times 2. Okay that’s acceptable.
当然还有一种方法也行 xs乘以xs 是吧
And then there’s of course…x’s like this. Okay.
So you can do whichever one you makes you feel better in sleep better at night.
都没问题 那我们的 m 值就是 我们可以把 m 值打印出来
Regardless there are…is our m and then we can even go and print out m.
好 那 m 的值就是-15.26
Okay so m minus fifteen point two six is what we’re getting here.
So let me see here…what did I…
这里有些说不通啊 我再看看……这儿确实都包住了 斜杠……
not content with that. Let’s see…That does close all that off. Slash…
我自己也有点闹不清楚了 我们看一下……
So I’m getting myself confused now. so let’s see…
确保这里括号是闭合的 我觉得这里似乎不太对
Let me make sure we close this off right. I don’t think that should be what we’re getting there.
这些是对的 我们想要这个
That’s about right and then we wanted this.
mean of
Because this whole thing needs to be divided by this whole thing.
所以 x的均值就是这个
So the mean of the xs like this.
Let’s run that one more time.
这个 m 值看起来就更符合我们要找的那条直线了 好
That looks to be more along the lines of what I’m looking for. Okay. So
欢迎来到 PEMDAS 的世界
So welcome to the world of PEMDAS alright.
代表的是计算的顺序 对吧?括号 幂值 乘法 除法 加法 减法
This is the order of operations, right? Parenthesis, exponents, multiplication, division, addition, subtraction.
So if you want to get around
运算的顺序 那就一定要把括号用好 我刚才……
the order of operations you need to use parentheses correctly. I was…my issue
was that we were dividing…
我们是想除以 这部分
We were trying to do…What we needed to do was divide by this
minus this like both those things together.
But instead division was occurring
before subtraction.
所以我们刚才就成了 这部分除以了这一部分 最后的结果再减去这部分
So we were actually dividing…doing this divided by this. And then subtracting this from the final answer
结果就不对了 得到了一个很大的负数
which was why we were getting such a large negative number there.
好吧 所以再来看下 这是直线的斜率
Okay. So again this is just…this is the slope of a line, so…
negative 15 slope is its…
First of all negative 15 is kind of weird.
哦 把 PEMDS 给打出来了
Oh printing out PEMDAS.
It’s weird to have a negative slope to the line
where you have clearly a positively correlated data.
But anyways. So we have our
我们求出了 m 的值
our m.
我们还要再求一个数 那就是 b 的值
And now we need just one more thing and that is our b.
所以下一节课我们就要来计算 b 的值
So that’s what we’re going to be working on in the next tutorial is calculating b.
一旦有了 b 的值 我们就可以做线性回归啦
And the once we have that we can do linear regression.
所以敬请关注下一期节目 如果你对之前的课程有任何问题 想法 评论
So anyways stay tuned in the next video if you’ve got questions, comments, concerns, whatever up to this point.
就请在下方留言 感谢各位一直以来的支持 观看还有订阅 我们下次见
Please feel free to leave them below. Otherwise as always thanks for watching, thanks for all the support subscriptions and until next time.



本期用 Python 代码来计算回归直线斜率 m 的值,顺便复习了一下四则运算顺序。