未登录,请登录后再发表信息
最新评论 (0)
播放视频

《机器学习Python实践》#5 回归预测

Regression forecasting and predicting - Practical Machine Learning Tutorial with Python p.5

What’s going on everybody!
大家最近怎么样!
Welcome to the fifth Machine Learning and fourth regression tutorial.
欢迎来到机器学习第五讲及回归教程第四讲
In this tutorial we’ll be building on the last one
这一讲我们要继续构建上一讲中
where we created this linear regression algorithm.
创建的线性回归算法了
We found that it got great accuracy and all that.
上节课最后我们得到了很高的准确率还有其他一些东西
And now we’re ready to actually predict
现在我们应该已经做好
like out into the unknow. All right?
对未知数据的预测的准备了 对吧?
So it turns out that we actually already do have some unknown data.
事实上我们已经有一些未知数据了
Simply because we’re forecasting out the shift right which is about 30 days.
其实就是我们 forecast_out 这一列是30天后的数据转过来的
So we can actually work with that.
那我们就保持这样吧
So we’re gonna do is where we define our…
所以这里我们要定义
our Xes…
X 的值
Let’s do the following.
让我们这么做
Let’s actually cut this…come down here…
我们把这个剪切下来 拉到底下
Paste.
粘贴
And we’re gonna take this dropna cut and paste.
然后我们再把这个 dropna 方法剪切过来
And now I am reminded why I was doing that negative forecast_out.
现在我想起来为啥我上节课要写 X[:-forecast_out] 那个式子了
So…er…
那么……额
So what we’re gonna do is
我们要做的就是
X equals X to the negative forecast_out.
X = X[-forecast_out]
Let’s see…negative forecast_out…to the point of negative forecast_out.
让我看看……到 forecast_out 前的位置 [:-forecast_out]
And then we’re gonna say…
然后
We’re gonna do X_lately
我们创建一个 X_lately 变量
equals X to the minus forecast_out colon.
等于 X[-forecast_out:]
And then we’re gonna drop
然后我们要换掉
the missing data when we go to create the labels.
在我们创建好标签列的时候没有值的那些数据
In this way we have both our Xes and our X_lately defined.
这样我们就把 X 和 X_lately 都定义好了
So the X_lately basically is the stuff we’re gonna actually predict again.
那么 X_lately 其实就是放我们要预测的数据
So we have the Xes.
我们有 X
And we just need to figure out
那我们只要找出
what the m and b is right? For y = mx + b
m 和 b 分别是什么就行了 对吧?就是这个函数 y = mx + b
We get the answer for y.
这样就能得到 y 的值了
We’ve done the linear regression.
这样就可以完成线性回归了
So…so we’re gonna do against this X_lately
所以我们要找到的就是 X_lately 的值
that’s we actually don’t have a y-value for.
我们并没有与其对应的 y 值
Which is why we were not training or testing on that data.
这也就是为什么我们没法去训练或测试这部分数据
So now we have X_lately.
所以这里我们创建好了 X_lately
So the next thing we’re gonna go ahead and do is basically
接下来要做的基本上就是
we’ll come down here and actually it’s going to run this really quick
先做到这 我们还是先运行一下程序
just make sure we’re still getting
只是确保我们
the accuracy we don’t have incorrect number of values.
还是保有准确度的 别有什么错误值
OK no we don’t. So we’re good.
好的 看来没有 还不错
So we’ve got 96% accuracy.
准确率仍然是96%
Awsome~
棒棒哒~
So we come down. We’ll comment this out.
所以到这里来 我们把这句注释掉
We don’t really need that anymore.
接下来用不上它了
And now to predict stuff.
好的该预测点东西了
What you will do is also make sure we scale…
还是要确保我们缩放了数据
So
那么
what we need to do is take this
我们要做的就是 剪切这个
Almost made a mistake there. And this…
粘贴到这 这里差点在这写错了啊
Now let’s run that one more time.
那让我们再运行一遍吧
I’m wanna make sure I don’t screw thing up.
我还是要再确认一下我没做错什么
Good. OK.
不错 好的
So now we’re gonna do is we’ll come down here.
那接下来我们到这里
And…
然后
We need to predict based on the X data.
我们要用 X 的数据来做预测
So the way that we can do this once you have a classifier
预测的方式就是一旦我们有了分类器
doing a prediction is super easy. So…
那预测就超级简单了 那么
We’re gonna say forecast_set equals clf.predict.
我们就定义 forecast_set = clf.predict
And in here you’re gonna actually pass a single value.
这里只需要传递一个值就行
Or you can pass like in
或者说你也可以
an array of values to predict, make a prediction, per value in that array.
传递一个数组去做预测 对数组中的每个值都做个预测
And that’s what we’re gonna do right?
这就是我们要做的事对吧?
We’ve got this 30 days of database basically right here.
基本上我们有了这30天的数据
So…er…
那么……
X_lately rather so 30 days here. OK.
X_lately 应该就有30天的数据 好的
So last 30 days. So X_lately…
最后30天的数据 那么
We want to create that with X_lately.
我们想用 X_lately 的数据做预测
So then we have forecast_set. So now…
然后我们有 forecast_set 那么
We can do…We can print…
就可以 可以打印
er…forecast_set…
forecast_set
forecast_set
forecast_set
confidence and forecast_out
置信度 confidence 和 forecast_out
Just so we know how many days were forecasting out here.
这样我们就知道预测了多少天
Uh-oh…
啊哦……
confidence…do we…er…I’m sorry. So I change the set I usually use confidence…
置信度……是不是……啊不好意思 我习惯用置信度 confidence 了
So accuracy…try again…
应该是准确度 accracy 再试一次
Pull this up…
把这个拉上来
And yeah so there we go…
好的 出来了
So we got our predict value.
这样我们就得到了预测值了
So these are
这些是
the basically the next 30 days of unknown values for us.
基本上就是之后30天我们未知的股票价格
That’s like these just straight up the stock prices
这就直接给我们把股价写出来了
which is that pretty cool?
是不是很赞?
Because we…you know…that whole scaling part
因为……你懂的……缩放数据
is also playing a major role here and still outputting
也在这里面扮演了主要角色 为我们输出了
You know…stock prices there
这里的股价
of decent value to us.
可以让我们很容易看明白
Anyway, I think it’s cool
反正我觉得这很赞
So these are the next 30 days’ prices
那这就是接下来30天的股价了
So then let’s say you want to graph that.
那如果你想画个股价走势图出来
So what we’re gonna do? We’re gonna come to the top
要怎么做呢?我们回到开头
And again we’re gonna just blast through wrapping this
我们有很强力的工具来做这种事
if you’re confused or whatever I have matplotLib tutorials
如果你不太明白接下来的事我也有有关 matploLib 模块的教程
So you can check those out if you want to learn more about graphing.
如果你想仔细了解下数据图像化可以去看看那些教程
But otherwise we’re gonna import and in fact…
这里我们就直接 import
We just add datetime here…
我还要引入 datetime 模块
er…datetime?
datetime
And then we’re gonna
然后我们要
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
from matplotlib import style
from matplotlib import style
And we’re gonna say style.use(‘ggplot’)
然后我们写 style.use(‘ggplot’)
This is just to plot stuff
这主要是用来画点的
This is how to make it look decent.
这个会让图像更好看
This is how to specify which decent looking thing you want.
这个主要是让图像按你想要的样式显示
So now what we’re gonna say is…we’ll come down…
接下来我们要做的就是 先到这里
er…Let’s come down here…And we’re gonna say
到这里 我们要写
df[‘forecast’] equals np.nan
df[‘Forecast’] = np.nan
This is just specifys the entire column.
这就限定了这一整列的值
It’s just full of a lot number of data and you’ll see why at a moment.
这一列全部都是大量的数据 你们等会就知道为什么了
But we actually puts some information there shortly.
但我们现在这简短的放些提示信息就好
Now we need to find out what the last day was.
接下来我们要找到最后一天是哪一天
This may not be the best-est way to do something like this
这可能不是最好的画图方式
But this is what how we’re gonna actually plot this on the graph.
但是这里我们还是要通过它来画图
So we say the last_date equals df.iloc.
这里写 last_date = df.iloc
Oops [-1]. So this is the very the last day we’ll get the name of that.
哦是[-1] 这里放的就是最后一天的日期 然后我们获取到它的名称
And we’re gonna say the last_unix value is equal to last_date.timestamp.
然后是 last_unix = last_date.timestamp()
And then one_day.
接下来是 one_day
This is how many seconds in a day.
这里代表一天有多少秒
So you can just do the math there if you want.
如果你想验证可以算下一天有多少秒
But it’s 86400
就是86400秒
And then the next_unix
然后 next_unix
would be like the next day, right?
就代表第二天 对吧?
And these are…we know these are daily prices
我们用的股价都是每天的价格
So we’re just gonna work kind of hard coding this part of it.
这里我们要写死这个值
Just so we can create graph.
这样才能画出图来
The last_unix + one_day.
last_unix + one_day
So when you do a prediction
当你想要做预测的时候
the prediction has no idea like
预测程序是不知道
what date that is…that’s like four…right?
预测的日期是几号的 是4吗 对吧?
So remember
所以记住
when you doing machine learning X and y does not correspond to like
机器学习中 X 未必和 y 是对应的
necessarily the Xes on the graph.
y 并不一定是 X 的函数值
In this case, it doesn’t. X are the features
在这个例子中 X 是特征
y is the label. It just so happens the label is the price
y 是标签 那是因为 y 刚好就是价格这个标签
so y is correct.
所以 y 就是对的了
But the X is correct? No, because the date is not a feature.
但是 X 对不对?不对 因为日期不是特征
So that’s why we can’t have it work around here
所以这里我们还画不出图来
Because we actually don’t have the date values.
因为我们原始的模型中并没有日期这个值
I have lost my mouse…there we go
找不到鼠标了 好的找到了
Anyway that’s unix. OK.
这就是 unix 时间戳 好的
So now we get the dates.
好的现在我们有了日期
And now we actually populate the data frame
接下来我们就要把数据帧
with the new dates
和新的日期以及预测值
and the forecast value. So…
结合起来 所以
the way we’re gonna do that is we’re gonna say
这么做的方法就是 我们写
for i in forecast_set
for i in forecast_set
next_date equals datetme.datetime.fromtimestamp
next_date = datetime.datetime.fromtimestamp
next_unix
next_unix
And now we’re just gonna say
接下来写
next_unix plus equals that value of one_day
next_unix += one_day 的值
So one_day
那么 one_day
And then df.loc
然后 df.loc
and then next_date…oops…next_date
然后 next_date
equals
等于
and then we’re gonna do like one-liner for loop here
接下来我们要写一行 for 循环
So we’re gonna say np.nan for something we don’t care about
就是 np.nan for 随便什么变量
in range()
in range()
len(df.columns)
len(df.columns)
er…let’s see…minus 1
让我想想……-1
And then plus i
然后 + i
So we’ll do is iterating through the forecast_set
我们要做的就是遍历 forecast_set
taking each forecast and day
拿出每个预测值和日期
and then setting those as the values
然后将它们作为
in the data frame
数据帧中的值
basically making the features
这样基本上就让特征变成了
the future features, not a number. OK.
未来的特征 而不是一个数 好吧
And then the last line just takes all of the first columns
接下来最后一行就是把第一列
sets them to, not to numbers, and in the final columns
设置为非数字的值 而最后一列
whatever i is. Which is the forecast in this case.
不管 i 是什么 就是预测值了
So now we’re gonna go ahead and do is
那我们继续
we’re gonna say df
然后写
[‘Adj. Close’].plot
df[‘Adj. Close’].plot()
And then we’re gonna say df[‘forecast’]
然后是 df[‘Forecast’]
forecast.plot
df.[‘Forecast’].plot()
And then we’re just gonna do plt.legend
再写 plt.legend
we’ll put that in the fourth location
然后我们把它放第4列
That’s just like the bottom, right?
应该就是最底下了 对吧?
And then we’re gonna say plt.xlabel
然后我们写 plt.xlabel
And we’ll say that’s the date.
这就是日期了
plt.ylabel. That is your price
plt.ylabel 这就是价格
And finally, plt.show
最后 plt.show
OK. So we zoom to that. Hopefully that for loop is gonna work out
好的 我们放大一下 希望 for 循环能起作用
We’ll find out shortly.
马上就知道了
See…and the graph here…OK
来看看……图出来了……好的
So this is our actual graph of the data here
这就是我们数据的图了
Pull this up…
拉上来
And as you can see. This is the known data here.
如你们所见 这里是已知的值
And then over here is our predicted data.
而这路就是预测日期了
So let me zoom in to that spot
让我们放大这一点
So this is like the future prediction here, the forecast. OK.
这就是对未来的预测了 好吧
So it’s just like a really quick way…
当然这是个能让我们迅速
to visualize the data.
看到数据图像的方法
And the really the complex part, the reason why we had all this nasty crap in here
真正复杂的地方是 我之所以要做这么多费劲的事
We’re just simply so you can have dates on the Xes
就是为了能让日期和 X 的值对应上
Because that’s how I am. I want to have the dates there.
因为我就喜欢有日期来做对应
Anyway…Oh yeah…
好的 哦对了
So that’s how you can actually forecasted out
这就是如何
the data and actually do a prediction.
去对未来某一天的股价进行预测
But the crux of doing prediction
怎么用 scikit-learn 来做出预测
with scikit-learn is right here
的核心就在这里了
And just remember you can pass a single value
记着你要传递一个单一值
Or you can pass an array of values and it will
或者一个数组到程序中
just output in the same order of the array of values
接下来程序就能用数组的顺序输出预测值了
And then from there we just use logic
接下来我们只要通过逻辑
to know…because each investment is a day
来……因为投资都是每天进行的
right? Each price report was one day.
对吧?每个价格都是那一天的价格
So then that just means
也就是说
that each forecast was like one day later, right?
每次预测的都是第二天的股价 对吧?
So we just kind of use our brians for that one.
所以只要动动脑子用好这些数据
So…anyway…
那么
And I guess the other thing to think about df.loc…just in case…
以防万一我还是说下 df.loc 吧
I’m not sure we’re actually cover that in Pandas
我不记得我是不是在 Pandas 教程里说过它
But what happens there is basically .loc
其实 .loc
is gonna referencing the index
就是对数据帧的
for the data frame.
一个参考索引
So we your df.loc[next_date]
所以如果你写下 df.loc[next_date]
basically what that saying is that next_date
那基本上就是说 next_date 的值
is a datestamp, right?
是个时间戳了格式 对吧?
And that next_date is the…
而 next_date
index of the data frame. So…
是数据帧的索引 所以
Maybe it’ll help…just…
也许这样可以
Let’s see print df.head.
比如我们打印出 df.head
If you’re not confused in that at this point
如果视频到这里你没什么问题的话
Feel free to carry on to the next video. We’ll be talking about pickling.
那就可以看下一节课了 我们来讲 pickling
But if you’re confused about that for loop. I just want to explain
如果你对这部分的 for 循环有疑问 那我就解释一下
that for loop just so everyone…no one is like:”what the hell?”
我可不想有人对着这个 for 循环说:“这什么鬼?”
So anyway…So yeah…So here…
好 那就来
Right the date is the index. So when we say
日期是数据帧的所以 所以当我们写
df.loo(next_day) we’re saying the index.
df.loo(next_day)我们就是在用索引
And if that index doesn’t exist
如果索引不存在
It’s gonna created. And if it did exist we’re just gonna replace it.
就会创建一个 如果存在的话就会被替换
OK. Then we’re saying np.nan for underscore in range(len(df.columns)-1)
好 然后我们用 np.nan for _ in rang(len(df.columns)-1)
What the heck is that?!
这是啥东西呢?
Well, that is just a list of values that are np.nan.
这其实就是一个 np.nan 的值的列表
So basically we’re saying it’s np.nan for Adjust, High percent change.
基本上我们要对 Adj HL_PCT
All this stuff is just not a number, right?
所有这些不是数字的内容用 np.nan 处理 对吧?
Because this is in the future. We don’t have information on that data.
因为这里表示未来 我们没有那部分数据
OK. Then…can we back down here?
好 那么 我们回到这
Then we say + i
然后我们写 +i
Remember i is the forecast, right?
还记得 i 是预测值吧?
i in forecast_set.
i in forecast_set
So when we’re just saying so basically it’s just the list
那如果我们说这就是
plus one value. So it’s just here like the huge list
列表多加了一个值 这其实是个非常大的列表
Well not that huge…It’s just these many columns, right?
也不算太大 就是有很多列 是吧?
Well we just add the forecast at the very end.
我们在最后加了预测列
So that’s just our super hacky way
这其实就是一个很好的技巧
of doing the following. I set to head there but probably more useful to set tail.
来做接下来的事 这里我们把头部替换成尾部也许更有用
And so you can see the end of this data frame.
这样你就能看到数据帧的结尾了
These are all those np.nans and then finally it’s just forecast, OK?
基本上全是 np.nan 的值最后是预测值 是吧?
So that’s all that is.
就是这样了
Sorry I was just little confusing. Hopefully the explaination worked. If not
不好意思刚才有点不清楚 希望这里讲明白了
feel free to ask question wherever I’ll be happay to clarify.
如果你还有疑问请留言吧我会跟高兴为你解答
Now…there…
现在的话
There is actually one more thing I want to show you all before
我还想最后说一件事
we dive into the regression then actually write a regression algorithm all on our own
在我们深入了解了一下回归并且自己写一个回归算法之前
that’s pickling. The reason why you want to pickle is imagine
我们要先讲序列化 pickling 原因是
you have rather than training a classifier on this…you know
不像我们这里训练的这个分类器
relatively small data set. We just had daily values for the last few years
数据量相对较少 这里我们大概只有过去几年的每日数据
But you know if you save this to a file. You know…it’s probably like
如果你把这些数据保存为文件
you know…500 killerbytes or something…who knows
大概就是 500k 左右吧
But let’s say you have like intraday data. You’ve got like two gigabytes worth of data.
但是比如说你有当天的数据 那大概就会有2G的数据大小了
That’s gonna take a while to train the classifier on that data.
这可得花些时间去训练分类器了
So won’t it be nice if eveytime you want to make a prediction
所以如果你每次做预测时
So just consider
比如
making a prediction in using future data
在未来用数据做预测
Consider everytime you want to make a prediction you have to train the classifier
假如每次做预测你都得训练分类器
Is that not just crazy sounding?
那听起来是不是挺奔溃的?
So yes that’s crazy sounding.
确实很奔溃
So the next tutorial we’re gonna talking about pickling
所以下次视频我们要讲序列化 pickling
which will let you save your classifier and then just quickly load it in
它可以让你存储你的分类器 然后省去训练的时间
without any training time.
快速调用分类器
So definitely very useful with machine learning classifiers.
绝对对机器学习非常的有用
So anyway that’s what we’ll talk about the next video.
我们会在下一节教程中讲这个
Questions comments leave them below. Otherwise as always thanks for watching
有问题就在下方留言吧 感谢观看
thanks for support and subscription and until next time.
谢谢各位支持和订阅 我们下次再见

发表评论

译制信息
视频概述

本讲完成了对上节线性回归算法的预测,成功预测出了数据集缺失的最后30天的股价数据。此外简单讨论了如何快速更换学习算法以及听过matplotlib模块图形化数据。

听录译者

[B]刀子

翻译译者

[B]刀子

审核员

审核团1024

视频来源

https://www.youtube.com/watch?v=QLVMqwpOLPk

相关推荐