未登录,请登录后再发表信息
最新评论 (0)
播放视频

《机器学习Python实践》#2 回归简介

Regression Intro - Practical Machine Learning Tutorial with Python p.2

好的 我们现在将开始建立一个简单线性回归的模型
Alright, so now we are at least going to get started with setting up a simple linear regression example.
第一件事我们需要安装scikit-learn、pandas和Quandl这三个包 打开终端、命令行 管他呢
The first thing that we need to make sure we have is scikit learn, Pandas and Quandl. So open up terminal, command prompt, whatever.
运行命令pip install sklearn运行命令pip install quandl
运行命令pip install pandas
And pip install sklearn. pip install quandl. And pip install pandas.
如果你有了这些包 你可以开始啦 如果没有 你可以先暂停 安装完成后然后在回来
Once you have all those, you are good to go. So to install those, go ahead and pause the video and pick back up once you have them.
好的 一旦你安装好了 让我们继续来构建这个模型 从回归开始
Ok, so once you have those, let’s go ahead and get started with a simple example. So we are starting with regression,
回归的核心思路就是处理连续的数据
and the idea of regression is to take continuous data,
找到这些点的最佳拟合线
and figure out a best fit line to that data.
简单的总结就是
and basically with that just boils down to
我们试图对这些数据建模
we are trying to like “model” your data
我们将从简单的线性回归开始
and the way we do that with regression at least with simple linear regression
它只是一个直线方程 我们将将讨论更多的直线方程
is just with a straight line so the equation of the line as we will talk more about down the line
但是你或许还记得学校里的知识 y=mx+b
but as you might remember from school, y=mx+b,
如果x已知 m和b也已知 你可以解出y的值
so if you have x, you can figure out what y is, also if you have m and b.
基本上回归的目的就是找到m和b的值
So basically the whole point of regression is to find out what m an b is.
举个例子 很多人用回归来预测价 那就是我们接下来将要做的
So for example, a lot of people use regression with stock prices so that’s what we are gonna do
至少在这一节的视频里
at least in this one.
做法是 通过这些连续的数据 获得数个月的股票价格
And, so the idea is, this is continuous data and you’ve got months and months of stock prices and
每一天有它自己特有的价格
and each price is in its own kind of unique day.
所有的数据都集中在一个数据集里 而不是像分类那样
But all the data is kind of one dataset together as opposed to with like classification,
每一组数据有它自己独有的标签
where each group of data has its own unique label.
对于监督机器学习 可以简单的总结为
So with machine learning, basically everything boils down, at least with supervised machine learning,
一切都归结为特征和标签
everything boils down to features and labels.
特征就像是属性 在这个例子中 特征就是连续的数据
Features are like your attributes, or in this case, the continuous data.
好的 我们将继续谈一谈特征
So, let’s go ahead and get started and we’ll talk a little bit more about features.
首先 让我们写一行代码 import pandas as pd
So first of all, let’s go ahead and import pandas as pd.
接下来我们将用一个大写字母Q导出Quandl
And then we are gonna import Quandl with a capital Q.
然后我将要说的是df就是dataframe
And then what we are gonna say is df for dataframe
Quandl.get赋值给df 我们将把股票数据放到括号里面
equals Quandl.get and we’ll put in the ticker.
你可以从quandl那里得到股票的数据
You can get this from quandl.
你可以去quandl.com这个网站
So if you just go to quandl.com
你可以使用搜索并且找到数据 像
You can use a little search and find stuff like
比如google的股票
if I say google stock
我们或许能够找到它
we can probably find it.
我正在努力找 我们将使用wiki这个数据集
Let’s see I am trying to find, we are using the wiki dataset
让我们选择free这个选项 无论如何 你想找 你就能找到它
Let’s just do free. Anyway, when you find it you can find
这里有各种各样的数据集 但是我们要找的
all kinds of different datasets here but we are looking
只是wiki这个数据集
just simply for the wiki one.
在这呢
Here it is.
你们们随便调一个数据集 你们可以
You will pick up a dataset and you can
到这个网站来
come over here
你们可以把数据集下载下来 更重要的一点是
You can either just download here or more importantly
这有quandl的代码 然后你可以点击
here is the quandl code and then you can click on like
Python 这是使用这个数据集的准确的语法
python and this is the exact statement to get it.
如果你有一个账号 你就可以自由地
If you have an account, you can make basically unlimited
获取免费的数据 如果你不用一个账号
request free data. If you don’t use an account,
就像我们不打算用账号
like we are not gonna use an account here,
就像我们不打算进行认证
like we are not gonna use an auth token.
如果你没有一个账号 我觉得有一个每天50次请求的限制
If you don’t have an account, I think, it’s limited like 50 calls a day.
实际上在这我们只是短期使用quandl
We are actually only use quandl fairly short term here
或许后面也用到 所以你真的不
and then maybe later on. So you really don’t
需要创建一个账号 但是如果你喜欢quandl
need to create an account, but if you like quandl,
在某一时刻 你将创建一个账号
you might as well make an account at some point.
无论如何 从quandl那里得到了wiki/Google股票数据集
So anyways, quandl.get and then wiki/google was the ticker there
然后我们就可以打印啦
so then we can just simply print
让我们输入 print df.head
let’s print the df.head
看看我们正在处理什么
just so we can see what it is we are working with.
我们基本上可以认为每一列
We’ll see that basically each column here
都是一个特征
is a feature.
所以open、high、low、close这些都是特征
So the open, high, low, close, these are features.
在机器学习中 你可以拥有你需要的所有特征 但是你需要有有意义的特征
So in machine learning you can have all the features you want but you want to have meaningful features
有意义的特征实际上就是和你的数据有关联的特征
features that actually have something to do with your data
有些人对这些技术是相当的狂热
So some people are pretty avid believers in the ideas like
像股价的模式识别
pattern recognition with stock prices
那些人或许就是你们
and that might be you but
但是你需要所有的特征
do you need every single one of these
像open、high、low、close这些列
open high low close columns to do
去进行模式识别吗?不
pattern recognition? No.
正如你所知道的 我们已经有
Also, you would know we’ve got
open high low close volume这些列 然后我们调整这些列
open high low close volume and then adjusted
调整有点像在股票分割之后的调整
and adjusted is adjusted after a thing like stock splits
股票分割
so a stock split
或许你的公司有十支股票
maybe your company has 10 stocks
每支股票都是1000刀一股
and each stock is $1000 a share
你下定决心:我想让人们能够以少于1000刀的价格买我们公司的股票
and you decide I want people to be able to buy shares of my company for less than $1000
你说 好吧
So you might say, ok, BAM,
每一股现在变成两股
every share is now two shares
我们总共有20股 每股的价格是500刀
so we have 20 total shares and the share price is $500
你已经通过调整价格满足了那个低于1000美金买股票的要求
so you have adjusted prices to account for that
所以它看起来并不像股票价格从1000刀降到500刀
so it doesn’t like like the stock price went from $1000 to $500
这就是调整的含义
so that’s what adjusted is
我们将使用这些特征
so we are gonna be using those
但是再重复一次 这些特征中的每一个都和另一个有关联
but again, each one of these is really related to the other one
像这两列之间的联系是相当地高
like the correlation of these two columns is super high
那么你会使用所有的列吗?
so would you use each one of these columns
下一列真的带来更有意义的数据?不
Does that the next one really brings that much meaningful data? No
但是一件总需要考虑的事
but one thing to always think about
当你有了特征 或许标签就像
when you have features and labels is maybe like
这些列之间的关系
what about the relationship between those columns
当我们进行深度学习或者其他算法的学习时
so when we get into something like deep learning and then some of the other algorithms
你将发现这些属性之间的关系
you can start to discover relationships between attributes
但是对于回归来说 并不需要
but with regression, just simply no.
你想要做啥呢?
what you wanna do
你想要尽可能的简化你的数据
you wanna like simplify your data as much as possible.
你想要尽可能的获取很多有意义的特征
You want as many meaningful features as you can get
但是我们在这个系列视频里展现的没有意义的特征
but useless features as we’ll show kinda through this series
真的可能对你的机器学习分类器造成很多麻烦
can really cause a lot of trouble for your machine learning classifiers
尤其是监督学习中的简单算法
especially the more simple ones in supervised learning and so on
不管怎样 让我们关掉这个 继续找一些特征
anyways let’s close out of this and let’s go ahead and grab some features
我们将要讲什么呢?
what we’re gonna say
首先我们要把这些配对
first we are gonna pair this down,
我们把dataframe定义为df
we are gonna say dataframe equals the df
我们将要创建一个长长的列表 列表中包含我们需要的列
and then we are going to create a long list basically all of the columns that we wanna have
我们将输入adjusted open 然后复制它
so we are gonna take adjusted, open, and then I’m just gonna go ahead and copy this
复制 好的
copy, ok
那是adjusted open 然后输入
so that’s adjusted, open, and then we are gonna take
open high low close 和 volume
oepn, so high, low, close, and volume
好的 我们现在有了这些列 所以我们
ok so now we have just these columns so we
将重新创建一个dataframe用于open high low close和volume调整后的赋值
kinda recreated our dataframe to just be the open high low close and volume of the adjusted ones.
正如我前面说的 这些列中的一些列并没有多大价值
so then, like I was saying, some of these columns are relatively worthless
但是他们确实有一些联系
but they do have some relationships
例如 我们对high和low感兴趣的是
so for example, like what is interesting about high and low
high和low的差告诉我们
is the margin of high and low tells us
今天股票价格的波动
a little bit about volatility for the day
open的值是一天开始的价格
Also, the open price that’s the starting price for the day
它和close的值有联系
and it’s relationship to the close price
告诉我们股票价格是否上涨
tells us did the price go up
如果上涨 上涨了多少
if so, by how much
是否下降 如果下降 下降了多少 等等
and did it go down? If so, by how much and so on.
所有这个联系是非常有价值的
so the relationship there is very valuable.
但是一个简单的线性回归并不是去寻找
But a simple linear regression is not gonna seek out that
那种联系 它只是处理
relationship. It’s just gonna work with whatever
你关联的特征
features you feed through it
我们需要做的就是定义这种特殊的联系
so what we need to do is define those special relationships
然后用这个联系作为我们的特征而不是
and then use those as our features rather than
那些冗余的价格 因为那些几乎不会告诉我们
redundant almost prices that not gonna really give us
任何作为特征的有用信息
anything else very useful
首先 让我们计算high减去low的百分比
first let’s do the high minus the low percent
这就像是百分比波动
so this is like the percent volatility almost
所以我们将要定义一个新列
so we are gonna define a new column
我们把它叫做HL_percent
we are gonna call it HL_percent
然后计算它
and then that is going to be
那对我有点难
I’m having a hard time here
等于
that’s gonna be equal to
百分比是
so percent change is
在这个案例 它是等于high减去low
in this case it would be the high minus low
再除以low乘100
divided by the low times 100
对我们来说就是df Adj high减去df Adj close
so for us it would be df Adj high minus the df Adj close
遵循每一行的准则
and what’s happening here is just on a per row basis
也就是这一列减去这一列
which is just this column minus this column
所以这一列除以Adj close然后乘100
so that column divided by df Adj close and then times 100
你可以选择乘100或者不乘
you can either times by 100 or not
分类器真的对这个不太关心
the classifier really is not gonna care about that
我们乘100只是方便我们自己
we are just doing that for ourselves
所以这就是high减去low的百分比
So that’s the high minus low percent
然后我们只是想要日常的百分比变化值
and then we actually want just the daily percent change
像每天的变化
like the daily move
复制那一整行 粘贴
so I’m just gonna copy that whole line, paste
然后我们把这个叫作percent_change
and then we are gonna call this one percent_change
式子一样 只需要改变里面的值
and that is equal to pretty much the same thing only we need to change some stuff
正常的讲 percent_change 就是new减去old然后除以old乘100
so normally percent change is new minus the old divided by the old times 100
所以那是adjusted close减去adjusted open
so that would be adjusted close minus adjusted open
新的减去旧的
so new minus the old
除以旧的乘100
divided by the old times 100
哦 对不起 哈 我们搞错了
oh, I am sorry, ha, we did it the wrong way.
除以旧的 所以这就是open乘100
divided by the old, so this would be open times 100
所以那就是percent change
so that’s percent change.
实际上 你也可以把close放在这
actually, you can pass close here again
只要一切符合规范 分类器真的并不关心这个
the classifier doesn’t really care as long as everything kinda normalized
无论哪种方式都可以
but yea so either way would been fine
不管怎样 你就应该这样做
this is the actual way you should do it anyways
一旦我们有数据啦
once we have that data
我们将定义一个新的dataframe
we are gonna define a new dataframe
我们将这样定义
and we are instead gonna say
df就是df[]
so it’s gonna be df equals df[]
然后我们定义唯一的列
and then now we define the only columns that
那个列是我们真正关心的
we really acutally care about
在我们的案列中我们真正关心的列是
and so in our case the columns we care about are gonna be
adjusted close、high low percent、percent change
adjusted close, the high low percent, the percent change
然后volume也对我们有用
and then volume is also somewhat useful to have
volume就是每天的交易量
so volume is just how many trades were occurred basically that day
所有volume和波动有点联系
so volume is also kinda related to volatility
你也可以针对这些联系创建更多的特征
so you can also make more features with some sort of relationship there
但是我们将尽量保持简洁
but we’ll try to keep this pretty simple
所以现在我们将打印df.head
so for now we’ll just print df.head
我们耐心等待 确定所有的事情都正常工作
and we wait just to make sure everything worked out
果然正常工作啦
and sure enough it did
我们有了这些让人感兴趣的数字
so we have all the numbers we are kinda interested in
我们得到了特征 最后
so we got our features and eventually
这些特征最终将成为我们的标签
this will actually wound up being, possibly, our label
但是我们将 我猜想
but we’ll get to…, I guess think about
在这一节和下一节的视频教程里
between now and the next tutorial
特征就是组成我们标签的一些属性
features are the kinda of the attributes that make up the label
标签就是关于未来的
and the label is, hopefully, some sort of prediction
一些预测
into the future
那么 adjusted close 这一列
so will the adjusted close, will this column,
实际上是一个特征吗?或者它是一个
actually be a feature? or will it be a label
所代表的标签吗
as it stands right now.
所以想想这个问题 下一节教程里
so think about that and the next tutorial
我们将解答这个问题并且进一步
we’ll pick it up and start getting closer actually
做真正的关于这个数据的预测
making real predictions with this data
所以如果你有任何问题 评论 无论什么 在下方留言
so if you have any questions, comments, whatever, leave them below
其他方面 一如既往 感谢观看 感谢支持和订阅 下次再见
otherwise, as always, thanks for watching, thanks for all the supports, subscriptions and until next time

发表评论

译制信息
视频概述

结合Google股票的具体案例介绍回归,现场写代码

听录译者

收集自网络

翻译译者

[B]倔强

审核员

知易行难

视频来源

https://www.youtube.com/watch?v=JcI5Vnw0b2c

相关推荐