ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

《机器学习Python实践》#4-回归训练与测试 – 译学馆
最新评论 (0)


Regression Training and Testing - Practical Machine Learning Tutorial with Python p.4

大家最近怎么样 欢迎来到机器学习第四讲及
What is going on everybody welcome to the fourth machine learning
回归教程第三讲 上次结束时我们
and third regression tutorial where we left off we had defined
定义了 或者至少理解了 特征和标签是什么 我们还没有定义它们
our or at least figured out what our features and our labels were. We haven’t quite yet defined them
但在这一讲中 我们将定义它们
but in this one we’re going to define them.
我们将把它们传递给分类器 训练和测试分类器 看看我们的训练结果如何
We’re going to actually pass them through to a classifier. Train and test that classifier to see how we do.
So before we get started let’s go ahead make some imports.
这里 我打算把quandl和math放在同一行
Also I’m just going to put Quandl and math on the same line.
import numpy as np Numpy是一个很好的计算库
Now we’ll to go ahead and import numpy as np. Numpy is just a nice
它让我们能使用数组 Python 没有定义数组这种结构 但 Numpy 可以让我们用数组
computing library it’s going to allow us to use arrays. Python doesn’t actually have arrays, but numpy will let us do that.
同样从 sklearn 我们将导入 preprocessing
Also from sklearn we’re going to import
这个模块中有很多方法 不过我们这里只用缩放 scaling 这一个
pre-processing. This gives us quite a few things, but we’re actually going to be using the scaling.
Scaling your data is usually done on the features.
And the goal is often to get your features to be between somewhere between negative 1 and positive 1
这不仅可以帮助我们提高精确度 也能加快
This can just help us with accuracy as well as just processing speed of
how long it might take to actually do the calculations.
到时候我会告诉你们 为什么实际情况中不怎么用数据预处理
When we get there, I’ll explain why you may actually just either
要么是用不到 要么可能是预处理太繁琐了
choose not to do pre-processing or maybe it’s just…you know…too tedious to incorporate it in reality.
不过我还是会给你们用一下看看 毕竟有时它还是挺有用的
But anyway, I will show it becauses it exists and it can be useful.
其次是交叉验证 我们要用它来创建训练和测试样本
Next is cross-validation, we’re gonna use cross-validation to create our training and testing samples.
It is just a really nice way to split up your data into…
它会打乱你的数据 对结果的统计学意义有帮助
It’ll shuffle your data for you and which helps with that…with statistics basically
这样你就不会使用一个有偏差的样本 同样它也能帮你分割数据
so you don’t have a biased sample and then also it helps seperate your data.
基本上这是一个节省时间的好办法 我们也将引入支持向量机(SVM)算法
It’s just a nice time-saver basically and we are also going to bring in SVM
that we are not to the support vector machine part
we are not…I am not going to explain support vector machines
just yet. We will get there.
But you can use SVM to do regression
可能 我们以后可能不会回来讲回归
and this is probably…we are probably not going to come back to regression
so we’ll just show it as an example using it.
这个例子很有用 我会通过它来展示更换训练算法是多么简单的事
Also it’s useful because I’ll show you how frigging simple it is to change the algrothim that you are using.
这就是有关支持向量机的部分然后 我们会引入回归算法
So anyways SVM. Next we’re actually gonna bring in regressions
from sklearn.linear_model import LinearRegression 好了
So from sklearn.linear_model import LinearRegression…OK
Now we are ready to rumble.
我们要做的第一件事是定义 X 和 y
So the first thing we’re going to go ahead and do is define our X and y
一般是要定义特征和标签的 特征用大写的 X 表示 标签用小写的 y 表示
So generally features and labels are defined. Features will be a capital X. Labels will be a lowercase y.
X = np.array(df.drop)
X is going to be equal to numpy array of df.drop
我们要去掉’label’列 对吧?
And we are going to drop the label column, right?
所以特征基本包含了除 ‘label’ 列以外的其他所有数据
So your features are basically everything except for the label column
and we can do this because df.drop returns a new data frame.
返回的新数据帧仍然可以被转换成 numpy 数组这种结构
So it’s returning a new data frame, it’s can being converted a numpy array
as being saved to the value of X.
Now the value of y is our labels.
你也许猜到了 我会输入np.array(df[‘label’]) 很容易
So you might be able to surmise that we’re gonna to say np.array(df[‘label’]). Easy enough.
好 现在我们要缩放X X = preprocessing.scale
Ok. So now we’re gonna scale X. So X equals pre-processing.scale
X 好 现在考虑一下 在我们把它喂给分类器之前 我们先要缩放 X
X…OK…Now think about it, so here we are scaling X before we feed it through the classifier.
假如我们已经有了一个分类器 而且也开始使用它了
But let’s say we feed it through a classifier. We have a classifier, and then we’re using it real time on real data.
好 那现在你拿到了这些数据 然后你要用这些数据来训练分类器
Well, let’s say you’re reading in that data and you feed it through your classifier.
但那之前 你得先缩放数据
But before you do that, you really have to scale it. And to scale it
数据要一起缩放 就是对这些数据和其它数据点进行规范化处理
it’s all scaled together, so it’s like normalized with all the other data points.
为了保证缩放一致 你得把它们一起放进训练数据中
So in order to properly scale it you would have to include it with your training data.
So keep that in mind if you ever go into the future and you’re actually using this you need to scale the new values.
不光只缩放新数据 而是要用缩放其他数据的相同方法去缩放新数据
But not just scale them, but scale them alongside all your other values.
So this can add action…while it can help with the training and testing.
虽然这在一些情况下会增加你的数据处理时间 比如像在
It can actually add processing time especially if you’re doing like any sort of like for example
在我们预测股票价格的时候 如果你要进行一些高频交易操作 那基本上就要跳过这一步了
with here…we using stock prices. If you’re doing like high-frequency trading. You would almost certainly skip this step,
但也只是一部分情况 现在我们要重新定义 X
but anyway there’s that. Now we’re going to redefine X as being equal to X
X 就等于 X[:forecast_out+1]
to the point of where we got…where we were able to forecast_out +1.
这样就包括进来所有的点了 还记得我们
So this is…includes all the points because remember we shifted
to…you know…this .01…you know…So that’s like
what? 1% basically.
我们做了那个数据转换后 确保这里能获得所有 y 值对应的 X 的值
So we made that shift so we just want to make sure that we only have Xs where we have values for y,
就是这样 接下来我们要写 df.dropna[inplace=True]
and so we do that, and then we’re going to say df.dropna[inplace=True]
然后我们要……要定义 y=np.array(df[‘label’])
and then we’re going to do…we’re going to define y is equal to np.array(df[‘label’])
然后继续 打印出 X 的长度 和 y 的长度看看
and let’s go ahead and let’s print len of X and then len of y.
So just make sure we have the correct lengths here.
好的 看来我们的数组长度不对 那我先关掉这个窗口 问题可能出在这个地方
Right. So we don’t have the correct lengths so let’s close this and really probably at this point.
所以其实我们并不需要这一步 我们赶紧再来运行一次
So we actually may not need to have this. So let me rerun that real quick.
So I’m thinking…yeah OK…So because we won’t have a label for…
一开始我要做这一步的原因是我不想让 X 里包含‘label’的数据
The reason why I was doing this shift initially was because we wouldn’t actually have labels
但在这里其实我已经把标签去掉了……或者说把标签行去掉了 所以说其实我们并不需要做这一步
but we dropped those labels here…or those rows here. So we didn’t need to do what we were doing there.
好的我们有了 X 和 y 的值 我们大概是不需要这一行
So okay we’ve got our Xs and our ys. And we don’t need this hopefully
好的 现在我们已经准备好创建训练集和测试集了
Okay, so now what we’re ready to do is create our training and testing sets.
把它们设为 X_train X_test y_train y_test =
So we’re going to say X_train, X_test, y_train, y_test equals
这里你需要传入的值就是 X 和 y 然后你想给测试集多大的比例
And what you’re going to pass through here is the Xs the ys and then how big of a test size do you want.
这里我们给0.2 所以我们要将20%的数据用于测试
And we’re going to do 0.2. So 20% of the data we want to actually use as testing data.
所以这里会用上标签和所有的特征 记着它们的排列顺序
So again what this is going to do is going to take all our features and our labels. Remember kind of the order
这里会打乱数据 对吧?但也还是要让 X 和 y 对应起来 是吧?
it’s going to shuffle them up, right? keeping Xs and ys connected right?
So it’s not going to shuffle them up to the point where you lose accuracy.
打乱它们后 输出的就是 X 的训练数据与对应的目标结果 y_train 和 X 的测试数据和与对应的目标结果 y_test
So shuffles them up, and then it outputs for X training data, y training data, X testing data and y testing data
所以我们要用 X_train 和 y_train 来拟合我们的分类器
So X_train, y_train we use to fit our classifiers.
我们等会做这个 我们首先要有一个分类器
So let’s do that next. So first we’re going to need to find a classifier.
所以我们要定义分类器 clf 等于 我们先用线性回归算法 然后来拟合或者说训练这个分类器
So we’re going to say classifier equals, and we’ll use linear regression to start and then to fit or train a classifier.
只要输入 然后你就拟合了所有特征和标签了
Just and you fit features and labels.
所以我们要用什么?对 我们要用 X_train 和 y_train
So which ones should we use? Well, we would use the X_train and y_train
这样我们就有了一个分类器 我们可以用这个分类器做预测 或者各种酷炫的事
Now we’ve got our classifier. We can actually use this classifier to predict into the future, do all kinds of crazy stuff.
但是首先我们还是应该测试一下它的效果 对吧?看看准确度怎么样 所以这里我们输入 clf.score
But first we probably should test it, right? See what the accuarcy is. So now what we would say is clf.score.
所以拟合就相当于训练 而得分相当于测试
So fit is synonymous with train. Score is synonymous with test.
我们要用 X_test y_test 很快搞定
So we’ll use X_test, y_test. So real quick.
Why might you want to train and test on separate data?
Well if you train a classifier to predict based on the same data that you test against
你的分类器 那其实结果就很明显了 分类器已经见过这些数据了 所以它知道正确答案 对吧?
when you go to test it, it’s going to be like I’ve already seen this information, so I know exactly what the answer is, right?
这样不好 你们可别这么做
So that’s not good. You don’t want to do that.
It’s no different than if…you know…if you were in school, or whatever and
and you were the same question you were asked in class were the exact identical questions on a test night,
如果你还没答对 那说明你没有认真听课
if you miss those, then you just didn’t pay attention to something.
这里是我们的得分 我们要输入confidence = clf.score
So here we have our score and what we’re going to say is confidance equals Clf.score
you could also maybe replace confidence with accuracy.
呃 accuracy 也许是个更好的选择
Accuracy, that’s probably a better choice,
因为 confidence 是不同的另一个值 当然这里不是这样
because confident you not only you can kind of get to values not necessarily from this one,
不过等以后到后面 你会遇到同时要计算准确度和置信度的情况
But as we go into the future, you can actually compute both accuracy and confidence.
这是两种不同的值 所以我们继续 那这里就用 accuracy 然后我们打印出 accuracy
As two different values so we’ll go ahead and keep accuracy there and let’s print accuracy.
保存 运行 等一会
so let’s save and run that, wait for a minute,
and the accuracy that we got out of this is 0.96
and the accuracy that we got out of this is 0.96,
so ninety-six percent accuracy on predicting what the price would be
预测的是向后1%的天数的数据 所以这就是……让我们输入print(forecast_out)
Shifted one percent of the days, so this would be let’s go ahead and print forecasts out
so we can see what that value actually is
所以线性回归的 如你所知道的那样
So with linear regression just for just so you know
准确率会是误差的平方 我们下次会讲到这个
The accuracy is going to be the squared error, and we’ll talk about that coming up next as we break down
在我们讲解线性回归是怎么起作用之后再说 所以这里还是预测30天后
Coming up next as we break down, how linear regression actually works, so this is actually still 30 days in advanced
有意思的是准确度还不低 好吧
So it’s pretty interesting that you’d still be that accurate [but] okay
所以 还不错 现在就是有这么几个问题 首先就是误差的平方
So it’s okay, yeah so now what if though a couple things first of all its squard error.
so this this actual percentage of accuracy is not necessarily like [you] would get rich off this algorithm
基本上只是方向上没错吧 不过也是很不错的准确率了
It’s almost like maybe directionally accurate, but that’s actually that’s pretty darn pretty darn accurate
So let’s say though remember. I what we also brought in Svm
假设我们想要使用支持向量机 而不是简单的线性回归
Let’s say we wanted to use the support vector regression, which is not simple linear regression.
当然我们并不是要细究算法 只是想用一个不同的算法
So we’re not going to actually break down, but what if we wanted to use [a] [different] algorithm?
那这里我们用的是线性回归吧 看这就是换个算法有多简单
So we’re using linear regression. Here’s how easy it is to switch our algorithm.
svm.svr() 搞定
“svm.svr()” done.
So now we’re testing a new algorithm.
哇 这个表现可是差多了 让我们再试一次
And this one does actually [a] lot worse wow let’s run that more time.
看看是不是还是这么不准确 好 哇 有意思
Let’s see if it still is inaccurate…yeah, wow…that’s interesting.
That’s huge difference.
Anyway you’ll actually kind of find that that happenes. Now…
For example…
with machine learning like with support vector regression for example
你会用到一个东西叫做核函数(kernel) 所以可能会这么写
you have these things called kernels. So you might say kernel
kernel= 默认的核函数是线性函数
kernel equals…I want to say the default is linear.
So let’s try polynomial kernel.
So you can change these kernels in there. You can kind of
你可以多弄几种核函数试试 看能不能得到更好的准确率 我的天
fiddle with it and see if you can get better values…Oh my gosh
So 51% accuracy is like
just barely better than average
或者比掷硬币猜好一点点 我们来试试看能不能比50%还低
or barely better than like a coin toss. Let’s see if we can actually get under 50.
哇 这波动可挺大啊 我的天
Now whoa…That’s a significant variance. My goodness.
不过就像我们所看到的一样 支持向量回归
Anyway as we can see. Support vector regression is not what we’re
gonna be using at this case anyway. But
不过这也算是一个比较好的例子 可以让你们更想学下去 比如你可能会好奇:
so here is another decent example about why you want to follow this tutorial series though like:
“什么是核函数?” 对吧?我会在支持向量机的部分解释什么是核函数的 不过
”what’s the kernel?”, right? So we’ll be explaining what a kernel is when we get to support vector machines. But anyway
what these were meant to show is
how easy it is to switch between …… algorithms
and this is the case whether you’re doing regression or whether you’re doing classification or clustering
你都可以很快的更换算法 你肯定想试试
you can switch algorithms really quick. So you definitely want to like test
不同算法的效果的 呀 这里我们应该 直接linearregression就行
the algorithms. Oops, do we do…svm.linearregression
好 还有
OK…So anyways there’s that
one more thing to talk about
before I let you guys go
is with the various algorithms
you’ll want to check the documentations.
So let me pull up the documentation for linear regression, for example.
And where you’re looking for is which algorithm can be threaded.
So in this case
we’re looking for
This…the question is how many jobs can we perform at any given time.
So right now it may not be totally obvious to you.
Why with…let’s say linear regression
we can thread the hack out of linear regression
as opposed to a support vector machine where
有很多方式可以达到效果 你可以用批处理或者什么的
there are ways that you could do it. You could let it do batches or something.
有很多方式 但是这些方式都不如
There are ways you could do it. But it’s just not inherently as easy as
something like regression is to threaded
massively and run it in huge parallel operations.
回归就完全可以并行运算 你们以后会明白为什的的
But with regression it totally is and you’ll see why
不过 是的 就这些了
later on. But anyway…So yes…So this is…
如果你认真学了文档 但是假如你略过了对算法的详细解释
If you were following along and let’s say you’re skipping the true breakdowns.
那可是不好 不过要是为了快速找到这个算法
Shame on you. But to find out of it really quickly if the algorithm
could be threaded
那么你就到这个页面来 好吧?你只要谷歌一下线性回归 支持向量机 然后就会发现这个页面
you would just go to the page, right? You can just search linear regression, support vector machine on google you’d find yourself on this page
然后就找 n_jobs 就行
and you’re looking for n_jobs.
就是可以运行多少个任务 也就是在给定时间内
So this just means how many jobs, how many threads are we willing to run
at any given time.
So the default…let me think here…the default for regression
就是1了 也就是说
is actually one. So this means it’s running in…
我不想用线性这个词 但是它确实就是线性的运行方式 对吧?
I hate to use the word linear here but it is running linearly, right?
不过我们可以并行运行它只要把 n_jobs 改成
Whereas we can run it in parallel by doing n_jobs equals
and you could say: ”OK, I want to run it at least 10 jobs at a time.”
So I would run 10 jobs at a time.
理论上训练和测试部分 不好意思
And in theory the training and testing part…actually I’m sorry, just the
只是训练部分 会快的多
training part would be significant faster.
Or you can use -1
and this will just run as many jobs as
as possible by your processor.
So anyway speaking of which processor wise
as you follow along this series if you get to a point where like
I’m getting really faster and you’re going really slow like you’re following on like an older computer
or laptop or something like that.
那你的运行时间可能会长些 类似这样的事
It may take you a little longer run. Some of these things. This one should be
这个例子应该会很快 不过到我们讲到深度学习的时候这个差别就会很明显了
pretty quick. But it conceivable especially when we get into like deep learning.
You might want to think about
spinning up a server or something like that but
那就是之后的事了 我们那时候再讨论这个吧 不过还是
that’s obviously way down the line. We’ll talk about that when we get there I suppose. But just
keep that in mind. So anyways
just remember this
but then again just one more
plug about why you want to dig in deep which we are about be doing
is to understand
哪种算法你可以用来 比如说 用来做深度学习
which algorithms you can do this with, for example with deep learning
比如线性回归这种算法 不只是深度学习相关算法 而是所有算法
like linear regression. That’s not just deep learning, like all the algorithms, really.
有很多回归算法 很多会引入线性代数 很多这样的算法
there is a lot of regression, a lot of like linear algebra goes in, because a lot of these things can be
super threaded
and you can run many operations at once
随着我们的处理器越来越强大 这种计算和
as we grow in processing power that becomes very useful to have
calculations and methodologies that can scale like that.
Anyway that’s it
here in the next tutorial we’re gonna be talking about
predictions into the future
还是用 scikit-learn 模块
using scikit-learn and then
after that we’ll actually be breaking down linear regression
and doing it ourselves.
所以记着看 如果你有什么疑问或者对这节课有什么问题
So stay tuned for that if you have questions comments concerns or whatever up to this point.
都可以在评论里留言 感谢观看和订阅 我们下次见
Feel free to leave them below. Otherwise as always thanks for watching and thanks for support and subscription until next time.