《机器学习Python实践》#4-回归训练与测试 – 译学馆

• #### 科普

SCIENCE

#### 英语

ENGLISH

#### 科技

TECHNOLOGY

MOVIE

FOOD

#### 励志

INSPIRATIONS

#### 社会

SOCIETY

TRAVEL

#### 动物

ANIMALS

KIDS

#### 卡通

CARTOON

#### 计算机

COMPUTER

#### 心理

PSYCHOLOGY

#### 教育

EDUCATION

#### 手工

HANDCRAFTS

#### 趣闻

MYSTERIES

CAREER

GEEKS

#### 时尚

FASHION

• 精品课
• 公开课
• 欢迎下载我们在各应用市场备受好评的APP

点击下载Android最新版本

点击下载iOS最新版本 扫码下载译学馆APP

#### 《机器学习Python实践》#4-回归训练与测试

Regression Training and Testing - Practical Machine Learning Tutorial with Python p.4

What is going on everybody welcome to the fourth machine learning

and third regression tutorial where we left off we had defined

our or at least figured out what our features and our labels were. We haven’t quite yet defined them

but in this one we’re going to define them.

We’re going to actually pass them through to a classifier. Train and test that classifier to see how we do.

So before we get started let’s go ahead make some imports.

Also I’m just going to put Quandl and math on the same line.
import numpy as np Numpy是一个很好的计算库
Now we’ll to go ahead and import numpy as np. Numpy is just a nice

computing library it’s going to allow us to use arrays. Python doesn’t actually have arrays, but numpy will let us do that.

Also from sklearn we’re going to import

pre-processing. This gives us quite a few things, but we’re actually going to be using the scaling.

Scaling your data is usually done on the features.

And the goal is often to get your features to be between somewhere between negative 1 and positive 1

This can just help us with accuracy as well as just processing speed of

how long it might take to actually do the calculations.

When we get there, I’ll explain why you may actually just either

choose not to do pre-processing or maybe it’s just…you know…too tedious to incorporate it in reality.

But anyway, I will show it becauses it exists and it can be useful.

Next is cross-validation, we’re gonna use cross-validation to create our training and testing samples.

It is just a really nice way to split up your data into…

It’ll shuffle your data for you and which helps with that…with statistics basically

so you don’t have a biased sample and then also it helps seperate your data.

It’s just a nice time-saver basically and we are also going to bring in SVM

that we are not to the support vector machine part

we are not…I am not going to explain support vector machines

just yet. We will get there.

But you can use SVM to do regression

and this is probably…we are probably not going to come back to regression

so we’ll just show it as an example using it.

Also it’s useful because I’ll show you how frigging simple it is to change the algrothim that you are using.

So anyways SVM. Next we’re actually gonna bring in regressions
from sklearn.linear_model import LinearRegression 好了
So from sklearn.linear_model import LinearRegression…OK

Now we are ready to rumble.

So the first thing we’re going to go ahead and do is define our X and y

So generally features and labels are defined. Features will be a capital X. Labels will be a lowercase y.
X = np.array(df.drop)
X is going to be equal to numpy array of df.drop

And we are going to drop the label column, right?

So your features are basically everything except for the label column

and we can do this because df.drop returns a new data frame.

So it’s returning a new data frame, it’s can being converted a numpy array

as being saved to the value of X.

Now the value of y is our labels.

So you might be able to surmise that we’re gonna to say np.array(df[‘label’]). Easy enough.

Ok. So now we’re gonna scale X. So X equals pre-processing.scale
X 好 现在考虑一下 在我们把它喂给分类器之前 我们先要缩放 X
X…OK…Now think about it, so here we are scaling X before we feed it through the classifier.

But let’s say we feed it through a classifier. We have a classifier, and then we’re using it real time on real data.

Well, let’s say you’re reading in that data and you feed it through your classifier.

But before you do that, you really have to scale it. And to scale it

it’s all scaled together, so it’s like normalized with all the other data points.

So in order to properly scale it you would have to include it with your training data.

So keep that in mind if you ever go into the future and you’re actually using this you need to scale the new values.

But not just scale them, but scale them alongside all your other values.

So this can add action…while it can help with the training and testing.

It can actually add processing time especially if you’re doing like any sort of like for example

with here…we using stock prices. If you’re doing like high-frequency trading. You would almost certainly skip this step,

but anyway there’s that. Now we’re going to redefine X as being equal to X
X 就等于 X[:forecast_out+1]
to the point of where we got…where we were able to forecast_out +1.

So this is…includes all the points because remember we shifted

to…you know…this .01…you know…So that’s like

what? 1% basically.

So we made that shift so we just want to make sure that we only have Xs where we have values for y,

and so we do that, and then we’re going to say df.dropna[inplace=True]

and then we’re going to do…we’re going to define y is equal to np.array(df[‘label’])

and let’s go ahead and let’s print len of X and then len of y.

So just make sure we have the correct lengths here.

Right. So we don’t have the correct lengths so let’s close this and really probably at this point.

So we actually may not need to have this. So let me rerun that real quick.

So I’m thinking…yeah OK…So because we won’t have a label for…

The reason why I was doing this shift initially was because we wouldn’t actually have labels

but we dropped those labels here…or those rows here. So we didn’t need to do what we were doing there.

So okay we’ve got our Xs and our ys. And we don’t need this hopefully

Okay, so now what we’re ready to do is create our training and testing sets.

So we’re going to say X_train, X_test, y_train, y_test equals
cross-validation.train_test_split
cross-validation.train_test_split.

And what you’re going to pass through here is the Xs the ys and then how big of a test size do you want.

And we’re going to do 0.2. So 20% of the data we want to actually use as testing data.

So again what this is going to do is going to take all our features and our labels. Remember kind of the order

it’s going to shuffle them up, right? keeping Xs and ys connected right?

So it’s not going to shuffle them up to the point where you lose accuracy.

So shuffles them up, and then it outputs for X training data, y training data, X testing data and y testing data

So X_train, y_train we use to fit our classifiers.

So let’s do that next. So first we’re going to need to find a classifier.

So we’re going to say classifier equals, and we’ll use linear regression to start and then to fit or train a classifier.

Just clf.fit and you fit features and labels.

So which ones should we use? Well, we would use the X_train and y_train

Now we’ve got our classifier. We can actually use this classifier to predict into the future, do all kinds of crazy stuff.

But first we probably should test it, right? See what the accuarcy is. So now what we would say is clf.score.

So fit is synonymous with train. Score is synonymous with test.

So we’ll use X_test, y_test. So real quick.

Why might you want to train and test on separate data?

Well if you train a classifier to predict based on the same data that you test against

when you go to test it, it’s going to be like I’ve already seen this information, so I know exactly what the answer is, right?

So that’s not good. You don’t want to do that.

It’s no different than if…you know…if you were in school, or whatever and

and you were the same question you were asked in class were the exact identical questions on a test night,

if you miss those, then you just didn’t pay attention to something.

So here we have our score and what we’re going to say is confidance equals Clf.score

you could also maybe replace confidence with accuracy.

Accuracy, that’s probably a better choice,

because confident you not only you can kind of get to values not necessarily from this one,

But as we go into the future, you can actually compute both accuracy and confidence.

As two different values so we’ll go ahead and keep accuracy there and let’s print accuracy.

so let’s save and run that, wait for a minute,
and the accuracy that we got out of this is 0.96
and the accuracy that we got out of this is 0.96,

so ninety-six percent accuracy on predicting what the price would be

Shifted one percent of the days, so this would be let’s go ahead and print forecasts out

so we can see what that value actually is

So with linear regression just for just so you know

The accuracy is going to be the squared error, and we’ll talk about that coming up next as we break down

Coming up next as we break down, how linear regression actually works, so this is actually still 30 days in advanced

So it’s pretty interesting that you’d still be that accurate [but] okay

So it’s okay, yeah so now what if though a couple things first of all its squard error.

so this this actual percentage of accuracy is not necessarily like [you] would get rich off this algorithm

It’s almost like maybe directionally accurate, but that’s actually that’s pretty darn pretty darn accurate

So let’s say though remember. I what we also brought in Svm

Let’s say we wanted to use the support vector regression, which is not simple linear regression.

So we’re not going to actually break down, but what if we wanted to use [a] [different] algorithm?

So we’re using linear regression. Here’s how easy it is to switch our algorithm.
svm.svr() 搞定
“svm.svr()” done.

So now we’re testing a new algorithm.

And this one does actually [a] lot worse wow let’s run that more time.

Let’s see if it still is inaccurate…yeah, wow…that’s interesting.

That’s huge difference.

Anyway you’ll actually kind of find that that happenes. Now…

For example…

with machine learning like with support vector regression for example

you have these things called kernels. So you might say kernel
kernel= 默认的核函数是线性函数
kernel equals…I want to say the default is linear.

So let’s try polynomial kernel.

So you can change these kernels in there. You can kind of

fiddle with it and see if you can get better values…Oh my gosh
51%的准确率就像
So 51% accuracy is like

just barely better than average

or barely better than like a coin toss. Let’s see if we can actually get under 50.

Now whoa…That’s a significant variance. My goodness.

Anyway as we can see. Support vector regression is not what we’re

gonna be using at this case anyway. But

so here is another decent example about why you want to follow this tutorial series though like:
“什么是核函数？” 对吧？我会在支持向量机的部分解释什么是核函数的 不过
”what’s the kernel?”, right? So we’ll be explaining what a kernel is when we get to support vector machines. But anyway

what these were meant to show is

how easy it is to switch between …… algorithms

and this is the case whether you’re doing regression or whether you’re doing classification or clustering

you can switch algorithms really quick. So you definitely want to like test

the algorithms. Oops, do we do…svm.linearregression

OK…So anyways there’s that

one more thing to talk about

before I let you guys go

is with the various algorithms

you’ll want to check the documentations.

So let me pull up the documentation for linear regression, for example.

And where you’re looking for is which algorithm can be threaded.

So in this case

we’re looking for
n_jobs这个属性
n_jobs

This…the question is how many jobs can we perform at any given time.

So right now it may not be totally obvious to you.

Why with…let’s say linear regression

we can thread the hack out of linear regression

as opposed to a support vector machine where

there are ways that you could do it. You could let it do batches or something.

There are ways you could do it. But it’s just not inherently as easy as

something like regression is to threaded

massively and run it in huge parallel operations.

But with regression it totally is and you’ll see why

later on. But anyway…So yes…So this is…

If you were following along and let’s say you’re skipping the true breakdowns.

Shame on you. But to find out of it really quickly if the algorithm

you would just go to the page, right? You can just search linear regression, support vector machine on google you’d find yourself on this page

and you’re looking for n_jobs.

So this just means how many jobs, how many threads are we willing to run

at any given time.

So the default…let me think here…the default for regression

is actually one. So this means it’s running in…

I hate to use the word linear here but it is running linearly, right?

Whereas we can run it in parallel by doing n_jobs equals

and you could say: ”OK, I want to run it at least 10 jobs at a time.”

So I would run 10 jobs at a time.

And in theory the training and testing part…actually I’m sorry, just the

training part would be significant faster.

Or you can use -1

and this will just run as many jobs as

So anyway speaking of which processor wise

as you follow along this series if you get to a point where like

I’m getting really faster and you’re going really slow like you’re following on like an older computer

or laptop or something like that.

It may take you a little longer run. Some of these things. This one should be

pretty quick. But it conceivable especially when we get into like deep learning.

You might want to think about

spinning up a server or something like that but

that’s obviously way down the line. We’ll talk about that when we get there I suppose. But just

keep that in mind. So anyways

just remember this

but then again just one more

plug about why you want to dig in deep which we are about be doing

is to understand

which algorithms you can do this with, for example with deep learning

like linear regression. That’s not just deep learning, like all the algorithms, really.

there is a lot of regression, a lot of like linear algebra goes in, because a lot of these things can be

and you can run many operations at once

as we grow in processing power that becomes very useful to have

calculations and methodologies that can scale like that.

Anyway that’s it

here in the next tutorial we’re gonna be talking about

predictions into the future

using scikit-learn and then

after that we’ll actually be breaking down linear regression

and doing it ourselves.

So stay tuned for that if you have questions comments concerns or whatever up to this point.

Feel free to leave them below. Otherwise as always thanks for watching and thanks for support and subscription until next time.

[B]刀子

[B]刀子