• #### 科普

SCIENCE

#### 英语

ENGLISH

#### 科技

TECHNOLOGY

MOVIE

FOOD

#### 励志

INSPIRATIONS

#### 社会

SOCIETY

TRAVEL

#### 动物

ANIMALS

KIDS

#### 卡通

CARTOON

#### 计算机

COMPUTER

#### 心理

PSYCHOLOGY

#### 教育

EDUCATION

#### 手工

HANDCRAFTS

#### 趣闻

MYSTERIES

CAREER

GEEKS

#### 时尚

FASHION

• 精品课
• 公开课
• 欢迎下载我们在各应用市场备受好评的APP

点击下载Android最新版本

点击下载iOS最新版本

扫码下载译学馆APP

#### 《机器学习python实践》#12检验假设

Testing Assumptions - Practical Machine Learning Tutorial with Python p.12

What is going on subscribers and others.

Welcome to part 12 of our machine learning tutorial series.

In this tutorial what we’re gonna be talking about is

testing our assumptions. So

up until this point it’s been…I would say rather hand-wavy

in the sense that I have just said hey these are the algorithms

and whatever the output. These are the answers to those algorithms and

we have done linear regression and R squared all this.

And so the question is we need to actually kind of test

all of these assumptions. So we’ve got actually

two major algorithms. One is the equation for the best fit line

and the other one is the R squared or coefficient of determination.

So we’ve got these two major algorithms that are also comprised of many other algorithms

as we even saw just a few videos ago.

The misplacement of a single parentheses changes everything and completely ruins the entire thing.

So we need to be able to test to make sure things are working as intended.

So in the world of programming this is…

There’s a similar kind of field and structure called

the unit testing where we you know test each little small unit basically that we can

in a program and this kind of helps us from getting into trouble.

Now this is not going to be unit testing but the idea is fairly similar.

We’ve got a lot of ideas. We’ve got a lot of inner working parts.

and we want to at least test them to make sure.

The easiest way we can do that is by working with sample data.

And by sample data that we have the power to change.

So that we can create a data set that is a more linear data set.

And or at least the relationship is more linear.

And then we can test to make sure is R squared better higher, right?

And then also just test our best fit line.

But for the most part we’re actually going to be testing R squared.

And if the data is not more linear we can make it more spread apart.
R平方就会更低 所以
R squared should be lower and so on. So anyways.

Let’s go ahead and do that.

And we can also confirm visually that the best fit line is indeed working just by looking at it

and seeing whether or not is indeed a best-fit liner what looks to be a best fit line.

So first what we’re going to go ahead and do is
import random 因为我们要用到随机数
import random. Because we’re going to be using random numbers

everybody the obligatory pseudo-random.

If you don’t say it’s pseudo-random someone absolutely feels the desire and urge to comment and say:
“这可不是真正的随机数啊” 我们用的就是伪随机数
‘But it is not real random.’ So anyways pseudo-random there you go.

You nitpickers. Okay. So

what we’re going to do is just right under here.

Let’s…We’re going to say define create_dataset().

Then here we’re going to have…we’re going to pass some parameters.

First is how much like how many data points do we actually want to create here.

And then we’re going to say we’ll pass variance.

And this will be how variable do we want this data set to be.

Then we’re gonna pass step.

And step will just be how far on average
y值变化的平均数
to step up the y value per point.

And we’ll assign a default value there.

And then finally we’re gonna do correlation.

And this is where we can just pass a value and say we want correlation to be positive

negative or none and

what we’re gonna do here is correlation or hold on…So correlation will either be true or false.

And then if it is true to get it to go to be a positive correlation step.

We’ll just be some positive number, right? Because that’s changing y.

And to be a negative correlation you would just change this to a negative number, right?

So…And in fact another way we could do it is we could actually say correlation is positive

or negative and if it’s negative you do a multiplication of the step.

That’s actually probably a better way to do it. Either way would work but we’ll do that way actually.

So the first thing that we’re going to do is…

Well we would want to be able to, at the end of this, I always like to build the skeleton function first.

So at the end of it what is the objective and that would be to return the numpy array

of the x and for now again we will specify the data type.

So we don’t forget this later on. Because it’s probably going to be useful later on.

So we’ll say float64.

So that returns the x’s and then we also need to return
y 的值 所以这里有 ys dtype就等于
y values. So ys and then dtype equals
np.float64 好吧
np.float64. Okay.

So that’s the objective that we want to do and then now what we want to do

is create some…start creating at least some random values.

So the first thing we’re gonna say is we’re going to start with val equals 1.

So that’s just going to be the first value for y basically.

And then we’re just going to say y is this–an empty list.

And then we’re going to…we could say something like for i in range of…

And how many…what should this range be. Well should be ‘hm’ for how many, right?

So for range hm what are we going to do. Well we’re going to say y
y = val + random.randrange()
equals the val plus random.randrange.

And it should be random.randrange from the negative variance to the positive variance.

So some range in there is what we want to do first.

And then we’re going to say ys.append that y.

So here we would just be iterating through the range using that how much variable.

And then we’re just appending that current value plus a random one.

So this would give us data but really no correlation if we actually wanted that data.

So then what we would ask is…So keep in mind that
val 就是 y 的起始值
y is literally the val. So it’s just that starting value.

And then our variance from that starting value.

So this would be pretty worthless at the moment.

It would just be somewhat varied but not by much.

Well it depends on what you said the variance was. Anyway.

So then what you could say now is
“if correlation and correlation == pos”
if correlation and correlation equals

positive. What we could do is val plus equals step
step 默认等于2
which would in this case default 2

And then elif correlation and correlation equals

negative. What do we want to do? Well val minus equals step.

Finally at the end of the day what all we’re going to do is now we’ve got the y’s so we just need some x’s.

So you could say something like x’s equals

And we’ll just do a one line for loop. i for i in range of what
(len(ys))
the len of y’s.

That’s good enough where you could do how much for that matter. Anyway.

So now we’ve got what we need and we’re returning some x’s and y’s.

So to create a sample data set we could do something

like…and for example let’s…

We can leave this here for now but

I’m gonna comment it out just so we know that we’re working with our new data instead.

So underneath this you could create a new data set but I guess we’ll create it.

We’ll create down here underneath all these other functions.

So you could say something like xs, ys equals create_data set.

And then let’s say we did a recall that it’s how much variance
step 还有 correlation
the step in the correlation.

So let’s say we said we want 40 data points

with variance of 40. The step will be 2.

And correlation we’ll make that positive.

So now we have x’s, y’s. We can print R squared and all that fun stuff.

And let’s go ahead and run that real quick. And in fact

are we still…We’re still graphing that prediction. So…

Let’s we’ll get rid of the prediction. We could actually leave the prediction that might be kind of interesting.

For now

that we might run in trouble I’m not really sure if

we’re gonna get in trouble for that or not. But we’ll just do that.

And let’s run it and see.

We might have to change something else but I think that would be everything we would change.

Awesome.

So here’s our data set and sure enough there’s a nice best fit line for us.

And we see that

We would kind of agree with that visually.

Let’s go and graph that other plot though that one.

And this will be a G prediction.

I don’t even see it. It was for x equals 8.

I guess it would be right on the line.

And then we’re plotting the regression line. So

I’m guessing the line is just going right over it probably. It’s just being drawn over it.

Still not seeing it however.

It was x equals 8, right? So it should be that…It’s probably this little plot right here. I’ll zoom in.

It’s there I don’t know if you’ll be able to see that on the

on the video. But it there isn’t need a plot there and in fact we could do something like

I think with scatter it’ll be s equals. And then let’s try 100.

So this is like for the size.

And indeed there is a huge green there. So okay.

Anyway. So there’s our prediction as you should expect it’s perfectly on the line.

So we’re going to close this out and…

So now how would we test our assumption?

Well recall that we’ve got how much and then variance.

So if I said…If I took variance which is currently 40.

And we saw that it was like 0.5, I think for R squared. Let’s look at again.

Well since it’s random data this time was 0.6. Okay. So in theory

if we decrease the variance. What should happen?

Well what should happen is that number should go down

pretty significantly so long as we decrease variance significantly. So let’s do it.

Let’s do 10. We can save and run that.

And as you can see it’s much tighter. Everything’s there and sure enough

the coefficient of determination is very very strong. It’s 0.92 much better than before.

What if we change this to an 80 now. It should be less than 0.6.

And sure enough it is less than 0.6. And so

what you can begin to do is automatically

write a program that’s simply calculates the coefficient of determination

for just a sample dataset.

And you would just make sure for example that you’d start with 40.

Save that number and then you would change that to 10.

And hopefully the coefficient of determination was less than this initial number.

And then if you went greater it should be greater and so on.

That would be a way to test just that. We’ll call it a unit.

In theory you could build a unit test out of this. But this isn’t quite yet a unit test. But anyway.

So you can test that and then sure enough the other thing you could do is…

While you we had a positive correlation. If we change this to false.

We should get quite an ugly data set. Sure enough we do.

And the coefficient of determination is almost zero.

which is absolutely not surprising.

Because that almost looks like a completely flat.

Completely flat line.

And sure enough this data is completely non-linear.

So if you did have a data set and you were trying to run linear regression on this data set.

And you came back with an R squared that was this number that’s like 0.0007.

You would probably be smart enough to decide: ‘hey my data is actually not linear’.

We can’t quite do linear regression with this data.

That said you can do other forms of classification with the data

or not just classification.

But you know other forms of machine learning. I’m thinking classification

with your data doesn’t necessarily have to be linear.

And in fact a lot of classification is…should be linear in some way. But we’ll get there.

Anyway that’s enough for now I think.

But just kind of keep in mind that when you create

big scripts like we have here in big programs that are kind of based on a lot of things.

You want to make sure that it’s about right.

We could check the best fit line ourselves kind of visually.

But R squared we could not really totally test that.

But you could definitely program something that would go through.

like I was saying

check to make sure R squared was acting

according to our assumption or our knowledge of how it ought to act.

So we’re basically done with regression.

But I want to make a quick edit to this video to cover two pretty important things.

One is a fundamental aspect to machine learning

that be getting overlooked using the really simple example that we that we’ve used here.

And then two I made an error that I think is bad enough that we want to cover it

plus I think you can learn a little bit from the mistake that I made.

So let’s pop over to the code and address these two things.

Hopefully pretty quickly.

So first of all, looking at the data.

I’m going to change this to from 1% basically to 10% now.

We’re going to run that.

And we’re going to see that it’s basically an exact copy of

like the data leading up. Just shift it in price a bit, right?

So coming over here.

It’s basically the same.

This version is squished up a little bit.

And that’s just because the blue line is the prediction line

that plots even on the weekends and holidays.

Whereas over here the stock price only occurs during Monday to Friday and not on holidays as well.

So anyways basically an exact match.

Just higher in price.

And the reason is kind of twofold.

One we’ve created a linear model that is going to attempt to do this.

But then also we’ve made a mistake.

So we’ll address kind of both. But anyway.

The first thing is in the biggest mistake.

Actually there was two mistakes.

One I noticed in the video just going back over it.

I’m pretty sure it was here. There was also a colon at the end of the X.

I don’t know why that was there. No one actually brought out that one.

I just happened to see it right before filming this one anyway.

That basically is X equals X, right?

All that says is X up to forecast_out and then finish the whole thing, right?

That doesn’t do anything.

So that was just a typo.

But then you get to this point.

And we’re still kind of in a world of hurt.

Because X…What we were intending to do is say X is

the first…Let’s say in this case it’s 10%.

Yeah. So the first we’re saying X is the first 90% of data.

This is the stuff we’re going to train against.

And then we’re saying X_lately and our objective here was to say
X_lately就是最后10%的数据
X_lately is the last 10%.

But instead what we’ve done is we’ve sliced X and redefined X here.

And then sliced X after it’s already been redefined.

So this is actually minus forecast_out of the 90%.

So obviously simplifying things a little bit.

This is the basically up to 90%. And this is the last

10% of that 90%. It’s a little bit more. But anyway.

So that was just…that’s a failure in logic.

Okay. So really the fix that you just cut that paste it there.

In there you have it. Now this is still going to create a model that’s relatively

akin and very similar to what we’ve already seen.

And again this is because we’re using linear regression.

It’s going to create a linear model that resembles what we’ve already seen.

So again you’ve got some jagged then you got to jump up and then price.

It’s a little different.

But it’s very very similar. Okay. So anyway. That’s just

given what we’ve done and how we’ve trained it. That’s going to happen.

So now let’s talk about the last thing

which is the fundamentals of…

you know what kind of features should you train against.

So what was the objective here?

First of all let me just say the reason why we did it this way is

just for simplicity sake.

We’re just trying to do a really simple regression example.

But let’s say you know regardless of whether or not you’re interested in stock investing.

This problem is every machine learning problem is going to likely be a somewhat complex problem.

So you have to think pretty logically about the features that you choose to use.

So looking at this.

Which of these things hinges directly on price or will directly impact price, right?
Adj.Close 明显就是 那 HL_PCT 呢 它和价格有关系吗
Obviously Adj. Close well. What about HL_PCT? Doesn’t matter what the price is?

Not. It’s a percent, right? It’s a normalized value.

So that doesn’t have anything to do with price.

These may be volatility, right?

May be magnitude–same thing with high-low percent volatility in like direction maybe.

But not price. What about Volume?

No, not price, right?

This is just magnitude. Kind of fluctuation maybe. Stuff like that volatility.

So the only thing that really hinges on price is just Adj. Close.

To illustrate that

despite training on a future value that is indeed price.

What we can do is we can actually drop Adj. Close.

from the features.

What do you think when we drop this? What do you think is going to happen before we graph it?

Is that going to create a similar line that follows price?

Is it going to be a falling price, upward price, flat line? What’s it going to create for the prediction?

So think about that we run it really quick. And we’ll get our answer.

And the answer is not going to be probably what we were hoping for, right?

It’s just more of a flatline and why do we get this, right? Well.

The HL_PCT.

high – low percents back when price of \$400, \$600, \$800, right?

Not big differences.

The only thing that might be sort of impactful is the Adj. Volume.

Since probably less people

are quickly flipping an \$800 stock as opposed to a \$50 stock or something like that.

But regardless

These just aren’t the greatest features.

So thinking about your problem. In this case it was stock investing.

What is it…What is a stock price indicative of?

It’s indicative of the the entire company’s value.

Let’s think of Google for example. Like 500 billion dollars I think.

Why is Google worth 500 billion dollars?

No! Come on! Be logical about it. You know that’s not the case.

There are people who believe in pattern recognition and stuff like this. But…

Or at least you know chart patterns in stocks.

Sorry. But it’s been tested. There’s plenty of research done. That doesn’t work. But anyway.

Some people still believe it.

But fundamentally why is Google worth 500 billion dollars? It’s not because of this stuff.

Fundamentally Google’s worth 500 billion dollars

Because of things like its quarterly earnings,

its price to earnings, its price to earnings’ growth, its book value and so on.

These are the things that value the company.

So if you wanted to predict stock price

You would use features that attempt to predict the company’s overall value.

Then from there you can divide that by outstanding shares and get a specific share price for the company.

But anyway. This was just meant to be a very simple example.

If you want to see a more complex example of

doing investing with features and fundamental features of companies.

I do have a tutorial series out for that.

It’s like 30-something videos if I recall or maybe 20 or something.

But it’s kind of tedious.

Because you got quarterly earnings which is every quarter.

Then you’ve got things like price to earnings to growth

which you could measure all the time.

Book value, price to book. You can measure all the time and so on.

So a lot of these things and also just the entire company’s values that

you know changes as the day throughout the day. So anyway.

It can get really complex really quick.

So we just wanted to use a really simple example. But

if you are looking for a more complex version. I do have one.

But anyways. That’s it with regression.

Hopefully you can learn from my mistakes down here. You’ll……

I’ll probably continue making mistakes and you’ll probably make mistakes too.

And that’s just like part of it honestly.

So luckily we could visualize this and we could catch it.

But a lot of times you’re not going to be able to visually catch it.

But still you’re going to make mistakes. So…

Unless you’re a robot or something. So anyways. Hopefully you can learn from my mistakes.

Otherwise we’re going to be leaving regression behind now.

And traversing into classification.

So stay tuned for that. As always thanks for watching.

[B]刀子

[B]hugue