What is going on subscribers and others.
Welcome to part 12 of our machine learning tutorial series.
In this tutorial what we’re gonna be talking about is
testing our assumptions. So
up until this point it’s been…I would say rather hand-wavy
in the sense that I have just said hey these are the algorithms
and whatever the output. These are the answers to those algorithms and
we have done linear regression and R squared all this.
And so the question is we need to actually kind of test
all of these assumptions. So we’ve got actually
two major algorithms. One is the equation for the best fit line
and the other one is the R squared or coefficient of determination.
So we’ve got these two major algorithms that are also comprised of many other algorithms
as we even saw just a few videos ago.
The misplacement of a single parentheses changes everything and completely ruins the entire thing.
So we need to be able to test to make sure things are working as intended.
So in the world of programming this is…
There’s a similar kind of field and structure called
the unit testing where we you know test each little small unit basically that we can
in a program and this kind of helps us from getting into trouble.
Now this is not going to be unit testing but the idea is fairly similar.
We’ve got a lot of ideas. We’ve got a lot of inner working parts.
and we want to at least test them to make sure.
The easiest way we can do that is by working with sample data.
And by sample data that we have the power to change.
So that we can create a data set that is a more linear data set.
And or at least the relationship is more linear.
And then we can test to make sure is R squared better higher, right?
And then also just test our best fit line.
But for the most part we’re actually going to be testing R squared.
And if the data is not more linear we can make it more spread apart.
R squared should be lower and so on. So anyways.
Let’s go ahead and do that.
And we can also confirm visually that the best fit line is indeed working just by looking at it
and seeing whether or not is indeed a best-fit liner what looks to be a best fit line.
So first what we’re going to go ahead and do is
import random 因为我们要用到随机数
import random. Because we’re going to be using random numbers
everybody the obligatory pseudo-random.
If you don’t say it’s pseudo-random someone absolutely feels the desire and urge to comment and say:
‘But it is not real random.’ So anyways pseudo-random there you go.
You nitpickers. Okay. So
what we’re going to do is just right under here.
写 def create_dataset()
Let’s…We’re going to say define create_dataset().
Then here we’re going to have…we’re going to pass some parameters.
First is how much like how many data points do we actually want to create here.
And then we’re going to say we’ll pass variance.
And this will be how variable do we want this data set to be.
Then we’re gonna pass step.
And step will just be how far on average
to step up the y value per point.
And we’ll assign a default value there.
And then finally we’re gonna do correlation.
And this is where we can just pass a value and say we want correlation to be positive
negative or none and
what we’re gonna do here is correlation or hold on…So correlation will either be true or false.
And then if it is true to get it to go to be a positive correlation step.
这里就是一些正数 对吧？因为这是修改 y
We’ll just be some positive number, right? Because that’s changing y.
And to be a negative correlation you would just change this to a negative number, right?
So…And in fact another way we could do it is we could actually say correlation is positive
or negative and if it’s negative you do a multiplication of the step.
可能这么做更好 当然两种都行 我们就用这种吧
That’s actually probably a better way to do it. Either way would work but we’ll do that way actually.
So the first thing that we’re going to do is…
Well we would want to be able to, at the end of this, I always like to build the skeleton function first.
最后这里这个对象会返回 np.array(xs, )
So at the end of it what is the objective and that would be to return the numpy array
of the x and for now again we will specify the data type.
So we don’t forget this later on. Because it’s probably going to be useful later on.
So we’ll say float64.
它会返回 xs 我们也需要返回
So that returns the x’s and then we also need to return
y 的值 所以这里有 ys dtype就等于
y values. So ys and then dtype equals
So that’s the objective that we want to do and then now what we want to do
is create some…start creating at least some random values.
首先要做的就是设定 val = 1
So the first thing we’re gonna say is we’re going to start with val equals 1.
这差不多就是 y 的第一个值
So that’s just going to be the first value for y basically.
然后我们设定 y 为一个空 list
And then we’re just going to say y is this–an empty list.
然后我们可以用 for i in range
And then we’re going to…we could say something like for i in range of…
范围range应该是多大 应该和 hm 的值一样
And how many…what should this range be. Well should be ‘hm’ for how many, right?
所以在 hm 的范围里 我们要定义
So for range hm what are we going to do. Well we’re going to say y
y = val + random.randrange()
equals the val plus random.randrange.
这大概就是从 -variance 到 variance
And it should be random.randrange from the negative variance to the positive variance.
So some range in there is what we want to do first.
And then we’re going to say ys.append that y.
So here we would just be iterating through the range using that how much variable.
And then we’re just appending that current value plus a random one.
So this would give us data but really no correlation if we actually wanted that data.
So then what we would ask is…So keep in mind that
val 就是 y 的起始值
y is literally the val. So it’s just that starting value.
And then our variance from that starting value.
So this would be pretty worthless at the moment.
It would just be somewhat varied but not by much.
Well it depends on what you said the variance was. Anyway.
So then what you could say now is
“if correlation and correlation == pos”
if correlation and correlation equals
那么就定义 val += step
positive. What we could do is val plus equals step
which would in this case default 2
然后是“elif correlation and correlation == neg”
And then elif correlation and correlation equals
我们要做什么 让 val -= step
negative. What do we want to do? Well val minus equals step.
最后我们就可以得到 y 的值了 我们还需要一些 x 的值
Finally at the end of the day what all we’re going to do is now we’ve got the y’s so we just need some x’s.
你可以假设 xs =
So you could say something like x’s equals
然后用一个 for 循环 “i for i in range”
And we’ll just do a one line for loop. i for i in range of what
the len of y’s.
That’s good enough where you could do how much for that matter. Anyway.
接下来我们要返回 x 和 y 的值
So now we’ve got what we need and we’re returning some x’s and y’s.
So to create a sample data set we could do something
like…and for example let’s…
We can leave this here for now but
I’m gonna comment it out just so we know that we’re working with our new data instead.
So underneath this you could create a new data set but I guess we’ll create it.
We’ll create down here underneath all these other functions.
你可以写 xs, ys = create_dataset
So you could say something like xs, ys equals create_data set.
这里我们知道参数是 hm variance
And then let’s say we did a recall that it’s how much variance
step 还有 correlation
the step in the correlation.
So let’s say we said we want 40 data points
with variance of 40. The step will be 2.
And correlation we’ll make that positive.
现在我们已经可以得到 xs ys 可以直接得出 R平方这些东西了
So now we have x’s, y’s. We can print R squared and all that fun stuff.
And let’s go ahead and run that real quick. And in fact
are we still…We’re still graphing that prediction. So…
我们就删掉这一段吧 其实我们也可以留着 它挺有趣的
Let’s we’ll get rid of the prediction. We could actually leave the prediction that might be kind of interesting.
that we might run in trouble I’m not really sure if
we’re gonna get in trouble for that or not. But we’ll just do that.
And let’s run it and see.
We might have to change something else but I think that would be everything we would change.
So here’s our data set and sure enough there’s a nice best fit line for us.
And we see that
We would kind of agree with that visually.
Let’s go and graph that other plot though that one.
And this will be a G prediction.
我没有看到它 它应该是 x = 8
I don’t even see it. It was for x equals 8.
I guess it would be right on the line.
And then we’re plotting the regression line. So
I’m guessing the line is just going right over it probably. It’s just being drawn over it.
Still not seeing it however.
就是 x = 8 对吧？应该是在这儿的一个小点 我们放大看看
It was x equals 8, right? So it should be that…It’s probably this little plot right here. I’ll zoom in.
It’s there I don’t know if you’ll be able to see that on the
能不能看得清 不过这里可能不需要这个点 我们可以
on the video. But it there isn’t need a plot there and in fact we could do something like
用散点图 那这里就是 s 然后我们试试100
I think with scatter it’ll be s equals. And then let’s try 100.
So this is like for the size.
And indeed there is a huge green there. So okay.
好的 看来我们的预测值 跟我们所想的一样 完美契合直线
Anyway. So there’s our prediction as you should expect it’s perfectly on the line.
So we’re going to close this out and…
So now how would we test our assumption?
这里我们有 hm 值和方差
Well recall that we’ve got how much and then variance.
So if I said…If I took variance which is currently 40.
And we saw that it was like 0.5, I think for R squared. Let’s look at again.
Well since it’s random data this time was 0.6. Okay. So in theory
if we decrease the variance. What should happen?
Well what should happen is that number should go down
pretty significantly so long as we decrease variance significantly. So let’s do it.
Let’s do 10. We can save and run that.
And as you can see it’s much tighter. Everything’s there and sure enough
而且决定系数确实很好 0.92 比以前都好
the coefficient of determination is very very strong. It’s 0.92 much better than before.
What if we change this to an 80 now. It should be less than 0.6.
And sure enough it is less than 0.6. And so
what you can begin to do is automatically
write a program that’s simply calculates the coefficient of determination
for just a sample dataset.
And you would just make sure for example that you’d start with 40.
Save that number and then you would change that to 10.
And hopefully the coefficient of determination was less than this initial number.
And then if you went greater it should be greater and so on.
That would be a way to test just that. We’ll call it a unit.
In theory you could build a unit test out of this. But this isn’t quite yet a unit test. But anyway.
So you can test that and then sure enough the other thing you could do is…
While you we had a positive correlation. If we change this to false.
We should get quite an ugly data set. Sure enough we do.
And the coefficient of determination is almost zero.
which is absolutely not surprising.
Because that almost looks like a completely flat.
Completely flat line.
And sure enough this data is completely non-linear.
So if you did have a data set and you were trying to run linear regression on this data set.
那你得到的 R平方值 大概就会是0.0007
And you came back with an R squared that was this number that’s like 0.0007.
You would probably be smart enough to decide: ‘hey my data is actually not linear’.
We can’t quite do linear regression with this data.
That said you can do other forms of classification with the data
or not just classification.
But you know other forms of machine learning. I’m thinking classification
with your data doesn’t necessarily have to be linear.
And in fact a lot of classification is…should be linear in some way. But we’ll get there.
Anyway that’s enough for now I think.
But just kind of keep in mind that when you create
big scripts like we have here in big programs that are kind of based on a lot of things.
You want to make sure that it’s about right.
We could check the best fit line ourselves kind of visually.
但 R平方 就不能这样来检验了
But R squared we could not really totally test that.
But you could definitely program something that would go through.
like I was saying
check to make sure R squared was acting
according to our assumption or our knowledge of how it ought to act.
So we’re basically done with regression.
But I want to make a quick edit to this video to cover two pretty important things.
One is a fundamental aspect to machine learning
that be getting overlooked using the really simple example that we that we’ve used here.
And then two I made an error that I think is bad enough that we want to cover it
plus I think you can learn a little bit from the mistake that I made.
So let’s pop over to the code and address these two things.
Hopefully pretty quickly.
So first of all, looking at the data.
I’m going to change this to from 1% basically to 10% now.
We’re going to run that.
And we’re going to see that it’s basically an exact copy of
之前数据的复制 稍微在价格上变了一些 对吧
like the data leading up. Just shift it in price a bit, right?
So coming over here.
It’s basically the same.
This version is squished up a little bit.
And that’s just because the blue line is the prediction line
that plots even on the weekends and holidays.
Whereas over here the stock price only occurs during Monday to Friday and not on holidays as well.
So anyways basically an exact match.
Just higher in price.
And the reason is kind of twofold.
One we’ve created a linear model that is going to attempt to do this.
But then also we’ve made a mistake.
So we’ll address kind of both. But anyway.
The first thing is in the biggest mistake.
Actually there was two mistakes.
One I noticed in the video just going back over it.
I’m pretty sure it was here. There was also a colon at the end of the X.
I don’t know why that was there. No one actually brought out that one.
I just happened to see it right before filming this one anyway.
基本上就是 X = X 对吧
That basically is X equals X, right?
All that says is X up to forecast_out and then finish the whole thing, right?
That doesn’t do anything.
So that was just a typo.
But then you get to this point.
And we’re still kind of in a world of hurt.
因为 X……我们要做的就是把 X
Because X…What we were intending to do is say X is
the first…Let’s say in this case it’s 10%.
是的 首先我们说 X 是前90%的数据
Yeah. So the first we’re saying X is the first 90% of data.
This is the stuff we’re going to train against.
然后我们用了 X_lately 然后目标就变成
And then we’re saying X_lately and our objective here was to say
X_lately is the last 10%.
我们所做的就是分割了 X 然后重新定义 X
But instead what we’ve done is we’ve sliced X and redefined X here.
重新定义 X 后分割它
And then sliced X after it’s already been redefined.
So this is actually minus forecast_out of the 90%.
So obviously simplifying things a little bit.
This is the basically up to 90%. And this is the last
最后10% 会更多一点 不过不要紧
10% of that 90%. It’s a little bit more. But anyway.
So that was just…that’s a failure in logic.
Okay. So really the fix that you just cut that paste it there.
In there you have it. Now this is still going to create a model that’s relatively
akin and very similar to what we’ve already seen.
And again this is because we’re using linear regression.
It’s going to create a linear model that resembles what we’ve already seen.
股价遇到一些问题 然后反弹 然后得出价格
So again you’ve got some jagged then you got to jump up and then price.
It’s a little different.
But it’s very very similar. Okay. So anyway. That’s just
基于我们之前所做的工作 还有我们训练的方式 就会发生这种事情
given what we’ve done and how we’ve trained it. That’s going to happen.
So now let’s talk about the last thing
which is the fundamentals of…
you know what kind of features should you train against.
So what was the objective here?
First of all let me just say the reason why we did it this way is
just for simplicity sake.
We’re just trying to do a really simple regression example.
But let’s say you know regardless of whether or not you’re interested in stock investing.
This problem is every machine learning problem is going to likely be a somewhat complex problem.
So you have to think pretty logically about the features that you choose to use.
So looking at this.
Which of these things hinges directly on price or will directly impact price, right?
Adj.Close 明显就是 那 HL_PCT 呢 它和价格有关系吗
Obviously Adj. Close well. What about HL_PCT? Doesn’t matter what the price is?
没关系 它就是一个百分比 对吧？就是一个归一化值
Not. It’s a percent, right? It’s a normalized value.
So that doesn’t have anything to do with price.
那 PCT_change 呢 也没关系 对吧
How about PCT_change? No. right?
These may be volatility, right?
May be magnitude–same thing with high-low percent volatility in like direction maybe.
但不是价格 那 Volume 呢
But not price. What about Volume?
无关 它不是价格 对吧
No, not price, right?
This is just magnitude. Kind of fluctuation maybe. Stuff like that volatility.
So the only thing that really hinges on price is just Adj. Close.
To illustrate that
despite training on a future value that is indeed price.
What we can do is we can actually drop Adj. Close.
from the features.
What do you think when we drop this? What do you think is going to happen before we graph it?
Is that going to create a similar line that follows price?
会变成下降的价格 上升的价格 还是价格不变 预测会变成什么样
Is it going to be a falling price, upward price, flat line? What’s it going to create for the prediction?
So think about that we run it really quick. And we’ll get our answer.
And the answer is not going to be probably what we were hoping for, right?
It’s just more of a flatline and why do we get this, right? Well.
对于 HL_PCT 这个量
当价格是 $400 $600 $800 的时候
You probably had very similar
high – low percents back when price of $400, $600, $800, right?
Not big differences.
The only thing that might be sort of impactful is the Adj. Volume.
Since probably less people
像处理 $50 的股票那样处理 $800 的股票
are quickly flipping an $800 stock as opposed to a $50 stock or something like that.
These just aren’t the greatest features.
So thinking about your problem. In this case it was stock investing.
What is it…What is a stock price indicative of?
It’s indicative of the the entire company’s value.
Let’s think of Google for example. Like 500 billion dollars I think.
Why is Google worth 500 billion dollars?
是因为 Adj.Close HL_PCT PCT_change Adj.Volume 这些数字吗
Is it because of the Adj. Close, HL_PCT, PCT_change, Adj.Volume?
当然不是 想想都不是 你也知道不是因为这些
No! Come on! Be logical about it. You know that’s not the case.
There are people who believe in pattern recognition and stuff like this. But…
Or at least you know chart patterns in stocks.
Sorry. But it’s been tested. There’s plenty of research done. That doesn’t work. But anyway.
Some people still believe it.
But fundamentally why is Google worth 500 billion dollars? It’s not because of this stuff.
Fundamentally Google’s worth 500 billion dollars
Because of things like its quarterly earnings,
营收值 营收增长 账面价值之类
its price to earnings, its price to earnings’ growth, its book value and so on.
These are the things that value the company.
So if you wanted to predict stock price
You would use features that attempt to predict the company’s overall value.
Then from there you can divide that by outstanding shares and get a specific share price for the company.
But anyway. This was just meant to be a very simple example.
If you want to see a more complex example of
doing investing with features and fundamental features of companies.
I do have a tutorial series out for that.
It’s like 30-something videos if I recall or maybe 20 or something.
But it’s kind of tedious.
Because you got quarterly earnings which is every quarter.
Then you’ve got things like price to earnings to growth
which you could measure all the time.
Book value, price to book. You can measure all the time and so on.
So a lot of these things and also just the entire company’s values that
you know changes as the day throughout the day. So anyway.
It can get really complex really quick.
So we just wanted to use a really simple example. But
if you are looking for a more complex version. I do have one.
But anyways. That’s it with regression.
Hopefully you can learn from my mistakes down here. You’ll……
I’ll probably continue making mistakes and you’ll probably make mistakes too.
And that’s just like part of it honestly.
So luckily we could visualize this and we could catch it.
But a lot of times you’re not going to be able to visually catch it.
So you want to like read and reread and all that your code.
But still you’re going to make mistakes. So…
Unless you’re a robot or something. So anyways. Hopefully you can learn from my mistakes.
Otherwise we’re going to be leaving regression behind now.
And traversing into classification.
So stay tuned for that. As always thanks for watching.