最新评论 (0)


Testing Assumptions - Practical Machine Learning Tutorial with Python p.12

各位粉丝还有其他人 大家好
What is going on subscribers and others.
Welcome to part 12 of our machine learning tutorial series.
In this tutorial what we’re gonna be talking about is
testing our assumptions. So
其实一直到现在 我都觉得还没讲到重点
up until this point it’s been…I would say rather hand-wavy
in the sense that I have just said hey these are the algorithms
和它们的输出结果 就是这些算法的结果
and whatever the output. These are the answers to those algorithms and
we have done linear regression and R squared all this.
And so the question is we need to actually kind of test
我们讲过的这些理论 我们其实
all of these assumptions. So we’ve got actually
主要讲了两个算法 一个就是最优拟合线的等式
two major algorithms. One is the equation for the best fit line
and the other one is the R squared or coefficient of determination.
So we’ve got these two major algorithms that are also comprised of many other algorithms
as we even saw just a few videos ago.
The misplacement of a single parentheses changes everything and completely ruins the entire thing.
So we need to be able to test to make sure things are working as intended.
So in the world of programming this is…
There’s a similar kind of field and structure called
单元测试 就是去尽量的测试程序中的
the unit testing where we you know test each little small unit basically that we can
每个小单元 以避免程序出现问题
in a program and this kind of helps us from getting into trouble.
我们要做的还不能算是单元测试 但差不多是一回事
Now this is not going to be unit testing but the idea is fairly similar.
我们有很多想法 我们也做了不少工作
We’ve got a lot of ideas. We’ve got a lot of inner working parts.
and we want to at least test them to make sure.
The easiest way we can do that is by working with sample data.
And by sample data that we have the power to change.
So that we can create a data set that is a more linear data set.
And or at least the relationship is more linear.
这样我们就可以确保R平方比较高 对吧?
And then we can test to make sure is R squared better higher, right?
And then also just test our best fit line.
But for the most part we’re actually going to be testing R squared.
如果数据不够线性 那么我们可以让它们更分散些
And if the data is not more linear we can make it more spread apart.
R平方就会更低 所以
R squared should be lower and so on. So anyways.
Let’s go ahead and do that.
And we can also confirm visually that the best fit line is indeed working just by looking at it
and seeing whether or not is indeed a best-fit liner what looks to be a best fit line.
So first what we’re going to go ahead and do is
import random 因为我们要用到随机数
import random. Because we’re going to be using random numbers
everybody the obligatory pseudo-random.
如果我不说是伪随机数 恐怕就有人要留言说
If you don’t say it’s pseudo-random someone absolutely feels the desire and urge to comment and say:
“这可不是真正的随机数啊” 我们用的就是伪随机数
‘But it is not real random.’ So anyways pseudo-random there you go.
You nitpickers. Okay. So
what we’re going to do is just right under here.
写 def create_dataset()
Let’s…We’re going to say define create_dataset().
Then here we’re going to have…we’re going to pass some parameters.
First is how much like how many data points do we actually want to create here.
And then we’re going to say we’ll pass variance.
And this will be how variable do we want this data set to be.
Then we’re gonna pass step.
And step will just be how far on average
to step up the y value per point.
And we’ll assign a default value there.
And then finally we’re gonna do correlation.
And this is where we can just pass a value and say we want correlation to be positive
negative or none and
这里就是相关性…… 它可以为真或为假
what we’re gonna do here is correlation or hold on…So correlation will either be true or false.
And then if it is true to get it to go to be a positive correlation step.
这里就是一些正数 对吧?因为这是修改 y
We’ll just be some positive number, right? Because that’s changing y.
如果是负相关你就得把这改成负数了 对吧?
And to be a negative correlation you would just change this to a negative number, right?
So…And in fact another way we could do it is we could actually say correlation is positive
或者为负 假如为正就需要增加步长
or negative and if it’s negative you do a multiplication of the step.
可能这么做更好 当然两种都行 我们就用这种吧
That’s actually probably a better way to do it. Either way would work but we’ll do that way actually.
So the first thing that we’re going to do is…
最后这里 我们要先写一个万能函数
Well we would want to be able to, at the end of this, I always like to build the skeleton function first.
最后这里这个对象会返回 np.array(xs, )
So at the end of it what is the objective and that would be to return the numpy array
of the x and for now again we will specify the data type.
等会别忘了这个 因为可能一会儿有用
So we don’t forget this later on. Because it’s probably going to be useful later on.
这里就用 float64
So we’ll say float64.
它会返回 xs 我们也需要返回
So that returns the x’s and then we also need to return
y 的值 所以这里有 ys dtype就等于
y values. So ys and then dtype equals
np.float64 好吧
np.float64. Okay.
这就是我们要定义的对象 接下来我们要
So that’s the objective that we want to do and then now what we want to do
is create some…start creating at least some random values.
首先要做的就是设定 val = 1
So the first thing we’re gonna say is we’re going to start with val equals 1.
这差不多就是 y 的第一个值
So that’s just going to be the first value for y basically.
然后我们设定 y 为一个空 list
And then we’re just going to say y is this–an empty list.
然后我们可以用 for i in range
And then we’re going to…we could say something like for i in range of…
范围range应该是多大 应该和 hm 的值一样
And how many…what should this range be. Well should be ‘hm’ for how many, right?
所以在 hm 的范围里 我们要定义
So for range hm what are we going to do. Well we’re going to say y
y = val + random.randrange()
equals the val plus random.randrange.
这大概就是从 -variance 到 variance
And it should be random.randrange from the negative variance to the positive variance.
So some range in there is what we want to do first.
接下来定义 ys.append(y)
And then we’re going to say ys.append that y.
So here we would just be iterating through the range using that how much variable.
And then we’re just appending that current value plus a random one.
So this would give us data but really no correlation if we actually wanted that data.
So then what we would ask is…So keep in mind that
val 就是 y 的起始值
y is literally the val. So it’s just that starting value.
And then our variance from that starting value.
So this would be pretty worthless at the moment.
It would just be somewhat varied but not by much.
Well it depends on what you said the variance was. Anyway.
So then what you could say now is
“if correlation and correlation == pos”
if correlation and correlation equals
那么就定义 val += step
positive. What we could do is val plus equals step
step 默认等于2
which would in this case default 2
然后是“elif correlation and correlation == neg”
And then elif correlation and correlation equals
我们要做什么 让 val -= step
negative. What do we want to do? Well val minus equals step.
最后我们就可以得到 y 的值了 我们还需要一些 x 的值
Finally at the end of the day what all we’re going to do is now we’ve got the y’s so we just need some x’s.
你可以假设 xs =
So you could say something like x’s equals
然后用一个 for 循环 “i for i in range”
And we’ll just do a one line for loop. i for i in range of what
the len of y’s.
That’s good enough where you could do how much for that matter. Anyway.
接下来我们要返回 x 和 y 的值
So now we’ve got what we need and we’re returning some x’s and y’s.
为了创建一个简单的数据集 我们可以
So to create a sample data set we could do something
like…and for example let’s…
We can leave this here for now but
我还是先把它注释掉吧 这样我们就可以用新数据了
I’m gonna comment it out just so we know that we’re working with our new data instead.
So underneath this you could create a new data set but I guess we’ll create it.
We’ll create down here underneath all these other functions.
你可以写 xs, ys = create_dataset
So you could say something like xs, ys equals create_data set.
这里我们知道参数是 hm variance
And then let’s say we did a recall that it’s how much variance
step 还有 correlation
the step in the correlation.
So let’s say we said we want 40 data points
方差为40 步长为2
with variance of 40. The step will be 2.
And correlation we’ll make that positive.
现在我们已经可以得到 xs ys 可以直接得出 R平方这些东西了
So now we have x’s, y’s. We can print R squared and all that fun stuff.
And let’s go ahead and run that real quick. And in fact
are we still…We’re still graphing that prediction. So…
我们就删掉这一段吧 其实我们也可以留着 它挺有趣的
Let’s we’ll get rid of the prediction. We could actually leave the prediction that might be kind of interesting.
For now
运行可能有些问题 我不太确定
that we might run in trouble I’m not really sure if
程序会不会出问题 那我们还是继续
we’re gonna get in trouble for that or not. But we’ll just do that.
And let’s run it and see.
我们可能还要改点东西 但可能也就这些了
We might have to change something else but I think that would be everything we would change.
这儿是我们的数据集 最优拟合线看起来也很不错
So here’s our data set and sure enough there’s a nice best fit line for us.
And we see that
We would kind of agree with that visually.
Let’s go and graph that other plot though that one.
And this will be a G prediction.
我没有看到它 它应该是 x = 8
I don’t even see it. It was for x equals 8.
I guess it would be right on the line.
And then we’re plotting the regression line. So
我想这条线应该直接划过去了 穿过了那个点
I’m guessing the line is just going right over it probably. It’s just being drawn over it.
Still not seeing it however.
就是 x = 8 对吧?应该是在这儿的一个小点 我们放大看看
It was x equals 8, right? So it should be that…It’s probably this little plot right here. I’ll zoom in.
就在这儿 我不确定你们在视频里
It’s there I don’t know if you’ll be able to see that on the
能不能看得清 不过这里可能不需要这个点 我们可以
on the video. But it there isn’t need a plot there and in fact we could do something like
用散点图 那这里就是 s 然后我们试试100
I think with scatter it’ll be s equals. And then let’s try 100.
So this is like for the size.
And indeed there is a huge green there. So okay.
好的 看来我们的预测值 跟我们所想的一样 完美契合直线
Anyway. So there’s our prediction as you should expect it’s perfectly on the line.
So we’re going to close this out and…
So now how would we test our assumption?
这里我们有 hm 值和方差
Well recall that we’ve got how much and then variance.
So if I said…If I took variance which is currently 40.
那R平方我觉得就大概是0.5 那我们再来看看
And we saw that it was like 0.5, I think for R squared. Let’s look at again.
因为随机数据这次是0.6 理论上来说
Well since it’s random data this time was 0.6. Okay. So in theory
if we decrease the variance. What should happen?
Well what should happen is that number should go down
这个数会大幅下降 我们来试试
pretty significantly so long as we decrease variance significantly. So let’s do it.
比如是10 保存然后运行
Let’s do 10. We can save and run that.
And as you can see it’s much tighter. Everything’s there and sure enough
而且决定系数确实很好 0.92 比以前都好
the coefficient of determination is very very strong. It’s 0.92 much better than before.
如果我们把它改成80 那这里就应该小于0.6
What if we change this to an 80 now. It should be less than 0.6.
And sure enough it is less than 0.6. And so
what you can begin to do is automatically
write a program that’s simply calculates the coefficient of determination
for just a sample dataset.
你只要确保 比如从40开始
And you would just make sure for example that you’d start with 40.
保存这个数 然后改为10
Save that number and then you would change that to 10.
And hopefully the coefficient of determination was less than this initial number.
如果变大的话 就会一直变大
And then if you went greater it should be greater and so on.
这就算是一个测试 我们叫做单元测试
That would be a way to test just that. We’ll call it a unit.
理论上你确实可以写这么一个单元测试 但这确实不算是一个单元测试
In theory you could build a unit test out of this. But this isn’t quite yet a unit test. But anyway.
你可以去测试它 当然也可以做些其他的事情
So you can test that and then sure enough the other thing you could do is…
这里我们有一个正相关关系 我们把它改成负相关
While you we had a positive correlation. If we change this to false.
We should get quite an ugly data set. Sure enough we do.
And the coefficient of determination is almost zero.
当然 这也是意料之中的
which is absolutely not surprising.
Because that almost looks like a completely flat.
Completely flat line.
And sure enough this data is completely non-linear.
如果你有这么一个数据集 而且你还想在上面做线性回归
So if you did have a data set and you were trying to run linear regression on this data set.
那你得到的 R平方值 大概就会是0.0007
And you came back with an R squared that was this number that’s like 0.0007.
You would probably be smart enough to decide: ‘hey my data is actually not linear’.
We can’t quite do linear regression with this data.
That said you can do other forms of classification with the data
or not just classification.
But you know other forms of machine learning. I’m thinking classification
用来分类 数据集就不必是线性的了
with your data doesn’t necessarily have to be linear.
其实大多数分类问题的数据都是线性的 我们以后会遇到的
And in fact a lot of classification is…should be linear in some way. But we’ll get there.
好的 差不多就这样了
Anyway that’s enough for now I think.
But just kind of keep in mind that when you create
就像我们今天这样 可能会用到很多东西
big scripts like we have here in big programs that are kind of based on a lot of things.
You want to make sure that it’s about right.
We could check the best fit line ourselves kind of visually.
但 R平方 就不能这样来检验了
But R squared we could not really totally test that.
But you could definitely program something that would go through.
like I was saying
check to make sure R squared was acting
according to our assumption or our knowledge of how it ought to act.
So we’re basically done with regression.
But I want to make a quick edit to this video to cover two pretty important things.
One is a fundamental aspect to machine learning
that be getting overlooked using the really simple example that we that we’ve used here.
第二就是我之前犯了一个很大的错误 现在要说明一下
And then two I made an error that I think is bad enough that we want to cover it
plus I think you can learn a little bit from the mistake that I made.
让我们回到代码中 再强调两个东西
So let’s pop over to the code and address these two things.
Hopefully pretty quickly.
So first of all, looking at the data.
I’m going to change this to from 1% basically to 10% now.
We’re going to run that.
我们可以看出 基本上就是
And we’re going to see that it’s basically an exact copy of
之前数据的复制 稍微在价格上变了一些 对吧
like the data leading up. Just shift it in price a bit, right?
So coming over here.
It’s basically the same.
This version is squished up a little bit.
And that’s just because the blue line is the prediction line
that plots even on the weekends and holidays.
然而股价只会在周一到周五变化 假期不会变化
Whereas over here the stock price only occurs during Monday to Friday and not on holidays as well.
So anyways basically an exact match.
Just higher in price.
And the reason is kind of twofold.
第一 我们建立的线性模型会倾向这么做
One we’ve created a linear model that is going to attempt to do this.
But then also we’ve made a mistake.
So we’ll address kind of both. But anyway.
The first thing is in the biggest mistake.
Actually there was two mistakes.
One I noticed in the video just going back over it.
I’m pretty sure it was here. There was also a colon at the end of the X.
我也不知道它为什么会在那 没人让它显示出来
I don’t know why that was there. No one actually brought out that one.
I just happened to see it right before filming this one anyway.
基本上就是 X = X 对吧
That basically is X equals X, right?
All that says is X up to forecast_out and then finish the whole thing, right?
That doesn’t do anything.
So that was just a typo.
But then you get to this point.
And we’re still kind of in a world of hurt.
因为 X……我们要做的就是把 X
Because X…What we were intending to do is say X is
the first…Let’s say in this case it’s 10%.
是的 首先我们说 X 是前90%的数据
Yeah. So the first we’re saying X is the first 90% of data.
This is the stuff we’re going to train against.
然后我们用了 X_lately 然后目标就变成
And then we’re saying X_lately and our objective here was to say
X_lately is the last 10%.
我们所做的就是分割了 X 然后重新定义 X
But instead what we’ve done is we’ve sliced X and redefined X here.
重新定义 X 后分割它
And then sliced X after it’s already been redefined.
So this is actually minus forecast_out of the 90%.
So obviously simplifying things a little bit.
基本上就是到90% 然后这个就是这90%里的
This is the basically up to 90%. And this is the last
最后10% 会更多一点 不过不要紧
10% of that 90%. It’s a little bit more. But anyway.
So that was just…that’s a failure in logic.
复制然后粘贴到这 就可以修复了
Okay. So really the fix that you just cut that paste it there.
这样就对了 这里创建的模型还是
In there you have it. Now this is still going to create a model that’s relatively
akin and very similar to what we’ve already seen.
And again this is because we’re using linear regression.
It’s going to create a linear model that resembles what we’ve already seen.
股价遇到一些问题 然后反弹 然后得出价格
So again you’ve got some jagged then you got to jump up and then price.
It’s a little different.
But it’s very very similar. Okay. So anyway. That’s just
基于我们之前所做的工作 还有我们训练的方式 就会发生这种事情
given what we’ve done and how we’ve trained it. That’s going to happen.
So now let’s talk about the last thing
which is the fundamentals of…
you know what kind of features should you train against.
So what was the objective here?
First of all let me just say the reason why we did it this way is
just for simplicity sake.
We’re just trying to do a really simple regression example.
But let’s say you know regardless of whether or not you’re interested in stock investing.
This problem is every machine learning problem is going to likely be a somewhat complex problem.
So you have to think pretty logically about the features that you choose to use.
So looking at this.
这些特征都和价格有关或者可以直接影响价格 对吧
Which of these things hinges directly on price or will directly impact price, right?
Adj.Close 明显就是 那 HL_PCT 呢 它和价格有关系吗
Obviously Adj. Close well. What about HL_PCT? Doesn’t matter what the price is?
没关系 它就是一个百分比 对吧?就是一个归一化值
Not. It’s a percent, right? It’s a normalized value.
So that doesn’t have anything to do with price.
那 PCT_change 呢 也没关系 对吧
How about PCT_change? No. right?
它们可能会很容易改变 对吧
These may be volatility, right?
大小可能会变 就像在趋势上会上下波动
May be magnitude–same thing with high-low percent volatility in like direction maybe.
但不是价格 那 Volume 呢
But not price. What about Volume?
无关 它不是价格 对吧
No, not price, right?
只是大小可能会有些波动 非常易变
This is just magnitude. Kind of fluctuation maybe. Stuff like that volatility.
所以真正和价格相关的量就是 Adj.Close
So the only thing that really hinges on price is just Adj. Close.
To illustrate that
despite training on a future value that is indeed price.
但我们可以把 Adj.Close
What we can do is we can actually drop Adj. Close.
from the features.
在画图前 你能想到去掉它后会发生什么吗
What do you think when we drop this? What do you think is going to happen before we graph it?
Is that going to create a similar line that follows price?
会变成下降的价格 上升的价格 还是价格不变 预测会变成什么样
Is it going to be a falling price, upward price, flat line? What’s it going to create for the prediction?
我们来快速运行一下 然后就可以知道答案
So think about that we run it really quick. And we’ll get our answer.
And the answer is not going to be probably what we were hoping for, right?
基本上就是一条平的线 为什么会这样
It’s just more of a flatline and why do we get this, right? Well.
对于 HL_PCT 这个量
当价格是 $400 $600 $800 的时候
You probably had very similar
可能会得到差不多的 HL_PCT
high – low percents back when price of $400, $600, $800, right?
Not big differences.
稍微会有点影响的就是 Adj.Volume
The only thing that might be sort of impactful is the Adj. Volume.
Since probably less people
像处理 $50 的股票那样处理 $800 的股票
are quickly flipping an $800 stock as opposed to a $50 stock or something like that.
But regardless
These just aren’t the greatest features.
那在思考一下问题 我们这里就是股票投资
So thinking about your problem. In this case it was stock investing.
What is it…What is a stock price indicative of?
It’s indicative of the the entire company’s value.
比如谷歌 差不多市值就是5千亿美元
Let’s think of Google for example. Like 500 billion dollars I think.
Why is Google worth 500 billion dollars?
是因为 Adj.Close HL_PCT PCT_change Adj.Volume 这些数字吗
Is it because of the Adj. Close, HL_PCT, PCT_change, Adj.Volume?
当然不是 想想都不是 你也知道不是因为这些
No! Come on! Be logical about it. You know that’s not the case.
有些人比较相信模式识别这类东西 但
There are people who believe in pattern recognition and stuff like this. But…
Or at least you know chart patterns in stocks.
抱歉 有很多研究证明它并没有用
Sorry. But it’s been tested. There’s plenty of research done. That doesn’t work. But anyway.
Some people still believe it.
但到底为什么谷歌值5000亿美元呢 不是因为这些
But fundamentally why is Google worth 500 billion dollars? It’s not because of this stuff.
Fundamentally Google’s worth 500 billion dollars
Because of things like its quarterly earnings,
营收值 营收增长 账面价值之类
its price to earnings, its price to earnings’ growth, its book value and so on.
These are the things that value the company.
So if you wanted to predict stock price
You would use features that attempt to predict the company’s overall value.
这样你就可以划分出那些优良的股票 并得到指定公司的股价
Then from there you can divide that by outstanding shares and get a specific share price for the company.
But anyway. This was just meant to be a very simple example.
If you want to see a more complex example of
doing investing with features and fundamental features of companies.
I do have a tutorial series out for that.
It’s like 30-something videos if I recall or maybe 20 or something.
But it’s kind of tedious.
Because you got quarterly earnings which is every quarter.
Then you’ve got things like price to earnings to growth
which you could measure all the time.
账面价值 所有时间段的账面价值
Book value, price to book. You can measure all the time and so on.
有太多太多东西 当然还有公司的市值
So a lot of these things and also just the entire company’s values that
you know changes as the day throughout the day. So anyway.
It can get really complex really quick.
So we just wanted to use a really simple example. But
如果你想要更复杂的例子 我还真有一个
if you are looking for a more complex version. I do have one.
不过就这样吧 这就是回归算法
But anyways. That’s it with regression.
Hopefully you can learn from my mistakes down here. You’ll……
我可能还会犯错 你也会
I’ll probably continue making mistakes and you’ll probably make mistakes too.
And that’s just like part of it honestly.
So luckily we could visualize this and we could catch it.
But a lot of times you’re not going to be able to visually catch it.
So you want to like read and reread and all that your code.
But still you’re going to make mistakes. So…
除非你是个机器人 希望你们能吸取教训
Unless you’re a robot or something. So anyways. Hopefully you can learn from my mistakes.
Otherwise we’re going to be leaving regression behind now.
And traversing into classification.
继续关注我的视频吧 感谢收看
So stay tuned for that. As always thanks for watching.