ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

《机器学习Python实践》#3 回归的特征与标签 – 译学馆
最新评论 (0)

《机器学习Python实践》#3 回归的特征与标签

Regression Features and Labels - Practical Machine Learning Tutorial with Python p.3

好的 大家好 欢迎来到第三次机器学习课程
alright, hello everybody, welcome to the third machine learning.
And second regression tutorial videos where we left off,
我问adjusted close这一列
I was asking whether or not the adjusted close column
would be a feature or a label,
答案是它是个特征 也有可能特征标签都不是
and the answers is really a feature and possibly none of the above.
额 如果我们还没有决定去
Um it could be a label if we haven’t already kind of decided
使用high减去low的百分比或者percent change它就是标签
that we will use the high minus low percent or the percent change.
例如 你可以把adjusted close当作标签 也就是说
For example so, you could use the adjusted close as the label if, say
at the beginning of the day you were trying to predict what the close might be that day.
But in this case with the given features that we have chosen,
you really wouldn’t even know this value.
额 在close的值出现之前 你无法知道high 减去 low的值
um, you wouldn’t know the high minus low
也无法知道percent change的值
and you wouldn’t know the percent change until the close had already occured.
因此如果你去训练这个分类器预测adjusted close这个值
So,if you trained the classfier to predict this value
额 那将会是非常偏颇的分类器
um, that would be incredibly biased classifier.
So, just kind of you were thinking of these things.
Is this even possible in the real world?
cause you can kind of find youself doing things.
that seems like a great idea at the time but then it is actually not even possible to do
所以在我们的例子中 adjusted close 是一个特征
So, in our case, adjusted close will either be a feature,
实际上我们将要做的就是拿到 adjusted close 的过去的十个值 那就是一个特征
that actually what we will do is take like the last 10 values of the adjusted close and that’s a feature
And, that’s most representative of when we actually go and dig in and write the algorithm ourselves,
额 你可以拿过去的10个值
um, you would take maybe the last 10 values,
and try to predict the future value
好的 后面我们会详细说
Anyway, more on that later.
So the last, uh the last tutorial we did features,
现在 这一节课里我们将定义标签
and now in this tutorial we are gonna define a label.
So, since I just got done telling you that this is not gonna be a label,
what actually would be a label?
好 标签就是未来某个时刻的价格
Well, it would be at some point in the future, the price.
好的 我们所仅有的和价格有关的列就是adjusted close
Okay. And the only price column we have anymore is adjusted close.
我们实际上想得到的就是未来的adjusted close
Um, but what we wanna do is actually get the adjusted close in the future,
或许明天 或许5天后的
maybe the next day, maybe the next 5 days, something like that.
那么我们还需要引进一些新的信息 一些未来的信息
So, we need to bring in some new information, basically to get that information onto the future.
让我们把这个关了 开始工作
So, let’s go ahead and close this out and begin working on that.
首先 额 我们想
So, first of all, um we want to,
we are gonna take,
we are not gonna print the head anymore.
首先 我们将定义 forecast_column 或者forecast_col
And, first of all, the, we are gonna say forecast_column or col,
把adjusted close赋值给它
is just gonna be equal to adjusted close.
我随后将解释为什么我们这么做 但是总的说来它就是个变量
I’ll explain why we are gonna do that in a second but basically it’s just a variable.
and later on, you could change this variable to be a different forecast column.
那么 也许你处理的不是股票价格
So, you might not be working with stock prices,
当然 线性回归 机器学习还可以应到不仅仅是股票价格的领域
there’s other things that you can use linear regression on it, of course machine learning, other than stock prices.
因此 今后如果你处理的不是股票价格 也将会用到相似的代码
So, in the future, if you aren’t, you’re gonna just, you’ll be able to use very similar code.
显然 你需要修改这个点之前的代码
You’ll obviously change the code leading up to this point,
把forecast column改成未来你想改的预测列
but you just change forecast column to be whatever you want it in the future.
我会向你展示什么时候用到这个代码 为啥这个代码有用
And I’ll show you why when we get to the code, why that’s gonna be useful.
现在 为了不 不缺少数据 那么 df.fillna
Um now, what we’re gonna say is just in case there is not, not, uh missing data, so df.fillna.
fillna 是 fill(填满)
So fill na is just fill na.
na的意思是不可用(not available) 在pandas里它代表NaN(Not a Number)(不是数字)
na is for not available or in pandas term, it’s actually gonna be a nan, most cases and that’s not a number.
现在我们要“fillna”一个具体的数字 将它赋值为-99999
So now we are gonna ‘fillna’ with a specific value. We’re gonna do negative ninety thousand nine hundred and ninety nine.
And we’re gonna say inplace equals true.
在机器学习领域 你不能处理NaN值
So with machine learning, you can’t work with nan data.
So you actually have to replace the nan data with something.
另外一个选择是你可以删除整列 但是在现实世界里你并不想在机器学习里删除数据
And, or you can get rid of that entire column, but you don’t want to get rid of data in machine learning in the real world,
如果删除了 你就会发现丢失了很多数据
you actually will find that you miss a lot of data.
你或许缺一列数据 但你已经有了其他列数据 那么不必要的话就不要删除那一列数据
You are lacking maybe one column, but you have got the other columns and you don’t wanna sacrifice data if you don’t have to.
你可以这样做 它将会被视为数据集里的异常值
So you can do this and it will be treated as an outlier in your dataset.
And again this is just one more reason why going through and doing the algorithm by hand
will help you understand so much better what kind of effect that is gonna have on the algorithm.
你会感谢我的 一起过了一遍
So, you’ll be thankful that we go through it.
And then basically you’ll learn through each algorithm why, uh, what doing something like that will do.
无论如何 不删数据是一个选择 我认为最好的选择
So anyways, that’s the choice, that’s the best choice in my opinion rather than getting rid of data.
Now, we are gonna forecast out.
这是一个回归算法 使用回归进行预测
This is a regression algorithm,generally you use regression to forecast out.
这样做不是必须的 但我们就这样做吧
You don’t have to but generally that’s what you are doing.
So I am gonna define forecast out as the equal to being the int value of math dot ceiling,
呃 ceil里的值等于0.1乘df的长度
um, and the ceiling will be point 1 times the length of the df.
首先 我们需要导入math的包
So, first of all, what are we doing there?And also we need to import math.
但是 这是在做什么 math.ceil 获取的是数字的整数部分
But, first, what are we doing there? math dot ceil will take anything and get to the ceiling.
So let’s say the length of the dataframe was a number that was gonna return a decimal point,
返回值应该是0.2 对吗
that was gonna be like point 2, right?
Let’s say that was gonna happen.
Math dot ceil will round that up to 1.
So, math dot ceil rounds everything up to the nearest whole.
额 然后我们把它强制类型转换为整数
So, um, and then we are making it an integer value,
额 就是这样的 因为math.ceil将会返回一个浮点数
um, just so, cause I think math dot ceil returns a float
and we don’t really want it to be a float either.
但是不管怎么样 这将是预测的天数
But anyway, uh, this will be the number of days out,
基本上我们将要做的就是 预测10%的数据
so basically what we are gonna do here is we are gonna try to predict out 10 percent of the dataframe
and you’ll see that actually when we go out and do this,
it’s not like you’ll just get 1 point 10 percent out,
你会得到明天的价格 后天的价格等等
you can get tomorrow’s price and the next days price and so on.
Um, you’re just using data that came 10 days ago to predict today.
好 额 你可以随便改变这个值 对吗
Ok. So, um, feel free to change that, right?
或许你想要把它改成0.01 是吗?
Maybe you want point 01, right?
或许你只想预测明天的价格 或者其他的
Maybe you want to just predict like tomorrow’s price or something.
You can play around with that if you want.
We are just making stuff up basically as we go.
如果你想改变数据 尽管去改吧
So if you wanna change that, by all means change it.
在我们忘记之前 先在代码前面导入math这个包
So let’s go ahead and go to the top and import math before we forget.
好的 现在 我们需要一个 真正的 我们已经有标签啦
Okay, so now, we need a, the actual, so we’ve got labels,
抱歉 是我们已经有特征了 对吗
oh I am sorry we have got features, right?
这些是我们的特征 或者这些是我们的特征 我们现在需要个标签
These are our features, or these are our features and now we need that label,
现在我们也有forecast_out这个变量了 可以创建标签啦
so now that we have forecast out we can create that label.
所以我们定义df 然后是label这一列
So we’re gonna say df, and then the label column.
label 等于 df的forcast_col这一列
the label will be the equal of df, forecast column
so that’s why we used forecast column.
That way if later on you decide to change something
you’ll be able to just change this variable rather than all the feature variables.
因此 它等于df 的forecast_col这一列 然后.shift(-forecast_out)
So it’ll be equal to the df forecast column and then we are gonna do a dot shift minus forecast out.
那就是我们需要整数值的原因 因为要在这一列用到shift
That’s why we needed it to be an int cause we are basically shifting the columns.
So, what we’ve done is we are shifting the columns negatively.
继续 如果你有一列
So it’ll go, basically if you have a column here
它将会被转换 几乎所有的数据
it’ll get shifted up, the spreadsheet almost.
用这种方式 每一行 标签这一列每一行的值
This way, each row, the label column for each row
将会是adjusted close未来10天的价格 对吗
will be adjusted close price 10 days into the future. Okay?
那就是标签 所以特征就是这些属性
So that’s our label, so our features are these attributes of
what in our mind may cause the adjusted close price in 10 days to change or 10 percent.
So actually this will be much greater than 10 days,
bcause we didn’t even specify the timeframe.
我们以后可以再修改这个数 那真的不太重要
So, we can tinker with this number later, it’s really not that important.
额 回归 我向你承诺你不会因为这个算法而变得富有
Um, regression, you aren’t gonna get rich on just this algorithm, I promise you.
But it’s actually good,
you’ll find this actually not a bad model of stock price.
如果你加一些更有用的特征 这个模型将会更加完善
And as you add more useful features, it can get, it can get pretty good.
无论如何 我们现在有了label这一列
But, anyway, um, so now we have our label column
and let’s go ahead and print df dot head again.
So this just prints like the first 5 rows of the dataframe.
Again if there’s anything we are doing with pandas that you are like “What’s going on?”,
你可以问我 我会给你提供一些教程
ask and I can point you to tutorial,
because I’ve got I’ve done tutorials based on everything that I am gonna be doing.
好 这些是 我们的特征列
Um, ok, so these are, our each of these column features
and then we finally have a label column that we’ve kind of,
this is timed into the future, um, for our data.
So, now what we are gonna go ahead and do is
in fact let me do a df dot,
定义df.tail 定义df.dropna
let’s do a df dot tail and also let’s just do a df dot drop na
and then inplace equals true.
因为这些数据有点大 大概10%的数据
Cause those are some awful high numbers, for 10 percent out.
真有趣 我猜这些价格可能因为shift而改变了很多
Interesting. So I guess prices changed that much by that shift.
So let’s try a smaller shift.
非常好 数据的10%已经出来了
Um, fascinating, that that would be 10 percent out.
That’s a little better.
Maybe, maybe we’ll use that point 01.
让我们用0.01 因为0.1太大了
Let’s use that one, cause the other ones were just so huge.
所以让我们打印df.head 看看这些数字
So let’s go back to head and see if, if that number.
So if, if you are not following,
我正在把forecast的价格和adjusted close的价格做对比
I am just comparing the forecast price to the adjusted close price.
So of course when the when the stock price opens,
实际上percent change的值很大 从50到66
this is actually a significant percent change, right, from 50 to 66,
但是股票就是这样 当然谷歌的表现也很好
but the stock just came out and of course google does very well in time.
当然 无论如何 现在我认为0.01比较好
So, so yeah, but anyway, yeah I think I’ll go with point 01 for now.
当我们进行预测的时候 可以把两者混合使用
Or oo oo we’ll, we’ll mess with both whenever we go to predicting stuff.
这就是这节课的内容 我们有了特征
Anyway, um, that’s it for this one, so we’ve done features,
我们有了标签 现在我们准备
we’ve got our label and now we are actually ready to
训练 测试 预测并且在真实的数据上运行这个算法
train, test, predict and actually run this algorithm on some realistic data.
请继续关注 如果你有任何的问题 评论 无论什么
So stay tuned for that, if you’ve any questions, comments, concerns whatever up to this point,
欢迎在视频下方留言 谢谢观看
feel free to leave them below otherwise as always thanks for watching,
谢谢大家的支持和订阅 下节课见
thanks for all the support and subscription and until next time.