ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

《机器学习Python实践》#3 回归的特征与标签 – 译学馆
未登陆,请登陆后再发表信息
最新评论 (0)
播放视频

《机器学习Python实践》#3 回归的特征与标签

Regression Features and Labels - Practical Machine Learning Tutorial with Python p.3

好的 大家好 欢迎来到第三次机器学习课程
alright, hello everybody, welcome to the third machine learning.
在第二次回归的课程中我留了个问题
And second regression tutorial videos where we left off,
我问adjusted close这一列
I was asking whether or not the adjusted close column
是特征还是标签
would be a feature or a label,
答案是它是个特征 也有可能特征标签都不是
and the answers is really a feature and possibly none of the above.
额 如果我们还没有决定去
Um it could be a label if we haven’t already kind of decided
使用high减去low的百分比或者percent change它就是标签
that we will use the high minus low percent or the percent change.
例如 你可以把adjusted close当作标签 也就是说
For example so, you could use the adjusted close as the label if, say
你可以在每一天的早上去预测这一天收盘的价格
at the beginning of the day you were trying to predict what the close might be that day.
但是考虑到我们所选的特征
But in this case with the given features that we have chosen,
实际上你并不知道这个值
you really wouldn’t even know this value.
额 在close的值出现之前 你无法知道high 减去 low的值
um, you wouldn’t know the high minus low
也无法知道percent change的值
and you wouldn’t know the percent change until the close had already occured.
因此如果你去训练这个分类器预测adjusted close这个值
So,if you trained the classfier to predict this value
额 那将会是非常偏颇的分类器
um, that would be incredibly biased classifier.
你或许思考过这样的事
So, just kind of you were thinking of these things.
这在真实世界里是可能的吗
Is this even possible in the real world?
因为你可能发现你正在做这样的事
cause you can kind of find youself doing things.
看起来好像是个好点子但是实际上并没有可能去实现它
that seems like a great idea at the time but then it is actually not even possible to do
所以在我们的例子中 adjusted close 是一个特征
So, in our case, adjusted close will either be a feature,
实际上我们将要做的就是拿到 adjusted close 的过去的十个值 那就是一个特征
that actually what we will do is take like the last 10 values of the adjusted close and that’s a feature
当我们真正挖掘和编写代码的时候那是最具代表性的
And, that’s most representative of when we actually go and dig in and write the algorithm ourselves,
额 你可以拿过去的10个值
um, you would take maybe the last 10 values,
然后尽量去预测未来的值
and try to predict the future value
好的 后面我们会详细说
Anyway, more on that later.
上一节课我们了解了特征
So the last, uh the last tutorial we did features,
现在 这一节课里我们将定义标签
and now in this tutorial we are gonna define a label.
我刚刚已经告诉过你们这个并不是一个标签
So, since I just got done telling you that this is not gonna be a label,
什么将会是一个标签呢
what actually would be a label?
好 标签就是未来某个时刻的价格
Well, it would be at some point in the future, the price.
好的 我们所仅有的和价格有关的列就是adjusted close
Okay. And the only price column we have anymore is adjusted close.
我们实际上想得到的就是未来的adjusted close
Um, but what we wanna do is actually get the adjusted close in the future,
或许明天 或许5天后的
maybe the next day, maybe the next 5 days, something like that.
那么我们还需要引进一些新的信息 一些未来的信息
So, we need to bring in some new information, basically to get that information onto the future.
让我们把这个关了 开始工作
So, let’s go ahead and close this out and begin working on that.
首先 额 我们想
So, first of all, um we want to,
我们将拿到
we are gonna take,
不再打印head
we are not gonna print the head anymore.
首先 我们将定义 forecast_column 或者forecast_col
And, first of all, the, we are gonna say forecast_column or col,
把adjusted close赋值给它
is just gonna be equal to adjusted close.
我随后将解释为什么我们这么做 但是总的说来它就是个变量
I’ll explain why we are gonna do that in a second but basically it’s just a variable.
以后你可以把这个变量改成其他的预测列数据
and later on, you could change this variable to be a different forecast column.
那么 也许你处理的不是股票价格
So, you might not be working with stock prices,
当然 线性回归 机器学习还可以应到不仅仅是股票价格的领域
there’s other things that you can use linear regression on it, of course machine learning, other than stock prices.
因此 今后如果你处理的不是股票价格 也将会用到相似的代码
So, in the future, if you aren’t, you’re gonna just, you’ll be able to use very similar code.
显然 你需要修改这个点之前的代码
You’ll obviously change the code leading up to this point,
把forecast column改成未来你想改的预测列
but you just change forecast column to be whatever you want it in the future.
我会向你展示什么时候用到这个代码 为啥这个代码有用
And I’ll show you why when we get to the code, why that’s gonna be useful.
现在 为了不 不缺少数据 那么 df.fillna
Um now, what we’re gonna say is just in case there is not, not, uh missing data, so df.fillna.
fillna 是 fill(填满)
So fill na is just fill na.
na的意思是不可用(not available) 在pandas里它代表NaN(Not a Number)(不是数字)
na is for not available or in pandas term, it’s actually gonna be a nan, most cases and that’s not a number.
现在我们要“fillna”一个具体的数字 将它赋值为-99999
So now we are gonna ‘fillna’ with a specific value. We’re gonna do negative ninety thousand nine hundred and ninety nine.
我们把inplace赋值为true
And we’re gonna say inplace equals true.
在机器学习领域 你不能处理NaN值
So with machine learning, you can’t work with nan data.
实际上你必须把NaN赋值为具体的数
So you actually have to replace the nan data with something.
另外一个选择是你可以删除整列 但是在现实世界里你并不想在机器学习里删除数据
And, or you can get rid of that entire column, but you don’t want to get rid of data in machine learning in the real world,
如果删除了 你就会发现丢失了很多数据
you actually will find that you miss a lot of data.
你或许缺一列数据 但你已经有了其他列数据 那么不必要的话就不要删除那一列数据
You are lacking maybe one column, but you have got the other columns and you don’t wanna sacrifice data if you don’t have to.
你可以这样做 它将会被视为数据集里的异常值
So you can do this and it will be treated as an outlier in your dataset.
这是另一个为什么你要自己实现和推导算法的理由
And again this is just one more reason why going through and doing the algorithm by hand
它将帮助你更好的理解算法会产生什么样的影响
will help you understand so much better what kind of effect that is gonna have on the algorithm.
你会感谢我的 一起过了一遍
So, you’ll be thankful that we go through it.
基本上我们会通过它们的工作方式来学习每个算法
And then basically you’ll learn through each algorithm why, uh, what doing something like that will do.
无论如何 不删数据是一个选择 我认为最好的选择
So anyways, that’s the choice, that’s the best choice in my opinion rather than getting rid of data.
现在我们开始定义forecast_out
Now, we are gonna forecast out.
这是一个回归算法 使用回归进行预测
This is a regression algorithm,generally you use regression to forecast out.
这样做不是必须的 但我们就这样做吧
You don’t have to but generally that’s what you are doing.
我将把forecast_out赋值为math.ceil()的int值
So I am gonna define forecast out as the equal to being the int value of math dot ceiling,
呃 ceil里的值等于0.1乘df的长度
um, and the ceiling will be point 1 times the length of the df.
首先 我们需要导入math的包
So, first of all, what are we doing there?And also we need to import math.
但是 这是在做什么 math.ceil 获取的是数字的整数部分
But, first, what are we doing there? math dot ceil will take anything and get to the ceiling.
我们知道df的长度计算后的返回值是一个带有小数点的数
So let’s say the length of the dataframe was a number that was gonna return a decimal point,
返回值应该是0.2 对吗
that was gonna be like point 2, right?
接下来将会发生什么
Let’s say that was gonna happen.
math.ceil将会把该数字取整为1
Math dot ceil will round that up to 1.
math.ceil将会把数字取整为最近的那个整数
So, math dot ceil rounds everything up to the nearest whole.
额 然后我们把它强制类型转换为整数
So, um, and then we are making it an integer value,
额 就是这样的 因为math.ceil将会返回一个浮点数
um, just so, cause I think math dot ceil returns a float
我们并不需要一个浮点数
and we don’t really want it to be a float either.
但是不管怎么样 这将是预测的天数
But anyway, uh, this will be the number of days out,
基本上我们将要做的就是 预测10%的数据
so basically what we are gonna do here is we are gonna try to predict out 10 percent of the dataframe
你将看到当我们去做这个预测时
and you’ll see that actually when we go out and do this,
并不是只可以预测10%的数据
it’s not like you’ll just get 1 point 10 percent out,
你会得到明天的价格 后天的价格等等
you can get tomorrow’s price and the next days price and so on.
你在用过去10天的数据来预测今天
Um, you’re just using data that came 10 days ago to predict today.
好 额 你可以随便改变这个值 对吗
Ok. So, um, feel free to change that, right?
或许你想要把它改成0.01 是吗?
Maybe you want point 01, right?
或许你只想预测明天的价格 或者其他的
Maybe you want to just predict like tomorrow’s price or something.
只要想到的你都可以去做
You can play around with that if you want.
只要填充足够的数据就可以
We are just making stuff up basically as we go.
如果你想改变数据 尽管去改吧
So if you wanna change that, by all means change it.
在我们忘记之前 先在代码前面导入math这个包
So let’s go ahead and go to the top and import math before we forget.
好的 现在 我们需要一个 真正的 我们已经有标签啦
Okay, so now, we need a, the actual, so we’ve got labels,
抱歉 是我们已经有特征了 对吗
oh I am sorry we have got features, right?
这些是我们的特征 或者这些是我们的特征 我们现在需要个标签
These are our features, or these are our features and now we need that label,
现在我们也有forecast_out这个变量了 可以创建标签啦
so now that we have forecast out we can create that label.
所以我们定义df 然后是label这一列
So we’re gonna say df, and then the label column.
label 等于 df的forcast_col这一列
the label will be the equal of df, forecast column
那就是我们用forecast_col的原因
so that’s why we used forecast column.
如果后来你想改点东西
That way if later on you decide to change something
你只需要改这个变量而不是改所有的特征变量
you’ll be able to just change this variable rather than all the feature variables.
因此 它等于df 的forecast_col这一列 然后.shift(-forecast_out)
So it’ll be equal to the df forecast column and then we are gonna do a dot shift minus forecast out.
那就是我们需要整数值的原因 因为要在这一列用到shift
That’s why we needed it to be an int cause we are basically shifting the columns.
我们刚刚做的就是把这一列变为负数
So, what we’ve done is we are shifting the columns negatively.
继续 如果你有一列
So it’ll go, basically if you have a column here
它将会被转换 几乎所有的数据
it’ll get shifted up, the spreadsheet almost.
用这种方式 每一行 标签这一列每一行的值
This way, each row, the label column for each row
将会是adjusted close未来10天的价格 对吗
will be adjusted close price 10 days into the future. Okay?
那就是标签 所以特征就是这些属性
So that’s our label, so our features are these attributes of
这些我们脑海里可能造成未来10天价格波动的属性
what in our mind may cause the adjusted close price in 10 days to change or 10 percent.
实际上预测的天数将比10天更多
So actually this will be much greater than 10 days,
因为我们并没有把预测时间具体化
bcause we didn’t even specify the timeframe.
我们以后可以再修改这个数 那真的不太重要
So, we can tinker with this number later, it’s really not that important.
额 回归 我向你承诺你不会因为这个算法而变得富有
Um, regression, you aren’t gonna get rich on just this algorithm, I promise you.
但是它相当的好
But it’s actually good,
你将会发现对于股票价格来说这不是一个坏的模型
you’ll find this actually not a bad model of stock price.
如果你加一些更有用的特征 这个模型将会更加完善
And as you add more useful features, it can get, it can get pretty good.
无论如何 我们现在有了label这一列
But, anyway, um, so now we have our label column
让我们再次打印df.head
and let’s go ahead and print df dot head again.
这一次只会打印数据的前5行
So this just prints like the first 5 rows of the dataframe.
再次强调如果你对我们正在使用的pandas有什么疑问
Again if there’s anything we are doing with pandas that you are like “What’s going on?”,
你可以问我 我会给你提供一些教程
ask and I can point you to tutorial,
因为我已经做了很多教程
because I’ve got I’ve done tutorials based on everything that I am gonna be doing.
好 这些是 我们的特征列
Um, ok, so these are, our each of these column features
我们终于有了标签这一列
and then we finally have a label column that we’ve kind of,
这一列和未来数据是同步的
this is timed into the future, um, for our data.
接下来要做的事情是
So, now what we are gonna go ahead and do is
实际是定义df.
in fact let me do a df dot,
定义df.tail 定义df.dropna
let’s do a df dot tail and also let’s just do a df dot drop na
inplace的值为true
and then inplace equals true.
因为这些数据有点大 大概10%的数据
Cause those are some awful high numbers, for 10 percent out.
真有趣 我猜这些价格可能因为shift而改变了很多
Interesting. So I guess prices changed that much by that shift.
让我尝试更小的shift
So let’s try a smaller shift.
非常好 数据的10%已经出来了
Um, fascinating, that that would be 10 percent out.
这次的情况比较好
That’s a little better.
或许我们应该用0.01
Maybe, maybe we’ll use that point 01.
让我们用0.01 因为0.1太大了
Let’s use that one, cause the other ones were just so huge.
所以让我们打印df.head 看看这些数字
So let’s go back to head and see if, if that number.
如果你没有跟上
So if, if you are not following,
我正在把forecast的价格和adjusted close的价格做对比
I am just comparing the forecast price to the adjusted close price.
当股票价格开盘的时候
So of course when the when the stock price opens,
实际上percent change的值很大 从50到66
this is actually a significant percent change, right, from 50 to 66,
但是股票就是这样 当然谷歌的表现也很好
but the stock just came out and of course google does very well in time.
当然 无论如何 现在我认为0.01比较好
So, so yeah, but anyway, yeah I think I’ll go with point 01 for now.
当我们进行预测的时候 可以把两者混合使用
Or oo oo we’ll, we’ll mess with both whenever we go to predicting stuff.
这就是这节课的内容 我们有了特征
Anyway, um, that’s it for this one, so we’ve done features,
我们有了标签 现在我们准备
we’ve got our label and now we are actually ready to
训练 测试 预测并且在真实的数据上运行这个算法
train, test, predict and actually run this algorithm on some realistic data.
请继续关注 如果你有任何的问题 评论 无论什么
So stay tuned for that, if you’ve any questions, comments, concerns whatever up to this point,
欢迎在视频下方留言 谢谢观看
feel free to leave them below otherwise as always thanks for watching,
谢谢大家的支持和订阅 下节课见
thanks for all the support and subscription and until next time.

发表评论

译制信息
视频概述

机器学习里的回归算法,定义标签,预测数据。值得拥有

听录译者

[B]倔强

翻译译者

[B]倔强

审核员

知易行难

视频来源

https://www.youtube.com/watch?v=lN5jesocJjk

相关推荐