ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

【机器学习入门】#3 什么是好特征 – 译学馆
未登陆,请登陆后再发表信息
最新评论 (0)
播放视频

【机器学习入门】#3 什么是好特征

What Makes a Good Feature? - Machine Learning Recipes #3

约什戈登:分类器只有在你提供的
JOSH GORDON: Classifiers are only
特征是良好的情况下才能很好的工作
as good as the features you provide.
这意味着找到好的特征
That means coming up with good features
是机器学习领域最重要的事
is one of your most important jobs in machine learning.
但是 什么才算是好的特征呢 该怎样分辨呢?
But what makes a good feature, and how can you tell?
如果你在做的是二元分类
If you’re doing binary classification,
那么所谓的好特征就是能够轻松判断
then a good feature makes it easy to decide
两个东西不同的元素
between two different things.
例如 假设我们打算编写一个
For example, imagine we wanted to write a classifier
分辨两种狗的分类器
to tell the difference between two types of dogs–
比如灰狗和拉布拉多
greyhounds and Labradors.
这样我们打算使用两个特征 – 狗的身高 以英寸为单位
Here we’ll use two features– the dog’s height in inches
以及它们眼睛的颜色
and their eye color.
这只是个示例 我们对狗狗做的假设
Just for this toy example, let’s make a couple assumptions
是为了让问题简化
about dogs to keep things simple.
首先我们说灰狗通常
First, we’ll say that greyhounds are usually
会比拉布拉多要高
taller than Labradors.
然后 我们假装狗狗们只有两种眼睛颜色 –
Next, we’ll pretend that dogs have only two eye
蓝色与褐色
colors– blue and brown.
狗狗眼睛的颜色其实
And we’ll say the color of their eyes
与其品种无关
doesn’t depend on the breed of dog.
这意味着这两种特征里一种有价值
This means that one of these features is useful
而另一种对我们毫无帮助
and the other tells us nothing.
要理解原因 我们将其可视化
To understand why, we’ll visualize them using a toy
先要创建一个数据集
dataset I’ll create.
从身高开始
Let’s begin with height.
你认为这个特征有多大价值?
How useful do you think this feature is?
额 平均来说 灰狗
Well, on average, greyhounds tend
通常比拉布拉多要高上几英寸 但不是总这样
to be a couple inches taller than Labradors, but not always.
全世界会很多不同
There’s a lot of variation in the world.
那么当我们思考特征时
So when we think of a feature, we
我们需要去考虑它在一定的样本基础上
have to consider how it looks for different values
会如何表现
in a population.
我们打开Python来用程序示例
Let’s head into Python for a programmatic example.
我来创建一个数量为1000的样本
I’m creating a population of 1,000
其中灰狗和拉布拉多各占一半
dogs– 50-50 greyhound Labrador.
我给它们每个赋予一个身高
I’ll give each of them a height.
这个例子里 我们说灰狗的
For this example, we’ll say that greyhounds
平均身高是28英寸 而拉布拉多是24英寸
are on average 28 inches tall and Labradors are 24.
现在所有的狗狗都是有些不同的
Now, all dogs are a bit different.
假设身高是正态分布的
Let’s say that height is normally distributed,
都在正负4英寸范围内
so we’ll make both of these plus or minus 4 inches.
这样我们就得到了两组数字
This will give us two arrays of numbers,
我们用直方图来做可视化
and we can visualize them in a histogram.
我要加入一个参数 这样灰狗用红色
I’ll add a parameter so greyhounds are in red
表示 拉布拉多用蓝色表示
and Labradors are in blue.
现在我们可以运行脚本了
Now we can run our script.
这里显示的是在某个高度上样本里的狗狗有多少
This shows how many dogs in our population have a given height.
屏幕上有很多数据
There’s a lot of data on the screen,
我们简化一下一块一块的看
so let’s simplify it and look at it piece by piece.
我们从分布左侧较远处开始
We’ll start with dogs on the far left
这里是20英寸高位置
of the distribution– say, who are about 20 inches tall.
假想我让你来预测某只这个身高的狗狗
Imagine I asked you to predict whether a dog with his height
会是拉布拉多还是灰狗
was a lab or a greyhound.
你会怎么做?
What would you do?
嗯 你能根据身高来得到
Well, you could figure out the probability of each type
狗狗属于某个类别的概率
of dog given their height.
这里 这只狗狗更可能是拉布拉多
Here, it’s more likely the dog is a lab.
如果我们一定到
On the other hand, if we go all the way
直方图的右侧
to the right of the histogram and look
假设狗狗的身高是35英寸
at a dog who is 35 inches tall, we
那我们应该能很确信它是一只灰狗
can be pretty confident they’re a greyhound.
那么 在中间位置的狗狗呢?
Now, what about a dog in the middle?
从图表来看信息价值就不大了
You can see the graph gives us less information
因为每种类型的狗狗出现的概率比较接近了
here, because the probability of each type of dog is close.
因此 身高是个有用的特征 但还不够完美
So height is a useful feature, but it’s not perfect.
这就是为什么在机器学习领域
That’s why in machine learning, you almost always
几乎你总是要多个特征联合使用
need multiple features.
要不然你就可以用个if语句就搞定了
Otherwise, you could just write an if statement
不用麻烦分类器了
instead of bothering with the classifier.
要想出你应该使用什么类型的特征
To figure out what types of features you should use,
可以做个思想实验
do a thought experiment.
假设你自己是个分类器
Pretend you’re the classifier.
如果你要来分辨某只狗狗是
If you were trying to figure out if this dog is
拉布拉多还是灰狗 还有什么需要知道的?
a lab or a greyhound, what other things would you want to know?
你也许会问它们的毛发长度
You might ask about their hair length,
或者它们的奔跑速度 或者它们的体重
or how fast they can run, or how much they weigh.
准确的说该使用多少特征
Exactly how many features you should use
更像是个艺术而不是科学
is more of an art than a science,
当然最重要的 是要考虑
but as a rule of thumb, think about how many you’d
解决问题需要多少特征
need to solve the problem.
现在我们来看下另一个特征 眼睛颜色
Now let’s look at another feature like eye color.
作为示例 我们假设狗狗的眼睛
Just for this toy example, let’s imagine
只有两种颜色 蓝色和褐色
dogs have only two eye colors, blue and brown.
我们说眼睛的颜色
And let’s say the color of their eyes
并不取决于狗狗的品种
doesn’t depend on the breed of dog.
这就是为什么直方图会是这样
Here’s what a histogram might look like for this example.
大多数情况下 分布都是一半一半
For most values, the distribution is about 50/50.
因此这个特征对我们无价值
So this feature tells us nothing,
因为它与狗狗的类型没有关联性
because it doesn’t correlate with the type of dog.
在训练数据集里包含一个无效的特征
Including a useless feature like this in your training
会破坏分类器的准确性
data can hurt your classifier’s accuracy.
因为某种情况下它们也许表现出
That’s because there’s a chance they might appear useful purely
偶然的相关性 特别是你提供的训练数据样本
by accident, especially if you have only a small amount
比较小的情况下
of training data.
你还需要确保特征之间是相互独立的
You also want your features to be independent.
相互独立的特征
And independent features give you
可以提供不同类型的信息
different types of information.
设想我们已经有一个特征了 – 单位是英寸的身高
Imagine we already have a feature– height and inches–
数据集里就有
in our dataset.
那么问问自己 如果我们加上另一个特征
Ask yourself, would it be helpful
比如用厘米表示的身高 会有效果吗?
if we added another feature, like height in centimeters?
不会 因为它与我们已有的特征
No, because it’s perfectly correlated with one
完全相关
we already have.
从训练数据中移除高相关特征
It’s good practice to remove highly correlated features
是应该要做的
from your training data.
因为很多分类器其实没有智慧到可以
That’s because a lot of classifiers
分辨出英寸表示的身高与
aren’t smart enough to realize that height in inches
厘米表示的身高其实是同一个特征
in centimeters are the same thing,
因此它们可能会双倍计算该特征的重要性
so they might double count how important this feature is.
最后 你应该选择容易理解的特征
Last, you want your features to be easy to understand.
举个例子 设想你打算
For a new example, imagine you want
预测两个城市间发送快递
to predict how many days it will take
需要多久
to mail a letter between two different cities.
城市间的距离越远 时间就越久
The farther apart the cities are, the longer it will take.
好的特征就是选择
A great feature to use would be the distance
城市间的距离
between the cities in miles.
比较糟糕的特征包括
A much worse pair of features to use
用经纬度表示的城市位置
would be the city’s locations given by their latitude
这种类型
and longitude.
原因是
And here’s why.
我可以看一眼距离
I can look at the distance and make
就能对快递发送的时间做出一个不错的猜测
a good guess of how long it will take the letter to arrive.
但是学习经纬度和时间的关系
But learning the relationship between latitude, longitude,
要难得多 需要更多的
and time is much harder and would require many more
训练数据示例
examples in your training data.
现在 你可以用这些技巧来
Now, there are techniques you can
思考特征的作用是什么了
use to figure out exactly how useful your features are,
还包括它们怎样的组合是最佳的
and even what combinations of them are best,
这样你就不会随意对待特征选择了
so you never have to leave it to chance.
后面的节目里我们还会介绍
We’ll get to those in a future episode.
下次节目里 我们会接着建立起
Coming up next time, we’ll continue building our intuition
对监督式学习的感觉
for supervised learning.
我们会展示不同的分类器
We’ll show how different types of classifiers
看它们在解决相同问题时的差异
can be used to solve the same problem and dive a little bit
我们会稍稍深入一些了解它们的工作原理
deeper into how they work.
感谢您的观看 下次再见!
Thanks very much for watching, and I’ll see you then.

发表评论

译制信息
视频概述
听录译者

收集自网络

翻译译者

知易行难

审核员

自动通过审核

视频来源

https://www.youtube.com/watch?v=N9fDIAflCMY

相关推荐