ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

【机器学习入门】#2 可视化决策树 – 译学馆
未登陆,请登陆后再发表信息
最新评论 (0)
播放视频

【机器学习入门】#2 可视化决策树

Visualizing a Decision Tree - Machine Learning Recipes #2

(音乐播放中)
[MUSIC PLAYING]
上期节目我们使用一个决策树来作为分类器
Last episode, we used a decision tree as our classifier.
今天我们来增加一些代码
Today we’ll add code to visualize it
能够可视化的看到它的工作机理
so we can see how it works under the hood.
也许你之前听说过 有很多种类的分类器
There are many types of classifiers
比如神经网络
you may have heard of before– things like neural nets
或者 支持向量机(SVM)
or support vector machines.
那么 为什么我们要从决策树开始呢?
So why did we use a decision tree to start?
额 它们有个非常特别的属性 –
Well, they have a very unique property–
那就是它们很容易阅读和理解
they’re easy to read and understand.
事实上 它们是少数可解读的模型之一
In fact, they’re one of the few models that are interpretable,
你可以很准确的理解分类器究竟
where you can understand exactly why the classifier makes
是如何做判断的
a decision.
在实践层面这会非常有用
That’s amazingly useful in practice.
开始 我来给你们介绍一下
To get started, I’ll introduce you
今天要用到的一个真正的数据集
to a real data set we’ll work with today.
称作Iris
It’s called Iris.
Iris是个经典的机器学习问题
Iris is a classic machine learning problem.
这里面涉及到鉴别不同类型的花
In it, you want to identify what type of flower
所需要依据的不同测量数据
you have based on different measurements,
比如花瓣的长度和宽度
like the length and width of the petal.
该数据集包含了三种不同类型的花
The data set includes three different types of flowers.
它们都是鸢尾属(iris)的花 – 山鸢尾(setosa)、变色鸢尾(versicolor)
They’re all species of iris– setosa, versicolor,
以及 维吉尼亚鸢尾(virginica)
and virginica.
往下卷动页面 你可以看到
Scrolling down, you can see we’re
每一类都有50个示例 一共有150个示例
given 50 examples of each type, so 150 examples total.
注意这里有4个特征
Notice there are four features that are
用来描述每个示例
used to describe each example.
这些是花萼和花瓣的长度和宽度
These are the length and width of the sepal and petal.
就像我们之前的苹果和橘子的问题一样
And just like in our apples and oranges problem,
前面四列是特征 最后一列是标签
the first four columns give the features and the last column
也就是每行示例里花的类型
gives the labels, which is the type of flower in each row.
我们的目标是使用这个数据集来训练分类器
Our goal is to use this data set to train a classifier.
然后就可以使用这个分类器来预测
Then we can use that classifier to predict what species
接下来要给出的新的我们从未见过的
of flower we have if we’re given a new flower that we’ve never
花的种类
seen before.
了解该如何利用已有的数据集是个好技能
Knowing how to work with an existing data set
那么我们来引入Iris到scikit-learn里
is a good skill, so let’s import Iris into scikit-learn
来看看用代码怎么表示它
and see what it looks like in code.
好客的家伙们已经给我们方便的提供了
Conveniently, the friendly folks at scikit
一堆示例数据集了
provided a bunch of sample data sets,
包括Iris 还有一些工具
including Iris, as well as utilities
可以很容易导入
to make them easy to import.
我们可以这样引入Iris数据到代码中
We can import Iris into our code like this.
这个数据集包括了维基百科上的表格
The data set includes both the table
以及其它一些描述数据
from Wikipedia as well as some metadata.
描述信息是关于这些特征的名字
The metadata tells you the names of the features
以及不同类型的花的名字
and the names of different types of flowers.
特征和示例
The features and examples themselves
都包含在data变量中
are contained in the data variable.
例如 如果我们打印出第一组数据
For example, if I print out the first entry,
就能看到花朵的测量数据
you can see the measurements for this flower.
这几个数对应到每项特征
These index to the feature names, so the first value
那么第一个数字指的是花萼的长度 第二个是花萼的宽度
refers to the sepal length, and the second to sepal width,
等等
and so on.
target变量包含的是标签
The target variable contains the labels.
类似的 数据对应的是标签的名称
Likewise, these index to the target names.
我们打印出第一个
Let’s print out the first one.
标签为0指的是山鸢尾(setosa)
A label of 0 means it’s a setosa.
如果你对照维基百科看那张表格
If you look at the table from Wikipedia,
你会注意到我们刚刚打印出的数据就是第一行
you’ll notice that we just printed out the first row.
data和target变量各自有150条数据
Now both the data and target variables have 150 entries.
你如果愿意的话可以遍历打印出
If you want, you can iterate over them
每一组数据
to print out the entire data set like this.
那么我们现在知道怎么处理数据集了
Now that we know how to work with the data set,
可以来训练分类器了
we’re ready to train a classifier.
不过在我们做这个之前 还需要对数据进行一下切割
But before we do that, first we need to split up the data.
我打算删掉一些示例
I’m going to remove several of the examples
之后再加上
and put them aside for later.
我们可以把这几个示例数据做为测试数据来用
We’ll call the examples I’m putting aside our testing data.
将它们从训练用的数据集里摘出来
We’ll keep these separate from our training data,
用在后面测试我们的分类器
and later on we’ll use our testing examples
是否能够精确处理
to test how accurate the classifier is
它所没见过的数据
on data it’s never seen before.
在实际的机器学习领域
Testing is actually a really important part
测试实际上是个非常重要的部分
of doing machine learning well in practice,
后面的节目里我们会详细描述
and we’ll cover it in more detail in a future episode.
这里只是用于练习的话 我会给每一类花
Just for this exercise, I’ll remove one example
移除掉一条数据
of each type of flower.
这样的话 因为数据集是排好序的
And as it happens, the data set is
第一条山鸢尾的数据从下标0开始
ordered so the first setosa is at index 0,
第一条变色鸢尾的下标是50 等等
and the first versicolor is at 50, and so on.
语法看上去有点复杂 但是我其实在做的就是
The syntax looks a little bit complicated, but all I’m doing
将三条数据从data何target变量中移除掉
is removing three entries from the data and target variables.
然后我会创建两组新的变量 –
Then I’ll create two new sets of variables– one
一组用于训练 一组用于测试
for training and one for testing.
训练数据里包含的是主要的数据
Training will have the majority of our data,
而测试数据里只有我刚刚从数据集里移除的那三条
and testing will have just the examples I removed.
现在 和之前一样 我们可以创建一个决策树分类器
Now, just as before, we can create a decision tree
然后使用训练数据来对它进行训练
classifier and train it on our training data.
我们将其可视化之前 先使用这棵树来
Before we visualize it, let’s use the tree
对测试数据进行分类试试
to classify our testing data.
我们知道一条数据对应一种类型的花
We know we have one flower of each type,
先打印出预期的结果
and we can print out the labels we expect.
现在我们来看下这棵树预测的结果会是怎样
Now let’s see what the tree predicts.
我们给出测试数据里的特征数据
We’ll give it the features for our testing data,
然后得到预测出的标签
and we’ll get back labels.
可以看到 预测的标签与测试数据吻合
You can see the predicted labels match our testing data.
这意味着计算机都处理正确了
That means it got them all right.
现在 记住 这还是个非常简单的测试
Now, keep in mind, this was a very simple test,
我们将要开始深入分析了
and we’ll go into more detail down the road.
现在我们来将这棵树的预测过程可视化
Now let’s visualize the tree so we can
来看分类器是如何工作的
see how the classifier works.
要做到这一点 我会从scikit的教学例子中
To do that, I’m going to copy-paste
粘贴一部分代码过来
some code in from scikit’s tutorials,
这部分代码是用来实现可视化过程的
and because this code is for visualization
不属于机器学习范畴
and not machine-learning concepts,
我不在这里详细说
I won’t cover the details here.
注意我将这两个例子的代码合并后
Note that I’m combining the code from these two examples
可以创建出一个容易阅读的PDF文档
to create an easy-to-read PDF.
运行程序后 就可以打开这个PDF
I can run our script and open up the PDF,
就能看到这棵树了
and we can see the tree.
要使用它来分辨数据 你得从顶部开始读
To use it to classify data, you start by reading from the top.
每个节点代表的是一个针对某项特征的
Each node asks a yes or no question
yes或no的问题
about one of the features.
例如这个节点问的是花瓣的宽度
For example, this node asks if the pedal width
是否小于0.8厘米
is less than 0.8 centimeters.
如果答案是真 进入左侧流程
If it’s true for the example you’re classifying, go left.
如果是假 进入右侧流程
Otherwise, go right.
现在我们来使用这棵树
Now let’s use this tree to classify an example
是测试数据中的某条数据进行分类
from our testing data.
这是测试数据中第一个花朵的特征和标签
Here are the features and label for our first testing flower.
记住 你可以在描述信息中找到
Remember, you can find the feature names
特征数据的名称
by looking at the metadata.
我们已经知道这朵花是山鸢尾
We know this flower is a setosa, so let’s see
那么我们看下决策树的预测过程
what the tree predicts.
我把窗口放大些这样能看清楚
I’ll resize the windows to make this easier to see.
第一个问题是
And the first question the tree asks
花瓣的宽度是否小于0.8厘米
is whether the petal width is less than 0.8 centimeters.
这是第四个特征
That’s the fourth feature.
答案是真 于是进入左侧
The answer is true, so we proceed left.
这里我们就已经遇到了一个叶节点
At this point, we’re already at a leaf node.
不需要问其它问题了
There are no other questions to ask,
于是决策树就给出了预测 山鸢尾
so the tree gives us a prediction, setosa,
这是对的
and it’s right.
注意标签是0 对应的是花的种类
Notice the label is 0, which indexes to that type of flower.
现在我们再看下第二条测试数据
Now let’s try our second testing example.
这是变色鸢尾
This one is a versicolor.
我们看下决策树的预测过程
Let’s see what the tree predicts.
仍然从顶部开始 这里花瓣宽度
Again we read from the top, and this time the pedal width
比0.8厘米要大
is greater than 0.8 centimeters.
这个节点的结果是假
The answer to the tree’s question is false,
于是进入右侧节点
so we go right.
下一个问题是花瓣宽度
The next question the tree asks is whether the pedal width
是否小于1.75
is less than 1.75.
它在试图缩小范围
It’s trying to narrow it down.
答案为真 进入左侧
That’s true, so we go left.
现在的问题变成了花瓣长度是否小于4.95
Now it asks if the pedal length is less than 4.95.
答案为真 再次进入左侧
That’s true, so we go left again.
最后的问题是
And finally, the tree asks if the pedal width
花瓣宽度是否小于1.65
is less than 1.65.
答案为真 进入左侧
That’s true, so left it is.
现在得到了预测结果 – 变色鸢尾
And now we have our prediction– it’s a versicolor,
又是正确的结果
and that’s right again.
你可以自己试一下最后一条数据作为练习
You can try the last one on your own as an exercise.
并且记住 我们每次使用的
And remember, the way we’re using the tree
决策树在代码层面都是一样的
is the same way it works in code.
那么这就是快速可视化
So that’s how you quickly visualize and read
阅读决策树的办法
a decision tree.
还有很多要学的内容
There’s a lot more to learn here,
特别是它们究竟是怎样从示例建立起决策树的
especially how they’re built automatically from examples.
后面的节目里我们会讲到
We’ll get to that in a future episode.
现在我们再讲一个要点就结束
But for now, let’s close with an essential point.
决策树在问你每个问题的时候
Every question the tree asks must be about one
都是关于其中的一项特征的
of your features.
这意味着 特征越好
That means the better your features are, the better a tree
树就越好
you can build.
下一期节目我们开始来看
And the next episode will start looking
怎样能找到一个好的特征
at what makes a good feature.
非常感谢您的收看 下期节目再见
Thanks very much for watching, and I’ll see you next time.
(音乐播放中)
[MUSIC PLAYING]

发表评论

译制信息
视频概述
听录译者

收集自网络

翻译译者

知易行难

审核员

自动通过审核

视频来源

https://www.youtube.com/watch?v=tNa99PG8hR8

相关推荐