未登录,请登录后再发表信息
最新评论 (0)
播放视频

【机器学习入门】#8 从零开始写一个决策树分类器

Let’s Write a Decision Tree Classifier from Scratch - Machine Learning Recipes #8

JOSH GORDON: Hey, everyone.
嗨 大家好
Welcome back.
欢迎回来
In this episode, we’ll write a decision tree classifier
本期节目里 我们将用纯Python
from scratch in pure Python.
来从头编写一个决策树分类器
Here’s an outline of what we’ll cover.
大纲如下
I’ll start by introducing the data set we’ll work with.
我会先从将要处理的数据集开始
Next, we’ll preview the completed tree.
然后我们预览一下最终完成的树
And then, we’ll build it.
接下来 将它构建出来
On the way, we’ll cover concepts like decision tree learning,
整个过程中 我们将覆盖到的概念包括 决策树学习
Gini impurity, and information gain.
基尼不纯度 信息增益
And you can find the code for this episode in the description.
描述区里有本期节目的相关代码
And it’s available in two formats,
代码用两种格式保存
both as a Jupiter notebook and as a regular Python file.
Jupyter notebook和普通Python文件
OK, let’s get started.
OK 我们开始吧
For this episode, I’ve written a toy data
我为本期节目已经准备好了
set that includes both numeric and categorical attributes.
一套数据集 包含数字和分类属性
And here, our goal will be to predict the type of fruit,
我们要实现预测水果类型的功能
like an apple or a grape, based on features
比如基于颜色或者大小这样的特征
like color and size.
来判断是苹果还是桔子
At the end of the episode, I encourage
本期节目的结尾 我会鼓励大家
you to swap out this data set for one of your own
把这个数据集抹掉 换上自己的
and build a tree for a problem you care about.
为你关心的问题构建一棵树
Let’s look at the format.
我们来看看格式
I’ve re-drawn it here for clarity.
为了更加清楚 我重新画了一下
Each row is an example.
每行是一条示例数据
And the first two columns provide features or attributes
前两列提供的是“特征” 或称属性
that describe the data.
它们用来描述数据
The last column gives the label, or the class,
最后一列显示的是“标签” 或称类别
we want to predict.
是我们要预测的目标
And if you like, you can modify this data set
你愿意的话 可以修改这个数据集
by adding additional features or more examples,
添加更多的特征 或更多的示例数据
and our program will work in exactly the same way.
程序可以保持不变
Now, this data set is pretty straightforward,
这个数据集很简单
except for one thing.
除了这一处
I’ve written it so it’s not perfectly separable.
我这样写是为了让它们难以区分
And by that I mean there’s no way to tell apart
意思是 其实没办法来区分
the second and fifth examples.
第2条与第5条记录
They have the same features, but different labels.
因为它们的特征相同 但标签不一样
And this is so we can see how our tree handles this case.
我们可以看到这棵决策树是怎么处理这种情况的
Towards the end of the notebook, you’ll
在这个notebook的最后
find testing data in the same format.
你会看到相同格式的测试数据
Now I’ve written a few utility functions that make it easier
我已经写了几个工具函数可以让
to work with this data.
处理这些数据变简单
And below each function, I’ve written a small demo
在每个函数的下方 我都写了一个
to show how it works.
小例子来演示
And I’ve repeated this pattern for every block of code in the notebook.
notebook里的每段代码 都遵循这个模式
决策树
Now to build the tree, we use the decision tree learning
我们将使用学习算法CART
algorithm called CART.
来构建决策树
And as it happens, there’s a whole family of algorithms
可以看到 有很多种算法可以
used to build trees from data.
用来从数据构建决策树
At their core, they give you a procedure
算法可以给你一个流程
to decide which questions to ask and when.
来决定问什么问题 以及什么时候问
CART stands for Classification and Regression Trees.
CART 意思是分类回归树
And here’s a preview of how it works.
它的工作原理是这样的
To begin, we’ll add a root node for the tree.
一开始 我们会给树添加一个根节点
And all nodes receive a list of rows as input.
所有节点都会收到一组数据作为输入
And the root will receive the entire training set.
而根节点会收到完整的训练数据集
Now each node will ask a true false question
现在 每个节点都会问
about one of the features.
关于某个特征的真假问题
And in response to this question,
根据对问题的回答
we split, or partition, the data into two subsets.
我们可以将数据分割成两块
These subsets then become the input to two child nodes
这两块会分别成为新增的两个
we add to the tree.
子节点的输入数据
And the goal of the question is to unmix the labels
这些问题的目的是为了在处理过程中
as we proceed down.
把标签拆开
Or in other words, to produce the purest possible
换种说法 就是让每个节点
distribution of the labels at each node.
将标签分配的尽可能的干净
For example, the input to this node
例如 这个节点
contains only a single type of label,
只包含单一的标签类型输入
so we’d say it’s perfectly unmixed.
那么 可以说这是完美的拆分
There’s no uncertainty about the type of label.
这个类型的标签没有不确定性
On the other hand, the labels in this node are still mixed up,
而另一个节点的标签仍然是混合的
so we’d ask another question to further narrow it down.
那么我们需要继续问问题来将其分拆
And the trick to building an effective tree
构建一棵有效率的决策树
is to understand which questions to ask and when.
关键在于理解该问什么何时该问
And to do that, we need to quantify how much a question
要做到这一点 我们需要对一个问题
helps to unmix the labels.
能拆散多少标签进行量化
And we can quantify the amount of uncertainty
我们可以通过基尼不纯度这个指标
at a single node using a metric called Gini impurity.
来量化某个节点的不确定性
And we can quantify how much a question
我们还可以通过信息增益这个概念
reduces that uncertainty using a concept called information gain.
来量化一个问题减少了多少不确定性
We’ll use these to select the best
通过它们就可以在
question to ask at each point.
每个节点上问出最好的问题
And given that question, we’ll recursively build the tree
给出问题后 我们会递归的为每个
on each of the new nodes.
新的节点构造出决策树
We’ll continue dividing the data until there
持续分割数据 直到
are no further questions to ask, at which point
没法再问出新的问题 到此
we’ll add a leaf.
我们才会加上叶子节点
To implement this, first we need to understand
要实现这个功能 首先我们要理解
what type of questions can we ask about the data.
对这些数据我们可以问什么样类型的问题
And second, we need to understand
然后 我们需要理解
how to decide which question to ask when.
如何决定该问的问题
该问什么样的问题?
Now each node takes a list of rows as input.
现在每个节点都有一组数据作为输入
And to generate a list of questions
通过遍历每条数据的特征值
we’ll iterate over every value for every feature that appears in those rows.
我们可以产生一系列的问题
Each of these becomes a candidate
每个问题都可以成为
for a threshold we can use to partition the data.
数据分割的边界
And there will often be many possibilities.
往往这会有很多可能性
In code we represent a question by storing
在代码里一个问题会用
a column number and a column value,
column.number和column.value来存储
or the threshold we’ll use to partition the data.
也就是我们用来分割数据的边界
For example, here’s how we’d write a question
例如 要写一个测试其颜色
to test if the color is green.
是不是绿色的问题
And here’s an example for a numeric attribute
这里的例子是针对数值类属性的
to test if the diameter is greater than or equal to 3.
测试直径是否大于等于3
In response to a question, we divide, or partition, the data
问题的答案会将数据拆分或分割成
into two subsets.
两个子集
The first contains all the rows for which the question is true.
一部分包含的是该问题结果为true的数据行
And the second contains everything else.
另一部分包含的是结果为false的
In code, our partition function takes a question
代码中 分割函数的输入参数是
and a list of rows as input.
问题和数据行
For example, here’s how we partition the rows based
例如 这里是我们怎样根据物体的颜色
on whether the color is red.
是否为红色来分割的
Here, true rows contains all the red examples.
结果为true的行都是红色的
And false rows contains everything else.
结果为false的包含其它所有数据
量化不确定性
The best question is the one that reduces our uncertainty the most.
最好的问题能够最大程度的减少不确定性
And Gini impurity let’s us quantify how much uncertainty
基尼不纯度可以让我们对某个节点的
there is at a node.
不确定性进行量化
Information gain will let us quantify how much
信息增益则可以帮我们量化
a question reduces that.
每个问题减少了多少不确定性
Let’s work on impurity first.
我们先看下基尼不纯度
Now this is a metric that ranges between 0 and 1
这是个0到1之间的数
where lower values indicate less uncertainty, or mixing, at a node.
数字小表示这个节点的不确定性小 或者混合度小
It quantifies our chance of being incorrect if we randomly
如果我们从集合里随机取出一个物件
assign a label from a set to an example in that set.
任意给它分配一个标签 这个数代表的是出错可能
Here’s an example to make that clear.
这里有个例子来解释清楚
Imagine we have two bowls and one contains the examples
想象有两只碗 一只装着待确认的物品
and the other contains labels.
另一只装着标签
First, we’ll randomly draw an example from the first bowl.
首先 我们随机从第一只碗里取出一个物品
Then we’ll randomly draw a label from the second.
然后再从第二只碗里随机取一个标签
And now, we’ll classify the example as having that label.
现在 我们用这个标签来对物品分类
And Gini impurity gives us our chance of being incorrect.
基尼不纯度表示的是出错的可能性
In this example, we have only apples in each bowl.
本例中 两只碗里都是苹果
There’s no way to make a mistake.
这不会出错
So we say the impurity is zero.
所以我们说这个不纯度就是0
On the other hand, given a bowl with five different types
换个情况 假设碗里有5个不同类型的水果
of fruit in equal proportion, we’d
比例平均分配
say it has an impurity of 0.8.
这时不纯度就变成了0.8
That’s because we have a one out of five chance of being right
因为如果我们随机分配标签
if we randomly assign a label to an example.
正确的几率是1/5
In code, this method calculates the impurity of a data set.
代码中 该方法会计算出数据集的不纯度
And I’ve written a couple examples
我写了几个
below that demonstrate how it works.
例子来演示它的原理
You can see the impurity for the first set is zero
可以看到第一组的不纯度是0
because there’s no mixing.
因为不存在混合
And here, you can see the impurity is 0.8.
这组的不纯度是0.8
信息增益
Now information gain will let us find the question that reduces
信息增益可以让我们看出哪个问题能够
our uncertainty the most.
最大程度的减少不确定性
And it’s just a number that describes
它是一个能够描述节点上
how much a question helps to unmix the labels at a node.
某个问题降低了多少混合程度的数字
Here’s the idea.
原理是这样的
We begin by calculating the uncertainty
我们对起始的数据集
of our starting set.
先计算出不确定性
Then, for each question we can ask,
然后 对每个问出的问题
we’ll try partitioning the data and calculating
来尝试分割数据 并计算出
the uncertainty of the child nodes that result.
分配到子节点结果的不确定性
We’ll take a weighted average of their uncertainty
我们使用一个加权平均值来计算其
because we care more about a large set with low uncertainty
不确定性 因为我们更重视低不确定性的大数据集
than a small set with high.
而不是高不确定的小数据集
Then, we’ll subtract this from our starting uncertainty.
然后我们再用开始的不确定性来相减
And that’s our information gain.
最终得到的就是信息增益
As we go, we’ll keep track of the question that
过程中 我们需要跟踪能产出
produces the most gain.
最大增益的问题
And that will be the best one to ask at this node.
那它就是在这个节点的最佳问题
Let’s see how this looks in code.
我们来看看代码该怎么写
Here, we’ll iterate over every value for the features.
这里 我们会对特征的每个值进行遍历
We’ll generate a question for that feature,
针对该特征生成一个问题
then partition the data on it.
然后对其数据进行分割
Notice we discard any questions that fail to produce a split.
注意 我们会舍弃掉那些不能产生分割的问题
Then, we’ll calculate our information gain.
接着计算信息增益
And inside this function, you can
在这个函数里
see we take a weighted average and the impurity of each set.
可以看到我们计算了加权平均值和每个数据集的不纯度
We see how much this reduces the uncertainty
这样可以看到不确定性
from our starting set.
减少了多少
And we keep track of the best value.
然后我们跟进获得最佳值
I’ve written a couple of demos below as well.
下面我还写了一些例子
OK, with these concepts in hand, we’re ready to build the tree.
好的 了解了这些概念 可以开始构建决策树了
And to put this all together I think the most useful thing I
要把这些都联系起来 我认为
can do is walk you through the algorithm
带着大家走一遍使用
as it builds a tree for our training data.
训练数据来构建树的算法是最有用的
This uses recursion, so seeing it in action can be helpful.
涉及到递归 因此看下过程会有帮助
You can find the code for this inside the Build Tree function.
代码就在build_tree函数里
When we call build tree for the first time,
当我们第一次调用build_tree函数时
it receives the entire training set as input.
整个训练数据集会作为其输入参数
And as output it will return a reference
它将输出指向决策树
to the root node of our tree.
根节点的引用
I’ll draw a placeholder for the root here in gray.
先给这个根节点画一个灰色的占位符
And here are the rows we’re considering at this node.
这是该节点的数据集
And to start, that’s the entire training set.
它们是一组完整的训练数据集
Now we find the best question to ask at this node.
现在我们来为这个节点找出最佳问题
And we do that by iterating over each of these values.
通过遍历每个值
We’ll split the data and calculate the information
分割数据 计算每个
gained for each one.
问题的信息增益
And as we go, we’ll keep track of the question that
过程中 我们要不断跟踪
produces the most gain.
哪个问题产生了最大增益
Now in this case, there’s a useful question to ask,
这里的这个问题是有效的
so the gain will be greater than zero.
增益大于0
And we’ll split the data using that question.
我们会用这个问题来分割数据
And now, we’ll use recursion by calling build tree again
现在 用递归来再次调用build_tree
to add a node for the true branch.
增加一个true的分支
The rows we’re considering now are the first half of the split.
现在来处理分割后的数据
And again, we’ll find the best question to ask for this data.
要再为这些数据找到最好的问题
Once more we split and call the build tree function
于是再来一遍分割和调用build_tree
to add the child node.
添加上子节点
Now for this data there are no further questions to ask.
现在这个数据 我们不用继续问问题
So the information gain will be zero.
信息增益为0
And this node becomes a leaf.
于是这个节点就是叶子节点了
It will predict that an example is either
预测物品是苹果还是柠檬的
an apple or a lemon with 50% confidence
置信度都是50%
because that’s the ratio of the labels in the data.
因为数据的标签比例决定的
Now we’ll continue by building the false branch.
现在我们继续来构造false分支
And here, this will also become a leaf.
这同样是一个叶子节点
We’ll predict apple with 100% confidence.
预测是苹果的置信度是100%
Now the previous call returns, and this node
前一次调用返回了
becomes a decision node.
这个节点就变成了决策节点
In code, that just means it holds a reference
所谓决策节点 就是一个
to the question we asked and the two child nodes that result.
指向问题的引用 其结果是两个子节点
And we’re nearly done.
快要完成了
Now we return to the root node and build the false branch.
现在我们返回到了根节点来构造false分支
There are no further questions to ask, so this becomes a leaf.
没有可以问的问题 因此这是个叶子节点
And that predicts grape with 100% confidence.
预测葡萄有100%的置信度
And finally, the root node also becomes a decision node.
最终 根节点也变成了一个决策节点
And our call to build tree returns a reference to it.
调用build_tree返回的就是该节点的引用
If you scroll down in the code, you’ll
往下卷动代码 可以看到
see that I’ve added functions to classify data and print the tree.
我增加了数据分类和打印决策树的函数
And these start with a reference to the root node,
这些函数都从根节点的引用开始
so you can see how it works.
请自己分析下原理
下一步
OK, hope that was helpful.
希望本课能帮到大家
And you can check out the code for more details.
代码里会有更详细的信息
There’s a lot more I have to say about decision trees,
对于决策树我还有很多要说
but there’s only so much we can fit into a short time.
但时间有限 只能放进来这么多
Here are a couple of topics that are good to be aware of.
这里有一些你应该要关注的主题
And you can check out the books in the description to learn more.
描述区里有可以供你深入学习的书籍名称
As a next step, I’d recommend modifying the tree
下一集里 我会修改决策树
to work with your own data set.
来处理我们自己的数据集
And this can be a fun way to build
打造一个简单易懂的
a simple and interpretable classifier for use in your projects.
分类器 并运用到项目里 将会是很有趣的事
Thanks for watching, everyone.
感谢观看
And I’ll see you next time.
下次再见

发表评论

译制信息
视频概述

视频讲解了决策树的概念,以及一些量化问题效率的指标,为自己构建决策树做好了准备

听录译者

收集自网络

翻译译者

知易行难

审核员

审核员O

视频来源

https://www.youtube.com/watch?v=LDRbO9a6XPU

相关推荐