ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

【机器学习入门】#8 从零开始写一个决策树分类器 – 译学馆
未登录,请登录后再发表信息
最新评论 (0)
播放视频

【机器学习入门】#8 从零开始写一个决策树分类器

Let’s Write a Decision Tree Classifier from Scratch - Machine Learning Recipes #8

嗨 大家好
JOSH GORDON: Hey, everyone.
欢迎回来
Welcome back.
本期节目里 我们将用纯Python
In this episode, we’ll write a decision tree classifier
来从头编写一个决策树分类器
from scratch in pure Python.
大纲如下
Here’s an outline of what we’ll cover.
我会先从将要处理的数据集开始
I’ll start by introducing the data set we’ll work with.
然后我们预览一下最终完成的树
Next, we’ll preview the completed tree.
接下来 将它构建出来
And then, we’ll build it.
整个过程中 我们将覆盖到的概念包括 决策树学习
On the way, we’ll cover concepts like decision tree learning,
基尼不纯度 信息增益
Gini impurity, and information gain.
描述区里有本期节目的相关代码
And you can find the code for this episode in the description.
代码用两种格式保存
And it’s available in two formats,
Jupyter notebook和普通Python文件
both as a Jupiter notebook and as a regular Python file.
OK 我们开始吧
OK, let’s get started.
我为本期节目已经准备好了
For this episode, I’ve written a toy data
一套数据集 包含数字和分类属性
set that includes both numeric and categorical attributes.
我们要实现预测水果类型的功能
And here, our goal will be to predict the type of fruit,
比如基于颜色或者大小这样的特征
like an apple or a grape, based on features
来判断是苹果还是桔子
like color and size.
本期节目的结尾 我会鼓励大家
At the end of the episode, I encourage
把这个数据集抹掉 换上自己的
you to swap out this data set for one of your own
为你关心的问题构建一棵树
and build a tree for a problem you care about.
我们来看看格式
Let’s look at the format.
为了更加清楚 我重新画了一下
I’ve re-drawn it here for clarity.
每行是一条示例数据
Each row is an example.
前两列提供的是“特征” 或称属性
And the first two columns provide features or attributes
它们用来描述数据
that describe the data.
最后一列显示的是“标签” 或称类别
The last column gives the label, or the class,
是我们要预测的目标
we want to predict.
你愿意的话 可以修改这个数据集
And if you like, you can modify this data set
添加更多的特征 或更多的示例数据
by adding additional features or more examples,
程序可以保持不变
and our program will work in exactly the same way.
这个数据集很简单
Now, this data set is pretty straightforward,
除了这一处
except for one thing.
我这样写是为了让它们难以区分
I’ve written it so it’s not perfectly separable.
意思是 其实没办法来区分
And by that I mean there’s no way to tell apart
第2条与第5条记录
the second and fifth examples.
因为它们的特征相同 但标签不一样
They have the same features, but different labels.
我们可以看到这棵决策树是怎么处理这种情况的
And this is so we can see how our tree handles this case.
在这个notebook的最后
Towards the end of the notebook, you’ll
你会看到相同格式的测试数据
find testing data in the same format.
我已经写了几个工具函数可以让
Now I’ve written a few utility functions that make it easier
处理这些数据变简单
to work with this data.
在每个函数的下方 我都写了一个
And below each function, I’ve written a small demo
小例子来演示
to show how it works.
notebook里的每段代码 都遵循这个模式
And I’ve repeated this pattern for every block of code in the notebook.
决策树
我们将使用学习算法CART
Now to build the tree, we use the decision tree learning
来构建决策树
algorithm called CART.
可以看到 有很多种算法可以
And as it happens, there’s a whole family of algorithms
用来从数据构建决策树
used to build trees from data.
算法可以给你一个流程
At their core, they give you a procedure
来决定问什么问题 以及什么时候问
to decide which questions to ask and when.
CART 意思是分类回归树
CART stands for Classification and Regression Trees.
它的工作原理是这样的
And here’s a preview of how it works.
一开始 我们会给树添加一个根节点
To begin, we’ll add a root node for the tree.
所有节点都会收到一组数据作为输入
And all nodes receive a list of rows as input.
而根节点会收到完整的训练数据集
And the root will receive the entire training set.
现在 每个节点都会问
Now each node will ask a true false question
关于某个特征的真假问题
about one of the features.
根据对问题的回答
And in response to this question,
我们可以将数据分割成两块
we split, or partition, the data into two subsets.
这两块会分别成为新增的两个
These subsets then become the input to two child nodes
子节点的输入数据
we add to the tree.
这些问题的目的是为了在处理过程中
And the goal of the question is to unmix the labels
把标签拆开
as we proceed down.
换种说法 就是让每个节点
Or in other words, to produce the purest possible
将标签分配的尽可能的干净
distribution of the labels at each node.
例如 这个节点
For example, the input to this node
只包含单一的标签类型输入
contains only a single type of label,
那么 可以说这是完美的拆分
so we’d say it’s perfectly unmixed.
这个类型的标签没有不确定性
There’s no uncertainty about the type of label.
而另一个节点的标签仍然是混合的
On the other hand, the labels in this node are still mixed up,
那么我们需要继续问问题来将其分拆
so we’d ask another question to further narrow it down.
构建一棵有效率的决策树
And the trick to building an effective tree
关键在于理解该问什么何时该问
is to understand which questions to ask and when.
要做到这一点 我们需要对一个问题
And to do that, we need to quantify how much a question
能拆散多少标签进行量化
helps to unmix the labels.
我们可以通过基尼不纯度这个指标
And we can quantify the amount of uncertainty
来量化某个节点的不确定性
at a single node using a metric called Gini impurity.
我们还可以通过信息增益这个概念
And we can quantify how much a question
来量化一个问题减少了多少不确定性
reduces that uncertainty using a concept called information gain.
通过它们就可以在
We’ll use these to select the best
每个节点上问出最好的问题
question to ask at each point.
给出问题后 我们会递归的为每个
And given that question, we’ll recursively build the tree
新的节点构造出决策树
on each of the new nodes.
持续分割数据 直到
We’ll continue dividing the data until there
没法再问出新的问题 到此
are no further questions to ask, at which point
我们才会加上叶子节点
we’ll add a leaf.
要实现这个功能 首先我们要理解
To implement this, first we need to understand
对这些数据我们可以问什么样类型的问题
what type of questions can we ask about the data.
然后 我们需要理解
And second, we need to understand
如何决定该问的问题
how to decide which question to ask when.
该问什么样的问题?
现在每个节点都有一组数据作为输入
Now each node takes a list of rows as input.
通过遍历每条数据的特征值
And to generate a list of questions
我们可以产生一系列的问题
we’ll iterate over every value for every feature that appears in those rows.
每个问题都可以成为
Each of these becomes a candidate
数据分割的边界
for a threshold we can use to partition the data.
往往这会有很多可能性
And there will often be many possibilities.
在代码里一个问题会用
In code we represent a question by storing
column.number和column.value来存储
a column number and a column value,
也就是我们用来分割数据的边界
or the threshold we’ll use to partition the data.
例如 要写一个测试其颜色
For example, here’s how we’d write a question
是不是绿色的问题
to test if the color is green.
这里的例子是针对数值类属性的
And here’s an example for a numeric attribute
测试直径是否大于等于3
to test if the diameter is greater than or equal to 3.
问题的答案会将数据拆分或分割成
In response to a question, we divide, or partition, the data
两个子集
into two subsets.
一部分包含的是该问题结果为true的数据行
The first contains all the rows for which the question is true.
另一部分包含的是结果为false的
And the second contains everything else.
代码中 分割函数的输入参数是
In code, our partition function takes a question
问题和数据行
and a list of rows as input.
例如 这里是我们怎样根据物体的颜色
For example, here’s how we partition the rows based
是否为红色来分割的
on whether the color is red.
结果为true的行都是红色的
Here, true rows contains all the red examples.
结果为false的包含其它所有数据
And false rows contains everything else.
量化不确定性
最好的问题能够最大程度的减少不确定性
The best question is the one that reduces our uncertainty the most.
基尼不纯度可以让我们对某个节点的
And Gini impurity let’s us quantify how much uncertainty
不确定性进行量化
there is at a node.
信息增益则可以帮我们量化
Information gain will let us quantify how much
每个问题减少了多少不确定性
a question reduces that.
我们先看下基尼不纯度
Let’s work on impurity first.
这是个0到1之间的数
Now this is a metric that ranges between 0 and 1
数字小表示这个节点的不确定性小 或者混合度小
where lower values indicate less uncertainty, or mixing, at a node.
如果我们从集合里随机取出一个物件
It quantifies our chance of being incorrect if we randomly
任意给它分配一个标签 这个数代表的是出错可能
assign a label from a set to an example in that set.
这里有个例子来解释清楚
Here’s an example to make that clear.
想象有两只碗 一只装着待确认的物品
Imagine we have two bowls and one contains the examples
另一只装着标签
and the other contains labels.
首先 我们随机从第一只碗里取出一个物品
First, we’ll randomly draw an example from the first bowl.
然后再从第二只碗里随机取一个标签
Then we’ll randomly draw a label from the second.
现在 我们用这个标签来对物品分类
And now, we’ll classify the example as having that label.
基尼不纯度表示的是出错的可能性
And Gini impurity gives us our chance of being incorrect.
本例中 两只碗里都是苹果
In this example, we have only apples in each bowl.
这不会出错
There’s no way to make a mistake.
所以我们说这个不纯度就是0
So we say the impurity is zero.
换个情况 假设碗里有5个不同类型的水果
On the other hand, given a bowl with five different types
比例平均分配
of fruit in equal proportion, we’d
这时不纯度就变成了0.8
say it has an impurity of 0.8.
因为如果我们随机分配标签
That’s because we have a one out of five chance of being right
正确的几率是1/5
if we randomly assign a label to an example.
代码中 该方法会计算出数据集的不纯度
In code, this method calculates the impurity of a data set.
我写了几个
And I’ve written a couple examples
例子来演示它的原理
below that demonstrate how it works.
可以看到第一组的不纯度是0
You can see the impurity for the first set is zero
因为不存在混合
because there’s no mixing.
这组的不纯度是0.8
And here, you can see the impurity is 0.8.
信息增益
信息增益可以让我们看出哪个问题能够
Now information gain will let us find the question that reduces
最大程度的减少不确定性
our uncertainty the most.
它是一个能够描述节点上
And it’s just a number that describes
某个问题降低了多少混合程度的数字
how much a question helps to unmix the labels at a node.
原理是这样的
Here’s the idea.
我们对起始的数据集
We begin by calculating the uncertainty
先计算出不确定性
of our starting set.
然后 对每个问出的问题
Then, for each question we can ask,
来尝试分割数据 并计算出
we’ll try partitioning the data and calculating
分配到子节点结果的不确定性
the uncertainty of the child nodes that result.
我们使用一个加权平均值来计算其
We’ll take a weighted average of their uncertainty
不确定性 因为我们更重视低不确定性的大数据集
because we care more about a large set with low uncertainty
而不是高不确定的小数据集
than a small set with high.
然后我们再用开始的不确定性来相减
Then, we’ll subtract this from our starting uncertainty.
最终得到的就是信息增益
And that’s our information gain.
过程中 我们需要跟踪能产出
As we go, we’ll keep track of the question that
最大增益的问题
produces the most gain.
那它就是在这个节点的最佳问题
And that will be the best one to ask at this node.
我们来看看代码该怎么写
Let’s see how this looks in code.
这里 我们会对特征的每个值进行遍历
Here, we’ll iterate over every value for the features.
针对该特征生成一个问题
We’ll generate a question for that feature,
然后对其数据进行分割
then partition the data on it.
注意 我们会舍弃掉那些不能产生分割的问题
Notice we discard any questions that fail to produce a split.
接着计算信息增益
Then, we’ll calculate our information gain.
在这个函数里
And inside this function, you can
可以看到我们计算了加权平均值和每个数据集的不纯度
see we take a weighted average and the impurity of each set.
这样可以看到不确定性
We see how much this reduces the uncertainty
减少了多少
from our starting set.
然后我们跟进获得最佳值
And we keep track of the best value.
下面我还写了一些例子
I’ve written a couple of demos below as well.
好的 了解了这些概念 可以开始构建决策树了
OK, with these concepts in hand, we’re ready to build the tree.
要把这些都联系起来 我认为
And to put this all together I think the most useful thing I
带着大家走一遍使用
can do is walk you through the algorithm
训练数据来构建树的算法是最有用的
as it builds a tree for our training data.
涉及到递归 因此看下过程会有帮助
This uses recursion, so seeing it in action can be helpful.
代码就在build_tree函数里
You can find the code for this inside the Build Tree function.
当我们第一次调用build_tree函数时
When we call build tree for the first time,
整个训练数据集会作为其输入参数
it receives the entire training set as input.
它将输出指向决策树
And as output it will return a reference
根节点的引用
to the root node of our tree.
先给这个根节点画一个灰色的占位符
I’ll draw a placeholder for the root here in gray.
这是该节点的数据集
And here are the rows we’re considering at this node.
它们是一组完整的训练数据集
And to start, that’s the entire training set.
现在我们来为这个节点找出最佳问题
Now we find the best question to ask at this node.
通过遍历每个值
And we do that by iterating over each of these values.
分割数据 计算每个
We’ll split the data and calculate the information
问题的信息增益
gained for each one.
过程中 我们要不断跟踪
And as we go, we’ll keep track of the question that
哪个问题产生了最大增益
produces the most gain.
这里的这个问题是有效的
Now in this case, there’s a useful question to ask,
增益大于0
so the gain will be greater than zero.
我们会用这个问题来分割数据
And we’ll split the data using that question.
现在 用递归来再次调用build_tree
And now, we’ll use recursion by calling build tree again
增加一个true的分支
to add a node for the true branch.
现在来处理分割后的数据
The rows we’re considering now are the first half of the split.
要再为这些数据找到最好的问题
And again, we’ll find the best question to ask for this data.
于是再来一遍分割和调用build_tree
Once more we split and call the build tree function
添加上子节点
to add the child node.
现在这个数据 我们不用继续问问题
Now for this data there are no further questions to ask.
信息增益为0
So the information gain will be zero.
于是这个节点就是叶子节点了
And this node becomes a leaf.
预测物品是苹果还是柠檬的
It will predict that an example is either
置信度都是50%
an apple or a lemon with 50% confidence
因为数据的标签比例决定的
because that’s the ratio of the labels in the data.
现在我们继续来构造false分支
Now we’ll continue by building the false branch.
这同样是一个叶子节点
And here, this will also become a leaf.
预测是苹果的置信度是100%
We’ll predict apple with 100% confidence.
前一次调用返回了
Now the previous call returns, and this node
这个节点就变成了决策节点
becomes a decision node.
所谓决策节点 就是一个
In code, that just means it holds a reference
指向问题的引用 其结果是两个子节点
to the question we asked and the two child nodes that result.
快要完成了
And we’re nearly done.
现在我们返回到了根节点来构造false分支
Now we return to the root node and build the false branch.
没有可以问的问题 因此这是个叶子节点
There are no further questions to ask, so this becomes a leaf.
预测葡萄有100%的置信度
And that predicts grape with 100% confidence.
最终 根节点也变成了一个决策节点
And finally, the root node also becomes a decision node.
调用build_tree返回的就是该节点的引用
And our call to build tree returns a reference to it.
往下卷动代码 可以看到
If you scroll down in the code, you’ll
我增加了数据分类和打印决策树的函数
see that I’ve added functions to classify data and print the tree.
这些函数都从根节点的引用开始
And these start with a reference to the root node,
请自己分析下原理
so you can see how it works.
下一步
希望本课能帮到大家
OK, hope that was helpful.
代码里会有更详细的信息
And you can check out the code for more details.
对于决策树我还有很多要说
There’s a lot more I have to say about decision trees,
但时间有限 只能放进来这么多
but there’s only so much we can fit into a short time.
这里有一些你应该要关注的主题
Here are a couple of topics that are good to be aware of.
描述区里有可以供你深入学习的书籍名称
And you can check out the books in the description to learn more.
下一集里 我会修改决策树
As a next step, I’d recommend modifying the tree
来处理我们自己的数据集
to work with your own data set.
打造一个简单易懂的
And this can be a fun way to build
分类器 并运用到项目里 将会是很有趣的事
a simple and interpretable classifier for use in your projects.
感谢观看
Thanks for watching, everyone.
下次再见
And I’ll see you next time.

发表评论

译制信息
视频概述

视频讲解了决策树的概念,以及一些量化问题效率的指标,为自己构建决策树做好了准备

听录译者

收集自网络

翻译译者

知易行难

审核员

审核员O

视频来源

https://www.youtube.com/watch?v=LDRbO9a6XPU

相关推荐