• #### 科普

SCIENCE

#### 英语

ENGLISH

#### 科技

TECHNOLOGY

MOVIE

FOOD

#### 励志

INSPIRATIONS

#### 社会

SOCIETY

TRAVEL

#### 动物

ANIMALS

KIDS

#### 卡通

CARTOON

#### 计算机

COMPUTER

#### 心理

PSYCHOLOGY

#### 教育

EDUCATION

#### 手工

HANDCRAFTS

#### 趣闻

MYSTERIES

CAREER

GEEKS

#### 时尚

FASHION

• 精品课
• 公开课
• 欢迎下载我们在各应用市场备受好评的APP

点击下载Android最新版本

点击下载iOS最新版本

扫码下载译学馆APP

#### 【机器学习入门】#2 可视化决策树

Visualizing a Decision Tree - Machine Learning Recipes #2

（音乐播放中）
[MUSIC PLAYING]

Last episode, we used a decision tree as our classifier.

Today we’ll add code to visualize it

so we can see how it works under the hood.

There are many types of classifiers

you may have heard of before– things like neural nets

or support vector machines.

So why did we use a decision tree to start?

Well, they have a very unique property–

they’re easy to read and understand.

In fact, they’re one of the few models that are interpretable,

where you can understand exactly why the classifier makes

a decision.

That’s amazingly useful in practice.

To get started, I’ll introduce you

to a real data set we’ll work with today.

It’s called Iris.
Iris是个经典的机器学习问题
Iris is a classic machine learning problem.

In it, you want to identify what type of flower

you have based on different measurements,

like the length and width of the petal.

The data set includes three different types of flowers.

They’re all species of iris– setosa, versicolor,

and virginica.

Scrolling down, you can see we’re

given 50 examples of each type, so 150 examples total.

Notice there are four features that are

used to describe each example.

These are the length and width of the sepal and petal.

And just like in our apples and oranges problem,

the first four columns give the features and the last column

gives the labels, which is the type of flower in each row.

Our goal is to use this data set to train a classifier.

Then we can use that classifier to predict what species

of flower we have if we’re given a new flower that we’ve never

seen before.

Knowing how to work with an existing data set

is a good skill, so let’s import Iris into scikit-learn

and see what it looks like in code.

Conveniently, the friendly folks at scikit

provided a bunch of sample data sets,

including Iris, as well as utilities

to make them easy to import.

We can import Iris into our code like this.

The data set includes both the table

from Wikipedia as well as some metadata.

The metadata tells you the names of the features

and the names of different types of flowers.

The features and examples themselves

are contained in the data variable.

For example, if I print out the first entry,

you can see the measurements for this flower.

These index to the feature names, so the first value

refers to the sepal length, and the second to sepal width,

and so on.
target变量包含的是标签
The target variable contains the labels.

Likewise, these index to the target names.

Let’s print out the first one.

A label of 0 means it’s a setosa.

If you look at the table from Wikipedia,

you’ll notice that we just printed out the first row.
data和target变量各自有150条数据
Now both the data and target variables have 150 entries.

If you want, you can iterate over them

to print out the entire data set like this.

Now that we know how to work with the data set,

we’re ready to train a classifier.

But before we do that, first we need to split up the data.

I’m going to remove several of the examples

and put them aside for later.

We’ll call the examples I’m putting aside our testing data.

We’ll keep these separate from our training data,

and later on we’ll use our testing examples

to test how accurate the classifier is

on data it’s never seen before.

Testing is actually a really important part

of doing machine learning well in practice,

and we’ll cover it in more detail in a future episode.

Just for this exercise, I’ll remove one example

of each type of flower.

And as it happens, the data set is

ordered so the first setosa is at index 0,

and the first versicolor is at 50, and so on.

The syntax looks a little bit complicated, but all I’m doing

is removing three entries from the data and target variables.

Then I’ll create two new sets of variables– one

for training and one for testing.

Training will have the majority of our data,

and testing will have just the examples I removed.

Now, just as before, we can create a decision tree

classifier and train it on our training data.

Before we visualize it, let’s use the tree

to classify our testing data.

We know we have one flower of each type,

and we can print out the labels we expect.

Now let’s see what the tree predicts.

We’ll give it the features for our testing data,

and we’ll get back labels.

You can see the predicted labels match our testing data.

That means it got them all right.

Now, keep in mind, this was a very simple test,

and we’ll go into more detail down the road.

Now let’s visualize the tree so we can

see how the classifier works.

To do that, I’m going to copy-paste

some code in from scikit’s tutorials,

and because this code is for visualization

and not machine-learning concepts,

I won’t cover the details here.

Note that I’m combining the code from these two examples

I can run our script and open up the PDF,

and we can see the tree.

To use it to classify data, you start by reading from the top.

Each node asks a yes or no question
yes或no的问题

For example, this node asks if the pedal width

is less than 0.8 centimeters.

If it’s true for the example you’re classifying, go left.

Otherwise, go right.

Now let’s use this tree to classify an example

from our testing data.

Here are the features and label for our first testing flower.

Remember, you can find the feature names

We know this flower is a setosa, so let’s see

what the tree predicts.

I’ll resize the windows to make this easier to see.

And the first question the tree asks

is whether the petal width is less than 0.8 centimeters.

That’s the fourth feature.

The answer is true, so we proceed left.

At this point, we’re already at a leaf node.

There are no other questions to ask,

so the tree gives us a prediction, setosa,

and it’s right.

Notice the label is 0, which indexes to that type of flower.

Now let’s try our second testing example.

This one is a versicolor.

Let’s see what the tree predicts.

Again we read from the top, and this time the pedal width

is greater than 0.8 centimeters.

The answer to the tree’s question is false,

so we go right.

The next question the tree asks is whether the pedal width

is less than 1.75.

It’s trying to narrow it down.

That’s true, so we go left.

Now it asks if the pedal length is less than 4.95.

That’s true, so we go left again.

And finally, the tree asks if the pedal width

is less than 1.65.

That’s true, so left it is.

And now we have our prediction– it’s a versicolor,

and that’s right again.

You can try the last one on your own as an exercise.

And remember, the way we’re using the tree

is the same way it works in code.

So that’s how you quickly visualize and read

a decision tree.

There’s a lot more to learn here,

especially how they’re built automatically from examples.

We’ll get to that in a future episode.

But for now, let’s close with an essential point.

That means the better your features are, the better a tree

you can build.

And the next episode will start looking

at what makes a good feature.

Thanks very much for watching, and I’ll see you next time.
（音乐播放中）
[MUSIC PLAYING]