• #### 科普

SCIENCE

#### 英语

ENGLISH

#### 科技

TECHNOLOGY

MOVIE

FOOD

#### 励志

INSPIRATIONS

#### 社会

SOCIETY

TRAVEL

#### 动物

ANIMALS

KIDS

#### 卡通

CARTOON

#### 计算机

COMPUTER

#### 心理

PSYCHOLOGY

#### 教育

EDUCATION

#### 手工

HANDCRAFTS

#### 趣闻

MYSTERIES

CAREER

GEEKS

#### 时尚

FASHION

• 精品课
• 公开课
• 欢迎下载我们在各应用市场备受好评的APP

点击下载Android最新版本

点击下载iOS最新版本

扫码下载译学馆APP

#### 【机器学习入门】#5 我们第一个分类器

Writing Our First Classifier - Machine Learning Recipes #5

(音乐播放中)
[MUSIC PLAYING]

Hey, everyone.

Welcome back.

In this episode, we’re going to do something special,

and that’s write our own classifier from scratch.

If you’re new to machine learning,

this is a big milestone.

Because if you can follow along and do this on your own,

it means you understand an important piece of the puzzle.

The classifier we’re going to write today

is a scrappy version of k-Nearest Neighbors.

That’s one of the simplest classifiers around.

First, here’s a quick outline of what we’ll do in this episode.

Write a Pipeline.

Recall in that episode we did a simple experiment.

We imported a data set and split it into train and test.

We used train to train a classifier,

and test to see how accurate it was.

Writing the classifier is the part

we’re going to focus on today.

Previously we imported the classifier

from a library using these two lines.

Here we’ll comment them out and write our own.

The rest of the pipeline will stay exactly the same.

I’ll pop in and out of the screencast to explain things

as we go along.

To start, let’s run our pipeline to remind ourselves

what the accuracy was.

As you can see, it’s over 90%.

And that’s the goal for the classifier

we’ll write ourselves.

Now let’s comment out that import.

Right off the bat, this breaks our code.

So the first thing we need to do is fix our pipeline.

And to do that, we’ll implement a class for our classifier.

I’ll call it ScrappyKNN.
scrappy的意思是原始的 也就是说是自己写的
And by scrappy, I mean bare bones.

Just enough to get it working.

Next, I’ll change our pipeline to use it.

Now let’s see what methods we need to implement.

Looking at the interface for a classifier,

we see there are two we care about– fit,
fit 用来训练的方法
which does the training, and predict,
predict 实现预测的接口
which does the prediction.

First we’ll declare our fit method.

Remember this takes the features and labels for the training set

as input, so we’ll add parameters for those.

Now let’s move on to our predict method.

As input, this receives the features for our testing data.

And as output, it returns predictions for the labels.

Our first goal is to get the pipeline working,

and to understand what these methods do.

So before we write our real classifier,

We’ll write a random classifier.

And by random, I mean we’ll just guess the label.

To start, we’ll add some code to the fit and predict methods.

In fit, I’ll store the training data in this class.

You can think of this as just memorizing it.

And you’ll see why we do that later on.
predict方法中
Inside the predict method, remember

that we’ll need to return a list of predictions.

That’s because the parameter, X_test, is actually

a 2D array, or list of lists.

Each row contains the features for one testing example.

To make a prediction for each row,

I’ll just randomly pick a label from the training data

and append that to our predictions.

At this point, our pipeline is working again.

So let’s run it and see how well it does.

Recall there are three different types of flowers

in the iris dataset, so accuracy should be about 33%.

Now we know the interface for a classifier.

But when we started this exercise,

our accuracy was above 90%.

So let’s see if we can do better.

To do that, we’ll implement our classifier,

which is based on k-Nearest Neighbors.

Here’s the intuition for how that algorithm works.

from the last episode.

Imagine the dots we see on the screen

are the training data we memorized in the fit method,

say for a toy dataset.

Now imagine we’re asked to make a prediction for this testing

point that I’ll draw here in gray.

How can we do that?

Well in a nearest neighbor classifier,

it works exactly like it sounds.

We’ll find the training point that’s

closest to the testing point.

This point is the nearest neighbor.

Then we’ll predict that the testing

point has the same label.

For example, we’ll guess that this testing dot is green,

because that’s the color of its nearest neighbor.

As another example, if we had a testing dot over here,

we’d guess that it’s red.

Imagine that this dot is equidistant to the nearest

green dot and the nearest red one.

There’s a tie, so how could we classify it?

Well one way is we could randomly break the tie.

But there’s another way, and that’s where k comes in.
k指的是我们做预测时
K is the number of neighbors we consider

when making our prediction.

If k was 1, we’d just look at the closest training point.

But if k was 3, we’d look at the three closest.

In this case, two of those are green and one is red.

To predict, we could vote and predict the majority class.

Now there’s more detail to this algorithm,

but that’s enough to get us started.

To code this up, first we’ll need a way

to find the nearest neighbor.

And to do that, we’ll measure the straight line

distance between two points, just like you do with a ruler.

There’s a formula for that called the Euclidean Distance,

and here’s what the formula looks like.

It measures the distance between two points,

and it works a bit like the Pythagorean Theorem.
A平方加上B平方等于C平方
A squared plus B squared equals C squared.

You can think of this term as A, or the difference

between the first two features.

Likewise, you can think of this term as B,

or the difference between the second pair of features.

And the distance we compute is the length of the hypotenuse.

Now here’s something cool.

Right now we’re computing distance

in two-dimensional space, because we have just two

features in our toy dataset.

But what if we had three features or three dimensions?

Well then we’d be in a cube.

We can still visualize how to measure distance

in the space with a ruler.

But what if we had four features or four dimensions,

like we do in iris?

Well, now we’re in a hypercube, and we

can’t visualize this very easy.

The good news is the Euclidean Distance

works the same way regardless of the number of dimensions.

With more features, we can just add more terms to the equation.

Now let’s code up Euclidean distance.

There are plenty of ways to do that,

but we’ll use a library called scipy that’s
Anaconda已经安装了这个库

Here, A and B are lists of numeric features.

Say A is a point from our training data,
B是来自测试数据的一个点
and B is a point from our testing data.

This function returns the distance between them.

That’s all the math we need, so now

let’s take a look at the algorithm for a classifier.

To make a prediction for a test point,

we’ll calculate the distance to all the training points.

Then we’ll predict the testing point has the same label

as the closest one.

I’ll delete the random prediction we made,

and replace it with a method that finds the closest training

point to the test point.

For this video hard, I’ll hard-code k to 1,

so we’re writing a nearest neighbor classifier.

The k variable won’t appear in our code,

since we’ll always just find the closest point.

Inside this method, we’ll loop over all the training points

and keep track of the closest one so far.

Remember that we memorized the training data in our fit
x_train包含的是所有的特征
function, and X_train contains the features.

To start, I’ll calculate the distance from the test point

to the first training point.

I’ll use this variable to keep track of the shortest

distance we’ve found so far.

And I’ll use this variable to keep

track of the index of the training point that’s closest.

We’ll need this later to retrieve its label.

Now we’ll iterate over all the other training points.

And every time we find a closer one,

we’ll update our variables.

Finally, we’ll use the index to return

the label for the closest training example.

At this point, we have a working nearest neighbor classifier,

so let’s run it and see what the accuracy is.

As you can see, it’s over 90%.

And we did it.

When you run this on your own, the accuracy

might be a bit different because of randomness in the train test

split.

Now if you can code this up and understand it,

that’s a big accomplishment because it

means you can write a simple classifier from scratch.

Now, there are a number of pros and cons

to this algorithm, many of which you can find online.

The basic pro is that it’s relatively easy to understand,

and works reasonably well for some problems.

And the basic cons are that it’s slow,

because it has to iterate over every training point

to make a prediction.

And importantly, as we saw in Episode 3,

But there’s not an easy way to represent

that in k-Nearest Neighbors.

In the long run, we want a classifier

that learns more complex relationships between features

and the label we’re trying to predict.

A decision tree is a good example of that.

And a neural network like we saw in TensorFlow Playground

is even better.
OK 希望这些能帮到你

Thanks as always for watching.

And I’ll see you guys next time.
(音乐播放中)
[MUSIC PLAYING]

【MR】拾月