大家好！

What is going on everybody!

欢迎来到 Python机器学习系列视频的新一节

Welcome to a new section on the machine learning with Python tutorial series.

这一节我们来讲讲分类问题

This section we’re going to be talking about classification

以及一些处理分类问题的方法

and a handful of methods for a classification.

那我们第一个要深入讲解的分类算法

So as we dive in the first classification algorithm

就是 K近邻 算法

that we’re going to be covering is K nearest neighbors.

不过所有分类算法 最后都会归结到同一件事情上

But really all classification algorithms boil down to the same thing.

回忆一下我们讲过的线性回归

So if you recall with linear regression

它的目标是

the objective was to

找到一个最能拟合我们数据的模型

create a model that best fits our data and

而分类问题的目标则是找到一个模型

with classification the general purpose is to create a model that best

这个模型能将数据有效分隔开来

divides or separates our data.

接下来就让我们来先从一个简单的例子开始

So let’s go ahead and show a quick example.

比如说有这么一张图

So let’s say you’ve got a graph

图上有这么几个点

and then on that graph you’ve got some data points like these.

我们的目标就是找出

And the objective is to figure out

怎样将这些点分成不同的部分

how to separate these into obvious groups.

你只要打眼一看

And even just looking at this intuitively

就可以看出来这些点可以分成2个部分

you could see that there are two groups here.

一部分是这些 一部分是这些 对吧？

One group is this group and one group is this group, right?

非常明显

You just know that’s the case.

我们刚刚做的其实就是聚类 是吧？

So what we just did just now is actually a clustering, right?

当然只是脑子里过了一遍 看到这些点后

Like with our mind there…when we were just looking at this

我们发现可以分成2个部分

and we decided that these were two groups.

我们确实就是做了次聚类

We actually did clustering.

不过分类比我们刚才做的事情还要简单一些

Classification is actually even more simple than what we just did here.

分类要做的就是这么几点

So what classification is going to do is the following.

首先你有一个数据集 看起来像这样

So with classification you’re going to have a data set that looks more like this

一个分类用＋表示 一个分类用－表示

where you’ve got a group that you know are pluses and a group that you know are minuses.

目标就是

and the objective is

找到一个模型

to create some sort of model that

可以同时拟合这两个分类 对吧？

fits both of these groups, right?

可以恰当地分隔这些点

that that properly divides them.

差不多就是一个能识别出＋数据

So almost like some sort of model that defines the pluses

同时也能识别出－数据的模型

and some sort of model that defines the minuses.

那如果要是有某个点你不知道分类会怎样？

So what if you had an unknown dot somewhere, right?

比如这里有这么一个点

Like what if you have a data point that’s like here.

直接观察这个点

Looking at that just visually

你觉得应该把它分给哪个分类？

which group would you assign that to?

你会分给蓝色的－还是绿色的＋？

Would you put it with the minus the blue minuses or the green pluses?

很可能会分给绿色的＋

Most likely you would put it with the green pluses.

如果我要问问为什么

And then I ask you why

你会这么分呢？

why would you have done that, right?

你为什么觉得它属于这个分类？

What made you think that was the case?

思考一下 那如果我把点放在这里呢？

So think about that and then what if we had a point over here.

这里你会怎么分？

Where would you assign that point?

这里大概会分到蓝色－分类

Well in this case most likely the blue minuses.

那还是要问为什么要这么分？

And again think about why might you choose that?

最后如果点

And then finally what if we had a point maybe

在几乎中间的这里

right here in the middle almost.

现在你会怎么分呢？

Now how would you classify that?

实际上分类的结果

It turns out the way that you would classify that

和你用的分类算法密切相关

might actually vary depending on the algorithm that you’re using.

大多数情况下如果

But in most cases I think that if you have a dot like

有这么一个点

like this one.

那我们应该基于它和其他点的接近程度

You’re going to classify that based on proximity

来决定它的分类

to the other points.

我觉得大多数人

I think most people

看到这么一张图都会基于点之间的距离

in looking at a graph like this would go based on proximity

去分类 所以

than anything else. So

你们先自己想想

you’re thinking to yourself.

这个点距离这个点最近 还有这个和这个

Well this point is closest to this point for sure, this point and this point.

想到这你会怎么做

And what you’re doing when you think of that.

因为这三个点比所有蓝色－点都更加接近这个新点

Because those three points are much closer than the closest blue minus

最近的蓝色－点得到这里 对吧？有点远

which is all the way here, right? That’s pretty far.

那这其实是在做什么呢？

So what are you doing when you do that?

其实这就是在做最近邻分类

Well turns out you’ve just done is nearest neighbors.

通过最近邻分类

So with nearest neighbors

我们基本上是在找

You are just checking to see basically

离这个新点最近的点是哪些

who are the closest points to this new point on the data.

这个例子中我们有一个2维数据集

In this case we’ve got two dimensional data

但数据集也可以是3维 10维甚至更高

but you can have 3 dimensional, 10 dimensional and so on.

这里看起来比较简单

So obviously visually for you looking at this is super simple.

但如果有比如10维的数据或者1000维的数据会怎样

but what if you had like 10 dimensions or a thousand dimensions.

那你可不能用眼睛看出答案了

Suddenly you can’t do this by eye anymore.

但这就是机器擅长的地方了

that’s where the machine begins to shine.

好的这就是最近邻分类

So that’s nearest neighbors.

但人们最常用的最近邻算法是

But this is actually most people use

K近邻算法

K nearest neighbors

那到底什么是 K近邻 算法呢？

So what the heck is K nearest neighbors?

好吧你可以先开始想想

Well it turns out that if you just try to start thinking about

这个算法到底是怎么起作用的呢？

‘Okay, how does this process actually going to work?’

真的需要去对比每一个点

Do you actually need to compare it to every single point

来得出答案吗？

in a data set to get your answer.

大多数情况下你并不需要这么干

And most likely you don’t need to do that

对于 K近邻

but so with K nearest neighbors

我就在在这里加个 K字

we’ll just add a K suppose here.

合起来写作 K近邻

But that’s all together right K nearest neighbors.

你得先确定 K 的数量是多少

You decide what the number of K is going to be.

比如我们假设 K 等于2

So let’s say K was equal to 2.

接下来要做的事就是找到离 K 最近的两个点

What you would do is you would find the two closest neighbors to K.

通过观察我觉得就是这个点

And I’m going to say visually that is this one.

不过我不太确定

And honestly I’m not really sure

剩下两个点哪个离得更近 我猜是

which one is closer of these two. I would probably guess

可能是这一个

maybe this one.

它的这条橘色的线更短一些

My orange line is definitely shorter

但这些线也没有画出全部距离

but it doesn’t quite go the whole distance.

比如说它就是离得第二近的点吧

But let’s just say it was closest to that second one there.

所以 K2 就是你得找离得最近的

So with k2 you’ve got two points.

这么2个点 所以

that are the closest. So we’ve got

通过这2个点基本上就可以确定它属于＋了

basically two points are saying:”Yep this is a plus.”

不过如果有这么一个点出现在这里

But what if you had a point that was maybe here.

因为是 K2 所以还是得找出离得最近的2个点

You might have a case where what are the two closest points by K2.

那找出来可能就是这个点还有这个点 对吧？

Well you would have probably this point here and this point here, right?

这两个点是离得最近的

Those are the two closest points.

当 K…因为最近邻的点

And when K…you know within when the nearest neighbors

就是通过数量来确定新点的分类的

go to basically place a vote on the what the identity of this point is.

这里我们就有了平局的现象

We have a split vote, okay?

所以一般使用 K近邻算法时

So in general when you do K nearest neighbors.

最好不要让 K 等于2或者任何偶数

You’re probably not going to want to have K equals 2 or any other even number.

可以把 K 设置为奇数 这里我们就设置为3

You’re going to want K equal to some odd number when in this case we’ll do 3.

如果是3会怎样呢？假如我们要多加一个点进行判断

So what if we did 3? What if we said ok we need one more point

比如我们就决定是这个点了 那最后的判定结果

what we would say:”Okay, it’s this one.” So then basically the vote would be

就会是 －,－和＋

negative negative and positive.

2比3 也就是

That’s a two out of three. So we would say it’s

这个点最终的分类为－

the class is actually a negative class. That’s what we would end up going with here.

这大概就是 K近邻算法的原理了

And so that’s basically how K nearest neighbors works.

非常简单的一个算法

It’s a super simple algorithm and

要注意的是这里我们只有两个分类

the other thing you have to think about to is in this case we had only two groups.

如果有三个分类怎么办？

But what if you had three groups?

K设置为3还合适吗？

Is K3 going to be a good idea?

不合适了 因为可能会出现三点属于不同分类的情况 那设置为4呢？

Turns out no. Because you could have a total split amongst all the groups. What about four?

也不行 因为会有平票情况

No. Because you could have a totally even vote.

所以如果有3个分类那就至少得让

So if you had 3 groups you need at least 5

K 等于5

total you know K equals 5

以避免平票事件发生

to avoid any sort of split vote.

当然你也可以写段代码 在出现平票结果时随机选一个分类

You can also code something into just randomly pick if there is a division.

K近邻算法的一个优点就是

What’s neat about K nearest neighbors though

除了能得到你所需要的

is not only can you get an actual classification

数据点的分类结果

for the data point that you pick.

你还能得到我们之前所说过的

You can get what we were talking about before both

模型的准确度

accuracy in the model

这样你就可以通过训练和测试模型来改进整体准确度

so that you can train and test the model for the models overall accuracy.

同时每个点都会有一个置信度

But each point can also have a degree of confidence.

比如设 K 等于3

So for example what if you get…you’re using K equals 3.

然后你得到了一个分配结果是－,－和＋

And you get a vote that is like a negative, a negative and a positive.

就是2比3，对吧？

Well that’s a two out of three, right?

也就是对于分类结果有

So that’s a you know 66% confidence

66%的几率是可靠的

in the score or in the classification of that data.

除了算出这一点的置信度是66%外

But not only is it 66% at…is the confidence 66%

对于训练的 K近邻模型整体

but you can also have the entire K nearest neighbors model that you’ve trained.

你可以算出其准确度 不过这里更应该叫做置信度

You can have that accuracy. So this would actually be more like confidence.

这也就是为什么

That’s why I wanted to change

当我们在讲线性回归的时候 我不用置信度而用准确度这个词的原因

when we were doing linear regression why I didn’t want to call it confidence

我改成准确度是因为

I wanted to call it accuracy because

K近邻算法中的置信度

confidence with K nearest neighbors is something you can actually value

和准确度是完全不同的概念

it can indeed be very different from the entire models actual accuracy.

所以这大概就是 K近邻 算法了

So that’s kind of cool with K nearest neighbors.

那再来说说K近邻的缺点

Now what are some downfalls of K nearest neighbors? Well as we’re going to see

在我们要找出最近邻的时候

in order to find out who are the closest neighbors

我们用来衡量距离用的是

what we’re using to measure that distance is just simple

欧几里得距离

Euclidean distance is what we’re going to be using here.

为了找出欧几里得距离

And to do that, to find the Euclidean distance

最简单的方法就是算出任意一点和

all the most simple method is actually to measure the distance between any given point

其他点间的距离 接下来你就可以找出离得最近的3个点

and all of the other points. And then you just say:”Okay, what are the closest 3?”

或者 K个点

or whatever K is.

如果你处理过大数据的话

And as you might guess on a huge data set

就会知道计算这些非常得繁琐

that’s a very very long and tedious operation.

有几个方法可以帮你提高速度

There are a few things that you can do to kind of speed it up

不过即便提速再多

but no matter what you do to speed this up

数据一大 算法还是会运行很慢

you’re going to find that the larger the data set the worse this algorithm runs.

因为 K近邻 就是没有其它算法有效率

Because it’s just not as efficient as other algorithms.

所以我们讲完这个算法后也许会讲讲支持向量机

And then and so once we cover this and then we get into maybe like the support vector machine.

你会发现在处理实际分类问题时

You’ll see that the support vector machine is much more efficient

支持向量机可有效率多了

when it comes to actual classification.

对于 K近邻算法

Also with K nearest neighbors

基本上…实际上并没有训练完成的时候

you’re basically…There’s never really a point where you’re totally training anything.

训练和测试基本上是同时进行的

Like the training and testing is basically the same spot

可以看做是一回事 因为测试的时候

are the same…basically the same thing because when you go to actually test

你还是得去和所有点进行对比 所以训练 K近邻算法

You’re comparing it to all the points. There’s really no good way

基本没有其他捷径可走

to train a simple K nearest neighbors algorithm.

还有一些事儿我们可以做

There are also some things that we can do down the line

不过应该不会继续往下了

but we’ll probably won’t be getting into that ourselves.

只要记住大数据用 K近邻效果并不好

But anyway just keep in mind that the scaling is not so good

在我们之后讲到

and will point out exactly why

支持向量机算法的时候就会知道为什么是这样

and then when we get into support vector machines you’ll see

同时也会知道为什么支持向量机比 K近邻有效率得多

why support vector machines scale so much better than K nearest neighbors.

我不想把 K近邻算法吹得太厉害

That said I don’t mean to brag on K nearest neighbors too much.

它确实对于很多分类问题来说是个好算法

It’s actually a more than fine algorithm for many classification tasks.

即便要处理几个G的数据

So if you’re…even if you’re working up to maybe a gigabyte worth of data.

K近邻算法仍然可以运行得很快

K nearest neighbors can still be calculated quite fast.

而且它可以很容易地进行并发运算

And it can also be easily calculated in parallel

因为任何你想要预测的点都可以随时进行计算

since any point you’re trying to predict can be calculated.

不管其它点有没有被计算过

Regardless of the other points that you’re trying to calculate.

也就是说你可以用多线程来运行它

So it’s actually you can you can thread it

而且效果相对来说也还不错 不过

and still scale relatively well. But if you’re working with

要是处理十几亿的数据的话 结果就不太好了

you know billions of data points. It’s not going to do very well. So anyways

好的 这就是 K近邻算法的原理和演示

that is the theory and intuition behind K nearest neighbors

接下来我们要对 K近邻算法进行实战

and now we’re going to actually be diving in to a real world example of K nearest neighbours.

然后我们会自己写一个 K近邻算法出来 敬请期待

And then after that we’ll actually write our own K nearest neighbors algorithm. So stay tuned for that.

如果有任何问题就在下方评论吧

If you have any questions or comments leave them below.

感谢各位地收看 支持和订阅 我们下次见

Otherwise as always thanks for watching, thanks for all the support and subscriptions and until next time.

##### 译制信息

视频概述

本节介绍了机器学习分类算法中的K近邻算法。对其原理和过程进行了介绍和演示。

听录译者

[B]刀子

翻译译者

[B]刀子

审核员

审核团1024

视频来源

https://www.youtube.com/watch?v=44jq6ano5n0

##### 相关推荐

###### 《迪哥Java教程》#17 Java的If语句教程

迪哥编程 ･ Deege U

###### [开发者快报] #75 Google Cloud在日本的发展

谷歌开发者 ･ Google Developers

###### 开发者快报#90

谷歌开发者 ･ Google Developers

###### #0 什么是数据分析？

电脑狂热 ･ Computerphile

###### 【根权限系列】Magnet公司介绍3种让开发者选用你家平台的方法

谷歌开发者 ･ Google Developers

###### 【创始人世界2015】Yvonne Cagle: NASA宇航员

谷歌开发者 ･ Google Developers

###### 【用代数讲计算机科学】求值单元

编程世界 ･ Code.org