ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

《机器学习Python实践》#13 分类w/K最近邻 – 译学馆
未登陆,请登陆后再发表信息
最新评论 (0)
播放视频

《机器学习Python实践》#13 分类w/K最近邻

Classification w/ K Nearest Neighbors Intro - Practical Machine Learning Tutorial with Python p.13

大家好!
What is going on everybody!
欢迎来到 Python机器学习系列视频的新一节
Welcome to a new section on the machine learning with Python tutorial series.
这一节我们来讲讲分类问题
This section we’re going to be talking about classification
以及一些处理分类问题的方法
and a handful of methods for a classification.
那我们第一个要深入讲解的分类算法
So as we dive in the first classification algorithm
就是 K近邻 算法
that we’re going to be covering is K nearest neighbors.
不过所有分类算法 最后都会归结到同一件事情上
But really all classification algorithms boil down to the same thing.
回忆一下我们讲过的线性回归
So if you recall with linear regression
它的目标是
the objective was to
找到一个最能拟合我们数据的模型
create a model that best fits our data and
而分类问题的目标则是找到一个模型
with classification the general purpose is to create a model that best
这个模型能将数据有效分隔开来
divides or separates our data.
接下来就让我们来先从一个简单的例子开始
So let’s go ahead and show a quick example.
比如说有这么一张图
So let’s say you’ve got a graph
图上有这么几个点
and then on that graph you’ve got some data points like these.
我们的目标就是找出
And the objective is to figure out
怎样将这些点分成不同的部分
how to separate these into obvious groups.
你只要打眼一看
And even just looking at this intuitively
就可以看出来这些点可以分成2个部分
you could see that there are two groups here.
一部分是这些 一部分是这些 对吧?
One group is this group and one group is this group, right?
非常明显
You just know that’s the case.
我们刚刚做的其实就是聚类 是吧?
So what we just did just now is actually a clustering, right?
当然只是脑子里过了一遍 看到这些点后
Like with our mind there…when we were just looking at this
我们发现可以分成2个部分
and we decided that these were two groups.
我们确实就是做了次聚类
We actually did clustering.
不过分类比我们刚才做的事情还要简单一些
Classification is actually even more simple than what we just did here.
分类要做的就是这么几点
So what classification is going to do is the following.
首先你有一个数据集 看起来像这样
So with classification you’re going to have a data set that looks more like this
一个分类用+表示 一个分类用-表示
where you’ve got a group that you know are pluses and a group that you know are minuses.
目标就是
and the objective is
找到一个模型
to create some sort of model that
可以同时拟合这两个分类 对吧?
fits both of these groups, right?
可以恰当地分隔这些点
that that properly divides them.
差不多就是一个能识别出+数据
So almost like some sort of model that defines the pluses
同时也能识别出-数据的模型
and some sort of model that defines the minuses.
那如果要是有某个点你不知道分类会怎样?
So what if you had an unknown dot somewhere, right?
比如这里有这么一个点
Like what if you have a data point that’s like here.
直接观察这个点
Looking at that just visually
你觉得应该把它分给哪个分类?
which group would you assign that to?
你会分给蓝色的-还是绿色的+?
Would you put it with the minus the blue minuses or the green pluses?
很可能会分给绿色的+
Most likely you would put it with the green pluses.
如果我要问问为什么
And then I ask you why
你会这么分呢?
why would you have done that, right?
你为什么觉得它属于这个分类?
What made you think that was the case?
思考一下 那如果我把点放在这里呢?
So think about that and then what if we had a point over here.
这里你会怎么分?
Where would you assign that point?
这里大概会分到蓝色-分类
Well in this case most likely the blue minuses.
那还是要问为什么要这么分?
And again think about why might you choose that?
最后如果点
And then finally what if we had a point maybe
在几乎中间的这里
right here in the middle almost.
现在你会怎么分呢?
Now how would you classify that?
实际上分类的结果
It turns out the way that you would classify that
和你用的分类算法密切相关
might actually vary depending on the algorithm that you’re using.
大多数情况下如果
But in most cases I think that if you have a dot like
有这么一个点
like this one.
那我们应该基于它和其他点的接近程度
You’re going to classify that based on proximity
来决定它的分类
to the other points.
我觉得大多数人
I think most people
看到这么一张图都会基于点之间的距离
in looking at a graph like this would go based on proximity
去分类 所以
than anything else. So
你们先自己想想
you’re thinking to yourself.
这个点距离这个点最近 还有这个和这个
Well this point is closest to this point for sure, this point and this point.
想到这你会怎么做
And what you’re doing when you think of that.
因为这三个点比所有蓝色-点都更加接近这个新点
Because those three points are much closer than the closest blue minus
最近的蓝色-点得到这里 对吧?有点远
which is all the way here, right? That’s pretty far.
那这其实是在做什么呢?
So what are you doing when you do that?
其实这就是在做最近邻分类
Well turns out you’ve just done is nearest neighbors.
通过最近邻分类
So with nearest neighbors
我们基本上是在找
You are just checking to see basically
离这个新点最近的点是哪些
who are the closest points to this new point on the data.
这个例子中我们有一个2维数据集
In this case we’ve got two dimensional data
但数据集也可以是3维 10维甚至更高
but you can have 3 dimensional, 10 dimensional and so on.
这里看起来比较简单
So obviously visually for you looking at this is super simple.
但如果有比如10维的数据或者1000维的数据会怎样
but what if you had like 10 dimensions or a thousand dimensions.
那你可不能用眼睛看出答案了
Suddenly you can’t do this by eye anymore.
但这就是机器擅长的地方了
that’s where the machine begins to shine.
好的这就是最近邻分类
So that’s nearest neighbors.
但人们最常用的最近邻算法是
But this is actually most people use
K近邻算法
K nearest neighbors
那到底什么是 K近邻 算法呢?
So what the heck is K nearest neighbors?
好吧你可以先开始想想
Well it turns out that if you just try to start thinking about
这个算法到底是怎么起作用的呢?
‘Okay, how does this process actually going to work?’
真的需要去对比每一个点
Do you actually need to compare it to every single point
来得出答案吗?
in a data set to get your answer.
大多数情况下你并不需要这么干
And most likely you don’t need to do that
对于 K近邻
but so with K nearest neighbors
我就在在这里加个 K字
we’ll just add a K suppose here.
合起来写作 K近邻
But that’s all together right K nearest neighbors.
你得先确定 K 的数量是多少
You decide what the number of K is going to be.
比如我们假设 K 等于2
So let’s say K was equal to 2.
接下来要做的事就是找到离 K 最近的两个点
What you would do is you would find the two closest neighbors to K.
通过观察我觉得就是这个点
And I’m going to say visually that is this one.
不过我不太确定
And honestly I’m not really sure
剩下两个点哪个离得更近 我猜是
which one is closer of these two. I would probably guess
可能是这一个
maybe this one.
它的这条橘色的线更短一些
My orange line is definitely shorter
但这些线也没有画出全部距离
but it doesn’t quite go the whole distance.
比如说它就是离得第二近的点吧
But let’s just say it was closest to that second one there.
所以 K2 就是你得找离得最近的
So with k2 you’ve got two points.
这么2个点 所以
that are the closest. So we’ve got
通过这2个点基本上就可以确定它属于+了
basically two points are saying:”Yep this is a plus.”
不过如果有这么一个点出现在这里
But what if you had a point that was maybe here.
因为是 K2 所以还是得找出离得最近的2个点
You might have a case where what are the two closest points by K2.
那找出来可能就是这个点还有这个点 对吧?
Well you would have probably this point here and this point here, right?
这两个点是离得最近的
Those are the two closest points.
当 K…因为最近邻的点
And when K…you know within when the nearest neighbors
就是通过数量来确定新点的分类的
go to basically place a vote on the what the identity of this point is.
这里我们就有了平局的现象
We have a split vote, okay?
所以一般使用 K近邻算法时
So in general when you do K nearest neighbors.
最好不要让 K 等于2或者任何偶数
You’re probably not going to want to have K equals 2 or any other even number.
可以把 K 设置为奇数 这里我们就设置为3
You’re going to want K equal to some odd number when in this case we’ll do 3.
如果是3会怎样呢?假如我们要多加一个点进行判断
So what if we did 3? What if we said ok we need one more point
比如我们就决定是这个点了 那最后的判定结果
what we would say:”Okay, it’s this one.” So then basically the vote would be
就会是 -,-和+
negative negative and positive.
2比3 也就是
That’s a two out of three. So we would say it’s
这个点最终的分类为-
the class is actually a negative class. That’s what we would end up going with here.
这大概就是 K近邻算法的原理了
And so that’s basically how K nearest neighbors works.
非常简单的一个算法
It’s a super simple algorithm and
要注意的是这里我们只有两个分类
the other thing you have to think about to is in this case we had only two groups.
如果有三个分类怎么办?
But what if you had three groups?
K设置为3还合适吗?
Is K3 going to be a good idea?
不合适了 因为可能会出现三点属于不同分类的情况 那设置为4呢?
Turns out no. Because you could have a total split amongst all the groups. What about four?
也不行 因为会有平票情况
No. Because you could have a totally even vote.
所以如果有3个分类那就至少得让
So if you had 3 groups you need at least 5
K 等于5
total you know K equals 5
以避免平票事件发生
to avoid any sort of split vote.
当然你也可以写段代码 在出现平票结果时随机选一个分类
You can also code something into just randomly pick if there is a division.
K近邻算法的一个优点就是
What’s neat about K nearest neighbors though
除了能得到你所需要的
is not only can you get an actual classification
数据点的分类结果
for the data point that you pick.
你还能得到我们之前所说过的
You can get what we were talking about before both
模型的准确度
accuracy in the model
这样你就可以通过训练和测试模型来改进整体准确度
so that you can train and test the model for the models overall accuracy.
同时每个点都会有一个置信度
But each point can also have a degree of confidence.
比如设 K 等于3
So for example what if you get…you’re using K equals 3.
然后你得到了一个分配结果是-,-和+
And you get a vote that is like a negative, a negative and a positive.
就是2比3,对吧?
Well that’s a two out of three, right?
也就是对于分类结果有
So that’s a you know 66% confidence
66%的几率是可靠的
in the score or in the classification of that data.
除了算出这一点的置信度是66%外
But not only is it 66% at…is the confidence 66%
对于训练的 K近邻模型整体
but you can also have the entire K nearest neighbors model that you’ve trained.
你可以算出其准确度 不过这里更应该叫做置信度
You can have that accuracy. So this would actually be more like confidence.
这也就是为什么
That’s why I wanted to change
当我们在讲线性回归的时候 我不用置信度而用准确度这个词的原因
when we were doing linear regression why I didn’t want to call it confidence
我改成准确度是因为
I wanted to call it accuracy because
K近邻算法中的置信度
confidence with K nearest neighbors is something you can actually value
和准确度是完全不同的概念
it can indeed be very different from the entire models actual accuracy.
所以这大概就是 K近邻 算法了
So that’s kind of cool with K nearest neighbors.
那再来说说K近邻的缺点
Now what are some downfalls of K nearest neighbors? Well as we’re going to see
在我们要找出最近邻的时候
in order to find out who are the closest neighbors
我们用来衡量距离用的是
what we’re using to measure that distance is just simple
欧几里得距离
Euclidean distance is what we’re going to be using here.
为了找出欧几里得距离
And to do that, to find the Euclidean distance
最简单的方法就是算出任意一点和
all the most simple method is actually to measure the distance between any given point
其他点间的距离 接下来你就可以找出离得最近的3个点
and all of the other points. And then you just say:”Okay, what are the closest 3?”
或者 K个点
or whatever K is.
如果你处理过大数据的话
And as you might guess on a huge data set
就会知道计算这些非常得繁琐
that’s a very very long and tedious operation.
有几个方法可以帮你提高速度
There are a few things that you can do to kind of speed it up
不过即便提速再多
but no matter what you do to speed this up
数据一大 算法还是会运行很慢
you’re going to find that the larger the data set the worse this algorithm runs.
因为 K近邻 就是没有其它算法有效率
Because it’s just not as efficient as other algorithms.
所以我们讲完这个算法后也许会讲讲支持向量机
And then and so once we cover this and then we get into maybe like the support vector machine.
你会发现在处理实际分类问题时
You’ll see that the support vector machine is much more efficient
支持向量机可有效率多了
when it comes to actual classification.
对于 K近邻算法
Also with K nearest neighbors
基本上…实际上并没有训练完成的时候
you’re basically…There’s never really a point where you’re totally training anything.
训练和测试基本上是同时进行的
Like the training and testing is basically the same spot
可以看做是一回事 因为测试的时候
are the same…basically the same thing because when you go to actually test
你还是得去和所有点进行对比 所以训练 K近邻算法
You’re comparing it to all the points. There’s really no good way
基本没有其他捷径可走
to train a simple K nearest neighbors algorithm.
还有一些事儿我们可以做
There are also some things that we can do down the line
不过应该不会继续往下了
but we’ll probably won’t be getting into that ourselves.
只要记住大数据用 K近邻效果并不好
But anyway just keep in mind that the scaling is not so good
在我们之后讲到
and will point out exactly why
支持向量机算法的时候就会知道为什么是这样
and then when we get into support vector machines you’ll see
同时也会知道为什么支持向量机比 K近邻有效率得多
why support vector machines scale so much better than K nearest neighbors.
我不想把 K近邻算法吹得太厉害
That said I don’t mean to brag on K nearest neighbors too much.
它确实对于很多分类问题来说是个好算法
It’s actually a more than fine algorithm for many classification tasks.
即便要处理几个G的数据
So if you’re…even if you’re working up to maybe a gigabyte worth of data.
K近邻算法仍然可以运行得很快
K nearest neighbors can still be calculated quite fast.
而且它可以很容易地进行并发运算
And it can also be easily calculated in parallel
因为任何你想要预测的点都可以随时进行计算
since any point you’re trying to predict can be calculated.
不管其它点有没有被计算过
Regardless of the other points that you’re trying to calculate.
也就是说你可以用多线程来运行它
So it’s actually you can you can thread it
而且效果相对来说也还不错 不过
and still scale relatively well. But if you’re working with
要是处理十几亿的数据的话 结果就不太好了
you know billions of data points. It’s not going to do very well. So anyways
好的 这就是 K近邻算法的原理和演示
that is the theory and intuition behind K nearest neighbors
接下来我们要对 K近邻算法进行实战
and now we’re going to actually be diving in to a real world example of K nearest neighbours.
然后我们会自己写一个 K近邻算法出来 敬请期待
And then after that we’ll actually write our own K nearest neighbors algorithm. So stay tuned for that.
如果有任何问题就在下方评论吧
If you have any questions or comments leave them below.
感谢各位地收看 支持和订阅 我们下次见
Otherwise as always thanks for watching, thanks for all the support and subscriptions and until next time.

发表评论

译制信息
视频概述

本节介绍了机器学习分类算法中的K近邻算法。对其原理和过程进行了介绍和演示。

听录译者

[B]刀子

翻译译者

[B]刀子

审核员

审核团1024

视频来源

https://www.youtube.com/watch?v=44jq6ano5n0

相关推荐