What is going on everybody!
Welcome to a new section on the machine learning with Python tutorial series.
This section we’re going to be talking about classification
and a handful of methods for a classification.
So as we dive in the first classification algorithm
就是 K近邻 算法
that we’re going to be covering is K nearest neighbors.
But really all classification algorithms boil down to the same thing.
So if you recall with linear regression
the objective was to
create a model that best fits our data and
with classification the general purpose is to create a model that best
divides or separates our data.
So let’s go ahead and show a quick example.
So let’s say you’ve got a graph
and then on that graph you’ve got some data points like these.
And the objective is to figure out
how to separate these into obvious groups.
And even just looking at this intuitively
you could see that there are two groups here.
一部分是这些 一部分是这些 对吧？
One group is this group and one group is this group, right?
You just know that’s the case.
So what we just did just now is actually a clustering, right?
Like with our mind there…when we were just looking at this
and we decided that these were two groups.
We actually did clustering.
Classification is actually even more simple than what we just did here.
So what classification is going to do is the following.
So with classification you’re going to have a data set that looks more like this
where you’ve got a group that you know are pluses and a group that you know are minuses.
and the objective is
to create some sort of model that
fits both of these groups, right?
that that properly divides them.
So almost like some sort of model that defines the pluses
and some sort of model that defines the minuses.
So what if you had an unknown dot somewhere, right?
Like what if you have a data point that’s like here.
Looking at that just visually
which group would you assign that to?
Would you put it with the minus the blue minuses or the green pluses?
Most likely you would put it with the green pluses.
And then I ask you why
why would you have done that, right?
What made you think that was the case?
So think about that and then what if we had a point over here.
Where would you assign that point?
Well in this case most likely the blue minuses.
And again think about why might you choose that?
And then finally what if we had a point maybe
right here in the middle almost.
Now how would you classify that?
It turns out the way that you would classify that
might actually vary depending on the algorithm that you’re using.
But in most cases I think that if you have a dot like
like this one.
You’re going to classify that based on proximity
to the other points.
I think most people
in looking at a graph like this would go based on proximity
than anything else. So
you’re thinking to yourself.
Well this point is closest to this point for sure, this point and this point.
And what you’re doing when you think of that.
Because those three points are much closer than the closest blue minus
which is all the way here, right? That’s pretty far.
So what are you doing when you do that?
Well turns out you’ve just done is nearest neighbors.
So with nearest neighbors
You are just checking to see basically
who are the closest points to this new point on the data.
In this case we’ve got two dimensional data
but you can have 3 dimensional, 10 dimensional and so on.
So obviously visually for you looking at this is super simple.
but what if you had like 10 dimensions or a thousand dimensions.
Suddenly you can’t do this by eye anymore.
that’s where the machine begins to shine.
So that’s nearest neighbors.
But this is actually most people use
K nearest neighbors
那到底什么是 K近邻 算法呢？
So what the heck is K nearest neighbors?
Well it turns out that if you just try to start thinking about
‘Okay, how does this process actually going to work?’
Do you actually need to compare it to every single point
in a data set to get your answer.
And most likely you don’t need to do that
but so with K nearest neighbors
we’ll just add a K suppose here.
But that’s all together right K nearest neighbors.
你得先确定 K 的数量是多少
You decide what the number of K is going to be.
比如我们假设 K 等于2
So let’s say K was equal to 2.
接下来要做的事就是找到离 K 最近的两个点
What you would do is you would find the two closest neighbors to K.
And I’m going to say visually that is this one.
And honestly I’m not really sure
which one is closer of these two. I would probably guess
maybe this one.
My orange line is definitely shorter
but it doesn’t quite go the whole distance.
But let’s just say it was closest to that second one there.
所以 K2 就是你得找离得最近的
So with k2 you’ve got two points.
that are the closest. So we’ve got
basically two points are saying:”Yep this is a plus.”
But what if you had a point that was maybe here.
因为是 K2 所以还是得找出离得最近的2个点
You might have a case where what are the two closest points by K2.
Well you would have probably this point here and this point here, right?
Those are the two closest points.
And when K…you know within when the nearest neighbors
go to basically place a vote on the what the identity of this point is.
We have a split vote, okay?
So in general when you do K nearest neighbors.
最好不要让 K 等于2或者任何偶数
You’re probably not going to want to have K equals 2 or any other even number.
可以把 K 设置为奇数 这里我们就设置为3
You’re going to want K equal to some odd number when in this case we’ll do 3.
So what if we did 3? What if we said ok we need one more point
what we would say:”Okay, it’s this one.” So then basically the vote would be
negative negative and positive.
That’s a two out of three. So we would say it’s
the class is actually a negative class. That’s what we would end up going with here.
And so that’s basically how K nearest neighbors works.
It’s a super simple algorithm and
the other thing you have to think about to is in this case we had only two groups.
But what if you had three groups?
Is K3 going to be a good idea?
不合适了 因为可能会出现三点属于不同分类的情况 那设置为4呢？
Turns out no. Because you could have a total split amongst all the groups. What about four?
No. Because you could have a totally even vote.
So if you had 3 groups you need at least 5
total you know K equals 5
to avoid any sort of split vote.
You can also code something into just randomly pick if there is a division.
What’s neat about K nearest neighbors though
is not only can you get an actual classification
for the data point that you pick.
You can get what we were talking about before both
accuracy in the model
so that you can train and test the model for the models overall accuracy.
But each point can also have a degree of confidence.
比如设 K 等于3
So for example what if you get…you’re using K equals 3.
And you get a vote that is like a negative, a negative and a positive.
Well that’s a two out of three, right?
So that’s a you know 66% confidence
in the score or in the classification of that data.
But not only is it 66% at…is the confidence 66%
but you can also have the entire K nearest neighbors model that you’ve trained.
You can have that accuracy. So this would actually be more like confidence.
That’s why I wanted to change
when we were doing linear regression why I didn’t want to call it confidence
I wanted to call it accuracy because
confidence with K nearest neighbors is something you can actually value
it can indeed be very different from the entire models actual accuracy.
所以这大概就是 K近邻 算法了
So that’s kind of cool with K nearest neighbors.
Now what are some downfalls of K nearest neighbors? Well as we’re going to see
in order to find out who are the closest neighbors
what we’re using to measure that distance is just simple
Euclidean distance is what we’re going to be using here.
And to do that, to find the Euclidean distance
all the most simple method is actually to measure the distance between any given point
and all of the other points. And then you just say:”Okay, what are the closest 3?”
or whatever K is.
And as you might guess on a huge data set
that’s a very very long and tedious operation.
There are a few things that you can do to kind of speed it up
but no matter what you do to speed this up
you’re going to find that the larger the data set the worse this algorithm runs.
因为 K近邻 就是没有其它算法有效率
Because it’s just not as efficient as other algorithms.
And then and so once we cover this and then we get into maybe like the support vector machine.
You’ll see that the support vector machine is much more efficient
when it comes to actual classification.
Also with K nearest neighbors
you’re basically…There’s never really a point where you’re totally training anything.
Like the training and testing is basically the same spot
are the same…basically the same thing because when you go to actually test
你还是得去和所有点进行对比 所以训练 K近邻算法
You’re comparing it to all the points. There’s really no good way
to train a simple K nearest neighbors algorithm.
There are also some things that we can do down the line
but we’ll probably won’t be getting into that ourselves.
But anyway just keep in mind that the scaling is not so good
and will point out exactly why
and then when we get into support vector machines you’ll see
why support vector machines scale so much better than K nearest neighbors.
That said I don’t mean to brag on K nearest neighbors too much.
It’s actually a more than fine algorithm for many classification tasks.
So if you’re…even if you’re working up to maybe a gigabyte worth of data.
K nearest neighbors can still be calculated quite fast.
And it can also be easily calculated in parallel
since any point you’re trying to predict can be calculated.
Regardless of the other points that you’re trying to calculate.
So it’s actually you can you can thread it
and still scale relatively well. But if you’re working with
you know billions of data points. It’s not going to do very well. So anyways
好的 这就是 K近邻算法的原理和演示
that is the theory and intuition behind K nearest neighbors
and now we’re going to actually be diving in to a real world example of K nearest neighbours.
然后我们会自己写一个 K近邻算法出来 敬请期待
And then after that we’ll actually write our own K nearest neighbors algorithm. So stay tuned for that.
If you have any questions or comments leave them below.
感谢各位地收看 支持和订阅 我们下次见
Otherwise as always thanks for watching, thanks for all the support and subscriptions and until next time.