• #### 科普

SCIENCE

#### 英语

ENGLISH

#### 科技

TECHNOLOGY

MOVIE

FOOD

#### 励志

INSPIRATIONS

#### 社会

SOCIETY

TRAVEL

#### 动物

ANIMALS

KIDS

#### 卡通

CARTOON

#### 计算机

COMPUTER

#### 心理

PSYCHOLOGY

#### 教育

EDUCATION

#### 手工

HANDCRAFTS

#### 趣闻

MYSTERIES

CAREER

GEEKS

#### 时尚

FASHION

• 精品课
• 公开课
• 欢迎下载我们在各应用市场备受好评的APP

点击下载Android最新版本

点击下载iOS最新版本

扫码下载译学馆APP

#### 《机器学习Python实践》#13 分类w／K最近邻

Classification w/ K Nearest Neighbors Intro - Practical Machine Learning Tutorial with Python p.13

What is going on everybody!

Welcome to a new section on the machine learning with Python tutorial series.

This section we’re going to be talking about classification

and a handful of methods for a classification.

So as we dive in the first classification algorithm

that we’re going to be covering is K nearest neighbors.

But really all classification algorithms boil down to the same thing.

So if you recall with linear regression

the objective was to

create a model that best fits our data and

with classification the general purpose is to create a model that best

divides or separates our data.

So let’s go ahead and show a quick example.

So let’s say you’ve got a graph

and then on that graph you’ve got some data points like these.

And the objective is to figure out

how to separate these into obvious groups.

And even just looking at this intuitively

you could see that there are two groups here.

One group is this group and one group is this group, right?

You just know that’s the case.

So what we just did just now is actually a clustering, right?

Like with our mind there…when we were just looking at this

and we decided that these were two groups.

We actually did clustering.

Classification is actually even more simple than what we just did here.

So what classification is going to do is the following.

So with classification you’re going to have a data set that looks more like this

where you’ve got a group that you know are pluses and a group that you know are minuses.

and the objective is

to create some sort of model that

fits both of these groups, right?

that that properly divides them.

So almost like some sort of model that defines the pluses

and some sort of model that defines the minuses.

So what if you had an unknown dot somewhere, right?

Like what if you have a data point that’s like here.

Looking at that just visually

which group would you assign that to?

Would you put it with the minus the blue minuses or the green pluses?

Most likely you would put it with the green pluses.

And then I ask you why

why would you have done that, right?

What made you think that was the case?

So think about that and then what if we had a point over here.

Where would you assign that point?

Well in this case most likely the blue minuses.

And again think about why might you choose that?

And then finally what if we had a point maybe

right here in the middle almost.

Now how would you classify that?

It turns out the way that you would classify that

might actually vary depending on the algorithm that you’re using.

But in most cases I think that if you have a dot like

like this one.

You’re going to classify that based on proximity

to the other points.

I think most people

in looking at a graph like this would go based on proximity

than anything else. So

you’re thinking to yourself.

Well this point is closest to this point for sure, this point and this point.

And what you’re doing when you think of that.

Because those three points are much closer than the closest blue minus

which is all the way here, right? That’s pretty far.

So what are you doing when you do that?

Well turns out you’ve just done is nearest neighbors.

So with nearest neighbors

You are just checking to see basically

who are the closest points to this new point on the data.

In this case we’ve got two dimensional data

but you can have 3 dimensional, 10 dimensional and so on.

So obviously visually for you looking at this is super simple.

but what if you had like 10 dimensions or a thousand dimensions.

Suddenly you can’t do this by eye anymore.

that’s where the machine begins to shine.

So that’s nearest neighbors.

But this is actually most people use
K近邻算法
K nearest neighbors

So what the heck is K nearest neighbors?

Well it turns out that if you just try to start thinking about

‘Okay, how does this process actually going to work?’

Do you actually need to compare it to every single point

And most likely you don’t need to do that

but so with K nearest neighbors

we’ll just add a K suppose here.

But that’s all together right K nearest neighbors.

You decide what the number of K is going to be.

So let’s say K was equal to 2.

What you would do is you would find the two closest neighbors to K.

And I’m going to say visually that is this one.

And honestly I’m not really sure

which one is closer of these two. I would probably guess

maybe this one.

My orange line is definitely shorter

but it doesn’t quite go the whole distance.

But let’s just say it was closest to that second one there.

So with k2 you’ve got two points.

that are the closest. So we’ve got

basically two points are saying:”Yep this is a plus.”

But what if you had a point that was maybe here.

You might have a case where what are the two closest points by K2.

Well you would have probably this point here and this point here, right?

Those are the two closest points.

And when K…you know within when the nearest neighbors

go to basically place a vote on the what the identity of this point is.

We have a split vote, okay?

So in general when you do K nearest neighbors.

You’re probably not going to want to have K equals 2 or any other even number.

You’re going to want K equal to some odd number when in this case we’ll do 3.

So what if we did 3? What if we said ok we need one more point

what we would say:”Okay, it’s this one.” So then basically the vote would be

negative negative and positive.
2比3 也就是
That’s a two out of three. So we would say it’s

the class is actually a negative class. That’s what we would end up going with here.

And so that’s basically how K nearest neighbors works.

It’s a super simple algorithm and

the other thing you have to think about to is in this case we had only two groups.

But what if you had three groups?
K设置为3还合适吗？
Is K3 going to be a good idea?

Turns out no. Because you could have a total split amongst all the groups. What about four?

No. Because you could have a totally even vote.

So if you had 3 groups you need at least 5
K 等于5
total you know K equals 5

to avoid any sort of split vote.

You can also code something into just randomly pick if there is a division.
K近邻算法的一个优点就是
What’s neat about K nearest neighbors though

is not only can you get an actual classification

for the data point that you pick.

You can get what we were talking about before both

accuracy in the model

so that you can train and test the model for the models overall accuracy.

But each point can also have a degree of confidence.

So for example what if you get…you’re using K equals 3.

And you get a vote that is like a negative, a negative and a positive.

Well that’s a two out of three, right?

So that’s a you know 66% confidence
66%的几率是可靠的
in the score or in the classification of that data.

But not only is it 66% at…is the confidence 66%

but you can also have the entire K nearest neighbors model that you’ve trained.

You can have that accuracy. So this would actually be more like confidence.

That’s why I wanted to change

when we were doing linear regression why I didn’t want to call it confidence

I wanted to call it accuracy because
K近邻算法中的置信度
confidence with K nearest neighbors is something you can actually value

it can indeed be very different from the entire models actual accuracy.

So that’s kind of cool with K nearest neighbors.

Now what are some downfalls of K nearest neighbors? Well as we’re going to see

in order to find out who are the closest neighbors

what we’re using to measure that distance is just simple

Euclidean distance is what we’re going to be using here.

And to do that, to find the Euclidean distance

all the most simple method is actually to measure the distance between any given point

and all of the other points. And then you just say:”Okay, what are the closest 3?”

or whatever K is.

And as you might guess on a huge data set

that’s a very very long and tedious operation.

There are a few things that you can do to kind of speed it up

but no matter what you do to speed this up

you’re going to find that the larger the data set the worse this algorithm runs.

Because it’s just not as efficient as other algorithms.

And then and so once we cover this and then we get into maybe like the support vector machine.

You’ll see that the support vector machine is much more efficient

when it comes to actual classification.

Also with K nearest neighbors

you’re basically…There’s never really a point where you’re totally training anything.

Like the training and testing is basically the same spot

are the same…basically the same thing because when you go to actually test

You’re comparing it to all the points. There’s really no good way

to train a simple K nearest neighbors algorithm.

There are also some things that we can do down the line

but we’ll probably won’t be getting into that ourselves.

But anyway just keep in mind that the scaling is not so good

and will point out exactly why

and then when we get into support vector machines you’ll see

why support vector machines scale so much better than K nearest neighbors.

That said I don’t mean to brag on K nearest neighbors too much.

It’s actually a more than fine algorithm for many classification tasks.

So if you’re…even if you’re working up to maybe a gigabyte worth of data.
K近邻算法仍然可以运行得很快
K nearest neighbors can still be calculated quite fast.

And it can also be easily calculated in parallel

since any point you’re trying to predict can be calculated.

Regardless of the other points that you’re trying to calculate.

So it’s actually you can you can thread it

and still scale relatively well. But if you’re working with

you know billions of data points. It’s not going to do very well. So anyways

that is the theory and intuition behind K nearest neighbors

and now we’re going to actually be diving in to a real world example of K nearest neighbours.

And then after that we’ll actually write our own K nearest neighbors algorithm. So stay tuned for that.

If you have any questions or comments leave them below.

Otherwise as always thanks for watching, thanks for all the support and subscriptions and until next time.

[B]刀子

[B]刀子