• #### 科普

SCIENCE

#### 英语

ENGLISH

#### 科技

TECHNOLOGY

MOVIE

FOOD

#### 励志

INSPIRATIONS

#### 社会

SOCIETY

TRAVEL

#### 动物

ANIMALS

KIDS

#### 卡通

CARTOON

#### 计算机

COMPUTER

#### 心理

PSYCHOLOGY

#### 教育

EDUCATION

#### 手工

HANDCRAFTS

#### 趣闻

MYSTERIES

CAREER

GEEKS

#### 时尚

FASHION

• 精品课
• 公开课
• 欢迎下载我们在各应用市场备受好评的APP

点击下载Android最新版本

点击下载iOS最新版本

扫码下载译学馆APP

#### 《机器学习python实践》#17 编写我们自己的K最邻近代码

Writing our own K Nearest Neighbors in Code - Practical Machine Learning Tutorial with Python p.17

What is going on everybody and welcome to part 17 of our machine learning tutorial series.

We’ve been working on K nearest neighbors and let’s get into it.

Don’t want to waste any more time here. So

we’ve started creating this K nearest neighbors function.

All we’ve done so far though is just warn the user when they’re trying to do something stupid.

So now we actually need to

create the K nearest neighbors part to calculate

who are the K nearest neighbors, what are their class, was the class of whatever we’re predicting

based on the comparison to the data. OK.

So what we need to do?

How could we do this?

How can we compare a data point to the other data points

to find out who is the closest data point

and therein lies the problem with K nearest neighbors.

To do this we have to compare the prediction point

to all other points in the data set.

That’s the problem with K nearest neighbors.

Now there are actually some additional things

we can do call…you can do what’s called a radius.

So you can look within a certain radius of a point.

And then you can kind of like forget the outliers.

And that can save you a little bit on the calculation of like Euclidean distance.

You can find out much easier as a point inside or outside a radius.

Then you can to calculate the Euclidean distance. Okay.

So anyways, we’re not going to get into that but just keep in mind

that this is the problem with K nearest neighbors.

So what we’re going to do is we’re just going to grab a basically a list of lists

of the distances.

So what we’re going to say is

you might say something like this, like you might say distances equals a list.

And then you’re going to say a for group in data.

And so that would be like for each, keep in mind, data

is whatever you pass through here but in this case it would be this data set.

So for each group…So for each class basically.

So for a group in data. for features in group[]

or rather features in data[group]

So this was now iterating through the features here.

What do we want to do? Well, we need to calculate Euclidean distance, right?

And I got rid of the Euclidean distance. So

in order to save your time I’m not going to write it out. I’m just going to copy and paste here.

Because we’re not going to actually use this version

but going with the version that I wrote previously we would say that, right?
sqrt(features[0]-predict[0])**2 + (features[1]-predict[1])**2
sqrt(features[0]-predict[0])**2 + (features[1]-predict[1])**2.

And then take the whole square root of that, right?

That is the equivalent of what we wrote earlier.

Unfortunately that is not…that’s the algorithm

but it’s not fast. It’s not a very fast calculation.

So I mean on our data set it would be lightning-fast

and probably on the breast cancer set

it would be pretty fast but in this case no.

So we definitely want to change it.

So what we’re going to change it to is, for example, this would be a faster version.

Also the other problem here is

what if you have this works with 2 dimensions but if you had 3 dimensions, right?

This will not change. This is hard-coded to only do K nearest neighbors on 2 feature dimensions.

Bummer, alright?

There are a few things that we could do to counteract this.

Specifically with like a for loop would do it, like a one-line for loop could figure it out

but to counteract for a dynamic number of features you can do something like this.

So first of all we’re using the numpy square root. We’re using numpy version of sum.

And then we’re also using numpy arrays

where we can actually just subtract numpy arrays, square those numpy arrays.

And perfect we get the same value for euclidean distance here as well.

But we’re going to go ahead and not use that one either

because turns out that numpy even has a more simple version

and that would be the following.

I guess this one I’ll write out

because you’re gonna…this is one you’re going to need to write out.

So it’s euclidean_distance = np.linalg

It’s basically linear algebra alg, right?
np.linalg.norm 是吧？
np.linalg.norm, okay?

It’s just another name basically for what we’re going to do here which is Euclidean distance.

So this is the norm of

np.array and we’re going to say features, right? This is features corresponding with features

minus, wait do we do…yeah – np.array(predict). Okay.

And this obviously no longer does it look like that the Euclidean distance formula.

And that’s why I did not want to use this initially

because this is…you’re just…you’re kind of cheating using this like you this is much higher level.

So anyways, but we’re going to use it because it’s faster. But I did want to show you

the true way to write it if you needed to.

Okay. So we’ve done that.

And now what we’re going to say is

let’s do…whoops…hit my mouse. Let’s do over here.

So we’ve got…Actually what we need to do is now we need to do distances.append.

We’re going to append the following. We’re going to append the list.

So we’re going to say whatever the euclidean_distance happens to be.

And then we’re going to append the group.

So it’s the distances and then a group.

So it’s going to be a list of lists

where the first item in the list is the actual distance. Second one’s the group.

This way we can sort that list. Take the three things in that list.

And then get the first element and those are the group. That’s my thinking.

So now we’re going to come down here and we’re going to say
the votes equals…So I guess…Let me explain how I’m going to do this first.

So again I’m going to use a one-liner for loop but I’ll explain

I guess I’ll explain…I’ll write this one out and then I’ll explain it.

So basically it is going to be
the votes is equal to i[1].

So again i[1] is the group.

Once we know what the top 3 distances are we don’t actually care what the distance is.

We just care about it because we want to be able to rank the distances

after we’ve done that and got in the top three with the best closest distances.

We don’t care what the distance was. So I want for i in sorted
sorted(distances)[:k] 对吧？排序之后
(distances)[:k], right? Because after we’ve sorted the distances

we only care about the distances to k.

So it would almost be the votes are equal to for distances i

i[1]. But this just allows us to populate votes as a list
i[1]其实也就是分类 对吧？
that is just simply the groups, right?

Right, are the categories labels whatever you want to call it. Class.

So there you have votes and then we’re going to say the vote result is equal to Counter

And how many of the most common do we care about?

We just care about the first one The one that is the most common.

And then we’re going to take [0] [0].
most_common会给你返回一个…
So most_common it gives you like a…First of all it’s…

It comes as like an array of a list.

So you take 0 first. You get the list and then

you take the 0 again because that list tells you what the most common

how many there were or the most common group.

And then how many there were. So there’s two things in it.

I think it’s actually a tuple of an array.

I can’t remember. Anyway, it’s like that so we’re going to take those.

So in fact I mean we could print. Just so you can see

not take my word for it, Okay?

So then you’ve got vote result. We return vote results.
K最近邻计算就结束了
And we’ve done K nearest neighbors. So now

Let’s say we actually want to run this darn thing

we’re going to say results=

And let’s just copy this. Copy. Paste. We’re going to pass through data set

We want to predict on the new_features and we’re going to say k=3. No problem. Okay.

And now let me print results.

So we’ll also get this other print out but that’ll be okay.

And then we’ll maybe get an error even if we’re lucky. Nope. No error. Cool.

So it’s actually a list of a tuple. Okay. Good thing I printed it out. I guess I was making it up.

So what we get here is that

the return is that…it’s a…

the most voted thing with ‘r’. Turns out k=3. So all three were r.

And of course the vote result is r.

So we get that the class back. This data point right here is an r type which

as we already saw like we kind of expected that to be the case.

So good. So now

there’s a couple of things that we want to do.

First of all we might want to compare it to scikit-learn’s version.

The other thing we could do…I mean like we could print or like we could graph this just

for example. So let me do…We could take this cut. Copy. Come down here. Paste.

And I’ll take this away and
new_features…好我们在这写 是color么？嗯 是的
new_features. So instead…yeah we can pull out that and then…Is it color? Yeah, color.
=result
=result

I think we’ll get away with that. So if we do that.

Right, we just need to change the size but that was the new point, right?

So indeed it classified it in the correct group.So now

what we want to do is we want to compare that to scikit-learn’s version of K nearest neighbors.

Okay. So we’re going to use this this K nearest neighbors algorithm against that breast cancer data.

And we’re going to see how well we compare to scikit-learn.

So that’s we’re going to be doing in the next tutorial.

So if you have any questions comments concerns or whatever leaving below.

Otherwise as always thanks for watching. Thanks for all the support and subscriptions and until next time.

[B]刀子

[B]刀子