• #### 科普

SCIENCE

#### 英语

ENGLISH

#### 科技

TECHNOLOGY

MOVIE

FOOD

#### 励志

INSPIRATIONS

#### 社会

SOCIETY

TRAVEL

#### 动物

ANIMALS

KIDS

#### 卡通

CARTOON

#### 计算机

COMPUTER

#### 心理

PSYCHOLOGY

#### 教育

EDUCATION

#### 手工

HANDCRAFTS

#### 趣闻

MYSTERIES

CAREER

GEEKS

#### 时尚

FASHION

• 精品课
• 公开课
• 欢迎下载我们在各应用市场备受好评的APP

点击下载Android最新版本

点击下载iOS最新版本 扫码下载译学馆APP

#### 《机器学习python实践》#19 K最邻近算法最后一课

Final thoughts on K Nearest Neighbors - Practical Machine Learning Tutorial with Python p.19

What’s going on everybody and welcome to part 19

of our machine learning with Python tutorial series.

In this video we’re going to be talking about K nearest neighbors

one last time before we jettison off

into the support vector machine.

What were we talking about here is K accuracy and predictions.

First question we might have is…Well

what if we…If we increased k

would accuracy necessarily go up?

So just for fun. Let’s say k is equal to 25.

And we run that. The previous accuracy was about 95.

Now we got 88, or 98 rather.

Sorry I can’t read for crap.

Run it again. Okay, 97. Run it again. 94. Run it again.

Okay, you get the point.

So we’re still doing pretty good. What if we raise it to 75.

95%. Okay, still doing pretty good.
95% 可以 还是很好
97%. Still doing pretty good.
97% 也挺好
Okay, 93 is one of our lower numbers.

Anyway, uh, we don’t really seem…

I mean we’re still kind of the average around here is about 97.

So, you know, we can keep going.

I forget how many data points we actually have in this set, let’s see,

which have 600 about. So if we said k is equal to 200.

So one of the closest 200 points.

We see accuracy actually appears to be worse.

So looking at more points and actually,

probably what’s happening there is…Recall that

we actually have a skewed number. Only like 30% were malignant where,

right yeah, 30% were malignant and the other like 70% were benign.

So this is, you known, adding to k doesn’t necessarily do you any more of a favor.

5 is probably a pretty good.
K=5就不错了
Just guess.

But depending on your data set and all that you might want to

fiddle with K just to see if it makes a huge difference.

And just test on whatever data set you might have. So anyway, that’s K.

Next we’re going to be talking about confidence versus accuracy.

And K nearest neighbors can give us confidence.
K最近邻算法可以提供给我们置信度
So accuracy is…Did we get the classification, right?

But confidence actually can come from the classifier and the classifier can say: ”Hey

we have 100% of the votes in favour of this class being such-and-such.
100%的选票赞成这类属于某某类
Or conversely it can go back and say: ”Hey

we have only 60% of the votes. This is what the vote was but

our confidence is only 60%, right?

So, for example, you could create confidence

and confidence would be basically this.

This. And instead of that 0th element it would be the 1th that’s how many.

And then what would it be? It would be divided by whatever k was, right?

So you’re hoping if you say k is five.

You’re hoping that this number is 5.

So anyways, there’s that and we’ll just leave it like this. Simply because

Every other like accuracy and stuff is the decimal form. So we’re not actually times by 100.

So you can have confidence there and then you literally could just do confidence. Okay.

And we can…Let’s see vote.

And then confidence would be here. So let me make sure we get away with that.

Right. So there you have that and then basically what we could do is we could…

You wouldn’t want to do this all the time but you could print, you know, vote_result

and not vote_result, confidence.

So there we go. So almost all of these were a, you know, 1.0 confidence.

This one was eighty percent but this one was sixty percent.

And the interesting thing would maybe be to say vote, confidence

if that is the case. else: print()

print the confidence. So every print out of confidence. So we’re going to comment this out.
print (confidence) 如此每次打印都没有confidence 所以我们就要把它注释掉
So all the numbers we’re about to see

are the confidence score of the votes that we got incorrect.

So we run this.

Okay. So actually this is a pretty good split here.

This wasn’t…These were 100% votes that were incorrect

Let’s go up here and let’s maybe change the test size to

0.4. We’re going to sacrifice a lot of data.

So accuracy might actually go down but the question was in my mind

would you see less 100% confidence if

if that in the incorrect valuations and sure enough it appears with

this…at least this test. This was a pretty even split but the ones that it was unconfident about was 60%.

So you might even get to the point though where

let’s say you have someone in your…and you’re telling them whether or not they have cancer.

Okay. You’re telling them whether or not they have cancer

if the confidence is only 60%.

You might…you know, not say anything, right?

You might say: ”Well, the confidence that the test was not accurate.” or whatever.

Or the test was not confident, right?

So you know that might be a reason why you might want to take a little more care

So that’s K’s confidence and now comparison.

So what I want to do is…

Let’s see what I want to do is take this here

and basically we’re going to take this.

And let’s say for i in range()

(25). And we’re gonna say for i in range(25). We’re going to do…

Actually…Yeah I will just will be sloppy about it whatever.

Tab that over.

And we’re gonna say accuracies equals a list.

And then here we’re gonna say

accuracies.append that number.
accuracies.append 再加上这个数
And then when we’re all done. We’re going to say print. We’re gonna say len()

No. sum. sum(accuracies)

divided by len()

len(accuracies)
len(accuracies)
Let’s just do 10. Well do five first just to make sure logic.

We shouldn’t have…We have to get rid of these printouts. Right.

Okay. So we’ll get rid of the printouts.

And as we can see at least out of 5 tests where we average 96.2% accuracy

and the other thing we need to make sure. k is indeed 5.

We’ll stop printing accuracy and we will stop…

Where’s the other thing we were printing? Oh, it’s here.

else print(confidence). Let’s just delete that.
else print(confidence) 我们把它删掉
Okay. So now.

Now we’ve got that and we are getting different accuracies that just

to show the whole process is being repeated over again.

So let’s run this 25 times.

May not be enough for you

but we’ll run that 25 times. And then let’s go to

the other K nearest neighbors.

Let’s say…You don’t have to follow along if you don’t want to by the way.

You can just see what the result is. I believe that this is the other K nearest neighbors from part 14.

What we could do is do basically the exact same thing here, right?

Let’s take this here. Tab over.

Paste for i range that.

I’m going to tab. I’m going to stop this accuracy stuff.

And in fact we don’t need to do predictions either.

I’m going to leave them there just in case I…Because this is the old code actually. So…

So I usually upload this to github.

So anyway. We’ll save that and then

accuracies.append(accuracy)

And then we print sum accuracies.

Divided by len(accuracies).

Okay. So out of 97…out of 25

examples we ran a 96.4% accuracy on the version we wrote.

Now let’s test K nearest neighbors.

Okay. So this is the average accuracy for K nearest neighbors. 96.8%

The only thing I will stress though is the following.

Keep in mind. On both of these we’re basically doing the exact same thing.

We’re loading in the data frame. We’re passing the data. We’re training. We’re testing.

And we’re doing 25 iterations.

So again let’s run this one.

And it’s still running and I’m going to go. It’s still running. I’ll run this one, too.

And it there’s your answer there and then we’re still waiting on the version we wrote.

And we’re still waiting.

Oh man. Anyway. So…

So what’s the difference? Okay. There’s a couple of differences here.

One is K nearest neighbors has a default parameter.

Actually I think the default n_jobs is equal to one

but a couple of things. One. K nearest neighbors can be threaded.

Okay. So

you do not have to test each prediction point

or like in tests you had 20% of the sample, right? So

when you go to test. Each one, each like set of features is its own unique snowflake.

So you don’t actually need to test linearly. You can test all…

Test each one on their own basically. So you can heavily thread

K nearest neighbors on a bulk of predictions.

So first of all. You can do that but I actually

I think that the default is they do not.

Let me pull up sklearn’s K neighbors for you guys. So here we go.

And neighbors default 5 sure enough. Radius. This is most likely where they’re…

They’re winning. They’re beating us because they’re using that idea of that radius

to ignore points that are outside the radius most likely.

n_jobs. This is what we’re talking about. So how many parallel jobs do you want to run

for the neighbors search?

The default is one but if you set that to be negative one.

Where are we? Here we are.

Then it will do as many jobs as possible.

So that’s another way to actually speed it up but as you can see it was actually already pretty fast.

But this is not a high performance tutorial on Python. This is a machine learning tutorial.

So if you want high performance you’ll most likely be using someone else’s algorithm.

But our accuracy was actually very similar to theirs.

I would wager that if we ran like a million samples.

We would find that accuracy was identical.

It’s just 25 is not actually a decent sample size.
25的话 并不是一个合适的样本量
So anyways. That’s all for K nearest neighbors

of a very valid algorithm to use.
K最近邻算法的所有应用了
You can use it on pretty like…A lot of people say it doesn’t scale well but

it scales to very large data sets.

It just doesn’t scale to like terabytes of data.

It’s just not going to run very well on terabytes of data.

And it actually still will run pretty well.

The accuracy is pretty good

so one of the other upsides of the K nearest neighbors classifier

is that it can work on both linear and nonlinear data.

So you’ll see that makes a big difference

especially in something like the support vector machine which is what we’re covering next.

But to use algorithms that we’ve already covered you actually can use

regression to do classification

so long as you’re using linear data. So for example I’ve got a graph here.

So if I was to draw the best fit line, the Y hat line,

the regression line for these two data sets. I would probably do something maybe like

like this. Okay. So it would be for the blue dots and then I might do something

like this for the orange dots.

And then if I had an unknown data point like…Let’s say maybe this point here.

Without using K nearest neighbors instead what I could do is I could measure the squared error

between this regression line and this regression line.

And whichever one had a lesser squared error

that would be what class that new X plot belonged to which in this case would be the orange

orange dot class, right?

So with regression if the data is indeed linear

you can still do classification, right? You don’t just have to forecast out.

Now what about a dataset like this though?

This dataset both of these the orange dots and the blue dots

do have a best fit line. I have no idea what they are

but they do have a best fit line.

But even if they did, even if when we did draw the best fit line.

The squared error, the coefficient of determination would be so poor

so the actual, you know, confidence of this algorithm

So you can’t really do classification

on a data set like this. This is nonlinear data.

But you can do K nearest neighbors on nonlinear data.

So, you know, we could still…We can classify this point

using K nearest neighbors because then we’re just…
K最近邻算法归类该点 这样的话我们就……
We’re measuring the distance between this point, probably this point and this point.

Those are our closest three points.

We would get…If K was equal to three

we have two to one vote. So we would vote that it’s an orange and orange dot class, right?
2:1 就得出这个点是橙色的 属于橙色点类 对吧
So anyways that’s one of the other upsides to the K nearest neighbors classifier.

That’s it for K nearest neighbors. The next topic we’re gonna be talking about is the support vector machine.

So that’s what we’re gonna be getting into. If you have any questions comments concerns whatever

feel free to leave them below.

Otherwise as always thanks for watching thanks for all the support subscriptions and until next time.

[B]刀子