• #### 科普

SCIENCE

#### 英语

ENGLISH

#### 科技

TECHNOLOGY

MOVIE

FOOD

#### 励志

INSPIRATIONS

#### 社会

SOCIETY

TRAVEL

#### 动物

ANIMALS

KIDS

#### 卡通

CARTOON

#### 计算机

COMPUTER

#### 心理

PSYCHOLOGY

#### 教育

EDUCATION

#### 手工

HANDCRAFTS

#### 趣闻

MYSTERIES

CAREER

GEEKS

#### 时尚

FASHION

• 精品课
• 公开课
• 欢迎下载我们在各应用市场备受好评的APP

点击下载Android最新版本

点击下载iOS最新版本

扫码下载译学馆APP

#### 《机器学习python实践》#18 应用我们的K最近邻算法

Applying our K Nearest Neighbors Algorithm - Practical Machine Learning Tutorial with Python p.18

What is going on everybody?

Welcome to part 18 of Machine Learning with Python tutorial series.

In this tutorial,

we’re gonna take the K nearest neighbors algorithm that we wrote.

It appears to be working

And then we’re gonna be testing it on some real-world data.

We’re gonna use that exact same data set as that breast cancer data set.

And then when we get our accuracy back,

we’re gonna compare our accuracy to the scikit-learn accuracy to see,

if we did about the same.

What I,what I want you to think about is,

should we or should we not, get either identical or almost identical results

or will the scikit-learn classifier do much better than us

under the same, let’s say “K=5”, parameter.
scikit-learn分类器的效果会比我们写的好得多
So think about that, as we go.

So the first thing we knew is we’re gonna clean up some stuff.

We’re gonna get rid of this imformation here,

we’re gonna get rid of the Matplotlib stuff.

We are not going to be graphing,

they are way, too many, which too many dimensions for that one.

Also, how could we know we have numpy?

We’re gonna add after collections, we’re gonna bring in “import pandas as pd”.

And we are also import random,

pandas so we can load in that data set,
pandas可以将这个数据集载入
random, so you can shuffle it, shuffle that data set.
random可以将这个数据集随机打散
Because we’re not using scikit-learn at all here.

We’re doing this ourselves from scratch. Okay.

So, except for the pandas part.

That’s good or that would take way too long,

but the algorithm.

Okay. Anyway. No one is amused.

Anyway, we’ll get rid of that too.

So it’s just a function in the imports.

So here, the first thing we gonna do is “df = pd.read_csv( )”, Oops, csv.

And don’t forget that “csv”, let me just copy and paste.

It’s that “breast-cancer-wisconsin. data”.

and no forget the “.txt” like I did that one time.

Now we’re gonna do “df.replace”,

of course, just like before we get rid of the question marks,

and we’ll replace that with -99999.

Now that you understand K nearest neighbors,

hopefully you understand what I was explaining before

about that’s significant outlier that, that distance is quite large.

So chances are under these circumstances.

The only time something would compare to something like that

is if they shared a missing data point.

Anyway. But we’ll keep it there anyways.

Oh and we need “inplace =True”,

see df, for place, “inplace = True”.

Now we’re gonna “df.drop”, and we’re dropping the “[id]”

Same reason as before that’s worthless column.

If you recall, accuracy went down to like 56 or something percent or was it 51?

I can remember it.

It’s very close to, you know, a coin toss.

So, a big deal there.

“full_data”, we’re gonna say is “df.astype(float).values.tolist( )”,
full_data = df.astype(float).values.tolist( )
and the reason I’m doing this is for some reason.

This dataframe, like if I go “print”, I will say “print(df. head( ))”.

And I just comment this out for now.

Hopefully we will get what I’m trying to show you.

I’m not seeing it, but it exists.

For some reason, some of these were coming through as quotes,

maybe because I’ve updated, maybe it won’t.

But I’m pretty sure it will.

So we just wanna make sure that we’ve converted to float.

Everything in this dataframe ought to be an int or a float.

It happens to most.

Everything here will be int.

But if you want to reuse this code,

it would need to be float, most likely.

So anyway, we’re gonna convert it to a float.

And then “.values.tolist( )”.

So now, we’ve got the data.

Now we’re gonna shuffle the data,

and now keep in mind, in this case, we can shuffle the data,

because we’ve done is we’ve converted this to a list of lists.

So for example, let me just “print(full_data)”,

I will do the first 10.

I think I hate run.

Here we go. Right, ok.

So as you can see, there’s the first elements, and keep in mind.

The 2 is, if I recall right, benign and a 4 would be malignart,

but I don’t see a 4 at the moment.

And, just let me do this, real quick.

I just want, you don’t have to follow this, I just want to see, because I knew this.

Yes. So converting it to the list here,

you can see like this one is in quotes.

It’s, It’s been treated as a string for some reason,

so this column, for whatever reason, is treated as a string.

Probably because it had a question mark in it?

But then again, I don’t know because it’s been replaced.

I really don’t know why it’s doing that.

But anyway, that’s why we’re saying “astype(float).values.tolist( )”.

So anyways, there’s our data.

So, at this point, we can shuffle this data,

and we are not losing the relationship of the features to label.

It’s all part of the same list, right?

So we can shuffle this and not lose anything.

So now we’re gonna say “random.shuffle(full_data)”,

And just to show, “print”, let’s do “print(full_data)”.

We’ll do it to 5,

and then we’re print full data again to 5 after 20 pound signs.

Just to exemplify something.

So, I just wanted to show that shuffle applies,

And you have not to redefine.

So the first one starts with 5,1,1,1,2,

and this one is 5,2,3 and so on,

so the shuffle works.

That was something that always confused me initially,

I would always try to do the following,

I would try to redefined the variable like “full_data=random.shuffle(full_data)” .

That’s, that’s not how it, how it works, anyway.

So that, so we’ve shuffle the data now.

And this is gonna be our version of train test splits.

In a really high quality code.

So we’re gonna say “test _size = 0.2”,
test_size = 0.2
and then we’re gonna say the “train_set = {2 : [ ], 4 : [ ]}”.

And then “test_set = {2 : [ ], }”,

we should just copy this 4 colon empty list.

Anyway, train_set, test_set,
train_set test_set
and then we’re gonna say “train_data = full_data”

Ops, not parentheses, brackets,

“[ :-int (test_size * len(full_data) )]”.
[ :-int (test_size * len(full_data) )]
So we’re just, we’re multiplying the whole test size 0.2.

We’re using that to create an index value,

and we’re just slicing it based on that index value.

We’ve converted it to an int.

So it’s a whole number and all that found stuff.

So we’ve done that. And let’s just copy this, paste.

And now, rather than colon minus,

it would just be, a minus int, minus, then basically to, let’s say to, here.

So this would be everything up to the last 20% of data.

And then this will be test, we need to rename this.

Test data would be the last 20% of the data.

Okay? So now, so we’ve shuffle the data, we’ve sliced the data.

And now what we need to do is populate the dictionaries,

because we built this to want a dictionary.

So now we’re gonna populate these dictionaries,

and populating them super quick and easy,

because all we have to do is following.

So we’re gonna say “for i in train_data”,
for i in train_data
we could make a one-line for loop here,

we really ought to, but I’m not gonna.

“train_set”，i， basically this will be “i[-1]”.
train_set[ ] 一般里面是i[-1]
And what are we doing here？

So we’re saying “train_set[i[-1]]”, which is negative first element in those lists,

Remember the last column is the class column.

That’s why we’re using negative one, that’s the last value.

So that is either a 2 or a 4, right?

And recall 2 is benign, 4 is maligant.

So that’s how we’re identifying which one of these in the dictionary we want to be a part of.

So “train_set[i[-1]].append[i[:-1]]”
train_set[i[-1]].append[i[:-1]]
So now, we’re appending lists into this list,

and that list is elements up to the last element.

So again, you wouldn’t want to have one of the attributes being the class,

because you will get it right every time most likely.

K nearest neighbors actually might not.
KNN算法实际上并不会如此
But yeah, you don’t wanna do that.

So now, we’ve done that.

Now we need to do is basically the exact same thing only for the test data.

so let’s take this copy, paste, change “train” to “test”, “train_set” to “test”,

And you’re good.

Now, and again, you could make this one line,

but I didn’t want to do that

simply because of the “i[-1]” , that whole stuff that was kind of confusing probably.

So anyways, we’re done with that.

Oops, what has happend? Come down here.

So we’ve populated our dictionaries.

So what’s left? Really nothing.

We just need to pass the information through the K nearest neighbors.

So basically what we’re gonna say is, we’re gonna say, let’s measure.

We’ll say “correct = 0” and “total = 0”,
correct = 0 和 total = 0
and we’re gonna create a simple counter here.

We’re gonna say “for group in test_set”.

What do we want to do?

We’re gonna say “for data in test_set[group]”.
for data in test_set[group]
So for each group in the test set, so this is “test_set”,

so for each of these 2 and 4, we’re testing these.

And then we’re going to say “for data in test_set[group]”.

So just that list of features, right?
data是特征列表 是吧？
So that’s what we’re about to feed through the “predict”, and we’re doing this just.

So “predict” is these lists from the test set, right?

And then as you might be able to guess what we’re going to pass through data,

which we goes here,

which we iterate every single point and calculate the distance,

is going to be the dictionary from the train_set, okay?

So “for data in test_set[group]”,
for data in test_set[group]
we’re gonna say “vote = K _nearest _neighbors( )”,

and we pass “train_set”.

That data, which is the features, and we’re gonna say “k = 5”,

Simply because if you look at the scikit-learn documentation for K nearest neighbors,

they’re using the default value 5,

so we’re gonna copy that.

Then we are good.

All we have to ask at this point is to know, if we were right or wrong.

Is “if group == vote”, right?

If the group that they came from the test_set,

because the test set that we know what the answer is.

So if that group is equal to the vote that we got from our K Nearest Neighbors classifier.

Congratulations! Plus equals one for you.

Otherwise, we’re also, we’ve need to do is “total += 1”.

Okay. So, now, we’re bascially done.

So now we would just “print”, maybe we would say “‘Accuracy:’,”

and then accuracy is just the “correct /total”.

So let’s save and run that, and see if we get any errors.

Oh, we shouldn’t be printing this out.

Oh, this is disgusting.

Ok, it went pretty quick, anyway.

“Accuracy: 0.978”, so 97.8% accuracy.
Accuracy： 0.978 所以准确率为97.8%
Boom, look at us. Ok.

I’m gonna, I’m gonna become died out.

OK, so, so that’s we’ve applied it,

and now what we want to do is compare that.

Let’s run it one more time, without nasty output.

We’re going to compare that, so we ran it again. 95.6% accuracy.

OK, so now what I want to do is have us to compare this to a scikit-learn.

So we’re gonna do that.

And then also we’re going to calculate confidence,

and we’re going to do that in the next tutorial.

So if you have any questions, comments, concerns, whatever up to this point,

feel free to leave them below.

Otherwise the next trial that’s what we’re gonna do.

Also, thanks for watching.

Thanks for all these supports, subscriptions until next time.

Leben

midorishen