未登录,请登录后再发表信息
最新评论 (0)
播放视频

《机器学习python实践》#18 应用我们的K最近邻算法

Applying our K Nearest Neighbors Algorithm - Practical Machine Learning Tutorial with Python p.18

What is going on everybody?
大家还好吗?
Welcome to part 18 of Machine Learning with Python tutorial series.
欢迎来到机器学习Python教程系列的第18集
In this tutorial,
在这次的教程中
we’re gonna take the K nearest neighbors algorithm that we wrote.
我们将使用自己写的KNN算法
It appears to be working
让它可以运行
And then we’re gonna be testing it on some real-world data.
然后会用一些实际数据去测试它
We’re gonna use that exact same data set as that breast cancer data set.
会用到之前的乳腺癌数据集
And then when we get our accuracy back,
然后我们得到返回的准确率后
we’re gonna compare our accuracy to the scikit-learn accuracy to see,
会把这个准确率与scikit-learn的进行比较
if we did about the same.
看看我们做的是否和它一致
What I,what I want you to think about is,
希望大家能想一下
should we or should we not, get either identical or almost identical results
我们是否会得到一致的结果
or will the scikit-learn classifier do much better than us
还是说在同样参数 比如K=5的情况下
under the same, let’s say “K=5”, parameter.
scikit-learn分类器的效果会比我们写的好得多
So think about that, as we go.
讲解的过程中好好想一想
So the first thing we knew is we’re gonna clean up some stuff.
首先 我们要清理下代码
We’re gonna get rid of this imformation here,
我们要把这里的信息去掉
we’re gonna get rid of the Matplotlib stuff.
也把Matplotlib相关的去掉
We are not going to be graphing,
这次我们不会进行绘图
they are way, too many, which too many dimensions for that one.
因为用这个数据来画图会有太多维度
Also, how could we know we have numpy?
还有 会不会用到numpy?
We’re gonna add after collections, we’re gonna bring in “import pandas as pd”.
现在在collections下面加上 import pandas as pd
And we are also import random,
同时也写一下import random
pandas so we can load in that data set,
pandas可以将这个数据集载入
random, so you can shuffle it, shuffle that data set.
random可以将这个数据集随机打散
Because we’re not using scikit-learn at all here.
因为这里完全没有使用scikit-learn
We’re doing this ourselves from scratch. Okay.
我们会重头开始实现它的功能 好
So, except for the pandas part.
除了pandas这部分
That’s good or that would take way too long,
因为它很好用 否则我们会在这儿花上很多时间
but the algorithm.
除了实现算法
Okay. Anyway. No one is amused.
好 没有人觉得好笑
Anyway, we’ll get rid of that too.
好了 我们也要把它去掉
So it’s just a function in the imports.
这是我们引入的函数
So here, the first thing we gonna do is “df = pd.read_csv( )”, Oops, csv.
这里 首先写df = pd.read_Csv( ) 哦 是csv
And don’t forget that “csv”, let me just copy and paste.
不要忘记csv文件 复制粘贴一下文件名
It’s that “breast-cancer-wisconsin. data”.
它就是breast-cancer-wisconsin.data
and no forget the “.txt” like I did that one time.
并且不要忘记.txt后缀名 像我之前就有一次
Now we’re gonna do “df.replace”,
现在我们写 df.replace
of course, just like before we get rid of the question marks,
当然 和我们之前去除问号的方法相同
and we’ll replace that with -99999.
把它们替换成-99999
Now that you understand K nearest neighbors,
既然你已经理解了KNN算法
hopefully you understand what I was explaining before
希望大家已经理解了我之前的课
about that’s significant outlier that, that distance is quite large.
关于那个非常重要的异常点的 它距离中心很远
So chances are under these circumstances.
在这种情况下 这个点是可能被随机分类
The only time something would compare to something like that
只有一个点和另一个点共享同一缺失点时
is if they shared a missing data point.
会出现这种情况
Anyway. But we’ll keep it there anyways.
好吧 但是我们会继续把它留下
Oh and we need “inplace =True”,
哦 同时我们还需要 inplace =True
see df, for place, “inplace = True”.
为了替换df inplace = True
Now we’re gonna “df.drop”, and we’re dropping the “[id]”
现在df.drop 同时删除名为id的列
Same reason as before that’s worthless column.
和之前的原因一样 这列没有用
If you recall, accuracy went down to like 56 or something percent or was it 51?
如果你还记得 准确率下降到56%左右还是51%?
I can remember it.
我记得这个数字
It’s very close to, you know, a coin toss.
它非常接近你们所知的抛硬币的概率
So, a big deal there.
这里比较重要
“full_data”, we’re gonna say is “df.astype(float).values.tolist( )”,
full_data = df.astype(float).values.tolist( )
and the reason I’m doing this is for some reason.
做这步是有原因的
This dataframe, like if I go “print”, I will say “print(df. head( ))”.
这个dataframe 我打印一下它 print(df. head( ))
And I just comment this out for now.
我现在先把这行注释掉
Hopefully we will get what I’m trying to show you.
希望会得到我想要展示给你们的东西
I’m not seeing it, but it exists.
我没有看到它 但是它确实存在
For some reason, some of these were coming through as quotes,
因为某些原因 部分数据带引号显示了
maybe because I’ve updated, maybe it won’t.
可能因为我更新了数据 可能没有
But I’m pretty sure it will.
但是我很确定它能展示
So we just wanna make sure that we’ve converted to float.
所以我们需要确认一下 把数据都转换成浮点型
Everything in this dataframe ought to be an int or a float.
这个dataframe的所有数据必须是整型或者浮点型
It happens to most.
大多都是这两种类型
Everything here will be int.
这里所有数据都会是整型的
But if you want to reuse this code,
但如果你想复用代码
it would need to be float, most likely.
需要转换成浮点型 它适用于大多数场景
So anyway, we’re gonna convert it to a float.
所以我们把数据转成浮点型
And then “.values.tolist( )”.
接下去 .values.tolist( )
So now, we’ve got the data.
现在我们已经有了数据
Now we’re gonna shuffle the data,
来随机一下数据
and now keep in mind, in this case, we can shuffle the data,
同时记住 在这个案例里可以随机数据
because we’ve done is we’ve converted this to a list of lists.
是因为我们已经把数据转成了一个列表的列表
So for example, let me just “print(full_data)”,
比如 print(full_data)
I will do the first 10.
我会先输出前10个
I think I hate run.
我想我讨厌执行程序
Here we go. Right, ok.
我们开始吧 好
So as you can see, there’s the first elements, and keep in mind.
所以正如你们看到的 这是第一个列表里的元素 记一下它们
The 2 is, if I recall right, benign and a 4 would be malignart,
如果我没记错的话 2表示良性 然后4代表着恶性
but I don’t see a 4 at the moment.
但是目前我还没看到4
And, just let me do this, real quick.
我做一下这个 非常快
I just want, you don’t have to follow this, I just want to see, because I knew this.
我想检查一下数据 好了
Yes. So converting it to the list here,
哦这里 把它转为列表
you can see like this one is in quotes.
可以看到像这个“1” 它在引号里
It’s, It’s been treated as a string for some reason,
因为某些原因 它被识别为了字符串
so this column, for whatever reason, is treated as a string.
所以不知为什么 这一列也被识别为一个字符串
Probably because it had a question mark in it?
可能因为这列数据中有一个问号?
But then again, I don’t know because it’s been replaced.
但我还是不知道原因 因为它已经被替换掉了
I really don’t know why it’s doing that.
我真的不知道为什么会进行这一步
But anyway, that’s why we’re saying “astype(float).values.tolist( )”.
这就是为什么写astype(float).values.tolist( )
So anyways, there’s our data.
好 这是我们的数据
So, at this point, we can shuffle this data,
现在我们可以随机这个数据了
and we are not losing the relationship of the features to label.
同时还不会丢失特征和它们对应标签的关系
It’s all part of the same list, right?
它们在同一个列表里 没错吧?
So we can shuffle this and not lose anything.
所以可以随机并且不丢失任何信息
So now we’re gonna say “random.shuffle(full_data)”,
现在 random.shuffle(full_data)
And just to show, “print”, let’s do “print(full_data)”.
为了显示 写print(full_data)
We’ll do it to 5,
我们输出前5个
and then we’re print full data again to 5 after 20 pound signs.
然后在20个星号后面 会再做一次 输出前5个数据
Just to exemplify something.
只是为了展示一些东西
So, I just wanted to show that shuffle applies,
这里我只想展示下 shuffle函数生效了
And you have not to redefine.
你不需要重新定义它
So the first one starts with 5,1,1,1,2,
第一个列表是5 1 1 1 2
and this one is 5,2,3 and so on,
这个列表的开始是5 2 3 ……
so the shuffle works.
所以suffle函数工作正常
That was something that always confused me initially,
这是我刚开始使用时一直困惑我的
I would always try to do the following,
我总是尝试执行下面的语句
I would try to redefined the variable like “full_data=random.shuffle(full_data)” .
我会重新定义变量 full_data=random.shuffle(full_data)
That’s, that’s not how it, how it works, anyway.
这不是正确的使用方法
So that, so we’ve shuffle the data now.
现在我们已经随机了数据
And this is gonna be our version of train test splits.
接下来我们将得到训练集和测试集
In a really high quality code.
用高质量的代码编写
So we’re gonna say “test _size = 0.2”,
test_size = 0.2
and then we’re gonna say the “train_set = {2 : [ ], 4 : [ ]}”.
然后 train_set = {2 : [ ], 4 : [ ]}
And then “test_set = {2 : [ ], }”,
然后 test_set = {2:[ ], }
we should just copy this 4 colon empty list.
将这个”4:[ ]”复制到括号里
Anyway, train_set, test_set,
train_set test_set
and then we’re gonna say “train_data = full_data”
接下来写train_data = full_data
Ops, not parentheses, brackets,
哦 不是小括号 中括号
“[ :-int (test_size * len(full_data) )]”.
[ :-int (test_size * len(full_data) )]
So we’re just, we’re multiplying the whole test size 0.2.
现在把整个测试集大小 乘以0.2
We’re using that to create an index value,
我们将用它创建一个索引数值
and we’re just slicing it based on that index value.
然后用索引数值对测试集做切片
We’ve converted it to an int.
它已经转换成了整型
So it’s a whole number and all that found stuff.
所以是一个整数 这是用它找到的所有数据
So we’ve done that. And let’s just copy this, paste.
搞定了 只要复制这段代码 粘贴
And now, rather than colon minus,
这里不是[ : -int( ) ]
it would just be, a minus int, minus, then basically to, let’s say to, here.
而是[-int( ) : ]
So this would be everything up to the last 20% of data.
它会选取前80%的数据
And then this will be test, we need to rename this.
然后这个数据会成为测试集 我们需要重命名一下
Test data would be the last 20% of the data.
测试数据将是后20%的数据
Okay? So now, so we’ve shuffle the data, we’ve sliced the data.
好了? 我们已经随机了数据 也把数据切片了
And now what we need to do is populate the dictionaries,
现在要做的就是构建字典
because we built this to want a dictionary.
因为我们需要一个字典来构建这个函数
So now we’re gonna populate these dictionaries,
我们需要把这些变成字典
and populating them super quick and easy,
做这个非常快而且容易
because all we have to do is following.
因为只需要写以下代码
So we’re gonna say “for i in train_data”,
for i in train_data
we could make a one-line for loop here,
这里也可以只用一行代码 就实现for循环
we really ought to, but I’m not gonna.
确实应该这么写 但是我没有
“train_set”,i, basically this will be “i[-1]”.
train_set[ ] 一般里面是i[-1]
And what are we doing here?
这里做了什么?
So we’re saying “train_set[i[-1]]”, which is negative first element in those lists,
写的是train_set[i[-1]] 它是这些列表中的倒数第一个元素
Remember the last column is the class column.
回想一下 最后一列是类别列
That’s why we’re using negative one, that’s the last value.
这就是用-1的原因 它是最后一个值
So that is either a 2 or a 4, right?
所以它的值是2或者4 是吧?
And recall 2 is benign, 4 is maligant.
回忆一下 2代表正样本 4代表负样本
So that’s how we’re identifying which one of these in the dictionary we want to be a part of.
这就是如何在字典中识别出想要的数据
So “train_set[i[-1]].append[i[:-1]]”
train_set[i[-1]].append[i[:-1]]
So now, we’re appending lists into this list,
这里我们已经将所有列表 添加到了这一列表中
and that list is elements up to the last element.
被添加的列表取的是 每列的最后一个数
So again, you wouldn’t want to have one of the attributes being the class,
另外 你并不想要某个分类占大多数
because you will get it right every time most likely.
因为这极可能导致你的结果都是正确的
K nearest neighbors actually might not.
KNN算法实际上并不会如此
But yeah, you don’t wanna do that.
但是你并不想发生这种情况
So now, we’ve done that.
我们搞定了
Now we need to do is basically the exact same thing only for the test data.
现在只要对测试数据做同样的事情
so let’s take this copy, paste, change “train” to “test”, “train_set” to “test”,
复制 黏贴 把train改成test train_set改为test_set
And you’re good.
你做得很好
Now, and again, you could make this one line,
和之前一样 也可以把这段代码写成一行
but I didn’t want to do that
但是我并不想那么做
simply because of the “i[-1]” , that whole stuff that was kind of confusing probably.
因为i[-1] 这部分可能会让人有点困惑
So anyways, we’re done with that.
我们完成了
Oops, what has happend? Come down here.
哦 发生了什么? 拉下来到这里
So we’ve populated our dictionaries.
我们已经构建了字典
So what’s left? Really nothing.
那还剩什么没做? 实在没什么了
We just need to pass the information through the K nearest neighbors.
我们只需要把数据传到KNN算法中
So basically what we’re gonna say is, we’re gonna say, let’s measure.
一般这里 我们需要计数
We’ll say “correct = 0” and “total = 0”,
correct = 0 和 total = 0
and we’re gonna create a simple counter here.
会创建一个简单的计数器
We’re gonna say “for group in test_set”.
接下来for group in test_set
What do we want to do?
我们接下来做什么?
We’re gonna say “for data in test_set[group]”.
for data in test_set[group]
So for each group in the test set, so this is “test_set”,
当group在这个test_set时 这是test_set
so for each of these 2 and 4, we’re testing these.
对每组数据选取2:[ ], 4:[ ] 我们要测试这些数据
And then we’re going to say “for data in test_set[group]”.
然后 for data in test_set[group]
So just that list of features, right?
data是特征列表 是吧?
So that’s what we’re about to feed through the “predict”, and we’re doing this just.
这就是要传给predict的 我们刚才做了这步
So “predict” is these lists from the test set, right?
所以predict是从测试集取出的一堆列表 对吧?
And then as you might be able to guess what we’re going to pass through data,
接下来你可能已经猜到了 我们会传给data什么
which we goes here,
就是这行代码
which we iterate every single point and calculate the distance,
它遍历了每个点并计算了它们间的距离
is going to be the dictionary from the train_set, okay?
所以data是一个从train_set得到的字典 对吧?
So “for data in test_set[group]”,
for data in test_set[group]
we’re gonna say “vote = K _nearest _neighbors( )”,
然后vote = K _nearest _neighbors( )
and we pass “train_set”.
然后把train_set写到里面
That data, which is the features, and we’re gonna say “k = 5”,
再传入data 它是特征数据 然后写k = 5
Simply because if you look at the scikit-learn documentation for K nearest neighbors,
如果你看一下scikit-learn中关于KNN的说明
they’re using the default value 5,
会发现他们将5设为默认值
so we’re gonna copy that.
所以我们也复制这个值
Then we are good.
一切进行顺利
All we have to ask at this point is to know, if we were right or wrong.
现在只要知道我们写的是对还是错
Is “if group == vote”, right?
写if group == vote 对吧?
If the group that they came from the test_set,
如果来自test_set的group
because the test set that we know what the answer is.
因为test_set中是我们已经知道的结果
So if that group is equal to the vote that we got from our K Nearest Neighbors classifier.
所以如果这个group等于我们用KNN得到的vote
Congratulations! Plus equals one for you.
恭喜! 为你加上1
Otherwise, we’re also, we’ve need to do is “total += 1”.
此外 我们还需做 total += 1
Okay. So, now, we’re bascially done.
好的 已经基本上完成了
So now we would just “print”, maybe we would say “‘Accuracy:’,”
现在我们可能需要显示一下‘Accuracy:’,
and then accuracy is just the “correct /total”.
准确率是correct /total
So let’s save and run that, and see if we get any errors.
我们保存运行一下 看看是否有什么错误
Oh, we shouldn’t be printing this out.
噢 不能把这个全部输出
Oh, this is disgusting.
噢 看着真恶心
Ok, it went pretty quick, anyway.
好吧 输出的非常快
“Accuracy: 0.978”, so 97.8% accuracy.
Accuracy: 0.978 所以准确率为97.8%
Boom, look at us. Ok.
好棒!看我们做的!好
I’m gonna, I’m gonna become died out.
我感觉都快要死了
OK, so, so that’s we’ve applied it,
好 我们已经应用了这个算法
and now what we want to do is compare that.
现在我们想要比较一下算法
Let’s run it one more time, without nasty output.
让我再运行一次 去掉烦人的输出
We’re going to compare that, so we ran it again. 95.6% accuracy.
我们想要作算法比较 所以再运行一次 准确率95.6%
OK, so now what I want to do is have us to compare this to a scikit-learn.
好 现在我想把它和scikit-learn的进行比较
So we’re gonna do that.
让我们开始吧
And then also we’re going to calculate confidence,
我们还要计算置信度
and we’re going to do that in the next tutorial.
这会在下一次教程中进行演示
So if you have any questions, comments, concerns, whatever up to this point,
如果你对此有任何问题 评论 关心 无论什么
feel free to leave them below.
请随时在下面留言
Otherwise the next trial that’s what we’re gonna do.
另外 下次教程我们会继续接下去的任务
Also, thanks for watching.
同时 感谢收看
Thanks for all these supports, subscriptions until next time.
感谢支持和订阅 我们下次再见

发表评论

译制信息
视频概述

本文介绍了如何自己写一个KNN算法 并将它与scikit-learn中的KNN做比较

听录译者

Leben

翻译译者

midorishen

审核员

审核员1024

视频来源

https://www.youtube.com/watch?v=3XPhmnf96s0

相关推荐