ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

【机器学习入门】#10 开始使用Weka – 译学馆
未登录,请登录后再发表信息
最新评论 (0)
播放视频

【机器学习入门】#10 开始使用Weka

Getting Started with Weka - Machine Learning Recipes #10

[音乐播放中]
[MUSIC PLAYING]
嗨 大家好
JOSH GORDON: Hey, everyone.
今天我想做一个
Today I’d like to make a quick video
扎实又实用的视频
that I hope will be concrete and useful for you.
介绍的是我很早就使用的机器学习库
It’s about thevery first machine learning library I ever tried.
它的名字是Weka
And it’s called Weka.
Weka的优点包括 它有GUI
What’s great is that Wekacomes with a GUI that
能可视化的展示你的数据集
makes it easy tovisualize your data sets
还能训练 对比不同的分类器
and train and comparedifferent classifiers.
对于要研究机器学习的朋友来说 它确实是个很方便的工具
And this is a really handy tool to have while you’re learning ML.
我会给大家快速的讲解一下Weka的使用方法
I’ll give you a quickwalkthrough of how to use Weka,
包括安装 运行实验
from installation all theway to running experiments,
以及它能实现的功能
and show you someof what it can do.
我会演示用两个不同的数据集来训练模型
I’ll demo training modelson two different datasets. First,
首先 我们对数据如血糖等级进行处理
we’ll predict if a patient has diabetes based
预测病人是否患上了糖尿病
on attributes like their glucose levels.
接下来 我们还将
And next, we’ll predict
基于某议员对法案的投票数据
if a congressperson is a Democrat or Republican based on how
来预测他是民主党还是共和党
they voted on different bills.
我还会给大家演示如何评估这些实验的数据
I’ll also show you how to evaluate the results
以及如何使用“特征选择”功能
of these experiments and how to do things like feature
来发现重要的属性
selection to discover which attributes are important. OK,
我们现在就开始吧
let’s dive right in.
首先要做的就是下载和安装Weka
The first thing we’ll do isdownload and install Weka.
很赞的是 它提供了
And what’s neat is that it comes
可以运行在Win Mac Linux上的
as a nicely packaged application you can run
安装包
on Mac, Windows, or Linux.
而且也包括Java API
There’s also a Java API. Here,
我来下载和安装Weka
I’ll downloadand install Weka.
运行这个应用
And now I’ll start it up.
它有不同的界面 我们使用的是Explorer
There are different interfaces,and we’ll use the Explorer.
看上去功能不少 不过不用担心
There’s a lot on this screen,but don’t worry about it.
很快你就能明白它的工作原理
You’ll get a feel for howthis works in a moment.
第一件事是打开一个数据集
The first thing todo is open a dataset.
点击Open
So we’ll hit Open.
我们现在可以去下载一个
And now would be a goodtime to download one.
这个页面里提供了一大堆准备好的数据集
You can find a bunchof prepackaged datasets on this page.
我们从UCI库开始
And we’ll start withthe UCI repository.
它包含了大约37个问题
It contains about 37 problems.
下载的是个JAR文件
And when you downloadit, you’ll get a JAR. Now,
如果你是Java程序员 应该不会觉得陌生
you might be familiar with these if you’re a Java developer.
不过就算不是 也不必担心
But if not, don’t worry.
权且将其认做是Zip吧
You can treat them as a ZIP.
我们会将其解压
Here I’ll unzip it.
可以看到 这是包含数据集的目录
And now we can see adirectory of datasets.
我们返回Weka 打开一个
Let’s return to Weka andopen one of these up.
从糖尿病数据开始
And we’ll start with diabetes.
好的 这里有什么?
All right, what do we see here?
我来带着大家过一遍
Let me walk you through it. First,
首先 了解一下这个数据集
let’s learnabout the dataset.
在最上面 能看到这里有768条数据
At the top, you can see there are 768 examples,
或称实例 包含9个属性 或称特征
or instances, and nine attributes, or features.
我们从最好的属性 – 类型开始
The best attribute to start with is class,
通常也称标签 这是我们要预测的结果
or the label we want to predict.
在Weka中 这个属性通常会在数据集的最底部
And usually in Weka, that’s thelast attribute in a dataset.
点击它 会显示一个柱状图
Clicking on that, wecan see a histogram.
左侧蓝色的柱代表的是
The blue column on the left shows the number
糖尿病测试呈阴性的病人数量
of people who tested negative for diabetes.
右侧红色的柱代表的是检测呈阳性的病人数
And the red column on the right shows those who tested positive.
现在我们来看看
Now let’s look
用于预测病人是否患糖尿病的属性
at the attributes we’ll use to predict if a patient has the disease.
这里的描述很简短
The descriptions hereare pretty short.
但其实我们可以用Sublime或者
But we can open up the dataset in Sublime or your
你爱用的编辑器来打开数据集 理解它们的含义
favorite text editor to learn more about what they mean,
以及这个数据集的收集方式
as well as how the dataset was collected. Now,
Weka的数据集保存在ARF格式的文件中
Weka datasetscome in an ARF format.
其实它本质上就是个文件顶部有些元数据的CSV
And this is just a CSV with some metadata included at the top.
向下卷动一些 我们就能看到这些属性的描述
Scrolling down a bit, we can see a description of the attributes.
第一个属性代表着病人怀孕的次数
And the first tells us the number of times a patient was pregnant,
第二个代表着他们的血糖值
and the second tells us their plasma glucose.
对于糖尿病来说 我觉得这个属性比其它的要更有预测价值
For diabetes, I imagine one of these is more predictive than the other.
那么让我们来看看Weka是否也能计算出这个信息
Let’s see if Weka cantell us that, too.
回到GUI 点击血糖这个属性
Back in the GUI,let’s click on plasma.
酷吧 这个直方图
And what’s cool is youcan see a histogram of how
显示了要预测的不同数值对映的类别
different values correlate tothe class we want to predict.
刚刚说过蓝色代表阴性 红色代表阳性
Recall that blue is negative,and red is positive.
于是我们立刻就能发现这是个有用的属性
And right off the bat, we can see this is a useful attribute,
如果血糖低
meaning that if plasma is low,
比方说低于100 病人患上糖尿病的可能性很小
say below about 100, thenit’s unlikely the patient has diabetes.
大部分的数值都是蓝色 然而 当数值升高时
Most of these values are blue,whereas as the value increases,
病人有糖尿病的倾向就开始增加
it’s increasingly likely thata patient has the disease.
我们再看看怀孕这个属性
Now let’s look at pregnancy.
在我看来 它不像有着强关联
And to me, this doesn’t looklike a strong correlation.
也许这里有一个 但很不明显
There may be one here,but it’s less obvious.
现在到这里 事情变得很有趣
Now here’s where thingsget really interesting.
可以给大家展示的内容太多
I’ve so much to show you,
而我现在必须要进入下个环节 来对数据进行分类了
but I want to jump in and classify this data.
我们点开Classify标签
So let’s head overto the Classify tab.
这里有一堆自带的分类器
There’s a whole bunchof built-in classifiers.
我们来从决策树开始
Let’s start withthe decision tree.
J48是能够做数据修剪的树
And J48 is one type oftree that does pruning.
点击Start 嘭
We’ll hit Start, and bam.
搞定了
We’re done.
我们使用糖尿病数据集 训练了一棵决策树
We just trained a decisiontree on the diabetes dataset.
假如我们打算训练一个线性分类器呢
Let’s say we also wanted totrain a linear classifier,
比方说逻辑回归
like logistic regression.
这样我们可以点开Functions 选择Logistic
To do that, we can gointo Functions, Logistic,
然后点击Start
and hit Start.
搞定
And there we go.
很棒的是 我们可以在两者之间来回切换
This is great, because we can flip back and forth
比较他们的结果
between the two and compare the results.
Weka中有许多类型的分类器
There are many types of classifiers in Weka
如果你有兴趣 从入门的贝叶斯
if you’re interested, everything from naive
到基本的神经网络 都可以选择
Bayes to basic neural networks.
现在 我们回到决策树来查看结果
Now let’s head back to thetree and see the results.
这一屏上有很多信息
And there’s a lot of information on this screen,
我来带着大家过一遍
so let me walk you through it.
拉到最上面开始
Let’s scroll to thetop and start there. First,
首先我们看到的是训练好的树
you can seethe trained tree.
一样 我们从上往下读
As always, you readit from the top down.
这里说的是让我们从血糖值开始看
It’s telling us to start by looking at the value of plasma.
而这个值恰巧是这个数据集中最有预测价值的属性
And this happens to be themost predictive attribute
那我们待会在回来看它
in the dataset, but we’llreturn to that later.
向下拉 可以看到准确度大约是73%
Scrolling down, we can seethe accuracy was about 73%.
但是这个准确度是怎么评估出来的呢?
But what exactly was theaccuracy evaluated on? Well,
这里 Weka提供了三个选项
here you can see Wekagives you three options.
第一个是基于训练数据集计算准确度
The first would be to computethe accuracy on the training set.
如果我们选择它 当然 准确度会更高
And if we do that, ofcourse it will be higher.
达到84% 因为我们用树测试过
It goes up to 84%, becausewe’ve tested the tree
这些已经见过的数据
on data it’s already seen.
当然 真实情况下这没啥用
Of course, this isn’tuseful in the real world.
在机器学习领域
As always in machine learning,
我们的目标是要从训练集中提取通用方法
our goal is to generalize from the training data. Ideally,
理想情况下 我们希望能够得到一个
we want amodel that performs well
能在未知数据中表现良好的模型
on data it’s never seen before.
那么 我们怎么知道做到了呢?
So how can we know if it does? Well,
一种方法是使用一组隔离的测试数据集来模拟
one way to simulate that is to have a separate test set. Basically,
也就是说 你可以将血糖ARF文件
you can dividethe diabetes ARF file
拆分到两个文件中 一个用来训练一个用来测试
into two separate files, one fortraining and one for testing.
测试文件只用在
Use the testing fileonly rarely to see how
确定算法表现的时候
well your algorithm performs.
还有一种方式是 使用交叉验证
Another thing we can dois use cross-validation.
名字听上去很酷
And this sounds fancy,
其实本质上就是将数据集循环分割成两组
but all it does is iteratively divide the dataset into two chunks.
较大的一组用来训练
The larger chunk is used for training,
较小的一组用来测试
and the smaller one is used for testing.
对模型进行训练 评估
We train a model,evaluate it, and repeat
反复多次 再对结果求平均
this process a number of times,then average the results.
Weka已将这个流程自动化了
And Weka automatesthis process for you. Now,
现在我们更仔细的看下这个评估
let’s look moreclosely at the evaluation.
任何训练出来的分类器 都会包含这些统计信息
You’ll see stats like thesefor any classifier you train.
重要的是 注意这里除了准确率指标外
And importantly, notice that there are metrics
还有些指标 如精确率 召回率等
like precision and recall in addition to accuracy.
为什么?
Why? Well,
嗯 尽管评估一个分配器时
although accuracy is the first thing we think
准确率是我们最关注的指标
of when evaluating a classifier,
但它并不是一切
it doesn’t always tellus the whole story,
特别是在数据分布集中 某类别很稀少的情况下 尤其突出
especially in datasetswhere one class is rare.
举个例子 我来给大家演示下
For example, let me show you how
怎样写出一个99%准确度的分类器 但其实
to write a 99 % accurate classifier that doesn’t really
啥也干不了
do anything at all.
设想你要写一个能够协助医生的程序
Imagine you’re writing a programto assist a doctor, like we’re
就好像是我们现在处理的这个糖尿病数据集
doing with thisdiabetes dataset.
假设我们要预测的疾病非常罕见
Now imagine that the disease we want to predict is very rare.
100人中只有1人得这种病
Say only one personin 100 is sick.
那么我们怎样能完全不使用机器学习 训练出
So how can you train a99% accurate classifier
一个99%准确度的分类器?
without using any ML at all? Well,
其实很简单
it’s simple.
一行Python代码就可以
It turns out you canwrite one line of Python.
def diagnose: return “healthy”
Def diagnose return healthy.
因为大多数人都是健康的
Because most peopleare healthy, just
只要预测所有人 也就是主要类别 是健康的
by predicting that everyoneis, or the majority class,
这就是99%准确度了 但是这个程序没用
we’re 99% accurate,but not useful.
如果真遇上该病的患者 这个模型肯定会是错的
Our model will always be wrongwhen the patient is sick.
这就是为什么当我们评估分类器的时候
That’s why when weevaluate classifiers,
应该同时分析阳性和阴性
we have to look at accuracy both
两种情况下的准确度
on the positive and negative cases.
而且我们有不同的实现方法
And there are differentways to do this.
下面这个混淆矩阵就是我最喜欢的一种方法
A confusion matrix like we see below is one of my favorites.
描述区有关于该方法的相关链接
And you can find a link in the description to learn more about it.
现在我们进入另一个主题
Now onto another topic.
前面我们问过这个问题
Imagine we had asked thequestion, which attributes
怎样判断数据集的哪个属性更重要?
in the dataset are important?
这次我们不来训练模型
Here we don’t wantto train a model.
而是分析数据本身
We just want toexplore the data.
这里有个称作“特征选择”的技术 可以帮到我们
There’s a technique we can usecalled feature selection which can help.
我们可以做的第一件事就是
And the first thing we cando is rank the attributes
通过信息增益来给这些属性排序
by their information gain.
还回到刚刚糖尿病的例子
Let’s head back to the diabetesdataset for an example.
点击 Filters – Supervised – Attribute – AttributeSelection
We can hit Filters, Supervised,Attribute, Attribute Selection,
再选择InfoGain作为方法
then select Info Gainas the method and Ranker
选择Ranker作为检索方式
as the search.
点击运行 属性会按照
And when we runthis, the attributes
对我们预测标签的有效程度来排序
will be sorted by how useful they are to predict the label.
如果你只从数据集里提取一个特征
If you could know just onefeature from the dataset,
没错就是血糖
you’d probably wantto know plasma.
如果你想提取两个
If you could know two things,
也许你还需要的是体重
you’d probably also want to know mass.
但是 请记住我们还没完成搜索
But keep in mind wehaven’t done a search.
而且可能这两个属性未必是最佳组合
It’s possible that thesetwo attributes are not the best combination.
还有其它选择属性的方法
There are other methods of selecting attributes,
如果你打算找到可用的
like this, if you want
最佳数据子集 这个也不错
to find the best subset to use.
尽管这种穷举搜索会非常消耗计算能力
An exhaustive search can becomputationally expensive, though.
那么 我们来简单看下这组投票数据吧
Now let’s look brieflyat the vote data.
这次会更快些 因为这个分类器的
I’ll move faster this time,because training and evaluating
训练和评估用的都是同样的模式
a classifier usesthe same pattern.
数据集的来源是美国议会
This dataset is fromthe US Congress,
通过议员在历史上对
and the goal is to predictif a representative is
不同法案的投票 来预测出
a Democrat or aRepublican based on how
他是民主党还是共和党
they voted on different bills.
同之前一样 从class开始
As before, let’sstart with class. Here,
这里蓝色的是民主党 红色的是共和党
blue are Democratsand red are Republicans.
数据集可以追溯到上世纪80年代
And this dataset wascollected back in the 1980s,
因此这个比例和现在有所不同
so these ratios are differentthan they are today.
每个属性描述了某议员在不同法案上的投票情况
Each attribute describeshow a congressperson voted on different bills.
很多都有预测价值
And many are predictive.
浏览一下就能看出 许多投票的
If you flip through them,you’ll see that many votes
边界线就是政党的边界
are divided along party lines.
同样 你可以在
As before, you can read details
ARF文件中阅读法案详情
about the bills in the ARF
如果你有兴趣的话
if you’re interested.
如果我们开始训练
And if we train a tree,
根据议员的投票历史来预测
these are the rules you can use to predict the political affiliation
他的政党归属 这些规则会有帮助
of a member of Congress basedon their voting history.
对我来说 这个工具很神奇 简单易用
It’s still amazing to me how easy this tool is to use.
我经常使用它 很有用
And I find it helpfulon a regular basis.
通常我会从决策树开始
I usually begin my experimentswith a decision tree
用它来分析数据 作为基线分类器
to learn more about thedata and as a sanity check
来对数据进行合理性检查
for a baseline classifier
然后才会使用更复杂的模型 比如神经网络
before I move on to more complex models like neural nets.
好的
OK.
希望这能帮到你 Weka会让你学习
I hope this was helpful andthat Weka makes it easier
机器学习知识更加轻松 谢谢大家
for you to learn ML. Thanks,
咱们下次再见
everyone, andI’ll see you next time.
[音乐播放中]
[MUSIC PLAYING]

发表评论

译制信息
视频概述

本视频介绍了Weka这种可视化的机器学习数据算法分析工具,对学习者来说非常有用。

听录译者

收集自网络

翻译译者

知易行难

审核员

审核员

视频来源

https://www.youtube.com/watch?v=TF1yh5PKaqI

相关推荐