未登录,请登录后再发表信息
最新评论 (0)
播放视频

【机器学习入门】#10 开始使用Weka

Getting Started with Weka - Machine Learning Recipes #10

[MUSIC PLAYING]
[音乐播放中]
JOSH GORDON: Hey, everyone.
嗨 大家好
Today I’d like to make a quick video
今天我想做一个
that I hope will be concrete and useful for you.
扎实又实用的视频
It’s about the very first machine learning library I ever tried.
介绍的是我很早就使用的机器学习库
And it’s called Weka.
它的名字是Weka
What’s great is that Weka comes with a GUI that
Weka的优点包括 它有GUI
makes it easy to visualize your data sets
能可视化的展示你的数据集
and train and compare different classifiers.
还能训练 对比不同的分类器
And this is a really handy tool to have while you’re learning ML.
对于要研究机器学习的朋友来说 它确实是个很方便的工具
I’ll give you a quick walkthrough of how to use Weka,
我会给大家快速的讲解一下Weka的使用方法
from installation all the way to running experiments,
包括安装 运行实验
and show you some of what it can do.
以及它能实现的功能
I’ll demo training models on two different datasets. First,
我会演示用两个不同的数据集来训练模型
we’ll predict if a patient has diabetes based
首先 我们对数据如血糖等级进行处理
on attributes like their glucose levels.
预测病人是否患上了糖尿病
And next, we’ll predict
接下来 我们还将
if a congressperson is a Democrat or Republican based on how
基于某议员对法案的投票数据
they voted on different bills.
来预测他是民主党还是共和党
I’ll also show you how to evaluate the results
我还会给大家演示如何评估这些实验的数据
of these experiments and how to do things like feature
以及如何使用“特征选择”功能
selection to discover which attributes are important. OK,
来发现重要的属性
let’s dive right in.
我们现在就开始吧
The first thing we’ll do is download and install Weka.
首先要做的就是下载和安装Weka
And what’s neat is that it comes
很赞的是 它提供了
as a nicely packaged application you can run
可以运行在Win Mac Linux上的
on Mac, Windows, or Linux.
安装包
There’s also a Java API. Here,
而且也包括Java API
I’ll download and install Weka.
我来下载和安装Weka
And now I’ll start it up.
运行这个应用
There are different interfaces, and we’ll use the Explorer.
它有不同的界面 我们使用的是Explorer
There’s a lot on this screen, but don’t worry about it.
看上去功能不少 不过不用担心
You’ll get a feel for how this works in a moment.
很快你就能明白它的工作原理
The first thing to do is open a dataset.
第一件事是打开一个数据集
So we’ll hit Open.
点击Open
And now would be a good time to download one.
我们现在可以去下载一个
You can find a bunch of prepackaged datasets on this page.
这个页面里提供了一大堆准备好的数据集
And we’ll start with the UCI repository.
我们从UCI库开始
It contains about 37 problems.
它包含了大约37个问题
And when you download it, you’ll get a JAR. Now,
下载的是个JAR文件
you might be familiar with these if you’re a Java developer.
如果你是Java程序员 应该不会觉得陌生
But if not, don’t worry.
不过就算不是 也不必担心
You can treat them as a ZIP.
权且将其认做是Zip吧
Here I’ll unzip it.
我们会将其解压
And now we can see a directory of datasets.
可以看到 这是包含数据集的目录
Let’s return to Weka and open one of these up.
我们返回Weka 打开一个
And we’ll start with diabetes.
从糖尿病数据开始
All right, what do we see here?
好的 这里有什么?
Let me walk you through it. First,
我来带着大家过一遍
let’s learn about the dataset.
首先 了解一下这个数据集
At the top, you can see there are 768 examples,
在最上面 能看到这里有768条数据
or instances, and nine attributes, or features.
或称实例 包含9个属性 或称特征
The best attribute to start with is class,
我们从最好的属性 – 类型开始
or the label we want to predict.
通常也称标签 这是我们要预测的结果
And usually in Weka, that’s the last attribute in a dataset.
在Weka中 这个属性通常会在数据集的最底部
Clicking on that, we can see a histogram.
点击它 会显示一个柱状图
The blue column on the left shows the number
左侧蓝色的柱代表的是
of people who tested negative for diabetes.
糖尿病测试呈阴性的病人数量
And the red column on the right shows those who tested positive.
右侧红色的柱代表的是检测呈阳性的病人数
Now let’s look
现在我们来看看
at the attributes we’ll use to predict if a patient has the disease.
用于预测病人是否患糖尿病的属性
The descriptions here are pretty short.
这里的描述很简短
But we can open up the dataset in Sublime or your
但其实我们可以用Sublime或者
favorite text editor to learn more about what they mean,
你爱用的编辑器来打开数据集 理解它们的含义
as well as how the dataset was collected. Now,
以及这个数据集的收集方式
Weka datasets come in an ARF format.
Weka的数据集保存在ARF格式的文件中
And this is just a CSV with some metadata included at the top.
其实它本质上就是个文件顶部有些元数据的CSV
Scrolling down a bit, we can see a description of the attributes.
向下卷动一些 我们就能看到这些属性的描述
And the first tells us the number of times a patient was pregnant,
第一个属性代表着病人怀孕的次数
and the second tells us their plasma glucose.
第二个代表着他们的血糖值
For diabetes, I imagine one of these is more predictive than the other.
对于糖尿病来说 我觉得这个属性比其它的要更有预测价值
Let’s see if Weka can tell us that, too.
那么让我们来看看Weka是否也能计算出这个信息
Back in the GUI, let’s click on plasma.
回到GUI 点击血糖这个属性
And what’s cool is you can see a histogram of how
酷吧 这个直方图
different values correlate to the class we want to predict.
显示了要预测的不同数值对映的类别
Recall that blue is negative, and red is positive.
刚刚说过蓝色代表阴性 红色代表阳性
And right off the bat, we can see this is a useful attribute,
于是我们立刻就能发现这是个有用的属性
meaning that if plasma is low,
如果血糖低
say below about 100, then it’s unlikely the patient has diabetes.
比方说低于100 病人患上糖尿病的可能性很小
Most of these values are blue, whereas as the value increases,
大部分的数值都是蓝色 然而 当数值升高时
it’s increasingly likely that a patient has the disease.
病人有糖尿病的倾向就开始增加
Now let’s look at pregnancy.
我们再看看怀孕这个属性
And to me, this doesn’t look like a strong correlation.
在我看来 它不像有着强关联
There may be one here, but it’s less obvious.
也许这里有一个 但很不明显
Now here’s where things get really interesting.
现在到这里 事情变得很有趣
I’ve so much to show you,
可以给大家展示的内容太多
but I want to jump in and classify this data.
而我现在必须要进入下个环节 来对数据进行分类了
So let’s head over to the Classify tab.
我们点开Classify标签
There’s a whole bunch of built-in classifiers.
这里有一堆自带的分类器
Let’s start with the decision tree.
我们来从决策树开始
And J48 is one type of tree that does pruning.
J48是能够做数据修剪的树
We’ll hit Start, and bam.
点击Start 嘭
We’re done.
搞定了
We just trained a decision tree on the diabetes dataset.
我们使用糖尿病数据集 训练了一棵决策树
Let’s say we also wanted to train a linear classifier,
假如我们打算训练一个线性分类器呢
like logistic regression.
比方说逻辑回归
To do that, we can go into Functions, Logistic,
这样我们可以点开Functions 选择Logistic
and hit Start.
然后点击Start
And there we go.
搞定
This is great, because we can flip back and forth
很棒的是 我们可以在两者之间来回切换
between the two and compare the results.
比较他们的结果
There are many types of classifiers in Weka
Weka中有许多类型的分类器
if you’re interested, everything from naive
如果你有兴趣 从入门的贝叶斯
Bayes to basic neural networks.
到基本的神经网络 都可以选择
Now let’s head back to the tree and see the results.
现在 我们回到决策树来查看结果
And there’s a lot of information on this screen,
这一屏上有很多信息
so let me walk you through it.
我来带着大家过一遍
Let’s scroll to the top and start there. First,
拉到最上面开始
you can see the trained tree.
首先我们看到的是训练好的树
As always, you read it from the top down.
一样 我们从上往下读
It’s telling us to start by looking at the value of plasma.
这里说的是让我们从血糖值开始看
And this happens to be the most predictive attribute
而这个值恰巧是这个数据集中最有预测价值的属性
in the dataset, but we’ll return to that later.
那我们待会在回来看它
Scrolling down, we can see the accuracy was about 73%.
向下拉 可以看到准确度大约是73%
But what exactly was the accuracy evaluated on? Well,
但是这个准确度是怎么评估出来的呢?
here you can see Weka gives you three options.
这里 Weka提供了三个选项
The first would be to compute the accuracy on the training set.
第一个是基于训练数据集计算准确度
And if we do that, of course it will be higher.
如果我们选择它 当然 准确度会更高
It goes up to 84%, because we’ve tested the tree
达到84% 因为我们用树测试过
on data it’s already seen.
这些已经见过的数据
Of course, this isn’t useful in the real world.
当然 真实情况下这没啥用
As always in machine learning,
在机器学习领域
our goal is to generalize from the training data. Ideally,
我们的目标是要从训练集中提取通用方法
we want a model that performs well
理想情况下 我们希望能够得到一个
on data it’s never seen before.
能在未知数据中表现良好的模型
So how can we know if it does? Well,
那么 我们怎么知道做到了呢?
one way to simulate that is to have a separate test set. Basically,
一种方法是使用一组隔离的测试数据集来模拟
you can divide the diabetes ARF file
也就是说 你可以将血糖ARF文件
into two separate files, one for training and one for testing.
拆分到两个文件中 一个用来训练一个用来测试
Use the testing file only rarely to see how
测试文件只用在
well your algorithm performs.
确定算法表现的时候
Another thing we can do is use cross-validation.
还有一种方式是 使用交叉验证
And this sounds fancy,
名字听上去很酷
but all it does is iteratively divide the dataset into two chunks.
其实本质上就是将数据集循环分割成两组
The larger chunk is used for training,
较大的一组用来训练
and the smaller one is used for testing.
较小的一组用来测试
We train a model, evaluate it, and repeat
对模型进行训练 评估
this process a number of times, then average the results.
反复多次 再对结果求平均
And Weka automates this process for you. Now,
Weka已将这个流程自动化了
let’s look more closely at the evaluation.
现在我们更仔细的看下这个评估
You’ll see stats like these for any classifier you train.
任何训练出来的分类器 都会包含这些统计信息
And importantly, notice that there are metrics
重要的是 注意这里除了准确率指标外
like precision and recall in addition to accuracy.
还有些指标 如精确率 召回率等
Why? Well,
为什么?
although accuracy is the first thing we think
嗯 尽管评估一个分配器时
of when evaluating a classifier,
准确率是我们最关注的指标
it doesn’t always tell us the whole story,
但它并不是一切
especially in datasets where one class is rare.
特别是在数据分布集中 某类别很稀少的情况下 尤其突出
For example, let me show you how
举个例子 我来给大家演示下
to write a 99 % accurate classifier that doesn’t really
怎样写出一个99%准确度的分类器 但其实
do anything at all.
啥也干不了
Imagine you’re writing a program to assist a doctor, like we’re
设想你要写一个能够协助医生的程序
doing with this diabetes dataset.
就好像是我们现在处理的这个糖尿病数据集
Now imagine that the disease we want to predict is very rare.
假设我们要预测的疾病非常罕见
Say only one person in 100 is sick.
100人中只有1人得这种病
So how can you train a 99% accurate classifier
那么我们怎样能完全不使用机器学习 训练出
without using any ML at all? Well,
一个99%准确度的分类器?
it’s simple.
其实很简单
It turns out you can write one line of Python.
一行Python代码就可以
Def diagnose return healthy.
def diagnose: return “healthy”
Because most people are healthy, just
因为大多数人都是健康的
by predicting that everyone is, or the majority class,
只要预测所有人 也就是主要类别 是健康的
we’re 99% accurate, but not useful.
这就是99%准确度了 但是这个程序没用
Our model will always be wrong when the patient is sick.
如果真遇上该病的患者 这个模型肯定会是错的
That’s why when we evaluate classifiers,
这就是为什么当我们评估分类器的时候
we have to look at accuracy both
应该同时分析阳性和阴性
on the positive and negative cases.
两种情况下的准确度
And there are different ways to do this.
而且我们有不同的实现方法
A confusion matrix like we see below is one of my favorites.
下面这个混淆矩阵就是我最喜欢的一种方法
And you can find a link in the description to learn more about it.
描述区有关于该方法的相关链接
Now onto another topic.
现在我们进入另一个主题
Imagine we had asked the question, which attributes
前面我们问过这个问题
in the dataset are important?
怎样判断数据集的哪个属性更重要?
Here we don’t want to train a model.
这次我们不来训练模型
We just want to explore the data.
而是分析数据本身
There’s a technique we can use called feature selection which can help.
这里有个称作“特征选择”的技术 可以帮到我们
And the first thing we can do is rank the attributes
我们可以做的第一件事就是
by their information gain.
通过信息增益来给这些属性排序
Let’s head back to the diabetes dataset for an example.
还回到刚刚糖尿病的例子
We can hit Filters, Supervised, Attribute, Attribute Selection,
点击 Filters – Supervised – Attribute – AttributeSelection
then select Info Gain as the method and Ranker
再选择InfoGain作为方法
as the search.
选择Ranker作为检索方式
And when we run this, the attributes
点击运行 属性会按照
will be sorted by how useful they are to predict the label.
对我们预测标签的有效程度来排序
If you could know just one feature from the dataset,
如果你只从数据集里提取一个特征
you’d probably want to know plasma.
没错就是血糖
If you could know two things,
如果你想提取两个
you’d probably also want to know mass.
也许你还需要的是体重
But keep in mind we haven’t done a search.
但是 请记住我们还没完成搜索
It’s possible that these two attributes are not the best combination.
而且可能这两个属性未必是最佳组合
There are other methods of selecting attributes,
还有其它选择属性的方法
like this, if you want
如果你打算找到可用的
to find the best subset to use.
最佳数据子集 这个也不错
An exhaustive search can be computationally expensive, though.
尽管这种穷举搜索会非常消耗计算能力
Now let’s look briefly at the vote data.
那么 我们来简单看下这组投票数据吧
I’ll move faster this time, because training and evaluating
这次会更快些 因为这个分类器的
a classifier uses the same pattern.
训练和评估用的都是同样的模式
This dataset is from the US Congress,
数据集的来源是美国议会
and the goal is to predict if a representative is
通过议员在历史上对
a Democrat or a Republican based on how
不同法案的投票 来预测出
they voted on different bills.
他是民主党还是共和党
As before, let’s start with class. Here,
同之前一样 从class开始
blue are Democrats and red are Republicans.
这里蓝色的是民主党 红色的是共和党
And this dataset was collected back in the 1980s,
数据集可以追溯到上世纪80年代
so these ratios are different than they are today.
因此这个比例和现在有所不同
Each attribute describes how a congressperson voted on different bills.
每个属性描述了某议员在不同法案上的投票情况
And many are predictive.
很多都有预测价值
If you flip through them, you’ll see that many votes
浏览一下就能看出 许多投票的
are divided along party lines.
边界线就是政党的边界
As before, you can read details
同样 你可以在
about the bills in the ARF
ARF文件中阅读法案详情
if you’re interested.
如果你有兴趣的话
And if we train a tree,
如果我们开始训练
these are the rules you can use to predict the political affiliation
根据议员的投票历史来预测
of a member of Congress based on their voting history.
他的政党归属 这些规则会有帮助
It’s still amazing to me how easy this tool is to use.
对我来说 这个工具很神奇 简单易用
And I find it helpful on a regular basis.
我经常使用它 很有用
I usually begin my experiments with a decision tree
通常我会从决策树开始
to learn more about the data and as a sanity check
用它来分析数据 作为基线分类器
for a baseline classifier
来对数据进行合理性检查
before I move on to more complex models like neural nets.
然后才会使用更复杂的模型 比如神经网络
OK.
好的
I hope this was helpful and that Weka makes it easier
希望这能帮到你 Weka会让你学习
for you to learn ML. Thanks,
机器学习知识更加轻松 谢谢大家
everyone, and I’ll see you next time.
咱们下次再见
[MUSIC PLAYING]
[音乐播放中]

发表评论

译制信息
视频概述

本视频介绍了Weka这种可视化的机器学习数据算法分析工具,对学习者来说非常有用。

听录译者

收集自网络

翻译译者

知易行难

审核员

审核员

视频来源

https://www.youtube.com/watch?v=TF1yh5PKaqI

相关推荐