ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

#0 什么是数据分析? – 译学馆
最新评论 (0)

#0 什么是数据分析?

Data Analysis 0: Introduction to Data Analysis - Computerphile

Okay, so artificial intelligence, machine learning, data mining, data analysis,
人工智能 机器学习 数据挖掘 数据分析
clustering classification, data pre-processing,
聚类分析 数据预处理
big data.
00 – 数据分析入门 电脑狂热
It’s hard to go anywhere now without hearing about AI and machine learning and data,
现在到处都在谈论AI 机器学习和数据
data particularly, it’s everywhere.
尤其是数据 到处都是
Researchers have suggested that every two years we generate more data than ever existed before
研究表明 我们每两年就会制造出比以往更多的数据
So the amount of data is doubling every two years.
The fact is actually, you know astronomical amount of data,
but the thing is of course that, these data doesn’t necessarily mean anything.
但当然了 这些数据不一定有意义
In fact, you can create tables of data
but unless you understand what’s in them and what they mean,
you haven’t got any knowledge, right?
你就没学到什么知识 对吧?
So there’s a distinction between having data and having knowledge.
所以有数据和有知识 二者是存在区别的
So very well saying, yes, as a species, we’re producing a huge amount of data,
没错 我们人类制造出了大量数据
but actually a lot of it doesn’t get used.
a lot of it sits there on a hard disk, waiting for someone to look at it.
And that’s kind of what we’re talking about here.
If we want to extract knowledge from data,
we’re going to need some tools and processes to do this in a formal way,
and that’s that’s what data science is, right?
And things like machine learning and AI have a place within it
So perhaps if you do this for your job,
then data analysis is going to be useful for you.
Maybe your company’s generating data and you want to analyze this data?
But on the other hand, perhaps you’re just a consumer, and companies are using data on you.
也有可能你只是一名消费者 而公司在对你使用数据
They’re generating data on you, and actually they’re profiting from data on you.
他们拿你去生成数据 而且还拿你的数据去盈利
These are sometimes life-changing decisions that are being made on your data.
And so it’s empowering to know how this process works.
And I have a very simple example which you might even do yourself.
我举一个非常简单的例子 你甚至可以自己动手试试
Suppose you go online to book some flights for a holiday,
and then you decide that actually two flights via an intermediate airport
你发现 订两个中转航班
is cheaper than a single flight, right?
You’re doing data analysis. Say you’re taking lots of different data sources
你这就是在做数据分析 你收集了大量不同数据源
and working out the optimal route.
And this of course happens automatically as well,
depending on the flight website that you’re using.
All right, so this kind of stuff you’re already doing it.
It’s just a case of trying to formalize this process.
So what do any of the things I listed at the beginning mean?
那么 我在视频开头列出那些名词是什么意思呢?
Well, one problem is that everyone’s definitions differ slightly,
but also I think that a lot of these terms are used completely interchangeably.
但我认为 大量这种术语是完全可以通用的
AI is the classic example.
So AI is everywhere, right? You can’t buy a product without it having been having AI added to it.
AI到处都是 购买产品都要用到AI
A lot of the time you see AI,
we’re actually talking about machine learning
So machine learning is the idea that we’re training a machine to perform a task
机器学习指的是 在没有显式编程的前提下
without explicitly programming it to do so.
A good example of AI that isn’t machine learning would be, let’s say a mouse in a maze,
迷宫中的老鼠 是说明AI并非机器学习的一个好例子
where all you’re doing is telling it to turn left or right at random.
Not learning anything, it doesn’t understand what the maze is
这只老鼠没有学到任何东西 也也不明白什么是迷宫
but it will eventually get to the end, right?
That’s a kind of rudimentary artificial intelligence that doesn’t involve learning anything.
Machine learning is about not giving it conditions,
not saying “if you’re here, turn left; if you’re here, turn right”.
不是“如果你到这里就左转 到这里就右转”
It’s just giving it examples and hoping it will learn to perform most tasks itself, right?
而只是给出案例 希望机器能学会自己去执行大多数任务
So machine learning is a subset of AI, but they shouldn’t be used interchangeably.
所以 机器学习是AI的一部分 二者不应通用
If we’re using machine learning, often what we’ll do
如果要用机器学习的话 我们往往要
is we train it based on samples of data.
So we’ll have some existing data set that we’re trying to train on,
and we’re trying to use machine learning to either
tease out information or make predictions on these data.
The problem is that not all data is sort of made equal.
但问题在于 数据质量不一
Some of its noisy and messy, maybe we don’t know what it is
有的数据非常混乱 存在很多噪声 我们可能看不出它是什么
and don’t know whether we can apply a certain technique to it, right?
And so we need to clean this data up.
We need to take this data, understand what it is and extract some knowledge,
我们要获取数据 了解数据 并且从中提取出一些知识
so that we can then apply these AI or machine learning techniques to it.
So this combination of things that can take data and prepare it
获取数据 以及为使用和理解它们
in a way that we can then use it or understand it, that’s data science.
而做准备的整个过程 就是数据科学
There are quite a few ways we could do this data analysis right throughout this course.
We could use R, we could use Python, we could use MATLAB. They all have their pros and cons
包括R Python和MATLAB 这些工具各有利弊
We’re gonna use R because it’s free and it’s really good for statistical analysis
我们将选择R 因为它免费 而且非常适合统计分析
It’s got loads of great libraries.
If you’re really familiar with Python, then maybe that’s what you want to start with for this kind of stuff.
如果你很熟悉Python的话 也可以用它来入门
But we know we’re going to be working with R
但在本课中 我们要用R
We have our script area here where we can write scripts and run scripts.
这里是脚本区域 可以写脚本和运行脚本
You can save them and then come back to them later.
你可以保存脚本 回头再来编辑
Console where we’re going to be putting in, you know, specific commands.
这里是控制台 用于运行一些命令
We have our environment, which is where all our variables and our data is held
and we can look at them there.
And then we have plots, any plots, which you can do quite a lot of different plots in R, very versatile.
还有图像 你可以用R画很多种类的图像 非常万能
That’s going to appear down here.
Okay, so you’ve probably got everything you need to get started with data analysis.
有了这些 你就可以开始数据分析了
In my opinion, the best way to get into R is just to kind of have a go.
在我看来 学习R的最佳方法就是上手试试
So it’s going to look at a few of the most obvious things that it does.
It has a little bit of a learning curve only because it’s syntax is slightly unusual.
R学起来有点费工夫 只因它的语法有些与众不同
If you can program you’ll be fine, but even if not, you should get there pretty quickly.
如果你有编程基础 就没什么问题 但即使没有 你也能很快上手
Most of the time in R we’ll be using either matrices or vectors
在R中 我们大部分使用的是矩阵 向量
or which are kind of a special case of matrices or maybe data frames.
或矩阵的一种特别形式 或数据框
Data frames a really nice aspect of R,
which you can kind of think of like a table that you might have in in Excel,
except you’ve also got headings for your columns.
So let’s have a look at some of these things, and just a few of the things we can do with them
before we perhaps go into a little bit more detail in other videos.
So for example, we might look at our variable X which I’ve created
举个例子 看我创建的这个变量X
and X is a sequence going from 0 all the way up to a few multiples of Pi,
which I used to create this plot.
That was only one line of code that produced that
and I’ve used that to create my plot by essentially saying y equals sin(X),
and then just simply plotting that.
If you wanna get a little bit more complicated, we can start looking at matrix data.
如果你想更复杂点 我们可以考虑一下矩阵数据
So I created a CSV file with a Gaussian function in it.
So essentially a two-dimensional array of values
that get bigger in the center. Very straightforward.
越靠近中心的值越大 很好懂吧
The CSV file is essentially a text file with commas separating those values,
very easy to read and write these out of Excel and other packages
and so you’ll often find data is passed around in this way,
at least moderately sized data, if it isn’t too, you know to it too huge.
I can load this in using my “read.csv” function.
So I can say “namedata”.
Now the arrow operator is essentially equivalent in R
for the assignment operators or equals.
Equals will often work, but I tend to try and use this one. So “namedata”…
等号通常也能行 但我喜欢用箭头 输入“namedata”
I’m going to assign “read.csv” and the file is going to be “norm.csv”
我要把”read.csv”这个函数赋给它 文件是”norm.csv”
And I’ve got no header for this file,
so I don’t want it to use the top row for the labels
So I’m going to say “header” equals “false”.
And that’s loaded in “namedata”. And we can have a look,
然后数据就存到“namedata”里了 我们可以看看
so I’m gonna click on “namedata” here.
And if we click on it, you can see we’ve got
点击它 你就能看到
the rows and the columns of our data in here.
数据有多少行 多少列
We can look at individual elements in this array.
So we can say data at position three four,
比如想看坐标为[3, 4]的数据
and that’s going to be the third row down and the fourth value across.
We can also leave one empty and just have an entire row,
or conversely, an entire column, like this.
And so it’s very easy to take ranges of values.
You’ve got a huge table of data selecting certain columns,
looking at certain columns, plotting certain columns.
This is one of the reasons why R is very popular.
Quite often when you’re looking at data,
we’ll actually be looking at something called a data frame.
Now a data frame – I’ve got a load one up –
is simply a… In essence, a table of values, but it won’t have to be the same type.
它其实就是一个数据表 但数据类型无需一致
So in an array, normally they’ll all be floats or they’ll all be integers.
In a data frame, there can be different things,
so you could have first and last name next to age, for example.
So I’ve just created a tiny little CSV file
with some random people in it. So let’s load this up.
里面有一些随机的人员信息 我们来载入看看
So I’m going to say “namedata”
assign “read.csv(names.csv)”
And if I look at “namedata”, you can see that it’s got three columns,
查看“namedata”数据集 你可以看到它有三列
it’s got firstname, surname and age,
分别是“名” “姓”和“年龄”
and five rows, and there’s five people in this dataset.
有五行 表示数据集中有五个人
And then you can do just like I did before,
but now we can also index by the names of these columns.
So I could say I want all of the first names for example,
举个例子 如果我想知道所有的名
so I can say “namedata$firstname”
and I can see all the different first names.
So you can start to look at this data set and more in more detail.
Obviously, this isn’t absolute tiny data set, but you get the idea.
显然这不是一个绝对小的数据集 但你应该明白我的意思
You could also look at individual instances, so we could say “namedata”.
你也可以查看单个实例 先输入“namedata”
And I want just the second row, for example, “namedata[2,]”.
如果我只想看第二行 就输入“namedata[2,]”
There we go, Bill Jones and he’s 18 years old.
结果出来了 这个人叫Bill Jones 18岁
As we move through these videos, it’s going to be very common for us
随着这些视频的学习 我们将学会
to load in datasets like this in this format.
and then start to process them based on these data frames.
So perhaps an example, right? So let’s imagine you’re an online retailer,
我举个例子吧 假设你是一名网络零售商
and someone comes into your shop and buy some thing.
And maybe they… you’re trying to understand what it is what they do, so that you can,
你试着去了解他们的购买行为 这样才能
let’s say, send them emails to try and get them to buy more products,
举个例子 才能给他们发邮件 吸引他们购买更多商品
or show them recommended products and things like this.
或者给他们推荐商品 等等
So you want to try and build up a pattern of their behavior, right?
And all you’ve got is what they click on, what they add to their basket,
而你掌握的信息是 他们点击了什么 添加了什么到购物车
and what they buy, right?
So you’ve learned that they’re looking at these kinds of items and they look at these ones regularly.
你知道他们浏览了这几种商品 以及经常浏览这些商品
And then sometimes they just buy something completely random seemingly,
and that goes in their basket and gets bought straight away.
Maybe it’s a present right? So maybe it’s not tied to them as a person.
So you’re taking all of this data all of these purchases, all of these… products that they’re looking at,
你把这些购买记录 和他们浏览过的所有商品记录了下来
and you’re turning this into a kind of picture of this person,
and you’re clustering that person in with other consumers that bought similar things,
and trying to predict what they want to buy next, right?
And that’s when you send them an email say “you should look at this one
这时候你就可以给他们发邮件说 “你应该看看这个商品
because this one’s really good and you didn’t buy it last time, but you’ll definitely want to buy it this time”.
上次你没有买它 但是它真的很好 这次你一定会想买的”
So we’ve got some data we want to extract some knowledge.
我们掌握了一些数据 想从中提取一些知识
What’s the first thing we do?
We have to start to look at it
and try and tease out some kind of information or analyze this data.
The data analysis is the idea of using statistical measures to try and work out what’s going on.
This is kind of a cycle. We’re going to analyze the data so we’re going to do a data analysis,
这是一种循环 我们要分析数据 所以要进行数据分析
and perhaps sometimes just using statistics to analyze the data isn’t enough.
You can’t really learn everything about it.
Yes, you can learn, you know, mathematically how it works,
的确 你能了解到它的数学原理
but you might not understand about what it all means
So visualizing the data can be really helpful.
So what we’ll also do is we’ll visualize the data – visualization.
So that’s going to be charting it, plotting it,
数据可视化指的是对数据做表 画图
trying to work out trends and links between different variables and things like this.
找出趋势和不同变量之间的联系 等等
And these are kind of being back and forth, right,
you could do both of these things numerous times and work out what we’ve got, right?
So you’re gonna do something like this.
And then what we’re going to do is we’re going to preprocess the data.
Often you’ll be finding your recording much more data than you actually need. Right.
有时候你会发现 你记录的数据比你所需要的多得多
This is certainly true of an online shop.
I’m going to be looking at a lot of products,
but I don’t end up buying and I was never really going to buy.
但最终并未购买 而且其实我本就并无购买的意愿
I know maybe a pipe dream.
And they’ve got a sort of weed out this information
to work out what it is that they might actually better convince me to buy right?
So this is going to you going to preprocess data and remove a nonsense,
这就是预处理数据 删除无意义数据
and drill right down to the stuff that’s really useful.
So this is preprocessing.
And this is going to be a kind of cycle of analysis and visualization and preprocessing,
数据分析 数据可视化和数据预处理可以构成一个循环
and we can repeat these things and then we can really drill down and whittle down our data
然后我们就能深挖 尽可能将数据压缩到
into the most usable sort of core of knowledge that we can.
And get the most out of it.
Now it may be that just analysing the data is enough, right?
You’ve now sort of you’ve obtained some knowledge.
You kind of understand what the trends are.
and maybe that was all you wanted to do. That’s sometimes the case.
觉得到此为止就行 有时候的确可以这样
Maybe actually what we want to do is take things a little bit further
We’re going to use machine learning or modeling
to try and model this system and predict what’s going to happen next.
来模拟该系统 预测下一步
So for example in the case of an online shop,
we might want to start predicting what people are going to buy next
and if we can do that, that’s when we can send out these emails
如果我们能成功 就可以给他们发送邮件
or flag things in their recommended items and get many more sales.
或标出给他们推荐的商品 增加销售量
As an example, let’s imagine that someone has spent a lot of time looking at DIY tools.
举个例子 假设一个人花了很多时间浏览DIY工具
I’ve, you know, recently moved house I spent a lot of time doing DIY,
我最近刚搬了家 也是花了很多时间去DIY
and I’m always trying to buy new tools because it just seems like a good idea.
我向来喜欢买新工具 因为觉得这样很好
So, you know, maybe I buy a certain kind of saw, and then you know a few months later,
可能我会买一款锯子 几个月后
they’re starting to recommend me a slightly different kind of saw that serves a slightly different purpose
店家就开始推荐我另一款略有不同的锯子 它的功能也略有不同
that suddenly I definitely need to be doing and I think, uh yeah, maybe I will buy that
而且我应该会用得到 我想 我可以买
and then the end I have 10 saws and I don’t know how to use any of the saws.
最后我就会有10把锯子了 可是我一把都不会用
But you know, the retailers job is done.
It’s if we want to extract this data, we’re going to use machine learning or modeling
如果要提取这些数据 我们就要用机器学习或建模
to put to model this system and make predictions.
来模拟系统 作出预测
Now so for example, we could cluster the data together.
We could link my purchase history with similar people.
What are they buying? Can I be tempted to buy those things as well, right?
Maybe I’m very different from someone else,
and so it’s not a good idea to recommend me certain products
because I’m unlikely to buy those things.
Perhaps use a different example. In the medical domain,
举个别的例子 医学界往往会
it’s quite common to classify people into kind of risk categories,
so that we can maybe use preventative treatments.
So every time I go to a doctor, they’re going to collect data on me, on…
每次我去看病时 大夫都会收集我的数据
What’s currently on with me? And what was wrong with me before? and…
Combine that with with you know standard data
like how much exercise someone does, and you know their family history,
比如锻炼量 家庭病史
and how what their stress levels are and things like this,
压力水平 等等
We can combine all these things to make a prediction as to what they were at risk of in the future,
将这些数据结合 就能预测对方将来是否存在健康风险
so you know, heart disease or something else like this.
比如心脏病等 这能挽救一个人的生命
It could save someone’s life if you spot
that they’re at risk of a certain thing
and you can really advise that person to, you know, increase their level of exercise or alter their diet.
There are two other terms that we come across, you know a lot, right?
我们还要学习另外两个知识点 你应该很清楚
So there’s data mining and big data.
Now, I’m not really sure what data mining is, because I don’t think anyone is.
我不是很清楚数据挖掘是什么 因为我觉得没人清楚
it’s a bit… it’s a bit of a buzzword
Really, what data mining is is a combination of preprocessing your data
and maybe using clustering to extract some knowledge from it.
So that’s our sort of… it’s a word that’s come to be used in place of those things.
If someone says they’re doing data mining, that’s what they’re doing.
如果有人说自己在做数据挖掘 那么他做的就是上述的事情
They’re preprocessing and extracting some knowledge from their data
It’s a cool sounding word. You’re not actually “mining” anything, right?
这个词听起来很酷 但你其实不是在真的“挖”东西
You’re just doing what everyone else does on data.
Big data is the idea that maybe we collect a lot of examples of something, you know, a huge number,
大数据指的是我们收集了某事的大量样本 海量样本
or each of our examples is quite complicated and it has a lot of variables.
或每个样本都很复杂 包含大量的变量
In that case, the amount of data we’ve got is sort of unwieldy.
这么说来 我们获取到的数据量就很难处理了
So I would argue, perhaps that big data is not data that you can run on your laptop.
所以我认为 大数据不是你能在笔记本电脑上运行的数据
Like, you might be using cloud compute, infrastructure or certainly parallel processing
而是要用云计算 基础设施或是并行处理
in some way to to preprocess and analyze this data.
So exactly where the line, how big is “big”.
I don’t know, but exactly where we draw the line in some ways is not really important,
我也不知道 但是究竟有多“大” 这个问题并不重要
the idea is just that the amount of data we as a species are now producing
more and more of our data is becoming big data.
越来越多的数据 逐渐构成了“大数据”
But you know exactly where the cutoff is doesn’t really matter.
但你也清楚 这个边界并不重要
What is data? I’m pretty sure that’s data.
Is this data, this picture? Or that data?
Is this data? What is data?