ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

#1 究竟什么是数据? – 译学馆
未登录,请登录后再发表信息
最新评论 (0)
播放视频

#1 究竟什么是数据?

Data Analysis 1: What is Data? - Computerphile

What is data? Right.
数据是什么?
I’m pretty sure that’s data, right?
我非常确定这就是数据
is this data?
这是数据吗
This picture? Or that,data?
这张照片 或者这个 是数据吗?
Is this data? What what is data?
这是数据吗?到底什么是数据?
什么是数据? 《电脑狂热》
So we talked a lot about data in last video
上期我们谈了很多关于数据的内容
Why is it important that we can analyze and understand data?
为什么学会分析和理解数据如此重要?
but what is data?
但什么是数据?
Everybody has data everybody’s generating it.
人人有数据 人人都生产数据
Companies are generating on us.
公司生产关于我们的数据
We’re generating it ourselves,
我们自己也生产数据
you know when we use social media,so on.
比如使用社交媒体时 等等
but what is it and
但是 数据是什么呢
Understanding what it is is a prerequisite for being able to use it properly.
摸清数据概念是能合理使用数据的前提
Perhaps the most important thing as far as we’re concerned,
对我们这些想要科学地分析数据的人来说
So people who are trying to analyze data sort of scientifically is
也许最至关重要的
the data has to be measurable, right?
就是数据本身须能度量 对吧
so the idea is, you know, if you’re going to do a survey on what people like.
因此 如果你要调查人们的喜好
Everyone’s got to be using the same scale and the same rating system.
则每个人都应使用同样的度量衡和评估体系
Otherwise, it doesn’t make any sense.
否则没有什么意义
Well, we can’t have someone rating things from one to five
我们不能让一人用12345进行评分
and someone else saying I thought it was good, right?
而另一人评价说“好” 对吧?
Because which one of one to five is good.
因为不知道12345几分算好
We don’t you know, we don’t know.All right.
我们根本不知道
So everyone is going to be doing the same thing
所以每个人行为一致
your data’s got to be a consistent format
搜集的数据格式也会一致
and once that’s achieved at least.
一旦两者一致
We’re a little bit closer to be able to make some sense of it.
至少收集到的数据会更有意义一点
Broadly speaking when we talk about data, we kind of have four different types
广义而言 我们所说的数据包含四种类型
and we summarize this with this nice noir word.
我们可以把它总结为一个单词 noir
So n, o, i, r, noir
n o i r noir(黑色)
And each of these different types of data we can do different things with,right?
不同类型的数据有不同的处理方式
So n that’s the first type,so this is nominal data.
“n”是第一种 即称名数据
The nominal data is where we have no distance between the values that we can measure.
称名数据下的测量值无法进行衡量比较
Right?Because they’re not really quantities and we can’t order them.
对吧? 它们并非数量 因此无法排序
So a good example would be,colors.
颜色就是个很好的例子
So maybe you have your favorite color is red And my favorite color is blue.
或许你最喜欢红色 而我最喜欢蓝色
I don’t know which is better than the other.
我不知道哪个更好
There is no measurement between them,right?
它们根本无法比较 对吧
Is blue closer to green? the matter is?
蓝色更像绿色吗? 这有什么关系吗?
You know, that doesn’t make any sense, right?
这种比较没什么意义吧?
We’re not talking about wavelengths.
我们不是在说波长
We’re just talking about the colors, right?
我们只聊颜色 是吧?
Another good example would be,let’s say,in football.
再举个好点的例子 比如 足球
player numbers on your back right now
足球队员背后的号码
symbolically sometimes certain player numbers have a meaning.
现在有时特定足球号码有着某种象征意义
but you can’t compare and contrast them
但它们无法用来对比比较
You can’t say that 8 is 2 times better than 4.
你不能说8号比4号好一倍
All right, that doesn’t make any sense, right?
这没什么意义 对吧
You also can’t really order them in general,right?
你也不能按大小给它们排序 对吧?
player 16 doesn’t go before or after player 13 in the list,
队伍中16号与13号球员没有先后之分
but you know, but that doesn’t make any sense, right?
这种排序并没有实际意义 是吧?
So nominal data is data where and it’s useful, right?
所以称名数据很实用
It could be really important,
有时十分重要
but it’s data where we we kind of have labels,
称名数据有标签
But no way of ordering these labels.
却无法按标签排序
so you can still analyze it,
但你仍然可以对它进行分析
but you can’t for example calculate the average that the mean average,right?
却不能 比如 计算平均值 是吧?
That wouldn’t make any sense.
这样做完全没有意义
What you can do is calculate the mode.
你只能计算众数
so you can calculate the most common one.
就是计算出现频率最多的数
You could say that more people prefer red to blue.
你可以说 比起蓝色 更多的人喜欢红色
but you couldn’t say you know the average color that people like is a sort of muddy brown right.
但你不能说人们喜欢的平均颜色是土褐色
That doesn’t make any sense at all, right
这根本毫无意义 对吧?
So as we go down this list,
顺着四种数据类型往下走
we get slightly more and more informative in some sense types of data
我们慢慢了解到信息量越来越大的数据类型
So the next one is ordinal.
下一种数据类型是有序数据
so in ordinal data,
在有序数据中
we have an order but we can’t measure distances between things.
有序数据可排序 但数值间的差距无法度量
so a good example would be something like
打个好点的比方
Positions people finished in a race.
跑步比赛的名次
So, you know, maybe I finished first
可能我跑第一
I’m super quick right?
我很快吧?
you didn’t,you finished third
你不快 你排第三
But how far we are apart that isn’t included in that kind of data
但这名次无法体现我们之间相差的距离
You’d have to have a separate value for that
所以你还得再测一个数值
another example what we’re all familiar with
再举一个我们都熟悉的例子
is rating systems, right?
评分系统 熟吧?
So perhaps you I rate a film from one to five stars
我打一到五星给电影评分
and you rate the film from one to five stars
你也打一到五星给电影评分
but you can’t really say that
但你不能说
a film that’s got four stars is two times better than one that scored two
一部四星的电影比一部二星的电影好两倍
Because that’s a very subjective
由于评分非常主观
and it’s there’s no real sort of measurable distance between these stars
星级评分间的差异无法具体衡量
if you have ordinal data you can calculate the mode again.
有了有序数据 你仍然可以计算众数
You can calculate the most common value of all the values that were returned
也能计算所有统计数据中出现次数最多的数值
or you can calculate the median the one that sits in the middle, right?
或是计算这组数据的中位数
So maybe you know fifty runners in a race
如果竞跑中有50个跑步选手
the 25th position roughly speaking is going to be you know around the median
大致来说第25个就是中位数
So it’s still not hugely useful, right
这种数据使用价值依然不大
the next up we have interval data
接下来是区间数据
interval data, we have an order and we have a distance,
区间数据既能排序 也能衡量
but we have no sort of absolute zero for this scale
但没有绝对零点这一说法
So a good example would be something like degree Celsius or degrees Fahrenheit
华氏度与摄氏度是一个特别好的例子
Zero degrees Celsius isn’t no temperature.
0摄氏度并非没有温度
It’s it’s a specific temperature, right?
而是一个具体的温度 对吧?
So we can’t say that fifty degrees is half of a hundred degrees
所以我们不能说50度是100度的一半
The numbers are but doesn’t really make sense, right?
数虽如此 但这样做并无意义 对吧?
They are we can we can say that a hundred degrees is hotter than 50,
我们可以说100度比50度热
which is hotter than zero, right?
50度又比0度热 对吧
So this is interval data
这就是区间数据
now interval data lets us do a few more things than we could with ordinal
比起有序数据 除了求众数与中位数
as well as be able to calculate the mode and median,
区间数据的应用范围更广
we can now calculate the mean temperature. That’s okay
现在我们可以求平均温度了 完全可以
And we could also calculate things like the rain
区间数据可用来测量降水量
the minimum and maximum temperatures for a certain window, right?
或某一窗口的最高与最低温度之类的 对吧?
So that’s pretty useful
还是挺有用的
another good example of interval will be PH level
再举个好点的例子 PH值
right again,the PH of zero means very acidic
PH值为0时意味着酸性很强
It doesn’t mean there is no acidity at all or no PH at all.
而不是说没有酸性或没有PH值
We can say that a PH13 is higher than a PH7 is higher than a PH3
我们可以说PH13比PH7或PH3高
And we know how far apart these numbers are
我们也知道这些数值的差是多少
but we can’t necessarily say if one is double one another one
但我们不必说这个数是那个数的一倍
So the final kind of data we’re going to look at is ratio data
最后一种数据类型是比值数据
So this is exactly like interval, except that we now have a sort of true 0 value
它与区间数据几乎一样 只是加入了真零值
So a good example of this would be degrees Kelvin,right.
绝对温度是一个典例
So Kelvin has an absolute zero which is
绝对温度有绝对零度
the absolute average absence of any kind of heat right
这个温度意味着没有任何热量
and then it goes upwards
数字越高温度越高
so we can say that in terms of Kelvin
所以以开为单位计算的话
a hundred is Half of 200
可以说100开是200开的一半
and so on like this
诸如此类
and we can get to 0
并且我们可以得到0
another example would be number of children, right?
再举个例子 孩子的数量 对吧?
Zero children means the absence of any children
0意味着没有孩子
and you can also say that let’s say four children is double the amount of two children
你也可以说四个孩子是两个孩子的两倍
And two men to look after in my opinion
对我来说 四个孩子就意味着要两个大人去照看
So that is an example of ratio data
这是比值数据的一个例子
Right now ratio data is quite similar in terms of what you can calculate to interval,
就计算而言 比值数据与区间数据类似
but it allows some more complicated statistical measures such as t-test
但可以进行较为复杂的计算 如t检验
So these are the types of data
这些就是数据类型
now actually, it’s quite important how you structure your data in general
实际上 如何整体架构数据十分重要
We can’t just have it sitting in some massive spreadsheet
我们不能仅仅把数据塞进繁多的电子表格中
with no thought given to where everything is, right?
而完全不知道如何查找 对吧?
There’s actually a pretty standard way of doing this
实际上 查找数据有一种十分标准的方法
that we’re going to look at
让我们来看一下
Data comes in lots of forms, right
数据来源方式多种多样 对吧?
different types of measurements,different experiments,
不同的测量类型 不同的实验
people are going to collect it in different ways
人们收集数据的方式也不同
But actually there’s a very standard way that we use
但是实际上只要数据在电脑上
to represent data once it’s actually on a computer
就有一种非常标准的方式来摆放数据
so we can have some kind of table of our data
我们可以根据数据做一个表格
We almost always represent our data in a matrix like this two dimension table
我们总是用这种二维表格的矩阵来查看数据
because it’s much easier to do
因为做起来更容易
and so along the top we’re going to have our attributes,right?
我们会在表格上方标明属性 对吧?
which are the the things we’ve been measuring
也就是我们想测量的事情
So an example would be maybe we’re collecting data on people
比如 可能我们搜集个人信息时
so we could have name that would be some nominal data
会写上姓名 这会是一个称名数据
and then, you know age,height
然后有年龄 身高等
So the columns are attributes or the things we’ve been measuring
因此这些列是各种属性 也就是度量类别
the rows those are the instances or the samples we’ve got
这些行则是我们搜集的所有案例与样本
so that’s all the individual people
所以这包括所有个人
So here’s person 1 and person 2 person 3
1号 2号 3号
and person 3 is called John
3号的名字是约翰
and there you know 54 and you know 5 foot 11 or whatever, you know whatever right and so on
年龄54岁 身高5尺11寸 随便填
and you can put you know have as many rows as you want
你想填多少行都可以
so when we talk about attributes.
所以我们谈到属性时
We’re talking about the number of columns
就是在谈这些列的数字
people use lots of different terms for these.
但人们对其的称谓各不相同
I like to think of them as features
我就较喜欢把他们称作特征
attributes is another one
属性是另一种称呼
and we have instances or samples down the rows
在行上 我们列出得到的案例和样本
now quite often on the very last column of your data sometimes separated out
通常表格的最后一列有时会被单独列出
but not really important.
但不太重要
We’ll have our output
现在输出数据
Maybe we’re trying to make a decision based on these people
我们也许可以试试基于这些人进行决策
Maybe these are candidates for a football team and we’re saying
假设这组人均为足球队的候选人
so,you know, are they gonna be on the team or not?
那么哪些人可以入队 哪些不能呢?
So this is “yes”.
这里填“通过”
No John’s made it, yes
“淘汰 ” 约翰可以 “通过”
no, no and so on
“淘汰”“淘汰”诸如此类
and that way we could perhaps analyze our decision-making process and decide you know
也许我们可用该方法分析决策过程并做出决定
Is there any aspect of these things
以该表为例
that inform our decision-making process as an example, right?
这些数据是否有体现决策过程的方面?
Now we always structure data in this way
我们常常以这种方式组织数据
and if we don’t it becomes a huge problem
不然就会出现很大问题
because you end up spending all this time
因为最终你要花费所有的时间
formatting and trying to work out what’s what
来摆放和试图解析数据的含义
and you know, why is John listed down
并且为什么约翰列表是
the table or not across the table?
从上往下而不是从左往右呢?
And you know, nothing makes any sense anymore
这样一切就都乱了套
So let’s look at an actual data set
接下来让我们看看真实的资料集
and we’ll see all this in action
我们会看到操作中的所有步骤
So we have here a data set of whether someone
这里是决定一个人是否去
goes to play tennis.Right?And
打网球的资料集 是吧?而且
whether or not we go is gonna depend
我们去不去 取决于
a little bit on what weather conditions are.
那里的天气情况怎么样
So we don’t like to play for example when it’s too hot
比如 天气太热的时候我们都不想打球
The tennis data set is
网球资料集与
just the same structure as a data set we looked at already
其他我们已经看过的资料集组织方式相同
We’re gonna load it into R it’s held in a CSV file.
我们将把CSV文件中的数据下载到Rstudio
So tennis read dot csv tennis
输入tennis read.csv tennis
now we’re using R for this because it’s free
我们使用RStudio 因为它免费
and it has a load of decent functions for
而且它在分析检验和
analyzing,examining,visualizing data.
查看数据上都非常好用
So we’re going to be using it
所以在所有的视频中
throughout these videos
我们都会用这个软件来进行教学
obviously you could use MATLAB or Python
你也可以用Matlab Python
or some other library if you wanted to
或其他编程语言 只要你愿意
I think that you should use whatever you’re most comfortable with
我觉得一定要用自己觉得趁手的
Looking at these rows and tables
来看看这些列表
I mean, it looks a lot like something like Microsoft Excel
它看起来很像Excel表格
You could do this data analysis in Excel
这份数据分析也可以用Excel来做
Some people would disagree.
或许有人不认同
No, Excel is perfectly good for what it does
但实际上Excel非常适合数据分析
you could do with data analysis in it.
你完全可以用它来做数据分析
I think that
我觉得
Excel in it doesn’t enforce anything to do with
Excel没有强制执行任何
observations versus variables and things like that
与观察值和变量相关的东西
These are distinctions that are not really made in Excel
RStudio和Excel还是有些不同
Obviously if you enforce those rules yourself that’s going to work,
很明显 如果你想对这些数据强制执行其规则
but you have to be a little bit more,you know regimented and rule-based about it
但你必须严格遵循其规则
I think the consensus would be that if you really want to get into data analysis
如果你真的很想进行数据分析
and start doing things like principal component analysis or more
并且开始做一些主成分分析之类的
Advanced statistical measures.
高级数据测量
Something like R or Python is going to help a lot more.
RStudio或者Python帮助会大一点
OK.So I’ve loaded the data set
好的 现在我已经下载好了数据集合
and if we look up the data set
如果我们来浏览
so we look at the top few rows of the data you’ll see that
我们看一下数据的前面几行
there are 6 different variables or 6 attributes.
有六个不同的变量和六个属性
And this data set has 14 instances or observations
这个数据集合有十四个案例或观察项
R calls them observations.
RStudio称之为观察项
So what we’re saying is we have six columns and
所以这个数据集合有六列
fourteen rows,right,of our data set and this data set is
14行 而且
structured exactly like
组织结构上跟我
this people data set that I was looking at a minute ago
前一分钟看的那个人物数据集一模一样
So we can examine a single instance,
所以我们可以检验其中一个案例
we can say what is it about day three?
我们可以说第三天怎么样?
So let’s have a look at day three so we can say tennis on day 3
让我们来看看第三天 输入tennis[3,]
And we can say on day three it was overcast. The temperature was only five degrees
原来第三天天气阴沉 温度只有五度
The humidity was high there wasn’t any wind
空气湿度高 无风
so they decided to play tennis, right?
所以我们决定打网球 对吧?
So it’s a bit chilly, but I guess they gave it a go
天气有点冷 但我猜他们还是去了
So on we could also look
所以我们也可以看看
at all the different temperatures,
所有不同的温度
for example, all the different forecasts.
比如所有不同的天气预报
tennis.outlook
输入tennis.outlook
All right.And we can look at
好的 我们可以看到
all the outlooks in the data set so we can say
资料集中显示的所有天气状况
we’ve got sunny sunny overcast rainy rainy rainy and so on
有 晴 晴 阴 雨 雨 雨 等等
and we can get a feel for what kind of weather we’re looking at here as well
我们也可以看到那里的天气怎么样
using something like R.
只要使用RStudio
You can examine the instances
你可以检测一下案例
You can examine the individual attributes
你可以检测单个属性
you can group them together or not as you see fit
你可以将他们分组 只要你觉得合适就行
and then you can start to drill into what this data set means
然后你就可以开始研究这些资料集是什么意思
Now this dataset has in it the final column
这些资料集的最后一栏
which is whether they actually played
显示了他们到底有没有去打球
so you could use something like machine learning
所以你可以使用机器学习之类的
to predict that final column based on the other columns.
基于其它列的数据来预测最后一列的数据
That’s something you could do.
就这样做
One aother thing about this dataset quite interesting is
有趣的是
it has a few examples of the different kinds
它有几个我们之前找到的
of data we were looking at earlier
不同种类数据的一些样例
So remember we have nominal
还记得吧 称名数据
ordinal interval and ratio
有序数据 区间数据 比值数据
So for example,
所以 比如
outlook is really a nominal field
天气是一个称名数据场
Right?it’s a nominal data type
对吧?属于称名数据类型
You could perhaps suggest that you could order it from rainy through to sunny,
你可能可以按照雨 晴 阴的
but then cloudy overcast, you know
顺序进行排列
It doesn’t really make any sense,so this is kind of nominal
但并没有意义 所以这是称名数据
you could calculate for example the mode and say
你可以计算出 比如 众数 你可以说
that most of the days were rainy or something like this
出现最多的天气是雨天或者晴天什么的
Temperature as we discussed before
温度 正如之前讨论过的
this is in Celsius. So this is going to be interval
单位是是摄氏度 所以它是区间数据
we can order the data and we can say
我们可以将它们进行排序
one of them is 15 away from another one
也可以说这个数和那个数之间相差15
But we can’t say how much of a difference that it’s like.
但他们的差异程度我们却无法说明
Is that double the temperature or half a temperature?
它的温度是它的一倍 还是一半儿呢?
We can’t really say.
说不清楚
so humidity is ordinal,
所以湿度是有序数据
so we can say high is more humidity,
所以我们可以说“高”是更加湿润
even normal, right?
甚至是正常湿度 对吗?
But we can’t really say how much,
但是我们无法说明
that’s going to depend on who was measuring it
这要取决于测量的人员
and where their differences lie
和数据的差异点在哪
and finally wind in kilometers per hour.
最后 风速的单位是每小时几千米
Well, zero is no wind.
0就是无风
Yeah, you can’t have negative wind.
毕竟不会出现负值的风
So this is a ratio, right?
所以这是一个比值数据 对吧?
You can say that 20 mile an hour wind or 20 kilometers an hour wind,
你可以说每小时20公里(km)
is two times more than ten
是每小时10公里的一倍
That’s something you can say,
这样说是没错的
this little dataset contains all the kinds of data
这些小小的资料集中包含了所有种类的数据
so the different statistics and measures you can calculate using these,
所以不同的统计方法和测量方法
it’s going to depend on what kind of data they are
计算的方式也不同 这取决于数据的类型
So we can see that even a very simple data set like this
所以即使是像这样非常简单的资料集
has loads of different kinds of data and different ways we could interpret this data
都有许多不同的数据类型和解读方法
Right, if you make a decision to play
对吧?如果你仅仅看天气
based only on whether the Outlook is good
是否良好来决定是否出去玩
You’re maybe not going to solve the whole problem, right?
你可能无法解决所有的问题 对吧?
So these are the kind of things we’ll be looking at as we go forward
所以这都是我们做决策前要考量的因素
And one thing we might do next is to visualize this data.
然后接下来就把数据可视化
Start to try and understand some patterns or extract some kind of knowledge
开始尝试理解一些图案或者提取一些知识
They’re very important tool but you’ve gotta use it properly
这些工具非常重要 但要适当使用
You can’t just plot anything and everything
它不能用来谋划所有的事情
Every chart you use has got to support your hypothesis
你用的每个图表都会用来支持自己的假说
or it’s got to try and show the story you’re trying to tell right?
或者试图展示你要诉说的故事 对吧?
You don’t just plot something
你不会仅仅因为这件事你会做
because it could be plotted right?
就去勤勤恳恳的做了 对吧?
There’s got to be a point.
肯定有原因
There’s a lot of problems with using inappropriate graphs
使用的图表不当 采纳数据时断章取义
and only picking subsets of your data.
都会造成许多问题
That’s a huge problem
而且问题不小

发表评论

译制信息
视频概述

现代社会数据一词经常出现,那到底什么是数据呢?一起来学习吧!

听录译者

收集自网络

翻译译者

梦的翅膀

审核员

审核员BA

视频来源

https://www.youtube.com/watch?v=SEeQgNdJ6AQ

相关推荐