Okay, so artificial intelligence, machine learning, data mining, data analysis,
人工智能 机器学习 数据挖掘 数据分析
clustering classification, data pre-processing,
00 – 数据分析入门 电脑狂热
It’s hard to go anywhere now without hearing about AI and machine learning and data,
data particularly, it’s everywhere.
Researchers have suggested that every two years we generate more data than ever existed before
So the amount of data is doubling every two years.
The fact is actually, you know astronomical amount of data,
but the thing is of course that, these data doesn’t necessarily mean anything.
In fact, you can create tables of data
but unless you understand what’s in them and what they mean,
you haven’t got any knowledge, right?
So there’s a distinction between having data and having knowledge.
So very well saying, yes, as a species, we’re producing a huge amount of data,
but actually a lot of it doesn’t get used.
a lot of it sits there on a hard disk, waiting for someone to look at it.
And that’s kind of what we’re talking about here.
If we want to extract knowledge from data,
we’re going to need some tools and processes to do this in a formal way,
and that’s that’s what data science is, right?
And things like machine learning and AI have a place within it
So perhaps if you do this for your job,
then data analysis is going to be useful for you.
Maybe your company’s generating data and you want to analyze this data?
But on the other hand, perhaps you’re just a consumer, and companies are using data on you.
They’re generating data on you, and actually they’re profiting from data on you.
These are sometimes life-changing decisions that are being made on your data.
And so it’s empowering to know how this process works.
And I have a very simple example which you might even do yourself.
Suppose you go online to book some flights for a holiday,
and then you decide that actually two flights via an intermediate airport
is cheaper than a single flight, right?
You’re doing data analysis. Say you’re taking lots of different data sources
and working out the optimal route.
And this of course happens automatically as well,
depending on the flight website that you’re using.
All right, so this kind of stuff you’re already doing it.
It’s just a case of trying to formalize this process.
So what do any of the things I listed at the beginning mean?
Well, one problem is that everyone’s definitions differ slightly,
but also I think that a lot of these terms are used completely interchangeably.
AI is the classic example.
So AI is everywhere, right? You can’t buy a product without it having been having AI added to it.
A lot of the time you see AI,
we’re actually talking about machine learning
So machine learning is the idea that we’re training a machine to perform a task
without explicitly programming it to do so.
A good example of AI that isn’t machine learning would be, let’s say a mouse in a maze,
where all you’re doing is telling it to turn left or right at random.
Not learning anything, it doesn’t understand what the maze is
but it will eventually get to the end, right?
That’s a kind of rudimentary artificial intelligence that doesn’t involve learning anything.
Machine learning is about not giving it conditions,
not saying “if you’re here, turn left; if you’re here, turn right”.
It’s just giving it examples and hoping it will learn to perform most tasks itself, right?
So machine learning is a subset of AI, but they shouldn’t be used interchangeably.
所以 机器学习是AI的一部分 二者不应通用
If we’re using machine learning, often what we’ll do
is we train it based on samples of data.
So we’ll have some existing data set that we’re trying to train on,
and we’re trying to use machine learning to either
tease out information or make predictions on these data.
The problem is that not all data is sort of made equal.
Some of its noisy and messy, maybe we don’t know what it is
有的数据非常混乱 存在很多噪声 我们可能看不出它是什么
and don’t know whether we can apply a certain technique to it, right?
And so we need to clean this data up.
We need to take this data, understand what it is and extract some knowledge,
我们要获取数据 了解数据 并且从中提取出一些知识
so that we can then apply these AI or machine learning techniques to it.
So this combination of things that can take data and prepare it
in a way that we can then use it or understand it, that’s data science.
There are quite a few ways we could do this data analysis right throughout this course.
We could use R, we could use Python, we could use MATLAB. They all have their pros and cons
包括R Python和MATLAB 这些工具各有利弊
We’re gonna use R because it’s free and it’s really good for statistical analysis
我们将选择R 因为它免费 而且非常适合统计分析
It’s got loads of great libraries.
If you’re really familiar with Python, then maybe that’s what you want to start with for this kind of stuff.
But we know we’re going to be working with R
We have our script area here where we can write scripts and run scripts.
You can save them and then come back to them later.
Console where we’re going to be putting in, you know, specific commands.
We have our environment, which is where all our variables and our data is held
and we can look at them there.
And then we have plots, any plots, which you can do quite a lot of different plots in R, very versatile.
还有图像 你可以用R画很多种类的图像 非常万能
That’s going to appear down here.
Okay, so you’ve probably got everything you need to get started with data analysis.
In my opinion, the best way to get into R is just to kind of have a go.
So it’s going to look at a few of the most obvious things that it does.
It has a little bit of a learning curve only because it’s syntax is slightly unusual.
If you can program you’ll be fine, but even if not, you should get there pretty quickly.
如果你有编程基础 就没什么问题 但即使没有 你也能很快上手
Most of the time in R we’ll be using either matrices or vectors
在R中 我们大部分使用的是矩阵 向量
or which are kind of a special case of matrices or maybe data frames.
Data frames a really nice aspect of R,
which you can kind of think of like a table that you might have in in Excel,
except you’ve also got headings for your columns.
So let’s have a look at some of these things, and just a few of the things we can do with them
before we perhaps go into a little bit more detail in other videos.
So for example, we might look at our variable X which I’ve created
and X is a sequence going from 0 all the way up to a few multiples of Pi,
which I used to create this plot.
That was only one line of code that produced that
and I’ve used that to create my plot by essentially saying y equals sin(X),
and then just simply plotting that.
If you wanna get a little bit more complicated, we can start looking at matrix data.
So I created a CSV file with a Gaussian function in it.
So essentially a two-dimensional array of values
that get bigger in the center. Very straightforward.
The CSV file is essentially a text file with commas separating those values,
very easy to read and write these out of Excel and other packages
and so you’ll often find data is passed around in this way,
at least moderately sized data, if it isn’t too, you know to it too huge.
I can load this in using my “read.csv” function.
So I can say “namedata”.
Now the arrow operator is essentially equivalent in R
for the assignment operators or equals.
Equals will often work, but I tend to try and use this one. So “namedata”…
等号通常也能行 但我喜欢用箭头 输入“namedata”
I’m going to assign “read.csv” and the file is going to be “norm.csv”
And I’ve got no header for this file,
so I don’t want it to use the top row for the labels
So I’m going to say “header” equals “false”.
And that’s loaded in “namedata”. And we can have a look,
so I’m gonna click on “namedata” here.
And if we click on it, you can see we’ve got
the rows and the columns of our data in here.
We can look at individual elements in this array.
So we can say data at position three four,
and that’s going to be the third row down and the fourth value across.
We can also leave one empty and just have an entire row,
or conversely, an entire column, like this.
And so it’s very easy to take ranges of values.
You’ve got a huge table of data selecting certain columns,
looking at certain columns, plotting certain columns.
This is one of the reasons why R is very popular.
Quite often when you’re looking at data,
we’ll actually be looking at something called a data frame.
Now a data frame – I’ve got a load one up –
is simply a… In essence, a table of values, but it won’t have to be the same type.
So in an array, normally they’ll all be floats or they’ll all be integers.
In a data frame, there can be different things,
so you could have first and last name next to age, for example.
So I’ve just created a tiny little CSV file
with some random people in it. So let’s load this up.
So I’m going to say “namedata”
And if I look at “namedata”, you can see that it’s got three columns,
it’s got firstname, surname and age,
and five rows, and there’s five people in this dataset.
And then you can do just like I did before,
but now we can also index by the names of these columns.
So I could say I want all of the first names for example,
so I can say “namedata$firstname”
and I can see all the different first names.
So you can start to look at this data set and more in more detail.
Obviously, this isn’t absolute tiny data set, but you get the idea.
You could also look at individual instances, so we could say “namedata”.
And I want just the second row, for example, “namedata[2,]”.
There we go, Bill Jones and he’s 18 years old.
结果出来了 这个人叫Bill Jones 18岁
As we move through these videos, it’s going to be very common for us
to load in datasets like this in this format.
and then start to process them based on these data frames.
So perhaps an example, right? So let’s imagine you’re an online retailer,
and someone comes into your shop and buy some thing.
And maybe they… you’re trying to understand what it is what they do, so that you can,
let’s say, send them emails to try and get them to buy more products,
举个例子 才能给他们发邮件 吸引他们购买更多商品
or show them recommended products and things like this.
So you want to try and build up a pattern of their behavior, right?
And all you’ve got is what they click on, what they add to their basket,
而你掌握的信息是 他们点击了什么 添加了什么到购物车
and what they buy, right?
So you’ve learned that they’re looking at these kinds of items and they look at these ones regularly.
And then sometimes they just buy something completely random seemingly,
and that goes in their basket and gets bought straight away.
Maybe it’s a present right? So maybe it’s not tied to them as a person.
So you’re taking all of this data all of these purchases, all of these… products that they’re looking at,
and you’re turning this into a kind of picture of this person,
and you’re clustering that person in with other consumers that bought similar things,
and trying to predict what they want to buy next, right?
And that’s when you send them an email say “you should look at this one
because this one’s really good and you didn’t buy it last time, but you’ll definitely want to buy it this time”.
上次你没有买它 但是它真的很好 这次你一定会想买的”
So we’ve got some data we want to extract some knowledge.
What’s the first thing we do?
We have to start to look at it
and try and tease out some kind of information or analyze this data.
The data analysis is the idea of using statistical measures to try and work out what’s going on.
This is kind of a cycle. We’re going to analyze the data so we’re going to do a data analysis,
这是一种循环 我们要分析数据 所以要进行数据分析
and perhaps sometimes just using statistics to analyze the data isn’t enough.
You can’t really learn everything about it.
Yes, you can learn, you know, mathematically how it works,
but you might not understand about what it all means
So visualizing the data can be really helpful.
So what we’ll also do is we’ll visualize the data – visualization.
So that’s going to be charting it, plotting it,
trying to work out trends and links between different variables and things like this.
And these are kind of being back and forth, right,
you could do both of these things numerous times and work out what we’ve got, right?
So you’re gonna do something like this.
And then what we’re going to do is we’re going to preprocess the data.
Often you’ll be finding your recording much more data than you actually need. Right.
This is certainly true of an online shop.
I’m going to be looking at a lot of products,
but I don’t end up buying and I was never really going to buy.
I know maybe a pipe dream.
And they’ve got a sort of weed out this information
to work out what it is that they might actually better convince me to buy right?
So this is going to you going to preprocess data and remove a nonsense,
and drill right down to the stuff that’s really useful.
So this is preprocessing.
And this is going to be a kind of cycle of analysis and visualization and preprocessing,
and we can repeat these things and then we can really drill down and whittle down our data
into the most usable sort of core of knowledge that we can.
And get the most out of it.
Now it may be that just analysing the data is enough, right?
You’ve now sort of you’ve obtained some knowledge.
You kind of understand what the trends are.
and maybe that was all you wanted to do. That’s sometimes the case.
Maybe actually what we want to do is take things a little bit further
We’re going to use machine learning or modeling
to try and model this system and predict what’s going to happen next.
So for example in the case of an online shop,
we might want to start predicting what people are going to buy next
and if we can do that, that’s when we can send out these emails
or flag things in their recommended items and get many more sales.
As an example, let’s imagine that someone has spent a lot of time looking at DIY tools.
I’ve, you know, recently moved house I spent a lot of time doing DIY,
and I’m always trying to buy new tools because it just seems like a good idea.
So, you know, maybe I buy a certain kind of saw, and then you know a few months later,
they’re starting to recommend me a slightly different kind of saw that serves a slightly different purpose
that suddenly I definitely need to be doing and I think, uh yeah, maybe I will buy that
而且我应该会用得到 我想 我可以买
and then the end I have 10 saws and I don’t know how to use any of the saws.
But you know, the retailers job is done.
It’s if we want to extract this data, we’re going to use machine learning or modeling
to put to model this system and make predictions.
Now so for example, we could cluster the data together.
We could link my purchase history with similar people.
What are they buying? Can I be tempted to buy those things as well, right?
Maybe I’m very different from someone else,
and so it’s not a good idea to recommend me certain products
because I’m unlikely to buy those things.
Perhaps use a different example. In the medical domain,
it’s quite common to classify people into kind of risk categories,
so that we can maybe use preventative treatments.
So every time I go to a doctor, they’re going to collect data on me, on…
What’s currently on with me? And what was wrong with me before? and…
Combine that with with you know standard data
like how much exercise someone does, and you know their family history,
and how what their stress levels are and things like this,
We can combine all these things to make a prediction as to what they were at risk of in the future,
so you know, heart disease or something else like this.
It could save someone’s life if you spot
that they’re at risk of a certain thing
and you can really advise that person to, you know, increase their level of exercise or alter their diet.
There are two other terms that we come across, you know a lot, right?
So there’s data mining and big data.
Now, I’m not really sure what data mining is, because I don’t think anyone is.
it’s a bit… it’s a bit of a buzzword
Really, what data mining is is a combination of preprocessing your data
and maybe using clustering to extract some knowledge from it.
So that’s our sort of… it’s a word that’s come to be used in place of those things.
If someone says they’re doing data mining, that’s what they’re doing.
They’re preprocessing and extracting some knowledge from their data
It’s a cool sounding word. You’re not actually “mining” anything, right?
You’re just doing what everyone else does on data.
Big data is the idea that maybe we collect a lot of examples of something, you know, a huge number,
or each of our examples is quite complicated and it has a lot of variables.
In that case, the amount of data we’ve got is sort of unwieldy.
So I would argue, perhaps that big data is not data that you can run on your laptop.
Like, you might be using cloud compute, infrastructure or certainly parallel processing
in some way to to preprocess and analyze this data.
So exactly where the line, how big is “big”.
I don’t know, but exactly where we draw the line in some ways is not really important,
我也不知道 但是究竟有多“大” 这个问题并不重要
the idea is just that the amount of data we as a species are now producing
more and more of our data is becoming big data.
But you know exactly where the cutoff is doesn’t really matter.
What is data? I’m pretty sure that’s data.
Is this data, this picture? Or that data?
Is this data? What is data?
Okay, so artificial intelligence, machine learning, data mining, data analysis,