Let’s imagine that you work for
a major streaming media provider, right
So you have I don’t know some 100 million subscribers
So you’ve got I don’t know ten thousand videos on your site
or many more audio files, right
So for each user you’re gonna have collected information
on what they’ve watched,
when they’ve watched it, how long they’ve watched it for
whether they went from this one to this one?
Did that work? Was that good for them?
And so maybe you’ve got 30,000 data points per user
We’re now talking about trillions of data points
and your job
is to try and predict what someone wants to watch or listen to next.
Best of luck.
<05 - 数据归约>
So we’ve cleaned the data, we’ve transformed our data
everything’s on the same scale
we’ve joined datasets together
The problem is because we’ve joined datasets together
perhaps our datasets has got quite large right,
now or maybe we just work for a company that has a lot a lot of data.
Certainly the general consensus these days
is to collect as much data as you can right,
this isn’t always a good idea.
We what we want remember,
is the smallest most compact and useful dataset we can
我们要的是一个最精简 最完整 且最有效的数据集
otherwise you’re just going to be wasting CPU hours or GPU hours,
training on this, wasting time.
We want to get to the knowledge as quickly as possible
and if you can do that with a small amount of data
that’s going to be great.
So we’ve got quite an interesting dataset to look at today based on music.
It’s quite common these days when you’re building something like a streaming service
for example Spotify
You might want to have a recommender system
This is an idea where you’ve maybe clustered people
who are similar in their tastes,
you know what kind of music they’re listening to
and you know the attributes of that music
and if you know that
you can say well this person likes high tempo music
就能知道 比如 这个人喜欢节奏强的音乐
So maybe he’d like this track as well.
And this is how playlists are generated.
One of the problems is that you’re gonna have to
produce descriptions of the audio
on things like tempo and how upbeat they are
in order to machine learn on this kind of system, alright.
And that’s what this dataset is about.
So we’ve collected a dataset here today.
There is, lots and lots of metadata on music tracks right.
Now these are freely available tracks and freely available data,
we’ll put a link in the description if you want to have a look at it yourself
I’ve cleaned it up a bit already
because obviously I’ve been through the process of cleaning and transforming my data.
So we’re gonna load this now this takes quite a long time to do,
because there’s quite a lot of attributes and quite a lot of instances
It’s loaded right?
How much is this data?
Well, we’ve got 13,500 observations
and we’ve got seven hundred and sixty-two attributes, right?
So that means another way of putting this if in sort of machine learning parlance
is we’ve got 13,000 instances and 760 features.
Now these features are a combination of things.
So let’s have a quick look at the columns we’re looking at
so we can see what this datasets about.
So names of music all right,
so we’ve got some 760 features or attributes
and you can see there’s a lot of slightly meaningless text here.
But if we look at the top you’ll see
some actual things that may be familiar to us.
So we’ve got the track ID, album ID the genre, right?
So genre was an interesting one
because maybe we can start to use
some of these audio descriptions to predict what genre this music is or something like that.
Things like the track number and the track duration and,
then we get onto the actual audio description features.
Now these have been generated by two different libraries,
the first is called Librosa,
which is a publicly available library
for taking an mp3 and calculating musical sort of attributes of it.
What we’re trying to do here
is represent our data in terms of attributes.
An mp3 file is not an attribute. It’s a lot of data.
So can we summarize it in some way?
Can we calculate by looking at the mp3?
What the tempo is,
what the amplitude is, how loud the track is, these kind of things.
This is the kind of thing we’re measuring.
And a lot of these are going to go into a lot of detail
down at kind of a waveform level.
So we have the Librosa features first,
and then if we scroll down
after a while we’d get to some Echo Nest features.
Echo Nest is a company
that produces very interesting features on music.
Actually, these are the features that
power Spotify’s recommend system and numerous others.
We’ve got things like acousticness.
How acoustic does it sound.
We’ve got instrumentalness.
I’m not convinced that‘s a word.
They how how to what extent is it speech or not speech, alright.
And then things like tempo how fast it is,
how happy does it sound, right.
A track of zero would be
quite sad, i guess,
and a track of one will be really high happy and upbeat.
And then of course we’ve got a load of features.
I’ve labeled temporal here and
these are going to be based on the actual music data themselves.
Often when we talk about data reduction,
what we’re actually using is dimensionality reduction, alright.
Well way of thinking about it is we
as we started we’ve been looking at things like attributes and
we’ve been saying what is the mean or a standard deviation
of some attribute on our data.
Right. But actually when we start to talk about clustering
and machine learning
we’re going to talk a little bit more about dimensions.
Now this is in many ways
the number of attributes is the number of dimensions.
It’s just another term for the same thing.
But certainly from a machine learning background,
we refer to a lot of these things as dimensions.
so you can imagine if you’ve got some data here
So you’ve got your instances down here
and you’ve got your attributes across here
So in this case our music data, we’ve got each song.
So this is song one, this is song two, song three,
比如说这是歌曲一 歌曲二 歌曲三
and then all the attributes of a tempo,
Echo Nest attributes, it’s tempo and things like this.
These are all dimensions in which this data can vary,
so they can be different in the first dimension, which is the track ID,
but they can also down here be different in this dimension, which is for tempo.
When we say some data is seven hundred dimensional
what that actually means is it has seven hundred different ways
or different attributes in which it can vary.
And you can imagine that first of all
this is going to get quite big quite quickly,
right, seven hundred attributes seem like a lot to me.
Right, and depending on what the algorithm you’re running is,
it can get quite slow
when you’re running on this kind of size of data.
And you can imagine this is a relatively small dataset
compared to what Spotify might deal with on a daily basis.
But another way to think about this data is actually
points in this space.
so we have some 700 different attributes
that you can vary,
and when we take a specific track,
it sits somewhere in this space
So if we were looking at it in just two dimensions, you know
track one might be over here,
and track two over here and track three over here
And then three dimensions,
track four might be back at the back here.
You can imagine the more dimensions we add,
the further spread out these things are going to get,
But we can still do all the same things we can
in three dimensions, in 700 dimensions.
It just takes a little bit longer.
So one of the problems is that
some things like machine learning don’t like to have too many dimensions.
So things like linear regression can get quite slow
if you have tens of thousands of attributes or dimensions
So remember that perhaps the the default response to anyone collecting data
is just to collect it all and worry about it later.
Right, this is a time of what we when you have to worry about it.
What we’re trying to do is
remove any redundant variables.
If you’ve got two attributes of your music
like tempo and valence,
that turn out to be exactly the same,
why are we using both for making our problem a little bit harder, right.
Now an actual fact Echo Nest features are pretty good,
they don’t tend to correlate that strongly,
but you might find where we’ve collected some data on a big scale
actually a lot of the variables are very very similar all the time
and you can just remove some of them
or combine some of them together
and just make your problem a little bit easier.
So let’s look at this on the music dataset and see what we can do.
So the first thing we can do is we could remove duplicates right.
It sounds like an obvious one
and perhaps one that we could also do during cleaning,
but exactly when you do it doesn’t really matter as long as you’re paying attention.
what we’re going to say is music all equals unique of music all.
and what that’s going to do is look for find any duplicate rows
and remove them.
The number of rows we’ve got will drop by some amount. Let’s see.
This is where you need a timer.
Actually, this is quite a slow process
you’ve got to consider that we’re going to look through every single row
and try and find any other rows that match.
Okay, so this is removed a bit about 40 rows
So this meant we had some duplicate tracks.
You can imagine that
things might get accidentally added to the database twice,
or maybe two tracks are actually identical
because they were released multiple times or something like this.
Now what this is doing,
the unique function actually finds rows that are exactly the same
for every single attribute or every single dimension, of course in practice,
you might find that you have two versions of the same track,
which differ by one second,
they might have slightly different attributes.
Hopefully they’ll be very very similar.
So what we could also do is have a threshold where we said
these are too similar,
they’re the same thing.
The name is the same
the artist is the same
and the audio descriptors are very very similar,
maybe we should just remove one of them, right
This is the other thing you could do.
Just for demonstration what we’re going to do is focus on
just a few of the genres in this dataset right,
just to make things a little bit clearer for visualizations.
We’re going to select just the classical jazz pop
我们将选取经典 爵士 流行
and spoken-word genres, right
cause these have a good distribution of different amounts in the dataset.
So we’re going to run that.
We’re creating a list of genres.
We’re going to say music is music_all
Where any time where the genre is in
that list of genres we just produced right,
and that’s going to produce a much smaller dataset
of 1,600 observations
the same number of attributes or dimensions.
Now normally you would also keep most of your data in,
this is just for a demonstration.
But removing genres that aren’t useful to you for your experiment
is a perfectly reasonable way
of reducing your data size if that’s a problem.
Assuming they’ve been labeled right in the first place.
Assuming they’ve been labeled right in the first place.
Right, that’s on someone else. That’s someone else’s job.
Let’s imagine but 1,600 is still too long.
Now actually computers are getting pretty quick.
Maybe 1,600 observations is fine,
but perhaps we want to remove some more.
The first thing we could do is just chop off the data half way
and keep about half.
So let’s try that first of all,
so we’re going to say the first music
that’s the first few rows of our music is
rows 1 to 835
and all the columns.
So we’re going to run that.
And that’s even smaller.
Right so we can start to whittle down our data.
This is not necessarily a good idea.
We’re assuming here
that our genre is equally,
you know, randomly sampled around our dataset.
That might not be true.
You might have all the rock first and then all the pop or something like that.
If you take the first few,
you’re just going to get all the rock,
right depending on what you like, that might not be for you.
So let’s plot the genres in the normal dataset,
and you can see that we’ve got very little spoken word,
but it is there.
we have some classical international jazz and pop
还有经典 国际 爵士和流行
in sort of roughly the same amount.
If we plot after we’ve selected the first 50
you can see we’ve lost two of the genres, right
we only have classical International and jazz
and there’s hardly any jazz.
That’s not a good idea.
So don’t do that unless you know
that your data is randomized.
So this is not this is not giving us a good representation of genres
if we wanted to predict genre
for example based on the musical features,
cutting out half the genres seems like an unwise decision.
So a better thing to do will be
to sample randomly from the dataset.
So what we’re going to do
is we’re going to use the sample function
to give us 835 random indices into this data
and then we’re going to use that
to index our music data frame instead.
Alright, that’s this line here.
And hopefully this will give us a better distribution
if we plot the original again,
it looks like this
and you can see we’ve got a broad distribution
and then if we plot the randomized version
You can see we’ve still got some spoken.
It’s actually going up slightly,
but the distributions are broadly the same.
So this is worked exactly how we want.
So how you select your data
if you’re trying to make it a little bit smaller is very very important.
And consider but obviously we only had 1,600 here
and even this whole dataset is only 1,300 rows,
you could imagine that you might have tens of millions of rows
and you’ve got to think about this before you start just getting rid of them completely.
Randomized sampling is is a perfectly good way of selecting your data.
Obviously, it has a risk that
maybe if the distributions of your genres are a little bit off
and maybe you haven’t got very much of a certain genre.
You can’t guarantee
that the distributions are going to be the same on the way out.
And if you’re trying to predict genre,
that’s going to be a problem.
So perhaps the best approach is stratified sampling.
This is where we try and maintain
the distribution of our classes.
So for example in this case genre.
So we could say we already we had 50% rock
30% pop and 20% spoken,
and we want to maintain that kind of distribution
on the way out, even if we only sample 50%, right?
This is a little bit more complicated in our but it can be done.
And this is a good approach if you want to make absolutely sure with distributions
of your sample data are the same as your original data.
We just looked at some ways,
we can reduce the size of our dataset
in terms of a number of instances or the number of rows.
Can we make the number of dimensions
or the number of attributes smaller, right?
Cause that’s often one of the problems
and the answer is yes
And there’s lots of different ways we can do this
some more powerful and useful than others.
One of the ways we can do this is something called correlation analysis.
So a correlation between two attributes basically tells us that
when one of them increases
the other one either increases or decreases in general in relation to it.
So you might have some data like this with attribute one
and we might have attribute two,
and they sort of look like this.
These are the data points for all of our different data
Obviously we’ve got a lot of data points
and you can see that roughly speaking
they kind of increase in this, sort of direction here like this.
Now it might be but if this correlation is very very strong.
attribute two is a copy of attribute one more or less.
Maybe it doesn’t make sense to have attribute two in our dataset.
Maybe we can remove it without too much of a problem.
Alright. What we can do is something called correlation analysis where we
pitch all of the attributes versus all of the other attributes,
we look for high correlations and we decide,
ourselves, whether to remove them.
Now sometimes it’s useful just to keep everything in
and try not to remove them too early
But on the other hand, if you’ve got a huge amount of data
and your correlations are very high,
this could be one way of doing it.
Another option is something called forward or backward attribute selection
Now this is the idea that
maybe we have a machine learning model or clustering algorithm in mind
we can measure the performance of that,
and then we can remove features,
and see if the performance remains the same.
Because if it does
maybe we didn’t need those features.
So what we could do is we could train our model on let’s say a 720-dimensional dataset.
and then we could get a certain level of accuracy and record that.
Then we could try it again by removing one of the dimensions
and try on seven hundred and nineteen,
and maybe the accuracy is exactly the same
in which case we can say,
well, we didn’t really need that dimension at all,
and we can start to whittle down our data this way.
Another option is forwards attribute selection.
this is where we literally train our machine learning on just one of the attributes,
and then we see what our accuracy is,
and we keep adding attributes in and retraining
until our performance plateaus,
and we can say you know what?
We’re not gaining anything now by adding more attributes.
Obviously, there’s the question of which order do you try this in.
So what you would do is you would train on all the data
for example of a backwards attribute selection.
You take one out at random,
if your performance stays the same, you can leave it out.
If your performance gets much worse,
you put it back in and you don’t try that one again.
And you try a different one.
And you slowly start to take dimensions away
and hopefully whittle down your data.
Let’s have a quick look at correlation analysis on this dataset.
You might imagine that
if we’re calculating features based on the mp3
from Librosa or Echo Nest,
maybe they’re quite similar a lot of the time.
And maybe we can remove them.
Let’s have a quick look.
So we’re just going to focus on
one of a set of Librosa features just for simplicity.
So we’re going to select only the attributes that contain
this chroma kurtosis field,
which is one of the attributes that you can calculate using Librosa.
So I’m going to run that.
We’re going to rename them just for whole simplicity
to Kurt 1 Kurt 2 Kurt 3.
kurt1 kurt2 kurt3等
And then we’re going to calculate a correlation matrix
of each of these different features versus each other,
Ok, finally, we’re going to plot this
and see what it looks like.
Hopefully we can find some good correlations
and we could have candidates
for just removing a few of these dimensions, if it’s redundant.
And it’s not too bad. So you can see that we’ve got for example kurt 7 here.
So index 7 is fairly similar to 8.
That’s a correlation of 0.65
Maybe that means that we could remove one over two of those.
This one here is 0.59.
We’ve got a 0.48 over here.
These are fairly high correlations.
If you’re really stretched for CPU time,
or you’re worried about a size of your dataset,
this is the kind of thing you could do to remove them.
Of course, whether 0.65 is a strong enough correlation
that you want to delete and completely remove one of these dimensions
is really up to you and it’s going to depend on your situation.
One of the reasons that the correlations aren’t quite as hard as you might think
is that these libraries have been designed with this in mind.
If you just, if Echo Nest just produce 200 features that are exactly the same,
it wouldn’t be very useful for picking playlists.
So they’ve produced 200 features that are widely different.
So we’re not necessarily going to correlate all the time, right?
That’s the whole point and that’s a really useful feature of this data
We’ve looked at some ways we can try and make our dataset a little bit smaller.
Remember our ultimate goal
is a smallest most sort of useful data we can get our hands on, right.
Then we can put that into machine learning or clustering
and really extract some knowledge.
The problem is that
what we might do may based on correlation analysis
or forward backwards attribute selection
We might just be deleting data.
And maybe the correlation wasn’t one.
It wasn’t completely redundant
Do we actually want to completely remove this data?
Is there another way we can transform our data
to make more informed decisions
as to what we remove, and more effective ones?
That’s PCA or principal component analysis
有的 那就是PCA技术 即主成分分析方法
At the moment, we’re just fitting one line through our two-dimensional data
there’s going to be more principal components later, right?
But what we want to do is we want to pick the direction through this data,
however many attributes it has, that has the most spread.
So how do we measure this? Well quite simply…
Let’s imagine that you work for