亲爱的学霸们 这里是由Károly Zsolnai-Fehér带来的两分钟论文
Dear Fellow Scholars, this is Two Minute Papers with Károly Zsolnai-Fehér.
In our earlier episodes, when it came to learning techniques,
we almost always talked about supervised learning.
This means that we give the algorithm a bunch of images, and some additional information,
for instance, that these images depict dogs or cats.
Then, the learning algorithm is exposed to new images that it had never seen before and
has to be able to classify them correctly.
It is kind of like a teacher sitting next to a student, providing supervision.
Then, the exam comes with new questions.
This is supervised learning, and as you have seen from more than 180 episodes of Two Minute Papers,
there is no doubt that it is an enormously successful field of research.
However, this means that we have to label our datasets,
so we have to add some additional information to every image we have.
This is a very laborious task, which is typically performed by researchers or through crowdsourcing,
both of which takes a lot of funding and hundreds of work hours.
But if we think about it, we have a ton of videos on the internet,
you always hear these mind melting new statistics
on how many hours of video footage is uploaded to YouTube every day.
Of course, we could hire all the employees in the world
to annotate these videos frame by frame to tell the algorithm
这是一把吉他 这是一把手风琴 或者键盘
that this is a guitar, this is an accordion, or a keyboard,
and we would still not be able to learn on most of what’s uploaded.
But it would be so great to have an algorithm that can learn on unlabeled data.
However, there are learning techniques in the field of unsupervised learning,
which means that the algorithm is given a bunch of images, or any media,
and is instructed to learn on it without any additional information.
There is no teacher to supervise the learning.
The algorithm learns by itself.
And in this work, the objective is to learn both visual
and audio-related tasks in an unsupervised manner.
So for instance, if we look at the this layer of the visual subnetwork,
we’ll find neurons that get very excited when they see,
for instance, someone playing an accordion.
And each of the neurons in this layer belong to different object classes.
I surely have something like this for papers.
And here comes the Károly goes crazy part one:
this technique not only classifies the frames of the videos,
but it also creates semantic heatmaps,
which show us which part of the image is responsible for the sounds that we hear.
This is insanity!
To accomplish this, they ran a vision subnetwork on the video part,
and a separate audio subnetwork to learn about the sounds,
and at the last step, all this information is fused together
to obtain Károly goes crazy part two:
this makes the network able to guess
whether the audio and the video stream correspond to each other.
It looks at a man with a fiddle, listens to a sound clip
and will say whether the two correspond to each other.
The audio subnetwork also learned the concept of human voices,
水声 风声 音乐 现场音乐会和更多其它声音的概念
the sound of water, wind, music, live concerts and much, much more.
And the answer is yes, it is remarkably close to human-level performance on sound classification.
And all this is provided by the two networks that were trained from scratch,
and, no supervision is required.
We don’t need to annotate these videos.
And please don’t get this wrong,
it’s not like DeepMind has suddenly invented unsupervised learning, not at all.
This is a field that has been actively researched for decades,
it’s just that we rarely see really punchy results like these ones here.
Truly incredible work.
If you enjoyed this episode, and you feel that 8 of these videos a month is worth a dollar,
please consider supporting us on Patreon.
Details are available in the video description.
Thanks for watching and for your generous support, and I’ll see you next time!
亲爱的学霸们 这里是由Károly Zsolnai-Fehér带来的两分钟论文