最新评论 (0)


DeepMind's AI Learns Imagination-Based Planning | Two Minute Papers #178

亲爱的学者们 这里是Károly Zsolnai-Fehér的两分钟论文
Dear Fellow Scholars, this is Two Minute Papers with Károly Zsolnai-Fehér.
两年多前 DeepMind的员工们实现了一种算法
A bit more than two years ago, the DeepMind guys implemented an algorithm that could play
只需通过观看你看到的游戏视频 它就可以将Atari公司的Breakout游戏玩出超凡的水平
Atari Breakout on a superhuman level by looking at the video feed that you see here.
And the news immediately took the world by storm.
原始的文章发表了有两年多一点的时间 并且
This original paper is a bit more than 2 years old and has already been referenced in well
over a thousand other research papers.
That is one powerful paper!
This algorithm was based on a combination of a neural network and reinforcement learning.
The neural network was used to understand the video feed, and reinforcement learning
is there to come up with the appropriate actions.
This is the part that plays the game.
Reinforcement learning is very suitable for tasks where we are in a changing environment
并且基于环境来选择行为 从而使得分
and we need to choose an appropriate action based on our surroundings to maximize some
sort of score.
举例来说 这些分可以是我们在迷宫中所走的距离
This score can be for instance, how far we’ve gotten in a labyrinth, or how many collisions
或者我们避免直升机碰撞的次数 或者一些反映我们表现如何
we have avoided with a helicopter, or any sort of score that reflects how well we’re
currently doing.
And this algorithm works similarly to how an animal learns new things.
它观察环境 尝试不同的行为 看它们是否有好的效果
It observes the environment, tries different actions and sees if they worked well.
如果有 就继续做 如果没有 那就尝试别的东西
If yes, it will keep doing that, if not, well, let’s try something else.
Pavlov’s dog with the bell is an excellent example of that.
这个领域中有许多已开展的工作 它们在很多问题和电脑游戏上
There are many existing works in this area and it performs remarkably well for a number
都表现得相当好 但是这只有在行为与奖励是即时对应的情况下
of problems and computer games, but only if the reward comes relatively quickly after
the action.
例如 在Breakout游戏中 如果我们没接到球 就会马上少一条命
For instance, in Breakout, if we miss the ball, we lose a life immediately, but if we
但如果我们接到球 就可以马上打破一些砖块从而增加分数
hit it, we’ll almost immediately break some bricks and increase our score.
This is more than suitable for a well-built reinforcement learner algorithm.
然而 这项早期的工作在其他需要长期计划的游戏中 表现得并不是太好
However, this earlier work didn’t perform well on any other games that required long-term planning.
如果Pavlov因为他的狗两天前做的事情而奖励它零食 那这条狗
If Pavlov gave his dog a treat for something that it did two days ago, the animal would
have no clue as to which action led to this tasty reward.
这篇论文的主题是一项游戏 在这个游戏中 我们控制这个绿色的人物
And this work’s subject is a game where we control this green character and our goal
is to push the boxes onto the red dots.
不仅是对算法来说 哪怕是对人类来说 这个游戏都特别难
This game is particularly difficult, not only for algorithms, but even humans, because of
因为两个重要原因 第一 这需要长期计划 正如我们所知道的
two important reasons: one, it requires long-term planning, which, as we know, is a huge issue
for reinforcement learning algorithms.
仅仅是因为一个箱子在一个点旁边 并不意味着它属于那里
Just because a box is next to a dot doesn’t mean that it is the one that belongs there.
This is a particularly nasty property of the game.
第二 一些我们犯下的错误是不可逆的 例如 将一个箱子推进角落
And two, some mistakes we make are irreversible, for instance, pushing a box in a corner can
make it impossible to complete the level.
如果一个算法可以尝试一系列的行为 然后看他们是否有用 好吧 这样的算法
If we have an algorithm that tries a bunch of actions and sees if they stick, well, that’s
not going to work here!
现在你应该看出来了 这是一个很难的问题
It is now hopefully easy to see that this is an obscenely difficult problem, and the
DeepMind的员工们想出了一个Imagination-Augmented Agents作为解决方法
DeepMind guys just came up with Imagination-Augmented Agents as a solution for it.
那么 在这个很酷的名字下是什么呢?
So what is behind this really cool name?
关于这个新奇的架构下的有趣的部分是 它运用了想象
The interesting part about this novel architecture is that it uses imagination, which is a routine
并不是仅仅编出一个动作 而是想出了一个包含许多步骤的完整计划 最终
to cook up not only one action, but entire plans consisting of several steps, and finally, choose one
选择一个从长远角度来看 被认为有最好结果的计划
that has the greatest expected reward over the long term.
这需要基于现在情形 来想象可能的未来 然后选择
It takes information about the present and imagines possible futures, and chooses the
one with the most handsome reward.
如你所见到的 这只是第一篇关于新架构的文章
And as you can see, this is only the first paper on this new architecture and it can
already solve a problem with seven boxes.
This is just unreal.
Absolutely amazing work.
请记住这是一个具有普遍性的算法 它可以用于
And please note that this is a fairly general algorithm that can be used for a number of
different problems.
This particular game was just one way of demonstrating the attractive properties of this new technique.
文章包含着更多的结果 值得阅读 一定要看看
The paper contains more results and is a great read, make sure to have a look.
同时 如果你喜欢这一段 请考虑支持一下Patreon上的两分钟论文
Also, if you’ve enjoyed this episode, please consider supporting Two Minute Papers on Patreon.
细节可以在视频描述中找到 请看一看
Details are available in the video description, have a look!
谢谢观看和您慷慨的支持 我们下次再见
Thanks for watching and for your generous support, and I’ll see you next time!