未登录,请登录后再发表信息
最新评论 (0)
播放视频

事后经验回放

Hindsight Experience Replay | Two Minute Papers #192

亲爱的学者们 这是Károly Zsolnai Fehér为你带来的两分钟论文
Dear Fellow Scholars, this is Two Minute Papers with Károly Zsolnai-Fehér.
强化学习是一种很棒的算法 它能够玩电脑游戏
Reinforcement learning is an awesome algorithm that is able to play computer games, navigate
导航直升机 打棒球 当它与神经网络和蒙特卡洛树搜索结合
helicopters, hit a baseball, or even defeat Go champions when combined together with a
甚至可以击败围棋冠军
neural network and Monte Carlo tree search.
这是一个能够解决各种困难问题的通用算法
It is a quite general algorithm that is able to take on a variety of difficult problems
它通过观察环境并采取一系列动作 以获得最高得分
that involve observing an environment and coming up with a series of actions to maximize a score.
在之前的一期视频中 我们已经看过DeepMind的算法是如何在一个困难重重的3D环境中
In a previous episode, we had a look at DeepMind’s algorithm where a set of movement actions
高效地选取一系列移动动作来前进的
had to be chosen to navigate in a difficult 3D environment efficiently.
这里要达到最大的分数就是从起点出发的距离
The score to be maximized was the distance measured from the starting point,
我们的角色走的越远得分越高
the further our character went, the higher score it was given,
并且它成功地学习了移动的概念
and it has successfully learned the concept of locomotion.
太酷了!
Really cool!
增强学习者正确工作的前提是
A prerequisite for a reinforcement learner to work properly is that it has to be given
要能得到信息奖励信号
informative reward signals.
例如 如果我们去参加笔试 对于结果
For instance, if we go to a written exam, as an output,
我们会希望能够得到一个针对每一个问题的详尽分数
we would like to get a detailed breakdown of the number of points we got for each problem.
这样 我们就可以知道哪些做的很好哪些问题需要多努力
This way, we know where we did well and which kinds of problems need some more work.
然而 想象一下 有一个粗心的老师 他从不告诉我们得分
However, imagine having a really careless teacher who never tells us the points, but
只告诉我们是否及格了
would only tell us whether we have failed of passed.
没有解释 没有个人得分 也没有告诉是否错了很多
No explanation, no points for individual tasks, no telling whether we failed by a lot or just
还是仅仅一点点
by a tiny bit.
什么都没有
Nothing.
第一次尝试 我们失败了
First attempt, we failed.
第二次 又失败了
Next time, we failed again.
一次一次又一次
And again and again and again.
现在 这将是一次非常糟糕的学习经历
Now this would be a dreadful learning experience
因为我们完全不知道该怎么提高
because we would have absolutely no idea what to improve.
显然 这个老师必须被解雇
Clearly, this teacher would have to be fired.
然而 当构造一个强化学习问题时
However, when formulating a reinforcement learning problem,
相比提供更多的信息丰富的分数
instead of using more informative scores
只需判断算法成功与否就容易的多
it is much easier to just tell whether the algorithm was successful or not.
对于我们来说做这样粗心的老师是很省事的
It is very convenient for us to be this careless teacher.
否则 对于一架直升机的控制问题来说 当它就要撞到树上时
Otherwise, what score would make sense for a helicopter control problem
什么样的分数才有意义呢?
when we almost crash into a tree?
这部分被称为奖励工程
This part is called reward engineering
它的主要问题是我们必须改变问题来适应算法
and the main issue is that we have to adapt the problem to the algorithm,
最好的办法是算法会自己解决问题
where the best would be if the algorithm would adapt to the problem.
在强化学习的研究中 这是一个长期存在的问题
This has been a long-standing problem in reinforcement learning research,
一个有潜力的解决方案将开启新的可能性
and a potential solution would open up the possibility
来解决学习算法中甚至更困难也更有趣的问题
of solving even harder and more interesting problems with learning algorithms.
这也正是OpenAI的研究员们试图通过引入事后经验回放来解决的问题
And this is exactly the what researchers at OpenAI try to solve
事后经验回放 HER 或简称her
by introducing Hindsight Experience Replay, HER, or her in short.
很贴切
Very apt.
该算法会处理分数是二进制的问题
This algorithm takes on problems where the scores are binary,
也就意味着它的目标任务要么成功要么失败
which means that it either passed or failed the prescribed task.
这是一个典型的粗心教师场景
A classic careless teacher scenario.
而且这些反馈不仅是二进制的 而且非常稀少
And these rewards are not only binary, but very sparse as well,
这进一步加大了问题的难度
which further exacerbates the difficulty of the problem.
在视频中 你可以看到和以前算法的一个对比
In the video, you can see a comparison with a previous algorithm
分别是带HER扩展程序和不带的
with and without the HER extension.
上面你看到的历元数越高
The higher the number of epochs you see above,
算法能够训练的时间就越长
the longer the algorithm was able to train.
令人难以置信的是 它能够实现一个
The incredible thing here is that it is able to achieve a goal
即使它从来没有训练过的目标
even if it had never been able to reach it during training.
核心思想是 我们可以从不良的结果中
The key idea is that we can learn just as much from undesirable outcomes
学习到和良好结果中同样多的东西
as from desirable ones.
让我引用一句作者的话
Let me quote the authors.
想象一下 你在学习如何打曲棍球
Imagine that you are learning how to play hockey
并努力地想把冰球打入球门里
and are trying to shoot a puck into a net.
你击出冰球 但是它从网的右侧擦过
You hit the puck but it misses the net on the right side.
在这种情况下 标准学习算法得出的结论是
The conclusion drawn by a standard reinforcement learning algorithm in such a situation would
所执行的动作序列不会导致击球得分
be that the performed sequence of actions does not lead to a successful shot,
这几乎学习不到任何东西
and little (if anything) would be learned.
然而 也有可能得出另一个结论 即如果
It is however possible to draw another conclusion, namely that this sequence of actions would
网球被放的离右侧远一些 这一行动的结果就会成功的
be successful if the net had been placed further to the right.
这可以通过保存和重播之前可能的目标 获得经验后来达到这一点
They have achieved this by storing and replaying previous experiences with different potential goals.
一如往常 详情在论文中 一定要看一看
As always, the details are available in the paper, make sure to have a look.
如今 测试整个系统在软件中是否正常工作当然总是好的
Now, it is always good to test things out whether the whole system works well in software,
然而 它的实用性已在一个真正的机器人手臂证明了
however, its usefulness has been demonstrated by deploying it on a real robot arm.
你可以在屏幕上看到目标和结果
You can see the goal written on the screen alongside with the results.
真的是一篇很棒的文章
A really cool piece of work
这可能会开启一种新的思考强化学习的方式
that can potentially open up new ways of thinking about reinforcement learning.
毕竟 有这样优秀的学习算法真是太棒了
After all, it’s great to have learning algorithms that are so good,
我们用这么懒惰的方式构建的问题 它们都能顺利解决 人类可以歇着了
they can solve problems that we formulate in such a lazy way that we’d have to be fired.
这里还有一个简单的问题:你认为每月8个这样的视频值一美元吗
And here’s a quick question: do you think 8 of these videos a month is worth a dollar?
如果你喜欢这一集 并且你的回答是“是”
If you have enjoyed this episode and your answer is yes, please consider supporting
请在Patreon上支持我们
us on Patreon.
详情可以在视频简介中找到
Details are available in the video description.
谢谢收看和支持 下期再见!
Thanks for watching and for your generous support, and I’ll see you next time!

发表评论

译制信息
视频概述

介绍强化学习的原理,通过强化学习算法,可以使得机器人能够从先前的经验找到解决问题的方法

听录译者

收集自网络

翻译译者

吾家黄姑娘

审核员

审核员1024

视频来源

https://www.youtube.com/watch?v=Dvd1jQe3pq0

相关推荐