亲爱的学者们 这是Károly Zsolnai Fehér为你带来的两分钟论文
Dear Fellow Scholars, this is Two Minute Papers with Károly Zsolnai-Fehér.
Reinforcement learning is an awesome algorithm that is able to play computer games, navigate
导航直升机 打棒球 当它与神经网络和蒙特卡洛树搜索结合
helicopters, hit a baseball, or even defeat Go champions when combined together with a
neural network and Monte Carlo tree search.
It is a quite general algorithm that is able to take on a variety of difficult problems
that involve observing an environment and coming up with a series of actions to maximize a score.
In a previous episode, we had a look at DeepMind’s algorithm where a set of movement actions
had to be chosen to navigate in a difficult 3D environment efficiently.
The score to be maximized was the distance measured from the starting point,
the further our character went, the higher score it was given,
and it has successfully learned the concept of locomotion.
A prerequisite for a reinforcement learner to work properly is that it has to be given
informative reward signals.
例如 如果我们去参加笔试 对于结果
For instance, if we go to a written exam, as an output,
we would like to get a detailed breakdown of the number of points we got for each problem.
This way, we know where we did well and which kinds of problems need some more work.
然而 想象一下 有一个粗心的老师 他从不告诉我们得分
However, imagine having a really careless teacher who never tells us the points, but
would only tell us whether we have failed of passed.
没有解释 没有个人得分 也没有告诉是否错了很多
No explanation, no points for individual tasks, no telling whether we failed by a lot or just
by a tiny bit.
First attempt, we failed.
Next time, we failed again.
And again and again and again.
Now this would be a dreadful learning experience
because we would have absolutely no idea what to improve.
Clearly, this teacher would have to be fired.
However, when formulating a reinforcement learning problem,
instead of using more informative scores
it is much easier to just tell whether the algorithm was successful or not.
It is very convenient for us to be this careless teacher.
否则 对于一架直升机的控制问题来说 当它就要撞到树上时
Otherwise, what score would make sense for a helicopter control problem
when we almost crash into a tree?
This part is called reward engineering
and the main issue is that we have to adapt the problem to the algorithm,
where the best would be if the algorithm would adapt to the problem.
This has been a long-standing problem in reinforcement learning research,
and a potential solution would open up the possibility
of solving even harder and more interesting problems with learning algorithms.
And this is exactly the what researchers at OpenAI try to solve
事后经验回放 HER 或简称her
by introducing Hindsight Experience Replay, HER, or her in short.
This algorithm takes on problems where the scores are binary,
which means that it either passed or failed the prescribed task.
A classic careless teacher scenario.
And these rewards are not only binary, but very sparse as well,
which further exacerbates the difficulty of the problem.
In the video, you can see a comparison with a previous algorithm
with and without the HER extension.
The higher the number of epochs you see above,
the longer the algorithm was able to train.
The incredible thing here is that it is able to achieve a goal
even if it had never been able to reach it during training.
The key idea is that we can learn just as much from undesirable outcomes
as from desirable ones.
Let me quote the authors.
Imagine that you are learning how to play hockey
and are trying to shoot a puck into a net.
You hit the puck but it misses the net on the right side.
The conclusion drawn by a standard reinforcement learning algorithm in such a situation would
be that the performed sequence of actions does not lead to a successful shot,
and little (if anything) would be learned.
然而 也有可能得出另一个结论 即如果
It is however possible to draw another conclusion, namely that this sequence of actions would
be successful if the net had been placed further to the right.
They have achieved this by storing and replaying previous experiences with different potential goals.
一如往常 详情在论文中 一定要看一看
As always, the details are available in the paper, make sure to have a look.
Now, it is always good to test things out whether the whole system works well in software,
however, its usefulness has been demonstrated by deploying it on a real robot arm.
You can see the goal written on the screen alongside with the results.
A really cool piece of work
that can potentially open up new ways of thinking about reinforcement learning.
After all, it’s great to have learning algorithms that are so good,
我们用这么懒惰的方式构建的问题 它们都能顺利解决 人类可以歇着了
they can solve problems that we formulate in such a lazy way that we’d have to be fired.
And here’s a quick question: do you think 8 of these videos a month is worth a dollar?
If you have enjoyed this episode and your answer is yes, please consider supporting
us on Patreon.
Details are available in the video description.
Thanks for watching and for your generous support, and I’ll see you next time!
亲爱的学者们 这是Károly Zsolnai Fehér为你带来的两分钟论文