亲爱的学者们 欢迎来到Károly Zsolnai-Fehér的两分钟论文
Dear Fellow Scholars, this is Two Minute Papers with Károly Zsolnai-Fehér.
This work doesn’t have a ton of viewable footage,
but I think it is an absolutely amazing piece of craftsmanship,
so in the first half of this video, we’ll roll some footage from earlier episodes,
and in the second half, you’ll see the new stuff.
In this series, we often talk about reinforcement learning,
which is a learning technique where an agent chooses an optimal series of actions
in an environment to maximize a score.
Playing computer games is a good example of a clearly defined score that is to be maximized.
可以说 分数越高 学习效果越好
As long as we can say that the higher the score, the better the learning,
the concept will work for helicopter control,
choosing the best spot for wifi connectivity or a large variety of different tasks.
However, what about environments where multiple agents or players are present?
Not all games are single player focused, and not all helicopters have to fly alone.
So what about that?
To deal with cases like this, scientists at OpenAI and the University of Oxford came up with
a work by the name “Learning with Opponent-Learning Awareness”, LOLA in short, or lola.
I have to say that the naming game at OpenAI has been quite strong lately.
This is about multiplayer reinforcement learning, if you will.
This new agent does not only care about maximizing its own score but also inserts a new term
into the equation which is about anticipating the actions of other players in the environment.
It is not only possible to do this, but they also show that it can be done in an effective way,
and, the best part is that it also gives rise to classical strategies
that game theory practitioners will immediately recognize.
For instance, it can learn tit for tat,
which is a strategy that mirrors the other player’s actions.
This means that if the other player is cooperative, it will remain cooperative,
but if it gets screwed over, it will also try to screw others over.
You’ll see in a moment why this is a big deal.
The prisoner’s dilemma is a game where two criminals are caught and are independently
interrogated, and have to choose whether they snitch on the other one or not.
If any one snitches out, there will be hell to pay for the other one.
If they both defect, they both serve a fair amount time in prison.
The score to be minimized is therefore this time spent in prison.
and this strategy is something that we call the Nash equilibrium.
换句话说 如果我们考虑对方的决策 并期待他们也做出同样的判断的话
In other words, this is the best set of actions if we consider the options of the other actor
as well and expect that they do the same for us.
The optimal solution of this game is when both criminals remain silent
那么现在 第一个很酷的结果是 如果我们用两个新LOLA代理来进行“囚徒困境”游戏
And now, the first cool result is that if we run the prisoner’s dilemma with two of
these new LOLA agents, they quickly find the Nash equilibrium.
This is great.
但是等等 我们之前提到了以牙还牙 那这又有什么关系呢？
But wait, we have talked about this tit for tat thing, so what’s the big deal with that?
There is an iterated version of the prisoner’s dilemma game,
where this snitching or cooperating game is replayed many many times.
It is an ideal benchmark because an advanced agent would know that we cooperated the last
time, so it is likely that we can partner up this time around too!
And now comes the even cooler thing!
This is where the tit for tat strategy emerges – these LOLA agents know that if the previous
time, they cooperated, they will immediately give each other another chance, and again,
get away with the least amount of prison time.
As you can see here, the results vastly outperform other naive agents, and from the scores it
seems that previous techniques enter a snitching revenge war against each other
and both will serve plenty of time in prison.
Other games are also benchmarked against naive, uncooperative agents,
vastly outperforming them.
This is a fantastic paper,
make sure to check it out in the video description for more details.
I found it to be very readable, so do not despair if your math kung fu is not that strong.
Just dive into it!
Videos like this tend to get less views because they have less visual fireworks than most
other works we’re discussing in the series.
但幸好 我们非常幸运 因为有你们在Patreon上的支持
Fortunately, we are super lucky because we have your support on Patreon
and can tell these important stories without worrying about going viral.
另外 如果你喜欢这个视频 并且觉得一个月8次更新
And, if you have enjoyed this episode and you feel that 8 of these videos a month is
worth a dollar, please consider supporting us on Patreon.
One buck is almost nothing, but it keeps the papers coming.
Details are available in the video description.
Thanks for watching and for your generous support,
and I’ll see you next time!
亲爱的学者们 欢迎来到Károly Zsolnai-Fehér的两分钟论文