ADM-201 dump PMP dumps pdf SSCP exam materials CBAP exam sample questions

学习模拟别人的想法 – 译学馆
未登录,请登录后再发表信息
最新评论 (0)
播放视频

学习模拟别人的想法

Learning to Model Other Minds (OpenAI) | Two Minute Papers #199

亲爱的学者们 欢迎来到Károly Zsolnai-Fehér的两分钟论文
Dear Fellow Scholars, this is Two Minute Papers with Károly Zsolnai-Fehér.
虽然这个作品并没有大量的视频镜头
This work doesn’t have a ton of viewable footage,
但是我认为这是一项非常惊艳的技术
but I think it is an absolutely amazing piece of craftsmanship,
所以在视频的前半段 我们会用一些往期视频的镜头
so in the first half of this video, we’ll roll some footage from earlier episodes,
然后在后半段 你才会看到新的材料
and in the second half, you’ll see the new stuff.
在这个系列中 我们经常说到强化学习
In this series, we often talk about reinforcement learning,
这是一种某代理在环境中选择最优方案
which is a learning technique where an agent chooses an optimal series of actions
来尽可能得到最高分的学习技术
in an environment to maximize a score.
玩游戏就是个好例子 它有个明确需要最大化的分数
Playing computer games is a good example of a clearly defined score that is to be maximized.
可以说 分数越高 学习效果越好
As long as we can say that the higher the score, the better the learning,
这个概念可以应用于控制直升机
the concept will work for helicopter control,
选择WIFI最佳连接地点 或者各种不同任务
choosing the best spot for wifi connectivity or a large variety of different tasks.
然而 如果环境中有多个代理或玩家呢?
However, what about environments where multiple agents or players are present?
并非所有游戏都是单机的 直升机也并非尽是单飞
Not all games are single player focused, and not all helicopters have to fly alone.
会发生什么呢?
So what about that?
为了处理这种情况 来自牛津大学和OpenAI的科学家们想出了一个
To deal with cases like this, scientists at OpenAI and the University of Oxford came up with
叫做”在考虑对手学习能力下学习“的方式 简称LOLA
a work by the name “Learning with Opponent-Learning Awareness”, LOLA in short, or lola.
不得不说OpenAI最近总爱玩命名命名游戏
I have to say that the naming game at OpenAI has been quite strong lately.
这是一项多人强化学习任务
This is about multiplayer reinforcement learning, if you will.
这个新的代理不仅考虑如何获得最高分 还会在等式中插入
This new agent does not only care about maximizing its own score but also inserts a new term
新的一项参数 用来预测环境中其他参与者的行为
into the equation which is about anticipating the actions of other players in the environment.
它不仅可以做到这一点 而且还可以用一种高效的方式来实现
It is not only possible to do this, but they also show that it can be done in an effective way,
并且最棒的是它还会让某种经典策略大幅提升
and, the best part is that it also gives rise to classical strategies
博弈论的实践者们很熟悉的理论
that game theory practitioners will immediately recognize.
例如 它能学会“以牙还牙”
For instance, it can learn tit for tat,
这是一种模仿其他玩家行为的策略
which is a strategy that mirrors the other player’s actions.
也就是说如果其他玩家的态度是合作 它就合作
This means that if the other player is cooperative, it will remain cooperative,
但如果它被别人欺骗 它也会学着去欺骗别人
but if it gets screwed over, it will also try to screw others over.
你马上就会知道为啥这一点很重要
You’ll see in a moment why this is a big deal.
在“囚徒困境”游戏中 两个被抓的罪犯要被单独询问
The prisoner’s dilemma is a game where two criminals are caught and are independently
他们得去选择要不要告发另一个人
interrogated, and have to choose whether they snitch on the other one or not.
如果其中一人告密 另一个人就要付出惨烈的代价
If any one snitches out, there will be hell to pay for the other one.
如果双方都招供 他们就都会被判一个合理的刑期
If they both defect, they both serve a fair amount time in prison.
因此需要最小化的分数就是他们在监狱中的时长
The score to be minimized is therefore this time spent in prison.
这一策略就是所谓的纳什均衡
and this strategy is something that we call the Nash equilibrium.
换句话说 如果我们考虑对方的决策 并期待他们也做出同样的判断的话
In other words, this is the best set of actions if we consider the options of the other actor
这就是最好的行动策略
as well and expect that they do the same for us.
这一游戏的最优解是两名罪犯都保持沉默
The optimal solution of this game is when both criminals remain silent
那么现在 第一个很酷的结果是 如果我们用两个新LOLA代理来进行“囚徒困境”游戏
And now, the first cool result is that if we run the prisoner’s dilemma with two of
它们很快就能发现纳什均衡
these new LOLA agents, they quickly find the Nash equilibrium.
这非常棒
This is great.
但是等等 我们之前提到了以牙还牙 那这又有什么关系呢?
But wait, we have talked about this tit for tat thing, so what’s the big deal with that?
这有一个“囚徒困境”游戏的升级版本
There is an iterated version of the prisoner’s dilemma game,
该版本中 告密或合作的游戏会重复很多很多遍
where this snitching or cooperating game is replayed many many times.
这是个理想的基准测试 因为高级的代理会知道我们上一次合作过
It is an ideal benchmark because an advanced agent would know that we cooperated the last
那么这一次我们也可能会合作!
time, so it is likely that we can partner up this time around too!
现在更棒的事情来了!
And now comes the even cooler thing!
这就是以牙还牙的策略产生的地方——这些LOLA代理知道
This is where the tit for tat strategy emerges – these LOLA agents know that if the previous
如果前一次他们合作过 他们会立刻给彼此另一个机会
time, they cooperated, they will immediately give each other another chance, and again,
并且再一次取得最短的刑期总和
get away with the least amount of prison time.
正如你所看到的 这个结果远胜于其他低级的代理
As you can see here, the results vastly outperform other naive agents, and from the scores it
并且从它的分数来看 以前的技术会陷入对彼此告密的复仇之战中去
seems that previous techniques enter a snitching revenge war against each other
两个人都会在监狱里服刑很久
and both will serve plenty of time in prison.
论文还用其他低级的、不懂合作的代理作为基准测试 测试了其他游戏
Other games are also benchmarked against naive, uncooperative agents,
lalo都远胜它们
vastly outperforming them.
这是一篇很棒的论文
This is a fantastic paper,
你一定要去看看描述区里的细节
make sure to check it out in the video description for more details.
论文的可读性也很强 所以如果你的数学功底不是很强也不必担心
I found it to be very readable, so do not despair if your math kung fu is not that strong.
快去试试吧!
Just dive into it!
相对我们讨论过的大多数其他论文而言 这种视频通常观看量会比较少
Videos like this tend to get less views because they have less visual fireworks than most
因为论文里没有多少可视化的资料
other works we’re discussing in the series.
但幸好 我们非常幸运 因为有你们在Patreon上的支持
Fortunately, we are super lucky because we have your support on Patreon
我们可以报告这些重要的研究成果而不必担心播放量
and can tell these important stories without worrying about going viral.
另外 如果你喜欢这个视频 并且觉得一个月8次更新
And, if you have enjoyed this episode and you feel that 8 of these videos a month is
值得一美元 请在Patreon上支持我们
worth a dollar, please consider supporting us on Patreon.
一美元可能不算什么 但是能够让我们为你带来更多的两分钟论文
One buck is almost nothing, but it keeps the papers coming.
更多细节可以参考视频描述
Details are available in the video description.
感谢你的收看以及大力支持
Thanks for watching and for your generous support,
我们下期再见!
and I’ll see you next time!

发表评论

译制信息
视频概述

这是一篇关于多方强化学习的研究,来自来自牛津大学和OpenAI的科学家们提出了”在考虑对手学习能力下学习“的方式,简称LOLA或者lola,它能够对对手的行为进行模仿,来实现“以牙还牙”。

听录译者

收集自网络

翻译译者

吾家黄姑娘

审核员

审核员1024

视频来源

https://www.youtube.com/watch?v=kfJMUeQO0S0

相关推荐