谷歌DeepMind—运用深度强化学习为双足机器人学习敏捷足球技能 Movies

原文链接：OP3 Soccer

Take a look at the OP3 Powered by DYNAMIXEL

看看由DYNAMIXEL 驱动的OP3

We investigate whether Deep Reinforcement Learning (Deep RL) is able to synthesize sophisticated and safe movement skills for a low-cost, miniature humanoid robot that can be composed into complex behavioral strategies in dynamic environments. We used Deep RL to train a humanoid robot with 20 actuated joints to play a simplified one-versus-one (1v1) soccer game. We first trained individual skills in isolation and then composed those skills end-to-end in a self-play setting. The resulting policy exhibits robust and dynamic movement skills such as rapid fall recovery, walking, turning, kicking and more; and transitions between them in a smooth, stable, and efficient manner—well beyond what is intuitively expected from the robot. The agents also developed a basic strategic understanding of the game, and learned, for instance, to anticipate ball movements and to block opponent shots. The full range of behaviors emerged from a small set of simple rewards. Our agents were trained in simulation and transferred to real robots zero-shot. We found that a combination of sufficiently high-frequency control, targeted dynamics randomization, and perturbations during training in simulation enabled good-quality transfer, despite significant unmodeled effects and variations across robot instances. Although the robots are inherently fragile, minor hardware modifications together with basic regularization of the behavior during training led the robots to learn safe and effective movements while still performing in a dynamic and agile way. Indeed, even though the agents were optimized for scoring, in experiments they walked 156% faster, took 63% less time to get up, and kicked 24% faster than a scripted baseline, while efficiently combining the skills to achieve the longer term objectives. Examples of the emergent behaviors and full 1v1 matches are available on the supplementary website: OP3 Soccer

我们研究了深度强化学习（Deep RL）是否能为一种低成本的小型仿人机器人合成复杂且安全的运动技能，这些技能可在动态环境中组合成复杂的行为策略。我们使用深度强化学习训练了一个拥有20个驱动关节的仿人机器人，让其参与简化的一对一（1v1）足球比赛。我们首先单独训练各项技能，然后在自我对战的场景中将这些技能进行端到端的组合。由此产生的策略展现出稳健且动态的运动技能，如快速跌倒恢复、行走、转身、踢球等，并且这些技能之间的转换流畅、稳定、高效，远远超出了人们对该机器人的直观预期。这些智能体还形成了对游戏的基本战略理解，并学会了例如预判球的运动轨迹和阻挡对手射门等技能。一系列的行为仅通过一组简单的奖励就得以实现。我们的智能体是在模拟环境中进行训练的，并能实现向真实机器人的零样本迁移。我们发现，尽管存在显著的未建模效应和不同机器人实例间的差异，但足够高频的控制、有针对性的动力学随机化以及模拟训练中的扰动相结合，仍能实现高质量的迁移。尽管这些机器人本身很脆弱，但通过训练过程中对硬件进行小幅修改以及对行为进行基本正则化，机器人能够学习到安全有效的运动方式，同时保持动态和敏捷的表现。事实上，尽管这些智能体的优化目标是得分，但在实验中，它们行走速度比脚本基准快156%，起身所需时间减少63%，踢球速度提高24%，同时还能高效组合各项技能以实现长期目标。有关这些新兴行为和完整1v1比赛的视频可在补充网站OP3 Soccer上查看。

Soccer players can tackle, get up, kick and chase a ball in one seamless motion. How could robots master these agile motor skills?

足球运动员能够流畅地完成抢断、起身、踢球和追球等一系列动作。机器人怎样才能掌握这些敏捷的运动技能呢？

Movie S1: Training in simulation

视频S1：模拟训练

We first trained individual skills in isolation, in simulation, and then composed those skills end-to-end in a self-play setting. We found that a combination of sufficiently high-frequency control and targeted dynamics randomization and perturbations during training in simulation enabled good-quality transfer to the robot.

我们首先单独在模拟环境中训练各项技能，然后在自我对战的场景中将这些技能进行端到端的组合。我们发现，在模拟训练中，足够高频的控制与有针对性的动力学随机化和扰动相结合，能够实现向机器人的高质量迁移。

Movie S2: 1v1 matches

视频S2：一对一比赛

5 one-versus-one matches. These matches are representative of the typical behavior and gameplay of the fully trained soccer agent.

5场一对一比赛。这些比赛充分展示了经过全面训练的足球智能体的典型行为和游戏玩法。

Movie S3: Set pieces in simulation and in the real environment

视频S3：模拟环境和真实环境中的定位球

We analysed the agent's performance in two set-pieces, to gauge the reliability of getting up and shooting behaviors and to measure the performance gap between the simulation and the real environment. We also compared behaviors with scripted baseline skills. In experiments they walked 156% faster, took 63% less time to get up, and kicked 24% faster than a scripted baseline.

我们分析了智能体在两种定位球情况下的表现，以评估起身和射门行为的可靠性，并衡量模拟环境与真实环境之间的性能差距。我们还将智能体的行为与脚本基准技能进行了比较。在实验中，智能体的行走速度比脚本基准快156%，起身所需时间减少63%，踢球速度提高24%。

Movie S4: Robustness and recovery from pushes

视频S4：鲁棒性和受推后的恢复

Although the robots are inherently fragile, minor hardware modifications together with basic regularization of the behavior during training lead to safe and effective movements while still being able to perform in a dynamic and agile way.

尽管这些机器人本身很脆弱，但通过训练过程中对硬件进行小幅修改以及对行为进行基本正则化，机器人能够学习到安全有效的运动方式，同时仍能保持动态和敏捷的表现。

Preliminary Results: Learning from vision

初步结果：从视觉中学习

We conducted a preliminary investigation of whether deep RL agents can learn directly from raw egocentric vision. In this context the agent must learn to control its camera and integrate information over a window of egocentric viewpoints to predict various game aspects. Our initial analysis indicates that deep RL is a promising approach to this challenging problem. We conducted a simpler set-piece using fixed walker and ball positions and found our agent scored 10 goals in simulation and 6 goals on the real robot over 10 trials.

我们进行了初步调查，研究深度强化学习（Deep RL）智能体是否能直接从以自我为中心的原始视觉中学习。在这种情况下，智能体必须学会控制其摄像头，并在以自我为中心的观点窗口内整合信息，以预测游戏的不同方面。我们的初步分析表明，深度强化学习是解决这一具有挑战性问题的一个有前途的方法。我们进行了一个更简单的定位球实验，使用固定的行走者和球的位置，发现我们的智能体在模拟环境中10次试验中进了10个球，在真实机器人上10次试验中进了6个球。

We hope the challenge of integrating the get-up skill and learning vision-guided exploration and multi-agent strategies will be tackled by future work.

我们希望未来的工作能够解决整合起身技能、学习视觉引导的探索以及多智能体策略的挑战。

Movie S5: Preliminary vision based agents

视频S5：基于视觉的初步智能体