The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games (MAPPO)

2103.01955

CTDE方式来使用多智能体PPO。在MPE、SMAC、Hanabi环境里进行了实验。发现MAPPO效果很好，样本效率比较好。着重介绍了代码实现上的一些trick。

实验环境：

算法：

1，各自的obs

2，共享的reward

3，同质的policy

4，各自的policy和value

实现中的trick:

1, Generalized Advantage Estimation (GAE) [28] with advantage normalization

2, observation normalization

3, gradient clipping

4, value clipping

5, layer normalization, ReLU activation with orthogonal initialization

6, a large batch size under our 1-GPU constraint

7, a limited grid-search over certain hyper-parameters, including network architecture (i.e., MLP or RNN), learning rate, entropy bonus coefficient, and the initialization scale of the final layer in the policy network.

重要的实现细节：

1, value normalization

Suggestion 1: Utilize value normalization to stabilize value learning.

2, value function inputs

Suggestion 2: Include agent-specific features in the global state and check that these features do not make the state dimension substantially higher.

3, training data usage

Suggestion 3: Avoid using too many training epochs and do not split data into mini-batches.

4, policy and value clipping

Suggestion 4: For the best PPO performance, tune the clipping ratioas a trade-off between training stability and fast convergence.

5, death masking

Suggestion 5: Use zero states with agent ID as the value input for dead agents.

主要结果：

posted @ 2022-06-19 17:31 明2022 阅读(637) 评论(0) 收藏举报

刷新页面返回顶部

The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games (MAPPO)

公告