The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games (MAPPO)
The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games (MAPPO)
2103.01955
CTDE方式来使用多智能体PPO。在MPE、SMAC、Hanabi环境里进行了实验。发现MAPPO效果很好,样本效率比较好。着重介绍了代码实现上的一些trick。
实验环境:

算法:

1,各自的obs
2,共享的reward
3,同质的policy
4,各自的policy和value
实现中的trick:
1, Generalized Advantage Estimation (GAE) [28] with advantage normalization
2, observation normalization
3, gradient clipping
4, value clipping
5, layer normalization, ReLU activation with orthogonal initialization
6, a large batch size under our 1-GPU constraint
7, a limited grid-search over certain hyper-parameters, including network architecture (i.e., MLP or RNN), learning rate, entropy bonus coefficient, and the initialization scale of the final layer in the policy network.
重要的实现细节:
1, value normalization
Suggestion 1: Utilize value normalization to stabilize value learning.
2, value function inputs
Suggestion 2: Include agent-specific features in the global state and check that these features do not make the state dimension substantially higher.
3, training data usage
Suggestion 3: Avoid using too many training epochs and do not split data into mini-batches.
4, policy and value clipping
Suggestion 4: For the best PPO performance, tune the clipping ratioas a trade-off between training stability and fast convergence.
5, death masking
Suggestion 5: Use zero states with agent ID as the value input for dead agents.
主要结果:

浙公网安备 33010602011771号