The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games (MAPPO)

The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games (MAPPO)

2103.01955

 

        CTDE方式来使用多智能体PPO。在MPE、SMAC、Hanabi环境里进行了实验。发现MAPPO效果很好,样本效率比较好。着重介绍了代码实现上的一些trick。

实验环境:

算法:

1,各自的obs

2,共享的reward

3,同质的policy

4,各自的policy和value

实现中的trick:

1, Generalized Advantage Estimation (GAE) [28] with advantage normalization

2, observation normalization

3, gradient clipping

4, value clipping

5, layer normalization, ReLU activation with orthogonal initialization

6, a large batch size under our 1-GPU constraint

7, a limited grid-search over certain hyper-parameters, including network architecture (i.e., MLP or RNN), learning rate, entropy bonus coefficient, and the initialization scale of the final layer in the policy network.

重要的实现细节:

1, value normalization

    Suggestion 1: Utilize value normalization to stabilize value learning.

2, value function inputs

    Suggestion 2: Include agent-specific features in the global state and check that these features do not make the state dimension substantially higher.

3, training data usage

    Suggestion 3: Avoid using too many training epochs and do not split data into mini-batches.

4, policy and value clipping

    Suggestion 4: For the best PPO performance, tune the clipping ratioas a trade-off between training stability and fast convergence.

5, death masking

    Suggestion 5: Use zero states with agent ID as the value input for dead agents.

主要结果:

 

posted @ 2022-06-19 17:31  明2022  阅读(637)  评论(0)    收藏  举报