AD原理:1)通过保存一个RL算法在许多单独任务上的训练过程从而形成一个大型多任务数据集;2)transformer适用前面的学习历史数据作为其上下文进行因果建模。由于源RL算法的策略在整个训练过程都会得到改进,因此AD被迫学习改进运算符(learn the improvement operator)以实现精确建模动作。值得注意的是,transformer模型的上下文长度必须足够大(Long enough to span learning updates,eg. across-episodic,跨场景的),才能捕获训练数据集的改进。
AD 对 state-action-reward token进行建模,并且不以 return 为条件
1、we found that in-context TD learning emerges only when the transformer is trained across a diverse range of environments.When restricted to a single-task setup, the transformer can "cheat" by learning heuristics specific to that environment rather than a generalizable value estimation algorithm.By exposing the transformer to multiple tasks, we force its weights to generalize, leading to the emergence of a robust algorithm like TD. This aligns with previous studies (including in-context supervised learning studies) which demonstrate that task diversity during training plays a crucial role in encouraging the development of generalizable in-context learning capabilities
2、the context can span multiple episodes. As the context length increases, action quality improves, suggesting that this improvement is not due to memorized policies encoded in the fixed parameters.Instead, it indicates that a reinforcement learning process occurs during the forward pass as the agent processes the context, a phenomenon termed in-context reinforcement learning (ICRL), where RL occurs at inference time within the forward pass
3、In supervised pretraining, the agent is explicitly tasked with imitating the behavior of some existing RL algorithms demonstrated in an offline dataset;In reinforcement pretraining, the agent is only tasked with maximizing the return, and there is no constraint on how the agent network should achieve this in the forward pass.
3、Open problems: When a transformer estimates the value function, its internal operations align with TD learning. Most transformers, however, serve as policies rather than value function estimators. It would be interesting to know the authors’ perspective on whether, if the transformer were to act as a policy instead of estimating a value function, the internal operations would still resemble TD learning, or if another reinforcement learning algorithm would take place internally.(这篇工作只研究了ICRL中Transformer在TD-based类算法中所扮演的value estimater角色,但没有探究policy类别算法)
Transformers learn to implement preconditioned gradient descent for in-context learning
来源:NIPS2023
How do Transformers perform In-Context Autoregressive Learning?