论文速读记录 | 2025年3月

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (2017)

  • arxiv:https://arxiv.org/abs/1703.03400
  • 来源:MoonOut
  • 作者:Chelsea Finn, Pieter Abbeel, Sergey Levine
  • 主要内容:提出了MAML meta-gradient 更新算法,涉及通过梯度的梯度(Gradient by Gradient)。该算法可以找到对任务变化敏感的模型参数,这样当沿着该算式的梯度方向改变时,参数的微小变化将对从数据分布得出的任何任务的损失函数产生很大的改进。
  • 第一次梯度参数更新是为了第二次梯度参数更新,即先算一次梯度,但不作用于原模型,再算一次,才作用于原模型.
  • Meta-Learning领域非常经典的算法,朴实简洁

Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

  • arxiv:https://arxiv.org/abs/1912.02875
  • 好难读懂........大致和Decision Transformer思想类似,使用action为预测目标、state等为输入的监督学习范式,不过这里模型架构用的是RNN
  • 另外一篇同系列的文章Training Agents using Upside-Down Reinforcement Learning

Plan Your Target and Learn Your Skills: Transferable State-Only Imitation Learning via Decoupled Policy Optimization

Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions

When does returnconditioned supervised learning work for offline reinforcement learning?

Diverse Transformer Decoding for Offline Reinforcement Learning Using Financial Algorithmic Approaches

  • arxiv:https://arxiv.org/abs/2502.10473
  • 来源:刷arxiv
  • 主要内容:
    • 这篇论文是follow的Trajectory Transformer的工作,属于对TT的改进。模型训练部分与TT完全一致,区别在于decoing阶段即PBS是inference-time算法
    • 主要贡献:将TT中的Planing算法——Beam Search替代为Portfolio Beam Search(PBS),将经济学领域的投资组合优化理论用以改进BS算法。说实话我没有看太懂
    • 作者强调PBS考虑expectation和分布变化的不确定性来确定要保留的相似性,促进多样性。考虑了平均值、协方差等因素。
  • BS算法是LLM中常用的用来挑选最优轨迹的decoding算法,即PLANNING,目前有很多同类变种。但我感觉这篇文章的工作量不太够

Efficient Exploration via State Marginal Matching

  • 来源:openreview , 检索 state-action marginal时发现的文章(2019)

Off-Policy Deep Reinforcement Learning without Exploration

Future-conditioned Unsupervised Pretraining for Decision Transformer

Transformers as Decision Makers: Provable In-Context Reinforcement Learning via Supervised Pretraining(偏理论推导)

Supervised Pretraining Can Learn In-Context Reinforcement Learning

Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

Plan Your Target and Learn Your Skills: Transferable State-Only Imitation Learning via Decoupled Policy Optimization

Prompting Decision Transformer for Few-Shot Policy Generalization

  • 来源:Decision-Pretrained Transformer的引文/Openreview
  • 主要内容:对DT+Fewshot实现meta learning/通用性

Human-Timescale Adaptation in an Open-Ended Task Space(Ada)

  • 来源:Decision-Pretrained Transformer的引文/Openreview
  • 作者:Deepmind Adaptive Agnts Team

In-context Reinforcement Learning with Algorithm Distillation (AD)(ICLR2023)

  • 来源:知乎/DPT引文
  • 作者: Deepmind
  • 主要内容
    • Motivation:Transformer驱动的决策模型,从单任务的DT,到通领域多任务的MGDT和跨领域多任务的GATO,这些都是从离线数据集中学习策略的方法。即Policy Distillation(PD),但PD算法没有体现trial-and-error的学习过程,它属于通过模仿学习从离线RLdata中学习策略。换言之,PD学习策略但不是RL算法,无法通过额外的与环境交互过程提升算法。(从数据集中训练无法体现学习过程)
    • Algorithm Distillation(AD):transformer represents not only a fixed policy,but a policy improvement operator
    • AD的目标是建模RL算法,即通过带有模仿损失的离线数据建模实现上下文强化学习,而PD目标是学习解决某具体任务的策略
    • AD原理:1)通过保存一个RL算法在许多单独任务上的训练过程从而形成一个大型多任务数据集;2)transformer适用前面的学习历史数据作为其上下文进行因果建模。由于源RL算法的策略在整个训练过程都会得到改进,因此AD被迫学习改进运算符(learn the improvement operator)以实现精确建模动作。值得注意的是,transformer模型的上下文长度必须足够大(Long enough to span learning updates,eg. across-episodic,跨场景的),才能捕获训练数据集的改进。
    • AD 对 state-action-reward token进行建模,并且不以 return 为条件
    • \(L(\Theta):=-\sum_{n=1}^N \sum_{t=1}^{T-1}log P_{\Theta}(A=a_t^{(n)}|h_{t-1}^{(n)},o_{t}^{(n)})\)
  • 实验
    • AD的model可以适用任何序列模型如transformer、RNN、LSTM(见附录),文中主要用transformer,因为其可以并行训练,且效果更好
    • 实验环境:1) Adversarial Bandit;2)Dark Room;3)Dark Key-to-Door ;4)DMLab Watermaze
  • AD属于 incremental in-context learning,不是 in-context learning。可以划归meta-RL

Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning(AAAI2025)

MoM: Linear Sequence Modeling with Mixture-of-Memories(2025)

  • 来源:机器之心
  • arxiv:https://arxiv.org/abs/2502.13685
  • 作者:上海AILAB
  • 主要内容:llm模型架构的新思路:强大的memory scalling能力+关于序列长度的低复杂度

Rvs: What is essential for offline RL via supervised learning?(ICLR2022)

Goalconditioned reinforcement learning: Problems and solutions

  • 来源:ICRL综述
  • 作者:Weinan Zhang

RL+Transformer=A General-Purpose Problem Solver

  • 来源:Arxiv
  • 作者:Micah Rentschler , Jesse Roberts
  • 主要内容:这篇工作属于ICRL的类别,领域发展脉络从Decision Transformer、到Algorithrm Distillation、再到Decision-Pretrained Transformer,属于监督学习的loss训练范式——最大对数似然估计。针对预训练知识未利用问题和DPT需要其它模型标注最佳动作label的问题,这篇工作提出Deep Q-Learning微调Llm3.1 8bTransformer模型,将DQN中深度神经网络估计Value-function的部分用transformer替代。
  • 优点:使用DQN loss训练方式、Transformer提取上下文信息增强泛化性
  • 缺点:测试环境为Frozen Lake,简单单一,未真正体现通用性

RL\(^2\): Fast Reinforcement Learning via Slow Reinforcement Learning

  • 来源:ICLR2017(reject)
  • Openreview
  • 作者:Yan Duan†‡, John Schulman†‡, Xi Chen†‡, Peter L. Bartlett†, Ilya Sutskever‡, Pieter Abbeel
  • 主要内容:Meta-RL领域的开山之作,将多个MDP组成一个trial,MDP中通过RNN(GRUs)输入序列信息,输出action;在trial层次,使用TRPO之类的RL算法从宏观训练,目标是最大化每个trial上的Return。
  • 宏观上是RNN建模的黑盒预测
  • 实验环境是MULTI-ARMED BANDITS、MDP、可视化迷宫

On Designing Effective RL Reward at Training Time for LLM Reasoning

  • 来源:知乎探讨Processing Reward modeling(ICLR2025 Reject)
  • Openreview

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

  • 来源:探讨RL的概率推理形式(2018)
  • 作者:Sergy Levine
  • 主要内容:不同于以往需要问题提供或设计Reward function的RL方法,本文将奖励的概率解释为某些离散事件变量 Ot 的对数概率,建立新的概率推理框架,提出了最大熵强化学习(maxium entropy RL).比较理论

Masked Visual-Tactile Pre-training for Robot Manipulation

  • 来源:大组会上别的同学的工作
  • 作者:刘庆涛,叶琦
  • 主要内容:针对机械臂操作训练难的问题,提出基于人类演示的预训练表征范式,并基于此encoder使用PPO训练下游任务。注意这篇文章的主要工作聚焦使用人类手拿捏物体的视觉和触觉演示,通过encoder-decoder架构训练学习特征提取和对齐,然后将encoder部分称为预训练的encoder,这里预训练的目的是减少主网络的负担。后面在下游任务仿真环境中,使用PPO算法训练
  • 这篇工作通篇都是在人类演示和仿真环境下,所以并没有涉及控制真正的机械臂......(读了半天才搞清楚算法的每一个部分的场景,不过在虚拟平台上训练好的算法应该也可以迁移到真正的机械臂上)
  • 实验部分这篇文章还对比了有无特征融合预训练encoder的影响
  • 我一开始对这篇工作有两个误解:1)RL训练部分误以为是在正在的机械臂上,但PPO作为on-policy的算法必然是在虚拟平台上训练——考虑到sample efficiency。2)误将这里的预训练认为和decision-pretrained transformer类似,实际上这里只是预训练了一个特征表征的encoder,不涉及主策略网络的训练。
  • 另:1)没想到mujoco平台里面居然有这么丰富的任务,还有机械手拿瓶子。2)仿真平台似乎也可以提供压力传感数据
  • 一开始对现实机械臂如何用RL训练很疑惑,后来看来这篇文章腿足机器人之十三-强化学习PPO算法的代码才明白了流程,之前有点纸上谈兵了(笑

Can Wikipedia Help Offline Reinforcement Learning?

  • 来源:RL with Transformer(ICLR2023 Reject)
  • (Openreview)[https://openreview.net/forum?id=eHrqmewX1B-]
  • 作者:Machel Reid, Yutaro Yamada, Shixiang Shane Gu
  • 主要内容:基于预训练llm作为backbone,再将其在RL(Atari)任务上微调。相比于DT直接在RL(Atrai)任务上从头训练transformer模型,大大缩减了(3~6x)训练时长,效果略有提升
  • 作为对比的baseline要求也是序列的,clip的text encoder是自回归的,保留,但clip的image encoder不是自回归的,舍弃。ImageGPT基于GPT2相同的transformer架构,但它的训练不是文本,而是基于展开为长序列的图像像素。
  • 微调的loss设计与许多部分有关:\(L=L_{MSE}+\lambda_1 L_{cos}+\lambda_2 L_{LM}\),其中\(L_{MSE}\) 表示用于主要轨迹建模目标的均方误差损失,\(L_{LM}\)表示基于负对数似然的语言建模目标,\(L_{cos}=-\sum_{i=0}^{3N}max_{j}COS(I_i,E_j)\)表示输入和嵌入向量的相似程度

POPGym: Benchmarking Partially Observable Reinforcement Learning

  • 来源:其它论文提到的部分可观测过程RL benchmark (ICLR2023 poster)
  • openreview

Hindsight experience replay

  • 来源:其它论文提到(2017 NIPS)

Benchmarking the spectrum of agent capabilities (Dreamerv2)

  • 来源:其它论文提到(ICLR2022)

Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning

  • 来源:metaRL 离散机械臂操作benchmark(PMLR2020)
  • 分析博客

AMAGO-2: Breaking the Multi-Task Barrier in Meta-Reinforcement Learning with Transformers

  • 来源:AMAGO1的改进工作(NIPS2024)
  • Openreview

VariBAD: variational Bayes-adaptive deep RL via meta-learning

  • 来源:meta-RL without transformer(from AMAGO2 openreview)
  • arxiv

VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning, 2020

  • 来源:meta-RL without transformer(from AMAGO2 openreview) ICLR2020
  • openreview

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

  • 来源:meta-RL without transformer(from AMAGO2 openreview) ICLR2019 workshop LLD
  • openreview

Improving Context-Based Meta-Reinforcement Learning with Self-Supervised Trajectory Contrastive Learning

  • 来源:meta-RL without transformer(from AMAGO2 openreview)
  • arxiv

Soft actor-critic for discrete action settings

  • 来源:CS285 (经典且先进的RL算法)
  • arxiv

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

A distributional perspective on reinforcement learning

  • 来源:AMAGO2提到的将critic loss从regression 重构为 classification(PMLR2017)
  • arxiv
  • blog

Transformers Can Learn Temporal Difference Methods for In-Context Reinforcement Learning

  • 来源:ICLR2025
  • openreview
  • keynotes:
  • 1、we found that in-context TD learning emerges only when the transformer is trained across a diverse range of environments.When restricted to a single-task setup, the transformer can "cheat" by learning heuristics specific to that environment rather than a generalizable value estimation algorithm.By exposing the transformer to multiple tasks, we force its weights to generalize, leading to the emergence of a robust algorithm like TD. This aligns with previous studies (including in-context supervised learning studies) which demonstrate that task diversity during training plays a crucial role in encouraging the development of generalizable in-context learning capabilities
  • 2、the context can span multiple episodes. As the context length increases, action quality improves, suggesting that this improvement is not due to memorized policies encoded in the fixed parameters.Instead, it indicates that a reinforcement learning process occurs during the forward pass as the agent processes the context, a phenomenon termed in-context reinforcement learning (ICRL), where RL occurs at inference time within the forward pass
  • 3、In supervised pretraining, the agent is explicitly tasked with imitating the behavior of some existing RL algorithms demonstrated in an offline dataset;In reinforcement pretraining, the agent is only tasked with maximizing the return, and there is no constraint on how the agent network should achieve this in the forward pass.
  • 3、Open problems: When a transformer estimates the value function, its internal operations align with TD learning. Most transformers, however, serve as policies rather than value function estimators. It would be interesting to know the authors’ perspective on whether, if the transformer were to act as a policy instead of estimating a value function, the internal operations would still resemble TD learning, or if another reinforcement learning algorithm would take place internally.(这篇工作只研究了ICRL中Transformer在TD-based类算法中所扮演的value estimater角色,但没有探究policy类别算法)

Transformers learn to implement preconditioned gradient descent for in-context learning

  • 来源:NIPS2023

How do Transformers perform In-Context Autoregressive Learning?

posted @ 2025-03-02 21:01  霜尘FrostDust  阅读(133)  评论(0)    收藏  举报