论文速读 | 25年9月

What can rl bring to vla generalization? an empirical study.

arxiv
在vla模型的最后一层外接MLP来得到Q-value，从而可以使用PPO等强化学习算法进行微调
PPO表现优于DPO、GRPO等
RL微调vla使其泛化性提高

Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation

CoRL2025
[https://arxiv.org/2508.06426](https://arxiv.org/2508.06426]
pre 视频
探索VLA模型泛化性差的原因：
- 任务无关因子（背景等）影响policy决定，影响了因果相关性
- 训练数据集的diversity(视角和指令)diversity严重不足
- 子任务之间关联性差
利用LIBERO数据集的regenerate功能来更改环境setting，从而提升多样性
- 调整不同viewpoint的比率
构建真实场景数据集
- 改变view point；训练时增加数据集相似性(针对同一物体的抓取任务）
在现有数据集上进行改动（因为重新收集数据集成本太大）
- 数据增强
  - 利用现有3D视角增强模型：eg.VISTA
增加指令多样性的作用大于增加视角多样性的提升
移动机器人viewpoint固定，只需考虑增加instruction diversity
如果输入了机器人本体state，可能导致policy只依靠本身state而轻视视觉

Denoising Diffusion Implicit Models

DDIM，加速DDPM过程

FiLM: Visual Reasoning with a General Conditioning Layer

AAAI 2018
特征级线性调制，允许一种信息（如文本指令）去动态地影响和调整另一种信息（如正在被处理的图像特征）。Film不是简单地把文本和图像信息拼接在一起，而是用文本信息生成一组“缩放”和“平移”参数，去精细地、动态地“操控”图像信息在神经网络中的处理过程。
OpenVLA-oft中应用这种技术使得整个视觉系统在一开始就变得“任务导向”

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

RSS 2023
Training:learn to predict the noise added to real action samples.
Inference: start with random noise and conduct gradual denoising process

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

RSS 2025
OpenVLA-OFT
key design:
- parallel decoding (and action chunking)
- continus action
- L1 regression learning objective
- Film (only for Aloha)

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

ACT
使用模仿学习解决具身智能
采用action chunking，推理频率显著上升

OpenVLA: An Open-Source Vision-Language-Action Model

VLA经典之作

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

阅读ing

OpenHelix: A Short Survey and Empirical Analysis and Open-Source Dual-System VLA Model for Robotic Manipulation.

dual system 双系统，引入一个中间laten token链接VLM和policy。这种异步机制可增强协调，缓解延迟。
持续更新快慢系统
项目地址

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

机器人测试benchmark，基于mujoco和rosuite
文档

Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models

arxiv
RoboVLMs
结论
- continuous action matters:离散action在long-horizon task中累积误差严重影响表现
- history obs matters: 历史obs越长，性能表现越好，但也会增加计算开销
  - 在历史信息的利用方式上：policy head(roboflamingo)比interleaved方法更有效(Gato).作者认为是policy head在保留了VLM的原始vl融合能力同时有效融合了历史信息。另外interleaved方法会导致更高内存和算力需求。
....发现这个论文好多错误？具体参考这个Issue

HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers

CORL2024
VLM提取latent信息，服务于action head
action head还直接接受经Film、cros attention with(VLM输出的latent info)加持的VIT处理视觉输入，另外经latent action 固定为at=MLP(MAP[X_tk,X^v_:t])

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIA
快慢系统：VLM+diffusion transformer
action head 利用中间第12层latent info

Flow matching for generative modeling.

GROOT用到的action 生成方法
属于diffusion 范畴，有点看不懂...

posted @ 2025-09-03 21:52 霜尘FrostDust 阅读(21) 评论(0) 收藏举报

刷新页面返回顶部

FrostDust

论文速读 | 25年9月

What can rl bring to vla generalization? an empirical study.

Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation

Denoising Diffusion Implicit Models

FiLM: Visual Reasoning with a General Conditioning Layer

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

OpenVLA: An Open-Source Vision-Language-Action Model

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

OpenHelix: A Short Survey and Empirical Analysis and Open-Source Dual-System VLA Model for Robotic Manipulation.

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models

HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Flow matching for generative modeling.

公告