StarAI——Lerobot机械臂算法复现之ACT篇

StarAI——Lerobot机械臂算法复现之ACT篇

理论部分(选看)

为了缩短篇幅,相关理论部分可以查看:

ACT算法模型理论与对应的代码分析

实践部分

1.0系统要求

操作系统:Linux(推荐 Ubuntu 20.04+)或 macOS
Python 版本:3.8+
GPU:NVIDIA GPU(推荐 RTX 3070 或更高),至少 6GB 显存
内存:至少 16GB RAM
存储空间:至少 30GB 可用空间

1.1录制数据集

这部分的教程我已经在上一节StarAI——Lerobot机械臂教程中介绍了,本篇就不再介绍

数据质量要求

最少 50 个 episode 用于基本训练
推荐 200+ episode 以获得最佳效果
每个 episode 应包含完整的任务执行
多视角图像(至少 2 个摄像头)
高质量的动作标注

1.2修改数据集录制算法(选看)

这部分内容是参考B站一位工程师对ACT算法进行改进,我将进行对StarAI分支进行适配

首先打开src/lerobot/datasets/lerobot_dataset.py:

1.2.1 在LeRobotDataset类中增加新方法

    # aug:增加均值滤波
    def actions_mean_filtering(self, raw_actions: list[list[float]], mean_num: int = 5) -> list[list[float]]:
        """
        对动作序列做一维均值滤波:
        raw_actions: [action_dim][T]
        返回同形状的平滑结果。
        """
        action_dim = len(raw_actions)
        T = len(raw_actions[0])
        filter_actions = [[0.0] * T for _ in range(action_dim)]

        for i in range(T):
            for d in range(action_dim):
                if i < mean_num or i > T - mean_num - 1:
                    # 头尾不过滤,直接保持原值
                    filter_actions[d][i] = raw_actions[d][i]
                    continue

                total = 0.0
                # 后面 mean_num 个点
                for j in range(i + 1, i + 1 + mean_num):
                    total += raw_actions[d][j]
                # 前面 mean_num 个点
                for j in range(1, 1 + mean_num):
                    total += raw_actions[d][i - j]

                filter_actions[d][i] = total / (mean_num * 2.0)

        return filter_actions

1.2.2修改LeRobotDataset类下的save_episode方法:

    def save_episode(
        self,
        episode_data: dict | None = None,
        parallel_encoding: bool = True,
    ) -> None:
        """
        This will save to disk the current episode in self.episode_buffer.

        Video encoding is handled automatically based on batch_encoding_size:
        - If batch_encoding_size == 1: Videos are encoded immediately after each episode
        - If batch_encoding_size > 1: Videos are encoded in batches.

        Args:
            episode_data (dict | None, optional): Dict containing the episode data to save. If None, this will
                save the current episode in self.episode_buffer, which is filled with 'add_frame'. Defaults to
                None.
            parallel_encoding (bool, optional): If True, encode videos in parallel using ProcessPoolExecutor.
                Defaults to True on Linux, False on macOS as it tends to use all the CPU available already.
        """
        episode_buffer = episode_data if episode_data is not None else self.episode_buffer

        validate_episode_buffer(episode_buffer, self.meta.total_episodes, self.features)

        # size and task are special cases that won't be added to hf_dataset
        episode_length = episode_buffer.pop("size")
        tasks = episode_buffer.pop("task")
        episode_tasks = list(set(tasks))
        episode_index = episode_buffer["episode_index"]

        # aug:对 action 做平滑(如果存在 'action' 这个键)
        if "action" in episode_buffer:
            # 假设 episode_buffer['action'] 现在是一个 list,长度 T,每个元素是长度 action_dim 的 list/np.array
            T = len(episode_buffer["action"])
            if T > 0:
                action_dim = len(episode_buffer["action"][0])
                # 先转成 [action_dim][T]
                raw_actions = [[episode_buffer["action"][t][d] for t in range(T)] for d in range(action_dim)]
                # 调用你刚定义的均值滤波
                filtered = self.actions_mean_filtering(raw_actions, mean_num=5)
                # 再写回 episode_buffer['action'],恢复成 [T][action_dim]
                for t in range(T):
                    for d in range(action_dim):
                        episode_buffer["action"][t][d] = filtered[d][t]


        episode_buffer["index"] = np.arange(self.meta.total_frames, self.meta.total_frames + episode_length)
        episode_buffer["episode_index"] = np.full((episode_length,), episode_index)

        # Update tasks and task indices with new tasks if any
        self.meta.save_episode_tasks(episode_tasks)

        # Given tasks in natural language, find their corresponding task indices
        episode_buffer["task_index"] = np.array([self.meta.get_task_index(task) for task in tasks])

修改效果:

  • 每次保存 episode 前,action 这条时间序列会被均值滤波一次

测试

  • 测试方案:使用同一个pick and place任务测试并观察修改前后动作的平滑效果。由于在record函数中,是在保存episode的时候才对轨道进行滤波。那么此时在rerun窗口观测的就是未滤波的轨迹。保存的轨迹是滤波后的。所以只需要写一个程序来读取保存的episode,便可以观察到滤波的效果。

  • 测试方法:通过修改lerobot-replay代码,使得replay拥有播放机械臂运动轨迹的功能:

** 点击查看代码——修改后的lerobot_replay.py **
import logging
import time
from dataclasses import asdict, dataclass
from pathlib import Path
from pprint import pformat

import rerun as rr

from lerobot.configs import parser
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.processor import (
    make_default_robot_action_processor,
)
from lerobot.robots import (  # noqa: F401
    Robot,
    RobotConfig,
    bi_openarm_follower,
    bi_so_follower,
    earthrover_mini_plus,
    hope_jr,
    koch_follower,
    make_robot_from_config,
    omx_follower,
    openarm_follower,
    reachy2,
    so_follower,
    unitree_g1,
)
from lerobot.utils.constants import ACTION
from lerobot.utils.import_utils import register_third_party_plugins
from lerobot.utils.robot_utils import precise_sleep
from lerobot.utils.utils import (
    init_logging,
    log_say,
)
from lerobot.utils.visualization_utils import init_rerun, log_rerun_data


@dataclass
class DatasetReplayConfig:
    # Episode to replay.
    episode: int
    # Dataset identifier. By convention it should match '{hf_username}/{dataset_name}' (e.g. `lerobot/test').
    # If using a local dataset, this can be None and root should be provided.
    repo_id: str | None = None
    # Root directory where the dataset will be stored (e.g. 'dataset/path').
    root: str | Path | None = None
    # Limit the frames per second. By default, uses the policy fps.
    fps: int = 30


@dataclass
class ReplayConfig:
    robot: RobotConfig
    dataset: DatasetReplayConfig
    # Use vocal synthesis to read events.
    play_sounds: bool = True
    # Display data in Rerun
    display_data: bool = False
    # Display data on a remote Rerun server
    display_ip: str | None = None
    # Port of the remote Rerun server
    display_port: int | None = None
    # Whether to display compressed images in Rerun
    display_compressed_images: bool = False


@parser.wrap()
def replay(cfg: ReplayConfig):
    init_logging()
    logging.info(pformat(asdict(cfg)))
    
    # Initialize Rerun if enabled (same as record)
    display_compressed_images = False
    if cfg.display_data:
        init_rerun(session_name="replay", ip=cfg.display_ip, port=cfg.display_port)
        display_compressed_images = (
            True
            if (cfg.display_data and cfg.display_ip is not None and cfg.display_port is not None)
            else cfg.display_compressed_images
        )

    robot_action_processor = make_default_robot_action_processor()

    robot = make_robot_from_config(cfg.robot)
    dataset = LeRobotDataset(cfg.dataset.repo_id, root=cfg.dataset.root, episodes=[cfg.dataset.episode])

    # Filter dataset to only include frames from the specified episode since episodes are chunked in dataset V3.0
    episode_frames = dataset.hf_dataset.filter(lambda x: x["episode_index"] == cfg.dataset.episode)
    actions = episode_frames.select_columns(ACTION)

    robot.connect()

    try:
        log_say("Replaying episode", cfg.play_sounds, blocking=True)
        
        for idx in range(len(episode_frames)):
            start_episode_t = time.perf_counter()

            action_array = actions[idx][ACTION]
            action = {}
            for i, name in enumerate(dataset.features[ACTION]["names"]):
                action[name] = action_array[i]

            robot_obs = robot.get_observation()

            processed_action = robot_action_processor((action, robot_obs))

            _ = robot.send_action(processed_action)
            
            # Log to Rerun if enabled (same pattern as record)
            if cfg.display_data:
                log_rerun_data(
                    observation=robot_obs, action=action, compress_images=display_compressed_images
                )

            dt_s = time.perf_counter() - start_episode_t
            precise_sleep(max(1 / dataset.fps - dt_s, 0.0))
            
    finally:
        robot.disconnect()


def main():
    register_third_party_plugins()
    replay()


if __name__ == "__main__":
    main()
  • 测试效果:
    修改前:
    0099dd7cbf07b51d8ff8e44cfb3b0e1f
    修改后:
    图片

  • 测试结果:
    均值滤波后还是有一定效果的,对于一些尖峰状的噪声有一定抑制效果。我测试了好几论效果都不明显,这是我找了效果比较明显的一组了

2.对ACT算法进行改进(选看)

首先进入到src/lerobot/policies/act/modeling_act.py:

2.1修改reset方法

def reset(self):
        """This should be called whenever the environment is reset."""
        if self.config.temporal_ensemble_coeff is not None:
            self.temporal_ensembler.reset()
        else:
            self._action_queue = deque([], maxlen=2*self.config.n_action_steps)
            self.last_action_list = []
            self.last_action = None 

2.2新增方法:

    #aug:方法
    def begin_mutation_filter(self, actions):
        """动作突变检测与线性插值"""
        if self.last_action is None:
            return

        first_action = actions[0][0].cpu().tolist()

        diff = [abs(a - b) for a, b in zip(first_action, self.last_action)]

        max_increment = 0.06
        add_point_num = int(max(diff) / max_increment)
        if add_point_num > 0:
            add_point_increment = [x / add_point_num for x in diff]

            add_point = self.last_action
            for i in range(0, add_point_num):
                add_point = [a + b for a, b in zip(add_point, add_point_increment)]
                tensor = torch.tensor([[add_point]], device=actions.device)
                self._action_queue.extend(tensor)
                
    #aug:方法
    def actions_mean_filtering(self):
        """将轨迹均值滤波"""
        mean_actions = []  # 均值滤波后的轨迹
        mean_num = 8  # 均值滤波取的前后点数

        action_step_list = []
        action_num = len(self._action_queue)
        # 将tensor转成list
        for i in range(0, action_num):
            action_step_list.append(self._action_queue[i].cpu().tolist())

        for i in range(0, action_num):
            # 最后 mean_num 个直接保留
            if i > action_num - mean_num - 1:
                mean_actions.append(action_step_list[i][0])
                continue

            action_total = action_step_list[i][0][:]
            for k in range(0, len(action_total)):
                action_total[k] = 0

            # 后 mean_num 个
            for j in range(i + 1, i + 1 + mean_num):
                for k in range(0, len(action_total)):
                    action_total[k] += action_step_list[j][0][k]

            # 前 mean_num 个
            if i < mean_num + 1:
                if len(self.last_action_list) == 0:
                    mean_actions.append(action_step_list[i][0])
                    continue
                else:
                    # 前面个数不够 mean_num,从上一次规划的轨迹点来均值
                    for j in range(0, i):
                        for k in range(0, len(action_total)):
                            action_total[k] += action_step_list[j][0][k]
                    for j in range(1, mean_num + 1 - i):
                        for k in range(0, len(action_total)):
                            action_total[k] += self.last_action_list[-j][0][k]
            else:
                for j in range(1, 1 + mean_num):
                    for k in range(0, len(action_total)):
                        action_total[k] += action_step_list[i - j][0][k]

            action_mean = []
            for k in range(0, len(action_total)):
                action_mean.append(action_total[k] / (mean_num * 2.0))
            mean_actions.append(action_mean)

2.3修改select_action方法

    @torch.no_grad()
    def select_action(self, batch: dict[str, Tensor]) -> Tensor:
        """Select a single action given environment observations.

        This method wraps `select_actions` in order to return one action at a time for execution in the
        environment. It works by managing the actions in a queue and only calling `select_actions` when the
        queue is empty.
        """
        self.eval()  # keeping the policy in eval mode as it could be set to train mode while queue is consumed

        if self.config.temporal_ensemble_coeff is not None:
            actions = self.predict_action_chunk(batch)
            action = self.temporal_ensembler.update(actions)
            return action
        
        # aug:保存上一次执行的动作用于突变检测
        if len(self._action_queue) == 1:
            self.last_action = self._action_queue[0].cpu().tolist()[0]

        # Action queue logic for n_action_steps > 1. When the action_queue is depleted, populate it by
        # querying the policy.
        if len(self._action_queue) == 0:
            actions = self.predict_action_chunk(batch)[:, : self.config.n_action_steps]
            # aug:动作突变检测与线性插值
            self.begin_mutation_filter(actions)
            # `self.model.forward` returns a (batch_size, n_action_steps, action_dim) tensor, but the queue
            # effectively has shape (n_action_steps, batch_size, *), hence the transpose.
            self._action_queue.extend(actions.transpose(0, 1))
            # aug: 均值滤波平滑
            self.actions_mean_filtering()  
        return self._action_queue.popleft()

2.4修改forward方法:

    def forward(self, batch: dict[str, Tensor]) -> tuple[Tensor, dict]:
        """Run the batch through the model and compute the loss for training or validation."""
        if self.config.image_features:
            batch = dict(batch)  # shallow copy so that adding a key doesn't modify the original
            batch[OBS_IMAGES] = [batch[key] for key in self.config.image_features]

        actions_hat, (mu_hat, log_sigma_x2_hat) = self.model(batch)

        l1_loss = (
            F.l1_loss(batch[ACTION], actions_hat, reduction="none") * ~batch["action_is_pad"].unsqueeze(-1)
        ).mean()

        loss_dict = {"l1_loss": l1_loss.item()}
        if self.config.use_vae:
            # Calculate Dₖₗ(latent_pdf || standard_normal). Note: After computing the KL-divergence for
            # each dimension independently, we sum over the latent dimension to get the total
            # KL-divergence per batch element, then take the mean over the batch.
            # (See App. B of https://huggingface.co/papers/1312.6114 for more details).
            mean_kld = (
                (-0.5 * (1 + log_sigma_x2_hat - mu_hat.pow(2) - (log_sigma_x2_hat).exp())).sum(-1).mean()
            )
            loss_dict["kld_loss"] = mean_kld.item()
            loss = l1_loss + mean_kld * self.config.kl_weight
        else:
            loss = l1_loss
        # aug: 均值滤波平滑性损失
        kernel_size = 11
        padding = kernel_size // 2
        x = actions_hat.transpose(1, 2)
        weight = torch.ones(actions_hat.size(-1), 1, kernel_size, device=actions_hat.device) / kernel_size
        filtered_x = F.conv1d(x, weight, padding=padding, groups=actions_hat.size(-1))
        filtered_tensor = filtered_x.transpose(1, 2)
        mean_loss = torch.abs(actions_hat - filtered_tensor).mean()
        loss += mean_loss
        loss_dict["mean_loss"] = mean_loss.item()
        return loss, loss_dict

以上修改是为了

  • 训练的时候添加了平滑损失函数loss值
  • 推理的时候对动作序列的跳变点进行了线性插值,然后对整个序列进行了平滑处理

注意:所有修改都带有"aug:"注释标识

3.训练

accelerate launch --num_processes=1 $(which lerobot-train) \
  --dataset.repo_id=yourdatasetdir \
  --policy.type=act \
  --policy.device=cuda \
  --policy.chunk_size=100 \
  --policy.n_action_steps=50 \
  --policy.use_amp=true \
  --policy.repo_id=starai/my_policy \
  --batch_size=4 \
  --optimizer.lr=2e-05 \
  --num_workers=4 \
  --output_dir=outputs/train/act_viola_test11 \
  --job_name=act_viola_test \
  --wandb.enable=False \
  --steps=20000 \
  --save_checkpoint=True \
  --save_freq=5000

3.1参数讲解

3.1.1核心参数

图片

3.1.2ACT特定参数

图片

3.1.3训练参数

图片

4.评估

lerobot-record  \
  --robot.type=lerobot_robot_viola \
  --robot.port=/dev/ttyUSB1 \
  --robot.cameras="{ up: {type: opencv, index_or_path: /dev/video6, width: 640, height: 480, fps: 30, fourcc: "MJPG"},front: {type: opencv, index_or_path: /dev/video8, width: 640, height: 480, fps: 30, fourcc: "MJPG"}}" \
  --robot.id=my_awesome_staraiviola_arm \
  --display_data=false \
  --dataset.repo_id=starai/eval_record-test \
  --dataset.single_task="Pick up the yellow cube to the white box" \
  --policy.path=outputs/train/act_viola_test1/checkpoints/pretrained_model

常见问题 (FAQ)

Q: ACT 与其他模仿学习方法相比有什么优势?

A: ACT 的主要优势包括:

减少复合误差:通过预测动作块减少误差累积
提高成功率:在精细操作任务上表现优异
端到端训练:无需手工设计特征
多模态融合:有效融合视觉和状态信息

Q: 如何选择合适的 chunk_size?

A: chunk_size 的选择取决于任务特性:

快速任务:chunk_size = 10-30
中等任务:chunk_size = 50-100
慢速任务:chunk_size = 100-200
一般建议从 50 开始尝试

Q: 训练需要多长时间?

A: 训练时间取决于多个因素:

数据集大小:100 episodes 约需 4-8 小时(RTX 3070)
模型复杂度:更大的模型需要更长时间
硬件配置:更好的 GPU 可显著减少训练时间
收敛要求:通常需要 50000-100000 步

Q: 如何处理多摄像头数据?

A: 多摄像头处理建议:

摄像头选择:选择信息互补的视角
特征融合:在特征层面进行融合
注意力机制:让模型学习关注重要视角
计算资源:注意多摄像头会增加计算负担

Q: 如何提升模型的泛化能力?

A: 提升泛化能力的方法:

数据多样性:收集不同条件下的数据
数据增强:使用图像和动作增强技术
正则化:适当的权重衰减和 dropout
域随机化:在仿真中使用域随机化技术
多任务学习:在多个相关任务上联合训练
posted @ 2026-03-14 21:23  zzzking778  阅读(18)  评论(0)    收藏  举报