VLM-3D空间理解

CoT

Thinking in space

Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs’ spatial distance ability.

《Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces》

语言学prompt方法：

zero-shot chain-of-thought: 在问题后面增加：Let's think step by step.诱发大模型逐步思考的能力。再运行模型要求模型从上个回答中提取显示的答案（比fuzzy匹配精度更高）。
Self-Consistency w/ CoT：生成多个答案投票；
Tree-of-Thought: 先要求模型生成回答该问题的3个计划，再运行多次模型要求投票选出最优计划，再要求模型根据计划执行，回答问题。

‍

Cot

We prompt Gemini-1.5 Pro to first generate a cognitive map based on the given video and question, and then to use the predicted map to answer the question.

Visual CoT

《Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning》

they often lack interpretability and struggle with complex visual inputs, especially when the resolution of the input image is high or when the interested region that could provide key information for answering the question is small. To address these challenges, we collect and introduce the large-scale Visual CoT dataset comprising 438k question-answer pairs, annotated with intermediate bounding boxes highlighting key regions essential for answering the questions.

制作了一个438K大小的数据集，包含：

question
answer
Cot-BBox

其中98K数据包含下图所示的步骤指引：

对于包含CoT BBox的训练数据，训练时在问题后面附加"Please provide the bounding box coordinate of the region that can help you answer the question better."，同时根据该问题的真值抠出对应图像区域，将其编码后的特征和完整图像编码后的特征统一送入模型。

训练分为两个阶段：

在图片字幕数据上仅训练projector；
在visual-cot数据集上微调全部模型；

Visual Encoder

Video-3D LLM

《Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding》

By treating 3D scenes as dynamic videos and incorporating 3D position encoding into these representations, our Video-3D LLM aligns video representations with real-world spatial contexts more accurately.

将深度图中每个像素深度坐标转换为全局坐标（根据相机内参和外参），再作为position embedding和图像feature融合。

‍

Chat-Scene

《Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers》

we decompose the input 3D scene into a set of object proposals, each assigned a unique identifier token.

对图像施加2d 图像检测器dino，得到每个目标的2d embedding再经过2d projector
对点云施加3d点云检测器，得到每个目标的3d embedding，再经过3d projector
在词汇表中添加目标索引例如<OBJ032>,<OBJ034>,通过tokenizer转换为embedding
结合上诉3个目标级别的embedding得到一个目标的完整embedding，用于替换
```
System: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user’s questions. The conversation centers around an indoor scene: [<OBJ001> <object> <OBJ002> <object> ... <OBJn> <object>]. 
User: Find the closest trash bin to <OBJ013>. 
Assistant: There are two trash bins, <OBJ023> and <OBJ032>, both located near the chair.
```
Decoder

VisionLLM

《VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks》

we propose a new information transmission mechanism termed “super link”, as a medium to connect MLLM with task-specific decoders. It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios.

训练：先检查answer中是否有路由token：[SEG],[DET]等，有的话则在其后插入对应的可学习的特定任务的embedding，将这些embedding送到llm中和其他token进行交互。

实现细节：

'According to the provided front_right_image:, front_image:, front_left_image:, back_right_image:, back_image:, back_left_image:, please identify lane markings in the scenes.'

'The lane markings are [DET][EMB][EMB2][EMB3][EMB4][EMB5][EMB6][EMB7][EMB8][EMB9][EMB10][EMB11][EMB12][EMB13][EMB14][EMB15][EMB16][EMB17][EMB18][EMB19][EMB20][EMB21][EMB22][EMB23][EMB24][EMB25][EMB26][EMB27][EMB28][EMB29][EMB30][EMB31][EMB32][EMB33][EMB34][EMB35][EMB36][EMB37][EMB38][EMB39][EMB40][EMB41][EMB42][EMB43][EMB44][EMB45][EMB46][EMB47][EMB48][EMB49][EMB50].'

‍

‍

Auxiliary Tasks

OmniDrive

训练流程：
1. 2D-Pretraining：预训练Carrier Queries和Q-Former对齐图像特征和语言模型，图片对数据集和指令集微调数据集均来自LLaVA v1.5
  1. 移除感知Query，在2D图片文本对上训练Q-Former；
  2. 仅冻结Image Encoder，在指令集微调数据集上微调模型，提升指令集理解和执行能力；
2. 3D-Finetuning：保留2D理解能力的同时增加模型的3D定位能力
  1. 给Q-Former增加3D时序模块；
  2. 小学习率微调视觉encoder和大语言模型（LoRA）；
  3. 大学习率训练Q-Former3D
训练数据，每帧数据QA组成：
1. VQA
  1. 1个 scene-action
  2. 1个 shortly describe action - action from keyword ???
  3. 若干个关于驾驶场景的问题（给定图片、scene-action生成）
  4. 4个涉及：影响驾驶的障碍物、自车决策、反现实推理问题（给定GPT车道、障碍物坐标等信息生成）（回答VCS坐标）通过BEV可视化
2. Online VQA: （回答VCS坐标）
  1. n=2个提供相机名称、像素坐标，回答该障碍物的类别、和自车的相对位置关系、vcs坐标、长宽高角度、速度信息（如果该障碍物速度大于0.2），每个问题只询问一个障碍物；
  2. n=2个提供提供VCS点坐标，回答该点10米内有哪些障碍物及其3D信息（同a）
  3. n=2个提供车道线三次贝塞尔曲线控制点，回答该车道上有哪些障碍物（同a）
3. 自车轨迹：1个无理由自车轨迹生成
Loss

Regression-like Loss

《Regress, Don’t Guess – A Regression-like Loss on Number Tokens for Language Models》

提出两个损失：
1. The first is based on an Lp loss between the ground truth token value and the weighted sum of the predicted class probabilities.
2. The second loss minimizes the Wasserstein-1 distance between the distribution of the predicted output probabilities and the ground truth distribution
‍

数值损失作为额外损失，只施加在数字token上，实验效果表明Wasserstein-1效果更好：

‍

‍

‍