Real-time Scalable Dense Surfel Mapping 论文阅读
英文题目 | Real-time Scalable Dense Surfel Mapping |
---|---|
中文名称 | 实时可扩展密集表面建图 |
发表时间 | 2019年9月10日 |
平台 | ICRA 2019 |
作者 | Kaixuan Wang, Fei Gao and Shaojie Shen |
邮箱 | {kwangap, fgaoaa, eeshaojie}@ust.hk |
来源 | HKUST Aerial Robotics Group |
关键词 | 实时稠密建图 |
paper && code && video | paper code video |
Abstract-In this paper, we propose a novel dense surfel mapping system that scales well in different environments with only CPU computation. Using a sparse SLAM system to estimate camera poses, the proposed mapping system can fuse intensity images and depth images into a globally consistent model. The system is carefully designed so that it can build from room-scale environments to urban-scale environments using depth images from RGB-D cameras, stereo cameras or even a monocular camera. First, superpixels extracted from both intensity and depth images are used to model surfels in the system. superpixel-based surfels make our method both runtime efficient and memory efficient. Second, surfels are further organized according to the pose graph of the SLAM system to achieve \(O\left( 1\right)\) fusion time regardless of the scale of reconstructed models. Third, a fast map deformation using the optimized pose graph enables the map to achieve global consistency in real-time. The proposed surfel mapping system is compared with other state-of-the-art methods on synthetic datasets. The performances of urban-scale and room-scale reconstruction are demonstrated using the KITTI dataset [1] and autonomous aggressive flights, respectively. The code is available for the benefit of the community \({}^{1}\) .
摘要-本文提出了一种新的密集点片建图系统,该系统仅使用CPU计算就能在不同环境中良好扩展。使用稀疏SLAM系统来估计相机姿态,所提出的建图系统可以将强度图像和深度图像融合成一个全局一致的模型。该系统经过精心设计,可以使用RGB-D相机、立体相机甚至单目相机的深度图像,从房间尺度环境构建到城市尺度环境。首先,从强度图像和深度图像中提取的超像素用于在系统中建模点片。基于超像素的点片使我们的方法在运行时间和内存效率上都高效。其次,根据SLAM系统的姿态图进一步组织点片,以实现\(O\left( 1\right)\)融合时间,而不受重建模型尺度的影响。第三,使用优化的姿态图进行快速地图变形,使地图能够实现实时全局一致性。所提出的点片建图系统与其他最先进的方法在合成数据集上进行了比较。使用KITTI数据集[1]和自主激进飞行分别展示了城市尺度和房间尺度重建的性能。代码可供社区使用\({}^{1}\)。
I. INTRODUCTION
Estimating the surrounding 3D environment is one of the fundamental abilities for robots to navigate safely or operate high-level tasks. To be usable in mobile robot applications, the mapping system needs to fulfill the following four requirements. First, the 3D reconstruction has to densely cover the environment in order to provide sufficient information for navigation. Second, the mapping system should have good scalability and efficiency so that it can be deployed in different environments using limited onboard computation resources. From room-scale (several meters) to urban-scale (several kilometers) environments, the mapping system should maintain both run-time efficiency and memory efficiency. Third, global consistency is required in the mapping systems. If loops are detected, the system should be able to deform the map in real-time to maintain consistency between different visits. Fourth, to be usable in different robot applications, the system should be able to fuse depth maps of different qualities from RGB-D cameras, stereo cameras or even monocular cameras.
估计周围3D环境是机器人安全导航或执行高级任务的基本能力之一。为了在移动机器人应用中使用,建图系统需要满足以下四个要求。首先,3D重建必须密集覆盖环境,以提供足够的导航信息。其次,建图系统应具有良好的可扩展性和效率,以便使用有限的机载计算资源在不同环境中部署。从房间尺度(几米)到城市尺度(几千米)环境,建图系统应保持运行时间和内存效率。第三,建图系统需要全局一致性。如果检测到环路,系统应能够实时变形地图以保持不同访问之间的连贯性。第四,为了在不同的机器人应用中使用,系统应能够融合来自RGB-D相机、立体相机甚至单目相机的不同质量的深度图。
In recent years, many methods have been proposed to reconstruct the environment using RGB-D cameras focusing on several requirements mentioned above. KinectFusion [3] is a pioneering work that uses the truncated signed distance field (TSDF) [4] to represent 3D environments. Many following works improve the scalability (e.g. Kintinuous [5]), the efficiency (e.g. CHISEL [6]), and the global consistency (e.g. BundleFusion [7]) of TSDF-based methods. Surfel-based methods model the environment as a collection of surfels. For example, ElasticFusion [8] uses surfels to reconstruct the scene and achieves global consistency. Although all these methods achieve impressive results using RGB-D cameras, extending them to fulfill all four requirements and to be usable in different robot applications is non-trivial.
近年来,许多方法被提出用于使用RGB-D相机重建环境,重点关注上述几个要求。KinectFusion[3]是一项开创性工作,它使用截断符号距离场(TSDF)[4]来表示3D环境。许多后续工作改进了可扩展性(例如Kintinuous[5]),效率(例如CHISEL[6]),以及基于TSDF方法的全局一致性(例如BundleFusion[7])。基于surfel的方法将环境建模为surfel的集合。例如,ElasticFusion[8]使用surfel重建场景并实现全局一致性。尽管所有这些方法使用RGB-D相机取得了令人印象深刻的结果,但将它们扩展以满足所有四个要求并在不同的机器人应用中使用并不简单。
Fig. 1. Our proposed dense mapping method can fuse low-quality depth maps to reconstruct large-scale globally-consistent environments in real-time without GPU acceleration. The top row shows the reconstruction of KITTI odometry 00 using stereo cameras and the detail of a looped corner. The bottom row shows the reconstruction using only one monocular camera with depth prediction [2].
图1. 我们提出的密集建图方法可以融合低质量深度图以实时重建大规模全局一致的环境,无需GPU加速。顶部行显示了使用立体相机重建的KITTI里程计00以及一个循环角落的细节。底部行显示了仅使用一个单目相机和深度预测[2]的重建。
In this paper, we propose a mapping method that fulfills all four requirements and can be applied to a range of mobile robotic systems. Our system uses state-of-the-art sparse visual SLAM systems to track camera poses and fuses intensity images and depth images into a globally consistent model. Unlike ElasticFusion [8] that treats each pixel as a surfel, we use superpixels to represent surfels. Pixels are clustered into superpixels if they share similar intensity, depth, and spatial locations. Modeling superpixels as surfels greatly reduces the memory requirement of our system and enables the system to fuse noisy depth maps from stereo cameras or a monocular camera. Surfels are organized according to the keyframes they are last observed in. Using the pose graph of the SLAM systems, we further find locally consistent keyframes and surfels that the relative drift between each other is negligible. Only locally consistent surfels are fused with input images, achieving \(O\left( 1\right)\) fusion time and local accuracy. Global consistency is achieved by deforming surfels according to the optimized pose graph. Thanks to the careful design, our system can be used to reconstruct globally consistent urban-scale environments in real-time without GPU acceleration.
在本文中,我们提出了一种满足所有四个要求并可应用于一系列移动机器人系统的建图方法。我们的系统使用最先进的稀疏视觉SLAM系统来跟踪相机姿态,并将强度图像和深度图像融合到一个全局一致的模型中。与ElasticFusion[8]将每个像素视为一个surfel不同,我们使用超像素来表示surfel。如果像素具有相似的强度、深度和空间位置,则它们被聚类为超像素。将超像素建模为surfel大大减少了系统的内存需求,并使系统能够融合来自立体相机或单目相机的噪声深度图。surfel根据它们最后被观察到的关键帧进行组织。利用SLAM系统的姿态图,我们进一步找到局部一致的关键帧和surfel,它们之间的相对漂移可以忽略不计。只有局部一致的surfel与输入图像融合,实现\(O\left( 1\right)\)融合时间和局部精度。通过变形surfel根据优化的姿态图,实现了全局一致性。得益于精心设计,我们的系统可以用于实时重建大规模全局一致的城市环境,无需GPU加速。
In summary, the main contributions of our mapping method are the following.
总之,我们建图方法的主要贡献如下。
All authors are with the Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, SAR China. {kwangap, fgaoaa, eeshaojie}@ust.hk
所有作者均隶属于香港科技大学电子与计算机工程系,中国香港特别行政区。
\({}^{1}\) https://github.com/HKUST-Aerial-Robotics/DenseSurfelMapping
-
We use superpixels extracted from both intensity and depth images to model surfels in the system. Superpix-els enable our method to fuse low-quality depth maps. Run-time efficiency and memory efficiency are also gained by using superpixel-based surfels.
-
我们使用从强度和深度图像中提取的超像素来建模系统中的表面元素。超像素使我们的方法能够融合低质量的深度图。通过使用基于超像素的表面元素,还获得了运行时间和内存效率。
-
We further organize surfels accordingly to the pose graph of the sparse SLAM systems. Using this organization, locally consistent maps are extracted for fusion, and the fusion time maintains \(O\left( 1\right)\) regardless of the reconstruction scale. Fast map deformation is also proposed based on the optimized pose graph so that the system can achieve global consistency in real-time.
-
我们进一步根据稀疏 SLAM 系统的姿态图来组织表面元素。使用这种组织方式,可以提取局部一致的地图进行融合,融合时间保持 \(O\left( 1\right)\),无论重建规模如何。还提出了基于优化姿态图的快速地图变形,使系统能够实现实时全局一致性。
-
We implement the proposed dense mapping system using only CPU computation. We evaluate the method using public datasets and demonstrate its usability using autonomous aggressive flights. To the best of our knowledge, the proposed method is the first online depth fusion approach that achieves global consistency in urban-scale using only CPU computation.
-
我们仅使用 CPU 计算实现所提出的密集建图系统。我们使用公共数据集评估该方法,并通过自主激进飞行展示其可用性。据我们所知,所提出的方法是第一个仅使用 CPU 计算在城市规模上实现全局一致性的在线深度融合方法。
II. RELATED WORK
Most online dense reconstruction methods take depth maps from RGB-D cameras as input. In this section, we introduce different methods to extend the scalability, global consistency and run-time efficiency of these mapping systems.
大多数在线密集重建方法将来自 RGB-D 相机的深度图作为输入。在本节中,我们介绍不同的方法来扩展这些建图系统的可扩展性、全局一致性和运行时间效率。
Kintinuous [5] extends the scalability of mapping systems by using a cyclical buffer. The TSDF volume is virtually transformed according to the movement of the camera. Voxel hashing proposed by Nießner et al. [9] is another solution to improve the scalability. Due to the sparsity of the surfaces in the space, only valid voxels are stored using hashing functions. DynSLAM [10] reconstructs urban-scale models using hashed voxels and a high-end GPU to accelerate. Surfel-based methods are relatively scalable compared with voxel-based methods because only surfaces are stored in the system. Without the explicit data optimization, Elastic-Fusion [8] can build room-scalable environments in detail. Fu et al. [11] further increase the scalability of surfel-based methods by maintaining a local surfel set.
Kintinuous [5] 通过使用循环缓冲区扩展建图系统的可扩展性。TSDF 体积根据相机的移动进行虚拟转换。Nießner 等人 [9] 提出的体素哈希是另一种提高可扩展性的解决方案。由于空间中表面的稀疏性,仅使用哈希函数存储有效的体素。DynSLAM [10] 使用哈希体素和高端 GPU 加速重建城市规模模型。基于表面元素的方法与基于体素的方法相比相对可扩展,因为系统中仅存储表面。Elastic-Fusion [8] 无需显式数据优化即可详细构建房间规模的环境。Fu 等人 [11] 通过维护局部表面元素集进一步增加了基于表面元素方法的可扩展性。
To remove the drift from camera tracking and maintain global consistency, mapping systems should be able to fast deform the model when loops are detected. Whelan et al. [12] improved Kintinuous [5] with point clouds deformation. A deformation graph is constructed incrementally as the camera moves. When loops are detected, the deformation graph is optimized and applied to the point clouds. Surfel-based methods usually deform the map using similar methods. BundleFusion [7] introduces another solution to achieve global consistency using de-integration and reintegration of RGB-D frames. When the camera poses are updated due to the pose graph optimization, RGB-D frames are firstly de-integrated from the TSDF volume and reintegrated using the updated camera poses. Submaps are used by many TSDF-based methods, such as InfiniTAM [13], to generate globally consistent results. These methods divide the space into multiple low-drift submaps and merge them into a global model using updated poses.
为了从相机跟踪中去除漂移并保持全局一致性,建图系统应该能够在检测到环路时快速变形模型。Whelan 等人 [12] 通过点云变形改进了 Kintinuous [5]。随着相机移动,增量构建变形图。当检测到环路时,优化并应用变形图到点云。基于 Surfel 的方法通常使用类似的方法来变形地图。BundleFusion [7] 提出了另一种解决方案,通过 RGB-D 帧的去集成和再集成来实现全局一致性。当由于姿态图优化更新相机姿态时,RGB-D 帧首先从 TSDF 体积中去集成,然后使用更新的相机姿态重新集成。许多基于 TSDF 的方法,如 InfiniTAM [13],使用子图来生成全局一致的结果。这些方法将空间划分为多个低漂移子图,并使用更新的姿态将它们合并到全局模型中。
Different methods have been proposed to accelerate the fusion process. Steinbrücker et al. [14] use an octree as the data structure to represent the environment. Voxblox [15] is designed for planning that both TSDF and the Euclidean signed distance fields are calculated. Voxblox [15] proposes a grouped raycasting to speed up the integration, and a novel weighting strategy to deal with the distortion caused by large voxel sizes. FlashFusion [16] uses valid chunk selection to speed up the fusion step and achieves global consistency based on the reintegration method. Most of the surfel-based mapping systems require GPUs to render index maps for data association. MREMap [17] defines octree-organized voxels as surfels so that it does not need GPUs. However, the reconstructed model of MREMap is voxels instead of meshes.
已经提出了不同的方法来加速融合过程。Steinbrücker 等人 [14] 使用八叉树作为数据结构来表示环境。Voxblox [15] 旨在同时计算 TSDF 和欧几里得符号距离场的规划。Voxblox [15] 提出了一种分组光线投射来加速集成,并提出了一种新的加权策略来处理大体素尺寸引起的失真。FlashFusion [16] 使用有效块选择来加速融合步骤,并基于再集成方法实现全局一致性。大多数基于 Surfel 的建图系统需要 GPU 来渲染索引图以进行数据关联。MREMap [17] 将八叉树组织的体素定义为 Surfel,因此不需要 GPU。然而,MREMap 重建的模型是体素而不是网格。
III. SYSTEM OVERVIEW
Fig. 2. The system architecture of the proposed method. Sensors are determined by the robot application. Our system can fuse depth maps from active RGB-D cameras, stereo cameras, or monocular cameras. The localization system is used to track the camera pose, detect loops, and provide optimized pose graph of keyframes. Each surfel in the system is attached to one keyframe, and all the surfels are stored in the map database.
图 2. 所提方法的系统架构。传感器由机器人应用决定。我们的系统可以融合来自主动 RGB-D 相机、立体相机或单目相机的深度图。定位系统用于跟踪相机姿态,检测环路,并提供关键帧的优化姿态图。系统中的每个 Surfel 都与一个关键帧相关联,所有 Surfel 都存储在地图数据库中。
The system architecture is shown in Fig. 2. Our system fuses intensity and depth image pairs into a globally consistent model. We use a state-of-the-art sparse visual SLAM system (e.g. ORB-SLAM2 [18] or VINS-MONO [19]) as the localization system to track the motion of the camera, detect loop closures, and optimize the pose graph. The keys to our mapping system are (1) superpixel-based surfels, (2) pose graph-based surfel fusion, and (3) fast map deformation. For each intensity and depth image input, the localization system generates camera tracking results and provides an updated pose graph. If the pose graph is optimized, our system first deforms all the surfels in the map database to ensure global consistency. After the deformation, the mapping system initializes surfels based on the extracted superpixels from the intensity and depth images. Then, local surfels are extracted from the map database according to the pose graph and fused with the initialized surfels. Finally, both the fused surfels and newly observed surfels are added back into the map database. Fig. 3 illustrates the pipeline of the system to process two frames when loops are detected.
系统架构如图 2 所示。我们的系统将强度和深度图像对融合成一个全局一致的模型。我们使用最先进的稀疏视觉 SLAM 系统(例如 ORB-SLAM2 [18] 或 VINS-MONO [19])作为定位系统来跟踪相机的运动,检测环路闭合,并优化姿态图。我们建图系统的要点是(1)基于超像素的 Surfel,(2)基于姿态图的 Surfel 融合,以及(3)快速地图变形。对于每个强度和深度图像输入,定位系统生成相机跟踪结果并提供更新的姿态图。如果姿态图被优化,我们的系统首先变形地图数据库中的所有 Surfel 以确保全局一致性。变形后,建图系统基于从强度和深度图像中提取的超像素初始化 Surfel。然后,根据姿态图从地图数据库中提取局部 Surfel 并与初始化的 Surfel 融合。最后,融合的 Surfel 和新观察到的 Surfel 都被添加回地图数据库。图 3 说明了当检测到环路时系统处理两帧的流程。
Fig. 3. Use KITTI odometry 05 as an example to show map deformation and the reuse of previous maps when the camera revisits a street. Surfels are visualized as point clouds for simplicity, and local maps are highlighted in color. During the revisit, a loop is detected on Frame 1327. (a): local map extracted to fuse Frame 1326. (b): the result of Frame 1326. As highlighted in red circles in (a) and (b), the map is misaligned due to the drift before the loop closure. (c): A loop is detected by Frame 1327 and the map is deformed accordingly to remove the drift. Due to the updated pose graph, more locally consistent poses and surfels can be found. (d): Local map extracted to fuse Frame 1327. As shown, previous maps are reused to fuse the current frame. (e) is the result after the fusion of Frame 1327.
图 3. 使用 KITTI 里程计 05 作为示例,展示当相机重新访问街道时的地图变形和先前地图的重用。为了简化,Surfel 以点云的形式可视化,局部地图以颜色突出显示。在重新访问期间,帧 1327 检测到一个环路。(a):提取局部地图以融合帧 1326。(b):帧 1326 的结果。如(a)和(b)中的红色圆圈所示,由于环路闭合前的漂移,地图未对齐。(c):帧 1327 检测到一个环路,地图相应变形以去除漂移。由于更新的姿态图,可以找到更多局部一致的姿态和 Surfel。(d):提取局部地图以融合帧 1327。如图所示,先前的地图被重用来融合当前帧。(e)是融合帧 1327 后的结果。
IV. Surfel Mapping System
A. Notation
Surfels are used to represent the environment. Each sur-fel \(S = {\left\lbrack {S}_{\mathbf{p}},{S}_{\mathbf{n}},{S}_{c},{S}_{w},{S}_{r},{S}_{t},{S}_{i}\right\rbrack }^{T}\) has the following attributes: position \({S}_{\mathbf{p}} \in {\mathbb{R}}^{3}\) , normal \({S}_{\mathbf{n}} \in {\mathbb{R}}^{3}\) , intensity \({S}_{c} \in \mathbb{R}\) , weight \({S}_{w} \in {\mathbb{R}}^{ + }\) , radius \({S}_{r} \in {\mathbb{R}}^{ + }\) , update times \({S}_{t} \in \mathbb{N}\) , and the index of attached keyframe \({S}_{i} \in \mathbb{N}\) . Update times \({S}_{t}\) is used to detect temporarily outliers or dynamic objects, and \({S}_{i}\) indicates the last keyframe the surfel is observed in.
Surfels 用于表示环境。每个 sur-fel \(S = {\left\lbrack {S}_{\mathbf{p}},{S}_{\mathbf{n}},{S}_{c},{S}_{w},{S}_{r},{S}_{t},{S}_{i}\right\rbrack }^{T}\) 具有以下属性:位置 \({S}_{\mathbf{p}} \in {\mathbb{R}}^{3}\),法线 \({S}_{\mathbf{n}} \in {\mathbb{R}}^{3}\),强度 \({S}_{c} \in \mathbb{R}\),权重 \({S}_{w} \in {\mathbb{R}}^{ + }\),半径 \({S}_{r} \in {\mathbb{R}}^{ + }\),更新次数 \({S}_{t} \in \mathbb{N}\),以及所附关键帧的索引 \({S}_{i} \in \mathbb{N}\)。更新次数 \({S}_{t}\) 用于检测临时异常值或动态对象,而 \({S}_{i}\) 表示该 sur-fel 最后被观察到的关键帧。
Inputs of our system are intensity images, depth images, the ego-motion of the camera, and the pose graph from the SLAM system. The \(i\) -th intensity image is \({I}_{i} : \Omega \subset {\mathbb{R}}^{2} \mapsto \mathbb{R}\) and the \(i\) -th depth image is \({D}_{i} : \Omega \subset {\mathbb{R}}^{2} \mapsto \mathbb{R}\) . A 3D point \(\mathbf{p} = {\left\lbrack x, y, z\right\rbrack }^{\mathrm{T}}\) in the camera frame can be projected into the image as a pixel \(\mathbf{u} \mathrel{\text{:=}} {\left\lbrack u, v\right\rbrack }^{\mathrm{T}} \in \Omega\) using the camera projection function: \(\mathbf{u} = \pi \left( \mathbf{p}\right)\) . A pixel can be back-projected into the camera frame as a point: \(\mathbf{p} = {\pi }^{-1}\left( {\mathbf{u}, d}\right)\) where \(d\) is the depth of the pixel.
我们的系统输入包括强度图像、深度图像、相机的自我运动以及来自 SLAM 系统的姿态图。第 \(i\) 个强度图像是 \({I}_{i} : \Omega \subset {\mathbb{R}}^{2} \mapsto \mathbb{R}\),第 \(i\) 个深度图像是 \({D}_{i} : \Omega \subset {\mathbb{R}}^{2} \mapsto \mathbb{R}\)。相机坐标系中的一个 3D 点 \(\mathbf{p} = {\left\lbrack x, y, z\right\rbrack }^{\mathrm{T}}\) 可以通过相机投影函数投影到图像中的一个像素 \(\mathbf{u} \mathrel{\text{:=}} {\left\lbrack u, v\right\rbrack }^{\mathrm{T}} \in \Omega\):\(\mathbf{u} = \pi \left( \mathbf{p}\right)\)。一个像素可以被反投影到相机坐标系中的一个点:\(\mathbf{p} = {\pi }^{-1}\left( {\mathbf{u}, d}\right)\),其中 \(d\) 是该像素的深度。
B. Localization System and Pose Graph
We use a sparse visual SLAM method as the localization system to track the camera motion and optimize the pose graph when there are loop closures. For each frame, the localization system estimates the camera pose \({\mathbf{T}}_{w, i} \in \mathbb{{SE}}\left( 3\right)\) and gives out the reference keyframe \({F}_{\text{ref }}\) that shares most features with \({I}_{i}.{\mathbf{T}}_{w, i}\) includes a rotation matrix \({\mathbf{R}}_{w, i} \in\) \(\mathbb{{SO}}\left( 3\right)\) and a translation vector \({\mathbf{t}}_{w, i} \in {\mathbb{R}}^{3}\) . Using \({\mathbf{T}}_{w, i}\) , a point \({\mathbf{p}}_{c}\) in the camera frame of \({I}_{i}\) can be transformed into the global frame \({\mathbf{p}}_{w} = {\mathbf{R}}_{w, i}{\mathbf{p}}_{c} + {\mathbf{t}}_{w, i}\) . A vector (such as the surfel normal) \({\mathbf{n}}_{c}\) in the camera frame can be transformed into the global frame \({\mathbf{n}}_{w} = {\mathbf{R}}_{w, i}{\mathbf{n}}_{c}\) . Similiarly, \({\mathbf{p}}_{w}\) and \({\mathbf{n}}_{w}\) can be transformed back into the camera frame of \({I}_{i}\) using \({\mathbf{T}}_{i, w} = {\mathbf{T}}_{w, i}^{-1}.\)
我们使用稀疏视觉 SLAM 方法作为定位系统来跟踪相机运动并在闭环时优化姿态图。对于每一帧,定位系统估计相机姿态 \({\mathbf{T}}_{w, i} \in \mathbb{{SE}}\left( 3\right)\) 并给出与 \({I}_{i}.{\mathbf{T}}_{w, i}\) 共享最多特征的参考关键帧 \({F}_{\text{ref }}\),包括一个旋转矩阵 \({\mathbf{R}}_{w, i} \in\) \(\mathbb{{SO}}\left( 3\right)\) 和一个平移向量 \({\mathbf{t}}_{w, i} \in {\mathbb{R}}^{3}\)。使用 \({\mathbf{T}}_{w, i}\),相机坐标系 \({I}_{i}\) 中的一个点 \({\mathbf{p}}_{c}\) 可以转换到全局坐标系 \({\mathbf{p}}_{w} = {\mathbf{R}}_{w, i}{\mathbf{p}}_{c} + {\mathbf{t}}_{w, i}\)。相机坐标系中的一个向量(如 sur-fel 法线)\({\mathbf{n}}_{c}\) 可以转换到全局坐标系 \({\mathbf{n}}_{w} = {\mathbf{R}}_{w, i}{\mathbf{n}}_{c}\)。类似地,\({\mathbf{p}}_{w}\) 和 \({\mathbf{n}}_{w}\) 可以使用 \({\mathbf{T}}_{i, w} = {\mathbf{T}}_{w, i}^{-1}.\) 转换回相机坐标系 \({I}_{i}\)。
The pose graph used in our system is an undirected graph similar to the covisibility graph in ORB-SLAM2. Vertices in the graph are the keyframes maintained in the SLAM system, and edges indicate keyframes share common features. Since the relative poses of frames are constrained by common features in the sparse SLAM systems by bundle adjustments, we assume keyframes are locally consistent if the minimum number of edges between each other is less than \({G}_{\delta }\) .
我们系统中使用的姿态图类似于 ORB-SLAM2 中的共视图。图中的顶点是 SLAM 系统中维护的关键帧,边表示关键帧共享共同特征。由于稀疏 SLAM 系统中的帧相对姿态通过共同特征由捆绑调整约束,我们假设如果彼此之间的最小边数小于 \({G}_{\delta }\),关键帧是局部一致的。
C. Fast Map Deformation 快速地图变形
If the pose graph of the localization system is updated, our method deforms all the surfels to keep the global consistency before the surfel initialization and fusion. Unlike previous methods that use a deformation graph embedded in the global map, we deform the surfels so that the relative pose between each surfel and its attached keyframe remains unchanged. Although surfels that are attached to the same keyframe are deformed rigidly, the overall deformation of the map is nonrigid.
如果定位系统的姿态图被更新,我们的方法会在 surfel 初始化和融合之前变形所有 surfel 以保持全局一致性。与以前使用嵌入全局地图的变形图的方法不同,我们变形 surfel 以保持每个 surfel 与其关联关键帧之间的相对姿态不变。尽管与同一关键帧关联的 surfel 是刚性变形的,但地图的整体变形是非刚性的。
For a surfel \(S\) that is attached to keyframe \(F\) , the position and normal of the surfel are transformed using \({\mathbf{T}}_{w,\widehat{F}}{\mathbf{T}}_{w, F}^{-1}\) , where \({\mathbf{T}}_{w, F}\) and \({\mathbf{T}}_{w,\widehat{F}}\) are the poses of keyframe \(F\) before and after the optimization, respectively. After the deformation, the transformation \({T}_{w, F}\) is replaced by the optimized pose for the next deformation.
对于关联到关键帧 \(F\) 的 surfel \(S\),使用 \({\mathbf{T}}_{w,\widehat{F}}{\mathbf{T}}_{w, F}^{-1}\) 转换 surfel 的位置和法线,其中 \({\mathbf{T}}_{w, F}\) 和 \({\mathbf{T}}_{w,\widehat{F}}\) 分别是关键帧 \(F\) 在优化前后的姿态。变形后,变换 \({T}_{w, F}\) 被优化的姿态替换以用于下一次变形。
D. Superpixel Extraction 超像素提取
Unlike other surfel-based methods that model per-pixel surfels, we extract surfels based on extracted superpixels from intensity and depth images. Using superpixels greatly reduces the memory burden of our system when applied to large-scale missions. More importantly, outliers and noises from low-quality depth maps can be reduced based on extracted superpixels. This novel representation enables us to reconstruct the environment using stereo-cameras, or even monocular cameras.
与基于每个像素 surfel 的其他方法不同,我们基于从强度和深度图像中提取的超像素提取 surfel。使用超像素在应用于大规模任务时极大地减少了我们系统的内存负担。更重要的是,基于提取的超像素可以减少来自低质量深度图的异常值和噪声。这种新颖的表示使我们能够使用立体相机,甚至单目相机来重建环境。
Superpixels are extracted by a \(k\) -means approach adapted from SLIC [20]. The original SLIC operates on RGB images and we extend it to segment both intensity and depth images. Pixels are clustered according to their intensity, depth and spatial location by firstly initializing the cluster centers and then alternating between the assignment step and the update step. A major improvement compared with SLIC is that our superpixel segmentation operates on images where not all pixels have valid depth measurements.
超像素通过从 SLIC [20] 适应的 \(k\) -means 方法提取。原始 SLIC 在 RGB 图像上运行,我们将其扩展到同时分割强度和深度图像。像素根据其强度、深度和空间位置进行聚类,首先初始化聚类中心,然后在分配步骤和更新步骤之间交替。与 SLIC 相比,主要改进之处在于我们的超像素分割在图像中并非所有像素都有有效的深度测量。
The cluster center \({C}_{i} = {\left\lbrack {x}_{i},{y}_{i},{d}_{i},{c}_{i},{r}_{i}\right\rbrack }^{T}\) is initialized on a regular grid on the image. \({\left\lbrack {x}_{i},{y}_{i}\right\rbrack }^{T}\) is the average location of clustered pixels, \({d}_{i}\) is the average depth, \({c}_{i}\) is the average intensity value, and \({r}_{i}\) is the radius of the superpixel defined as the largest distance between the assigned pixels to \({\left\lbrack {x}_{i},{y}_{i}\right\rbrack }^{T}.{\left\lbrack {x}_{i},{y}_{i}\right\rbrack }^{T}\) is initialized as the location of the center. \({d}_{i}\) and \({c}_{i}\) are initialized as the depth and intensity value of pixel \({\left\lbrack {x}_{i},{y}_{i}\right\rbrack }^{T}\) . For cluster centers that are initialized on pixels with no valid depth estimations, the depth \({d}_{i}\) is initialized as \(\mathrm{{NaN}}\) .
聚类中心 \({C}_{i} = {\left\lbrack {x}_{i},{y}_{i},{d}_{i},{c}_{i},{r}_{i}\right\rbrack }^{T}\) 在图像的规则网格上初始化。\({\left\lbrack {x}_{i},{y}_{i}\right\rbrack }^{T}\) 是聚类像素的平均位置,\({d}_{i}\) 是平均深度,\({c}_{i}\) 是平均强度值,\({r}_{i}\) 是超像素的半径,定义为分配给 \({\left\lbrack {x}_{i},{y}_{i}\right\rbrack }^{T}.{\left\lbrack {x}_{i},{y}_{i}\right\rbrack }^{T}\) 的像素之间的最大距离。\({d}_{i}\) 和 \({c}_{i}\) 分别初始化为像素 \({\left\lbrack {x}_{i},{y}_{i}\right\rbrack }^{T}\) 的深度和强度值。对于初始化在没有有效深度估计的像素上的聚类中心,深度 \({d}_{i}\) 初始化为 \(\mathrm{{NaN}}\)。
In the assignment step, the per-cluster scan from SLIC is replaced by the per-pixel update so that invalid depth can be handled while the complexity remains unchanged. We defined two distances between one pixel \(\mathbf{u}\) and one candidate cluster center \({C}_{i}\) as
在分配步骤中,SLIC 的每簇扫描被每像素更新所取代,以便在复杂度不变的情况下处理无效深度。我们定义了一个像素 \(\mathbf{u}\) 和一个候选聚类中心 \({C}_{i}\) 之间的两个距离为
where \({D}_{d}\) and \(D\) are the distances with and without depth information, respectively. \(\left\lbrack {{\mathbf{u}}_{x},{\mathbf{u}}_{y}}\right\rbrack ,{\mathbf{u}}_{d}\) and \({\mathbf{u}}_{i}\) are the location, depth and intensity of pixel \(\mathbf{u}\) , respectively. \({N}_{s}^{2},{N}_{c}^{2}\) and \({N}_{d}^{2}\) are used to normalize the distance, color and depth proximity, respectively, before the summation. Each pixel scans the four neighbor candidate cluster centers. If pixel \(\mathbf{u}\) and all the centers have valid depth values, then the assignment is done by comparing \({D}_{d}\) . Otherwise, \(D\) is used for the assignment.
其中 \({D}_{d}\) 和 \(D\) 分别是带有和不带有深度信息的距离。\(\left\lbrack {{\mathbf{u}}_{x},{\mathbf{u}}_{y}}\right\rbrack ,{\mathbf{u}}_{d}\) 和 \({\mathbf{u}}_{i}\) 分别是像素 \(\mathbf{u}\) 的位置、深度和强度。\({N}_{s}^{2},{N}_{c}^{2}\) 和 \({N}_{d}^{2}\) 分别用于在求和之前对距离、颜色和深度接近度进行归一化。每个像素扫描四个邻近的候选聚类中心。如果像素 \(\mathbf{u}\) 和所有中心都有有效的深度值,则通过比较 \({D}_{d}\) 进行分配。否则,使用 \(D\) 进行分配。
Once all pixels have been assigned, the cluster centers are updated. \({x}_{i},{y}_{i}\) , and \({c}_{i}\) are updated by the average of all the assigned pixels. The mean depth \({d}_{i}\) , on the other hand, is updated by minimizing a Huber loss with radius \(\delta\) :
一旦所有像素都被分配,聚类中心就会更新。\({x}_{i},{y}_{i}\) 和 \({c}_{i}\) 通过所有分配像素的平均值更新。另一方面,平均深度 \({d}_{i}\) 通过最小化半径为 \(\delta\) 的 Huber 损失进行更新:
where \(\mathbf{u}\) is the assigned pixel that has a valid depth value and \({\mathbf{u}}_{d}\) is its depth. \({d}_{i}\) can be estimated by Gauss-Newton iterations. This outlier-robust mean depth not only enables the system to process low-quality depth maps but also preserves the depth discontinuity.
其中 \(\mathbf{u}\) 是具有有效深度值的分配像素,\({\mathbf{u}}_{d}\) 是其深度。\({d}_{i}\) 可以通过 Gauss-Newton 迭代估计。这种鲁棒的平均深度不仅使系统能够处理低质量的深度图,还保留了深度不连续性。
E. Surfel Initialization
For a superpixel cluster center \({C}_{i} = {\left\lbrack {x}_{i},{y}_{i},{d}_{i},{c}_{i},{r}_{i}\right\rbrack }^{T}\) that has enough assigned pixels, we initialize one surfel \(S = {\left\lbrack {S}_{\mathbf{p}},{S}_{\mathbf{n}},{S}_{c},{S}_{w},{S}_{r},{S}_{t},{S}_{i}\right\rbrack }^{T}\) in an outlier-robust way. The intensity \({S}_{c}\) is initialized as the mean intensity of the cluster \({c}_{i}.{S}_{i}\) is initialized as the index of the reference keyframe \({F}_{ref}\) given by the sparse SLAM system. \({S}_{t}\) is initialized as 0 meaning that the surfel has not been fused by other frames.
对于一个具有足够分配像素的超像素聚类中心 \({C}_{i} = {\left\lbrack {x}_{i},{y}_{i},{d}_{i},{c}_{i},{r}_{i}\right\rbrack }^{T}\),我们以一种鲁棒的方式初始化一个 surfel \(S = {\left\lbrack {S}_{\mathbf{p}},{S}_{\mathbf{n}},{S}_{c},{S}_{w},{S}_{r},{S}_{t},{S}_{i}\right\rbrack }^{T}\)。强度 \({S}_{c}\) 初始化为该聚类的平均强度 \({c}_{i}.{S}_{i}\) 初始化为参考关键帧的索引,该索引由稀疏 SLAM 系统给出。\({S}_{t}\) 初始化为 0,表示该 surfel 尚未与其他帧融合。
The position \({S}_{\mathbf{p}}\) and normal \({S}_{\mathbf{n}}\) are initialized by using the information from all pixels of the superpixel. \({S}_{\mathbf{n}}\) is initialized as the average normal of these pixels and then fine-tuned by minimizing a fitting error defined as:
位置 \({S}_{\mathbf{p}}\) 和法线 \({S}_{\mathbf{n}}\) 是通过使用超像素的所有像素的信息来初始化的。\({S}_{\mathbf{n}}\) 初始化为这些像素的平均法线,然后通过最小化定义为以下的拟合误差进行微调:
where \({\mathbf{p}}_{\mathbf{u}} = {\pi }^{-1}\left( {\mathbf{u},{\mathbf{u}}_{d}}\right) ,\overline{\mathbf{p}}\) is the mean of the \(3\mathrm{D}\) points \({\mathbf{p}}_{\mathbf{u}}\) , and \(b\) estimates the bias. \({S}_{\mathbf{p}}\) is defined as the point on the surfel that is observed by the camera as a pixel \({\left\lbrack {x}_{i},{y}_{i}\right\rbrack }^{T}\) :
其中 \({\mathbf{p}}_{\mathbf{u}} = {\pi }^{-1}\left( {\mathbf{u},{\mathbf{u}}_{d}}\right) ,\overline{\mathbf{p}}\) 是 \(3\mathrm{D}\) 点 \({\mathbf{p}}_{\mathbf{u}}\) 的平均值,\(b\) 估计偏差。\({S}_{\mathbf{p}}\) 定义为相机观察到的像素 \({\left\lbrack {x}_{i},{y}_{i}\right\rbrack }^{T}\) 对应的 surfel 上的点:
and can be solved in closed-form as:
并且可以以封闭形式求解为:
where \(K\) is the camera intrinsic matrix.
其中 \(K\) 是相机内参矩阵。
The surfel radius \({S}_{r}\) is initialized so that the projection of it can cover the extracted superpixel in the input intensity image:
surfel 半径 \({S}_{r}\) 初始化为可以覆盖输入强度图像中提取的超像素的投影:
where \({S}_{\mathbf{p}}\left( z\right)\) is the depth of the surfel, and \(f\) is the camera focal length.
其中 \({S}_{\mathbf{p}}\left( z\right)\) 是 surfel 的深度,\(f\) 是相机焦距。
Most of the depth estimation methods, like stereo matching, or active stereos (e.g. Ultrastereo [21]) work by firstly estimating the pixel disparity \({d}_{dis}\) and then inverting it into depth values \(d = {bf}/{d}_{dis}\) , where \(b\) is the baseline of the sensors. Assuming the variance of disparity estimation is \({\sigma }^{2},{S}_{w}\) is initialized as the inverse variance of the estimated surfel depth:
大多数深度估计方法,如立体匹配或主动立体(例如 Ultrastereo [21]),都是首先估计像素视差 \({d}_{dis}\),然后将其转换为深度值 \(d = {bf}/{d}_{dis}\),其中 \(b\) 是传感器的基线。假设视差估计的方差 \({\sigma }^{2},{S}_{w}\) 初始化为估计 surfel 深度的逆方差:
F. Local Map Extraction
Reconstructing large-scale environments may generate millions of surfels. However, only a subset of surfels are extracted based on the pose graph to fuse with initialized surfels due to the following reasons. Firstly, the local map fusion ensures \(O\left( 1\right)\) update time regardless of the reconstruction scale, and secondly, due to the accumulated tracking error of the sparse SLAM system, fusing surfels that have large drift ruins the system so that it cannot achieve global consistency even if loops are detected afterward.
重建大规模环境可能会生成数百万个 surfel。然而,基于姿态图仅提取部分 surfel 与初始化的 surfel 融合,原因如下。首先,局部地图融合确保 \(O\left( 1\right)\) 更新时间与重建规模无关,其次,由于稀疏 SLAM 系统的累积跟踪误差,融合具有大漂移的 surfel 会破坏系统,使其即使在检测到回环后也无法实现全局一致性。
Here, we introduce a novel approach that uses the pose graph from the localization system to identify local maps. With the assumption in Section IV-B that keyframes with the number of minimum edges to the current keyframe \({F}_{\text{ref }}\) below \({G}_{\delta }\) are locally consistent, we extract surfels attached to these keyframes as the local map. Locally consistent keyframes can be found by a breadth-first search on the pose graph. When loops are detected and edges between these keyframes are added, previous surfels can be reused so that the map growth is reduced. As shown in (d) of Fig. 3, previous maps are reused due to the loop closure.
这里,我们介绍一种使用定位系统中的姿态图来识别局部地图的新方法。根据第 IV-B 节中的假设,与当前关键帧 \({F}_{\text{ref }}\) 边缘数量最少的关键帧 \({G}_{\delta }\) 以下的局部一致,我们提取这些关键帧附带的 surfel 作为局部地图。局部一致的关键帧可以通过在姿态图上进行广度优先搜索找到。当检测到回环并添加这些关键帧之间的边缘时,之前的 surfel 可以被重用,从而减少地图的增长。如图 3(d) 所示,由于回环闭合,之前的地图被重用。
G. Surfel Fusion
In this section, extracted local surfels in Section. IV-F are fused with newly initialized surfels in Section. IV-E. Given the current camera pose estimation \({\mathbf{T}}_{w, c}\) , the positions and normals of local surfels are firstly transformed into the current camera frame using \({\mathbf{T}}_{w, c}^{-1}\) . Each local surfel \({S}^{l}\) is then back-projected into the input frame as a pixel: \(\mathbf{u} = \pi \left( {S}_{\mathbf{p}}^{l}\right)\) . If a surfel \({S}^{n}\) is initialized based on the superpixel containing \(\mathbf{u}\) , we determine the correspondence if they have similar depth and normals: \(\left| {{S}_{\mathbf{p}}^{n}\left( z\right) - {S}_{\mathbf{p}}^{l}\left( z\right) }\right| < {S}_{\mathbf{p}}^{l}{\left( z\right) }^{2}/\left( {bf}\right) \cdot {2\sigma }\) , and \({S}_{\mathbf{n}}^{n} \cdot {S}_{\mathbf{n}}^{l} > {0.8}.{S}^{l}\) is fused with the corresponding surfel \({S}^{n}\) :
在本节中,第 IV-F 节中提取的局部 surfel 与第 IV-E 节中新初始化的 surfel 融合。给定当前相机姿态估计 \({\mathbf{T}}_{w, c}\),局部 surfel 的位置和法线首先使用 \({\mathbf{T}}_{w, c}^{-1}\) 转换到当前相机坐标系中。然后,每个局部 surfel \({S}^{l}\) 被反投影到输入帧中作为一个像素:\(\mathbf{u} = \pi \left( {S}_{\mathbf{p}}^{l}\right)\)。如果一个 surfel \({S}^{n}\) 是基于包含 \(\mathbf{u}\) 的超像素初始化的,我们确定它们具有相似深度和法线时的对应关系:\(\left| {{S}_{\mathbf{p}}^{n}\left( z\right) - {S}_{\mathbf{p}}^{l}\left( z\right) }\right| < {S}_{\mathbf{p}}^{l}{\left( z\right) }^{2}/\left( {bf}\right) \cdot {2\sigma }\),并且 \({S}_{\mathbf{n}}^{n} \cdot {S}_{\mathbf{n}}^{l} > {0.8}.{S}^{l}\) 与相应的 surfel \({S}^{n}\) 融合:
After the fusion, all local surfels are transformed into the global frame using \({\mathbf{T}}_{w, c}\) and are moved into the global map. Surfels that are initialized in this frame but have not been fused with local maps are also transformed and added into the global map. To handle outliers, surfels with \(\left| {{S}_{i} - {F}_{ref}}\right| >\) 10 but are updated less than 5 times are removed.
融合后,所有局部 surfel 使用 \({\mathbf{T}}_{w, c}\) 转换到全局坐标系并移动到全局地图中。在该帧中初始化但尚未与局部地图融合的 surfel 也进行转换并添加到全局地图中。为了处理离群值,更新次数少于 5 次但 \(\left| {{S}_{i} - {F}_{ref}}\right| >\) 为 10 的 surfel 被移除。
V. IMPLEMENTATION DETAILS
The surfel mapping system is implemented using only CPU computing and achieves real-time performance even when it reconstructs urban-scale environments. Superpixels are initialized on the regular grid spaced 8 pixels apart. The small-sized superpixels give the system a balance between efficiency and reconstruction accuracy. \({N}_{s} = 4,{N}_{c} = {10}\) and \({N}_{d} = {0.05}\) are used during the pixel assignment in Equation 1 and Equation 2. During the surfel initialization and fusion, superpixels with more than 16 assigned pixels are used to initialize surfels. \(\delta\) used in the Huber loss and the disparity error \(\sigma\) are determined by the depth sensors or depth estimation methods.
Surfel 建图系统仅使用 CPU 计算实现,并且即使在重建城市规模环境时也能实现实时性能。超像素在间隔 8 像素的规则网格上初始化。小尺寸的超像素为系统提供了效率和重建精度之间的平衡。\({N}_{s} = 4,{N}_{c} = {10}\) 和 \({N}_{d} = {0.05}\) 在方程 1 和方程 2 的像素分配期间使用。在 surfel 初始化和融合期间,使用分配像素超过 16 个的超像素来初始化 surfel。\(\delta\) 在 Huber 损失和视差误差 \(\sigma\) 中使用,由深度传感器或深度估计方法确定。
VI. EXPERIMENTS
In this section, we first compare the proposed mapping system with other state-of-the-art methods using the ICL-NIUM [22]. The performance of the proposed system in large-scale environments is also analyzed using the KITTI dataset [1]. The platform to evaluate our method is a workstation with an Intel i7-7700. Finally, we use the reconstructed map to support UAV autonomous aggressive flights to demonstrate the usability of the system. In the experiments, we show that the proposed method can fuse depth maps from stereo matching, depth prediction, and monocular depth estimation.
在本节中,我们首先使用 ICL-NIUM [22] 将所提出的建图系统与其他最先进的方法进行比较。我们还使用 KITTI 数据集 [1] 分析了所提系统在大规模环境中的性能。我们的方法评估平台是一台配备 Intel i7-7700 的工作站。最后,我们使用重建的地图来支持无人机自主激进飞行,以展示系统的可用性。在实验中,我们展示了所提方法可以融合来自立体匹配、深度预测和单目深度估计的深度图。
A. Reconstruction Accuracy 重建精度
We evaluate the accuracy of the reconstructed models using ICL-NIUM [22] and compare it with that of other mapping methods. The dataset provides rendered RGB images and the corresponding depth maps from a synthetic room. To simulate real-world data, the dataset adds noise to both RGB images and depth images. \(\delta = {0.05},\sigma = {1.0}\) are used for surfel initialization. We use ORB-SLAM2 in RGB-D mode to track the camera motion. \({G}_{\delta } = {20}\) is used to extract the local map for fusion.
我们使用 ICL-NIUM [22] 评估重建模型的精度,并将其与其他建图方法进行比较。该数据集提供了渲染的 RGB 图像和来自合成房间的相应深度图。为了模拟真实世界的数据,数据集向 RGB 图像和深度图像添加了噪声。\(\delta = {0.05},\sigma = {1.0}\) 用于 surfel 初始化。我们在 RGB-D 模式下使用 ORB-SLAM2 来跟踪相机运动。\({G}_{\delta } = {20}\) 用于提取用于融合的局部地图。
The reconstruction accuracy is defined as the mean difference between the reconstructed model and the ground truth model. Here, we compare the proposed mapping method with BundleFusion [7], ElasticFusion [8], InfiniTAM [13] and the recently published FlashFusion [16]. To evaluate the ability to maintain the global consistency, we also evaluate Ours w/o loop in which the loop closure in ORB-SLAM2 is disabled.
重建精度定义为重建模型与真实模型之间的平均差异。这里,我们将所提出的建图方法与 BundleFusion [7]、ElasticFusion [8]、InfiniTAM [13] 和最近发表的 FlashFusion [16] 进行比较。为了评估保持全局一致性的能力,我们还评估了禁用了 ORB-SLAM2 中闭环的 Ours w/o loop。
The result is shown in Table I and Fig. 4. Please note that only FlashFusion [16] and our proposed system do not need GPU acceleration. BundleFusion [7], on the other hand, uses two high-end desktop GPU for frame reintegration and stores all the fused RGB-D frames. Although our method is designed for large-scale efficient reconstruction, it achieves similar results compared with FlashFusion. Only \({kt3}\) has global loops, and our method reduces the reconstruction error from 1.7 to 0.8 by removing the drift during motion tracking.
结果如表 I 和图 4 所示。请注意,只有 FlashFusion [16] 和我们提出的系统不需要 GPU 加速。另一方面,BundleFusion [7] 使用两块高端桌面 GPU 进行帧重新集成,并存储所有融合的 RGB-D 帧。虽然我们的方法旨在实现大规模高效重建,但它与 FlashFusion 达到了相似的结果。只有 \({kt3}\) 具有全局闭环,我们的方法通过在运动跟踪过程中消除漂移,将重建误差从 1.7 降低到 0.8。
TABLE I
RECONSTRUCTION ACCURACY ON ICL-NIUM DATASET (CM)
ICL-NIUM 数据集上的重建精度 (CM)
Method | kt0 | ktl | kt2 | kt3 |
BundleFusion | 0.5 | 0.6 | 0.7 | 0.8 |
ElasticFusion | 0.7 | 0.7 | 0.8 | 2.8 |
InfiniTAM | 1.3 | 1.1 | 0.1 | 2.8 |
FlashFusion | 0.8 | 0.8 | 1.0 | 1.3 |
Ours | 0.7 | 0.9 | 1.1 | 0.8 |
Ours w/o loop | 0.7 | 0.9 | 1.1 | 1.7 |
方法 | kt0 | ktl | kt2 | kt3 |
BundleFusion | 0.5 | 0.6 | 0.7 | 0.8 |
ElasticFusion | 0.7 | 0.7 | 0.8 | 2.8 |
InfiniTAM | 1.3 | 1.1 | 0.1 | 2.8 |
FlashFusion | 0.8 | 0.8 | 1.0 | 1.3 |
我们的方法 | 0.7 | 0.9 | 1.1 | 0.8 |
我们的方法 无循环 | 0.7 | 0.9 | 1.1 | 1.7 |
Fig. 4. The reconstruction result of our system on the \({kt3}\) sequence of the ICL-NIUM dataset. Left is the reconstructed meshes. Right is the error map of the surfel locations. Red represents \(4\mathrm{\;{cm}}\) error and blue means \(0\mathrm{\;{cm}}\) error. As visualized in the images, our method generates surfel construction that is dense and covers fine structures (such as the poles).
图4. 我们系统在ICL-NIUM数据集的\({kt3}\)序列上的重建结果。左侧是重建的网格。右侧是surfel位置的误差图。红色表示\(4\mathrm{\;{cm}}\)误差,蓝色表示\(0\mathrm{\;{cm}}\)误差。如图所示,我们的方法生成的surfel构建是密集的,并覆盖了精细结构(如杆)。
B. Reconstruction Efficiency
Most of the previous online dense reconstruction methods focus on room-scale environments using RGB-D cameras. Here, thanks to the memory and computation efficiency, we show that our method can reconstruct much larger environments, such as streets in KITTI datasets. Both the fusion update time and the memory usage are studied when the reconstruction scale grows. We use PSMNet [23] to generate depth maps from stereo images and use ORB-SLAM2 in stereo mode to track the moving camera. \(\delta = {0.5}\) , \(\sigma = {2.0}\) are set according to the environment and the stereo method. Here, we use KITTI odometry sequences 00 for the evaluation.
大多数以前的在线密集重建方法专注于使用RGB-D相机的房间规模环境。在这里,由于内存和计算效率,我们展示了我们的方法可以重建更大的环境,如KITTI数据集中的街道。当重建规模增加时,研究了融合更新时间和内存使用情况。我们使用PSMNet [23]从立体图像生成深度图,并使用立体模式下的ORB-SLAM2来跟踪移动相机。\(\delta = {0.5}\),\(\sigma = {2.0}\)根据环境和立体方法设置。这里,我们使用KITTI里程计序列00进行评估。
The first row of Fig. 1 shows the reconstruction result and the detail of one looped corner. Fig. 5 shows the map before and map the map deformation. The time efficiency of our method during the KITTI sequences 00 reconstruction is shown in Fig. 6. As shown in the figure, the average fusion time is around \({80}\mathrm{\;{ms}}\) per-frame, making our method more than \({10}\mathrm{\;{Hz}}\) real-time using only CPU computation. Unlike other dense mapping methods, such as TSDF-based methods, our method spends most of the time extracting superpixels and initializing surfels. The outlier-robust super-pixel extraction and surfel initialization enable our system to use low-quality stereo depth maps. On the other hand, the surfel fusion only consumes less than \(6\mathrm{\;{ms}}\) regardless of the environment scale. Due to the fact that ORB-SLAM2 optimizes the whole pose-graph frequently, our system deforms the map accordingly to maintain global consistency. The memory usage of the system during the runtime is shown in Fig. 7. Between frame 3000 and 4000, the vehicle revisits one street and ORB-SLAM2 detects loop closures between the keyframes. Based on the updated pose graph, our system reuses previous surfels so that the memory grows according to the environment scale instead of the runtime.
图1的第一行显示了重建结果和一个循环角落的细节。图5显示了变形前后的地图。图6显示了我们在KITTI序列00重建过程中的时间效率。如图所示,平均融合时间约为每帧\({80}\mathrm{\;{ms}}\),使得我们的方法仅使用CPU计算就超过\({10}\mathrm{\;{Hz}}\)实时。与其他密集建图方法,如基于TSDF的方法不同,我们的方法花费大部分时间提取超像素和初始化surfel。鲁棒的超像素提取和surfel初始化使我们的系统能够使用低质量的立体深度图。另一方面,surfel融合无论环境规模如何,消耗时间都不到\(6\mathrm{\;{ms}}\)。由于ORB-SLAM2频繁优化整个姿态图,我们的系统相应地变形地图以保持全局一致性。系统在运行时的内存使用情况如图7所示。在第3000帧和第4000帧之间,车辆重新访问一条街道,ORB-SLAM2检测到关键帧之间的循环闭合。根据更新的姿态图,我们的系统重复使用之前的surfel,因此内存根据环境规模增长,而不是运行时间。
Fig. 5. Details of one street corner before and after the map deformation. Before the loop closure, the road and the car are misaligned due to large drift (shown in red boxes). After the loop closure and the map deformation, the drift is removed.
图5. 一个街道角落变形前后的细节。在循环闭合之前,由于较大的漂移(红色框中所示),道路和汽车未对齐。循环闭合和地图变形后,漂移被消除。
Fig. 6. Time efficiency of our method reconstructing KITTI odometry sequence 00. As shown in the figure, our system achieves \({10}\mathrm{\;{Hz}}\) real-time performance during the reconstruction of the KITTI sequence.
图6. 我们方法重建KITTI里程计序列00的时间效率。如图所示,我们的系统在重建KITTI序列时实现了\({10}\mathrm{\;{Hz}}\)实时性能。
Fig. 7. Memory efficiency of our method reconstructing KITTI odometry sequence 00. As shown, the memory usage grows when more frames are fused into the model. Between frame 3000 and 4000, the memory stays almost unchanged because the vehicle revisits one street and the surfels are reused.
图 7. 我们方法重建 KITTI里程计序列 00 的内存效率。如图所示,当更多帧融合到模型中时,内存使用量增加。在第 3000 帧和第 4000 帧之间,内存几乎保持不变,因为车辆重新访问了一条街道,surfels 被重复使用。
C. Using A Monocular Camera
One of the advantages of the proposed method is that it can fuse depth maps from different kinds of sensors. In the previous sections, we showed dense mapping using rendered RGB-D images and stereo cameras. In this section, the proposed dense mapping system is used to reconstruct the KITTI sequence using only one monocular camera. Only the left images from the dataset are used to predict the depth maps [2], and the camera poses are tracked using ORB-SLAM2 in RGB-D mode (with the left image and the predicted depth map). The reconstruction result is shown in the bottom row of Fig 1. During the fusion, \(\sigma\) is set to 4.0 according to the variance of the monocular depth estimation. Our method is the first one that reconstructs KITTI sequences with scale using predicted depth maps.
所提出方法的一个优点是它可以融合来自不同类型传感器的深度图。在前几节中,我们展示了使用渲染的 RGB-D 图像和立体相机进行密集建图。在本节中,所提出的密集建图系统用于仅使用一个单目相机重建 KITTI 序列。仅使用数据集中的左图像来预测深度图 [2],并且使用 ORB-SLAM2 在 RGB-D 模式下(使用左图像和预测的深度图)跟踪相机姿态。重建结果如图 1 的底部行所示。在融合过程中,根据单目深度估计的方差将 \(\sigma\) 设置为 4.0。我们的方法是第一个使用预测的深度图重建具有尺度的 KITTI 序列的方法。
D. Supporting Autonomous Aggressive Flights 支持自主激进飞行
To prove the usability of the proposed dense mapping, we apply the system to support autonomous aggressive flights. A dense model of the environment is first built by a handheld monocular camera. Then, a flight path is generated so that the quadrotor can navigate safely and aggressively in the environment. MVDepthNet [24] is used to estimate monocular depth maps and VINS-MONO is used to track the camera motion. During the scene reconstruction, the proposed mapping approach corrects map drift according to the detected loops so that obstacles are consistent between different visits. We also compare the reconstruction results with CHISEL [6] \({}^{2}\) using the same input images and camera poses. Since CHISEL cannot deform the map to eliminate the detected drift, fine obstacles cannot be reconstructed right when they are revisited. The results are shown in Fig. 8. Aggressive flights using the reconstructed maps can be found in the supplementary video. Please note that all indoor obstacles and outdoor trees are constructed accurately using our method. On the other hand, CHISEL [6] cannot reconstruct fine obstacles due to the drift between different visits and the maps are not usable for autonomous flights.
为了证明所提出的密集建图的可用性,我们将系统应用于支持自主激进飞行。首先,使用手持单目相机构建环境的密集模型。然后,生成飞行路径,使四旋翼飞行器能够在环境中安全且激进地导航。使用 MVDepthNet [24] 估计单目深度图,并使用 VINS-MONO 跟踪相机运动。在场景重建过程中,所提出的建图方法根据检测到的循环来纠正地图漂移,使得障碍物在不同访问之间保持一致。我们还使用相同的输入图像和相机姿态将重建结果与 CHISEL [6] \({}^{2}\) 进行比较。由于 CHISEL 无法变形地图以消除检测到的漂移,因此当它们被重新访问时,细小障碍物无法被正确重建。结果如图 8 所示。使用重建地图的激进飞行可以在补充视频中找到。请注意,所有室内障碍物和室外树木都使用我们的方法准确构建。另一方面,由于不同访问之间的漂移,CHISEL [6] 无法重建细小障碍物,地图无法用于自主飞行。
Fig. 8. Using the proposed method to reconstruct the environment for autonomous flights. Left is the overview of the environment. Middle is the reconstruction of our method, and right is the result of widely used CHISEL.
图 8. 使用所提出的方法重建环境以支持自主飞行。左侧是环境的概览。中间是我们的方法的重建结果,右侧是广泛使用的 CHISEL 的结果。
VII. 结论
In this paper, we propose a novel surfel mapping method that can fuse sequential depth maps into a globally-consistent model in real-time without GPU acceleration. The system is carefully designed so that it can handle low-quality depth maps and maintain run-time efficiency. Surfels used in our system are initialized using extracted outlier-robust superpix-els. Surfels are further organized according to the pose graph of the localization system so that the system maintains \(O\left( 1\right)\) fusion time and can deform the map to achieve global consistency in real-time. All the characteristics of the system make the proposed mapping system suitable for robot applications.
本文提出了一种新颖的surfel建图方法,该方法可以在没有GPU加速的情况下实时融合连续深度图以生成全局一致的模型。该系统经过精心设计,能够处理低质量深度图并保持运行时效率。我们系统中使用的surfels通过提取的鲁棒于异常值的超像素进行初始化。根据定位系统的姿态图进一步组织surfels,使得系统保持\(O\left( 1\right)\)融合时间,并能够实时变形地图以实现全局一致性。系统的所有特性使得所提出的建图系统适用于机器人应用。
VIII. 致谢
This work was supported by the Hong Kong PhD Fellowship Scheme.
本工作得到了香港博士研究生奖学金计划的支持。
\({}^{2}\) https://github.com/personalrobotics/OpenChisel
参考文献
[1] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
[1] A. Geiger, P. Lenz, 和 R. Urtasun. 我们准备好实现自动驾驶了吗?KITTI 视觉基准套件。在计算机视觉和模式识别会议 (CVPR),2012。
[2] C. Godard, O. M. Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In Proc. of the IEEE Int. Conf. on Pattern Recognition, July 2017.
[2] C. Godard, O. M. Aodha, 和 G. J. Brostow. 无监督单目深度估计与左右一致性。在 IEEE 国际模式识别会议记录,2017年7月。
[3] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, and A. Fitzgibbon. KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. In Proceedings of the 24th annual ACM symposium on User interface software and technology, Santa Barbara, CA, USA, October 2011. ACM.
[3] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, 和 A. Fitzgibbon. KinectFusion: 使用移动深度相机进行实时3D重建和交互。在第24届ACM用户界面软件和技术年度会议记录,美国加利福尼亚州圣巴巴拉,2011年10月。ACM。
[4] B. Curless and M. Levoy. A volumetric method for building complex models from range images. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques. ACM, 1996.
[4] B. Curless 和 M. Levoy. 一种体积方法,用于从范围图像构建复杂模型。在第23届计算机图形和交互技术年度会议记录。ACM,1996。
[5] T. Whelan, M. Kaess, M. Fallon, H. Johannsson, J. Leonard, and J. McDonald. Kintinuous: Spatially extended KinectFusion. In RSS Workshop on RGB-D: Advanced Reasoning with Depth Cameras, 2014.
[5] T. Whelan, M. Kaess, M. Fallon, H. Johannsson, J. Leonard, 和 J. McDonald. Kintinuous: KinectFusion的空间扩展。在RSS关于RGB-D: 使用深度相机进行高级推理的工作坊,2014。
[6] M. Klingensmith, I. Dryanovski, S. S. Srinivasa, and J. Xiao. Chisel: Real time large scale 3D reconstruction onboard a mobile device using spatially hashed signed distance fields. In Proc. of Robot.: Sci. and Syst., Rome, Italy, July 2015.
[6] M. Klingensmith, I. Dryanovski, S. S. Srinivasa, 和 J. Xiao. Chisel: 使用空间哈希符号距离场在移动设备上进行实时大规模3D重建。在机器人科学与系统会议记录,意大利罗马,2015年7月。
[7] A. Dai, M. Niessner, M. Zollhöfer, S. Izadi, and C. Theobalt. Bundle-fusion: Real-time globally consistent 3D reconstruction using on-the-fly surface reintegration. ACM Transactions on Graphics (TOG), 36(3), 2017.
[7] A. Dai, M. Niessner, M. Zollhöfer, S. Izadi, and C. Theobalt. Bundle-fusion: 实时全局一致的3D重建使用即时表面重新集成. ACM Transactions on Graphics (TOG), 36(3), 2017.
[8] T. Whelan, S. Leutenegger, R. F. Salas-Moreno, B. Glocker, and A. J. Davison. ElasticFusion: Dense SLAM without a pose graph. In Proc. of Robot.: Sci. and Syst., Rome, Italy, July 2015.
[8] T. Whelan, S. Leutenegger, R. F. Salas-Moreno, B. Glocker, and A. J. Davison. ElasticFusion: 无姿态图的稠密SLAM. In Proc. of Robot.: Sci. and Syst., Rome, Italy, July 2015.
[9] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger. Real-time 3D reconstruction at scale using voxel hashing. ACM Transactions on Graphics (TOG), 32(6):169, 2013.
[9] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger. 使用体素哈希实现实时大规模3D重建. ACM Transactions on Graphics (TOG), 32(6):169, 2013.
[10] Ioan Andrei Bârsan, Peidong Liu, Marc Pollefeys, and Andreas Geiger. Robust dense mapping for large-scale dynamic environments. In Proc. of the IEEE Int. Conf. on Robot. and Autom., 2018.
[10] Ioan Andrei Bârsan, Peidong Liu, Marc Pollefeys, and Andreas Geiger. 面向大规模动态环境的鲁棒稠密地图构建. In Proc. of the IEEE Int. Conf. on Robot. and Autom., 2018.
[11] X. Fu, F. Zhu, Q. Wu, Y. Sun, R. Lu, and R. Yang. Real-time large-scale dense mapping with surfels. Sensors, 18(5), 2018.
[11] X. Fu, F. Zhu, Q. Wu, Y. Sun, R. Lu, and R. Yang. 使用Surfels实现实时大规模稠密地图构建. Sensors, 18(5), 2018.
[12] T. Whelan, M. Kaess, J. J. Leonard, and J. McDonald. Deformation-based loop closure for large scale dense RGB-D SLAM. In Proc. of the IEEE/RSJ Int. Conf. on Intell. Robots and Syst., Tokyo, Japan, November 2013.
[12] T. Whelan, M. Kaess, J. J. Leonard, and J. McDonald. 面向大规模稠密RGB-D SLAM的基于变形的回环闭合. In Proc. of the IEEE/RSJ Int. Conf. on Intell. Robots and Syst., Tokyo, Japan, November 2013.
[13] O. Kähler, V. A. Prisacariu, and D. W. Murray. Real-time large-scale dense 3D reconstruction with loop closure. In European Conference on Computer Vision, pages 500-516, 2016.
[13] O. Kähler, V. A. Prisacariu, 和 D. W. Murray. 实时大规模密集3D重建与闭环。在欧洲计算机视觉会议,第500-516页,2016。
[14] F. Steinbrücker, J. Sturm, and D. Cremers. Volumetric 3D mapping in real-time on a CPU. In Proc. of the IEEE Int. Conf. on Robot. and Autom., Hong Kong, China, May 2014.
[14] F. Steinbrücker, J. Sturm, 和 D. Cremers. 实时三维体积建图在CPU上。在IEEE国际机器人与自动化会议,中国香港,2014年5月。
[15] H. Oleynikova, Z. Taylor, Ma. Fehr, R. Siegwart, and J. Nieto. Voxblox: Incremental 3D Euclidean signed distance fields for on-board mav planning. In Proc. of the IEEE/RSJ Int. Conf. on Intell. Robots and Syst., 2017.
[15] H. Oleynikova, Z. Taylor, Ma. Fehr, R. Siegwart, 和 J. Nieto. Voxblox: 用于机载MAV规划的增量式3D欧几里得符号距离场。在IEEE/RSJ国际智能机器人与系统会议,2017。
[16] L. Han and L. Fang. Flashfusion: Real-time globally consistent dense 3D reconstruction using CPU computing. In Proc. of Robot.: Sci. and Syst., Pittsburgh, USA, 2018.
[16] L. Han 和 L. Fang. Flashfusion: 使用CPU计算进行实时全局一致的密集3D重建。在机器人科学与系统会议,美国匹兹堡,2018。
[17] J. Stückler and S. Behnke. Multi-resolution surfel maps for efficient dense 3D modeling and tracking. Journal of Visual Communication and Image Representation, 25(1):137-147, 2014.
[17] J. Stückler 和 S. Behnke. 多分辨率表面素图用于高效密集3D建模和跟踪。视觉通信与图像表示期刊,25(1):137-147, 2014。
[18] Raúl Mur-Artal and Juan D. Tardós. ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Transactions on Robotics, 33(5):1255-1262, 2017.
[18] Raúl Mur-Artal 和 Juan D. Tardós. ORB-SLAM2: 一个开源的单目、立体和RGB-D相机SLAM系统。IEEE机器人技术杂志,33(5):1255-1262, 2017。
[19] Tong Qin, Peiliang Li, and Shaojie Shen. VINS-Mono: A robust and versatile monocular visual-inertial state estimator. IEEE Transactions on Robotics, 34(4):1004-1020, 2018.
[19] Tong Qin, Peiliang Li, and Shaojie Shen. VINS-Mono: 一种稳健且多功能的单目视觉惯性状态估计器。IEEE 机器人学报, 34(4):1004-1020, 2018.
[20] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence, 34(11):2274-2282, 2012.
[20] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk. SLIC 超像素与最先进的超像素方法相比。IEEE 模式分析与机器智能汇刊, 34(11):2274-2282, 2012.
[21] S. R. Fanello, J. Valentin, C. Rhemann, A. Kowdle, V. Tankovich, P. Davidson, and S Izadi. Ultrastereo: Efficient learning-based matching for active stereo systems. In Proc. of the IEEE Int. Conf. on Pattern Recognition, pages 1063-6919. IEEE, 2017.
[21] S. R. Fanello, J. Valentin, C. Rhemann, A. Kowdle, V. Tankovich, P. Davidson, and S Izadi. Ultrastereo: 高效的学习匹配方法用于主动立体系统。在 IEEE 国际模式识别会议记录, 页码 1063-6919. IEEE, 2017.
[22] A. Handa, T. Whelan, J. MacDonald, and A. J. Davison. A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM. In Proc. of the IEEE Int. Conf. on Robot. and Autom., Hong Kong, May 2014.
[22] A. Handa, T. Whelan, J. MacDonald, and A. J. Davison. 一个用于 RGB-D 视觉里程计、3D 重建和 SLAM 的基准。在 IEEE 国际机器人与自动化会议记录, 香港, 2014年5月.
[23] J. Chang and Y. Chen. Pyramid stereo matching network. In Proc. of the IEEE Int. Conf. on Pattern Recognition, pages 5410-5418, 2018.
[23] J. Chang and Y. Chen. 金字塔立体匹配网络。在 IEEE 国际模式识别会议记录, 页码 5410-5418, 2018.
[24] K. Wang and S. Shen. MVDepthNet: real-time multiview depth estimation neural network. In Proc. of the Int. Conf. on 3D Vis., Sep. 2018.
[24] K. Wang and S. Shen. MVDepthNet: 实时多视图深度估计神经网络。在国际 3D 视觉会议记录, 2018年9月.