SOS_Match

SOS-Match: Segmentation for Open-Set Robust Correspondence Search and Robot Localization in Unstructured Environments

SOS-Match：在非结构化环境中进行开放集鲁棒对应搜索和机器人定位的分割方法

Annika Thomas ${}^{1 * }$ ,Jouko Kinnari ${}^{2 * }$ ,Parker C. Lusk ${}^{1}$ ,Kota Kondo ${}^{1}$ ,and Jonathan P. How ${}^{1}$

Annika Thomas ${}^{1 * }$ ,Jouko Kinnari ${}^{2 * }$ ,Parker C. Lusk ${}^{1}$ ,Kota Kondo ${}^{1}$ ,和 Jonathan P. How ${}^{1}$

Abstract-We present SOS-Match, a novel framework for detecting and matching objects in unstructured environments. Our system consists of 1) a front-end mapping pipeline using a zero-shot segmentation model to extract object masks from images and track them across frames and 2) a frame alignment pipeline that uses the geometric consistency of object relationships to efficiently localize across a variety of conditions. We evaluate SOS-Match on the Bätvik seasonal dataset which includes drone flights collected over a coastal plot of southern Finland during different seasons and lighting conditions. Results show that our approach is more robust to changes in lighting and appearance than classical image feature-based approaches or global descriptor methods, and it provides more viewpoint invariance than learning-based feature detection and description approaches. SOS-Match localizes within a reference map up to 46x faster than other feature-based approaches and has a map size less than ${0.5}\%$ the size of the most compact other maps. SOS-Match is a promising new approach for landmark detection and correspondence search in unstructured environments that is robust to changes in lighting and appearance and is more computationally efficient than other approaches, suggesting that the geometric arrangement of segments is a valuable localization cue in unstructured environments. We release our datasets at https://acl.mit.edu/SOS-Match/.

摘要——我们提出了 SOS-Match，一种用于在非结构化环境中检测和匹配物体的新框架。我们的系统包括 1) 一个前端映射管道，使用零样本分割模型从图像中提取物体掩码并跨帧跟踪它们，以及 2) 一个帧对齐管道，利用物体关系的几何一致性在各种条件下高效定位。我们在 Bätvik 季节性数据集上评估了 SOS-Match，该数据集包括在芬兰南部沿海地区不同季节和光照条件下收集的无人机飞行数据。结果显示，我们的方法比基于经典图像特征的方法或全局描述符方法对光照和外观变化更为鲁棒，并且比基于学习的特征检测和描述方法提供了更好的视角不变性。SOS-Match 在参考地图中的定位速度比其他基于特征的方法快 46 倍，并且地图大小小于 ${0.5}\%$ 最紧凑的其他地图。SOS-Match 是一种有前景的新方法，用于在非结构化环境中进行地标检测和对应搜索，对光照和外观变化具有鲁棒性，并且在计算效率上优于其他方法，表明片段的几何排列在非结构化环境中是一个有价值的位置提示。我们在 https://acl.mit.edu/SOS-Match/ 发布了我们的数据集。

I. INTRODUCTION

I. 引言

The capability of a robot to localize itself with respect to an environment is a fundamental requirement in mobile robotics. Various approaches exist for achieving this, including infrastructure-based methods, map-based methods, and Simultaneous Localization and Mapping (SLAM).

机器人相对于环境进行自我定位的能力是移动机器人学的基本要求。实现这一目标的方法有很多，包括基于基础设施的方法、基于地图的方法和同时定位与地图构建（SLAM）。

Infrastructure-based methods such as Global Navigation Satellite System (GNSS) directly provide estimates of location in a known coordinate system but are subject to interference by malicious actors [1] and limited in availability (e.g., only work outdoors). Map-based methods such as [2] allow localization but only in cases where a global map can be acquired of the environment prior to operation. SLAM-based approaches do not depend on the availability of localization infrastructure or a pre-acquired map of the operating environment, and are able to provide a notion of pose with respect to a robot's initial starting position and orientation [3]. In multi-agent SLAM cases such as [4], there is an additional need to find the alignment between the reference frames of different agents operating in the same environment.

基于基础设施的方法，如全球导航卫星系统（GNSS），直接在已知坐标系中提供位置估计，但容易受到恶意行为者的干扰[1]，并且在可用性方面受限（例如，仅在户外工作）。基于地图的方法，如[2]，允许定位，但仅限于在操作前能够获取环境全局地图的情况下。基于SLAM的方法不依赖于定位基础设施或操作环境的预先获取地图，并且能够相对于机器人的初始起始位置和方向提供姿态概念[3]。在多智能体SLAM的情况下，如[4]，还需要找到在同一环境中操作的不同智能体参考框架之间的对齐。

* Equal Contribution

* 同等贡献

tics and Astronautics, Massachusetts Institute of Technology. {annikat, plusk, kkondo, jhow}@mit.edu.

${}^{2}\mathrm{\;J}$ . Kinnari is with Saab Finland Oy,Salomonkatu 17B,00100 Helsinki, Finland jouko.kinnari@saabgroup.com

Fig. 1: SOS-Match tracks object masks produced with no pre-training or fine-tuning across sequential posed camera frames to build sparse object-based maps. It robustly associates object masks using their geometric relationship with each other, enabling correspondence detections between traverses over highly ambiguous natural terrains.

图1：SOS-Match跟踪对象掩码，这些掩码在连续的有姿态相机帧中生成，无需预训练或微调，以构建基于稀疏对象的地图。它通过对象掩码之间的几何关系稳健地关联对象掩码，从而在高度模糊的自然地形上进行遍历时实现对应检测。

A fundamental question to address in SLAM, as well as map-based localization approaches, is how to relate the current environment measurements, i.e., sensor inputs in the vicinity of the robot, efficiently and accurately to a reference map. We propose four requirements of robust correspondence search in the localization problem in unstructured environments. To resolve the ambiguities in the correspondence search of current observations and past observations, or across observations by different robots, the description should (1) provide high precision and recall. The method of description should (2) enable operation in an environment which lacks prominent landmarks, operating in a zero-shot approach i.e.,without requiring significant engineering effort if the application domain changes. To allow operation over extended periods, it should (3) be robust to the variation in the appearance of the environment e.g.,due to appearance change over seasonal time in the year. The description of the environment should (4) be modest in terms of memory use and computation time and communication bandwidth requirement for multi-agent scenarios.

在SLAM以及基于地图的定位方法中，一个基本问题是如何有效地将当前环境测量值（即机器人附近的传感器输入）准确地关联到参考地图上。我们提出了在非结构化环境中定位问题中鲁健对应搜索的四个要求。为了解决当前观测与过去观测之间或不同机器人观测之间的对应搜索中的歧义，描述应（1）提供高精度和召回率。描述方法应（2）能够在缺乏显著地标的环境中操作，采用零样本方法，即在应用领域变化时不需要大量工程努力。为了允许在较长时间内操作，它应（3）对环境外观的变化（例如由于季节性时间的变化）具有鲁棒性。环境描述应（4）在内存使用、计算时间和多智能体场景的通信带宽需求方面较为适中。

To meet these requirements, we present SOS-Match, an open-set mapping and correspondence search pipeline that makes no prior assumptions about the content of the environment to extract and map objects from visually ambiguous unstructured settings, and uses only the geometric structure of the environment as cue for localization. Utilizing the segmentation model Segment Anything Model (SAM) [5] for front-end segment detection, our pipeline tracks detected segments across frames and prunes spurious detections to construct a map of consistent object masks, without requiring additional training per usage environment. We use a robust graph-theoretic data association method [6] to associate object locations within object maps, leveraging the geometric arrangement of landmarks and their relative position as cues for localization. Since many localization algorithms are expensive to run on platforms with limited computing resources, we formulate a windowed correspondence search that can trade off accuracy for computational cost. This is an especially suitable approach for drone localization over unstructured terrain, as we demonstrate through experiments with localization and loop closure detection in drone flights collected across different seasons.

为了满足这些需求，我们提出了 SOS-Match，这是一个开放集映射和对应搜索管道，它不对环境内容做出任何先验假设，从视觉上模糊的无结构环境中提取和映射对象，并且仅使用环境的几何结构作为定位的线索。利用分割模型 Segment Anything Model (SAM) [5] 进行前端片段检测，我们的管道跟踪跨帧检测到的片段并修剪虚假检测，以构建一致对象掩码的地图，而不需要在每个使用环境中进行额外训练。我们使用一种鲁棒的图论数据关联方法 [6] 来关联对象地图内的对象位置，利用地标的几何排列及其相对位置作为定位的线索。由于许多定位算法在计算资源有限的平台上运行成本高昂，我们制定了一种窗口化的对应搜索，可以在计算成本上权衡准确性。这是一种特别适合无人机在无结构地形上定位的方法，我们通过在不同季节收集的无人机飞行中的定位和闭环检测实验来证明这一点。

${}^{1}$ A. Thomas,K. Kondo and J. How are with the Department of Aeronau-

${}^{1}$ A. Thomas, K. Kondo 和 J. How 来自航空航天系

In summary, the contributions of this work include:

总之，这项工作的贡献包括：

A front end capable of reconstructing vehicle maps made of segmented object masks that are less than ${0.5}\%$ the size of other benchmark maps, relying on no prior assumptions of the operating environment.
一个前端，能够重建由分割对象掩码组成的车辆地图，这些掩码的大小不到其他基准地图的 ${0.5}\%$，不依赖于对操作环境的任何先验假设。
A method for relating vehicle maps using a geometric correspondence search with a windowed approach that localizes up to ${46}\mathrm{x}$ faster than feature-based data association approaches.
一种使用窗口化几何对应搜索方法来关联车辆地图的方法，其定位速度比基于特征的数据关联方法快 ${46}\mathrm{x}$ 倍。
SOS-Match achieves higher recall compared to classical and learned feature-based methods and a state-of-the-art visual place recognition approach evaluated on real-world flights across varied seasonal and illumination conditions, and provides increased robustness to viewpoint variation in comparison to learned feature-based methods.
SOS-Match 与经典和基于学习的特征方法以及在不同季节和光照条件下评估的最新视觉地点识别方法相比，实现了更高的召回率，并且在与基于学习的特征方法相比时，对视角变化的鲁棒性更强。
We release the Bâtvik seasonal dataset containing long traverses with an Unmanned Aerial Vehicle (UAV) across diverse lighting conditions and seasonal appearance change to promote novel contributions towards localization in unstructured environments.

我们发布了包含无人机在多种光照条件和季节变化下长距离穿越的Bâtvik季节性数据集，以促进非结构化环境定位方面的新贡献。

II. 相关工作

Several approaches have been proposed for correspondence search in SLAM and map-based localization. We discuss these approaches by considering various environment representations, segmentation-based approaches, and review deep learning in visual navigation. Additionally, we consider challenges from operation in unstructured environments.

针对SLAM和基于地图的定位中的对应搜索，已经提出了几种方法。我们通过考虑各种环境表示、基于分割的方法以及视觉导航中的深度学习来讨论这些方法。此外，我们还考虑了在非结构化环境中操作的挑战。

A. Environment Representation

A. 环境表示

Descriptive and efficient environment representation lays the framework for robust localization. In visual SLAM, common environment representations include feature-based or object-based approaches. In feature-based SLAM, the environment is described by consistently detectable features within images that are distinct from one another like ORB [7], SIFT [8], SURF [9] or learned [10], [11] features. Several SLAM systems utilize these features for mapping, as shown in [12]. While feature-based SLAM is widely used, this approach poses a significant data handling challenge due to the substantial data volume it entails. In object-based SLAM methods, object detectors like YOLOv3 [13] can be used to extract a set vocabulary of objects in urban environments. Using object-based methods coupled with semantic labels can be advantageous in gathering contextual information about the environment while maintaining a compact map, as demonstrated in [14]. While the sparse nature of object-based mapping is promising for computational efficiency in large-scale operations, object classifiers are restricted to the types of objects that can be detected, limiting their use in unstructured environments.

描述性和高效的环境表示为鲁健定位奠定了框架。在视觉SLAM中，常见的环境表示包括基于特征或基于对象的方法。在基于特征的SLAM中，环境由图像中一致可检测的特征描述，这些特征彼此不同，如ORB [7]、SIFT [8]、SURF [9]或学习到的[10]、[11]特征。如[12]所示，多个SLAM系统利用这些特征进行地图绘制。尽管基于特征的SLAM被广泛使用，但由于其涉及的大量数据量，这种方法带来了显著的数据处理挑战。在基于对象的SLAM方法中，如YOLOv3 [13]这样的对象检测器可以用于提取城市环境中的一组对象词汇。结合语义标签使用基于对象的方法可以在保持紧凑地图的同时收集有关环境的上下文信息，如[14]所示。尽管基于对象的映射的稀疏性在大型操作中具有计算效率的潜力，但对象分类器受限于可以检测的对象类型，限制了它们在非结构化环境中的使用。

B. Segmentation

B. 分割

Segmentation partitions an image into meaningful regions or objects. In computer vision, segmentation is widely used for object detection and classification with classifiers such as YOLOv3 [13] or semantic classifications with CLIP [15]. Segmentation can be used as a step in methods for path planning and object detection [16]. Segment Anything [5], uses a trained model to perform segmentation in any environment without assumptions or fine tuning and a modified version [17] has recently been used for coordinate frame alignment in multi-agent trajectory deconfliction [18].

分割将图像划分为有意义的区域或对象。在计算机视觉中，分割广泛用于对象检测和分类，如使用 YOLOv3 [13] 或通过 CLIP [15] 进行语义分类。分割可以作为路径规划和对象检测方法中的一个步骤 [16]。Segment Anything [5] 使用训练好的模型在任何环境中进行分割，无需假设或微调，其改进版本 [17] 最近已被用于多智能体轨迹冲突解决中的坐标框架对齐 [18]。

C. 视觉导航中的深度学习

Some recent works in deep learning for visual navigation leverage foundational models [19], which are models trained in a self-supervised way that can accomplish many tasks without fine-tuning or additional training. Deep learning has been used to learn features in methods such as SuperPoint [11] and D2-Net [20]. In visual place recognition, learned global descriptors can facilitate robust scene recognition across viewpoints and differing illuminations. AnyLoc [21] is a technique for visual place recognition that uses DINOv2 [22], a self-supervised vision transformer model in combination with unsupervised feature aggregation using VLAD [23], which surpasses other VPR approaches in open-set environments, but degrades in cases where contents are similar across frames. Segment Anything uses deep learning to segment environments without prior assumptions or additional training [5].

视觉导航中的一些最新深度学习工作利用了基础模型 [19]，这些模型以自监督方式训练，无需微调或额外训练即可完成多项任务。深度学习已被用于学习特征，如 SuperPoint [11] 和 D2-Net [20]。在视觉地点识别中，学习到的全局描述符可以促进跨视角和不同光照条件下的稳健场景识别。AnyLoc [21] 是一种使用 DINOv2 [22]（一种自监督视觉变换器模型）结合无监督特征聚合（使用 VLAD [23]）的视觉地点识别技术，在开放集环境中超越了其他 VPR 方法，但在帧间内容相似的情况下性能下降。Segment Anything 使用深度学习对环境进行分割，无需先验假设或额外训练 [5]。

D. Unstructured Environments

D. 非结构化环境

Unstructured environments are a challenging setting for many visual SLAM systems that assume urban-centric information such as the presence of lane markings or buildings. In unstructured environments, it is important to adequately extract meaningful features or recognize objects to aid in place recognition. Some approaches address road roughness and limited distinctive features in these settings by integrating a range of sensor modalities and strategies like using wheel odometry with visual tracking [24], integrating topological maps [25], and utilizing lidar point clouds [26]. These methods exhibit limitations tied to external hardware requirements, point cloud size constraints, susceptibility to structural changes, and reliance on the assumption of well-defined off-road trails.

非结构化环境对许多假设以城市为中心信息的视觉SLAM系统来说是一个挑战性的场景，例如存在车道标线或建筑物。在非结构化环境中，充分提取有意义的特征或识别物体以辅助地点识别至关重要。一些方法通过整合多种传感器模态和策略来应对这些环境中的道路粗糙度和有限独特特征，例如结合视觉跟踪的轮式里程计[24]、整合拓扑地图[25]以及利用激光雷达点云[26]。这些方法存在与外部硬件需求、点云大小限制、对结构变化的敏感性以及依赖于定义明确的越野路径假设相关的局限性。

Fig. 2: SOS-Match incorporates two novel components. The front end mapping pipeline utilizes the vehicle odometry sensor along with camera images to perform SLAM and generate vehicle maps. The frame alignment pipeline offsets windows and uses our data association algorithm to filter the most likely correspondences.

图2：SOS-Match包含两个新颖组件。前端地图构建流程利用车辆里程计传感器和摄像头图像执行SLAM并生成车辆地图。帧对齐流程通过偏移窗口并使用我们的数据关联算法来过滤最可能的对应关系。

E. Placement of This Work

E. 本工作的定位

SOS-Match utilizes a pre-trained foundation model for segmentation to construct object-based maps without any prior assumptions about the environment such as the presence of objects of specific classes. The generality of this framework allows it to localize successfully in unstructured environments with extreme illumination and structural changes while keeping map sizes compact enough to be shared between multiple agents.

SOS-Match利用预训练的基础模型进行分割，构建基于对象的地图，无需对环境做出任何先验假设，例如特定类别对象的存在。该框架的通用性使其能够在极端光照和结构变化下的非结构化环境中成功定位，同时保持足够紧凑的地图大小以便在多个代理之间共享。

III. Method

III. 方法

Our method consists of two main parts; mapping and frame alignment. A block diagram of our method is illustrated in Figure 2 where two vehicles generate object-based maps then perform global data association.

我们的方法包括两个主要部分：地图构建和帧对齐。图2展示了我们方法的框图，其中两辆车生成基于对象的地图，然后执行全局数据关联。

A. Mapping

A. 地图构建

Our mapping approach consists of running camera images through a pre-trained image segmentation model such as [5] or [17], identifying tracks (i.e., finding correspondence between object masks across a sequence of images), and reconstructing the positions of the centroids of the segmented areas with a Structure-from-Motion (SFM)-style approach using camera poses estimated by visual-inertial odometry (VIO).

我们的地图构建方法包括通过预训练的图像分割模型（如[5]或[17]）处理摄像头图像，识别轨迹（即在图像序列中找到对象掩码之间的对应关系），并使用视觉惯性里程计（VIO）估计的相机姿态，采用类似运动结构恢复（SFM）的方法重建分割区域质心的位置。

We perform segment detection only after movement of ${\theta }_{T}$ meters after most recent keyframe,estimated with VIO. This enables the user of our algorithm the ability to adjust the performance of the system to match the computational resources available on the robot.

我们仅在移动 ${\theta }_{T}$ 米后执行段检测，该移动距离是基于视觉惯性里程计（VIO）估计的最近关键帧之后的距离。这使得我们的算法用户能够根据机器人上可用的计算资源调整系统性能。

Given an image $\mathcal{I}\left( t\right)$ ,acquired at time $t$ ,we use the segmentation model to extract binary object masks ${I}_{k}\left( t\right)$ ,indexed by $k \in \{ 0,1,\ldots K\left( t\right) \}$ . For each ${I}_{k}\left( t\right)$ larger than 2 pixels,we compute the centroid ${m}_{k}\left( t\right)$ and covariance ${\sum }_{k}\left( t\right)$ of the pixel coordinates of each mask. Further, we extract SIFT features [27] from the region of the mask and store them as set ${A}_{k}\left( t\right)$ . We emphasize that SIFT features are used solely for inter-frame tracking, and we don't keep a record of them in the map. We also compute a size descriptor ${h}_{k}\left( t\right) = \sqrt{{\lambda }_{k}\left( t\right) }$ ,where ${\lambda }_{k}\left( t\right)$ is the largest eigenvalue of ${\sum }_{k}\left( t\right)$ .

给定在时间 $t$ 获取的图像 $\mathcal{I}\left( t\right)$，我们使用分割模型提取二进制对象掩码 ${I}_{k}\left( t\right)$，这些掩码由 $k \in \{ 0,1,\ldots K\left( t\right) \}$ 索引。对于每个大于2像素的 ${I}_{k}\left( t\right)$，我们计算每个掩码的像素坐标的质心 ${m}_{k}\left( t\right)$ 和协方差 ${\sum }_{k}\left( t\right)$。此外，我们从掩码区域提取SIFT特征 [27] 并将其存储为集合 ${A}_{k}\left( t\right)$。我们强调，SIFT特征仅用于帧间跟踪，我们不在地图中记录它们。我们还计算一个大小描述符 ${h}_{k}\left( t\right) = \sqrt{{\lambda }_{k}\left( t\right) }$，其中 ${\lambda }_{k}\left( t\right)$ 是 ${\sum }_{k}\left( t\right)$ 的最大特征值。

For inter-frame tracking, we assume a generic VIO implementation is available for tracking camera poses $T\left( t\right) \in$ ${SE}\left( 3\right)$ as well as for tracking the movement of visual feature points between images $\mathcal{I}\left( {t}_{i}\right)$ and $\mathcal{I}\left( {t}_{j}\right)$ . We compute the amount of movement of visual feature points tracked by VIO in pixels and compute mean ${\mu }_{p}$ and standard deviation ${\sigma }_{p}$ of the movement.

对于帧间跟踪，我们假设有一个通用的视觉惯性里程计（VIO）实现可用于跟踪相机姿态 $T\left( t\right) \in$ ${SE}\left( 3\right)$ 以及跟踪图像 $\mathcal{I}\left( {t}_{i}\right)$ 和 $\mathcal{I}\left( {t}_{j}\right)$ 之间的视觉特征点移动。我们计算由VIO跟踪的视觉特征点的移动量（以像素为单位），并计算移动的均值 ${\mu }_{p}$ 和标准差 ${\sigma }_{p}$。

We evaluate putative correspondences for each track whose latest observation is at most ${\theta }_{t}$ keyframes old; this provides some robustness against intermittent temporal inconsistencies in detection of segments by SAM.

我们评估每个轨迹的潜在对应关系，其最新观测不超过 ${\theta }_{t}$ 个关键帧；这为SAM检测段中的间歇性时间不一致性提供了一定的鲁棒性。

We associate the detections of segments across consecutive image frames using three techniques. First, we require that an epipolar constraint of the segment centroids is satisfied. Second, we require that the apparent shift in pixel coordinates of the segment centroid is in correspondence with the movement of features points tracked by a VIO algorithm. Third, we match segments that are similar in size and appearance.

我们使用三种技术关联连续图像帧之间的段检测。首先，我们要求段的质心满足极线约束。其次，我们要求段质心的像素坐标偏移与VIO算法跟踪的特征点移动相对应。第三，我们匹配大小和外观相似的段。

For the epipolar constraint (see e.g., [28]), we only allow associations with a margin of less than ${\theta }_{a}$ pixels. The comparison of the apparent shift to VIO points is based on only allowing movement less than a specified limit ${\theta }_{v}$ :

对于极线约束（参见例如，[28]），我们只允许关联在小于 ${\theta }_{a}$ 像素的范围内。对表观位移与 VIO 点的比较是基于只允许移动小于指定限制 ${\theta }_{v}$：

\[\frac{\left| \left| {m}_{i}\left( t - 1\right) - {m}_{j}\left( t\right) \right| - {\mu }_{p}\right| }{{\sigma }_{p}} < {\theta }_{v} \tag{1} \]

After excluding infeasible matches based on the epipolar constraint and the requirement of similar movement as VIO detections, there may still exist more than one possible association between segments observed in latest keyframe to tracks. To this end, we compute the similarity of segment association hypotheses based on their appearance and size. For comparing appearance, we define the feature scoring function ${q}_{f}$ as the fraction of SIFT features in sets ${A}_{i}\left( {t - 1}\right)$ and ${A}_{j}\left( t\right)$ that are not eliminated by Lowe’s ratio test [27]. For comparing sizes,we use a scoring function ${q}_{s}\left( {{h}_{i},{h}_{j}}\right)$ to weigh each putative association of areas with size descriptors ${h}_{i}$ and ${h}_{j}$ :

在基于极线约束和与 VIO 检测相似移动的要求排除不可行匹配后，在最新关键帧中观察到的片段与轨迹之间可能仍存在多个可能的关联。为此，我们基于它们的外观和大小计算片段关联假设的相似度。对于外观比较，我们将特征评分函数 ${q}_{f}$ 定义为集合 ${A}_{i}\left( {t - 1}\right)$ 和 ${A}_{j}\left( t\right)$ 中未被 Lowe 比率测试 [27] 消除的 SIFT 特征的比例。对于大小比较，我们使用评分函数 ${q}_{s}\left( {{h}_{i},{h}_{j}}\right)$ 来加权每个面积描述符为 ${h}_{i}$ 和 ${h}_{j}$ 的潜在关联：

\[{q}_{s}\left( {{h}_{i},{h}_{j}}\right) = \left\{ \begin{array}{ll} 1 + \cos \left( {\frac{\pi }{{\theta }_{h}}r\left( {{h}_{i},{h}_{j}}\right) }\right) & \text{ if }r\left( {{h}_{i},{h}_{j}}\right) ) < {\theta }_{h}, \\ 0 & \text{ otherwise } \end{array}\right. \tag{2} \]

where we measure relative size difference of masks $i$ and $j$ using

其中我们使用

\[r\left( {{h}_{i},{h}_{j}}\right) = \frac{2\left| {{h}_{i} - {h}_{j}}\right| }{{h}_{i} + {h}_{j}}. \tag{3} \]

Finally, we compute a similarity score

最后，我们计算一个相似度分数

\[q = \sqrt{{q}_{s}{q}_{f}} \tag{4} \]

as the geometric mean of the size scoring function ${q}_{s}$ and the feature scoring function ${q}_{f}$ . To provide an unambiguous mapping between latest object masks and history of masks, we use an implementation of the Hungarian algorithm [29] using weights from (4). We initialize new tracks for observations that cannot be matched to previous tracks.

作为大小评分函数 ${q}_{s}$ 和特征评分函数 ${q}_{f}$ 的几何平均值。为了提供最新物体掩码与掩码历史之间的明确映射，我们使用基于（4）中的权重实现的匈牙利算法 [29]。对于无法匹配到先前轨迹的观察结果，我们初始化新的轨迹。

Finally,for tracks with more than ${\theta }_{n}$ observations,we build a small SFM-style factor graph [30] for each track separately. We specify poses based on odometry and projection factors from the centroid pixel coordinates of segments, and use GTSAM [31] for finding a minimal cost solution to the factor graph. We record the mean positions of each segment, indexed by $n$ in frame of robot $i$ ,in odometry frame, ${l}_{i,n}$ . We discard tracks that do not converge to a solution as they are often a result of tracking errors. Furthermore, to describe the size of the segment in a way that is invariant to the distance at which the segment is observed in each image frame,we compute a size descriptor ${h}_{S,n}$ scaled approximately to meters, based on the observed pixel size descriptors and distance from the camera to the estimated segment position.

最后，对于具有超过 ${\theta }_{n}$ 个观测值的轨迹，我们分别为每条轨迹构建一个小型的 SFM 风格因子图 [30]。我们根据里程计和来自段中心像素坐标的投影因子指定姿态，并使用 GTSAM [31] 来寻找因子图的最小成本解。我们记录每个段的平均位置，以机器人 $i$ 的帧中的 $n$ 为索引，在里程计帧中，${l}_{i,n}$。我们丢弃那些未收敛到解的轨迹，因为它们通常是跟踪错误的结果。此外，为了以一种不依赖于每个图像帧中观察段距离的方式描述段的大小，我们计算一个大小描述符 ${h}_{S,n}$，该描述符根据观察到的像素大小描述符和从相机到估计段位置的距离，近似缩放到米。

The end result of the mapping pipeline is a vehicle map ${\mathcal{M}}_{v,i}$ ,which contains estimated positions of objects corresponding to segmented masks, expressed in the odometry frame of robot $i$ ,and size descriptors for the object masks.

映射管道的最终结果是一个车辆地图 ${\mathcal{M}}_{v,i}$，其中包含对应于分割掩码的物体的估计位置，以机器人 $i$ 的里程计帧表示，以及物体掩码的大小描述符。

B. Finding correspondences between vehicle maps

B. 寻找车辆地图之间的对应关系

With perfect knowledge of the correspondences between objects, any alignment errors could be mitigated to the level determined by measurement errors of environment measurements. For robot $i$ that has observed $n$ successfully tracked object masks in ${\mathcal{M}}_{v,i}$ ,we thus focus on finding the correspondences between objects within its own map ${\mathcal{M}}_{v,i}$ and another map, communicated by a peer or collected at an earlier time instant ${\mathcal{M}}_{v,j}$ .

如果完全了解物体之间的对应关系，任何对齐误差都可以降低到由环境测量误差决定的水平。对于成功观察到 ${\mathcal{M}}_{v,i}$ 中 $n$ 个成功跟踪的物体掩码的机器人 $i$，我们因此专注于寻找其自身地图 ${\mathcal{M}}_{v,i}$ 与另一个地图（由同伴或早期时刻收集）中物体之间的对应关系。

Assuming no further prior information on the correspondences, the number of possible associations grows quadratically as the number of objects increases, leading to an infeasible search time for any reasonably large map. To utilize the notion that objects spatially close to each other in ${\mathcal{M}}_{v,i}$ should be spatially close to each other in ${\mathcal{M}}_{v,j}$ ,we implement a windowed search approach, where we define a window length of ${\theta }_{WL}$ objects,and search for correspondences between the frames, moving forward the window by a stride length of ${\theta }_{SL}$ objects after each comparison.

假设没有关于对应关系的进一步先验信息，随着对象数量的增加，可能的关联数量呈二次方增长，导致对于任何合理大小的地图来说，搜索时间都变得不可行。为了利用在 ${\mathcal{M}}_{v,i}$ 中空间上彼此接近的对象在 ${\mathcal{M}}_{v,j}$ 中也应彼此接近的概念，我们采用了一种窗口搜索方法，其中我们定义了一个长度为 ${\theta }_{WL}$ 个对象的窗口，并在帧之间搜索对应关系，每次比较后将窗口向前移动 ${\theta }_{SL}$ 个对象的步长。

Fig. 3: Example images from Bâtvik seasonal dataset, including variation in snow coverage, deciduous tree foliage and sharpness of shadows across different seasons.

图 3：来自 Bâtvik 季节性数据集的示例图像，包括不同季节雪覆盖的变化、落叶树树叶和阴影的清晰度。

We denote ${\mathcal{M}}_{v,i} = \left\{ {l}_{i,n}\right\}$ for robot $i$ ,where $n \in$ $\left\{ {0,1,\ldots ,{N}_{i}}\right\}$ . The windowed search thus attempts to find correspondences between subsets ${\mathcal{M}}_{v,i}\left\lbrack {{a}_{i} \cdot {\theta }_{SL},\ldots ,{a}_{i} \cdot {\theta }_{SL} + }\right.$ $\left. {\theta }_{WL}\right\rbrack$ and ${\mathcal{M}}_{v,j}\left\lbrack {{a}_{j} \cdot {\theta }_{SL},\ldots ,{a}_{j} \cdot {\theta }_{SL} + {\theta }_{WL}}\right\rbrack$ across all values of ${a}_{i}$ and ${a}_{j}$ ,where the $G\left\lbrack \cdot \right\rbrack$ notation corresponds to taking a subset of $G$ using items with indices $\left\lbrack \cdot \right\rbrack$ .

我们用 ${\mathcal{M}}_{v,i} = \left\{ {l}_{i,n}\right\}$ 表示机器人 $i$，其中 $n \in$ $\left\{ {0,1,\ldots ,{N}_{i}}\right\}$。因此，窗口搜索试图在所有 ${a}_{i}$ 和 ${a}_{j}$ 的值中找到子集 ${\mathcal{M}}_{v,i}\left\lbrack {{a}_{i} \cdot {\theta }_{SL},\ldots ,{a}_{i} \cdot {\theta }_{SL} + }\right.$ $\left. {\theta }_{WL}\right\rbrack$ 和 ${\mathcal{M}}_{v,j}\left\lbrack {{a}_{j} \cdot {\theta }_{SL},\ldots ,{a}_{j} \cdot {\theta }_{SL} + {\theta }_{WL}}\right\rbrack$ 之间的对应关系，其中 $G\left\lbrack \cdot \right\rbrack$ 表示使用索引为 $\left\lbrack \cdot \right\rbrack$ 的项从 $G$ 中取子集。

In finding correspondences, we first exclude hypothetical pairs of objects where pairwise difference in distance of the objects in each map is $\varepsilon > {\theta }_{\varepsilon }$ or where size difference of objects is significant,i.e., $r\left( {{h}_{S,i},{h}_{S,j}}\right) > {\theta }_{r}$ . We weigh putative associations using (2). We use a robust geometric data association framework [6] to approximate a set $S$ of associations (object pairs). As a final step, we estimate with [32] the relative translation and rotation between ${\mathcal{M}}_{v,i}$ and ${\mathcal{M}}_{v,j}$ , assuming correspondences defined by $S$ and discard hypotheses that would result in more than ${\theta }_{\alpha }$ angular difference in roll or pitch. This is motivated by that odometry frames' roll and tilt can be estimated with an IMU due to excitation from gravity. We use the number of associations returned by the framework, $\left| S\right|$ as criteria for accepting or rejecting the hypothesis. We accept the hypothesis if $\left| S\right| > {\theta }_{S}$ . By varying threshold ${\theta }_{S}$ for acceptance,we can balance precision and recall of our solution.

在寻找对应关系时，我们首先排除那些假设的对象对，其中每个地图中对象之间的距离差异为 $\varepsilon > {\theta }_{\varepsilon }$，或者对象大小差异显著，即 $r\left( {{h}_{S,i},{h}_{S,j}}\right) > {\theta }_{r}$。我们使用公式 (2) 来评估潜在关联的权重。我们采用一种稳健的几何数据关联框架 [6] 来近似一组关联（对象对）$S$。作为最后一步，我们假设由 $S$ 定义的对应关系，使用 [32] 估计 ${\mathcal{M}}_{v,i}$ 和 ${\mathcal{M}}_{v,j}$ 之间的相对平移和旋转，并丢弃那些会导致滚转或俯仰角度差异超过 ${\theta }_{\alpha }$ 的假设。这是因为里程计框架的滚转和倾斜可以通过 IMU 由于重力激励来估计。我们使用框架返回的关联数量 $\left| S\right|$ 作为接受或拒绝假设的标准。如果 $\left| S\right| > {\theta }_{S}$，我们接受该假设。通过调整接受阈值 ${\theta }_{S}$，我们可以平衡解决方案的精确度和召回率。

IV. EXPERIMENTS

IV. 实验

We evaluate the performance of SOS-Match at varying levels of appearance change in the environment, comparing precision, recall, average F1 score, search time and map size with respect to five reference methods. In each experiment, we use a dataset collected for this task.

我们评估了 SOS-Match 在环境外观变化不同程度下的性能，比较了精确度、召回率、平均 F1 分数、搜索时间和地图大小相对于五种参考方法的表现。在每个实验中，我们使用为此任务收集的数据集。

A. Dataset

A. 数据集

We introduce the Båtvik seasonal dataset which includes six drone flights that travel a distance of approximately 3.5 $\mathrm{{km}}$ over a coastal plot in southern Finland at an altitude approximately ${100}\mathrm{\;m}$ over ground,each following the same trajectory plan as shown in Figure 1. We release this dataset for public use. The flights consist of drone images collected with a nadir-pointing camera as well as Inertial Measurement Unit (IMU) measurements, and we record autopilot output along with other telemetry data from an Ardupilot-based [33] drone flight controller. The flight takes place over an area that contains only a few buildings, and a large part of the trajectory takes place over a forest region, as well as above sea. This dataset represents flight of an UAV over a terrain that has naturally high ambiguity. We record this flight six times over many seasonal conditions, as illustrated in Figure 3 and outlined in Table I.

我们介绍了Båtvik季节性数据集，该数据集包括六次无人机飞行，每次飞行在芬兰南部沿海地区上空约${100}\mathrm{\;m}$的高度，沿着相同的轨迹计划（如图1所示），飞行约3.5$\mathrm{{km}}$的距离。我们公开发布此数据集供公众使用。这些飞行包括使用下视相机收集的无人机图像以及惯性测量单元（IMU）测量数据，同时记录了基于Ardupilot的[33]无人机飞行控制器的自动驾驶仪输出和其他遥测数据。飞行区域仅包含少数建筑物，大部分轨迹经过森林区域以及海上。该数据集代表了无人机在具有自然高歧义的地形上的飞行。我们记录了在多种季节条件下六次飞行的数据，如图3所示并在表I中概述。

TABLE I: Description of flight trajectories in Båtvik dataset.

表I：Båtvik数据集中飞行轨迹的描述。

Name	Icon Time of flight	Description of appearance
Winter 1	2022-03-30 12:51	Snow coverage
Winter 2	2022-03-31 11:39	Snow coverage
Early Spring	2022-05-05 14:10	Some leaves
Late Spring	2022-05-25 12:33	Leaves in deciduous plants
Summer 1	2022-06-09 12:05	Full leaves, hard shadows
Summer 2	2022-06-09 12:28	Full leaves, hard shadows

B. Baseline approaches

B. 基线方法

Several visual SLAM approaches [12] use image features such as ORB or SIFT as front end, for detecting and describing feature points, and use random sample consensus (RANSAC) [34] to prune outliers. We implement two baseline methods that detect and describe image features with each approach using keyframes taken every $2\mathrm{\;m}$ of travel of the flight sequence. Next, we use RANSAC to find correspondences that are consistent with respect to the fundamental matrix, and set a limit for the number of required associations to consider the keyframes a match. We extract 500 SIFT or ORB features,select ${20}\%$ of best matches in terms of descriptors, and use a reprojection threshold of 5.0 pixels in fundamental matrix filtering. We run RANSAC for a maximum of 2000 iterations with confidence level 0.995 .

多种视觉SLAM方法[12]使用图像特征，如ORB或SIFT作为前端，用于检测和描述特征点，并使用随机样本一致性（RANSAC）[34]来剔除异常值。我们实现了两种基线方法，每种方法使用飞行序列中每隔$2\mathrm{\;m}$的行程拍摄的关键帧来检测和描述图像特征。接下来，我们使用RANSAC来寻找与基本矩阵一致的对应关系，并设定一个阈值，以确定所需关联的数量来认为关键帧匹配。我们提取500个SIFT或ORB特征，选择描述符中最好的${20}\%$个匹配项，并在基本矩阵过滤中使用5.0像素的重投影阈值。我们运行RANSAC最多2000次迭代，置信水平为0.995。

To compare against state-of-the-art learned detector and descriptor methods, we evaluate against LoFTR [35] with pretrained outdoor weights and the SuperPoint detector [11] using SuperGlue with pretrained outdoor weights from [10] for correspondence search. For each method, we retain only keypoint correspondences with confidence of at least 0.7 , and add all keypoint match values as a metric for the overall match confidence of each image pair.

为了与最先进的检测器和描述符方法进行比较，我们评估了使用预训练户外权重的LoFTR [35]和使用SuperGlue与预训练户外权重的SuperPoint检测器 [11] 进行对应搜索。对于每种方法，我们仅保留置信度至少为0.7的关键点对应关系，并将所有关键点匹配值作为每对图像整体匹配置信度的度量。

In addition to feature-based approaches, we compare our method against a modern Visual Place Recognition (VPR) approach that uses global descriptors for images, AnyLoc [21]. AnyLoc outperforms universal place recognition pipelines NetVLAD [36], CosPlace [37] and MixVPR [38] in almost every evaluation, making it an appropriate benchmark that authors claim works across very different environmental and lighting conditions. In our implementation, we define a DINOv2 extractor following AnyLoc's parameters at layer 31 with facet value and 32 clusters. We train a VLAD vocabulary of 32 cluster centers on database images, generate global descriptors for each image in the query set, then compute the cosine similarity of the global descriptors of each image pair in each sequence.

除了基于特征的方法外，我们还与使用图像全局描述符的现代视觉位置识别（VPR）方法AnyLoc [21]进行了比较。AnyLoc在几乎所有评估中都优于通用位置识别管道NetVLAD [36]、CosPlace [37]和MixVPR [38]，使其成为适用于不同环境和光照条件下的适当基准。在我们的实现中，我们按照AnyLoc的参数在第31层定义了一个DINOv2提取器，使用facet值和32个聚类。我们在数据库图像上训练了一个包含32个聚类中心的VLAD词汇表，为查询集中的每个图像生成全局描述符，然后计算每个序列中每对图像的全局描述符的余弦相似度。

C. Performance measures

C. 性能度量

We compare correspondence search results by first computing what region of the ground would be visible from each keyframe camera pose if the ground under the image acquisition position was flat. For this, we use ground truth camera poses recorded from the extended Kalman filter (EKF) output from a flight controller and a terrain elevation map of the area. By comparing the area of overlap to the area of intersection of each keyframe pairwise, we compute the intersection over union (IoU) of every pair of keyframes. In evaluating recall, we assume that each keyframe pair for which IoU is more than 0.333 , the matching algorithm should return a match indication. In evaluating precision, we assume that an algorithm may provide a correspondence between frames if the IoU from ground truth is more than 0.01 ; for smaller IoUs, we assume a returned correspondence is a false positive. In mapping, for purposes of evaluation, we use ground truth poses of flight controller EKF in SFM. We use 3.0 as pixel measurement noise standard deviation. By varying the runtime as function of window length ${\theta }_{WL}$ and stride ${\theta }_{SL}$ ,we experimentally choose parameters ${\theta }_{WL} = {50}$ and ${\theta }_{SL} = {10}$ . For other parameters,we choose ${\theta }_{T} = {2.0}\mathrm{\;m}$ , ${\theta }_{v} = {4.0},{\theta }_{q} = {0.2},{\theta }_{h} = {0.2},{\theta }_{n} = 5,{\theta }_{\alpha } = {22.5}^{ \circ },{\theta }_{\varepsilon } = {2.0}$ , and ${\theta }_{r} = {0.2}$ .

我们首先计算如果图像采集位置下方的地面是平的，从每个关键帧相机姿态可以看到地面的哪个区域，以此来比较对应搜索结果。为此，我们使用从飞行控制器的扩展卡尔曼滤波器（EKF）输出记录的地面实况相机姿态和该区域的地形高程图。通过比较每个关键帧对的重叠区域与交集区域，我们计算每对关键帧的交并比（IoU）。在评估召回率时，我们假设对于IoU大于0.333的每对关键帧，匹配算法应返回匹配指示。在评估精度时，我们假设如果地面实况的IoU大于0.01，算法可能提供帧间对应关系；对于较小的IoU，我们假设返回的对应关系是假阳性。在映射中，为了评估目的，我们使用飞行控制器EKF的地面实况姿态进行SFM。我们使用3.0作为像素测量噪声标准偏差。通过改变窗口长度${\theta }_{WL}$和步幅${\theta }_{SL}$的运行时间函数，我们实验性地选择参数${\theta }_{WL} = {50}$和${\theta }_{SL} = {10}$。对于其他参数，我们选择${\theta }_{T} = {2.0}\mathrm{\;m}$、${\theta }_{v} = {4.0},{\theta }_{q} = {0.2},{\theta }_{h} = {0.2},{\theta }_{n} = 5,{\theta }_{\alpha } = {22.5}^{ \circ },{\theta }_{\varepsilon } = {2.0}$和${\theta }_{r} = {0.2}$。

We produce precision-recall results by varying the acceptance limit (threshold for number of detected correspondences) for our approach, ORB and SIFT-based methods, the image match confidence threshold for SuperPoint+SuperGlue and LoFTR, and the required level of cosine similarity for AnyLoc.

我们通过改变接受限制（检测到的对应关系数量的阈值）来生成精度-召回结果，这些限制适用于我们的方法、基于ORB和SIFT的方法、SuperPoint+SuperGlue和LoFTR的图像匹配置信度阈值，以及AnyLoc所需的余弦相似度水平。

On an NVIDIA Quadro RTX 3000 with 6 GB VRAM, mean detection and description time is ${5.78}\mathrm{\;s}$ . A recent branch of research on the SAM problem suggests improvements to runtime of the SAM problem (e.g., [17]). We forgo detailed discussion of minimizing the front end runtime for our method to focus on the correspondence search runtime characteristics. Our runtime evaluations in all experiments tabulated in Table II measure time consumed in correspondence search, reflecting time required for localization in real-time settings. The computational time evaluations in Table II are made with an $2 \times 8$ core Intel Xeon 6134 @ 3.2 GHz cluster computer from which we reserve ${16}\mathrm{\;{GB}}$ RAM. For Su-perPoint+SuperGlue and LoFTR, which require a GPU for correspondence search, computational time evaluations are made on an NVIDIA RTX 3090.

在配备6 GB VRAM的NVIDIA Quadro RTX 3000上，平均检测和描述时间为 ${5.78}\mathrm{\;s}$。最近关于SAM问题的研究分支提出改进SAM问题的运行时间（例如，[17]）。我们不详细讨论最小化我们方法的前端运行时间，而是专注于对应搜索的运行时间特性。我们在表II中列出的所有实验中评估的运行时间测量了对应搜索所消耗的时间，反映了实时定位所需的时间。表II中的计算时间评估是在一台 $2 \times 8$ 核心的Intel Xeon 6134 @ 3.2 GHz集群计算机上进行的，我们从中保留了 ${16}\mathrm{\;{GB}}$ RAM。对于需要GPU进行对应搜索的SuperPoint+SuperGlue和LoFTR，计算时间评估是在NVIDIA RTX 3090上进行的。

Fig. 4: Precision-recall curves with different approaches with increasing visual discrepancy between flights.

图4：不同方法在飞行间视觉差异增加时的精确度-召回率曲线。

TABLE II: Mean search time and map size across flights with increasing visual discrepancy between flights. Best results are highlighted first and second, and worst is shown in red.

表II：飞行间视觉差异增加时的平均搜索时间和地图大小。最佳结果首先和第二突出显示，最差结果以红色显示。

Case	Implementation	Mean search time [std] (s) Map size (N
${\Delta }_{T} = {23}\mathrm{\;{min}}$	Ours	5.34 [0.13]	0.05
	ORB+RANSAC	49.11 [0.25]	25.43
	SIFT+RANSAC	76.82 [0.52]	406.95
	Any-Loc	${0.11}\left\lbrack {0.01}\right\rbrack$	310.05
	SuperPoint+SuperGlue	${166.67}\left\lbrack {1.17}\right\rbrack$	2593.72
	LoFTR	133.72 [0.23]	322.9
${\Delta }_{T} = 1$ day	Ours	${4.35}\left\lbrack {0.04}\right\rbrack$	0.07
	ORB+RANSAC	27.23 [0.48]	21.94
	SIFT+RANSAC	40.28 [0.67]	345.89
	Any-Loc	${0.12}\left\lbrack {0.01}\right\rbrack$	310.05
	SuperPoint+SuperGlue	201.17 [5.88]	2843.62
	LoFTR	129.66 [0.31]	454.9
${\Delta }_{T} = {20}$ days	Ours	${9.29}\left\lbrack {0.12}\right\rbrack$	0.10
	ORB+RANSAC	38.35 [0.52]	22.07
	SIFT+RANSAC	49.75 [0.67]	337.47
	Any-Loc	${0.12}\left\lbrack {0.01}\right\rbrack$	310.05
	SuperPoint+SuperGlue	${166.76}\left\lbrack {1.77}\right\rbrack$	2843.62
	LoFTR	132.47 [0.19]	534.0
${\Delta }_{T} = {15}$ days	Ours	${8.06}\left\lbrack {0.14}\right\rbrack$	0.05
	ORB+RANSAC	40.38 [0.44]	25.43
	SIFT+RANSAC	64.75 [0.81]	407.20
	Any-Loc	0.12 [0.01]	310.05
	SuperPoint+SuperGlue	220.87 [4.52]	2907.95
	LoFTR	154.05 [0.53]	179.6

D. Precision and Recall

D. 精确度和召回率

First, we evaluate the performance of SOS-Match when an agent localizes within a previously collected map from another agent after time has passed. We include sections of the flight in Figure 1 that do not involve flying over water, as visual navigation-based systems do not perform well in environments with no distinctive features. Thus, to evaluate the performance of our pipeline over unstructured terrain with a variety of visual features, we consider the flight as a whole in addition to flights from the same viewpoint and different viewpoints. Differentiating into these test cases allows us to evaluate the performance of our method when localizing from the same viewpoint and different viewpoints.

首先，我们评估了SOS-Match在代理经过一段时间后在另一个代理先前收集的地图内定位时的性能。我们在图1中包含了不涉及飞越水域的部分，因为基于视觉导航的系统在缺乏独特特征的环境中表现不佳。因此，为了评估我们的管道在具有各种视觉特征的非结构化地形上的性能，我们除了考虑从相同视角和不同视角的飞行外，还将整个飞行考虑在内。将这些测试案例区分开来，使我们能够评估我们的方法在从相同视角和不同视角定位时的性能。

In Figure 4, we show precision and recall of each comparison method in the cross-seasonal localization case increasing visual discrepancy between flights from left to right cases.

在图4中，我们展示了在跨季节定位情况下，随着飞行间视觉差异从左到右增加，每种比较方法的精确度和召回率。

Our method provides better localization performance than the reference methods in the Summer 1 vs. Summer 2, Early Spring vs. Late Spring, and Late Spring vs. Summer 2 cases. Based on Figure 4, in the Winter 1 vs. Winter 2 cases, SuperPoint+SuperGlue appears to outperform our method. While there is a performance benefit in this case for Super-Point+SuperGlue, in all cases, our method outperforms all reference methods with a significant margin in terms map size, and it outperforms all reference methods aside from AnyLoc in terms of seach time, as seen in Table II. A low search time is a particularly critical characteristic for the description of a localization approach where search time is critical for the suitability for online implementation. Furthermore, a small map size is a key enabler of collaborative localization, where robots must share their local maps with limited network bandwidth.

我们的方法在夏季1与夏季2、早春与晚春、晚春与夏季2的情况下，比参考方法提供了更好的定位性能。基于图4，在冬季1与冬季2的情况下，SuperPoint+SuperGlue似乎优于我们的方法。尽管在这种情况下SuperPoint+SuperGlue有性能优势，但在所有情况下，我们的方法在地图大小方面显著优于所有参考方法，并且在搜索时间方面除了AnyLoc之外优于所有参考方法，如表II所示。低搜索时间是定位方法描述中特别关键的特征，其中搜索时间对于在线实施的适用性至关重要。此外，较小的地图大小是协作定位的关键推动因素，其中机器人必须在有限的网络带宽下共享其本地地图。

E. F1 Score: Performance Analysis by Viewpoint

E. F1分数：按视角的性能分析

To further analyze the performance of our method by viewpoint, we consider the cases in which an agent localizes within the map of another agent from the same viewpoint and when an agent passes over a place previously seen by another agent from a separate viewpoint. We calculate the average F1 score of different flights separated into the same viewpoint and different viewpoint, as shown in Figure 5.

为了进一步按视角分析我们的方法的性能，我们考虑了以下情况：一个代理从相同视角在另一个代理的地图内进行定位，以及一个代理从一个不同视角经过另一个代理之前见过的地点。我们计算了不同飞行在相同视角和不同视角下的平均F1分数，如图5所示。

In this evaluation, we take the average F1 score calculated as the average value between when precision is tuned to at least 0.99 and when recall is tuned to at least 0.99 . If recall cannot be tuned to at least 0.99 , we start at the highest value. Thus, these results demonstrate performance in cases that benefit from trading off between precision and recall.

在此评估中，我们将平均F1分数计算为当精确度调整至至少0.99和当召回率调整至至少0.99之间的平均值。如果召回率无法调整至至少0.99，我们从最高值开始。因此，这些结果展示了在精确度和召回率之间权衡受益的情况下的性能。

In Figure 5, we see that all methods have a lower average F1 score in the different viewpoint setting than in the same viewpoint setting. We note that our method does not perform as well as ORB- and SIFT-based methods in the Summer 1 vs. Summer 2 case from multiple viewpoints due to our map representation being sparse and thus limited by field of view, but it maintains performance while others degrade when there are more visual variations between the flights.

在图5中，我们可以看到所有方法在不同视角设置下的平均F1分数都低于相同视角设置下的分数。我们注意到，由于我们的地图表示是稀疏的，因此受限于视野范围，在从多个视角进行的夏季1与夏季2的比较中，我们的方法不如基于ORB和SIFT的方法表现好，但在飞行之间视觉变化更多的情况下，当其他方法性能下降时，我们的方法仍能保持性能。

In the same viewpoint case, Superpoint+Superglue surpasses our method in average F1 score. However, in the different viewpoint case, the Superpoint+Superglue fares unfavorably against our method and most comparison methods, suggesting that the approach is very sensitive to variation in viewpoint.

在相同视角情况下，Superpoint+Superglue的平均F1分数超过了我们的方法。然而，在不同视角情况下，Superpoint+Superglue的表现不如我们的方法和大多数比较方法，这表明该方法对视角变化非常敏感。

V. Discussion

V. 讨论

SOS-Match demonstrates the value of incorporating foundation models into front-end object detection and map construction in unstructured environments. Using segmentation in open-set unstructured settings such as dense forested regions provides sufficient geometric cues that are highly suitable for localization and loop closure detection. Our method also offers a significant speed improvement in search time and size reduction in map size in comparison to reference methods. We consider these major improvements towards satisfying the the requirements of a robust description of environment measurements for use in the localization problem, whose requirements we briefly listed in Sec. I.

SOS-Match展示了将基础模型融入非结构化环境中的前端物体检测和地图构建的价值。在如密集森林区域这样的开放式非结构化环境中使用分割技术，提供了非常适合定位和闭环检测的几何线索。与参考方法相比，我们的方法在搜索时间和地图大小缩减方面也提供了显著的速度提升。我们认为这些重大改进满足了用于定位问题的环境测量稳健描述的需求，我们在第一节中简要列出了这些需求。

Fig. 5: Average F1 scores of different cross-season cases. Bars indicate the performance from the same viewpoint (left) and from different viewpoints (right).

图5：不同跨季节情况下的平均F1分数。柱状图表示从相同视角（左侧）和不同视角（右侧）的性能。

We share the Båtvik seasonal dataset, which represents a challenging real-world scenario for visual navigation in unstructured environments, with significant ambiguities in the appearance of the environment. The data includes typical quality issues that occur in drones with hardware constraints such as image compression artifacts, which are useful for real-world evaluation. Our work reveals that most baseline methods are affected by even short time gaps between traverses, highlighting the need for robust visual approaches in these environments. The release of this dataset enables evaluation of robustness to changing seasons and visual conditions.

我们分享了Båtvik季节性数据集，该数据集代表了非结构化环境中视觉导航的具有挑战性的现实场景，环境中外观存在显著的模糊性。数据包括了由于硬件限制（如图像压缩伪影）导致的典型质量问题，这对于现实世界的评估非常有用。我们的工作揭示了大多数基线方法在穿越之间即使存在短暂的时间间隔也会受到影响，强调了在这些环境中需要稳健的视觉方法。该数据集的发布使得评估对季节变化和视觉条件的鲁棒性成为可能。

Our method does not fully account for uncertainty, and we plan to address cases with less favorable (non-bird's eye view) triangulation geometry that may impact depth accuracy, as well as scenarios with significant odometry drift in future work.

我们的方法并未完全考虑不确定性，我们计划在未来工作中解决那些不太有利（非鸟瞰视角）的三角测量几何可能影响深度精度的情况，以及在里程计漂移显著的场景。

The comparison of the localization performance against SuperPoint+SuperGlue suggests that it may be possible to find a balance between map size and localization capability by combining the information about the environment's structure with a learning-based approach to description. We thereby plan to incorporate additional information about the objects, including semantic information in environments containing variable discernable objects, incorporating anisotropic object location covariance, and developing robust descriptors for geometry to enable faster correspondence search over large hypothesis spaces. Our proposed method uses size descriptors as a means for excluding putative matches where the size difference of objects is significant, and further search time reduction may be achievable if richer descriptors can be extracted for the segments.

与SuperPoint+SuperGlue的定位性能比较表明，通过结合环境结构的信息与基于学习的方法来描述，可能能够在地图大小和定位能力之间找到平衡。因此，我们计划在包含可变可辨别对象的环境中加入关于对象的额外信息，包括语义信息，引入各向异性对象位置协方差，并开发针对几何的稳健描述符，以实现在大型假设空间中更快的对应搜索。我们提出的方法使用大小描述符作为排除对象大小差异显著的潜在匹配的手段，如果能够为片段提取更丰富的描述符，则可能进一步实现搜索时间的减少。

VI. CONCLUSION

VI. 结论

We present SOS-Match, a framework for compact mapping and fast localization that is able to operate in open-set unstructured environments containing segmentable objects. The compact nature of the maps lends the pipeline well to a multi-agent scenario, allowing for efficient communication streams between agents. Experiments with the Bâtvik seasonal dataset demonstrate the pipeline's ability to align frames in a challenging unstructured environment across different seasonal conditions and from various viewpoints. The size of the reference map generated by each agent is efficient and concise enough to be communicated between agents with limited communication bandwidth. SOS-Match shows that segmentation provides sufficient geometric cues suitable for localization and robust correspondence search in unstructured environments.

我们提出了 SOS-Match，这是一个紧凑映射和快速定位的框架，能够在包含可分割对象的开集非结构化环境中运行。地图的紧凑性使得该流程非常适合多代理场景，允许代理之间进行高效的通信流。使用 Bâtvik 季节性数据集进行的实验证明了该流程在不同季节条件和各种视角下，在具有挑战性的非结构化环境中对齐帧的能力。每个代理生成的参考地图的大小足够高效和简洁，可以在通信带宽有限的代理之间进行通信。SOS-Match 表明，分割提供了适合非结构化环境中定位和鲁特对应搜索的充分几何线索。

ACKNOWLEDGEMENT

致谢

This work was supported by Saab Finland Oy, Boeing Research & Technology, and the National Science Foundation Graduate Research Fellowship under Grant No. 2141064. The dataset was collected as part of Business Finland project Multico (6575/31/2019). We acknowledge the computational resources provided by the Aalto Science-IT project.

这项工作得到了 Saab Finland Oy、Boeing Research & Technology 以及国家科学基金会研究生研究奖学金的支持，授予编号为 2141064。数据集作为 Business Finland 项目 Multico（6575/31/2019）的一部分收集。我们感谢 Aalto Science-IT 项目提供的计算资源。

REFERENCES

参考文献

[1] R. Morales-Ferre, P. Richter, E. Falletti, A. de la Fuente, and E. S. Lohan, "A survey on coping with intentional interference in satellite navigation for manned and unmanned aircraft," IEEE Communications Surveys & Tutorials, vol. 22, no. 1, pp. 249-291, 2020.

[2] J. Kinnari, R. Renzulli, F. Verdoja, and V. Kyrki, "Lsvl: Large-scale season-invariant visual localization for uavs," Robotics and Autonomous Systems, vol. 168, p. 104497, 2023.

[3] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard, "Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age," IEEE Transactions on robotics, vol. 32, no. 6, pp. 1309-1332, 2016.

[4] Y. Tian, Y. Chang, F. Herrera Arias, C. Nieto-Granda, J. P. How, and L. Carlone, "Kimera-multi: Robust, distributed, dense metric-semantic slam for multi-robot systems," IEEE Transactions on Robotics, vol. 38, no. 4, pp. 2022-2038, 2022.

[5] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., "Segment anything," arXiv preprint arXiv:2304.02643, 2023.

[6] P. C. Lusk, K. Fathian, and J. P. How, "CLIPPER: A graph-theoretic framework for robust data association," in 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 13828- 13834.

[7] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, "ORB: An efficient alternative to SIFT or SURF," in 2011 International conference on computer vision. Ieee, 2011, pp. 2564-2571.

[8] D. G. Lowe, "Distinctive image features from scale-invariant key-points," International journal of computer vision, vol. 60, pp. 91-110, 2004.

[9] H. Bay, T. Tuytelaars, and L. Van Gool, "SURF: Speeded up robust features," in Computer Vision-ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9. Springer, 2006, pp. 404-417.

[10] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, "Su-perglue: Learning feature matching with graph neural networks," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4938-4947.

[11] D. DeTone, T. Malisiewicz, and A. Rabinovich, "Superpoint: Self-supervised interest point detection and description," in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 224-236.

[12] I. Abaspur Kazerouni, L. Fitzgerald, G. Dooly, and D. Toal, "A survey of state-of-the-art on visual SLAM," Expert Systems with Applications, vol. 205, p. 117734, 2022.

[13] J. Redmon and A. Farhadi, "YOLOv3: An incremental improvement," arXiv preprint arXiv:1804.02767, 2018.

[14] J. Ankenbauer, P. C. Lusk, and J. P. How, "Global localization in unstructured environments using semantic object maps built from various viewpoints," arXiv preprint arXiv:2303.04658, 2023.

[15] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., "Learning transferable visual models from natural language supervision," in International conference on machine learning. PMLR, 2021, pp. 8748-8763.

[16] B. Douillard, J. Underwood, N. Kuntz, V. Vlaskine, A. Quadros, P. Morton, and A. Frenkel, "On the segmentation of 3D LIDAR point clouds," in 2011 IEEE International Conference on Robotics and Automation. IEEE, 2011, pp. 2798-2805.

[17] X. Zhao, W. Ding, Y. An, Y. Du, T. Yu, M. Li, M. Tang, and J. Wang, "Fast segment anything," arXiv preprint arXiv:2306.12156, 2023.

[18] K. Kondo, C. T. Tewari, M. B. Peterson, A. Thomas, J. Kin-nari, A. Tagliabue, and J. P. How, "PUMA: Fully decentralized uncertainty-aware multiagent trajectory planner with real-time image segmentation-based frame alignment," arXiv preprint arXiv:2311.03655, 2023.

[19] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., "On the opportunities and risks of foundation models," arXiv preprint arXiv:2108.07258, 2021.

[20] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, "D2-net: A trainable CNN for joint description and detection of local features," in Proceedings of the ieee/cvf conference on computer vision and pattern recognition, 2019, pp. 8092-8101.

[21] N. Keetha, A. Mishra, J. Karhade, K. M. Jatavallabhula, S. Scherer, M. Krishna, and S. Garg, "AnyLoc: Towards universal visual place recognition," arXiv preprint arXiv:2308.00688, 2023.

[22] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khali-dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., "DINOv2: Learning robust visual features without supervision," arXiv preprint arXiv:2304.07193, 2023.

[23] R. Arandjelovic and A. Zisserman, "All about VLAD," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2013, pp. 1578-1585.

[24] M. Grimes and Y. LeCun, "Efficient off-road localization using visually corrected odometry," in 2009 IEEE International Conference on Robotics and Automation. IEEE, 2009, pp. 2649-2654.

[25] T. Ort, L. Paull, and D. Rus, "Autonomous vehicle navigation in rural environments without detailed prior maps," in 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018, pp. 2040-2047.

[26] R. Ren, H. Fu, H. Xue, X. Li, X. Hu, and M. Wu, "Lidar-based robust localization for field autonomous vehicles in off-road environments," Journal of Field Robotics, vol. 38, no. 8, pp. 1059-1077, 2021.

[27] D. G. Lowe, "Distinctive image features from scale-invariant keypoints," International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.

[28] R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed. Cambridge University Press, ISBN: 0521540518, 2004.

[29] H. W. Kuhn, "The Hungarian method for the assignment problem," Naval Research Logistics Quarterly, vol. 2, no. 1-2, pp. 83-97, 1955.

[30] F. Dellaert, M. Kaess et al., "Factor graphs for robot perception," Foundations and Trends® in Robotics, vol. 6, no. 1-2, pp. 1-139, 2017.

[31] F. Dellaert, "Factor graphs and GTSAM: A hands-on introduction," Georgia Institute of Technology, Tech. Rep, vol. 2, p. 4, 2012.

[32] K. S. Arun, T. S. Huang, and S. D. Blostein, "Least-squares fitting of two 3-D point sets," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-9, no. 5, pp. 698-700, 1987.

[33] ArduPilot Community. (2022) Ardupilot - open source autopilot. Accessed: Dec 21, 2023. [Online]. Available: https://ardupilot.org

[34] M. A. Fischler and R. C. Bolles, "Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography," Communications of the ACM, vol. 24, no. 6, pp. 381-395, 1981.

[35] J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou, "LoFTR: Detector-free local feature matching with transformers," CVPR, 2021.

[36] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, "NetV-LAD: CNN architecture for weakly supervised place recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5297-5307.

[37] G. Berton, C. Masone, and B. Caputo, "Rethinking visual geo-localization for large-scale applications," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4878-4888.

[38] A. Ali-Bey, B. Chaib-Draa, and P. Giguere, "MixVPR: Feature mixing for visual place recognition," in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 2998-3007.

posted @ 2024-08-06 16:05 Zenith_Hugh 阅读(48) 评论(0) 收藏举报

刷新页面返回顶部

Zenith Hugh

We Go To The Moon