Deeper Depth Prediction with Fully Convolutional Residual Networks摘要和简介翻译
Deeper Depth Prediction with Fully Convolutional Residual Networks
使用全卷积残差神经网络进行更深的深度预测
Abstract
摘要
This paper addresses the problem of estimating the depth map of a scene given a single RGB image. We propose a fully convolutional architecture, encompassing residual learning, to model the ambiguous mapping between monocular images and depth maps. In order to improve the output resolution, we present a novel way to efficiently learn feature map up-sampling within the network. For optimization, we introduce the reverse Huber loss that is particularly suited for the task at hand and driven by the value distributions commonly present in depth maps. Our model is composed of a single architecture that is trained end-to-end and does not rely on post-processing techniques, such as CRFs or other additional refinement steps. As a result, it runs in real-time on images or videos. In the evaluation, we show that the proposed model contains fewer parameters and requires fewer training data than the current state of the art, while outperforming all approaches on depth estimation. Code and models are publicly available[1].
这篇论文解决了估计一个单张RBG图像场景的深度图的问题。我们提出了一个全卷积的结构,这个结构包含了包含残差学习,为了给从单目的图像到深度图之间的模棱两可的映射进行建模。为了提高输出的分辨率,我们提出了一个新颖的做法,来有效的在网络中学习特征图上采样。为了进行优化,我们提出了翻转Huber损失,这对于此项任务特别合适,同时受到一般深度图中的数值分布表现先的驱动。我们的模型由单个结构组成,这是用端到端的方式进行训练,并且不依赖于后面处理的技术,比如说条件随机场或者是其他额外的微调步骤。因此,我们的模型在图像或者是视频中实时进行运行。在评价方面,显示出来,相比于目前最先进的技术,我们提出的模型包含更少的参数以及需要更少的训练数据,同时比其他深度估计的所有方法表现的都好。代码和模型现已开源。
- 1. Introduction
- 1. 简介
Depth estimation from a single view is a discipline as old as computer vision and encompasses several techniques that have been developed throughout the years. One of the most successful among these techniques is Structure-from-Motion (SfM) [34]; it leverages camera motion to estimate camera poses through different temporal intervals and, in turn, estimate depth via triangulation from pairs of consecutive views. Alternatively to motion, other working assumptions can be used to estimate depth, such as variations in illumination [39] or focus.
从单一视角进行深度估计是一个和计算机视觉一样古老的领域,在这么多年中开发了许许多多的技术用于解决这一问题。所有这些技术中最为成功的是一个名为“源自动作的结构”(SfM)的技术,他利用了不同的时间区间内,相机的运动来估计相机的位置。反过来,通过从连续不断的视角进行三角化来估计深度。作为动作的替代品,我们的工作假设可以被用于估计深度,比如说在照明或者对焦时候的变化。
In absence of such environmental assumptions, depth estimation from a single image of a generic scene is an ill-posed problem, due to the inherent ambiguity of mapping an intensity or color measurement into a depth value. While this also is a human brain limitation, depth perception can nevertheless emerge from monocular vision. Hence, it is not only a challenging task to develop a computer vision system capable of estimating depth maps by exploiting monocular cues, but also a necessary one in scenarios where direct depth sensing is not available or not possible. Moreover, the availability of reasonably accurate depth information is well-known to improve many computer vision tasks with respect to the RGB-only counterpart, for example in reconstruction [23], recognition [26], sematic segmentation [5] or human pose estimation [35].
在缺乏这样的环境假设的情况下,从一个单张普通的景物图像中估计深度是一个病态的问题,将一个亮度或者色彩度量映射到深度数值中的时候,其本身就带有歧义性。尽管这也是人类大脑的局限,然而深度感知可以从单目视觉中浮现。因此,开发一个拥有从单目线索中估计深度图的计算机视觉系统不仅是一个颇具挑战的任务,同时对于那些不能够直接感知深度的场景来说也是非常必要的。此外,能够拥有合理且准确的深度信息是公认的,能够用于提升许多只有RBG图像的计算机视觉任务,比如说重建【23】、识别【26】、语义分割【5】或者人类姿态估计【35】。
For this reason, several works tackle the problem of monocular depth estimation. One of the first approaches assumed superpixels as planar and inferred depth through plane coefficients via Markov Random Fields (MRFs) [30]. Superpixels have also been considered in [16, 20, 37], where Conditional Random Fields (CRFs) are deployed for the regularization of depth maps. Data-driven approaches, such as [10, 13], have proposed to carry out image matching based on hand-craft features to retrieve the most similar candidates of the training set to given query image. The corresponding depth candidates are then warped and merged in order to produce the final outcome.
也正是因为这些原因,有许多工作用于处理单目深度估计。其中首先的一个方法是,将超像素假设为平面,通过马尔科夫随机场推测平面系数,平面系数则用于推断深度。超像素同样也在【16,20,37】这些文献中进行过考虑,条件随机场被用于正则化深度图。还有一些基于数据驱动的方法,比如文献【10, 13】提出要用手工提取的特征实现图像的配准问题,来检索给定的查询图像在训练集中最相似的候选图像。对应的深度候选接下来被包装与合并,用于输出最后的结果。
Recently, Convolutional Neural Networks (CNNs) have been employed to learn an implicit relation between color pixels and depth [5, 6, 16, 19, 37]. CNN approaches have often been combined with CRF-based regularization, either as a post-processing step [16, 37] or via structured deep learning [19], as well as with random forests [27]. Theses methods encompass a higher complexity due to either the high number of parameters involved in a deep network [5, 6, 19] or the joint use of CNN and a CRF [16, 37]. Nevertheless, deep learning boosted the accuracy on standard benchmark datasets considerably, ranking these methods first in the state of the art.
近来,卷积神经网络已经被应用于学习一个关于亮度像素和深度【5, 6, 16, 19, 37】之间的含蓄的关联。CNN常常与基于条件随机场的正则化方法相关联,无论是在之后处理的步骤中【16, 37】,还是通过结构化的深度学习【19】,同样的正则化方法还有随机森林【27】。这些方法包含了更高的复杂性,这不仅仅是由于深度神经网络中大量的参数数量【5, 6, 19】,还由于联合使用了卷积神经网络和条件随机场【16, 37】。然而,深度学习在标准数据集上的准确度有了明显的提升,让这些方法首次成为了最先进的技术。
In this work, we propose to learn the mapping between a single RGB image and its corresponding depth map using a CNN. The contribution of our work is as follows. First, we introduce a fully convolutional architecture to depth prediction, endowed with novel up-sampling blocks, that allows for dense output maps of higher resolution and at the same time requires fewer parameters and trains on one order of magnitude fewer data than the state of the art, while outperforming all existing methods on standard benchmark datasets [23, 29]. We further propose a more efficient scheme for up-convolutions and combine It with the concept of residual learning [7] to create up-projection blocks for the effective up-sampling of feature maps. Last, we train the network by optimizing loss based on the reverse Huber function (berHu) [40] and demonstrate, both theoretically and experimentally, why it is beneficial and better suited for the task at hand. We thoroughly evaluate the influence of the network’s depth, the loss function and the specific layers employed for up-sampling in order to analyze their benefits. Finally, to further assess the accuracy of our method, we employ the trained model within a 3D reconstruction scenario, in which we use a sequence of RGB frames and their predicted depth map for Simultaneous Localization and Mapping (SLAM).
在这个工作中,我们提出了要使用卷积神经网络学习单张RGB图像和他对应的深度图之间的映射关系。我们工作中的贡献陈述如下。首先我们引入了一个全卷积结构用来做深度预测,赋予了新的上采样块,这允许在使用更高分辨率的时候做稠密的输出图,同时,相比于最前沿的方法,我们的方法需要的参数少了一个数量级,同时在标准基准数据及上,表现比所有其他的方法都要好【23, 29】。我们进一步提出了一个更加有效率的计划,用于上卷积并将他们与残差学习【7】的方法相结合,创造了逆映射,用于更有效率的上采样特征图。最后,我们训练神经网络,通过优化基于翻转Huber函数(berHu)【40】的损失函数,并证明出无论是在理论上还是在实验上,为什么我们提出的方法是有效的,并且更加适用于这个任务。我们彻底地评估了网络深度的影响,损失函数和特定的用于上采样的层,以分析他们的益处。最后,为了更进一步的评估我们方法的准确率,我们使用训练模型在三维重建的场景下,我们使用了一系列的RGB帧,并用他们来预测用于同步定位和建图(SLAM)的深度图。

浙公网安备 33010602011771号