Depth Map Prediction from a Single Image using a Multi-Scale Deep Network 摘要翻译

Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

使用多尺度神经网络预测单张相片的深度图

Abstract

摘要

       Predicting depth is an essential component in understanding the 3D geometry of a scene. While for stereo images local correspondence suffices for estimation, finding depth relations from a single image is less straightforward, requiring integration of both global and local information from various cues. Moreover, the task is inherently ambiguous, with a large source of uncertainty coming from the overall scale. In this paper, we present a new method that addresses this task by employing tow deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally. We also apply a scale-invariant error to help measure depth relations rather than scale. By leveraging the raw datasets as large sources of training data, our method achieves state-of-the-art results on both NYU Depth and KITTI, and matches detailed depth boundaries without the need for superpixelation. 

       估计深度图是理解场景中的三维几何结构所必须的组成部分。近端对于立体图像而言,局部的对应已经足够做估计了,但是从一张图像中找到与深度的关联还是远远不够直接,这需要从多个线索中整合全局信息和局部信息。此外,这项任务从根本上来讲还是一个含糊不清的任务,从全局尺度上带来了大量的不确定性。在这篇论文中,我们提出了一个用于解决这项任务的新的方法,方法是使用两个深度神经网络的堆叠:其中一个基于整张图像做了全局的粗糙预测,另一个网络在局部上对这个预测进行微调。我们也应用了尺度不变的误差来帮助度量深度关系,这个误差并不是用来对尺度进行度量的。通过将原始数据集用作大量的训练数据源,我们的方法在两个数据集上实现了最先进的结果,这两个数据集分别是NYU Depth和KITTI。我们的方法也匹配精细的深度边界,而不需要超像素作为支撑。

       1 Introduction

       1 简介

       Estimating depth is an important component of understanding geometric relations within a scene. In turn, such relations help provide richer representations of objects and their environment, often leading to improvements in existing recognition tasks [18], as well as enabling many further applications such as 3D modeling [16, 6], physics and support models [18], robotics [4, 14], and potentially reasoning about occlusions.

       估算深度图是理解场景中几何结构关系的重要一环。反过来说,几何结构关系也提供了更加丰富的关于物体和环境的表达方式,这常常有助于提升对现有物体进行识别的任务【18】,并且能够让更深入的应用成为可能。比如说三维建模【16,6】,物理学和支撑模型【18】,机器人学【4,14】,以及一些潜在的闭塞推理任务。

       While there is much prior work on estimating depth based on stereo images or motion [17], there has been relatively little on estimating depth from a single image. Yet the monocular case often arises in practice: Potential applications include better understanding of the many images distributed on the web and the social media outlets, real estate listings, and shopping sites. These include many examples of both indoor and outdoor scenes.

       虽然此前有许多基于立体图像或者动作进行深度估计的工作,但是这些工作都和从单张图像估算深度关系不大。到目前为止,单目的情形在实际应用中经常出现:有许多潜在的应用,比如说需要更好的理解在社交媒体、房地产列表、购物网站中发布的许多图像。

       There are likely several reasons why the monocular case has not yet been tackled to the same degree as the stereo one. Provided accurate image correspondences, depth can be recovered deterministically in the stereo case [5]. Thus, stereo depth estimation can be reduced to developing robust image point correspondences – which can often be found using local appearance features. By contrast, estimating depth from a single image requires the use of monocular depth cues such as line angles and perspective, object sizes, image position, and atmospheric effects. Furthermore, a global view of the scene may be needed to relate these effectively, whereas local disparity is sufficient for stereo.

       下面是几点可能的原因,关于为什么应对单目的情形并没有像立体的情形那样丰富。假如有精确的图像对应,就可以在立体情况下确切地还原深度了。因此,立体深度估计可以被概括为产生健壮的图像点对应关系,这在使用局部外观特征的情况中很常见。相反的,从单个图像中估计图像深度就需要使用单目深度的线索,比如说角度和透视图,图形的尺寸,图像的位置,以及气氛的影响。更进一步,可能会需要一个全局景象的视图来将这些事情紧密关联在一起,然而局部的差异对于立体观察而言就足够了。

       Moreover, the task is inherently ambiguous, and a technically ill-posed problem: Given an image, an infinite number of possible world scenes may have produced it. Of course, most of these are physically implausible for real-world spaces, and thus the depth may still be predicted with considerable accuracy. At least one major ambiguity remains, though: the global scale. Although extreme cases (such as a normal room versus a dollhouse) do not exist in the data, moderate variations in room and furniture sizes are present. We address this using a scale-invariant error in addition to more common scale-dependent errors. This focuses attention on the spatial relations within a scene rather than general scale, and is particularly apt for applications such as 3D modeling, where the model is often rescaled during postprocessing.

       更进一步的说,这个任务天然就是模糊的,并且在技术上是一个病态的问题:给定一张图像,这张图像可能对应着无数种不同的世界的场景。当然,这些场景中许多在现实世界中不符合物理规律,也因此深度依然可以在合理的精度范围内被预测出来。然而,至少一个主要的歧义是:全局的度量尺度。虽然一些极端的例子(比如说普通房间对比于儿童游乐室)并不存在于数据中,但是在房间大小和家具尺寸上的一些适度变化都体现在了数据集中。我们解决这个问题中,除了使用更加普遍的依赖于尺度的损失误差,还使用了一个尺度不变性的误差。这将注意力集中在了空间和景物的相关性上,而不在于空间和全局尺寸的相关性上了,这对于诸如三维建模之类的应用来说十分合适,因为模型经常在后续处理过程中重新缩放。

       In this paper we present a new approach for estimating depth from a single image. We directly regress on the depth using a neural network with two components: one that first estimates the global structure of the scene, then second for depth relations between pixel locations, in addition to point-wise error. Our system achieves state-of-the-art  estimation rates on NYU Depth and KITTI, as well as improved qualitative outputs.

       在这篇论文中,我们提出了一个用于从单个相片中估计深度的方法。我们直接的在深度上使用神经网络进行回归,用了两个基本组件:其中一个是,首先要估计场景的全局结构,第二个是除了点的误差,还有对于像素位置的深度估计。我们的系统达到了目前最先进的估计比率,无论是是在NYU Depth还是在KITTI中,并提高了定性的输出效果。

 

文献:Eigen D, Puhrsch C, Fergus R. Depth map prediction from a single image using a multi-scale deep network[J]. arXiv preprint arXiv:1406.2283, 2014. 有2258次引用

posted @ 2021-06-18 11:35  ProfSnail  阅读(222)  评论(0编辑  收藏  举报