Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture 摘要翻译

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture




       In this paper we address three different computer vision tasks using a single multiscale convolutional network architecture: depth prediction, surface normal estimation, and semantic labeling. The network that we develop is able to adapt naturally to each task using only small modifications, regressing from the input image to the output map directly. Our method progressively refines predictions using a sequence of scales, and captures many image details without any superpixels or low-level segmentation. We achieve state-of-the-art performance on benchmarks for all three tasks.


1.     Introduction

1.     简介

        Since understanding is a central problem in vision that has many different aspects. These include semantic labels describing the identity of different scene portions; surface normal or depth estimates describing the physical geometry; instance labels of the extent of individual objects; and affordances capturing possible interactions of people with the environment. Many of these are often for presented with a pixel-map containing a value or label for each pixel, e.g. a map containing the semantic label of the object visible at each pixel, or the vector coordinates of the surface normal orientation.


        In this paper, we address three of these tasks. depth prediction, surface normal estimation and semantic segmentation – all using a single common architecture. Our multi-scale approach generates pixel-maps directly from an input image, without the need for low-level superpixels or contours, and is able to align to many image details using a series of convolutional network stacks applied at increasing resolution. At test time, all three outputs can be generated in real time (~30Hz). We achieve state-of-the-art results on all three tasks we investigate, demonstrating our model’s versatility.


        There are several advantages in developing a general model for pixel-map regression. First, applications to new tasks may be quickly developed, with much of the new work lying in defining an appropriate training set and loss function; in this light, our work is a step towards building off-the-shelf regressor models that can be used for many applications. In addition, use of a single architecture helps simplify the implementation of systems that require multiple modalities, e.g. robotics or augmented reality. Lastly, in the case of depth and normals, much of the computation can be shared between modalities, making the system more efficient.



文献:07-Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture[C]//Proceedings of the IEEE international conference on computer vision. 2015: 2650-2658. 有1982次引用


posted @ 2021-06-18 11:32  ProfSnail  阅读(286)  评论(0编辑  收藏  举报