A Point Set Generation Network for 3D object Reconstruction from a Single Image摘要和简介翻译

A Point Set Generation Network for 3D object Reconstruction from a Single Image

一个从单张图像中重建三维物体的点集生成网络

Abstract

摘要

Generating of 3D data by deep neural networks has been attracting increasing attention in the research community. The majority of extant works resort to regular representations such as volumetric grids or collection of images; however, these representations obscure the natural invariance of 3D shapes under geometric transformations, and also suffer from a number of other issues. In this paper we address the problem of 3D reconstruction from a single image, generating a straight-forward from output – point cloud coordinates. Along with this problem arises a unique and interesting issue, that the groundtruth shape for an input image may be ambiguous. Driven by this unorthodox output form and the inherent ambiguity in groundtruth, we design architecture, loss function and learning paradigm that are novel and effective. Our final solution is a conditional shape sampler, capable of predicting multiple plausible 3D point clouds from an input image. In experiments not only can our system outperform state-of-the-art methods on single image based 3d reconstruction benchmarks; but it also shows strong performance for 3D shape completion and promising ability in making multiple plausible predictions.

通过深度神经网络生成三维数据已经成为了日益增长的研究社团中吸引人的地方了。现有的工作主要都是诉诸于常规的表达方式，比方说体积网格或者是图像的集合；然而，这些表达方式遮蔽了在集合结构转换下的三维形状天生的不变性，并且在许多其他的地方也有所损害。在这篇论文额中，我们解决了从一个单张图像中进行三维重建，生成一个直接输出的点云坐标问题。随着这个问题的提出，一个独特而有趣的论题出现了，就是从一个输入图像中生成的真实值形状可能是有歧义的。受到这个不正规的输出形式以及真值天生的歧义性的驱使，我们设计了一个新颖并有效的结构、损失函数和学习范式。我们的最后解决方案是一个条件形状采样器，可以从一个输入图像中预测多个可信的三维点云。在实验中，我们的系统不仅在基于三维重建的评价准则上达到了业界最佳的水平，同时在三维形状补全方面也表现出了很强的性能，在做出多种可信预测方面也有很好的能力。

1. Introduction

1. 简介

As we try to duplicate the success of current deep convolutional architectures in the 3D domain, we face a fundamental representational issue. Extant deep net architectures for both discriminative and generative learning in the signal domain are well-suited to data that is regularly sampled, such as images, audio, or video. However, most common 3D geometry representations, such as 2D meshes or point clouds are not regular structures and do not easily fit into architectures that exploit such regularity for weight sharing, etc. That is why the majority of extant works on using deep nets for 3D data resort to either volumetric girds or collections of images (2D views of the geometry). Such representations, however, lead to difficult trade-offs between sampling resolution and net efficiency. Furthermore, they enshrine quantization artifacts that obscure nature invariances of the data under rigid motions, etc.

当我们想在三维领域上复现当前深度卷积结构的成功之时，我们遇到了一个基础的表达方式的问题。当前的深度网络结构，在辨别学习和生成学习两方面的信号领域上都很适合于规则采样的数据，比方说图像、音频或者是视频。然而，许多三维几何结构的表示，比方说二维网格或者是点云，并不是规则的结构，不能够简单地将其是应用于开发这些规则采样的结构中，以实现权值共享之类的。这就是为什么当前的主要用语三维数据的神经网络的工作，都在用是体积网格或者是图像的集合（二维视角的几何结构）。然而，这样的表示导致难以在采样分辨率和网络有效性之间进行权衡。此外，他们还包含了刚体运动下数据不变性的量化伪影等。

In this paper we address the problem of generating the 3D geometry of an object based on a single image of that object. We explore generative networks for 3D geometry based on a point cloud representation. A point cloud representation may not be as efficient in representing the underlying continuous 3D geometry as compared to a CAD model using geometric primitives or even a simple mesh, but for our purposes it has many advantages. A point cloud is a simple, uniform structure that is easier to learn, as is does not have to encode multiple primitives or combinatorial connectivity patterns. in addition, a point cloud allows simple manipulation when it comes to geometric transformations and deformations, as connectivity dose not have to be updated. Our pipeline infers the point positions in a 3D frame determined by the input image and the inferred viewpoint position.

在这篇论文中，我们解决了从一个单张物体图像生成该物体的三维几何结构的问题。我们研究了生成网络对于三维几何形状在点云表示的基础上。我们研究了在点云表示的基础上三维几何结构的生成网络。一个在潜在的连续三维几何结构点云表示，可能不会像使用几何图元甚至是一个简单网格的计算机辅助模型那样有效率，但是对于我们的目的而言他还是有许多作用的。一个点云是简单的、统一的结构，并且易于学习，因为他并不需要编码许多的基元或者是编码组合的连接模式。另外，当一个点云遇到几何结构转换和几何结构变形允许简单的操控，因为连通性不需要更新。我们的通路推测了在三维帧中点的位置，由输入图像和推测出来的视点位置决定。

Given this unorthodox network output, one of our challenges Is how to measure loss during training, as the same geometry may admit different point cloud representation at the same degree of approximation. Unlike the usual L2 type losses, we use the solution of a transportation problem based on the Earth Mover’s distance (EMD), effectively solving an assignment problem. We exploit an approximation to the EMD to provide speed as well as ensure differentiability for end-to-end training.

考虑到这个非正统的网络输出，我的遇到的挑战之一是如何评估在训练中的损失，因为相同的几何结构可能会在相同程度的近似上接受不同的点云表示。不像常见的L2种类的损失，我们使用了一个机遇Earth Mover距离转换问题的解决方案，有效的解决了一个分配的问题。我们探索了一个近似于EMD的方法，提供了速度以及保证端到端训练的可微性。

Our approach effectively attempts to solve the ill-posed problem of 3D structure recovery from a single projection using certain learned priors. The network has to estimate depth for the visible parts of the image and hallucinate the rest of the object geometry, assessing the plausibility of several different completions. From a statistical perspective, it would be ideal if we can fully characterize the landscape of the ground truth space, or be able to sample plausible candidates accordingly. If we view this as a regression problem, then it has a rather unique and interesting feature arising from inherent object ambiguities in certain views. These are situations where there are multiple, equally good 3D reconstructions of a 2D image, making our problem very different from classical regression/classification settings, where each training sample has a unique ground truth annotation. In such settings the proper loss definition can be crucial in getting the most meaningful result.

我们的方法有效的尝试去解决从单张映射中使用特定学习到的先验知识进行三维结构重建的病态问题。这个网络要做三件事情，估计图像中可见区域的深度，幻想物体几何区域的剩余部分，以及评估许多不同补全的可信度。从统计学的视角来看，理想的情况是，我们能够完全的将真实区域的边界定位出来，或者有能力照着去采样一些可信的候选区域。如果我们将这个问题视作回归问题的话，那么在特定视角下他就是一个相当独特而有趣的特征了，产生自物体天生的歧义性。跟传统的回归或者是分类的设定完全不同的是，传统回归分类中每一个训练样本都有一个独一无二的表示，然而从二维图像中生成三维重建会有许多几乎一致的情形，无法唯一表示。在这种情况下，有一个合适的损失函数对于生成有意义的结果而言是至关重要的。

Our final algorithm is a conditional sampler, which samples plausible 3D point clouds from the estimated ground truth space given an input image. Experiments on both synthetic and real world data verify the effectiveness of our method. Our contributions can be summarized as follows:

我们最后的算法是一个条件采样器，从给定图像估计的真实值空间上采取可信的三维点云。在合成空间和现实世界的数据中进行的实验，都验证了我们算法ad有效性。我们的贡献可以总结如下：

l We use deep learning techniques to study the point set generation problem;

l 我们使用深度学习技术来学习点集生成问题。

l On the task of 3D reconstruction from a single image, we apply our point set generation network and significantly outperform state of the art;

l 在从一个单张图像中生成三维重建的工作上，我们应用了点集生成网络，并比现有工作有显著提升。

l We systematically explore issue in the architecture and loss function design for point generation network;

l 我们系统的研究了用于点生成网络的结构和损失函数的问题。

l We discuss and address the ground-truth ambiguity issue for the 3D reconstruction from single image task.

l 我们讨论并解决了真实值模糊的问题，用于从单个图像中重建的任务。

Source and code demonstrating our system can be obtained from https://github.com/fanhqme/PointSetGeneration.

展示我们工作的源代码，请参见：https://github.com/fanhqme/PointSetGeneration.

posted @ 2021-06-18 11:37 ProfSnail 阅读(503) 评论(0) 收藏举报

刷新页面返回顶部

ProfSnail

A Point Set Generation Network for 3D object Reconstruction from a Single Image摘要和简介翻译

公告