Automatic scene inference for 3D object Compositing摘要和简介翻译

17-Automatic scene inference for 3D object Compositing

17-自动场景推理，用于三维物体合成

We present a user-friendly image editing system that support s a drag-and-drop object insertion (where the user merely drags objects into image, and the system automatically places them in 3D and relights them appropriately), post-process illumination editing, and depth-of-field manipulation. Underlying our system is a fully automatic technique for recovering a comprehensive 3D scene model (geometry, illumination, diffuse albedo and camera parameters) from a single, low dynamic range photograph. This is made possible by two novel contributions: an illumination inference algorithm that covers a full lighting model of the scene (including light sources that are not directly visible in thee photograph), and a depth estimation algorithm that combines data-driven depth transfer with geometric reasoning about the scene layout. A user study shows that our system produces perceptually convincing results, and achieves the same level of realism as techniques that require significant user interaction.

我们提出了一个用户友好的图像编辑系统，这个系统支持一个简单的拖拽式物体插入（用户只需要简单的拖拉物体到图像中，系统就会自动的把他们以三维形式放进去，并对这些物体以合适的光照重新打光），后处理照明编辑，以及景深的操作处理。潜在隐含的来说，我们的系统是一个潜在的全自动恢复复杂三维场景模型的系统，这个系统可以从单张低动态范围[1]的相片中恢复诸如几何结构、照明、漫反射率、相机参数等内容。以下的两个创新型贡献使得这一切得以实现：一个照明推测算法，覆盖了完整场景的照明模型（包括光源在照片本身中不可见的情况）；以及一个深度估计算法，将数据驱动的深度转换和关于场景布局的几何推理结合在一起。一个基于用户的调查显示，我们的系统实现了感知上可信服的结果，并且达到了和那些需要用户的显著交互的科技同等水平的逼真效果。

Categories and Subject Descriptors: I.2.10 [Computing Methodologies]: Artificial Intelligence – Vision and Scene Understanding; I.3.6 [Computing Methodologies]: Computer Graphics – Methodology and Techniques.

分类和主题描述：I.2.10【计算方法学】：人工智能——视觉和场景理解；I.3.6【计算方法学】：计算机图形学——方法和技巧

Additional Key Words and Phrases: Illumination inference, depth estimation, scene reconstruction, physically grounded, image-based rendering, image editing

其他的关键词和短语：照明推测，深度估计，场景重建，基于物理的（物理上合理的），基于图像表达的，图像编辑。

Introduction
简介

Many applications require a user to insert 3D characters, props, or other synthetic objects into images. In many existing photo editors, it is the artist’s job to create photorealistic effects by recognizing the physical space present in an image. For example, to add a new object into an image, the artist must determine how the object will be lit, where shadows will be cast, and the perspective at which the object will be viewed. In this paper, we demonstrate a new kind of image editor – one that computes the physical space of the photograph automatically, allowing an artist (or, in fact, anyone) to make physically grounded edits with only a few mouse clicks.

许多应用需要使用者向图像中插入三维的角色，三维的小物件，或者其他合成的物体。在许多已有的图像编辑器中，是由美术家来创造照相写实的效果，通过识别在图像中展现出来的物理空间来实现的。比如说，为了向图像中插入一个新的物体，美术家必须要决定这个物体如何被打光，如何展现阴影，以及物体被观测时候的透视图效果。在这篇论文中，我们战士了一个新的图像编辑器，这个图像编辑器能够自动的计算相片中的物理空间，允许美术家（实际上任何人都可以）动动鼠标就能够创作出基于物理的编辑来。

Our System works by inferring the physical scene (geometry, illumination, etc.) that corresponds to a single LDR photograph. This process is fully automatic, requires no special hardware, and works for legacy images. We show that our inferred scene models can be used to facilitate a variety of physically-based image editing operations. For example, objects can be seamlessly inserted into the photograph, light source intensity can be modified, and the picture can be refocused on the fly. Achieving these edits which existing software is a painstaking process that takes a great deal of artistry and expertise.

我们的系统通过推测现实的场景来工作，比方说几何结构、照明等，这些现实场景对应于一个单张的低动态范围的相片。推测现实场景的过程是完全自动的，不需要特殊的硬件，并且对于遗留的图像也能顺利工作（不知道legacy image如何翻译）。我们战士处根据我们推测的场景模型，可以让各种基于真实的图像编辑操作更加方便。比如说，物体可以被无缝的插入到图像中，可以修改光源强度，并且图片也可以轻松的重新聚焦。通过已有的软件实现这些编辑是一个劳心费力的工作，需要大量的艺术技巧和专业经验。

In order to facilitate realistic object insertion and rendering we need to hypothesize camera parameters, scene geometry, surface materials, and sources of illumination. To address this, we develop a new method for both single-image depth and illumination inference. We are able to build a full 3D scene model without any user interaction, including camera parameters and reflectance estimates.

为了让插入物体更加逼真的展示，以及让这个过程更加轻松，我们需要假设相机参数、场景几何结构、表面材料、照明的光源。为了解决这些问题，我们开发了一个新的方法，用于推测单张照片的深度，以及推测相片的照明情况。我们可以构建一个完整的三维场景模型，而不需要任何其他的用户交互，包括相机参数以及反射比的估计。

Contributions. Our primary contribution is a completely automatic algorithm for estimating a full 3D scene model from a single LDR photograph. Our system contains two technical contributions: illumination inference and depth estimation. We have developed a novel, data-driven illumination estimation procedure that automatically estimates a physical lighting model for the entire scene (including out-of-view light source). This estimation is aided by our single-image light classifier to detect emitting pixels, which we believe is the first of its kind. We also demonstrate state-of-the-art depth estimates by combing data-driven depth inference with geometric reasoning.

主要贡献：我们最主要的贡献是一个完全自动的算法，用于从一个单张LDR照片中估计一个完整的三维场景模型。我们的系统包括两个主要的技术贡献：第一个是照明推断，第二个是深度估计。我们开发了一个新颖的，数据驱动的照明估计过程，这个过程自动的估计整个场景的真实亮度模型（包括在视野范围内不可见的光源）。场景亮度估计的过程收到了我们单张图像亮度分类的帮助，用来检测发出亮光的像素们，我们认为是亮度类别中第一类的。我们也展示了最先进的深度估计技术，通过将数据驱动的深度推测和几何推理相结合。

We have created an intuitive interface for inserting 3D models seamlessly into photographs, using our scene approximation method to relight the object and facilitate drag-and-drop insertion. Our interface also supports other physically grounded image editing operations. In a user study, we show that our system is capable of making photorealistic edits: in side-by-side comparisons of ground truth photos with photos edited by our software, subjects had a difficult time choosing the ground truth.

我们创造了一个凭直觉的接口，用于无缝地向图片中插入三维模型，这个接口使用我们的场景近似方法来对物体重新照明，并且让拖拽式的插入更加方便。我们的接口也支持其他基于真实场景的图像编辑操作。在对用户的调研中发现，我们的系统有能力创作出接近真实值的相片编辑：在对于真实相片和我们的软件笔记过的相片进行并行比较后，实验对象们在选择真实图像的时候花费了一点时刻来纠结哪个才是真实图像。

Limitations. Our method works best when scene lighting is diffuse, and therefore generally works better indoors than out (see our user studies and results in Sec 7). Our scene models are clearly not canonical representations of the imaged scene and often differ significantly from the true scene components. These coarse scene reconstructions suffice in many cases to produce realistically edited images. However, in some case, errors in either geometry, illumination, our materials may be stark enough to manifest themselves in unappealing ways while editing. For example, inaccurate geometry could cause odd looking shadows for inserted objects, and inserting light sources can exacerbate geometric errors. Also, our editing software dose not handle object insertion behind existing scene elements automatically, and cannot be used to deblur an image taken with wide aperture. A Manhattan World is assumed in our camera pose and depth estimation stages, but our method is still applicable in scenes where this assumption does not hold (see Fig 10).

不足之处：我们的方法在场景的光源是发散的时候才表现的最好，因此一般情况下对于室内场景效果比室外场景的效果更好一些（参见在第七章中我们对用户的研究和结论）。我们的场景模型很显然不是图像化场景表示的最简洁方法，并且经常和真实场景中的组成部分有显著差异。这些粗糙的场景重建在许多情况下就足够生成逼真的编辑后的图像。然而，在某些情况下，无论是几何结构还是照明都可能会出现错误，当进行图像编辑的时候，我们的材料可能会展示出不能令人满意的一面，并且这些模型是相当粗糙的。比如说，不正确的几何结构可能会导致插入物体的光影看起来非常古怪，插入的光源可能会加剧几何结构的错误。同样的，我们的编辑软件并不能够自动的将物体插入在已经存在的景物元素的后面，并且不能够用于让那些拍摄于较大孔径的不清晰图片变得清晰起来。我们在相机姿势和深度估计的阶段是基于曼哈顿世界的假设，但是我们的方法依然对于那些不基于曼哈顿假设的场景依然适用（参见图像10）。

Fig. 2. Our system allows for physically grounded image editing (e.g., the inserted dragon and chair on the right), facilitated by our automatic scene estimation procedure. To compute a scene from a single image, we automatically estimate dense depth and diffuse reflectance (the geometry and materials of our scene). Sources of illumination are then inferred without any user input to form a complete 3D scene, conditioned on the estimated scene geometry. Using a simple, drag-and-drop interface, objects are quickly inserted and composited into the input image with realistic lighting, shadowing, and perspective. Photo credits: @Salvadonica Borgo.

图2. 我们的系统允许基于现实的图像编辑（比如说，在图的右侧插入在椅子上的龙），我们所做的自动场景估计过程推动了这一个图像编辑。为了从单一图像中计算场景，我们自动的估计稠密深度以及漫反射率（我们场景中的几何结构和材料）。接下来，无需用户输入其他东西，就可以推断出照明的光源，以生成一个完整的三维场景，这是在被估计的场景几何结构的条件下推断的。使用一个简单的，只需要拖拽的界面，物体就可以很快地被插入并合成到输入的图像中，对物体的打光、阴影以及人对物体的感知也相当逼真。图片来源于Salvadonica Borgo。

3. METHOD OVERVIEW

3. 方法总览

Our method consists of three primary steps, outlined in Fig 2. First, we estimate the physical space of the scene (camera parameters and geometry), as well as the per-pixel diffuse reflectance (Sec 4). Next, we estimate scene illumination (Sec 5) which is guided by our previous estimates (camera, geometry, reflectance). Finally, our interface is used to composite objects, improve illumination estimates, or change the depth-of-field (Sec 6). We have evaluated our method with a large-scale user study (Sec 7), and additional details and results can be found in the corresponding supplemental document. Figure 2 illustrates the pipeline of our system.

我们的方法由三个主要的步骤组成，在图二中展示出来了。首先，我们估计场景的显示空间（包括相机参数和几何结构），以及每个像素的漫反射率（参见第四节）。接下来，我们估计场景的照明（参见第五节），这受到我们之前估计（相机，几何参数，反射率）的引导。最后，我们的界面用于合成物体，提升光照估计，或者改变景深（参见第六节）。我们对提出的方法进行了评估，这项评估基于了大量的用户进行调研（参见第七章），其他的细节和结果可以在对应的附加文档中找到。图二展示了这个系统工作的工作流程。

Scene parameterization. Our geometry is in the form of a depth map, which is triangulated to form a polygonal mesh (depth is unprojected according to our estimated pinhole camera). Our illumination model contains polygonal area sources, as well as one or more spherical image-based lights.

场景参数化。我们的几何结构是以深度图的形式组织的，通过三角化来形成多角形的网格（根据我们估计的针孔相机，深度被逆投影了）。我们的照明模型包括多角形区域的光源，以及一个或者更多的基于图像的球形光源。

While unconventional, our models are suitable for most off-the-shelf rendering software, and we have found our models to produce better looking estimates than simpler models (e.g. planar geometry with infinitely distant lighting).

尽管另辟蹊径，我们的模型对于大多数现成的生成软件都很实用。我们的模型在观感估计上比一些更简单的模型要表现的更好（比如说，使用无限远光照的几何结构）。

Automatic indoor/outdoor scene classification. As a preprocessing step, we automatically detect whether the input image is indoors or outdoors. We use a simple method: k-nearest-neighbor matching of GIST features [Oliva and Torralba 2001] between the input image and all images from the indoor NYUv2 dataset and the outdoor Make3D Dataset. We choose k = 7, and decide use majority-voting to determine if the image is indoors or outdoors (e.g. if 4 of the nearest neighbors are from the Make3D dataset, we consider it to be outdoors). More sophisticated methods could also work.

自动室内/室外场景分类。作为预训练的步骤，我们自动检测输入图像是室内图像还是室外图像。我们使用一个简单地方法：k近邻方法用来匹配GIST特征【Oliva and Torralba 2001】在输入图像和所有图像之间做最近邻算法，包括室内的NYUv2数据集以及室外的Make3D数据集。我们选择k=7，并决定使用多数投票方法，来决定一张图像是室内还是室外图像（比如，如果四张最近邻是来自于Make3D数据集，我们会将输入图像分为室外图像）。更多的复杂方法也是有效的。

Our method uses different training images and classifiers depending on whether the input image is classified as an indoor or outdoor scene.

我们的方法使用不同的训练图像和分类器，这取决于是将输入图像分为室内图像还是室外图像了。

[1] Low Dynamic Range，低动态范围，也就是图像能够显示的色彩范围有限。

posted @ 2021-06-18 11:39 ProfSnail 阅读(132) 评论(0) 收藏举报

刷新页面返回顶部

ProfSnail

Automatic scene inference for 3D object Compositing摘要和简介翻译

公告