Massive Model Rendering Techniques
Andreas Dietrich Enrico Gobbetti Sung-Eui Yoon
Abstract
We present an overview of current real-time massive model visualization
technology, with the goal of providing readers with a high level understanding
of the domain, as well as with pointers to the literature.
本文展示了当前大规模模型实时可视化技术的概况,目的是为了给读者们对这个领域一个比较深入的认识,并指出学术界的一些研究的文献。
I. INTRODUCTION
Interactive visualization and exploration of massive 3D models is a
crucial component of many scientific and engineering disciplines and is
becoming increasingly important for simulations, education, and entertainment
applications such as movies and games. In all those fields, we are observing
data explosion, i.e., information quantity is exponentially increasing. Typical
sources of rapidly increasing massive data include the following:
交互可视化以及对大规模3D模型的浏览,对于很多的科学与工程的学科来说是十分关键的部分。特别是对于仿真、教育以及娱乐应用如电影与游戏几个方面变得越来越重要。在这些所有的领域中,我们注意到了所采用的数据的爆炸性增加情况,如,表现在信息数量的指数级增长。这些增长快速的领域包括:
• Large-scale engineering projects. Today, complete aircrafts, ships,
cars, etc. are designed purely digital. Usually, many geographically dispersed
teams are involved in such a complex process, creating thousands of different
parts that are modeled at the highest possibly accuracy. For example, the
Boeing 777 airplane seen in Figure
大规模的工程项目。目前,整个飞机、船、汽车等设计全是由数字化方式进行的。通常情况下,一些地理上位置分散的小组共同参与到这个复杂的进程中,创建数以万计的高精度的模型。
• Scientific simulations. Numerical simulations of natural real world
effects can produce vast amounts of data that need to visualize to be
scientifically interpreted. Examples include nuclear reactions, jet engine
combustion, and fluid-dynamics to mention a few. Increased numerical accuracy
as well as faster computation can lead to datasets of gigabyte or even terabyte
size (Figure 1b).
科学仿真。对于真实世界效果的数值模拟可能产生巨大数量的数据,而它们需要用可视化来进行科学解释。
• Acquisition and measuring of real-world objects. Apart from modeling and
computing geometry, scanning of real-world objects is a common way of acquiring
model data. Improvements in measuring equipment allow scanning in sub-mm
accuracy range, which can result in millions to billions of samples per object
(Figure
对真实世界对象的获取与量测。
• Modeling natural environments. Natural landscapes contain an incredible
amount of visual detail. Even for a limited field of view, hundreds of
thousands of individual plants might be visible. Moreover, plants are made of
highly complex structures themselves, e.g., countless leaves, complicated
branchings, wrinkled bark, etc. Even modeling only some of these effects can
produce excessive quantities of data. For example, the landscape model depicted
in Figure 1d measures “only” a square area of
对自然环境的建模。自然景观包括了众多难以置信的细节。即使在有限的视场角下,也将有数以万计的植被等可见。此外,对象本身也十分的复杂。
Handling such massive models presents important challenges to developers.
This is particularly true for highly interactive 3D programs, such as visual
simulations and virtual environments, with their inherent focus on interactive,
low latency, and real-time processing.
操作这些大规模的模型给开发者们带来了一些重要的挑战。尤其是对于高度交互的3D程序,如视觉仿真或虚拟环境,这些应用的固有的特点是进行交互、低延迟和实时的处理。
In the last decade, the graphics community has witnessed tremendous
improvements in the performance and capabilities of computing and graphics
hardware. It therefore naturally arises the question if such a performance
boost does not transform rendering performance problems into memories of the
past. A single standard dual-core 3 GHz Opteron processor has roughly 20
GFlops, a Play station
过去的十年中,图形领域目睹了图形硬件的计算能力与处理性能的极大的提升。自然而然,将提出这样一个问题:是否这种性能的推进不能使渲染性能的问题成为历史呢?一个标准的双核3GHz的处理器可以处理20GFlops的浮点运行。一个PS3有180GFlops而一个GPU支持可编程能力能提供达到340GFlops的浮点运算。随着硬件并行应用的增加,性能的提升要超过摩尔指数增长的预测。尽管取得了这样的计算与图形处理器能力的提升,而对于图形应用来说,在可预见的将来中,还是并不能完全的依赖于硬件的发展来处理数据集的大小。这不仅是因为计算能力的增长允许用户来创建更为复杂的数据集,也是因为内存带宽的发展速度明显的低于处理器增长的速度,而这成为处理大规模数据集的主要瓶颈。
As a result, massive datasets cannot be interactively rendered by brute
force methods. To overcome this limitation, researchers have proposed a wide
variety of output-sensitive rendering algorithms, i.e., rendering techniques
whose runtime and memory footprint is proportional to the number of image
pixels, not to the total model complexity. In addition to requiring out-of-core
data management, for handling datasets larger than main memory or for providing
applications the ability to explore data stored on remote servers, these
methods require the integration of techniques for filtering out as efficiently
as possible the data that is not contributing to a particular image.
因而,大规模数据集的交互渲染不能通过强力模型进行。要克服这个限制,研究者们提出了一系列的输出敏感型的渲染算法。如,渲染技术的运行时间和内存的要求与象素成比例而不是与全部的模型复杂度成比例。此外在要求外核的数据管理技术时,要处理大于内存的数据集或提供应用程序能力来探测数据存储在远端服务器上,这些方法需要集成高效的过虑对最终生成的图像不起作用的数据。
This article provides an overview of current massive model rendering
technology, with the goal of providing readers with a high level understanding
of the domain, as well as with pointers to the literature. The main focus will
be on rendering of large static polygonal models, which are by far the current
main test case for massive model visualization. We will first discuss the two
main rendering techniques (Section II) employed in rendering massive models:
rasterization and ray tracing. We will then illustrate how rendering complexity
can be reduced by employing appropriate data structures and algorithms for
visibility or detail culling, as well as by choosing alternate graphics
primitive representations (Section III). We will further focus on data
management (Section IV) and parallel processing issues (Section V), which are
increasingly important on current architectures. The article concludes with an
overview of how the various techniques are integrated into representative
state-of-the-art systems, and a discussion of the benefits and limitations of
the various approaches (Section VII).
II. TWO MAIN RENDERING TECHNIQUES
Rendering – the image generation process that takes places during
visualization of geometric models – requires calculating each primitive’s
contribution to each pixel. It involves three main aspects: projection,
visible-surface determination, and shading.
两种主要的渲染技术
渲染—在对几何模型进行可视化时的图像生成过程—需要计算每一个图元对生成象素的贡献。它包括了三个主要的方面:投影、可见面决定、着色。
Projection transforms 3D objects into 2D for viewing. This is typically a
planar perspective projection, where 3D points are projected onto a 2D image
plane based on a center of projection (see Figure 3). Visible surface
determination is necessary to compute which parts of a scene are actually
visible by an observer, and which parts are occluded by other surfaces.
Finally, shading means the computation of the appearance of visible surface
fragments. For example, in case of photo-realistic image generation, shading
can result in complicated light transport simulations, which in turn make it
necessary to also determine visibility between surfaces.
投影将三维的对象转换到二维中用于观察。这是一个典型的平面透视投影,此时,基于一个点心的投影,一个3D点投影到一个2D影像平面。可见面的决定是必须用来计算场景中的哪些部分在观察者的位置是可见的,哪些部分是被其它表面遮挡了的。最终,着色意味着计算可见表面片段的外表。例如,在生成像片质感的影像时,着色将产生复杂的光线的传播模拟。
As of today, practically only two rendering algorithms are almost
exclusively applied: rasterization and ray tracing, which is mainly due to
their robustness and simplicity.
A. Rasterization
Rasterization algorithms combined with the Z-buffer are widely used in
interactive rendering, and are implemented in virtually all modern graphics
boards in the form of highly parallel graphics processing units (GPUs).
Rasterization is an example of an object-order rendering approach (also called
forward-mapping): Objects to be rendered are sequentially projected onto the
image plane, where they are rasterized into pixels and shaded. Visibility is
resolved with the help of the Z-buffer, which stores for each pixel the
distance of the respective visible object fragment to the observer.
光栅化算法结合了Z-buffer算法被广泛的用于交互渲染,并且被布设在几乎所有的现代图形板卡中以一个高度并行的图像处理单元形式。光栅化是一个对象顺序渲染的方法(也叫着正向映射)。需要被渲染的对象顺序的投影到影像平面,在影像平面上被光栅化成象素并进行着色。可见性是通过辅助Z-buffer来实现的,它存储了每一个象素中相应片段中的对象到观察者的距离。
This process can be efficiently realized in a pipeline setup, commonly
known as graphics pipeline. Figure 2 illustrates a generic graphics pipeline
architecture. Early graphics hardware was based on a hardwired implementation
of this architecture, with multiple geometry and rasterization units made to
work in parallel to further increase performance. In recent years, graphics
hardware has started to feature extensions to the fixed-function pipeline, in a
way that parts of the geometry stage and rasterizer stage can be programmed
using vertex- and fragment-shaders. GPU designs have further evolved, and,
nowadays, instead of being based on separate custom processors for vertex
shaders, geometry shaders, and pixel shaders, the pipeline is realized on top
of a large grid of data-parallel floating-point processors general enough to
implement shader functionalities. Vertices, triangles, and pixels thus
recirculate through the grid rather than flowing through a pipeline with stages
of fixed width, and allocation of the pool of processors to each shader type
can dynamically vary to respond to varying graphics load.
这个过程可以在流水线中高效的实现,如下图的图形管线。早期是通过硬件连接实现的,使用多个几何与光栅单元来并行处理提升性能。近年来,管道进行了扩展,这种方式下,在几何阶段,可以使用顶点与片段着色程序。而GPU的设计进入了更一步的发展,不用分离的特定的着色器,如顶点与片段着色器,管线的实现是在大的格网的并行的浮点计算处理器之上的。顶点、三角形、象素循环而不是以固定的宽度通过管线,以及为每一个着色器分配一个池来动态的响应不同的图形加载。
光栅化的管线允许以流的方式处理任意数量的图元,这对于图形的数量要大于图形显存时相当有用。基本形式的光栅化技术局限于场景图元数量的时间线性化,而这是直接由物体的次序决定的。为了使绘制时间达到一个对数的复杂性,空间索引结构需要应用来进行一个构建队列,以限制送入到管线中的多边形的数量。然而,由于 GPU的性能与内存层次带宽之间的差距越来越大,适当的技术必须用来细致的管理工作数据集以确保一致性的访问。
B. Ray Tracing
In contrast to rasterization, ray casting and its recursive extension ray
tracing are image order rendering (backward mapping) approaches. Ray tracing
closely models physical light transport as straight lines. In their classical
form, ray tracing algorithms start by shooting imaginary (primary) rays from
the observer (i.e., the projection center) through the pixel grid into a 3D
scene (see Figure 3). For each ray the closest intersection with the model’s
surfaces is determined.
相比与光栅化,光线投身以及其迭代的版本射线追踪是从象空间顺序的方法。射线追踪近似的建模物理光线的传输过程。在经典的方式下,射线追踪算法开始于从观察者发出的射线,穿过象素格网到达3D场景。对于每一个射线,得到相交于模型的最近的表面。
To find out if a surface hit point is lit, so-called shadow rays are
spawned into the direction of each light source. If such a ray is blocked, the
hitpoint lies in shadow. Light propagation between surfaces can be computed by
recursively tracing secondary rays, which emanate from previously found
hitpoints. In Figure 3, e.g., a reflective surface has been hit, thus, the
incoming ray is mirrored and fired again to find surfaces that are visible in
the reflection.
A basic ray tracing implementation can be very simplistic, and can be
realized with much less effort than a (software) rasterizer. For example, all
parts of the rasterizer geometry stage (see Figure 2) are handled implicitly as
result of the backward projection property.
要找到表面着射点是否有光,所谓的阴影射线
In a naive
implementation, the closest intersection would be calculated by sequentially
testing each primitive of the scene for each ray, which is obviously only
practical for very small models. To limit the number of primitives for which
the actual ray–primitive intersection test is performed, again spatial index
structures are necessary. In a modern ray tracer such acceleration structures
are typically considered to be an integral part of the algorithm, and allow for
an average logarithmic time complexity with respect to the number of
primitives.
要限制进行光线与图元进行相交计算的图元数量,同样需要空间索引结构。在现代的光线追踪器中,这样的加速结构是一个典型的算法的集成部分。并且支持一个与图元数量的对数的时间复杂度。
Probably, the most well-known property of ray tracing is the ability to
generate highly photo-realistic imagery (an example can be seen in Figure 4).
While in classical ray tracing rays are only shot into the most significant
directions (e.g., towards lights), accurately computing global lighting
(including glossiness and indirect light propagation) requires the solution of
high-dimensional integrals over the path-space of
light transport in a scene. Today, numeric Monte-Carlo (MC) integration
techniques in combination with ray tracing are used almost exclusively for this
purpose [7].
光线追踪最有能力的是产生高质量的相片质感的图像。当在典型的光线追踪中,光线只射向明显的方向。准确的计算全局的光照需要解决对光线在场景中的传输路线的完整的积分。而数值Monte-Carlo积分技术与射线追踪技术集成用于完成这个工作。
Another advantage resulting from the recursive nature of ray tracing
algorithms is that shaders can be deployed in a plug-and-play manner. The
characteristics of different surfaces can be described independently, and all
optical effects are automatically combined in a physically correct way. The
most eminent disadvantage of ray tracing is its high computational complexity, especially
for higher order lighting effects. Because of this, it has been employed in a
real-time context only in recent years [24]. While prototype hardware
implementations exist [25], only software ray tracing has so far been applied to
massive models.
另一个优点是,对于自然场景的迭代的射线追踪算法可以开发着色器程序来以一个即插即用的形式。不同表面的性质可以被独立的描述,而所有的光线效果自动组合到物理方法中。最大的问题在于计算的复杂性,特别是高阶段光线的效果。
An important ingredient that is widely applied in state-of-the-art
real-time ray tracing systems is to simultaneously trace bundles of rays called
packets. First, working on packets allows for using SIMD (single instruction
multiple data) vector operations of modern CPUs to perform parallel traversal
and intersection of multiple rays. Second, packets enable deferred shading,
i.e., it is not necessary to switch between intersection and shading routines
for every single ray, thus amortizing memory accesses, function calls, etc.
Third, frustum traversal methods [18] avoid traversal steps and intersection
calculations based on the bounds of ray packets, therefore making better use of
object as well as scanline coherence.
当前广泛应用于最为先进的实时光线追踪系统的一个重要的部分是同时处理一捆光线,被称作包裹。二个原因:允许使用单指令多数据向量操作于当代CPU,操作每一条光线。其次,允许延迟着色。视景体遍历。以射线包的边界为界减少进行处理的数据。
Both rasterization and ray tracing algorithms can perform rendering – and
most importantly, visible-surface determination – in logarithmic time
complexity with respect to scene size. However, this is only possible if
spatial index structures and hierarchical rendering are involved (see below).
光栅化与射线追踪都可以执行渲染,而最为重要的可见面的判断以场景大小的对数时间复杂度,只有当空间索引建立以及层次渲染使用时,才能达到。
The main advantage of rasterization algorithms is the ability to
efficiently exploit scan line coherence. Only the corners of triangles need to
be explicitly projected, while the geometric properties of the pixels enclosed
by the triangle edges can be easily interpolated during the actual
rasterization step. Consequently, such techniques work best in cases where
large screen space areas are covered by a few triangles. Conversely, ray tracing
and ray casting perform better if visibility needs to be evaluated point-wise.
Interestingly, hierarchical front-to-back rasterizing in combination with
occlusion culling techniques can be interpreted as a form of beam- or frustum
tracing.
光栅化的主要优点在于可以利用扫描线的相关性,只有三角形的角需要清楚的投影,而由三角形边包围起来的象素的属性可以在光栅化过程中容易的插值实现。因而,这种技术最适合于大的屏幕中只有少数的三角形。相反,射线追踪与射线投射执行的较好是在可见性需要以点的方式来评估。有意思的是,层次化的由前向后的光栅化组合遮挡剔除技术可以认为是束或视景体追踪的形式。
When it comes to advanced shading effects, ray tracing algorithms are
usually the premier choice, mainly because of their physical accuracy and
implementation flexibility. Especially in highly complex models, the ability to
perform shading and light transport adaptively proves to be helpful.
Implementing advanced global shading algorithms in rasterization architectures is
more complex, because during the shading stage no global access to the scene
database is possible, and visibility between surface points cannot explicitly
be determined. The problem can in principle be overcome by computing maps, and applying
them in successive rendering passes. However, this does not easily allow for
adaptive rendering, e.g., controlling the reflection recursion depth on a per
pixel basis.
然而,当使用高级着色效果时,射线追踪算法是不二选择,主要是因为他的物理模型的精确性以及实施的灵活性。特别是在高度复杂的模型中,执行着色的能力与光线传输的自适应性相当有用。实施高级全局着色算法在光栅化架构下是十分复杂的,因为在着色阶段,是不能访问到场景数据库的全局信息的,表面点之间的可见性是不能清楚的决定的。这个问题在原理上可以用计算映射图解决,应用他们在连续的渲染阶实现。但这并不容易进行自适应的渲染,如控制反射迭代的深度。
As of today, most massive model rendering systems use exclusively one of
the two presented rendering techniques. It is, however, likely that future
systems will incorporate hybrid approaches, in particular as graphics hardware
is becoming more and more general purpose oriented, and will allow for executing
rasterization and ray tracing side-by-side.
当前,大多数的大规模模型渲染系统使用的无非是这二种技术中的一个。然而,有可能在将来的系统中使用一个混合的方法,特别是图形硬件变得越来越一般化目的时,它可以一边实现光栅化一边进行射线追踪。
III. COMPLEXITY REDUCTION TECHNIQUES
In order to meet timing constraints, massive model visualization
applications have to employ methods for filtering out as efficiently as possible
the data that is not contributing to a particular image.
复杂性降低技术
为了实现满足时间的约束,大规模模型可视化应用不能不采用一些方法来过滤掉尽可能多的对特定影像不起作用的数据。
One straightforward approach is to simplify complex models until they
become manageable by the application: if models are too complex, make them
simpler! This static “throwaway-input-data approach” might seem simplistic, but
can be considered beneficial in a number of practical use cases. A common
application of static simplification is reducing the complexity of very densely
over-sampled models.
一个直接的方法是将复杂模型进行简化,直到它变得可由应用程序进行管理。静态的“扔掉输入数据”的方法是最为简单的,但也对一系列的实际应用有作用。静态简化的一个普通的应用是将十分密集的过度采样的模型简化。
For instance, models generated by scanning devices and isosurfaces
extracted by algorithms such as marching cubes are often uniformly
over-tessellated because of the nature of most reconstruction algorithms. By
adaptively simplifying meshes so that local triangle density adapts to local
curvature it is often possible to radically reduce triangles without noticeable
error. More generally, users may want to produce an approximation which is
tailored for a specific use, e.g., viewing from a distance. In the more general
case, however, the quality loss incurred when using off-line simplification
techniques is not acceptable, and the application must resort to more general
adaptive techniques able to filter data at run-time. These methods can be
broadly classified into view-dependent level-of-detail algorithms and
visibility culling methods.
Level-of-detail (LOD) techniques reduce memory access and rendering complexity
by exploiting multi-resolution data structures for dynamically adapting the
resolution of the dataset (the number of required model samples per pixel),
while visibility culling techniques achieve the same goal by avoiding the
processing of parts that can be proved invisible because out of view (in the
case of view-frustum culling) or masked (in the case of occlusion culling).
LOD技术减少了内存的访问和渲染的复杂性通过采用多分辨率数据结构来动态自适应数据的分辨率。然而,可见性剔除技术也可以达到这个目的通过避免处理不在视域中的部分。
A. Geometric Simplification
Geometric simplification is a well studied subject, and a number of high
quality automatic simplification techniques have been developed [17].
几何简化是一个研究比较深入的方法,国际上开发了一些高质量的自动简化的技术。
The wide majority of the methods iteratively simplifies an input mesh by
sequences of vertex removal or edge contraction (see Figure 5). In the first
case of Figure 5, originally introduced by Schroeder [19], at each
simplification step, a vertex is removed from the mesh and the resulting hole
is triangulated. In the second case, popularized by Hoppe [12], the two
endpoints of an edge are contracted to a single point and the triangles that
degenerate in the process are removed.
大多数的技术迭代的简化输入的网格模型通过序列的顶点移除、边收缩。下图展示了顶点删除的方法,每删除后的洞要重新三角化。第二个是由Hoppe提出的边的两个端点合并成一个顶点,而退化三角形移除掉。
The control of the approximation accuracy is critical in the process, if
only to guide the order in which to perform operations. The error evaluation
method most frequently used in current applications is the quadric error
metrics (QEM), originally proposed by
近似准确性的控制在这个过程中是十分重要的,当仅仅指导执行操作的顺序。当前采用的最广泛的误差评估方法是QEM方法。这个方法中,每一个顶点都与一个二次误差相关联,包含了顶点到其相邻面距离的平方和。使用二次误差的好处是减少空间的需要来存储对称矩阵,导致一个极快的误差尺度包含了简单的向量矩阵操作。
In most current systems, simplification is performed in an iterative
greedy fashion, which maintains a sorted list of candidate operations and
applies at each step the operation associated to the minimal simplification
error. Many variations of this approach have been proposed, especially for
dealing with extremely large meshes not fitting in main memory.
在当前的大多数系统中,简化的执行是以一个迭代的贪婪的形式进行的,它包括了一个排序的候选操作的列表,并在每一步应用时操作都相关联最小的简化误差。这一方法的很多变种都被提了出来,特别是用来处理极大规模网格,这些数据是不能完全存储在内存中的。
An efficient approach recently introduced in this context is the idea of
“streaming simplification” [15]. The key insight is to keep input and output
data in streams and document, for example, when all triangles around a vertex
or all points in a particular spatial region have arrived with ”finalization
tags”. This representation allows for streaming very large meshes through main
memory while maintaining information about the visiting status of edges and
vertices.
最近,与此相关的一个有效的方法采用的思想是“流型简化”,提了出来。最为关键之处在于输入输出数据以流型和存档的形式保持。例如,当包含同一个顶点的所有三角形或在特定空间区域的所有顶点已经最终定案时,这将允许流处理十分巨大的网格模型通过主存而保持顶点与边的访问特征信息。
At any time, only a small portion of the mesh is kept in-core, with the
bulk of the mesh data residing on disk. Mesh access is restricted to a fixed
traversal order, but full connectivity and geometry information is available
for the active elements of the traversal. For simplification, an in-core buffer
is filled and simplified and output is generated as soon as enough data is
available.
任意时刻,只有小部分的网格数据保持在内存中,而大部分数据都保持在磁盘中。而对于网格的访问是严格的限定顺序。但是完全的连接与几何信息是可以被访问到的。对于简化来说,一个内核的缓冲区被填充与简化,而输出的产生是当数据可用时立即输出。
Geometric simplification can be considered a mature field for which
industrial quality solution exists. However, these methods, that repeatedly
merge nearby surface points or mesh vertices based on error minimization
considerations, perform best for highly tessellated surfaces that are otherwise
relatively smooth and topologically simple. However, it becomes difficult, in
other cases, to derive good “average” merged properties. Geometric
simplification is thus hard to apply when the visual appearance of an object
depends on resolving the ordering and mutual occlusion of even very close-by
surfaces, potentially with different shading properties.
几何简化可以被看作是一人成熟的领域,因为工业质量的解决方案已经存在。然而,这些基于误差最小考虑的重复的合并邻近表面的点或网格顶点方法,对于高度栅格化的平滑的拓扑简单的表面处理效果是最好的。但是,在其它方面变得十分困难来获取好的平均的合并属性。几何简化因此是难以用于当一个对象的视觉效果取决于解决的顺序与十分靠近表面的相互遮挡,以及不同的着色属性的简化。
B. Level-of-Detail
A level-of-detail (LOD) model is a compact description of multiple
representations of a single shape and is the key element for providing the
necessary degrees of freedom to achieve run-time adaptively. LOD models can be
classified as discrete, progressive, and continuous LOD models.
一个LOD模型是一个对单个简单形状的多个表达的描述,是提供运行时自适应访问自由度的关键因素。可以被分成离散的、累进的、连续的LOD模型。
Discrete LOD models simply consist of ordered sequences of representations
of a shape, representing an entity at increasing resolution and accuracy. The
expressive power of discrete LODs is limited to the different models contained
in the sequence: these are a small number and their accuracy/resolution is
predefined (in general, it is uniform in space). Thus, the possibility of
adapting to the needs of user applications is scarce. The extraction of a mesh
at a given accuracy reduces to selecting the corresponding mesh in the
sequence, whose characteristics are the closest to application needs. Such
models are standard technology in graphics languages and packages, such as VRML
or X3D and are used to improve efficiency of rendering: depending on the
distance from the observer, or a similar measure, one of the available models
is selected.
离散LOD简单的包括了对于一个图形的序列表达,这些序列的表达以增量的分辨率与精度的方法表达实体。离散LOD的表达的能力仅限于序列中的不同模型:数量少而且精度、分辨率是预先定义的。因而自适应于用户应用的需求是有限的。在一定的精确度下提取网格模型就变成在序列中选取相应的层次模型,而模型的特点最为靠近应用程序的需要。
The approach works well for small or distant isolated objects, which can
be found in CAD models [8]. However, it is not efficient for large objects
spanning a range of different distances from the observer. Since there is no
relation among the different LODs, there are no constraints on how the various
detail models are constructed.
这种方法对于较小的隔离较远的对象比较合适,这些可以在CAD模型中找到。然而,这在应用于一些大对象跨度一个大的范围,从观察位置看有不同的距离的情况。因为在不同的LOD之间是没有什么联系的,因此也没有相关的约束关于不同细节的模型的建立。
Progressive models consist of a coarse shape representation and of a
sequence of small modifications which, when applied to the coarse
representation, produce representations at intermediate levels of detail. Such
models lead to very compact data structures, based on the fact that all modifications
in the sequence belong to a predefined type, and thus can be described with a
few parameters. The most notable example is the Progressive Mesh representation
[12], included in the DirectX library since version 8. In this case, the
coarsening/refinement operations are edge collapse/edge split. A mesh at
uniform accuracy can be extracted by starting from the initial one, scanning
the list of modifications and performing modifications from the sequence until
the desired accuracy is obtained. As for discrete LODs, the approach works well
for small or distant isolated objects, but it is not efficient for large
objects spanning a range of different distances from the observer.
渐进网格模型包含了一个粗的形状的表达以及一个小的变化的序列,当应用于粗的表达时,它将产生一些蹭的细节层次。这些模型导致了非常紧促的数据结构,基于的事实是,序列中的所有的修改都属于预定义的类型,因而可以用较少的参数来进行表达。最为有名的例子是渐进网格表达(PM),它的简化/细化操作是边的折叠/边分裂。一个统一的网格模型的提取可以从一个初始的扫描修改列表开始,按序列进行修改操作直到获得期望的精度。
Continuous LOD models improve over progressive models by fully supporting
selective refinement, i.e., the extraction of representations with an LOD that
can be variable in different parts of the representation, and can be changed on
a virtually continuous scale. Continuous LODs are typically created using a refinement/coarsening
process similar to the one employed in progressive models. However, rather than
just storing a totally ordered sequence of local modifications, continuous LOD
models link each local modification to the set of modifications that block it. Thus,
contrary to progressive models, local updates can be performed without
complicated procedures to find out dependency between modifications.
连续的LOD模型改进了渐进模型通过完全支持可选择性的精化。如,提取LOD的一个表达可以在不同的部分进行,也可以在一个虚拟的连续的尺度中进行。它的创建与PM类似。不同的是,它不仅仅只存储了全部的局部修改的顺序,而连接每一个局部修改到一个修改集。因而,相比于渐进模型,局部更新可以不带复杂程序的执行以生成不同修改之间的依赖性。
A general framework for managing continuous LOD models is the
multi-triangulation [6], which is based on the idea of encoding the partial
order describing mutual dependencies between updates as a directed acyclic
graph (DAG), where nodes represent mesh updates (removals and insertions of
triangles), and arcs represent relations among updates. An arc a = (n1, n2)
exists if a non-empty subset of the triangles introduced by n1 are removed by
n2. Selectively refined meshes can thus be obtained from cuts of this graph,
and by sweeping the cut forward/backward through the DAG the resolution
increases/decreases. Figure 6 illustrates the concept with an example.
一般的框架是多三角化,基于的思想是,编码用于描述更新间的相互依赖关系的局部顺序成为一个有向图。图中的节点表达网格的更新(删除或插入三角形)而弧表达更新之间的关系。
Most of the continuous LOD models can be expressed in this framework. Many
variations have been proposed. Up until recently, however, the vast majority of
view-dependent level-of-detail methods were all based on multi-resolution
structures taking decisions at the triangle/vertex primitive level. This kind
of approach involves a constant CPU workload for each triangle that makes
detail selection the bottleneck of the whole rendering process. This problem is
particularly stringent in rasterization approaches, because of the increasing
CPU/GPU performance gap. To overcome this bottleneck and to fully exploit the
capabilities of current hardware it is therefore necessary to select and send
batches of geometric primitives to be rendered with just a few CPU
instructions. To this end, various GPU oriented multi-resolution structures
have been recently proposed, based on the idea of moving the granularity of
the representation from triangles to triangle patches [4], [29]. Thus, instead
of working directly at the triangle level, the models are first partitioned
into blocks containing many triangles, and, then, a multi-resolution structure
is constructed among partitions. By carefully choosing appropriate subdivision
structures for the partitioning and managing boundary constraints, hole-free
adaptive models can be constructed.
大多数的连续LOD模型都可以用这个框架进行表达。同时很多的变化也提出来了。然而,直到目前,大多数的视点相关细节层次方法都是基于多分辨率结构考虑三角形、顶点图元层次。这种方法需要一定的CPU负载用于每一个三角形的详细选择,这将是整个渲染过程中的一个瓶颈。这在光栅化方法中尤其明显,由于增加了CPU与GPU之间的差距。要克服这个瓶颈并完全使用当前硬件的能力,需要使用少量的CPU指令选择并送入批量的几何单元进行渲染。因此,出现了一些面向GPU的多分辨率的方法,基于的思想是将表达的粒度由三角形发展到三角形批。因此并不是直接的工作在三角形级别,模型首先被剖分成包含了多个三角形的块,然后依据于剖分建立多分辨率模型。通过仔细的选择合适的细分结构用于剖分并管理边界的约束,可以建立没有洞的自适应的模型。
The benefit of these approaches is that the needed pertriangle workload to
extract a multi-resolution model reduces by orders of magnitude. The small
patches can be preprocessed and optimized off line for a more efficient
rendering, and highly efficient retained mode graphics calls can be exploited
for caching the current adaptive model in video memory. Recent work has shown
that the vast performance increase in CPU/GPU communication results in greatly
improved frame rates [4], [29].
这些方法的优点是,所需要的以每三角形来提取一个多分辨率的模型的工作量减少达数量级。而这些片可以离线的方法进行预处理和优化用于更高效的渲染,以及更高效保留模式的图形命令可以用来将当前自适应的模型缓存在显存中。有一些工作展示了这方面的成果。
C. Visibility Culling
Often, massive scenes are densely occluded or are too large to be viewed
in their entirety from a single viewpoint, which means that in most viewing
situations only a fraction of the model geometry is actually visible. For
instance, in the Boeing 777 model in Figure
通常,大规模场景是高密度的遮挡的并且由于其过大是不能从单一的视点中来观察整个场景的,这也就意味着,在大多数的情况下,只有小部分模型几何是可见的。因此一个直接的策略就是决定可见数据集。主要的意途是在实际可见表面决定发生前拒绝场景的大部分内容。因此,将渲染的复杂性降低到计算场景子区域几何的复杂性。计算场景可见子部分的过程叫做可见性剔除。跟LOD一样,是输出敏感方法的一个基本部分。
Three typical culling examples are back-face culling, view frustum
culling, and occlusion culling (see Figure 7). Back face and view-frustum
culling are trivial to implement, as they are local per-primitive operations.
Occlusion culling is a far more effective technique since it removes primitives
that are blocked by groups of other objects, but is unfortunately not as
trivial to handle as the first two culling techniques because of its global
nature. Often preprocessing needs to be involved, usually to generate some sort
of scene hierarchies to allow for performing occlusion test in a top-down
order.
有三种典型的剔除方法,背面剔除、视景体剔除、遮挡剔除。前两者的布设是十分简单的,因为他们是局部的对每一个图元进行操作。遮挡剔除是一个有效得多的方法,因为它移除了被其它对象集合阻挡了的图元。但不幸的是,不跟前两者十分容易的布设,因为其全局的特点。因而常需要进行预处理,通过是产生一定顺序的场景层次以允许执行由顶向下的遮挡测试。
Quite a lot of different occlusion strategies have been proposed so far.
Occlusion culling approaches are broadly classified into from-point and
from-region visibility algorithms [5]. From-region algorithms compute a
potentially visible set (PVS) for cells of a fixed subdivision of the scene and
are computed offline in a preprocessing phase. During rendering only the
primitives in the PVS of the cell where the observer is currently located, are
rendered. From-region methods are mainly used for specialized applications,
e.g., like visualization of urban scenarios or building interiors. From point
algorithms, on the other hand, are applied online for each particular
viewpoint, and are usually better suited for general scenes, since for general
environments accurate PVSs are hard to compute.
到目前为止,提出了一些不同的遮挡查询的方法。大体上可以分为基于点和基于区域的可见性算法。基于区域的算法,在离线的方式下,对于每一个固定场景细分的单元格计算了一个PVS。在渲染阶段只有当前观察者位置的单元格中PVS中的图元被渲染。基于区域的算法常用于一些特别的应用,如,对于城市场景的可视化以及建筑物的内部时。另一方面,基于点的方法在线对于每一个视点进行,通常比较适合于一般的场景,因为一般环境下,准备的PVSs是难于计算的。
Visibility culling methods are typically realized with the help of a so-called
spatial index, a spatial data structure that organizes geometry in 3D space.
There are two major approaches, bounding volume hierarchies (BVHs) and spatial
partitioning.
可见性剔除的方法的实现中同样需要所谓的空间索引的辅助,这是一种空间数据结构用来组织3D空间的几何。而这种结构通常有两种主要的方法,层次包围盒(BVHs)以及空间剖分。
Bounding volume hierarchies organize geometry bottom-up. Groups of objects
are encapsulated by a larger volume, which encloses them completely. Then
bounding volumes can again be grouped and put into even larger volumes. The
resulting tree can be rendered in a top-down order, starting with the scene
bounding box. If a bounding volume, i.e., its boundary, is found to be fully or
partially visible, rendering continues with its child-volumes. If a volume is
completely invisible, traversal of the respective sub-tree can be discontinued,
since all children will not be visible either.
层次包围体结构是由底向上的方法组织几何的。一组对象用一个大的体进行封装。同样包围体可以被包围进一个更大的包围体中。最终的树结构可以以一个由上到下的顺序从整个场景的包围盒开始渲染。如果一个包围盒被发现是全部或部分可见,渲染进行则遍历其子节点进一步判断。而当其不可见时,则不遍历其子节点。
In contrast, spatial partitioning schemes subdivide the scene top-down.
The scene bounding box is split into disjoint partitions, which may further be
subdivided in the same fashion. Each atomic partition holds a list of
primitives it contains in whole or in part, and which will be rasterized if the
partition can be classified as visible. In case of ray tracing, all primitives
of such a partition are intersected sequentially with the current ray.
相反,空间剖分的方法则是将整个场景由上到下的细分。整个场景的包围盒被分离成部分,而其部分也将进一步迭代细分。每一个原子的剖分区保持着其内图元的列表,若其可见时,将被光栅化渲染整个分块。而对于光线追踪方法,所有的图元将与当前光线顺序进行相交计算。
Quite a number of spatial partitioning schemes have been proposed in the
past. Most popular are hierarchical grids, octrees, kd-trees. More details can,
e.g., be found in [1]. An example of a kd-tree scene partition is illustrated
in Figure 8. Kd-trees are axis-aligned binary space partitioning (BSP) trees.
Construction of a kd-tree starts with the bounding box of the model and a list
of contained primitives. The scene bounding box is then subdivided into two
sub-boxes along one of the three primary coordinate axes, and the list of
primitives is sorted into the two halves, creating two primitive lists, one for
each half. Polygons that lie in either half are simply replicated. The process
is recursively continued for both sub-boxes and their respective primitive
lists. This way a binary tree is constructed, where each node corresponds to a
spatial region (called voxel), and its children to a binary space partition of
their parent region. If splitting positions are chosen to tightly enclose the
scene geometry, kd-trees typically exhibit superior culling efficiency over
other acceleration structures.
在过去的多年中,一系列的空间剖分的方法结构被提了出来。最流行的方法是层次格网、八叉树、k-D树。更多细节可以参见文献[1]。图8展示了一个k-D树的场景分割方法。kD树可以被认为是一个轴对称的BSP树。kD树的构建开始于模型的包围盒以及一个包含图元的列表。这个场景包围盒被分割成两个子包围盒沿着三个主轴中的一个。而其列表也被按序列的分成两半,创建两个子列表,每个子块一个。跨越两个子块的多边形被复制。处理过程在每一个子块中迭代进行。使用这种方法可以建立一个二叉树结构。每一个节点对应一个空间区域(叫体元)。如果分割的位置选择紧紧包围了整个场景的几何,kD树相比于其它的结构有较高的剔除的效率。
When combined with ray tracing, each ray traverses the kd-tree top-down,
resulting in a linear sequence of cells (leaf voxels) the ray passes. In the
example in Figure 8 the ray visits cells 3, 2, and 4. Primitives contained in
these cells are tested sequentially, and traversal can be stopped if a hitpoint
is found. Using a rasterizer, kd-tree traversal is performed similarly.
However, here all cells are visited that intersect the viewing frustum. In our
example this would be cells 3, 2, 4, and 5. Only the primitives of the
respective cells are sent to the graphics pipeline.
当进行射线追踪时,每一个光线由上到下的遍历kd树,射线绕过导致了一个线性的序列。位于这些序列单元格中的图元按顺序进行测试,若找到相求点,则中止。而在光栅化的过程中,kd树的遍历也十分的类似,然而,这里所有的单元格都被访问,因为都与视景体相交。
In order to achieve a sub-linear time complexity, employing acceleration
structures alone is not sufficient. It is also necessary to include an early
traversal termination. For a ray tracer this is trivial since visibility is
evaluated independently for each ray, and once a hitpoint has been found, it is
certain that geometry behind is not visible.
为了达到一个子线性时间复杂度,只采用加速结构是不够的。同时也需要包括一个早期的遍历中止条件。对一个射线追踪器来说,这是不重要的。因为可见性是与光线独立的。如果有一个相交点,那些其后面的对象就不可见。
Using rasterization, the decision whether traversal of the spatial index
can be stopped can also be made in image space, by exploiting the Z-buffer, and
the most recent algorithms exploit graphics hardware for this purpose. During
rendering – when the spatial index is traversed hierarchically in a front- to-back
order – the bounding box of each visited node is tested against the Z-buffer
and traversal is aborted as soon as occlusion can be proved, i.e., when all
Z-values of a box are behind the corresponding stored Z-buffer’s values.
使用光栅化方法时,决定是否中止对空间索引的遍历也可以在像空间进行决定。通过采用Z-buffer,当前大多数算法采用的图形硬件的帮助。在渲染的过程中----当对空间索引进行由前到后的顺序的层次遍历时----每一个被访问对象的包围盒都与Z-Buffer进行测试,而当遮挡时,中止遍历。
An efficient implementation of this method requires the availability of rapid
Z-queries for screen regions. A classic solution is the hierarchical Z-buffer,
which extends the traditional Z-buffer to a hierarchical Z-pyramid maintaining
for each coarser block the farthest Z-value among the corresponding finer level
blocks, therefore allowing to quickly determine if a geometry is visible by a
top-down visit of the Z-pyramid. A pure software implementation of this method
is impractical, but to some extent this idea is exploited in the current
generation of graphics hardware by applying early Z-tests of fragments in the
graphics pipeline (e.g., ATI’s Hyper-Z technology or NVIDIA’s Z-cull), and
providing users with so-called occlusion queries. These queries define a
mechanism whereby an application can query the number of pixels (or, more
precisely, samples) drawn by a primitive or group of primitives.
而这种方法的实施需要能够快速的查询屏幕区域的Z值。一个经典的方法是使用层次Z缓冲,它将传统的Z缓冲区扩展到层次Z金字塔,对每一个在相应的好的层次块中的粗的块保存每一个粗块最快的Z值。因此可以快速的决定几何体是否可见通过一个由顶向下的Z金字塔访问。纯软件的设计实现是不实际的,但在某种程度上来说,这种思想已经被利用在当前最新的图形硬件上了,通过对图形管道中的每一个片段应用早期的Z测试,提供给用户的是遮挡查询。这些查询定义了一种机制,通过它应用程序可以查询到一个或一组图元绘制的象素数量。
For occlusion culling during scene traversal the faces of bounding boxes
are simply tested for visibility against the current Z-buffer using the
occlusion query functionality to determine whether to continue traversal. It
should be noted that, although the query itself is processed quickly using the
raw power GPU, its result is not available immediately due to the delay between
issuing the query and its actual processing by the graphics pipeline. A naive
application of occlusion queries can thus even decrease the overall application
performance due the associated CPU stalls and GPU starvation. For this reason, modern
methods exploit spatial and temporal coherence to schedule the issuing of
queries [3], [29], [11]. The central idea of these methods is to issue multiple
queries for independent scene parts and to avoid repeated visibility tests of
interior nodes by exploiting the coherence of visibility classification.
当场景遍历的遮挡剔除时,包围盒中的面时只简单的测试当前的Z缓冲用于可见性使用遮挡查询函数来决定是否继续遍历。需要指出的是,虽然查询本身的处理十分的快,但他的结果并不是立即可得的,因为查询指令的发出与执行处理在目前的图形管线中有时延。一个比较天真的然而,将因此而拖累整个应用的性能,由于CPU的延迟以及GPU的不足。基于此原因,现代的方法利用空间与时间的一致性来安排查询的发布。基本思路是发布多个查询给场景的独立部分,而通过采用可见性分类的一致性来避免重复的对于内部节点进行可见性测试。
D. Summary
Level-of-detail and visibility culling techniques are fundamental
ingredients for massive model rendering applications. It is important to note
that, in general, the lack of one of these techniques limits the theoretical scalability of an application. However,
massive models arise from a number of different domains, and the relative
importance of the LOD management and visibility culling components depends on
the widely varying geometry, appearance, and depth complexity characteristics
of the models. For instance, typical 3D scanned models and terrain models tend
to be extremely dense meshes with very low depth complexity, favoring pure LOD
techniques, while architectural and engineering CAD models tend to combine
complicated geometry and appearance with a large depth complexity, requiring
applications to deal with visibility problems.
细节层次与可见性剔除技术对于大规模的模型渲染应用来说是最为基本的组成部分。总的来说,需要重要强调的是,缺少其中之一将限制应用的理论的可量测性。然而,大规模模型的由来出自于不同的领域,而LOD管理与可见性剔除的重要性相关性依赖于广泛不一样的几何、外表以及模型的深度的复杂性。如,典型的三维扫描模型与地形模型趋向于网格的极端的密而深度信息少,趋向于使用当纯的LOD技术,然而,建筑与工程CAD模型趋向于组合完整的带有十分大的深度复杂性的几何与外表。
Few approaches exist that integrate LODs with occlusion culling both in
the construction and rendering phases. Moreover, and most importantly, the
off-line simplification process that generates the multi-resolution hierarchy
from which view-dependent levels of detail are extracted at rendering time is
essentially unaware of visibility. When approximating very complex models,
however, resolving the ordering and mutual occlusion of even very close-by
surfaces, potentially with different shading properties, is of primary
importance (see Figure 9). Providing good multi-scale visual approximations of
general environments remains an open research problem, and the few solutions
proposed so far involve the introduction of other primitives than triangle
meshes for visibility-aware complexity reduction.
只存在着少数的系统在构建与渲染阶段都集成了LOD的管理与遮挡剔除技术的。而且,更为重要的是从视点相关的细节层次中离线的简化处理产生的多分辨率层次,在渲染中被提取时完全不知道可见性信息。然则,当近似表达十分复杂的模型时,解决排序与相互查询即使是十分靠近的带有不同着色属性的表面的问题显十分重要。对一般环境提供一个好的多尺度视觉近似性仍然是一个主要问题,而且迄今为止,鲜有方法致力于引入非三角形网格的其它图元来进行减少可见性已知复杂性研究。
Representations other than polygons offer significant potential for
massive models visualization. In conventional polygon based computer graphics,
models have become so complex that for most views polygons are smaller than one
pixel in the final image. The benefits of polygons for interpolation and
multi-scale surface representation become thus questionable. For these reasons,
researchers have started investigating alternative approaches, that represents
complex 3D environments with sets of points, voxels, or images. At present
time, however, no single best representation exists in terms of storage,
computational and implementation costs. For more information, see the sidebars
“Alternative Geometric Representations” and “Image-Based Methods”.
采用多边形以外的表达方法为大规模模型的可视化提供了重大的潜力。在传统的基于多边形的计算机图形学中,模型变得十分的复杂,导致最终图像上大多数的可见多边形都要小一个象素。多边形用于插值以及多尺度表面表达的优点就值得怀疑了。出于这个原因,研究者们开始研究可替代的方法,这些方法对于复杂场景的表达采用的是点集、体元或影像。然而,在当前,并没有一个最好的表达存在,当考虑到存储、计算以及实施的代价时。
IV. DATA MANAGEMENT TECHNIQUES
In the previous section, we have discussed complexity reduction techniques
in order to improve the performance of rendering massive models. However, some
of those techniques, e.g., LOD methods, can even increase the size of external data
since we have to maintain different versions of a model. Moreover, we may have
still a huge amount of in-core data, especially for creating high-resolution
images, even after applying all those complexity reduction techniques.
数据的管理技术
前一章讲座了如何来降低复杂性以提高对于大规模模型的绘制性能。然而,这些技术,如LOD方法也将导致外存数据的增加,因为我们不能不保存不同版本的模型。而且,即使在采用了复杂性降低技术后,我们还仍将使用一个巨大数量的内核数据,特别是要生成高质量的影像。
Unfortunately, the computation trend during the last several decades is
the aggravating performance of data access speed compared to that of processing
speed [28]. Moreover, it is likely that this computational trend is to persist
in the near future. Therefore, system architectures have been employing various
caches and memory hierarchies to reduce expensive data access time and memory
latency. Typically, the access times of different levels of a memory hierarchy
vary by orders of magnitude (e.g., 10−8 s for L1/L2 caches, 10−7 s for main memory,
and 10−2 s for disks). Also, as ubiquitous computing is more widely accepted,
data is now accessed through the network in many cases, where the data access
time is very expensive.
不巧的是,过去几十年的计算发展表明数据访问的速度与数据处理速度性能之间的差距在增加。而且在不久的将来这种趋势还将继续。因而,系统的架构采用了不同的缓存现内存层次结构来减少数据访问的时代代价与内存的延迟。典型的,不同层级的内存层次的访问时间的延迟区别是几个数量级的。然而,普适计算是已经被广泛的采纳,在很多情况下数据通过网络的访问,这种数据访问的时间代价更大。
As a result, it is critical to reduce the number of cache misses in order
to maximally utilize the exponential growth rate of computational processing
power and improve the performance of various applications including
rasterization and ray tracing. In this section we will discuss three data
management techniques: out-of-core techniques, layout methods, and compression
methods, to improve the performance of applications by reducing data access
time.
因此,减少缓存失中的数量在最大化利用计算处理能力的指数增长与提高不同应用程序包括光栅化与射线追踪方面变得十分重要。本部分,将讨论三种数据的管理技术:外核技术、布局技术、压缩技术,主旨是用于减少数据访问的时间。
A. Out-of-Core Techniques
Out-of-core or external memory techniques store the major part of the
scene database on disk, and only keep the fraction of the data that is
currently processed in main memory. Such methods target to reduce the number of
disk accesses, which are several orders of magnitude more expensive than that of
memory and L1/L2 cache access time. In general, out-of-core techniques require two
cache parameters: the size of available main memory and the disk block size.
Since out-of-core techniques require these explicit cache parameters, they are also known as cache-aware techniques.
外核技术
外核技术或者叫外存技术将大部分的场景数据库存在磁盘中,只将当前需要进行处理的数据加载到内存中。这种方法目的是减少磁盘访问的数量。因为磁盘的访问时间代价是内存以及L1或L2缓冲访问速度的几个数量级大。一般来说,外核技术需要两个缓冲的参数,可用内存的大小以及磁盘块的大小。由于外核技术需要清楚这些缓存的级数,因此他们也被看作是缓存-已知的技术。
Given the known sizes of main memory and disk blocks, out-of-core
techniques keep the working set size of rendering (or other applications) less
than the size of main memory. They achieve this property typically by using an
explicit data page system. Therefore, they avoid any I/O thrashing, where a
huge number of disk cache misses occurs, and the performance of applications is
severely degraded. Also, most out-of-core techniques construct compact external
representations optimized towards disk block size to effectively reduce the
number of disk cache misses. Also, these techniques are used together with
pre-fetching methods to further reduce data access time.
当内存的大小以及磁盘块大小已知时,外核技术保存了一个工作集用于渲染,工作集的大小要小于内存的大小。实现的方法是通过一个明晰的数据页系统。因此可以避免I/O爆发,当发生较大的磁盘失中数量时,这个系统的性能将剧烈下降。同样,大多数的外核技术构建了紧促的外存表达方法并以磁盘块大小进行优化用于有效的减少磁盘访问的失中。将此技术与预读取方法同时使用来时一步减少数据的访问时间。
For readers interested further in out-of-core techniques, we refer to a
survey of Silva et al. [20].
可参见综述。
B. Layout Techniques
Most triangular meshes used in games and movies are typically embedded in
two or three dimensional geometric space. However, these triangle meshes are
stored on disk or in main memory as one dimensional linear data
representations. Then, we access such data stored in the one dimensional data sequence
using a virtual address or some other index format. An example of this mapping
is shown in Figure 10. One implication of this mapping process of 3D or 2D
triangular meshes into 1D linear layout representations is that two vertices
(or triangles) close in the original 2D or 3D geometric space can be stored
very far away in the 1D linear layout representation. This is mainly because a
1D layout cannot adequately represent the relationships of vertices and
triangles embedded in 2D or 3D geometric space.
布局技术
大多数游戏或电影中使用的三角形的网格模型都是植入一个二维或三维的几何空间。然而,这些三角形网格存储在磁盘或内存中时是以一维线性数据表达的。然后,我们访问以一维序列方式存储的数据时使用了虚拟的地址或另外的某些索引格式。而这种方法的一个可能的问题是,将一个三维或二维的三角形网格映射到一维的布局表达时,二个在原有的二维或三维空间中十分靠近的顶点或三角形存储在一维布局的表达中可能相隔很远。这是由于一维的布局并不能充分的表达二维或三维几何空间中的顶点的关系。
Many rendering methods including rasterization and ray tracing access
vertices and triangles in a coherent manner. For example, suppose that a
triangle and its three associated vertices are accessed for rendering. Then, it
is most likely that neighboring triangles or vertices will be accessed for
rasterizing the next triangles or performing ray–triangle intersection tests
for subsequent rays. Therefore, two vertices (or triangles) close in original
geometric space are most likely to be accessed sequentially rather than any
triangle or vertex will be equally likely to be accessed.
很多的渲染方法包括光栅化以及射线追踪对顶点和三角形的访问都是以一个一致性的方法。假设一个三角形以及它的三个相关的顶点访问用于渲染,很可能它的相邻的三角形或顶点将下一步进行光栅化或射线求交。因此,在原几何空间中二个顶点或三角形相近时最有可能依闪的访问,而不是所有的任意三角形或顶点访问的概率相同。
Although rendering applications access data in a coherent manner in the
geometric space where triangle meshes are embedded, there is no guarantee that
we will have coherent data access patterns in terms of 1D data layout
representations due to the intrinsic lower dimensionality of 1D layouts
compared to those of meshes. Particularly, this phenomenon can have significant
impact on the performance of many applications running on modern computer
architectures. This is mainly due to I/O mechanisms of most modern computer
architectures.
虽然渲染应用在几何空间中对于三角形的数据访问的一致性方式是内在的,但并不保证我们在以一维存储布局表达时的访问模式是一致的,由于缺少的一维信息。特别是,这种现象可能对于在当前的计算机架构下的应用性能有相当大的影响。而影响的主要因素是现代计算机架构的I/O机制
Most I/O architectures use hierarchies of memory levels, where each level
of memory serves as a cache for the next level. Memory hierarchies have two
major characteristics. First, lower levels are larger in size and farther from
the processor and, thus, have lower access times. Second, data is moved in
large blocks between different memory levels. Data is initially stored in the lowest
memory level, typically the disk or can be even accessed through the network. A
transfer is performed whenever there is a cache miss between two adjacent
levels of the memory hierarchy.
I/O架构的特点,2个特点:低层次量大速度慢;数据以块的形式在不同的级中移动,移动发生在一个缓存失中时。
This block fetching mechanism assumes that runtime applications will
access data in a coherent manner. As mentioned earlier, most rendering methods
including rasterization and ray tracing access data in a coherent manner, only
in the geometric space where meshes are embedded. However, data accessed in the
1D layout may be stored very far away and, thus, can require a lot of block
fetching, i.e., cache misses, during runtime applications.
这种块的读取机制假设了应用程序运行时对数据的访问是以一种一致性的方式的。正如前文所说,这种一致性,无论是光栅化还是射线追踪,都是在几何空间的一致性,而在一维的存储布局中,存储位置可能相当远。因此,可能导致以对于很多块的读取。
Intuitively speaking, we can get fewer cache misses as we store vertices
close in the mesh to be also close in the 1D layout. Cache-coherent layouts are
layouts constructed in such a way that they minimizes the number of cache
misses. One notable layout method to improve the performance of rasterization
is the rendering sequence, which is a sequential list of triangles optimized to
reduce the number of GPU vertex cache misses. By computing a good rendering
sequence of a mesh, we can expect an up to six times rendering performance
improvement. This technique requires the GPU vertex cache size as an input and
optimizes the rendering sequence according to this value [13]; therefore, the
computed rendering sequence is considered as a cache-aware layout.
直观的来说,当我们将空间网格中邻近的顶点在一维布局中也保持其邻近性,则会带来比较少的缓存失中。缓存一致的布局的构建是为了最小化缓存失中的数量。一个值得提出的提高性能的布局方法应用在光栅化中的渲染队列,它是一个三角形序列,优化用于减少GPU缓存失中的数目。通过计算一个好的网格的渲染队列,我们可以期待达到6倍的性能的提升。本技术需要GPU顶点缓存大小作为输入并以此来进行队列的优化。因而,计算渲染列队被考看成是缓存一致的布局。
Recently, cache-oblivious mesh layouts have been proposed [28]. These
cache-oblivious layouts do not require any specific cache parameters such as
block sizes. Instead, these layouts are constructed in such a way that they
minimize the expected number of cache misses during accessing meshes for
block-based caches with various block sizes. Its main advantage compared to
other cache-aware layouts is that the user does not need to specify cache
parameters such as block size and can get benefits from various levels of
memory hierarchies including disk, main memory, and L1/L2 caches. Moreover, the
implicit paging system provided by the OS can effectively be coupled with
cache-oblivious layouts. Therefore, users can observe the performance
improvements without significant modifications on the underlying code and
runtime applications. Cache-oblivious layouts have been developed for meshes
and bounding volume hierarchies for ray tracing, rasterization, and other
geometric applications.
而最近,缓存非知的网格布局也被提了出来。这种方法无须获得任何特定缓存参数如,如缓存块大小,相反,这些布局的构建以这样一种方式,他们最小化缓存失中的期望值在访问网格用于不同块大小的块状缓冲时。它的主要的优点相比于其它的缓存已知的方法在于用记无须特定缓存参数例如块的大小,并可以从不同的存储分级中得到便利,包括磁盘、内存、L1或L2缓存。而且明晰的操作系统管理的分页系统可以很好的处理缓未知的布局。因此,用户就可以观察到性能的提升而不需要明显的改变底层的代码和运行应用。缓存未知的布局已经开发了网格或包围体层次用于射线追踪、光栅化以及其它的应用。
C. Compression Techniques
Mesh compression techniques compute compact representations by reducing
redundant data of the original meshes. Mesh compression has been widely
researched and good surveys [2] are available. Particularly, triangle strips
have been widely used to compactly encode rendering sequences. Triangle strips
consist of a list of vertices, which implicitly encodes a list of triangles.
Also, decoding triangle strips for rasterizing triangles can be very
efficiently implemented in graphics hardware.
大多数的压缩技术都是通过减少原始网格模型的冗余数据的来计算一个紧促的表达。网格压缩,已经被广泛的研究了,并且有较好的综述可参见。特别是,三角形条带已经被广泛的使用于紧促的编码渲染序列。三角形条带包括一个顶点的列表,暗示的编码了一个三角形列表。
However, these representations cannot be directly used for ray tracing,
which access underlying meshes in an arbitrary order, unlike the sequential
order in rasterization. There have been efforts to design compressed meshes
that also support random accesses for various applications including ray
tracing [27], [16]. One of the concepts leading to such techniques is to
decompose a mesh into different chunks and compress each chunk independently.
Therefore, when a mesh element is required for an application, a chunk
containing the required mesh element is decompressed, and the decompressed mesh
element can be returned to the application.
然而,这些表达并不能直接用于光线追踪,它是以任意的顺序来访问网格的,与光栅化的序列顺序不一样。人们做了一些努力来设计压缩的网格模型用于支持随机访问用于多种的应用包括射线追踪。这种思想下的一个技术是将网格模型分解成不同的块,并独立的对每一块进行压缩。因此,当一个网格元素在某个应用中是需要时,包含它的块就将解压。
Recently, the Ray-Strips representation [16] has been proposed for ray
tracing. Ray-Strips like triangle strips consist of a list of vertices.
Moreover, this vertex list implicitly encodes a list of triangles and a
bounding volume hierarchy, which accelerates the performance of ray–triangle
intersection tests.
最近,射线条带的表达专用于射线追踪。它编码了一个三角形列表以及包围体层次。
D. Summary
We have discussed three data management techniques, out-of-core
techniques, cache-coherent layout methods, and compression techniques, to
reduce expensive data access time and, thus, improve the performance of
rasterization and ray tracing. Out-of-core techniques mainly focus on reducing
disk access time and require specific memory and disk block sizes. Therefore,
they can achieve better disk access time compared to cache-oblivious algorithms
and layouts. However, out-of-core techniques are usually coupled with an
explicit external paging system; therefore, taking advantage of its best
performance may require more implementation efforts. On the other hand,
cache-oblivious layouts do not require any cache-parameters and work quite well
with various GPUs and CPUs having different cache-parameters. Moreover, a user
can achieve reasonably high performance without modifying underlying code and
applications. Also, by compressing the external data representations and
layouts, storage requirements can be drastically reduced and performance of
applications is improved.
我们讨论了三种数据的管理技术,外核技术、缓存一致性布局技术与压缩技术来减少代价高的数据访问时间来提高光栅化或射线追踪的问题。外核技术的核心在于减少磁盘的访问时间,需要指定内存与磁盘块的大小。因此,他们可以较好的磁盘访问时间,相比于缓存未知的算法与布局。然而,外核技术经常需要与明晰的外存页面映射一起工作。因此要完全的发挥其功能,则需要更多的布设的工作。另一方面,缓存未知的布局并不需要知道缓冲的参数。而且,一个用户可以获得合理的高性能并不修改底层的代码及应用。
V. PARALLEL PROCESSING TECHNIQUES
Even when applying the various complexity reduction methods we have
visited in the previous sections, rendering of massively complex scenes can
still be extremely computationally expensive. Especially in the case of
advanced shading, a single CPU/GPU often cannot deliver the performance
required for interactive image generation. It becomes thus necessary to combine
the computational resources of multiple processing units to achieve a sufficient
level of computing power.
Today, parallel computing capabilities are offered at a variety of
different hardware and system levels. Examples are SIMD (single instruction
multiple data) instructions that can perform vector operations, multiple
pipelines in CPUs and GPUs, multi-core architectures, and shared-memory and
loosely coupled cluster systems that can contain multiple processors and/or
graphics cards.
When focusing on rendering systems using distinct CPUs and/or GPUs, we can
distinguish between two main straightforward methods: sort-first and sort-last
parallelization strategies. Sort-first rendering is based on subdividing screen
space into disjoint regions that are rendered independently in parallel by
multiple processing units. In a sort-last setting, the scene dataset is split
into several parts, and is typically distributed amongst separate computing
systems individually containing RAM, CPUs, and GPUs. The sub-scenes can be
rendered independently, and the results are composed afterwards into the final
image. For rasterization this can be accomplished by collecting the content of
the individual color and Z-buffers, and choosing the final pixel color based on
the nearest depth value for a given pixel. A popular OpenGL oriented system for
clusters of workstations that incorporates sort-first, sort-last, and hybrid
rendering is Chromium [14]. For a ray tracer a sort-last approach can be
handled in a similar way by simply ray tracing the sub-scene images instead of
rasterizing them. A different approach for ray tracing is to forward rays from one
rendering system to the next if no surface intersection can be found.
A. Data Parallel Rendering
Parallel rendering of a distributed scene database is generally termed
data parallel rendering (the term sort-last is more related to the composition
strategy of the final image). Apart from reducing the complexity of visibility
calculations, splitting and distributing a scene between computing subsystems has
another advantage; once the massive scene is decomposed into small chunks, each
of which can fit into the available main memory of each sub-system, the overall
system can handle highly complex scenes, which would otherwise not fit into the
main memory of a single system. A big disadvantage of such a setup is the
difficulty of dealing with advanced shading, as this would require potential access
to all parts of the 3D model. In are rasterization based system, data parallel
rendering is typically performed using a sort-last image composing mechanism.
In case of multipass rasterization, various maps (e.g., for shadow
calculations) have to be rendered, assembled, and distributed to all hosts.
Using ray tracing, mainly sort-first approaches are applied, i.e., an
initial primary ray is sent to the sub-system that hosts the region of the
scene the ray enters first. Rays thereafter have to be propagated between the
individual sub-systems, which usually results in a high communication overhead,
especially for secondary rays. In addition, using pure data parallel scheduling
of rendering tasks typically does not allow for handling load-imbalances caused
by changing viewpoints.
B. Demand Driven Rendering
When pursuing a screen space subdivision (sort-first) approach, a
straightforward way is to use a static partitioning scheme, where the image
space is broken into as many fixed size regions as there are rendering client
machines. Another alternative is to split the image into many small, often
quadrangular regions called tiles, which are assigned to the available rendering
clients (i.e., processing units). Depending on the part of a scene that is
visible in a tile, computation time for each tile can vary strongly across the
image plane. This can lead to bad client utilization, and thus result in a poor
scalability if tiles would be statically assigned to the rendering clients.
Therefore, a better alternative is to employ a demand driven scheduling
strategy, by letting the clients themselves ask for work. As soon as a client
has finished a tile, it sends its results back to the master process (which is
responsible for assembling and displaying the image), and request the next unassigned
tile. This leads to an efficient load balancing, and for most scenes to an
almost linear scalability in the number of rendering clients.
C. Distributed Rendering
In contrast to shared-memory environments, where a number of CPUs can
simultaneously access a virtually single contiguous chunk of main memory,
distributed systems contain a number of physically separated computing systems communicating
over an interconnection network. In such an environment, typically one
dedicated machine serves as host for a master process. The master is
responsible for distributing the rendering workload among the remaining client
machines, and for assembling and displaying the generated image. Particularly
in situations where clients render massive datasets in an out-of-core manner,
it is important to consider spatio-temporal coherence. In order to make best
use of caches on client machines, image tiles should be assigned to the same clients
in subsequent frames whenever possible. To hide latencies, rendering, network
transfer, image display, and updating scene settings should be performed
asynchronously. For example, while image data for frame N is transferred, the
clients already render frame N + 1, whereas the application can specify and
send updated scene settings for frame N + 2.
D. Summary
Modern CPUs and GPUs increasingly feature parallel computing capabilities.
One of the most important computational trends is the growing
numbers of CPU cores and GPU processing units, as physical limitations
in stepping up clock rates and reducing sizes of integrated circuit structures
become more and more eminent. Therefore, future hardware will make it possible
to render today’s massive models on standard computing systems, but scene complexity is also expected to keep rising for
quite some time to come. For such extremely complex scenes, it is required to combine the computational power of multiple
computing systems to enable interactive rendering and sophisticated shading.
In this section we only dealt with parallel rendering. However, parallel
installations described in this section can be equally applied to speed up
precomputation tasks, like, e.g., building spatial index structures or computing
level-of-detail representations.
VI. SYSTEM ISSUES
Rendering high-quality representations of complex models at interactive
rates requires not only to carefully craft algorithms and
data structures, but also to combine the different components in an
efficient way. This means that the different solutions illustrated in the
previous sections must be carefully mixed and matched in a single coherent
system able to balance the competing requirements of realism and frame rates.
No single standard approach presently exists, and the different solutions
developed to date all have their advantages and drawbacks. In the following, we
briefly illustrate how a few representative state-of-the-art systems work.
不仅需要仔细的设计算法与数据结构,同时需要以高效的形式组合不同的部分。
A. Visibility Driven Rasterization
可见性驱动的光栅化
One of the important uses for massive model rendering is the exploration
of models with a high depth complexity. These application test cases include
architectural walkthroughs and explorations of large CAD assemblies. In these
situations, occlusion culling is often the most
effective visible data reduction techniques. A number of systems have
thus been developed around efficient visibility queries.
A system that efficiently makes use of the occlusion query capabilities of
modern graphics hardware is the Visibility Guided Rendering [11] (VGR) system.
VGR organizes the scene in a hierarchy of axis-aligned bounding boxes. For each
internal node of the resulting tree a splitting plane along one of the three
primary axes (like in a regular kd-tree) is stored, which is used to traverse
the scene in a front-to-back order. In a preprocessing step the tree is
generated top-down, while trying to keep the edges of boxes of equal size, to
optimize culling efficiency for different viewing angles. Recursively subdividing
the scene is terminated once a node contains about 2000 to 4000 triangles.
VGR使用层次的轴对称包围盒来组织场景,以由前到后的顺序访问预处理阶段,分割树由上到下的产生。每一个场景的节点有2000-4000个三角形。
The faces of boxes associated with nodes are directly used as test
geometry for hardware occlusion queries. Ideally, traversal of the bounding box
hierarchy would be performed depth-first, selecting children of a node in a
front-to-back order. However, with occlusion queries being done on the GPU and
tree traversal on the CPU, this would result in a stop-and-go behavior. As
explained in section III-C, the solution is to carefully schedule queries by
exploiting spatial and temporal coherence. Thus, to allow for running occlusion
queries in parallel to tree traversal, VGR maintains a queue of query requests,
which can be asynchronously processed by the GPU. Rather than carrying out
visibility checks in a depth-first order, the queue is filled based on a more breath-first
traversal order (see Figure
VGR also maintains a list of leaf nodes that were visible in the
previously rendered frame. 维持前一帧的一个可见叶节点的列表,后续帧中,新的可见节点被加入到其中。
In successive frames new visible nodes are added to the list while others
become invisible. The nodes of this list are rendered first in the current
frame. The intention is to fill the Z-buffer before the first occlusion query
takes place, thus exploiting frame-to-frame coherence. Visibility information
for leaf nodes from the previous frame is propagated up the tree, which makes
it then possible to exclude sub-trees from traversal and visibility testing
(Figure 11b). Not every leaf is tested for visibility in each frame. Occlusion
testing is performed every n frames for the leaf nodes contained in the list.
For handling complex scenes, VGR can render in an out-of-core mode. To this end, it maintains a two-level caching architecture,
where the graphics card memory (VRAM) serves as first-level cache, and the RAM
of the host machine as second-level cache. Memory is managed using a
least recently used (LRU) cache eviction strategy. The VRAM is subdivided into
a small number of large slices containing OpenGL vertex and index buffers.
Slices are filled from the front to the back with data from visible leaf nodes.
Once no more free slices are left, the slice least recently used is completely
emptied and refilled.
VGR系统使用一个外核的模式进行处理,保持着一个两级的缓存结构,图形卡内存作为第一级缓存,而主机的RAM作为第二级缓存。使用LRU作为缓存的清除策略。VRAM缓存被分割成几个大块用来保存OpenGL的顶点和索引缓存。使用可见的叶子节点由后向前进行填充。
The VGR system also incorporates a simple LOD mechanism. Before rendering a visible node, its screen space projection
area is determined. If the area is smaller than a pixel, the system resorts to
point rendering for this node. In addition, it is further possible to randomly
thin out the point cloud for very distant nodes. While the VGR system employs online visibility queries, other systems make use of
visibility information computed in a preprocessing phases and stored with the
model. A representative system of this category is iWalk, which is constructed
around a multi-threaded out-of-core rasterization method.
绘制前,先计算可见节点的投影面积,若小于一个象素,则按点绘制方法进行处理。
iWalk can support high-resolution (4096 × 3072) and multitiled displays by
employing sort-first parallel out-of-core rendering. iWalk decomposes an input
model with an octree. For construction, since an input model typically does not
fit into the available memory, iWalk breaks the model into small sections, each
of which can fit into main memory. Then, iWalk incrementally constructs an
octree by processing each section in a separate pass, and merges the final
result into a single octree. iWalk can be integrated with approximate or
conservative visibility culling and employs speculate
prefetching considering visibility events, which are very hard to deal
with. To do that, a visibility coefficient for
each octree node is computed. A visibility coefficient
measures how much geometry in an octree node is likely to be occluded by other
geometry. At runtime, iWalk predicts visibility events based on
visibility coefficients stored in the octree nodes. This
feature allows the system to pre-fetch the geometry which is likely to be
accessed in a next frame, and thus reduces expensive loading time of newly
visible geometry.
参处理过程中,计算了一个可见性的协因子。
iWalk also uses multi-threading to concurrently perform visibility
prefetching, rendering, and out-of-core management. Since
disk operations are very expensive and have high latency, multi-threading of
different tasks achieves higher CPU utilization and thus improves the overall
performance of rendering massive models.
多线程可以充分使用CPU的使用以提高整个性能?How?
B. Real-Time Ray Tracing
Another class of systems heavily relying on efficient visibility culling
is that of systems built around real-time ray tracing kernels.
A very advanced system of this kind is the OpenRT realtime ray tracer
[24], [22], a renderer originally conceived to deliver interactive ray tracing
performance on low-cost clusters of commodity PCs. It can, however, also be run
in a shared-memory environment. OpenRT uses a two-level
kdtree hierarchy as spatial index. A scene can consist of several independent
objects composed of geometric primitives, where each object has its own local
kd-tree. The bounding boxes of such objects are organized in a top-level
kd-tree.
OpenRT使用两级的kd树层次作为空间索引。一个场景可以包括几个独立的对象,每个对象由多个几何图元组成,而每一个对象有其局部的kd树。对象的包围盒组织成一个上层的kd树。
On the one hand, this allows for a limited form or rigid-body motion, since
only the top-level tree needs to be rebuilt once objects move. On the other
hand, it enables efficient instancing of objects as the top-level tree can
contain multiple references and corresponding transformation matrices to a
single object. Kd-tree construction makes use of cost prediction functions to estimate
optimal splitting plane positions. As long as the scene description can fit
completely into main memory, even visually highly complex scenes can be handled
due to the logarithmic time complexity of ray tracing. An example can be seen
in Figure 1d. Using tile-based demand driven rendering on multiple CPUs, the
depicted landscape scene can be interactively explored without having to
incorporate explicit occlusion culling or level-of-detail methods.
一方面,这允许一定形式或规则体的运动,因为只有顶层树需要进行重建,当对象移动时。另一方面,它允许实例化顶层树的对象可以包含多个参考并对一个对象有多个变换矩阵。Kd树的构建过程中使用代价预测函数来估计优化分割平面的位置。只要场景的描述可以完全的加入内存,即使是可见性高度复杂的场景也可以被处理由于射线追踪方法的对数时间复杂度。
OpenRT incorporates a custom memory management subsystem that can deal
with scenes larger than physical memory in an out-of-core fashion. The whole
dataset, including acceleration structures, etc. is mapped from disk into
virtual address space using the operating system’s memory mapping facilities
(e.g., Linux mmap()). Making use of the OS memory mapping system provides an
automatic demand paging scheme, taking care that data is loaded into main
memory as soon as it is accessed. However, in order to avoid stalling due to
page faults during ray traversal and intersection, it is necessary to detect
whether or not memory referenced by a pointer is actually in core memory. To
this end, a hash-table is maintained that records which pages of the model have
already been loaded. In case of a potential page fault, tracing a ray is canceled,
while the missing memory page is scheduled to be loaded by an asynchronously
running fetcher thread.
OpenRT使用一个惯用的内存管理子系统,可以以一个外核管理的方法来进行大于数据场景的管理。整个数据集,包括加速结构,从磁盘中映射到虚拟地址中,使用操作系统的映射函数。使用操作系统管理映射,可以提供一个自动的按需的分页架构,在需要进行访问时,将数据加载到内存中。然而,要达到在页面失效时的延迟的避免,则需要探测是否以指针指向的内存的参考是否有效。在这种情况下,哈希表用来维护表达场景中哪些页面已经加载到内存。在潜在的页面失效时,追踪一个射线被取消,而失效的页面则即由异步读取线程进行加载。
In case of a canceled rays several strategies can be applied. For smaller
models (in the range of a few dozen million triangles), where missing data can
be loaded during a single frame, rays can be suspended and later resumed once
the data becomes available in memory. For larger models, simplified representations
that can fully fit into memory are used as a substitute. It should be noted
that this is only necessary to bridge loading time, but not to reduce the
visual complexity. Simplified data is only used while fully-detailed data is
being loaded.
当光线被取消时,可以采用多个策略。对于小的模型(只有几千万三角形),失中的数据可以在单帧中加载,射线可以被挂起,在数据加载后重新恢复。而对于巨大模型而言,粗层次的表达可以替代。需要指出的是,只需要经过加载时间而不会降低可见复杂性。完全的数据已经加载后,则不需使用简化数据。
OpenRT can use different types of surface shaders in a plug-and-play
manner, which makes it possible to include different types of shading and
lighting effects, e.g. soft shadows or transparency (see Figure 12) that can
help to better visualize the model structure.
In contrast to OpenRT, which was originally
conceived for distributed PC cluster rendering, the Manta Open Source Interactive
Ray Tracer [21] has be designed from scratch for shared-memory multi-processor
systems. Manta’s architecture consists of two fundamental components, a
parallel pipeline executed by all threads synchronously, and a set of modular
components organized in a rendering stack, which is executed by each thread
asynchronously. The pipeline is organized by dividing rendering tasks based on
their parallel load characteristics into inherently balanced, imbalanced, and
dynamically balanced categories. All rendering threads perform each task on
each frame in the pipeline during a stage.
相比于OpenRT(其主旨是进行分布式集群渲染),Manta开源交互射线追踪器设计用于共享内存的多处理器系统。它的架构包括两个主要部分, 一个平行的由所有线程异步执行的管道,以及一个以渲染堆组织的模块集合。管线的组织通过将渲染任务基于它们平行加载特点分割成内丰在平衡的、不平衡的、动态平衡的几类。在一个阶段,所有渲染线程在每一帧中执行每一个任务。
Dynamically load balanced tasks are executed last so any imbalance
introduced earlier can be smoothed out and processor stalls between stages are avoided.
A basic Manta rendering pipeline consists of ray tracing and image display.
Since image display is usually only performed by one thread, this task is
imbalanced. In the Manta control loop, the previous frame is displayed first
and then the rendering stack is invoked to perform ray tracing of the current frame.
After the responsible thread completes display, it joins the rendering threads.
动态加载平衡的任务最后执行,因此任何一个早期不平衡的引入可以平衡的移出而避免处理器的延迟。
Thread safe state changes in Manta are performed using callbacks to simple
application defined functions called transactions. These transactions are
stored in a queue and processed between pipeline stages when all threads are
synchronized. Ideally, each transaction performs a very small state change, given
the high performance (relative to rendering rates) of modern barriers,
transaction performance is higher than individually locking shared state.
Additionally, Manta supports a variety of scheduling techniques that allow
callbacks to be invoked at specific times or frame numbers. The transaction
facility allows Manta to be easily embedded in other applications. Thread safe
state changes are performed by defining callback functions and then passing the
functions to Manta using transactions. Changes can be entirely application specific
from simple camera movements to scene graph or material property changes.
C. LOD Based Mesh Rasterization
Relying on efficient visibility determination alone is not sufficient to
ensure interactive performance for highly complex scenes with a lot of very
fine details, since, in order to bound the amount of data required for a given
frame, a prefiltered representation of details must also be available. When
dealing with very large detailed meshes, such as those generated by laser
scanners, some of the highest performance systems to date are based on the
rasterization of multi-resolution pointor vertex-hierarchies constructed
off-line through a geometric simplification process.
仅仅采用有效的可见性剔除并不能完全确保对于带有很多细节的复杂场景的交互性能,因而,为了使一帧中的数据的总量在一定的范围内,预过滤的细节表达必须存在。当对十分巨大的细节网格进行处理时。当前最高性能的系统是基于光栅化的多分辨率并通过一个几何简化阶段离线的构建顶点的层次。
For instance, the Quick-VDR [29] system is constructed around a
dynamic LOD representation, and achieves interactive performance by combining
various techniques mentioned in earlier sections. To efficiently provide
view-dependent rendering of massive models based on dynamic LODs, Quick-VDR
proposed a clustered hierarchy of progressive meshes (CHPM). The CHPM consists
of two parts: a cluster hierarchy and progressive meshes. Quick-VDR represents
the entire dataset as a hierarchy of clusters, which are spatially localized
mesh regions. Each cluster consists of a few thousand triangles. The clusters
provide the capability to perform coarse-grained view-dependent (or selective)
refinement of the model. They are also used for visibility computations and
out-of-core rendering. Then, Quick-VDR precomputes a simplification of each
cluster and represents a linear sequence of edge collapses as a progressive
mesh (PM). The PMs are used for fine-grained local refinement and to compute an
error-bounded simplification of each cluster at runtime. Also, explicit
dependencies between clusters are maintained in order to guarantee crack-free
simplifications on the mesh. The major benefit of the CHPM representation is
its ability to provide efficient, but effective dynamic LODs for massive models
by combining coarse-grained refinement based on clusters and fine-grained local
refinement providing smooth transitions between different LODs.
Quick-VDR系统围绕一个动态的LOD表达进行构建,组合上述的几个技术来实现交互帧率的渲染效果。为了有效的以基于动态LOD的方法提供视点相关的渲染大规模模型,系统提出了一个聚类层次的渐进网格。包括两个部分,一个聚类的层次,一个是渐进网格。系统对于整个数据集以聚类的层次进行表达,以网格区域的空间坐落为依据。每一个聚类都包含有几千个三角形。
Quick-VDR can render massive models without a significant loss of image
quality, although all the data structures cannot fit into the available main
memory. Also, conservative visibility culling implemented with
hardware-accelerated occlusion queries are integrated with rendering with the
CHPM representation.
A major downside of this method is a relatively low GPU vertex cache
utilization during rendering dynamic LODs compared to rendering static LODs.
However, this low cache utilization was addressed by employing cache-oblivious mesh
layouts for ordering of triangles and vertices of dynamic LODs.
不足是对于GPU顶点缓存的使用率低。
The Adaptive
TetraPuzzles [4] (ATP) system also introduced a solution based on a patch-based multi-resolution data structure, from
which view-dependent conforming mesh representations can be efficiently
extracted by combining precomputed patches. In the ATP case, however, the
system does not need to maintain explicit dependencies, since the method uses a
conformal hierarchy of tetrahedra generated by recursive longest edge bisection
to spatially partition the input mesh. In this case, each tetrahedral cell
contains a precomputed simplified version of the original model, which is
constructed off-line during a fine-to-coarse parallel out-of-core simplification
of the surface contained in diamonds (sets of tetrahedral cells sharing their
longest edge). Appropriate boundary constraints are introduced in the
simplification process to ensure that all conforming selective subdivisions of the
tetrahedron hierarchy lead to correctly matching surface patches. At run-time,
selective refinement queries based on projected error estimation are performed
on the external memory tetrahedron hierarchy to rapidly produce view-dependent continuous
mesh representations by combining precomputed patches.
基于批的多分辨率数据结构,以四面体最长对角边二分的方法组织空间
Using coarse grained LODs also serves in the Quick-VDR and ATP systems for
out-of-core management, which is done by explicitly maintaining LRU caches of
mesh patches.
D. Switching to Alternate Rendering Primitives
交替的图元渲染
Adaptive meshing systems such as those discussed above tend to perform
best for highly tessellated surfaces that are otherwise relatively smooth and
topologically simple, since it becomes difficult, in other cases, to derive
good “average” merged properties. Since performing iterative mesh
simplification does not provide visually adequate simplifications when dealing
with complicated topology, geometry and appearance, systems have started to
appear that use alternate rendering primitives for data prefiltering.
自适应网格系统比较适应于高度网格化的表面以及相对光滑与拓扑简单的对象。当处理复杂拓扑关系的、几何与表面属性的对象时,执行迭代的网格简化并不能提供视觉上满足的简化效果。系统开始采用变化的图元来进行数据的过滤。
The Far
Voxels [10] system, for instance, exploits the programmability and
batched rendering performance of current GPUs, and is based on the idea of
moving the grain of the multi-resolution surface model up from points or
triangles to small volumetric clusters, which represent spatially localized dataset
regions using groups of (procedural) graphics primitives.
Far系统,充分利用了当前GPU的可编程能力与批处理能力,基于的思想是将表面模型的多分辨率的粒度从点或三角形转移到小的体簇,这些体元是使用空间区域的一组图元进行表达的。
The clusters provide the capability of performing coarse grained view-dependent
refinement of the model and are also used for on-line visibility culling and
out-of-core rendering. Figure 13 provides an overview of the approach. To
generate the clusters, the model is hierarchically partitioned with an axis-aligned
BSP tree. Leaf nodes partition full resolution data into fixed triangle count
chunks, while inner nodes are discretized into a fixed number of cubical voxels
arranged in a regular grid.
簇的使用提供了粗的视点相关的简化并可以使用在线的可见性剔除和外核的绘制。产生簇,模型层次的采用BSP树进行分割。叶子节点剖分成完全的分辨率数据,并且限定在一定的三角形数量中。而内部的节点被离散成一定数量的立体体元用规则的格网进行组织。
Finding a suitable voxel representation is challenging, since a voxel
region can contain arbitrarily complex geometry. To simplify the problem, the
method assumes that each inner node is always viewed from the outside, and at a
distance sufficient to project each voxel to a very small screen area (say,
below one image pixel). This constraint can be met with a suitable view-dependent
refinement method, that refines the structure until a leaf is encountered or
the image of each voxel is small enough. Under this condition, a voxel always
subtends a very small viewing angle, and a purely direction dependent
representation of shading information is thus sufficient to produce accurate
visual approximations of its projection.
如果找到一个合适的体元是困难的,因为一个体元区域可能包括有任意数量的复杂几何。为了对问题进行简化,本方法假设每一个内部节点是要从外部进行观察的,并且在一定的距离上足够将一个体元投影到一个十分小的屏幕区域。这个约束可以使用一些合适的视点相关的简化方法来实现,对这个结构的简化只到一个叶子节点的投影的影像足够小。在这种情况下,一个体素经常包含在一个十分小的可见角下,一个完全方向相关的表达用于着色可以产生它的投影的近似的精确效果。
To construct a view-dependent voxel representation, the method employs a visibility
aware sampling and reconstruction technique. First, a set of shading
information samples is acquired by ray casting the original model from a large
number of appropriately chosen viewing positions. Each sample associates a
reflectance and a normal to a particular voxel observation direction. Then, these
samples are compressed to an analytical form that can be compactly encoded and
rapidly evaluated at run-time on the GPU to compute voxel shading given a view
direction and light parameters.
为了创建一个视点相关的体元表达,这个方法采用了一个可见性已知的采样与重建方法。首先,一系列的着色采样信息通过从大量合适的选择视点位置进行原始模型的射线投射。每一个采样与一个反射以及指向一个特定体元观察方向的法线相关。然后,采样采用解析模式进行压缩,它可以被简洁的编码并快速的在运行时使用GPU求值以计算体元的在给定观察方向和光线参数下的着色。
At rendering time, the volumetric structure, maintained off-core, is
refined and rendered in front-to-back order, exploiting vertex shaders for GPU
evaluation of view-dependent voxel representations rendered as point primitives,
hardware occlusion queries for culling occluded sub-trees, and asynchronous I/O
for avoiding out-of-core data access latencies. Since the granularity of the
multi-resolution structure is coarse, data management, traversal and visibility
culling costs are amortized over many graphics primitives, and disk/CPU/GPU
communication can be optimized to fully exploit the complex memory hierarchy of
modern graphics PCs.
在渲染时,体元结构保持在核外,以由前到后的顺序进行精化与渲染,采用顶点着色器用于GPU评估视点相关的体元表达以点图元的方法进行渲染,硬件遮挡查询用于遮挡剔除子树,一个异步的IO用于避免外核数据访问的延迟。由于多分辨率结构的粒度是比较粗的,数据的管理、遍历以及可见性剔除的代价是通过许多图元共同承担的。而磁盘CPU与GPU的通讯可以进行优化以充分的利用复杂的现代图形PC内存层次。
The resulting technique has proven to be fully adaptive and applicable to
a wide range of model classes, that include very detailed colored objects
composed of many loosely connected interweaving detailed parts of complex
topological structure (see Figure 14. Its major drawbacks are the large
preprocessing costs and the aliasing and transparency handling problems due to
the point splatting approach.
问题是预处理的代价以及走样和透明处理的问题
Even if it is often neglected, finding good LOD representations is also of
primary importance for ray tracing systems. Even if visibility determination in
ray tracing has logarithmic growth-rate, due to the use of acceleration hierarchies,
the runtime access pattern of massive model ray tracing can be very incoherent.
For instance, most portions of the hierarchies and meshes can be accessed and
traversed during ray–triangle intersection tests when generating zoomed out
views of very large models. Therefore, in the absence of a suitable LOD representation,
the working set size of ray tracing can be very high, and when it becomes
bigger than the available main memory, the performance of ray tracing is
significantly degraded. To address this issue, the R-LOD
[26] system has introduced a LOD representation for ray tracing tightly
integrated with kd-trees. Specifically, a R-LOD
consists of a plane with material attributes (e.g., color), which is a drastic
simplification of the descendant triangles contained in an inner node of the
kd-tree, as shown in Figure. 15, and is similar to one of the shaders employed
by the Far Voxels system. Each R-LOD is also associated with a surface
deviation error, which is used to quantify the projected screen space error at
runtime.
If a R-LOD representation of a kd-node has enough resolution for a ray
according to a LOD metric, further hierarchy traversal for ray-triangle
intersection tests stops and performs ray-LOD intersection tests. As a major
benefit of this method, it can reduce the working set size by traversing
smaller amount of hierarchy and providing LOD representations for the input
model. As a result, it can drastically improve the performance of ray tracing
massive models. Moreover, the R-LOD representation can improve the SIMD
utilization of ray-coherence techniques. This is mainly possible because an LOD
representation is likely to be chosen for a kd-node if hierarchy traversals of
rays in a ray packet are getting to show low-coherence.
As a downside of this approach, this method does not provide the full LOD
solutions for arbitrary rays, especially for non-linear transformations for
refractions and reflections off of curved surfaces. Moreover, in some cases,
viewers can observe visual artifacts, which is a very serious problem for ray
tracing.
E. Summary
Today, a broad variety of massive model rendering systems exists that
allow for the interactive display of very large models. While this rapid survey
is by no means exhaustive, the analyzed systems provide a good sampling of
today’s available options to implementing a state-of-the-art rendering system. From
this overview it can be seen that rendering systems are typically non-trivial
frameworks that need to incorporate many techniques, usually from all of the in
the previous sections presented categories. Today, no universal system exist that can handle all massive
models application scenarios. As it should be clear from this brief
analysis many options exists, but at the same time, many more similarities
among systems exist than it can appear at a first look, since in the end, all
systems pick from a the same bag of techniques.
目前没一一个通用的系统可以处理所有的大规模模型的应用场景。
VII. CONCLUSION
In this article, we have examined various techniques of improving rendering
performance for massive 3D synthetic environments. Such massive models are
increasingly common as a result of the phenomenon known as information
explosion in more general contexts. This trend is doomed to continue: for instances,
just consider that today’s massive model scenes have a really small complexity
compared to real live environments. It can be argued that, while a number of
applicable solutions exist, efficient processing of large scale datasets for interactive
visualization is a challenging open-ended problem that must be continuously
addressed. By learning from the last decade of research in this field, and
taking into account the current hardware evolution, it is possible to find some
common guidelines, and draw some conclusions on how to realize current systems
and plans for the future.
本文中我们讨论了多种技术用于提高大规模三维综合场景的渲染性能。这种大规模模型的增加是我们称之为信息爆炸的结果之一。这种趋势仍将继续。例如,只考虑今天的大规模模型场景相比于我们真实生活的环境,复杂性小到可以不计。可以说明的是,当一定的可行的解决方案出现时,如何有效的处理大范围数据集用于交互可视化是一个可修整的挑战,需要进一步的讨论。通过对本领域近十年研究的学习,并考虑到当代硬件的技术进步,要发现一些公用的指导方针是可能的,并可以下一些结论用于将来如何当前的系统来实现。
We have seen that, at the broad level, current massive model rendering
approaches can be classified into rasterization or ray tracing methods. We
argue that, when it comes to dealing with massive datasets, the underlying issues
are somewhat similar. All the methods have to deal with the same data
management and filtering problems and are converging towards proposing similar
solutions, based on spatial indexing, data reduction techniques, and data
management methods. Even though in the past the ray tracing and rasterization
fields have independently developed their own approaches, it is likely that
future systems will incorporate hybrid approaches, in particular as graphics hardware
is becoming more and programmable and will allow for executing rasterization
and ray tracing side-by-side.
我们已经发现,广泛的来说,当前的大规模模型的渲染方法可以被分成光栅化与射线追踪的方法两类。我们认为,当处理大规模模型时,底层的思想是相似的。所有的方法都必须处理相同的数据管理、数据过滤问题,并且将会趋向于提出想似的方法,都将基于空间索引、数据的减少技术和数据的管理技术。即使在以前,射线追踪与光栅化的方法独立的发展他们不同的方法,在将来,系统将可能集成两类方法进行混合,特别是将图形硬件变得越来越通用且可编程能力越来越强允许同步执行光栅化与射线追踪。
One point that is now clear in massive model visualization is that, while
large scale rendering problem cannot just be solved by waiting for more
powerful hardware, hardware trends dictate which methods are successful and which
are doomed to be practically inefficient. The challenge
is thus in designing methods able to capture as much as the performance growth
as possible. Current multi-core CPU systems and GPUs excel at massively
parallel tasks with good memory locality, since the gap between computation
performance and bandwidth throughout the memory hierarchy is growing.
For this reason, we expect that methods for carefully managing working set size, ensuring
coherent access patterns, as well as data parallel techniques will increasingly
gain importance. Up until very recently, the various problems
handled by a renderer were independently solved. It is now increasingly clear
that there are important couplings between the different components of a
massive model renderer. For instance, generating good levels of details for
very complex models requires visibility information. At the same time, the
availability of a multi-resolution model typically increases data size, but is essential to
increase memory coherence and reduce working set size. At present
time, good solutions are available for a restricted number of situations and
restricted classes of models.
大规模模型可视化中可以明确提出的一点是,当进行大范围渲染时,不能只通过等待更强的硬件来解决,硬件趋向于控制哪些方法是成功的哪些实际是效率不高的。这个困难因此要求设计方法能可获得尽可能多的增加。当前的多核CPU与GPU擅长于大量并行的有好的内存局部性的任务,由于计算性能与多层次内存带宽之间的差距在增加,因此,我们期望仔细管理工作集大小、确保一致性访问模式的方法跟数据并行技术一样获得增长。直到目前,很多的渲染器只是独立的解决一些问题。可以明确指出的是,不同的大规模模型的渲染器的不同部件的合作是十分重要的。例如,产生好的细节层次需要可见性信息。同时,多分辨率模型的可见性增加了数据的大小,但是增加内存一致性和减少数据集是必须的。目前,一些好的解决方法是针对于一定的情况一定类型的模型。
Although there has been a lot of advances on massive model rendering
techniques for static models, there have been relatively less research efforts
on dealing with time-varying, dynamic, or animated models. Since these dynamic
models are getting easier to model and capture, it is expected that there will
be higher demand for efficient dynamic model management techniques. Currently,
there is a new trend in investigating how to rapidly build and update
acceleration structures, and how to best trade culling efficiency against construction
time. While this research is so far mainly focused on much smaller dynamic
models (see, e.g., [23]), the results are equally important when dealing with
massively complex scenes, where such methods are not only applicable for animation,
but also for fast preprocessing. These incremental methods need also be
extended to tasks other than spatial indexing, e.g., generating LODs.
虽然在静态模型中,大规模模型的渲染取得了一些进展,但少有研究围绕时间变化的、动态的、动画的模型进行。由于动态的模型的获得越来越容易,因此将对动态模型的管理技术有一个大的需求。当前,有一个新的趋势是研究如何快速的构建与更新加速的结构以及如何最好的平衡剔除效率与构建时间。然而这些研究都只是围绕着小的动态模型,当处理大规模复杂场景时,这些研究结果相当重要,这些方法不仅保适应于动画也要适应于快速的预处理。这些方法也需要扩展如空间索引之外的方法。
Finally, very little research has been done on how to adapt advanced
shading and light transport simulation techniques to massively complex scenes,
especially in a real-time setting. Although a tremendous amount of research
targeting photorealistic image synthesis has been carried out in the last decades,
such techniques cannot easily be applied to massive environments. While
handling pure visibility in arbitrarily sized models still remains challenging,
it is likely that additional efforts will be made to also include more
sophisticated illumination effects.
最后,几乎没有方法来处理如果自适应的高级着色和光线传输的模拟技术用于大规模的复杂场景,特别是在实时条件下。虽然有大量的方法集中于研究相片质感的合成,但这些技术不容易直接应用到大规模的场景。就算是单纯的任意大小的模型的可见性问题仍是个难题。
浙公网安备 33010602011771号