NVIDIA G80: Architecture and GPU Analysis(zz from Beyond3D)

We might as well take these first words in our look at NVIDIA's G80 graphics processor, the first GPU from any consumer graphics vendor that supports Windows Vista's Direct3D 10 (D3D10) component, to let you know that this piece marks the beginnings of Beyond3D doing things a little bit differently in terms of brand new GPU analysis. From now on we'll be splitting things up three ways, focussing separately on the architecture, its possible image quality and lastly its performance, possibly with other orbiting satellite articles depending on other Cool Stuff™ that it can do, enables or tickles us with.

That means there'll start to be three separate articles for you to digest, possibly not all coming on the same day depending on what it's all about, but always following each other in short order. That lets us focus on what Beyond3D's famous for first of all, before we go nuts on the other good stuff, allowing time for things to settle down and discussion to form, before the next piece of the puzzle slides into view. Of course each piece will be a complement of all the others, with the right references back and forth, so Arch will reference IQ will reference Perf will reference Arch, and so on and so forth. So even if we just rock out with Arch on launch day, as we are today, you'll still get an idea of the other two. Hope that's cool, let us know what you think since we think it offers more advantages than not to all concerned, authors, downtrodden publisher and readers alike.

We should also drop the bomb that we'll also be directly comparing cross-IHV in the future (and for G80), where appropriate, when constructing any piece here at Beyond3D. You'll see AMD and NVIDIA GPUs go toe-to-toe, directly in the same piece be it for arch, IQ, perf (or whatever) reasons, whenever it makes sense to do so. So a couple of shifts in how you'll see us do what we do, and definitely for the better. The pieces as a whole will represent a stronger final product than Beyond3D's been able to produce in the past, and it's that which has driven the change. Gotta keep it fresh and focussed lest the other guys catch up, right? Thus without further ado, we happily present our first stab at the new way of doing things; our NVIDIA G80 Architecture and GPU Analysis.

NVIDIA G80: Architecture and GPU Analysis

Four years and 400 million dollars in the making, NVIDIA G80 represents for the company their first brand new architecture with arguably no strong ties to anything they've ever built before. Almost entirely brand new as far as 3D functions are concerned, and designed as the flagship of their 8-series GeForce product line, their new architecture is squarely a D3D10 part but with serious D3D9 performance and image quality considerations. One doesn't beget the other in the world of programmable shading, and NVIDIA seem to want to hit the ground running. Arguably the masters of the compromise, of which all modern 3D rendering is anyway, the Cali-based graphics company has no problems loving some parts of the chip less than others, in the pursuit of the best product for the market they're addressing.

D3D10 makes that a bit more difficult, though. Any D3D10 accelerator must support all base API features without any exceptions, developers requiring a level playing field in order to further the PC as a platform and fruitful playground for graphics application development, be they games or GPGPU or the next generation of DCC tools, or whatever the world feels like building. Performance is the differentiator with D3D10, not the feature check list. There's no caps bit minefield to navigate, nor a base feature to be waived. So the big new inflection point probably should have been a big hint.

So what have NVIDIA come up with for D3D10? We know that dictates such features as support for geometry shading, constant buffers, unified shading instruction set, FP32 calculations throughout the entire pipeline, support for non-linear colour spaces in hardware and much much more, but how did they build it? Well who really knows as far as D3D10 goes in depth, because NVIDIA, ever the comedians, decided now is not the time for a driver for such endeavours! But nevermind. Being Beyond3D we dug deep, pushing the hardware around as is our wont, making it do naughty things under D3D9 in order to better understand the silicon under the big-ass hood.

We made the trip to the recent Editor's Day, got the skinny, grabbed the board and fled back to base. Now's the time to let you see what G80, mostly in the form of GeForce 8800 GTX, is all about from an architecture perspective. Leaked slides from Chinese websites will only get you so far. For the rest you need this. We look at the chip itself in physical form, then we examine the architecture with a walk across the GPU from front to back (mostly), so onwards we go.

The Chip

NVIDIA carry the GeForce brand into its 8th ineration with G80, the first SKUs to be released called GeForce 8800 GTX and GeForce 8800 GTS respectively, with NVIDIA resurrecting the GTS moniker for the first time since GeForce 2. G80 itself is probably the biggest and most complex piece of mass-market silicon ever created.

Chip Details
Chip Name	G80
Silicon Process	90nm (TSMC)
Transistors	681M
Die Size	484mm² [21.5mm (w) x 22.5mm (h)]
Packaging	Flipchip + HS
Pipeline Configuration	32 / 24 / 192 (Textures / Pixels / Z Samples per clock)
Memory Interface	384-bit (64x6 Crossbar) GDDR to GDDR4
DirectX Capability	DX10.0 - VS4.0 + GS4.0, PS4.0
Display	None (NVIO)
Host Interface	PCI Express x16

Click to enlarge

G80 Chip

Built on TSMC's 90HS process, G80 is some 681M transistors big with a rough die area of 480mm², supporting Direct3D 10 (Shader Model 4.0) and implementing a heavily threaded, unified shader architecture. NVIDIA disguise the actual die with a package that includes a heatspreader module, for more effectively getting the heat output from the GPU to the cooling solution. Natively PCI Express, NVIDIA have announced no AGP variant and this time around we honestly don't expect one for any high-end G80 configuration either, despite the option being there with NVIDIA's own BR-series of bridge ICs designed to glue AGP to PCIe and vice versa.

Support for Shader Model 4.0 means a hefty change in specification, one which we condense down for those looking for a checkbox-style overview that scratches the surface.

NVIDIA G80 Architecture
- Full Direct3D 10 Support
- DirectX10 Shader Model 4.0 Support
  - Vertex Shader 4.0
  - Geometry Shader 4.0
  - Pixel Shader 4.0
  - Internal 128-bit Floating Point (FP32) Precision
- Unlimited Shader Lengths
- Up to 128 textures per pass
- Support for FP32 texture formats with filtering
- Non-Power of two texture support
- 8 multiple Render Targets
NVIDIA Lumenex Technology
- Full FP32 floating point support throughout the entire pipeline
- FP32 floating point frame buffer support
- Up to 8x, gamma adjusted, native multisampling FSAA with jittered or rotated grids
- Up to 16x, coverage sample antialiasing
- Transparent multisampling and supersampling
- Lossless color, texture, Z and stencil data compression
- Fast Z clear
- Up to 16x anisotropic filtering
NVIDIA SLI Support
NVIDIA Pure Video HD Technology
- Adaptable programmable video processor with GPU shader core assist and post processing
- High Definition video decode acceleration (H.264, VC-1, WMV-HD, MPEG2-HD)
- Spatial temporal de-interlacing
- Inverse 2:2 and 3:2 pull-down (inverse telecine)

Time for a quick look at the reference board and our test setup before we dive in to the architecture analysis.

NVIDIA are announcing GeForce 8800 GTX and GeForce 8800 GTS on launch day, and it's the GeForce 8800 GTX we focus on as a means to get you introduced to what the boards are like. GTX is definitely the full-fat version, GTS scaling down not just in G80 configuration but also board form factor.

Board Details
Board Name	GeForce 8800 GTX
Core Clock Rate	575MHz
Pixel Pipelines	24
Pixel Fill-rate	13800M Pixels/s
Texture Fill-rate	9200M Texels/s
Geometry Processor	128 VS
Geometry Rate	575M Triangles/s
Memory Speed	900MHz
Memory Bandwidth	86.4GB/s
Frame Buffer Size	768MB
Host Interface	PCI Express x16

We need a slight reengineering of our 3D tables to completely account for the fact that G80 doesn't run the entire chip at the same clock, NVIDIA clocking the shader core at more than twice the rate of the rest of the chip. Official clock rates for GeForce 8800 GTX are 575MHz base clock (bits of the chip front end and ROPs, mostly), 1350MHz (yes, 1.35GHz) shader clock for the majority of the processing hardware, before the back end of the chip communicates with an external memory pool 768MiB in size via a 384-bit 6-channel memory bus.

Click Image to Enlarge



*GeForce 8800 GTX*

GeForce 8800 GTX products are some 267mm long from backplane to PCB end, carrying dual-slot cooling and dual power inputs (75W each). Weighing 740g, GeForce 8800 GTX is the largest consumer graphics board ever created, the PCB definitely overhanging the edge of any standard 9-hole ATX-spec mainboard by some distance. The side-exit power input connectors are a definite concession to board length, NVIDIA realising that rear-exit would definitely cause GeForce 8800 GTX to not fit in ever more cases than it's able to in the shipping configuration.

You can also see dual SLI connectors that feed in to NVIO (pictured below), NVIDIA's brand new I/O processor for handling signal input and output to their next generation of GPUs. NVIO marshalls and is responsible for all data that enters the GPU over an interface that isn't PCI Express, and everything that outputs the chip that isn't going back to the host. In short, it's not only responsible for SLI, but the dual-link DVI outputs (HDCP-protected), all analogue output (component HDTV, VGA, etc) and input from external video sources.

Click Image to Enlarge

*NVIO-1 (A3 revision)*

Samsung 512Mib DRAMs are used for the memory pool, the DRAM giant's K4J52324QC BJ11 (900MHz) variant chosen for GTX and run at its maximum rated clock. All the DRAM devices are actively cooled by one-piece heatpiped heatsink and fan assembly that also actively cools NVIO, as well as the GPU package and some of the power regulation components at the read end of the board. The cooling assembly is efficient, a large rotary blower-type fan pushing air across the sizable heatsink mass and out of your chassis via the backplane.

Thermally and acoustically the cooler assembly seems to do a sterling job. The heat output from the DRAM pool and the GPU is substantial at times, but the cooler exhausts exchanged heat without any hint of overheating whatsoever, and more impressively at noise levels that are barely higher than the outgoing GeForce 7900 GTX. That product's cooler, one which first made its appearance back on the GeForce 7800 GTX 512, is widely lauded as being the best dual-slot cooler ever used as original equipment. We think that the reference cooler strapped to the GeForce 8800 GTX configuration, which will ship with all initial GeForce 8800 GTX boards from all add-in board partners, deserves similar accolades.

Therefore it's the size of the board and the resulting form factor that will cause most concern. 10.5 inches is a lot of PCB to accomodate in a PC chassis, and some potential customers buying GeForce 8800 GTX as an upgrade or part of a self-build will undoubtedly need a chassis change to get it to fit comfortably as part of the rest of the PC's configuration. While NVIDIA have made very effective use of the available PCB space, such that the PCB is realistically as short as it can be, the requirements for power dictate that board length is an issue. It's the only thing we dislike about the configuration of GeForce 8800 GTX.

Power consumption wise, NVIDIA ask you to connect two 6-pin power connectors simply because the clock rates of the GTX mean that power consumption topping 150W, the maximum possible via one of the standard connectors and the PCI Express slot, can be seen depending on what the board is being asked to do. It's not however a hint that power consumption will get anywhere near 225W. Far from it, infact. Our testing indicates a peak single board consumption of just over 165W, and average power consumption over a range of game tests seems to be no more than any other single board on the market today. The board will shout at you furiously if you don't connect two, however, should you decide you think you can get away with just the one.

Lastly, the G80 on our reference board is A2 silicon, manufactured in week 38 of 2006, inside a chip package a square 45mm in size. NVIO is NVIDIA's NVIO-1 revision, A3 silicon and produced in week 39 of 2006. NVIO-1 dice are currently 7x7mm and are produced on TSMC's 110nm value silicon process, packaged as a flipchip with an apparent 553 pins (we haven't counted them, so don't take it as gospel!). Let's document the test setup used for our G80 architecture investigation work, before diving straight into the goodies.

Test Setup and Methods, and nForce 680i SLI

We tested G80 using the latest high-end enthusiast core logic from NVIDIA, nForce 680i SLI. The top Intel version of its nForce 6-series chipset family, 680i debuts today at the same time as G80 and G80 board SKUs. We've not had the chance to formally evaluate the board used to house 680i, but first impressions are absolutely excellent. The board is extremely performant, matching and often besting everything we've come across to date for Intel LGA775 microprocessors, and the effort put into its engineering at the board level and in the BIOS have it standing, in our eyes at least, as the enthusiast core logic of choice.

NVIDIA have engineered what effectively amounts to a production-quality reference board, entirely in-house, and will be selling it branded via traditional NVIDIA-only graphics board partners such as eVGA (vendor of the board we have for testing), XFX and BFG. Traditional NVIDIA mainboard partners such as ASUS and MSI will likely engineer their own 680i SLI efforts using their own layout choices and peripheral ICs. The reference board sports 6 SATA2 ports, 10 USB2.0, 2 hardware-level GigE network ports, three PCI Express x16 slots (16-16-8 configuration) and a layout that makes concession for the GeForce 8800 GTX in terms of availability of ports and connectors when two 8800 GTX boards are installed. In short, we like, and it was a stable platform for all of our testing.

System Specification

	Component
Graphics Card	NVIDIA GeForce 8800 GTX
Processor	Intel Core 2 Extreme X6800, LGA775 2.93GHz, Core µArch, 4MiB L2, dual-core
Mainboard	eVGA nForce 680i SLI
Memory	Corsair PC2-6400 DDR2, 2 x 1GiB 4-4-4-12, 1T
Power Supply	Tagan TG420-U22, 420W EATX
Hard Disk	Seagate ST3160812AS 160GB SATA
Software Specification
Core Logic Driver	NVIDIA nForce 590 9.53
Display Driver	NVIDIA ForceWare 96.94
Operating System	Windows XP Professional, SP2, x86

Testing Methods

For our architecture analysis we use a mix of software, largely biased towards our own in-house tools but with a healthy selection of freely available others. Our own self-written shader programs are run using in-house frameworks based mostly on Cg or MDX, along with raw ASM shader input to D3D9. You'll see us refer to what we used and how we tested as the architecture analysis progresses, and you can freely ask us anything about how we test if there's a detail you need to know via the Beyond3D forums.

It's a unique and highly fruitful way to see what a chip is capable of, and the first departure from how we've done it in the past. The architecture analysis revolves around instruction rate testing using our own issuers, to calculate throughputs and investigate execution limits of the shader core, whatever they may be. We've also written AA sample pattern detectors, floating point filter testers and fillrate benchmarks to help analyse G80 in GeForce 8800 GTX configuration.

Myself and Arun Demeure are chief architects of the vast majority of the tests used for this analysis. We also use contributed and freely available tests from George Kolling, Ralf Kornmann, Victor Moya, Mike Houston and pals at Stanford, and also the guys at iXBT/Digit Life in order to poke and prod, and the big help of all the guys in Team Beyond3D and silent friends to make sure we're doing The Right Thing™.

We forge on then with an overview of the architecture before diving in at the deep end.

NVIDIA G80 Overview

A product of a mental technique used to gather my thoughts on previous architecture analysis endeavours, the following diagram represents a somewhat high-level look at how G80 is architected and what some of the functional units are capable of. After the guts of a chip analysis is complete, I find it helps to draw it out (usually on a huge sheet of paper) to organise data flows, note down specs and rates and get a picture of it mentally to refer to when describing the thing in text. This time it turns out that the drawing is decent enough for public consumption and as a basis for what we'll talk about. Click for the full-on version, and be prepared for an altogether scarier one later on.

Click Image to Enlarge

*G80 Overview Diagram*

If it's not clear from the above diagram, G80 is a fully-unified, heavily-threaded, self load-balancing (full time, agnostic of API) shading architecture. It has decoupled and threaded data processing, allowing the hardware to fully realise the goal of hiding sampler latency by scheduling sampler threads independently of and asynchronously with shading threads. Its primary ALU structure is fully FP32 and completely meets IEEE754 standards for computation and rounding of single precision floats, and supports most '754 specials other than denorm signalling and propogation.

So let's start with the front end of the chip, following data flow pretty much from 1 to 16 on the diagram. We suggest you keep the diagram open as a reference as you progress through the article, to help make some of the text make a bit more sense.

Front End

The front end of the chip concerns itself with keeping the shader core and ROP hardware as busy as possible with meaningful, non-wasteful work, sorting, organising, generating and presenting data to be worked on, and implementing pre-shading optimisations to save burning processing power on work that'll never contribute to the final result being presented. G80's front end therefore begins with logic dedicated to data assembly and preparation, organising what's being sent from the PC host to the board in order to perform the beginnings of the rendering cycle as data cascades through the chip. A thread controller then takes over, determining the work to be done in any given cycle (per thread type) before dispatching control of that to the shader clusters and their own individual schedulers that manage the data and sampler threads running there.

In terms of triangle setup, the hardware will run as high as 1 triangle/clock (we measure around 0.75x ourselves, with other tests confirming a rough 0.9x rate), with the setup hardware in the main clock domain, feeding into the raster hardware. That works on front-facing triangles only, after screenspace culling and before early-Z reject, which can throw away pixels based on a depth test apparently up to 4x as fast as any other hardware they've built, and is likely based on testing against a coarse representation of the depth buffer. Given 64 pixels per clock in G71 of Z-reject, 256-pixels per clock (and therefore the peak raster rate) is the likely figure for G80, and tallies with measurements taken with Archmark.

Our own tests rendering triangle-heavy scenes in depth-sorted order (and in reverse) don't give away much in that respect, so we defer to what NVIDIA claim and our educated guess. NVIDIA are light on details of their pre-shading optimisations, and we're unable to programatically test pre-shading reject rates in the detail we'd like, at least for the time being and because of time limitations. We'll get there. However, we're happy to speculate with decent probability that the early-Z scheme is perfectly aggressive, never shading pixels that pass the early-Z test unless pixel depth is adjusted in the pixelshader, since such adjustment would kill any early-Z scheme before it got off the ground, at least for that pixel and immediate neighbours.

The raster pattern is observed to be completely different when compared to any programmable shader NVIDIA has built before, at least as far as our tests are able to measure. We engineered a pixel shader that branches depending on pixel colour, sampling a full-screen textured quad. Adjusting the texture pattern for blocks of colour and measuring branching performance let us gather data on how the hardware is rasterising, and also let us test branch performance at the same time. Initially working with square tiles, it's apparent that pixel blocks of 16x2 are as performant as any other, therefore we surmise that a full G80 attempts to rasterise screen tiles 16x2 pixels in size (8 2x2 quads) before passing them down to the shader hardware.

We'll go out on a limb and speculate wildly that the early-Z depth buffer representation is based on screen tiles which the hardware then walks in 16x16 or 32x32 using defined Morton order or a Hilbert curve (fractal space filling, which I'm happy to admit that Arun threw out there for me to ponder for inclusion, such is the way his mind works!) as a way for the chip to walk the coarse representation to decide what to reject.

As far as thread setup and issue goes, the hardware will schedule and load balance for three thread types when issuing to the local schedulers, as mentioned previously, and we presume that all thread scheduling runs in the shader clock domain (as would be most obvious at least). We mentioned the hardware scaling back on thread count depending on register pressure, and we assert again that that's the case. Scaling back like that makes the hardware less able to hide sampler latency since there are less available threads around to run while that's happening, but it's the right thing for the hardware to do when faced with deminishing register resources. So performance can drop off more than expected when doing heavy texturing (dependant texture reads obviously spring to mind), with a low thread count. Threads execute per cluster, but output data when the thread is finished can be made available to any other cluster for further work.

Attribute interpolation for thread data and special function instructions are performed using combined logic in the shader core. Our testing using an issue shader tailored towards measuring performance of dependant special function ops -- and also varying counts of interpolated attributes and attribute widths -- shows what we believe to be 128 FP32 scalar interpolators, assigned 16 to a cluster (just as SPs are). It makes decent sense that attribute interpolation simply scales as a function of that unit count, so for example if 4-channel FP32 attributes are required by the shader core, the interpolators will provide them at a rate of 32 per cycle across the whole chip. We'll talk about special function rates later in the article.

Returning to thread setup, it's likely that the hardware here is just a pool of FIFOs, one per thread type, that buffers incoming per-thread data ready to be sent down the chip for processing. Threads are then dispatched from the front-end to individual per-cluster schedulers that we describe later on. Let's move on to talk about that and branching.

Threading and Branching

We talked a bit about how threaded the hardware is previously, so a little more detail about that and its branching performance is prudent. Giving the technology a marketing banner of GigaThread, threads can jump onto and off of their cluster for free, with a new one ready for processing every cycle if need be, with 'free' branching via dedicated branch units (relieving the SPs from performing the calculation). Assuming say 10 stages each for MADD and MUL in the SP, and interpolation, let's call that 30 batches of 16 objects being executed per cluster, in flight and per cycle. And just incase you didn't catch the D3D10 o'clock news, three thread types exist for D3D10, and just two for D3D9 (vertex and pixel).

There's a global scheduler for managing the core, but also a local scheduler per cluster, that manages the individual threads being executed on those with thousands maintained and 'in-flight' by the hardware at any one time. Given ALU pipelining and other considerations, 4K as a count for active SP threads would be a good guess we think.

That threaded execution core is paired with branching hardware that works at a granularity of either 16 (for vertex data) or 32 (for pixels) objects, with it actually measured at 32 for pixels in our tests as mentioned earlier. Branches happen in one cycle for all thread types, and it means branching penalties are minimised at up to 32 objects per clock across the entire shader core. Constrast that with prior Shader Model 3.0 NVIDIA hardware with minimum branching granularity of 1024 pixels, and NVIDIA catch up with ATI in the branching performance stakes on a modern GPU (16 pixels for R520, 48 for R580).

We assume a register file per cluster, but no register file for the global scheduler (just FIFOs). Each file is likely big enough to maintain around several times the number of active threads to hide sampler latency by allowing the cluster to grab a thread's data, process and put it back 'for free', as the core executes sampler threads in parallel. One assumes again that if the register file gets full, the hardware just reduces the in-flight thread count until register pressure is relieved and thread count can creep back up over a number of cycles, the heuristics for which should be fairly simple we imagine

We'll cover just what the hardware runs on its shading core on the next page.

Unified shading core

At the core of G80 is a homogeneous collection of floating point processors, be they for math ops or data 'address' (it's way more than just calculation of a memory address in reality) and fetch. Each math processor is a scalar (1-channel) ALU with exposed FP32 precision (full time, no partial precision calculations are possible), with rounding properties that conform to IEEE754 standards and that can theoretically (supposedly) dual-issue a MADD and MUL instruction in one cycle. More on that later. There are 128 such processors (called SPs by NVIDIA) in a full G80, grouped in clusters of 16, giving the outward appearance of an 8-way MIMD setup of 16-way SIMD SP clusters.

Each cluster can process threads of three different types (vertex, geometry, pixel), independently and in the same clock cycle (and infact it's likely more granular than that), and that's the crux of how the shading core works. With a scheduler per cluster and processing of all thread types, you have yourself a unified shading architecture, no clusters specifically tied to vertex, geometry or pixel data for the duration of the chip's operation, and the ability to perform the necessary work whenever it's needed. That almost self-scheduling, dynamically balanced, independant shading is what defines G80 from a data processing perspective. That means that in any given cycle, all 128 SPs may be working on the same thread type, but conversely and as a function of the thread schedulers running each cluster, working on vertex, geometry and pixel data in the same cycle.

The threaded nature extends to data fetch and filtering too, the chip running fetch threads asynchronously from threads running on the clusters, allowing the hardware to hide fetch and filter latency as much as possible, very much in order to keep the chip working as close to maximum efficiency as possible. Hopefully by now the fully scalar and superthreaded (horrible word, but apt) nature of G80 is becoming clear. The chip is a radical departure from G7x where that architecture was fixed by thread type (8 vertex processors, 6 quads of fragment processors), coupling data fetch to the processor ALUs so you could frequently stall the ALU waiting for data to become available. The old ALUs were also fixed-width vector processors, 5 or 4D wide with somewhat limited co- and dual-issue opportunity.

G80 sports none of those architectural traits.

Threads can jump on and off the SPs for free, cycle-by-cycle, they are dynamically allocated to do whatever work is currently needed, and since the chip is entirely scalar you get auto-vectorisation (effectively where you're trying to fill up unused channels in the ALU that might have sat idle) for free, the goal being to keep as much of what amounts to 128 slots doing work as possible. You might see folks attempting to divide the chip up nice and neatly in order to compare it to previous chips gone by, but we've already hinted how disengenious that is and it does a disservice to the way the architecture works.

Each cluster has its own data cache and associated register file (we guess the hardware is optimised for two or three FP32 temporary registers per object), and data sampling ability by means of data address and fetch units. We mentioned that data fetch is threaded asynchronously to the shader core, which is true, but fetch and address is still tied to a cluster such that threads on those data samplers can only feed the cluster they're tied to, at least initially. Clusters can pass data to each other, but not without a memory traversal penalty of some kind, and we're unable to measure that explicity for the time being.

Each cluster also has 4 pixels per clock of address ability, but 8 INT8 (FP16 half speed) bilerps per clock of fetch and filtering, in order to get data out of VRAM or cache into the core (for any thread type of course) for working on. Therefore the hardware will give you two bilerps per pixel address, for 'free' (and proven with in-house codes up to and including FP32 per channel), rather than the usual one. So that's 64 pixels per clock (ppc) of INT8 bilinear 2xAF, or 32ppc of FP16 bilinear, or 16ppc of FP16 2xAF, or 16ppc of FP32 bilinear per cycle, out of the combined texture hardware across the chip, giving you a few rates to ponder as we talk about the rest. We'll get back to fetch and filter performance later in the article.

Special function ops (sin, cos, rcp, log, pow, etc) all seem to take 4 cycles (4 1-cycle loops, we bet) to execute and retire, performed outside of what you'd reasonably call the 'main' shading ALUs for the first time in a programmable NVIDIA graphics processor. Special function processing consumes available attribute interpolation horsepower, given the shared logic for processing in each case, NVIDIA seemingly happy to make the tradeoff between special function processing and interpolation rates. We covered that in a bit of detail previous, back on page 6. Each cluster then feeds into a level 2 cache (probably 128KiB in size) and then data is either sent back round for further processing, stored off somewhere in an intermediary surface or sent to the ROP for final pixel processing, depending on the application.

That's effectively the shader core in a (fairly big) nutshell, making the sampler hardware next on our topic list.

Data fetch and filtering

In our diagram, we showed you how each cluster has what are effectively dedicated paths for data fetch and filtering (let's call that sampling to save some keystrokes and time) logic to service them. Rather than a global sampler array, each cluster gets its own, reducing overall texturing performance per thread (one SP thread can't use all of the sampler hardware, even if the other samplers are idle) but making the chip easier to build.

The sampler hardware per cluster runs in a separate clock domain to the SPs (a slower one), and with the chip supporting D3D10 and thus constant buffers as a data pool to fetch from, we presume each sampler section has a bus to and from L1, and likely to and from dedicated constant buffer storage nearby on the chip. With measured L1 size seemingly 8KiB, other buffers for constant storage are practically confirmed, especially given their focus in D3D10. We measure cache sizes by fetching multi-size and multi-component texture surfaces into our shader, measuring performance and making what's essentially a guess about size, but one we're confident in, based on where we see performance decreasing as cache misses start to occur.

Data fetch and filtering is tied implicitly to any memory controller and caching schemes a GPU might employ, as well as the DRAMs the chip is connected to and their performance as well. We'd be very surprised, therefore, if the chip didn't have dedicated storage for constant buffers on the chip (and the ability to reuse them under D3D9 as well), or at least the tuned ability to pull constants out of non-specific cache efficiently. Saying that, we question if any dedicated sampler hardware exists to match CB fetching being unfiltered and with different access patterns across their memory space, compared to other data access the chip is already being asked to perform. Therefore under D3D10, we suggest that basis filter rates will decrease as constant buffers are accessed using the same logic, but we also suggest that constant fetch is sufficiently optimal so as to negate the performance hit acceptably.

Filtering wise, the sampler hardware provides bilinear as the base filtering mode with trilinear and anisotropic (non-box, perspectively correct) filtering as options for all surface types the sampler hardware can access. Up to 16x anisotropic filtering is again available, and the out-of-the-box setting with all shipping drivers will be a high level of angle invariance. Put simply, more surfaces in the scene will receive high levels of filtering as requested by the application or the user via the control panel, as a default matter of course, raising the minimum level of image quality with surface filtering enabled by a significant amount.

We've been banging the image quality drum for some time now, with a view to it being raised as a default by the main IHVs when the hardware is able to with a surfeit of performance. It happens with G80, big time, and should not be underestimated or glossed over. While other base filtering optimisations remain in the driver's default settings as it controls the hardware, with G80, the return to almost invariant filtering with G80 and GeForce 8-series products is most welcome. The hardware can also filter somewhat orthogonally, too, which deserves a mention. The chip can filter any and all integer and floating point surfaces it can access in the sampler hardware, and that includes surfaces with non-linear colour spaces. Filtering rates essentially just become a product of consumption of the available per-cycle bilerps.

The sampler hardware runs at base clock, as do the on-chip memories and the back end of the chip, which is conveniently next on the list.

ROP, Display Pipe and PureVideo

Ah, the ROP hardware. In terms of base pixel quality it's able to perform 8x multisampling using rotated or jittered subsamples laid over a 4-bit subpixel grid, looping through the ROP taking 4 multisamples per cycle. It can multisample from all backbuffer formats too, NVIDIA providing full orthogonality, including sampling from pixels maintained in a non-linear colour space or in floating point surface formats. Thus the misunderstood holy grail of "HDR+AA" is achieved by the hardware with no real developer effort. Further, it can natively blend pixels in integer and floating point formats, including FP32, at rates that somewhat diminish with bandwidth available through the ROP (INT8 and FP16 full speed (measured) and FP32 half speed). Each pair of ROPs share a single blender (so 12 blends per cycle) from testing empirically.

Sample rates in the ROP are hugely increased over previous generation NVIDIA GPUs, a G80 ROP able to empirically sample Z 8 times per clock (4x higher than G7x ever could), with that value scaling for every discrete subsample position, per pixel, bandwidth permitting of course. Concluding 'free' 4xMSAA from the ROP with enough bandwidth is therefore an easy stretch of the imagination to make, and the advantages to certain rendering algorithms become very clear. The 24 ROPs in a full G80 are divided into partitions of 4 each, each partition connecting to a 64-bit memory channel out to the DRAM pool for intermediary or final storage.

Further to the traditional multisampling implementation, the ROP also assists in a new type of exposed AA method called coverage sampling. Combined with multisampling, CSAA will take extra binary coverage samples in the subpixel, to determine if triangles intersect the subpixels at those sample points. If they do, the sample locations are used to influence the sample resolve and thus the colour blended into the framebuffer (or so I understand at least, NVIDIA are somewhat vague). The sample data consumes much less storage space per pixel than multisamples, and the hardware falls back to the basic multisample mode should coverage sampling fail for the pixel (I presume a threshold of samples show no intersections at all, or possibly if rendering is done in a way other than front-to-back). The ROPs share access to an L2 data cache that's likely 128KiB with access likely optimised for reading data aligned on 8-byte boundaries (a full FP16 pixel).

Stencil update rates only scale with ROP count compared to G71, NVIDIA not spending any area improving that facet of performance. We measure 96 stencil tests per clock with G80 and and 64 with G71, proving that's the case. Depth tests needed for CSAA are 2 for 2x, 4 for 4x, 8 for 8x, 8 for 8xQ, 12 for 16x, and ~20 for 16xQ, but we'll discuss that in the IQ piece. Finally as far as this somewhat quick look at the new ROP goes -- it's easier to talk about a ROP that sees little restriction on what it can process, after all -- it'll also do the traditional depth check for occlusion, per pixel, and forego the write to memory if it fails, saving memory bandwidth (but not saving on shading resources or ROP bandwidth any more than it has done in prior architectures). We leave D3D10 (and D3D10.1) AA considerations for a chip like G80 until we have that pesky D3D10 driver.

Display Pipe and PureVideo

While G80 itself contains no display logic, it has to present that to NVIO. Therefore the internal workings of the chip in a colour precision sense is definitely worth it. G80 has the ability to work within the bounds of a 10-bit resolution per colour component from data input to data ouput, supporting 10bpc framebuffers without issue and allowing direct transmission and scanout from those via NVIO

As far as video processing goes, that's where you'll find some of the only carried-over functional blocks from previous GPUs in G80. While nearly 100% of the transistors for 3D processing are new according to NVIDIA (and we're inclined to believe them), what NVIDIA call PureVideo -- their on-GPU logic primarily for the decode of motion video -- is (mostly) carried over from G7x, so we're told. As fixed function logic operating in the chip's lower clock domain and with frequencies in the same range as older NVIDIA GPUs, it makes some sense to do so. Like with G7x, the shading core augments what largely amounts to post-decode processing in G80, improving video quality (maybe just using one cluster to provide baseline expectations, but there's no honest reason for NVIDIA to hardcode it that way, especially with high resolution video support).

We've got room before we finish to talk about some theoretical rate work we did, G80's application to GPGPU and image quality before we wrap up for the time being.

Measured Performance Discussion

While we have an entire performance piece in the works, there's some architecture specific performance data that we'd like to share before we release that part of the G80 article series.

Rys and Uttar In The Case Of The Missing MUL

NVIDIA's documentation for G80 states that each SP is able to dual-issue a scalar MADD and MUL instruction per cycle, and retire the results from each once per cycle, for the completing instruction coming out of the end. The thing is, we can't find the MUL, and we know another Belgian graphics analyst that's having the same problem. No matter what we tried, be it Arun and work with Cg or myself and pushing instructions through the hardware using MDX. No matter the dependant instruction window in the shader, the peak -- and publically quoted by NVIDIA at Editor's Day -- MUL issue rate never appears.

We can push almost every other instruction through the hardware at close to peak rates, with minor bubbles or inefficiencies here and there, but dual issuing that MUL is proving difficult. That is unless we ask the hardware to issue it for perspective correction, in which case it says hello. NV40 did something similar, where the 1st MUL unit would be used for that, or the 1st MADD unit on G70. Indeed, perspective correction is why ALU1 was partially or fully used when issuing a TEX operation on these architectures, but it was also available for general shading. So there's a possibility it's also doing setup for the special function hardware, and that it'll be made available in future drivers, or not, and we'll keep you posted.

Since RCP was a free op on G7x, you got 1/w for nothing on that architecture for helping setup texture coordinates while shading. It's not free any more, so it's calculated at the beginning of every shader on G80 and stored in a register instead.

General Shading Performances

Bar that current mystery, getting (close to) peak base ALU rates was fairly elementary with our own shaders, and different instruction mixes testing dependant and non-dependant instruction groups all seem to run as you'd expect. A significant time spent with special function rates prompted the halleluja-esque discovery one evening that special function and interpolation are tied together, something NVIDIA's specification doesn't tell you but which we're supremely confident of.

It seems the shader compiler/assembler is in a decent state, then. Getting 150-170G scalar instructions/sec through the hardware is easy, hiding texturing latency works as expected (where you texture for free with enough non-dependant math) and seeing special functions execute at quarter speed (they take 4 cycles to execute remember, as a function of quadratic approximation) isn't difficult.

Once into our stride after instruction issue testing was mostly put to bed, we've spent (and continue to spend) good time verifying the performance of the ROP hardware, including the very real verification of near 70Gpixels/sec of Z-only fill, performant 4xMSAA as expected (and also 8x as it turns out, but you need to choose your test wisely since the mix of CSAA can throw you off) and blend rates roughly equal to what NVIDIA quote. We're close to completing blend rate testing, with just FP32 to go.

Texturing performance was up next (or was it before ROP performance, too many late nights testing....) and verifying the sometimes astounding sample rates with filtering is something we've been working on recently. The hardware can and will filter FP32 textures, and we're seeing close to expected performance. Along with all that we've also measured performance in a number of modern games, so we don't go completely overboard on the theory. G80 is damn fast with most modern games, and it seems tuned to 2560x1600 sustaining high performance at that resolution, largely doubling (or more) what any other current single board is capable of. You can see some numbers from that testing over at HEXUS. In short, we've worked hard to make sure the hardware does what it says it can, and bar the missing MUL which we eventually found, we're happy that G80 doesn't cut major corners. We'll sum it all up in the Performance piece in the near future.

Image Quality

Anistropic Filtering

We've discussed the angle invariance G80 sports when it comes to texture filtering quality, and the following image comparing 16xAF on G71 and G80 (on the right) says a lot. G, you're welcome!

Total invariance eludes modern consumer hardware, but that's pretty close and the most important thing is that it's the out-of-the-box quality, rather than anything you have to select yourself. We've measured filtering performance on G80 under Half Life 2 to be very high, and anisotropic filtering is effectively pretty much free, depending on your game and target resolution of course. The hardware is able to perf a huge number of bilerps for every screen pixel, though, so don't be scared to get your app to ask for AF in your games, if you own a G80-based board.

Antialiasing

We mentioned the hardware can apply true 8xMSAA to a scene, no matter the surface format being used by a game or application. We quickly snapped the sample grid positions and took a shot of it at work in Oblivion with HDR rendering enabled in the game options, and 16xQ AA enabled in the driver control panel.

We mentioned CSAA before, and using a tool we've written in house we're able to detect the primary and secondary samples that CSAA takes when determining subpixel coverage. We'll reveal all in the Image Quality analysis piece but as a teaser here are the subpixel samples from the 16xQ mode.

We'll cover AA performance in the Performance part of our G80 analysis.

Architecture Summary

Our new way to approaching new GPU releases is partly designed to let us go deep on the hardware without producing a single article that's dozens of pages long, letting us stagger our content delivery and take our time on each piece. And if there was ever a GPU to time that for, it'd be G80. A huge deviation from anything they've had the balls to build in the past, its unified and heavily threaded approach to programmable shading -- paired with all the bandwidth, sampling ability and image output quality that you were hopefully expecting regardless of the base architecture -- means that they put a very hefty marker in the sand and one that took us plenty of time to analyse as we did.

With Vista yet to debut and no publically usable driver for the Release Candidates either, we're left having it push D3D9-class pixels for the time being, but it does that without any major hiccups and issues that we can see. We found the MUL, rounding off our opinion that the chip does what NVIDIA set out at Editor's Day and are delivering today. As a generational leap from old to new, it's not hard to make comparisons with similar generational standouts like ATI R300. And I think both ATI and NVIDIA will smile at that pretty knowingly.

If there's a major performance flaw in what it's capable of, we're still looking for it, and we've been able to realise theoretical rates or thereabouts in each of the main processing stages the hardware provides, under D3D9 at least. Image quality looks excellent too, NVIDIA providing a big leap in sampling power and putting it to good use in G80. We'll cover performance and image quality in the next two pieces, but it's worth saying now for those wanting a taster on launch day that the thing goes really really fast at times, compared to the outgoing generation, and looks as good or better than anything else that's come before it.

For a look at games performance today we point you to HEXUS and also to Hardware.fr (thanks Damien for all your help!), who'll echo the sentiments here we're sure. We're pretty sure that G80 isn't what a lot of people were expecting, with Dave Kirk largely responsible with public comments akin to "unified? not needed!" and "orthogonality in the ROP? let the developer do it in their code instead!", but it is what it is, and it's deeply impressive in places with a lot left to discover. Direct3D 10 still awaits.

It's also not hard to see how G80 could sweetly apply to GPGPU, and in that respect NVIDIA have something special for G80 just for that community. Called Complete Unified Device Architecture, or CUDA for short, it's what effectively amounts to a C compiler and related tools for the GPU, leveraging G80's computing horsepower and architecture traits that make it suitable for such endeavours. We have an in-depth CUDA piece in the works, so look out for that too. Feel free to speculate, though, in the meantime!

So there's still a boatload of analysis yet to do, mostly under Direct3D, but today's first looks at G80, be they architecture-biased or otherwise, should have most keen to find out more or experience the performance for themselves. You'll find reports all round the web that talk about realised performance in games, so go read! Find our discussion thread on G80 now it's officially public knowledge (and what a complete joke that's been), here.

Thanks (the sniffly bit)

Arun for all his help, especially when poking the sampler hardware and the ROPs. G, Tim and the rest of Team B3D for the editing help, idea floating and everything else. Damien @ Hardware.fr for late night work on theoretical shading performance with me on more than one occasion. Adam and Nick at NVIDIA. Scott Wasson at The Tech Report. Chris Ray! Team HEXUS.

Finally, before we go, Uttar drew the chip too like I did. Uttar is crazy. But good crazy. Enjoy.

Click Image to Enlarge

posted on 2006-11-09 10:48 王大牛的幸福生活阅读(4101) 评论(2) 收藏举报