How Quad SLI splits up the work
That single SLI bridge connector between the 7950 GX2s leads to a Quad SLI topology that looks like this:
Although each G71 GPU has room for two SLI links to other GPUs, only two of the four GPUs in a GX2 quad config actually use both links. Internally, each GX2 card has an SLI link between its two GPUs. (The card also has a 48-lane PCI Express switch that links both GPUs to the PCIe interface, noted as "X48 PCI-E" in the diagram.) Externally, the single SLI connector bridges between the "primary" GPUs on each card. This arrangement dispenses with the ring topology found in early Quad SLI configsan arrangement that always seemed unnecessary to me. At the end of the day, pixels rendered by all of the GPUs have to make it to the lone GPU driving the display, anyhow.
With four GPUs in this topology, Nvidia uses several techniques to split up the work between the graphics processors. These are variations of the methods used in dual-GPU SLI.
As with dual-GPU SLI, the preferred method of GPU load balancing is known as alternate-frame rendering (AFR). In a two-GPU config, that means GPU 0 renders the odd frames and GPU 1 renders the even frames, for an every-other-frame arrangement. With Quad SLI, the "alternate" tag isn't quite accurate. Frames are split up sequentially between the four GPUs.
This method tends to scale best in performance because it's a very logical way to divvy up the workload. Not only does it scale up in terms of fill rate and pixel shading power, but it also divides the vertex-processing burden between the GPUs. Both Nvidia and ATI tend to employ this method when possible in their respective multi-GPU schemes.
AFR isn't without its drawbacks, though. It's not compatible with all applications, for one thing, so it's can't always be used. More notably, for Quad SLI, AFR requires the use of four frame buffers in order to work. That's a major, show-stopping problem, because DirectX 9 currently allows a maximum of three such buffers. For DirectX gameswhich comprise the vast majority of PC game titlesQuad SLI's best load-balancing method isn't an option. Nvidia does use four-way AFR in some OpenGL titles like Doom 3 and Quake 4, but they have to resort to other methods for most games.
Another potential problem with four-way AFR is simply the amount of latency involved in a four-buffer rendering scheme. The lag between user input and when that change is reflected onscreen could be fairly longtens of milliseconds, or long enough to be perceptible (and probably annoying) to a gamer. This problem would be most acute when the graphics subsystem is really stressed and the rate at which the GPUs are pumping out frames is relatively low.
I played a fair amount of Quake 4 with our Quad SLI test rig using an AFR graphics mode, and I didn't detect any noticeable input lag. Now, I'm no professional gamer, and Quake 4 isn't exactly the fastest-twitch action game around. But I have noticed input lag playing Quake 4 on an LCD display. Our test rig's fast CRT may have helped here. I wouldn't be shocked to hear of folks who found four-way AFR too slow for their tastes, especially when combined with a middling-speed LCD monitor or when running a graphically intensive game at high quality settings.
An alternative to AFR that offers broader compatibility and doesn't suffer from the three-buffer limit is split-frame rendering, where each GPU renders a portion of the frame, subdivided horizontally. The screen can be split into four segments of the same size, or the area apportioned to each GPU can be modified dynamically in response to demand. SFR's big downside is so-so performance scaling. Some applications work better than others with SFR, but SFR always requires each GPU to process vertex data for the entire frame.
Nvidia can also circumvent DX9's three-buffer limit in AFR-compatible apps by employing a hybrid of AFR and SFR, as shown in the diagram above. The two GPUs on each GeForce 7950 GX2 card use SFR to distribute the load, and the frames rendered are interleaved between the two cards via AFR. In theory, this "AFR of SFR" should scale better than four-way SFR. The current driver profile for Quad SLI uses this method effectively in the game F.E.A.R, as we will see in our performance results shortly.
Another means of distributing the load in a multi-GPU system is to deliver high levels of antialiasing by combining the antialiased frames from multiple GPUs into one. The AA sample pattern is varied from one GPU to the next so that the final frame effectively has a larger sample size and a more dispersed sample pattern. The various SLI antialiasing modes haven't traditionally achieved strong performance scaling, although performance improved somewhat when Nvidia incorporated the ability to pass sample data over the SLI bridge in the G71 and G73 GPUs. (The G70 had to pass this data via PCI Express.) Still, SLI AA has remained a means of making less graphically intense games look prettier rather than a load-balancing technique aimed primarily at performance.
That trajectory continues in Quad SLI with the addition of a new SLI AA mode with a 32X sample size. As the diagram above indicates, each GPU renders the frame at a slight offset using the G71's highest quality 8xS AA mode, and the four frames are composited to produce the final result. Because the 8xS mode is a mix of edge-oriented multisampling and full-scene supersampling, the resulting frame has elements of both, as a quick look at the sample patterns should illustrate.
|The TR Podcast 166 is now available on YouTube||19|
|Chromebooks now come with 1TB of cloud storage for two years||16|
|Deal of the week: Devil's Canyon starting at $179.99, Intel 730 Series for $0.42/GB, and more||32|
|AMD prolongs A-series software deal; price cuts still a work in progress||20|
|Report: Valve lays out new rules for Early Access games||50|
|Intel's 2015 revenue outlook beats Street expectations||51|
|Intel's 3D NAND has 32 layers and 256Gb per die||60|
|Telltale's Game of Thrones game looks pretty good||12|
|Sounds like a good way to conceal the terrible financial performance of the mobile business unit.||+35|