When a company called Lucid unveiled a web site promising a revolutionary new technology that could deliver near-perfect performance scaling for multiple GPUs, independent of GPU type, we were initially skeptical. Their claims sounded odd and perhaps too good to be true. But not only were they were present on the show floor at IDF, they were showing a demo of working silicon. Remarkably enough, it appears they may just be on to something big.
To understand what they’re doing, you’ll first want to recall that, despite their growing popularity, schemes like SLI and CrossFire that combine multiple graphics cards to achieve higher performance often face serious challenges for performance scaling. Dropping in a second video card may get you nearly double the performance if all goes well, but multi-GPU schemes are fragile, and frequently, performance doesn’t scale nearly that well—particularly in games that use advanced but potentially problematic rendering methods. Adding a third or fourth GPU to the mix may not help and can even harm performance.
Part of the problem is the way GPUs are architected; unlike CPUs, they’re not capable of sharing a common pool of memory, so graphics firms end up managing inter-GPU coordination manually in their drivers, profiling games and making tweaks on a case-by-case basis.
On top of that, SLI and CrossFire both use relatively simple load-balancing algorithms, the most popular of which is alternate frame rendering (AFR), in which GPU 0 renders frame A, GPU 1 renders frame B, GPU 0 renders frame C, and so on. AFR sometimes works well, but isn’t compatible with every application. A common alternative is split-frame rendering (SFR), in which GPU 0 draws the top half of the screen while GPU 1 draws the bottom half. SFR is more broadly compatible, but doesn’t redistribute the work required in the earlier stages of the graphics pipeline, which harms performance scaling. There are a few variations on these schemes out there, but they don’t get much more sophisticated than that.
By contrast, Lucid’s approach is much more complex—though still a bit mysterious at the most basic level—and involves its own custom hardware created for graphics load balancing: the Hydra 100 chip. This chip has several key components, including a RISC processing core that Lucid licensed from a third party, Lucid’s own proprietary 48-lane PCI Express switch fabric, and an image compositing engine. In a typical implementation, the Hydra 100 would be connected to a system’s north bridge chip via a 16-lane PCIe connection. Two GPUs would then sit behind it, each connected to it via a PCIe x16 link. (The Hyrda 100 can also partition its PCIe lanes into a 4×8 config for quad-GPU setups.)
The Hydra 100 then appears to the host OS as a PCIe device, with its own driver. It intercepts calls made to the most common graphics APIs—OpenGL, DirectX 9/10/10.1—and reads in all of the calls required to draw an entire frame of imagery. Lucid’s driver and the Hydra 100’s RISC logic then collaborate on breaking down all of the work required to produce that frame, dividing the work required into tasks, determining where the bottlenecks will likely be for this particular frame, and assigning the tasks to the available rendering resources (two or more GPUs) in real time—for graphics, that’s within the span of milliseconds. The GPUs then complete the work assigned to them and return the results to the Hydra 100 via PCI Express. The Hydra streams in the images from the GPUs, combines them as appropriate via its compositing engine, and streams the results back to the GPU connected to the monitor for display.
As I understand it, because data is streamed from the GPUs into the compositing engine pixel by pixel, and because the compositing engine immediately begins streaming back out the combined result, the effective latency for the compositing step is very low.
Once a frame has been completed, Lucid analyzes the relative performance of its client GPUs for that frame and dynamically adjusts its expectations for the next one. As a result, Lucid President and co-founder Offir Remez told us, the Hydra 100 is capable of effectively load-balancing for asymmetrical GPU configurations, such as a GeForce 8600 GTS and a 9800 GTX. Or, in another potential real-world scenario, Lucid demonstrated its real-time load-balancing running Crysis fluidly while one of the two GPUs involved spent a portion of its power displaying a streaming video.
Because Lucid is simply intercepting and then making OpenGL or DirectX calls, the Hydra 100 is purportedly GPU-agnostic, unconcerned and unaware whether it’s working with a Radeon, a GeForce or anything else. (In fact, one of the firm’s primary financial backers is Intel’s capital investment arm.) One limitation is that the GPUs involved must all use the same graphics driver, so mixing a GeForce with a Radeon won’t work.
The most intriguing aspect of this scheme is how Lucid actually breaks down a scene and apportions work to the individual GPUs. Remez said the firm has applied for over 50 patents, many of them for its load-balancing algorithms, which are much more fine-grained than AFR, SFR, or the like.
It’s difficult to express verbally, but Lucid’s demo on the show floor offered a good sense of what’s happening. The demo system had two GeForce GTX 260 cards connected to a Hydra 100 and enclosed in a box. A PCI Express cable then attached this test mule to the PCIe x16 slot in an enthusiast-class system (based on an Intel chipset). The whole setup was running Unreal Tournament 3. On one screen, we could see the output from a single GPU, while the other showed the output from either the second GPU or, via a hotkey switch, the final and composited frame. GPU 0 was rendering the entire screen space, but only portions of the screen showed fully textured and shaded surfaces—a patch of the floor, a wall, a column, a sky box—while other bits of the screen were black. GPU 1, meanwhile, rendered the inverse of the image produced by GPU 0. Wiggle the mouse around, and the mix of surfaces handled by each GPU changed frame by frame, creating an odd flickering sensation that left us briefly transfixed. The composited images, however, appeared to be a pixel-perfect rendition of UT3. Looking at the final output, you’d never suspect what’s going on beneath the covers.
Remez told us Lucid uses a mix of load-balancing algorithms, and he wouldn’t reveal too many specifics about how the various algorithms might work. He claims the end result is near-linear performance scaling. Even with mismatched GPUs, if the slower of the two is only 30% the speed of the faster one, the total system could produce nearly 1.3X the performance of a single card alone.
I asked Remez about potential snags or incompatibilities in the scheme, things that might cause problems, whether it be multisampled antialiasing of edges or some of the cases that cause SLI and CrossFire to stumble. He asserted that MSAA would work properly and emphasized that the API-level, GPU-agnostic approach Lucid takes tends to shield them from application- or hardware-specific compatibility issues.
Lucid has identified a few places where its technology could likely be deployed at first. The most obvious, perhaps, is in place of a simple PCI Express switch chip on a dual-GPU video card like the Radeon HD 4870 X2. Lucid is already talking with board makers about the possibilities there. Another obvious possibility is for the Hydra 100 to find its way onto motherboards, where it could enable peak performance from high-end multi-GPU teams and offer upgraders the possibility of pairing an older, slower video card with a newer, quicker one for better overall performance. The presence of a Hydra 100 could also provide an easy workaround for chipset-specific multi-GPU lockouts. For instance, a Hydra-equipped motherboard based on an Intel chipset would be able to run multiple GeForce GPUs together, even though Nvidia doesn’t allow SLI on Intel chipsets. The third place where the Hydra might be deployed is in “pods” or external multi-GPU enclosures for the professional visualization market, similar to the Quadro enclosures Nvidia sells.
Lucid’s IDF demos were running on alpha silicon, but the company has just gotten final silicon back and says it’s on track to deliver products during the first half of 2009. The first chips support only PCI Express Gen 1, but Lucid claims that’s sufficient for now given the way its scheme works. The A0 silicon demoed at IDF was capable of running without a heatsink, and my finger survived a quick touch test. Andrew Schmied, VP of Marketing for Lucid, pegged the chip’s power draw at under 5W.
Assuming the Hydra 100 does work as advertised, the big questions now are “How does it really perform?” and “Who will make use of it?” As for the first question, we got a demo of Crysis running at 1920×1200 at the highest quality levels available in DirectX 9. The test system was using a pair of GeForce 9800 GTX cards, and performance ranged between 40 and 60 FPS on the game’s built-in frame rate counter. The game played very, very smoothly, and I didn’t perceive any latency between mouse inputs and on-screen responses. That seemed very promising, but we’ll have to get one of these things into Damage Labs for a true test of Lucid’s scaling claims before we can draw any real conclusions about performance.
We don’t yet know exactly who Lucid’s first customers might be, but we know that at least one major Taiwanese mobo and video card maker is working with them. Interest in the firm’s technology at IDF seemed to be considerable.
We’re also curious to see what AMD and Nvidia make of this upstart firm with apparently superior technology to their own load-balancing methods. We haven’t yet spoken with either company about Lucid, but we plan to soon. Stay tuned.