The multi-GPU scaling challenge
AMD claims development on CrossFire X drivers has taken a year, and that the total effort amounts to twice that of its initial dual-GPU CrossFire development effort. In order to understand why that is, I spoke briefly with Dave Gotwalt, a 3D Architect at AMD responsible for CrossFire X driver development. Gotwalt identified several specific challenges that complicated CrossFire X development.
One of the biggest challenges, of course, is avoiding CPU bottlenecks, long the bane of multi-GPU solutions. Gotwalt offered a basic reminder that it's easier to run into CPU limitations with a multi-GPU setup simply because multi-GPU solutions are faster overall. On top of that, he noted, multi-GPU schemes impose some CPU overhead. As a result, removing CPU bottlenecks sometimes helps more with multi-GPU performance than with one GPU.
In this context, I asked about the opportunities for multithreading the driver in order to take advantage of multiple CPU cores. Surprisingly, Gotwalt said that although AMD's DirectX 9 driver is multithreaded, its DX10 driver is notneither for a single GPU nor for multiples. Gotwalt explained that multithreading the driver isn't possible in DX10 because the driver must make callbacks though the DX10 runtime to the OS kernel, and those calls must be made through the main thread. Microsoft, he said, apparently felt most DX10 applications would be multithreaded, and they didn't want to create another thread. (What we're finding now, however, noted Gotwalt, is that applications aren't as multithreaded as Microsoft had anticipated.)
With that avenue unavailable to them, AMD had to focus on other areas of potential improvement for mitigating CPU bottlenecks. One of the keys Gotwalt identified is having the driver queue up several command buffers and several frames of data, in order to determine ahead of time what needs to be rendered for the next frame.
Even with such provisions in place, Windows Vista puts limitations on video drivers that sometimes prevent CrossFire X from scaling well. The OS, Gotwalt explained, controls the "flip queue" that holds upcoming frames to be displayed, and by default, the driver can only render as far as three frames ahead of the frame being displayed. Under Vista, both DX9 and DX10 allow the application to adjust this value, so that the driver could get as many as ten frames ahead if the application allowed it. The driver itself, however, has no control over this value. (Gotwalt said Microsoft built this limitation into the OS, interestingly enough, because "a certain graphics vendornot us" was queuing up many more frames than the apps were accounting for, leading to serious mouse lag. Game developers were complaining, so Microsoft built in a limit.)
For CrossFire X, AMD currently relies solely on a method of GPU load balancing known as alternate frame rendering (AFR), in which each GPU is responsible for rendering a whole frame and frames are distributed to GPUs sequentially. Frame 0 will go to GPU 0, frame 1 to GPU 1, frame 2 to GPU 2, and so on. Because of the three-frame limit on rendering ahead, explained Gotwalt, the fourth GPU in a CrossFire X setup will have no effect in some applications. Gotwalt confirmed that AMD is working on combining split-frame rendering with AFR in order to improve scaling in such applications. He even alluded to another possible technique, but he wasn't willing to talk about it just yet. Those methods will have to wait for a future Catalyst release.
Another performance challenge Gotwalt pointed to is one of Vista's resource management practices. In order for an application to access a resource (such as a buffer), the application must "lock" this resource. The fastest type of lock, he said, is a lock-discard, which is useful when one doesn't care about modifying the current contents of the resource, since a lock-discard simply allocates a new chunk of memory. This sort of lock makes sense for certain types of resources, like vertex buffers. The problem, according to Gotwalt, is that the OS's implementation of lock-discard is expensive for small buffers. A kernel transition is involved, and the memory manager will only allow a given buffer to be renamed 64 times. After that, the DirectX runtime will require the driver to flush its command buffer, invoking a severe performance penalty. As Gotwalt put it, "We have now just serialized the whole system." This limitation exists for both DX9 and DX10, but Gotwalt said it isn't as evident in DX9. DirectX 10 presents more of a problem because its constant buffers are different in nature; they are smaller and can have a higher update frequency than vertex buffers.
As a result, AMD has taken over management of renaming in its drivers. Doing so isn't a trivial task, Gotwalt pointed out, because one must avoid over-allocating memory. At present, AMD has a constant buffer renaming mechanism in place in Catalyst 8.3, but it involves some amount of manual tweaking, and new applications could potentially cause problems by exhibiting unexpected behavior. However, Gotwalt said AMD has a new, more robust solution coming soon that won't involve so much tweaking, won't easily be broken by new applications, and will apply to any resource that is renamednot just constant buffers, but vertex buffers, textures, and the like.
The final issue Gotwalt described may be the thorniest one for multi-GPU rendering: the problem of persistent resources. In some cases, an application may produce a result that remains valid across several succeeding frames. Gotwalt's example of such a resource was a shadow map. The GPU renders this map and then uses it as a reference in rendering the final frame. This sort of resource presents a problem because multiple GPUs in CrossFire X don't share memory. As a result, he said, the driver will have to track when the map was rendered and synchronize its contents between different GPUs. Dependences must be tracked, as well, and the driver may have to replicate both a resource and anything used to create it from one GPU to the next (and the next). This, Gotwalt said, is one reason why profiled AFR ends up being superior to non-profiled AFR: the driver can turn off some of its resource tracking once the application has been profiled.
Gotwalt pointed out that "AFR-friendly" applications will simply re-render the necessary data multiple frames in a row. However, he said, the drivers must then be careful not to sync data unnecessarily when the contents of a texture have been re-rendered but haven't changed.
Curious, I asked Gotwalt whether re-rendering was typically faster than transferring a texture from one GPU to the next. He said yes, in some applications it is, but one must be careful about it. If you're re-rendering too many resources, you're not really sharing the workload, and performance won't scale. In those cases, it's faster to copy the data from GPU to GPU. Gotwalt claimed they'd found this to be the case in DirectX 10 games, whereas DX9 games were generally better off re-rendering.
Gotwalt attributed this difference more to changes in the usage model in newer games than to the API itself. (Think about the recent proliferation of post-processing effects and motion blur.) DX10 games make more passes on the data and render to textures more, creating a "cascading of resources." DX10's ability to render to a buffer via stream out also allows more room for the creation of persistent resources. Obviously, this is a big problem to manage case by case, and Gotwalt admitted as much. He qualified that admission, though, by noting that AMD learns from every game it profiles and tries to incorporate what it learns into its general "compatible AFR" implementation when possible.
Clearly, AMD has put a tremendous amount of sweat and smarts into making CrossFire X work properly and into achieving reasonably good performance scaling with multiple GPUs. The obstacles Gotwalt outlined are by no means trivial, and the AMD driver team's ability to navigate those obstacles with some success is impressive. Still, some of the challenges they face aren't going to go away. In fact, the persistent resources problem is only growing thornier and more complex with time. This is one of the major reasons multi-GPU solutionsbased on today's GPU architectures, at leastwill probably always be somewhat fragile and very much reliant on driver updates in order to deliver strong performance scaling. There's reason for optimism here based on the good work that folks at AMD and elsewhere are putting into these problems, but also reason for caution.
|Samsung's 28'' display serves up single-tile 4K at 60Hz for $800||110|
|Good Friday Shortbread||8|
|Friday night topic: where are the good ultraportables?||36|
|Deal of the week: Radeon R9 290X cards for... more than list?||17|
|Release roundup: Bits, pieces, and whole PCs||23|
|AMD posts another loss but beats Wall Street forecast||60|
|GlobalFoundries licenses Samsung process tech, grants AMD access to FinFETs||95|
|MSI shows next-gen Intel motherboards||42|