The graphics engine
AMD refers to the front-end of Cypress as the graphics engine, encompassing as it does the traditional setup engine, the command processor, and the thread dispatch processor. Notable new additions here include a second rasterizer and a next-generation tessellation unit.
Keeping with the theme of doubling resources, AMD added a second rasterizer to make sure the GPU can convert polygon meshes into pixels at a rate sufficient to keep up with the rest of the chip. There are two separate units here, and I wondered at first whether taking full advantage of them might require the use of DirectX 11 and its multithreaded command processing. But AMD says the geometry assembly and thread dispatch units have been modified to perform the necessary load balancing in hardware transparently.
The tessellator is capable of turning lower-polygon models into higher-poly ones by using mathematical hints, such as higher-order surfaces. Radeons have had hardware tessellation units for several generations, as does the Xbox 360 GPU, but they've not been widely used because prior versions of DirectX haven't exposed their capabilities. That all changes with DirectX 11, which exposes the tessellator for programming via two new shader types: hull shaders and domain shaders. Not only that, but Cypress' tessellator is improved from prior iterations, so it can handle popular (as these things go) algorithms like Catmull-Clark in a single pass. The tessellator can adjust the level of geometric detail in real time, too. We should see vastly more geometric detail in terrain, characters, and the like once hardware tessellation goes into widespread use.
Notable by their absence are the interpolation units traditionally found in the setup engine. These fixed-function interpolators have given way to a long-term trend in graphics processors; they've been replaced by the shader processors. AMD has added interpolation instructions to its shader cores as a means of implementing a new DirectX 11 feature called pull-model interpolation, which gives developers more direct control over interpolation (and thus over texture and shader filtering.) The shader core offers higher mathematical precision than the old fixed-function hardware, and it has many times the compute power for linear interpolation, as well. AMD CTO Eric Demers pointed out in his introduction to the Cypress architecture that the RV770's interpolation hardware had become a performance-limiting step in some texture filtering tests, and using the SIMDs for interpolation should bypass that bottleneck.
Not only has Cypress doubled the amount of computing power available on a single GPU, but AMD has also added refinements to improve the per-clock performance, mathematical precision, and fundamental capabilities of its stream processors.
Here's another look at the basic layout of the chip. Cypress has 20 SIMDs, each of which has 16 of what AMD calls thread processors inside of it. Each of those thread processors has five arithmetic logic units, or ALUs. Multiply it out, and you get a grand total of 1600 ALUs across the entire chip, or 1600 "stream processors" or "stream cores," depending on which version of AMD nomenclature you pick. "Stream cores" is the latest, and it seems to be a bit inflationary. My friend David Kanter argues that what makes a core in computer architecture is the ability to fetch instructions. By that measure, Cypress would have 20 cores, since the thread processors inside of each SIMD march together according to one large instruction word.
The organization of the thread processors is essentially unchanged from the RV770 and traces its roots pretty directly back to the R600. The primary execution unit is superscalar and five ALUs wide. That fifth ALU is a superset of the others, capable of handling more advanced math like trascendentals. The execution units are pipelined with eight cycles of latency, but the SIMDs can execute two hardware thread groups, or "wavefronts" in AMD parlance, in interleaved fashion, so the effective wavefront latency is four cycles. Multiply that latency by the width of the SIMD, and you have 64 pixels or threads of branch granularity, just as in R600.
Despite this similarity to past architectures, AMD has made a host of improvements to Cypress, some of which are helpful for graphics, others for GPU compute, and some for both. Demers told us DirectX 11, DirectCompute 11, and OpenCL are fully implemented in hardware, with no need for performance-robbing software emulation of features. Demers stopped just short of asserting that Cypress would support the next version of OpenCL fully in hardware, as well, but gave the distinct impression that this chip would likely be able to do so.
Cypress adds a number of instructions to support DirectX 11, DirectCompute, and other missions this chip may have, including video encoding. One general performance improvement is the ability to co-issue a MUL and a dependent ADD instruction in a single clock, sidestepping a pitfall of its superscalar execution units.
On the dedicated compute front, Cypress continues to execute double-precision FP math at one-fifth its peak rate for single-precision, but AMD has upped the ante on precision in several ways. Demers claims the GPU is compliant with the IEEE 754-2008 standard, with precision-enhancing denorms handled "at speed." The chip now supports a fused multiply-add instruction, which takes the result of a multiply operation and feeds it directly into the adder without rounding in between. Demers describes FMA as a way to achieve DP-like results with single-precision datatypes. (This FMA capability is present in some CPU architectures, but isn't yet built into x86 microprocessors, believe it or notthough Intel and AMD have both announced plans to add it.) The lone potential snag for full IEEE compliance, Demers told us, is the case of "a few numerical exceptions." The chip will report that such exceptions have occurred, but won't execute user code to handle them.
|GeForce 9800 GT||339||508|
|GeForce GTS 250||484||726|
|GeForce GTX 285||744||1116|
|GeForce GTX 295||1192||1788|
|Radeon HD 4850||1088||-|
|Radeon HD 4870||1200||-|
|Radeon HD 4890 OC||1440||-|
|Radeon HD 4870 X2||2400||-|
|Radeon HD 5850||2088||-|
|Radeon HD 5870||2720||-|
AMD continues to devote more transistors to compute-specific logic. The local data stores on each SIMD, used for inter-process communication, have doubled in size to 32KB, and AMD's distinctive global data share has quadrupled from 16 to 64KB. The memory export buffer can now scatter up to 64 32-bit values per clock, twice the rate of RV770. Cypress supports 32-bit atomic operations, as well; hardware semaphores enable global synchronization in "a few cycles," according to Demers. However, Demers wouldn't reveal whether or not Cypress's memory controller is capable of supporting ECC memory, a capability that could be crucial in the burgeoning markets for GPU computing.
Demers made no bones about the fact that the primary market for this chip is graphics and gaming, but he was quick to point out that Cypress is also the most advanced GPU compute engine in the world. Given the current state of things, that claims seems to be credibleat least for the time being. The Radeon HD 5870's peak processing power is formidable at 2.7 TFLOPS for single-precision math and 544 GFLOPS for double-precision. That's more than twice the peak theoretical capacity of the GT200b's fastest graphics card variant, GeForce GTX 285, even if we generously include Nvidia's co-issue feature in our FLOPS count.
Of course, as with almost any processor, peak throughput is only part of the story. We don't yet have much in the way of standard GPU compute benchmarks or applications we can run, but we can look at the directed tests for shader performance in 3DMark.
These results range from disappointingslightly slower than the GTX 285 in the GPU cloth testto astoundingconsiderably faster than two Radeon HD 4870s in the parallax occlusion mapping and Perlin noise tests.