Single page Print

Nvidia's GeForce GTX 750 Ti 'Maxwell' graphics processor


...takes on the Radeon R7 265 and friends
— 10:20 PM on February 19, 2014

So this is different. I don't recall the last time a new GPU architecture made its worldwide debut in a lower-end graphics card—or if I do, I'm not about to admit I've been around that long. In my book, then, Nvidia's "Maxwell" architecture is breaking new ground by hitting the market first in a relatively affordable graphics card, the GeForce GTX 750 Ti, and its slightly gimpy twin, the GeForce GTX 750.

Don't let the "750" in those names confuse you. Maxwell is the honest-to-goodness successor to the Kepler architecture that's been the basis of other GeForce GTX 600 and 700 series graphics cards, and it's a noteworthy evolutionary step. Nvidia claims Maxwell achieves twice the performance per watt of Kepler, without the help of a new chip fabrication process. Given how efficient Kepler-based GPUs have been so far, that's a bold claim.

I was intrigued enough by Maxwell technology that I've hogged the spotlight from Cyril, who usually reviews video cards of this class for us. Having spent some time with them, I've gotta say something: regardless of the geeky architectural details, these products are interesting in their own right. If your display resolution is 1920x1080 or less—in other words, if you're like the vast majority of PC gamers—then dropping $150 or less on a graphics card will get you a very capable GPU. Most of the cards we've tested here are at least the equal of an Xbone or PlayStation 4, and they'll run the majority PC games quite smoothly without compromising image quality much at all.

Initially, I figured I'd try testing these GPUs with some popular games that aren't quite as demanding as our usual fare. However, I quickly learned these cards are fast enough that Brothers: A Tale of Two Sons and Lego Lord of the Rings don't present any sort of challenge, even with all of the image quality options cranked. Discerning any differences between the GPUs running these games would be difficult at best, so I was soon back to testing Battlefield 4 and Crysis 3.

Why should that rambling anecdote matter to you? Because if you're an average dude looking for a graphics card for his average computer so it can run the latest games, this price range is probably where you ought to be looking. I'm about to unleash a whole torrent of technical gobbledygook about GPU architectures and the like, but if you can slog through it, we'll have some practical recommendations to make at the end of this little exercise, too.

The first Maxwell: GM107

ROP
pixels/
clock
Texels
filtered/
clock
(int/fp16)
Stream
Processors
Raster- ized
triangles/
clock
Memory
interface
width (bits)
Transistor
count
(Millions)
Die
size
(mm²)
Fab
process
Cape Verde 16 40/20 640 1 128 1500 123 28 nm
Bonaire 16 56/28 896 2 128 2080 160 28 nm
Pitcairn 32 80/40 1280 2 256 2800 212 28 nm
GK107 16 32/32 384 1 128 1300 118 28 nm
GK106 24 80/80 960 3 192 2540 214 28 nm
GM107 16 40/40 640 1 128 1870 148 28 nm

The first chip based on the Maxwell architecture is code-named GM107. As you can see from the picture and table above, it's a modestly sized piece of silicon roughly halfway between the GK107 and GK106. Like its predecessors and competition, the GM107 is manufactured at TSMC on a 28-nm process.

Purely on a chip level, the closest competition for the GM107 is the Bonaire chip from AMD. Bonaire powers the Radeon R7 260X and, just like the big Hawaii chip aboard the Radeon R9 290X, packs the latest revision of AMD's GCN architecture. The GM107 and Bonaire are roughly the same size, and they both have a 128-bit memory interface. Notice that Bonaire has more stream processors and texture filtering units than the GM107. We'll address this question properly once we've established clock speeds for the actual products, but the GM107 will have to make more efficient use of its resources in order to outperform the R7 260X. Something to keep in mind.

The Maxwell GPU architecture


A functional block diagram of the GM107. Source: Nvidia.

Above is a not-terribly-fine-grained representation of the GM107's basic graphics units. From this altitude, Maxwell doesn't look terribly different from Kepler, with the same division of the chip in to graphics processing clusters (GPCs, of which the GM107 has only one) and, below that, into SMs or streaming multiprocessors. If you're familiar with these diagrams, maybe you can map the other units on the diagram to the unit counts in the table above. The two ROP partitions are just above the L2 cache, for instance, and each one is associated with a slice of the L2 cache and a 64-bit memory controller. Although these things seem familiar from its prior GPUs, Nvidia says "all the units and crossbar structures have been redesigned, data flows optimized, power management significantly improved, and so on." So Maxwell isn't just the result of copy-paste in the chip design tools, even if the block diagram looks familiar. Maxwell's engineering team didn't achieve a claimed doubling of power-efficiency without substantial changes throughout the GPU.

In fact, Nvidia has been especially guarded about what exactly has gone into Maxwell, more so than in the past. These are especially interesting times for GPU development, since the competitive landscape is changing. Nvidia introduced the first mobile SoC with a cutting-edge GPU, the Tegra K1, early this year, and it faces competition not just from AMD but also from formidable mobile SoC firms like Qualcomm. The company has had to adapt its GPU design philosophy to focus on power efficiency in order to play in the mobile space. Kepler was the first product of that shift, and Maxwell continues that trajectory, evidently with some success. Nvidia seems to be a little skittish about divulging too much of the Maxwell recipe, for fear that it could inspire competitors to take a similar path.

With that said, we still know about the basics that distinguish Maxwell from Kepler. The most important ones are in the shader multiprocessor block, or SM. Let's put on our extra-powerful glasses and zoom in on a single SM to see what's inside.


A functional block diagram of the Maxwell SM. Source: Nvidia.

You may recall that the Kepler SMX is a big and complex beast. The SMX has four warp schedulers, eight instruction dispatch units, four 32-wide vector arithmetic logic units (ALUs), and another four 16-wide ALUs. ("Warps" is an Nvidia term that refers to a group of 32 threads that execute together. These groupings are common in streaming architectures like this one. AMD calls its thread groups "wavefronts.") That gives the SMX a total of 192, uhh, math units—thanks to four vec32 ALUs and four vec16 ALUs. Nvidia says the Kepler SM has 192 "CUDA cores," but that's a marketing term intended to incite serious nerd rage. We'll call them stream processors, which is somewhat less horrible.

Anyhow, Maxwell divvies things up inside of the SM a little differently. One might even say this so-called SMM is a quad-core design, if one were determined to use the word "core" more properly. The Maxwell SM is divided into quads, anyhow. Each quad has a warp scheduler, two dispatch units, a dedicated register file, and single vec32 ALU. The quads have their own banks of load/store units, and they also have their own special-function units that handle tricky things like interpolation and transcendentals.

Nvidia's architects have rejiggered the SM's memory subsystem, too. For instance, the texture cache has been merged with the L1 compute cache. (Formerly, a partitioned chunk of the SM's 64KB shared memory block served as the L1 compute cache.) Naturally, each L1/texture cache is attached to a texture management unit. Each pair of quads shares one of these texture cache/filtering complexes. Separately, the 64KB block of shared memory remains, and as before, it services the entire SM.

Maxwell's control logic and execution resources are more directly associated with one another than in Kepler, and the scale of the SM itself is somewhat smaller. One Maxwell SM has 128 stream processors and eight texels per clock of texture filtering, down by one third and one half, respectively, from Kepler. The number of load/store and special-function units apparently remains the same. Nvidia says the Maxwell SM achieves about 90% of the performance of the Kepler SM in substantially less area. To give you some sense of the scale, the GM107 occupies about 24% more area than the GK107, yet the Maxwell-based chip has 66% more stream processors. Due to more efficient execution, the firm claims the GM107 manages about 2.3X the shader performance of the GK107.

How does Maxwell manage those gains? Well, the higher ratio of compute to texturing doesn't hurt—the SM has shifted from a rate of 12 flops for every texel filtered to 16. Meanwhile, Nvidia contends that much of the improvement comes from smarter, simpler scheduling that keeps the execution resources more fully occupied. Kepler moved some of the scheduling burden from the GPU into the compiler, and Maxwell reputedly continues down that path. Thanks to its mix of vec16 and vec32 units, the Kepler SM is surely somewhat complicated to manage, with higher execution latencies for thread groups that run on those half-width ALUs. A Maxwell quad outputs one warp per clock consistently, with lower latency. That fact should simplify scheduling and reduce the amount of overhead required to track thread states. I think. The methods GPUs use to keep themselves as busy—and efficient—as possible are still very much secret sauce.

One change in the new SM will be especially consequential for certain customers—and possibly for the entire GPU market. Maxwell restores a key execution resource that was left out of Kepler: the barrel shifter. The absence of this hardware doesn't seem to have negative consequences for graphics, but it means Kepler isn't well-suited to the make-work algorithms used by Litecoin and other digital currencies. AMD's GCN architecture handles this work quite well, and Radeons are currently quite scarce in North America since coin miners have bought up all of the graphics cards. The barrel shifter returns in Maxwell, and Nvidia claims the GM107 can mine digital currencies quite nicely, especially given its focus on power efficiency.

Beyond the SM, the other big architectural change in Maxwell is the growth of the L2 cache. The GM107's L2 cache is 2MB, up from just 256KB in the GK107. This larger cache should provide two related benefits: bandwidth amplification for the GPU's external memory and a reduction in the power consumed by doing expensive off-chip I/O. Caches keep growing in importance (and size) for graphics hardware for exactly these reasons. I'm curious to see whether the upcoming larger chips based on Maxwell follow the GM107's lead by including L2 caches eight times the size of their predecessors. That may not happen. Nvidia GPU architect Jonah Alben tells us the L2 cache size in Maxwell is independent of the number of SMs or flops on tap.

Along with everything else, the dedicated video processing hardware in Maxwell has received some upgrades. The video encoder can compress video (presumably 1080p) to H.264 at six to eight times the speed of real-time. That's up from 4X real-time in Kepler. Meanwhile, video decoding is 8-10X faster than Kepler due in part to the addition of a local cache for the decoder hardware. This big performance boost probably isn't needed by itself, but again, the goal here is to save power. Along those lines, Nvidia's engineers have added a low-power sleep state, called GC5, to the chip for video playback and other light workloads.