review a brief look at nvidias gk110 graphics chip

A brief look at Nvidia’s GK110 graphics chip

Last week at its GPU Technology Conference, Nvidia unveiled the first details of its upcoming GK110 GPU, the “real” Kepler chip and bigger brother to the GK104 silicon powering the GeForce GTX 600 series. Although the GK110 won’t be hitting the market for some time yet, Nvidia’s increasing focus on GPU-computing applications has changed the rules, causing the GPU firm to show its cards well ahead of the product’s release. As a result, we now know quite a bit about the GK110’s architecture and the mix of resources it offers for GPU-computing work. With a little conjecture, we can probably paint a fairly accurate picture of its graphics capabilities, too.

Let’s start with the GK110’s basic specifications. Since we’ve known the GK104’s layout for a while now, the exact dimensions of its bigger brother have been the subject of some speculation. Turns out most of our guesses weren’t too far from the mark, although there are a few surprises. We don’t have its exact dimensions yet, but the chip itself is likely to be enormous; it packs in 7.1 billion transistors, roughly double the count of the GK104. The die shot released by Nvidia offers some clear hints about how those transistors have been allocated, as you can see below.

A shot of the GK110 die. Source: Nvidia.

The GK110 is divided into five of the deep green structures above, which are almost certainly GPCs, or graphics processing clusters, nearly complete GPUs unto themselves. Each of those GPCs houses three SMX cores, and Nvidia has confirmed the chip hosts a total of 15 of those. By contrast, the GK104 has four GPCs with two SMX cores each, so the GK110 nearly doubles its per-clock processing power.

Ringing three sides of the chip are its six 64-bit memory controllers, giving it an aggregate 384-bit path to memory, 50% more than the GK104. That’s not an increase in interface width from the big Kepler’s true predecessor, the Fermi-based GF110, but GDDR5 data rates are up by roughly 50% in the Kepler generation, so there’s a bandwidth increase on tap, regardless. Looks like the PCI Express interface is on the upper edge of the chip; it has been upgraded to Gen3, with twice the peak data rates of Gen2 devices.

Logical block diagram of the GK110’s compute mode. Source: Nvidia.

Because it has a dual mission, serving both the GPU computing and video card markets, the GK110 has a bit different character than GK104. As you’ve likely noted, in some cases it has twice the capacity of GK104, while other increases are closer to 50% or so. More notably, the GK110 has some compute-oriented features that the GK104 lacks, including ECC support (for both on-chip storage and off-chip memory) and the ability to process double-precision floating-point math at much higher rates. (The GK104 has token double-precision support at 1/24th the single precision rate, only to maintain compatibility. Single-precision datatypes tend to be entirely sufficient for real-time graphics and most consumer applications involving GPU computing.)

The GK110-based Tesla K20. Source: Nvidia.

Nvidia said repeatedly at the show that increasing double-precision performance was a major objective for the big Kepler chip, and it appears the firm is on track to deliver. The GF110-based Tesla M2090 card is rated for a peak of 666 DP gigaflops, and Nvidia claims the GK110-based Tesla K20 will exceed one teraflops. If we assume a relatively conservative clock rate of 700MHz for the Tesla product, we’d expect the K20 to double the M2090’s throughput, to 1.3 teraflops.

The ceiling may be even higher than that. Nvidia’s press release about the K20 cryptically says the GK110 “delivers three times more double precision performance compared to Fermi architecture-based Tesla products,” and Huang said something similar in his keynote. In other presentations, though, the 3X claims were tied to power efficiency, as in three times the DP flops per watt, which seems like a more plausible outcome—and a very good one, since power constraints are paramount in virtually any computing environment these days. In order to deliver full-on three times the DP flops of Fermi-based Tesla cards, the K20 would have to run at nearly 1GHz. It’s possible the K20 could reach that speed temporarily thanks to Nvidia’s new driver-based dynamic voltage and frequency scaling mechanism (dubbed GPU Boost in the GeForce products), but it seems unlikely the K20 will achieve sustained operation at that frequency.

The SMX core
The single biggest change in the Kepler architecture is the redesigned shader multiprocessor core, nicknamed the SMX.

The GK110 SMX. Source: Nvidia.

From a block diagram standpoint, the GK110’s SMX looks very much like the GK104’s, with the same basic set of resources, from the 192 single-precision shader ALUs right down to the 16 texels per clock of texture filtering. That’s a departure from the Fermi generation, where the GF104’s SM mixed things up a bit. The only major change from the GK104 is the addition of 64 double-precision math units. At least, that’s what the official block diagram tells us, but I’m having a hard time believing the DP execution units are entirely separate from the single-precision ones. Odds are that the GK110 breaks up those 64-bit numbers into two pieces and uses a pair of ALUs to process them together, or something of that nature.

Our understanding is that the SMX has eight basic execution units, four units with 32 ALUs each and another four with 16 ALUs each. We suspect double-precision math is handled on the four 32-wide execution units, with the 16-wide units left idle. The numbers work out if that’s the case, at least. The GK110 can process 64 double-precision ops per clock, one third of its single-precision rate.

All this talk of rates brings up another issue with the Kepler generation. As David Kanter has pointed out, the SMX’s big increases in shader flops have been accompanied by proportionately smaller increases in local storage area and bandwidth. As a result, key architectural ratios like bandwidth per flop have declined, even thought the chip’s overall power has increased. The GK110 has a new trick that should help offset this change in ratios somewhat: the SMX’s 48KB L1 texture cache can now be used as a read-only cache for compute, bypassing the texture unit. Apparently some clever CUDA coders were already making use of this cache in older GPUs, but with GK110, they won’t have to contend with texture filtering and the like.

Along the same lines, the GK110’s shared L2 cache has doubled in size from Fermi, to 1.5MB, and it has twice the bandwidth per clock, as well. Yes, the ALU count has more than doubled, but the increases in cache size and bandwidth should mean improvement, even with the shifting ratios.

Built for compute
The GK110 includes some other compute-oriented provisions that the GK104 lacks, and those are intended to deal with the growing problem of keeping a massively parallel GPU fully occupied with work.

Fermi and prior chips have only a single work queue, so incoming commands from the CPU are serialized, and work can only be submitted by, effectively, a single CPU core. As a result, even though Fermi supports multiple concurrent kernels, Nvidia claims the GPU often isn’t fully occupied when running complex programs. To remedy this situation, the GK110 has 32 work queues, managed in hardware, so it can be fed by multiple CPU threads running on multiple CPU cores. Nvidia has oh-so-cleverly named this new capability “Hyper-Q”.

The other big hitter is a feature called Dynamic Parallelism. In a nutshell, the big Kepler gives programs running on the GPU the ability to spawn new programs without going back to the CPU for help. Among other things, this feature allows a common logic structure, the nested loop, to work properly and efficiently on a GPU.

Dynamic Parallelism zooms in on a Mandelbrot set. Source: Nvidia.

Perhaps the best illustration of this capability is the classic computing case of evaluating a fractal image like a Mandelbrot set. On the GK110, a Mandelbrot routine could evaluate the entire image area by breaking it into a coarse grid and checking to see which portions of that grid contain an edge. The blocks that do not contain an edge wouldn’t need to be further evaluated, and the program could “zoom in” on the edge areas to compute their shape in more detail. The program could repeat this process multiple times, each time ignoring non-edge blocks and focusing closer on blocks with edges in them, in order to achieve a very high resolution result without performing unnecessary work—and without constantly returning to the CPU for guidance.

Since, as we understand it, pretty much any data-parallel computing problem requires a data set that can be mapped to a grid, the usefulness of Dynamic Parallelism ought to be pretty wide-ranging. Also, Nvidia claims it simplifies the programming task just by allowing the presence of nested loop logic. Obviously, these benefits won’t show up in a peak flops count, but they should improve the GPU’s real-world effectiveness, regardless.

Nvidia has tweaked the programming model for Kepler in several more ways. A new “shuffle” instruction allows for data to be passed between threads without going through local storage. Atomic operations have been beefed up, with int64 versions of some operations joining their int32 counterparts. Kepler’s combination of a shorter pipeline and more atomic units should increase performance, too. Nvidia claims the atomic ops that were slowest on Fermi will be as much as ten times faster on Kepler, and even the fastest atomics on Fermi will be twice as fast on the GK110. Also, Kepler’s ISA encoding allows up to 255 registers to be associated with each thread, up from 63 in Fermi.

A GK110-based GeForce?
Nvidia has done a tremendous amount of work, from the hardware to software to promotion and more, to cultivate a market for its graphics chips as data-parallel processors for use in supercomputing, HPC, and academia. GTC 2012 featured a total of 340 different sessions presented by folks from a broad range of disciplines, and virtually all of the presenters were using GPUs for something other than real-time graphics.

If your interest in GPUs was, like mine, first sparked by graphics and gaming, you might be wondering about the prospects for a GeForce card based on the GK110. Trouble is, those prospects have been dampened somewhat by Nvidia’s success in other areas. The GK110 won’t reach the market until the fourth quarter of 2012, and multiple folks from Nvidia forthrightly admitted to us that those chips are already sold out through the end of 2012. All of those sales are to supercomputing clusters and the like, where each chip commands a higher price than it would aboard a video card. One gentleman seated in front of us at the GK110 deep-dive session mentioned in passing that he had 15,000 of the chips on order, which was his reason for attending.

The Nvidia executives we talked with raised the possibility of a GK110-based GeForce being released this year only if necessary to counter some move by rival AMD. That almost certainly means that any GK110-based GeForce to hit the market in 2012 would come in extremely limited quantities.

Nevertheless, with the information Nvidia has revealed about the GK110 and a dash of speculation, we can paint a picture of how a GeForce card based on the big Kepler might look. Note that we’re assuming a higher clock frequency for the consumer graphics card than we have for the Tesla K20. Beyond the clock speeds, which affect all of the rates, we’re only guessing about a couple of graphics capabilities. Nvidia hasn’t officially confirmed that the GK110 has five GPCs, although we do have the die shot. Similarly, we’d expect 48 pixels per clock of ROP throughput to accompany its six memory channels, if the GK110 retains the same arrangement as the mid-sized Kepler.

GF110 GK104 GK110 Tahiti
Transistor Count 3.0B 3.5B 7.1B 4.31B
Process node 40 nm @ TSMC 28 nm @ TSMC 28 nm @ TSMC 28 nm @ TSMC
Core clock 772 MHz 1006 MHz 900 MHz 925 MHz
Hot clock 1544 MHz
Setup rate 3088 Mtris/s 4024 Mtris/s 4500 Mtris/s 1850 Mtris/s
ALUs 512 1536 2880 2048
SP FMA rate 1.6 Tflops 3.1 Tflops 5.2 Tflops 3.8 Tflops
bilinear texels/clock
64/64 128/128 240/240 128/64
bilinear texel rate
49/49 Gtexels/s 129/129 Gtexels/s 216/216 Gtexels/s 118/59 Gtexels/s
ROPs 48 32 48 32
ROP rate 37 Gpixels/s 32 Gpixels/s 43 Gpixels/s 30 Gpixels/s
Memory clock 4000 MT/s 6000 MT/s 6000 MT/s 5700 MT/s
Memory bus width 384 bits 256 bits 384 bits 384 bits
Memory bandwidth 192 GB/s 192 GB/s 288 GB/s 274 GB/s

We think the GK104 has a more suitable mix of resources for real-time graphics, especially for current games that have been cross-developed for antiquated console hardware. The GK110 may be twice the size, but it’s not likely to be twice as fast for gaming. Still, our theoretical GK110-based GeForce increases shader flops and texture filtering capacity by two-thirds, along with respectable improvements in ROP rate and memory bandwidth. Since the GK104 is already a match for AMD’s Tahiti, we reckon the GK110 would be substantially faster still—if and when it makes it way into a consumer graphics card.

0 responses to “A brief look at Nvidia’s GK110 graphics chip

  1. Nope. That picture of a die and spec sheet can’t run Crysis.

    GK110 might, but it doesn’t exist yet.

  2. This was a great article on an interesting bit of tech. Sad to hear Nvidia basically admit this will not make it out this gen though.

  3. I agree.

    It’s possible that Nvidia is able to do some magic with the GPGPU stuff (probably hidden) that will allow higher than expected game performance, but I’m not holding my breath for that. Still, it would be in line with their efforts concerning the efficiency of how the GTX600’s use memory. That has continued to amaze me.

  4. IDK but don’t expect near double performance of the GTX680, more like 50% increase in gaming performance I bet.
    Seems like Nvidia is dedicating a lot of transistors to GPGPU stuff…

  5. That’s what I was aiming for… more of ‘can it blend’ though. 😀

  6. I could see us getting the ones that go in the “poor” bin and paying $700 for them.

  7. That is EXACTLY how it works. The reason for going open source is to allow ANYBODY in the public to work on the code[b<] if they so wish [/b<] without being encumbered by having to pay royalties or fees. It does not mean that because it is freely available that an entity has to contribute to it. If that was the case, open source would be even worse the proprietary as it would mean nobody would be able to release a product until every bit of code out there supported it.

  8. Just 1%? That’s awful. I’ve been under the impression that when overclocked to the limit, AMDs cards reach a higher level of performance, but I guess I’m mistaken.

    To add to that, Nvidia’s card will draw less power, emit less heat, and produce less noise at the same time.

    Win? I think so.

  9. Yup, and I give Deanjo the win here. Actual work experience is more important.

  10. TL;DR: Deanjo actually works on this code, Shank has a stiffy for open-source-anything.

  11. the gtx580 was most certainly not 35% faster “on average” in games than the 6970. at 1920×1200 it was about 12-15% faster and at 2560 it was only about 10% faster on average.

  12. [quote<]if Nvidia doesn't want to optimize for LuxMark then thats Nvidia's business[/quote<] Why is it nvidia's responsibility to optimize someone else's code? If luxmark wants to optimize for the families that is up to them. If they are satisfied with optimizing for AMD then that's their decision. If nvidia had to do the optimization of the code then there really wouldn't be any point to if the code is open or not.

  13. That goes for every architecture under the sun, if Nvidia doesn’t want to optimize for LuxMark then thats Nvidia’s business. Sounds like an excuse more than anything else. I don’t have to show you anything, you’re making the point that Nvidia’s card is slow because the software isn’t optimized, and I’m saying so what, the end result is it’s slower, you don’t get points for potential performance.

  14. So instead of Katherine Willows saying ‘Enhance the video!’ she’ll now just say ‘Nvidia(c)!’ ??

  15. More transistors on a single chip than people on the planet, wow.

    What’s the largest x86 consumer chip have, 1 billion?

    Kinda puts things in perspective. If only I had a silicon mine……

  16. As a side note, I find it very, very interesting that in Switzerland I find the following prices:
    670: 429-470 francs
    680: 554-679 francs
    690: 1111 francs (in stock btw!)
    7970: 435-557 francs

    So, I think it’s fair to say that the production/stock issues (although almost every model can be found somewhere in stock, including the 690) translate to very important price differences. People do pay a high premium for stuff that is less available. MSRP is not the end of the story…

  17. [quote<]"The program could repeat this process multiple times, each time ignoring non-edge blocks and focusing closer on blocks with edges in them, in order to achieve a very high resolution result without performing unnecessary work—and without constantly returning to the CPU for guidance."[/quote<] They're aiming for a CSI product placement.

  18. They might bother because some people would buy it. Not many, but I bet a “limited edition” of, say, 10,000 GTX 695 cards would sell out at any price.

  19. With twice as many transistors as the GTX 680 and at the same process node, I am sure it is going to use at least 50% more power. That is best case scenario for Nvidia…

  20. “Financially, that’s good news for both Nvidia and AMD (more so for AMD), but, well, sucks to be us.”

    There’s no bearing on high prices being a guaranteed financially good thing for a producer. It can just as easily kill a market when a competitor releases something cheaper and your product is seen as expensive. Look at netbook/tablet markets versus ultraportable markets.

  21. We absolutely DO know what happened. Don’t try to speak for us and present your side as the truth. 6xx series IS delayed due to manufacturing issues. NOT because they were going to blow AMD out of the water that they could release them later.

    BTW, check out the performance summary of the latest slight factory overclocked 7970 card over at

    [url<][/url<] The 680 beats the 7970 by a whopping 1%!

  22. Show me the optimizations. Seriously. It’s opensource, so show them. Running on product is not the same as optimized for a product. Read the article yourself. Right in there:

    [url<][/url<] [quote<]Nvidia tells us LuxMark isn't a target for driver optimization and may never be.[/quote<] If anyone sounds like a fanboi it is you as you cannot face the facts that code optimization for a specific plays a HUGE role in performance output. You also seem to be under the illusion that because something is opensource means that it is optimized which is 100% complete and utter BS. It has the [b<]potential to be optimized if someone wishes for it to be[/b<]. That doesn't mean that it will be. Closed source can be optimized as well for an architecture as well, if someone chooses to do that is again a choice if they have interest to do so.

  23. No my point is that its easier to optimize an open source project because the code if available to anybody who wants to make it faster on any given architecture. All you’re trying to do is dismiss an imperical example where AMD outdoes Nvidia. I’m not cherry picking, its of of the few OpenCL renderers out there that runs on both AMD and Nvidia hardware. AMD also outdoes Nvidia in other GPGPU benchmarks by large margins. I’m citing TRs reviews and you’re saying that it’s all a toss up because none of these GPGPU benchmarks are Nvidia optimized. You sound like a fanboi, not an objective user to me.

  24. I think it’s a funny twist on the “can it fold” question. +1 for you.

  25. Yeah, while the HD 7970 is faster than Fermi or GTX 680 in LuxMark and Civ 5 direct compute:

    [url<][/url<] There is still a number of shader performances benchmarks that the HD 7970 is still slower so it is not hard to conclude that optimization would make a huge difference.

  26. Your point? Just because something is open source doesn’t mean that someone will optimize it for an architecture, just that the possibility is there if some one wishes to do so. Case and point, has everybody added bulldozer optimizations to their opensource projects?

    Please do link these opensource projects that carry optimizations for the g92, GTX 200, fermi and AMD 4xxx/5xxxx/6xxxx/7xxxx families of processors. Since they are open source they should be easy for you to point to the code examples.

  27. No, your use of small lux as an example doesn’t back up your point. Because someone has not bothered optimizing the code for an architecture is not proof that architecture A is faster then B. I have openCL code that will run on an AMD card but hasn’t been optimized for their architecture. As a result, even a 8800GT will perform better then a AMD 6970. That doesn’t mean that the 8800GT is a faster computing card, just that my code and algorithms has been optimized with the G92’s architecture in mind.

  28. I honestly posted it as more as sarcasm with a tad of rhetoric. Totally not where this chip is aimed, but that’s what makes it interesting. 😀

  29. Yeah, I think what you mean is that SLI still has the frame latency of that of one of its GPU’s. Because the 2 GPU’s are doing whole, alternating frames, each, it still takes the latency of each GPU’s rendering speed to bring each frame to the screen. Basically, its up to twice as fast for things like real-time movies, but not for gaming input lag. The gaming input lag will be the same as 1 of the GPU’s, then SLI has to deal with issues like frame send timing from the engine, which it will probably never get right unless the engine is coded for it. It can get rid of the stutter, but not the incorrect engine frame times.

  30. I believe the NVS is just the name they put on their slow cards to maintain the Quadro name as the “powerful” and “premium” end of their range.

    Basically, the NVS seems to be the professional driver-support version of the crap cards based on G98, GT218, GF108, GF119 etc.

    If it’s slower than an IGP for 3D rendering, they don’t want to muddy their Quadro image with it.

  31. All PC products have always been on a price/performance curve. Regardless of whether it’s a good value or not, if you want that top tier of performance you’re going to have to spend a disproportional amount more. That’s the cost of cutting edge, and believe it or not there are actually applications that will warrant this such as using 3D vision surround on a mobo without all the PCIe x16 slots.

    An improvement is still an improvement, and this card represents a better value for a dual GPU card than has ever been available previously. The reality of it is the card actually costs more than twice as much to make and has higher binned chips as well. If you don’t like it, don’t buy it. Personally I’d take 2 of them over 4-way SLI any day, but I don’t have that kind of money.

    EDIT: I’d like to point out that the last time we had a similarly designed product (no-compromise SLI on a stick) it was the Asus Mars II, it costed $1500, and multi-gpu solutions were even worse.

  32. [quote<]The difference was a tiny bit bigger maybe, but not by much. The 580 was the "halo" card in the market, but the 6970 was very close in performance and a better value.[/quote<] The performance delta was still larger that what we are seeing today. [url=<]Here is the 580 vs 6970. [/url<] [url=<]Here is the 480 vs 5870.[/url<] [url=<]Here is the 680 vs 7970.[/url<] [quote<]Never say never.[/quote<] This was also during the time where there were huge performance driver releases. The 5870 started out very competitive with the 480. However, by the time the 580 released and the dust settled the 480 was consistently faster than it and the performance delta grew. I will agree on the 4 series vs the 2xx however.

  33. Huge demand?! Even things that are in “huge demand” get restocked. The 680 and now the 690 in particular aren’t being restocked. 1 SKU returning once a week for two hours is not huge demand. That’s a supply problem. The 680 was released almost two months ago.

    [quote<]For most applications, they're superior to AMDs products (unlike in the last generation), and many people are jumping ship.[/quote<] The 580 outperformed the 6970 in every metric/game (aside from power) at a sizable and easily noticeable performance delta. On average it was 35% faster in games and with GPGPU applications it was 2x as fast. The performance delta between a 680 and the 7970 ranges from -6% to +15% in games and sits below a 570 in GPGPU this is also below the 7870 and 7850. How someone can conclude that this generation is so much better than last is beyond me. [quote<]AMDs products also had stock issues, but their demand was dampened by a lack of clear information on Nvidias upcoming part. Those that waited are glad they did.[/quote<] Those that waited are still waiting because they aren't in stock.

  34. I agree it’s pretty obvious it just wasn’t ready, but..

    [quote<]GTX 580 is considerably faster than 6970. This isn't the case this round.[/quote<] [url=<]The difference was a tiny bit bigger maybe, but not by much.[/url<] The 580 was the "halo" card in the market, but the 6970 was very close in performance and a better value. [quote<]since when has Nvidia's top SKU traded blows with AMD's top SKU? That's never happened.[/quote<] [url=<]5870 vs 480[/url<] [url=<]4870 vs 280[/url<] Never say never.

  35. Sorry bro, I run cards from both companies, recommend cards from both companies, and switch back and forth for my primary system regularly.

    Last generation AMD had the edge in many circumstances, and this generation Nvidia wipes them; where they meet AMD in performance, they beat them at everything else.

    I’m reconsidering the launch window perspective a little- we still don’t know everything about it all, but one thing that did come up was a comment (rumors like everything else we’re discussion) where they stated that they were disappointed with the HD7000’s series performance. Around that time it was outed that Nvidia would be launching their mid-range SKU as a high-end GPU, because AMDs new stuff was just so slow.

  36. Keep in mind that the GTX670/680/690 stock issue is also a product of huge demand.

    For most applications, they’re superior to AMDs products (unlike in the last generation), and many people are jumping ship.

    AMDs products also had stock issues, but their demand was dampened by a lack of clear information on Nvidias upcoming part. Those that waited are glad they did.

  37. I’ll leave this up to an ‘agree to disagree’.

    We don’t know the truth, and we’re both speculating-

    What we DO know is that Nvidia’s relatively tiny and cheap GK104 is beating handily upon AMDs best. The 560Ti wasn’t in the same league as the HD6970, but it’s successor can outrun the HD7970.

    Why would Nvidia spend time manufacturing GK110 for desktop use again?

  38. Thanks for your detailed technical explanation. It must be great to have unreleased prototype hardware in your hands, and even benchmarked already so you can give us your expert advice on its performance-per-watt figures.

  39. There are few project that has optimized code paths for multiple architectures, but in an open source project the option is wide open for optimization.

  40. I dont care what you develop, what you just posted doesn’t backup your point. Small Lux is an open source renderer that could be optimized for any architecture.

  41. You think so?

    I think 28nm will become a lot more cost-effective fairly soon as parts finally begin to flood the market given nvidia has priority with tsmc and amd parts are readily available. Also, TSMC should have a lot more production capability up by the end of the year.

    Factor in AMD will probably launch a new series that fills the cracks between their currents parts, nvidia will probably rebrand gk104/’gk114′ with 7gbps ram to compete driving down ‘older’ parts and hopefully eventually launch gk106, and I think we’ll be well on our way.

    As for GK110, while Scott’s thinking is sound, I think it is much more likely we will see skus with SMX units disabled, even at the high-end. The most logical choice is one complete GPC, amounting to essentially 1.5x gk104. This is not just because of yield practicality, but because perf/watt is best up to 1.05v, and for yield reasons performance/w scales to 1.175v with essentially similar corresponding clocks. Hence, that range makes the most sense for business reasons.

    I think we’ll see 5TF SPFP, but more likely 12-13 SMX units at more balanced corresponding clocks. This is also a better mix of pixel fillrate/memory bandwidth etc for gaming than more units at a lower clock which is better suited for the HPC market. I do not think nvidia will make the 7970 mistake, but rather essentially make a high-clocked 7950 (similar to gk104 in inherent computational units) their top consumer option which should be more efficient.

  42. That is delusion at its best. Nvidia could not design and manufacture GK100. It needed to wait till GK104 was in production to tape out GK110. GK110 seems to have taped out in late Jan. For such a complex chip 9 months from tape out to production is expected.

    [url<][/url<] Nvidia will have a tough time filling up the piling orders from HPC customers who are definitely waiting eagerly for Tesla K20. GK110 for consumer graphics aka GTX 780 will be at best a late Q4 2012 event. Worst case Q1 2013. Also we have to wait and see what specs and clocks Nvidia achieves. I doubt we will have a fully enabled chip. those will go for the tesla crowd. For a chip thats double the transistor count of GTX 680 I don't think it can achieve 900 Mhz if it wants to stay at around 250w. But its going to be a beastly chip if Nvidia can get good yields and good clocks.

  43. While we’re at it, what are the winning lottery numbers for each week this coming November, and what are the USD/EUR and USD/CNY exchange rates for 3 PM EST on December 10 of this year?

  44. You seem to be filtering everything through heavily green-tinted glasses.

    My post was not an attack on your precious beloved company, it was a comment about the inevitability of moving to a more realistic business strategy. NV is certainly not blindly copycatting AMD by making this move, and AMD is not copycatting NV by improving their compute capabilities. It’s just that both companies are responding to the same market pressures, so their paths will often converge.

    If as late as Christmas (when the 7xxx were launched) NV was within just a few months of launching the GK110, there’s no way on earth they would have chosen to delay it until late Q4 just because of AMD’s product lineup. That’s absurd. That would have been a ridiculously poor business decision.

    Beyond being economically absurd, your scenario was also physically impossible. A bunch of sites reported that GK104 taped out late last summer while GK110 didn’t tape out until Q1 of this year. It takes a good amount of time- usu. 8-12 months- from tape-out to shipping products. Though some unanticipated delays might have contributed to GK110’s later release, it’s got to be mostly a [i<]deliberate decision[/i<] made [i<][b<]over a year ago[/b<][/i<].

  45. Quadro and Geforce cards are largely the same when it comes to the chip itself. the PCB, specifically the memory and display interfaces are adjusted more for the high end though, offering up larger Memory spaces, and also support more than 2 “live” display outputs at once (most Geforce cards only support pushing display to up to 2 monitors per card, even if they physically have 3 or more connectors).

    Quadro cards also come with drivers that are adjusted for “industrial” uses. I’m frankly not to clear on the nuances, but much like how the Geforce drivers recieve periodic updates improving performance in some games and fixing various bugs, the Quadro recieve much the same treatment for their specific uses (CAD, etc).

    Also, warranty and support is typically much better on quadro cards; this element being the lions share of the increased cost of a Quadro card compared to it’s Geforce equivalent.

    Tesla cards are a step up from Quadro and are meant almost solely for GPGPU uses, not as display drivers, and as such have further refinements to their circuitry, drivers, support, and to my knowledge, all Tesla cards lack any display out connectors.

  46. Yeah… The GTX 690 has between 4 and 50% better 99th percentile frame times, though the straight FPS shows a bigger lead. Costs twice as much, average 30% better frame times… totally worth it.

  47. I would say it’s pretty apparent now that GK110 isn’t out because it wasn’t ready. If GK104 was available in any kind of volume I could buy the PR. Considering even the 670 struggles to stay in stock I just don’t buy it. In addition since when has Nvidia’s top SKU traded blows with AMD’s top SKU? That’s never happened. GTX 580 is considerably faster than 6970. This isn’t the case this round.

  48. I get opencl very well thank you very much. I develop gpgpu code for a living now days. It’s not so much the libraries that make the difference. It is the opencl code implementation and optimization that makes the difference. openCL although vender indifferent still requires code optimization for each of the various architectures to get the most from each architecture.
    [url<][/url<] [quote<]9. Conclusions In this paper, we evaluated various aspects of using OpenCL as a performance-portable method for GPGPU application de- velopment. Profiling results show that environment setup over- head is large and should be minimized. Performance results for both the Tesla C2050 and Radeon 5870 show that OpenCL has good potential to be used to implement high performance ker- nels[b<] so long as architectural specifics are taken into account[/b<] in the algorithm design. Even though good performance should not be expected from blindly running algorithms on a new plat- form, auto-tuning heuristics can help improving performance on a single platform. Putting these factors together, we conclud that OpenCL is a good choice for delivering a performance- portable application for multiple GPGPU platforms.[/quote<] [url<][/url<] [quote<]Nevertheless, to make the most of the resources available on the underline hardware probably the most important advice is to use OpenCL* language features and[b<] the hardware features and optimize the OpenCL* code to the target device.[/b<][/quote<]

  49. You dont get opencl then, nvidia can contribute opencl optimized code to small lux if they wanted to. AMDs opencl libraries even run faster on Intel cpus.

  50. Optimization for each respective architecture yields huge swings in performance. Take fermi optimized opencl code and it pukes on amd, take amd optimized code and it pukes on nvidia. I have yet to see a opencl project that optimizes for both architectures.

  51. “Apparently some clever CUDA coders were already making use of this cache in older GPUs”

    It’s pretty easy to do, and there’s a tidy speedup to be had, also on AMD GPUs 🙂

  52. What’s so impressive about it? It’s obscenely big. Is that all? They can’t even fab the smaller GK104 well enough to compete with AMD in terms of yields/stock. GK110 will be 1 year behind to allow for TSMC’s 28nm to have matured, and even now it looks like yields are too low and costs are too high for this card to even exist outside the HPC market.

  53. AMD 7970 is over twice as fast on comparable opencl code than Fermi, very disappointing indeed.

  54. Nvidia held off the GK110 cause it didn’t have it ready to launch and still wont have it ready to launch until the end of 2012. Nvidia has taken paper launches to a new height, declaring products 6-8 months before their release so please stop the fanboyism and come back to reality.

  55. You do realize that Nvidia held off on GK110 because AMD’s 7000’s were slower than they expected? And that this is primarily because AMD baked in more GPU-Compute this time around, burning transistors on things that aren’t useful for gaming?

    It’s more like AMD is starting to follow Nvidia, with respect to GPU compute. If you compare GK104 to GF114 (GTX560 Ti), you’ll notice that they’re identically setup, with a smaller bus width and lower HPC capability.

  56. Have you seen the results from the GTX670/680 SLi and GTX690 benchmarks? What you’re talking about doesn’t hardly exist for Kepler. It does for AMD cards and to a lesser extent Fermi.

  57. SLI sucks for actual return on investment in frame time charts. Sure there may be 50-90% more frames produced, but half of them are on the high end of the frame studder, so actual decrease in frame latency is small. Then there are game and driver issues that are always around, more glitches, etc. You also need twice the memory, and power consumption goes up considerably faster than actual performance.

    You get near double GPGPU performance from dual GPU (assuming they don’t severely downclock them … GTX 590 …, but for games its not worth the extra investment if it costs twice as much.

  58. A GeForce card based on this chip would have lower gaming performance than the GTX 690 but cost more to make. Why would nvidia bother?

  59. Thought so. That does mean that if the 64-bit elements are actually 2x 32-bit elements, they have to be organized in a way that makes FMA usable — i.e. AMD has done the 2x 32-bit elements thing, but they haven’t been able to double throughput with FMA (tahiti is 940 DP GFlops instead of 1.8 TFlops). Interesting . . .

  60. Great overview Scott! Quick math question, though:

    If there are 64 DP floating point elements per SMX and 15 SMX’s per chip @ 700 MHz that comes to 672 GFlops (64x15x0.7) not 1.3 TFlops –> does GK110 use FMA or have some form of dual-issue FP units?

  61. People estimate 550mm^2, bigger than anything since the GTX 280 back in June 2008. Should do well in the compute market.

    [url=<]Back in 2008 around the launch of the 4xxx[/url<], AMD talked about their new strategy of launching the "performance mainstream" parts first and adapting things both upwards and downwards from there rather than starting with a huge chip and cutting it down afterwards. Having the first chip on a new process be a >400mm^2 behemoth is costly and painful (yield problems and many other complications), and >$300 cards don't get all that much consumer marketshare anyways. Of course, they ended up not just delaying the largest chips but skipping them and doing some market repositioning e.g. with Barts. With this generation, looks like nV finally decided to make similar moves- waiting on the huge chip and doing some market repositioning. It had to happen sometime. The main difference is that their CUDA customers give them some compelling reasons to keep making the behemoths though they'll come later in the cycles.

  62. The last few years, nVIDIA increased the GeForce until high level.
    But i would like to know what’s the difference between the GeForce and the Quadro Graphics cards ?

    I’m not the time to look at the several technical reports, but did you ever compare both cards ?

    Thanks to let me know,

  63. Impressively huge chip is impressive.

    [quote<]The Nvidia executives we talked with raised the possibility of a GK110-based GeForce being released this year only if necessary to counter some move by rival AMD[/quote<] In a closely-related context, I think this means we won't see a return to graphics card price warfare anytime soon. Financially, that's good news for both Nvidia and AMD (more so for AMD), but, well, sucks to be us.