review nvidias fermi gpu architecture revealed

Nvidia’s ‘Fermi’ GPU architecture revealed

Graphics processors, as you may know, have been at the center of an ongoing conversation about the future of computing. GPUs have shown tremendous promise not just for producing high-impact visuals, but also for tackling data-parallel problems of various types, including some of the more difficult challenges computing now faces. Hence, GPUs and CPUs have been on apparent collision course of sorts for some time now, and that realization has spurred a realignment in the processor business. AMD bought ATI. Intel signaled its intention to enter the graphics business in earnest with its Larrabee project. Nvidia, for its part, has devoted a tremendous amount of time and effort to cultivating the nascent market for GPU computing, running a full-court press everywhere from education to government, the enterprise, and consumer applications.

Heck, the firm has spent so much time talking up its GPU-compute environment, dubbed CUDA, and the applications written for it, including the PhysX API for games, that we’ve joked about Nvidia losing its relish for graphics. That’s surely not the case, but the company is dead serious about growing its GPU-computing business.

Nowhere is that commitment more apparent that when it gets etched into a silicon wafer in the form of a new chip. Nvidia has been working on its next-generation GPU architecture for years now, and the firm has chosen to reveal the first information about that architecture today, at the opening of its GPU Technology Conference in San Francisco. That architecture, code-named Fermi, is no doubt intended to excel at graphics, but this first wave of details focuses on its GPU-compute capabilities. Fermi has a number of computing features never before seen in a GPU, features that should enable new applications for GPU computing and, Nvidia hopes, open up new markets for its GeForce and Tesla products.

An aerial view
We’ll begin our tour of Fermi by honoring a time-honored tradition of looking at a logical block diagram of the GPU architecture. Images like the one below may not mean much divorced from context, but they can tell you an awful lot if you know how to interpret them. Here’s how Nvidia represents the Fermi architecture when focused on GPU computing, with the graphics-specific bits largely omitted.

A functional overview of the Fermi architecture. Source: Nvidia.

Let’s see if we can decode things. The tall, rectangular structures flanked by blue are SMs, or streaming multiprocessors, in Nvidia’s terminology. Fermi has 16 of them.

The small, green squares inside of each SM are what Nvidia calls “CUDA cores.” These are the most fundamental execution resources on the chip. Calling these “cores” is apparently the fashion these days, but attaching that name probably overstates their abilities. Nonetheless, those execution resources do help determine the chip’s total power; the GT200 had 240 of them, and Fermi has 512, just more than twice as many.

Six of the darker blue blocks on the sides of the diagram are memory interfaces, per their labels. Those are 64-bit interfaces, which means Fermi has a total path to memory that is 384 bits wide. That’s down from 512 bits on the GT200, but Fermi more than makes up for it by delivering nearly twice the bandwidth per pin via support for GDDR5 memory.

Needless speculation and conjecture for $400, Alex
Those are the basic outlines of the architecture, and if you’re like me, you’re immediately wondering how Fermi might compare to its most direct competitor in both the graphics and GPU-compute markets, the chip code-named Cypress that powers the Radeon HD 5870. We don’t yet have enough specifics about Fermi to make that determination, even on paper. We lack key information on its graphics resources, for one thing, and we don’t know what clock speeds Nvidia will settle on, either. But we might as well indulge in a little bit of speculation, just for fun. Below is a table showing the peak theoretical computational power and memory bandwidth of the fastest graphics cards based on recent GPUs from AMD and Nvidia. I’ve chosen to focus on graphics cards rather than dedicated GPU compute products because AMD hasn’t yet announced a FireStream card based on Cypress, but the compute products shouldn’t differ too much in these categories from the high-end graphics cards.

Peak single-precision
arithmetic (GFLOPS)


Single-issue Dual-issue
GeForce GTX 280

622 933 78 141.7
Radeon HD 4870

1200 240 115.2
Radeon HD 5870

2720 544 153.6

Those numbers set the stage. We’re guessing from here, but let’s say 1500MHz is a reasonable frequency target for Fermi’s stream processing core. That’s right in the neighborhood of the current GeForce GTX 285. If we assume Fermi reaches that speed, its peak throughput for single-precision math would be 1536 GFLOPS, or about half of the peak for the Radeon HD 5870. That’s quite a gap, but it’s not much different than the gulf between the GeForce GTX 280’s single-issue (and most realistic) peak and the Radeon HD 4870’s—yet the GTX 280 was faster overall in graphics applications and performed quite competitively in directed shader tests, as well.

Double-precision floating-point math is more crucial for GPU computing, and here Fermi has the advantage: its peak DP throughput should be close to 768 GFLOPS, if our clock speed estimates are anything like accurate. That’s 50% higher than the Radeon HD 5870, and it’s almost a ten-fold leap from the GT200, as represented by the GeForce GTX 280.

That’s not all. Assuming Nvidia employs the same 4.8 Gbps data rate for GDDR5 memory that AMD has for Cypress, Fermi’s peak memory bandwidth should be 230 GB/s, again roughly 50% higher than the Radeon HD 5870, which has a total memory bus width of 256 bits.

All of this speculation, of course, is a total flight of fancy, and I’ve probably given some folks at Nvidia minor heart palpitations by opening with such madness. A bump up or down here or there in clock speed could have major consequences in a chip that involves this much parallelism. Not only that, but peak theoretical gigaFLOPS numbers are increasingly less useful as a predictor of performance for a variety of reasons, including scheduling complexities and differences in chip capabilities. Indeed, as we’ll soon see, the Fermi architecture is aimed at computing more precisely and efficiently, not just delivering raw FLOPS.

So you’ll want to stow your tray tables and put your seat backs in an upright and locked position as this flight of fancy comes in to land. We would also like to know, of course, how large a chip Fermi might turn out to be, because that will also tell us something about how expensive it might be to produce. Nvidia doesn’t like to talk about die sizes, but it says straightforwardly that Fermi is comprised of an estimated 3 billion transistors. By contrast, AMD estimates Cypress at about 2.15 billion transistors, with a die area of 334 mm². We’ve long suspected that the methods of counting transistors at AMD and Nvidia aren’t the same, but set that aside for a moment, along with your basic faculties for logic and reason and any other reservations you may have. If Fermi is made using the same 40-nm fab process as Cypress, and assuming the transistor density is more or less similar—and maybe we’ll throw in an estimate from the Congressional Budget Office, just to make it sound official—then a Fermi chip should be close to 467 mm².

That’s considerably larger than Cypress—nearly 50%—but is in keeping with its advantages in DP compute performance and memory bandwidth. That also seems like a sensible estimate in light of Fermi’s two additional memory interfaces, which will help dictate the size of the chip. Somewhat surprisingly, that also means Fermi may turn out to be a little bit smaller than the 55-nm GT200b, since the best estimates place the GT200b at just under 500 mm². Nvidia would appear to have continued down the path of building relatively large high-end chips compared to the competition’s slimmed-down approach, but Fermi seems unlikely to push the envelope on size quite like the original 65-nm GT200 did.

Then again, I could be totally wrong on this. We should have more precise answers to these questions soon enough. For now, let’s move on to what we do know about Nvidia’s new architecture.

Better scheduling, faster switching
Like most major PC processors these days, Fermi hasn’t been entirely re-architected fresh from a clean sheet of paper; it is an incremental enhancement of prior Nvidia GPU architectures that traces its roots two major generations back, to the G80. Yet in the context of this continuity, Fermi brings radical change on a number of fronts, thanks to revisions to nearly every functional unit in the chip.

Many of the changes, especially the ones Nvidia is talking about at present, are directed toward improving the GPU’s suitability and performance for non-graphics applications. Indeed, Nvidia has invested tremendous amounts in building a software infrastructure for CUDA and in engaging with its customers, and it claims quite a few of the tweaks in this architecture were inspired by that experience. There’s much to cover here, and I’ve tried to organize it in a logical manner, but that means some key parts of the architecture won’t be addressed immediately.

We’ll start with an important, mysterious, and sometimes overlooked portion of a modern GPU: the primary scheduler, which Nvidia has too-cleverly named the “GigaThread” scheduler in this chip. Threads are bunched into groups, called “warps” in Nvidia’s lexicon, and are managed hierarchically in Fermi. This main scheduler hands off blocks of threads to the streaming multiprocessors, which then handle finer-grained scheduling for themselves. Fermi has two key improvements in its scheduling capabilities.

Serial versus concurrent kernel execution. Source: Nvidia.

One is the ability to run multiple, independent “kernels” or small programs on different thread groups simultaneously. Although graphics tends to involve very large batches of things like pixels, other applications may not happen on such a grand scale. Indeed, Nvidia admits that some kernels may operate on data grids smaller than a GPU like Fermi, as illustrated in the diagram above. Some of the jobs are smaller than the GPU’s width, so a portion of the chip sits idle as the rest processes each kernel. Fermi avoids this inefficiency by executing up to 16 different kernels concurrently, including multiple kernels on the same SM. The limitation here is that the different kernels must come from the same CUDA context—so the GPU could process, say, multiple PhysX solvers at once, if needed, but it could not intermix PhysX with OpenCL.

To tackle that latter sort of problem, Fermi has much faster context switching, as well. Nvidia claims context switching is ten times the speed it was on GT200, as low as 10 to 20 microseconds. Among other things, intermingling GPU computing with graphics ought to be much faster as a result.

(Incidentally, AMD tells us its Cypress chip can also run multiple kernels concurrently on its different SIMDs. In fact, different kernels can be interleaved on one SIMD.)

Inside the new, wider SM
In many ways, the SM is the heart of Fermi. The SMs are capable of fetching instructions, so they are arguably the real “processing cores” on the GPU. Fermi has 16 of them, and they have quite a bit more internal parallelism than the processing cores on a CPU.

A block digram of a single SM. Source: Nvidia.

That concept we mentioned of thread groups or warps is fundamental to the GPU’s operation. Warps are groups of threads handled in parallel by the GPU’s execution units. Nvidia has retained the same 32-thread width for warps in Fermi, but the SM now has two warp schedulers and instruction dispatch units.

The SM then has four main execution units. Two of them are 16-wide groups of scalar “CUDA cores,” in Nvidia’s parlance, and they’re helpfully labeled “Core” in the diagram on the right, mainly because I wasn’t given sufficient time with a paint program to blot out the labels. There’s also a 16-element-wide load/store unit and a four-wide group of special function units. The SFUs handle special types of math like transcendentals, and the number here is doubled from GT200, which had two per SM.

Fermi’s SM has a full crossbar between the two scheduler/dispatch blocks and these four execution units. Each scheduler/dispatch block can send a warp to any one of the four execution units in a given clock cycle, which makes Fermi a true dual-issue design, unlike GT200’s pseudo-dual-issue. The only exception here is when double-precision math is involved, as we’ll see.

The local data share in Fermi’s SM is larger, as well, up from 16KB in GT200 to 64KB here. This data share is also considerably smarter, for reasons we’ll explain shortly.

A single “CUDA core.” Source: Nvidia.

First, though, let’s take a quick detour into the so-called “CUDA core.” Each of these scalar execution resources has separate floating-point and integer data paths. The integer unit stands alone, no longer merged with the MAD unit as it was on prior designs. And each floating-point unit is now capable of producing IEEE 754-2008-compliant double-precision FP results in two clock cycles, or half the performance of single-precision math. That’s a huge step up from the GT200’s lone DP unit per SM—hence our estimate of a ten-fold increase in DP performance. Again, incorporating double-precision capability on this scale is quite a commitment from Nvidia, since such precision is generally superfluous for real-time graphics and really only useful for other forms of GPU computing.

I’d love to tell you the depth of these pipelines, but Nvidia refuses to disclose it. We could speculate, but we’ve probably done enough of that for one day already.

Fermi maintains Nvidia’s underlying computational paradigm, which the firm has labeled SIMT, for single instruction, multiple thread. Each thread in a warp executes in sequential fashion on a “CUDA core,” while 15 others do the same in parallel. For graphics, as I understand it, each pixel is treated as a thread, and pixel color components are processed serially: red, green, blue, and alpha. Since warps are 32 threads wide, warp operations will take a minimum of two clock cycles on Fermi.

Thanks to the dual scheduler/issue blocks, Fermi can occupy both 16-wide groups of CUDA cores with separate warps via dual issue. What’s more, each SM can track a total of 48 warps simultaneously and schedule them pretty freely in intermixed fashion, switching between warps at will from one cycle to the next. Obviously, this should be a very effective means of keeping the execution units busy, even if some of the warps must wait on memory accesses, because many other warps are available to run. To give you a sense of the scale involved, consider that 32 threads times 48 warps across 16 SMs adds up to 24,576 concurrent threads in flight at once on a single chip.

Enhanced precision and programmability
Fermi incorporates a number of provisions for higher mathematical precision, including support for a fused multiply-add (FMA) operation with both single- and double-precision math. FMA improves precision by avoiding rounding between the multiply and add operations, while storing a much higher precision intermediate result. Fermi is like AMD’s Cypress chip in this regard, and both claim compliance with the IEEE 754-2008 standard. Also like Cypress is Fermi’s ability to support denorms at full speed, with gradual underflow for accurate representation of numbers approaching zero.

Fermi’s native instruction set has been extended in a number of other ways, as well, with hardware support for both OpenCL and DirectCompute. These changes have prompted an update to PTX, the ISA Nvidia has created for CUDA compute apps. PTX is a low-level ISA, but it’s not quite machine level; there’s still a level of driver translation beneath that. CUDA applications can be compiled to PTX, though, and it’s sufficiently close to the metal to require an update in this case.

Nvidia hasn’t stopped at taking care of OpenCL and DirectCompute, either. Among the changes in PTX 2.0 is a 40-bit, 1TB unified address space. This single address space encompasses the per-thread, per-SM (or per block), and global memory spaces built into the CUDA programming model, with a single set of load and store instructions. These instructions support 64-bit addressing, offering headroom for the future. These changes, Nvidia contends, should allow C++ pointers to be handled correctly, and PTX 2.0 adds a number of other odds and ends to make C++ support feasible.

The memory hierarchy
As we’ve noted, each SM has 64KB of local SRAM associated with it. Interestingly, Fermi partitions this local storage between the traditional local data store and L1 cache, either as 16KB of shared memory and 48KB of cache or vice-versa, in a 48KB/16KB share/cache split. This mode can be set across the chip, and the chip must be idled to switch. The portion of local storage configured as cache functions as a real L1 cache, coherent per SM but not globally, befitting the CUDA programming model.

Backing up the L1 caches in Fermi is a 768KB L2 cache. This cache is fully coherent across the chip and connected to all of the SMs. All memory accesses go through this cache, and the chip will go to DRAM in the event of a cache miss. Thus, this cache serves as a high-performance global data share. Both the L1 and L2 caches support multiple write policies, including write-back and write-through.

The L2 cache could prove particularly helpful when threads from multiple SMs happen to be accessing the same data, in which case the cache can serve to amplify the tremendous bandwidth available in a streaming compute architecture like this one. Nvidia cites several examples of algorithms that should benefit from caching due to their irregular and unpredictable memory access patterns, and they span the range from consumer applications to high-performance computing. Among them: ray tracing, physics kernels, and sparse matrix multiply. Atomic operations should also be faster on Fermi—Nvidia estimates between five and 20 times better than GT200—in part thanks to the presence of the L2 cache. (Fermi has more hardware atomic units, as well.)

Additionally, the entire memory hierarchy, from the register file to the L1 and L2 caches to the six 64-bit memory controllers, is ECC protected. Robust ECC support is an obvious nod to the needs of large computing clusters like those used in the HPC market, and it’s another example of Nvidia dedicating transistors to compute-specific features. In fact, the chip’s architects allow that ECC support probably doesn’t make sense for the smaller GPUs that will no doubt be derived from Fermi and targeted at the consumer graphics market.

Fermi supports single-error correct, double-error detect ECC for both GDDR5 and DDR3 memory types. We don’t yet know what sort of error-correction scheme Nvidia has used, though. The firm refused to reveal whether the memory interfaces were 72 bits wide to support parity, noting only that the memory interfaces are “functionally 64 bits.” Fermi has true protection for soft errors in memory, though, so this is a more than just the CRC-based error correction built into the GDDR5 transfer protocol.

We’ve already noted that Fermi’s virtual and physical address spaces are 40 bits, but the true physical limits for memory size with this chip will be dictated by the number of memory devices that can be attached. The practical limit will be 6GB with 2Gb memories and 12GB with 4Gb devices.

Of course, GPUs must also communicate with the rest of the system. Fermi acknowledges that fact with a revamped interface to the host system that packs dedicated, independent engines for data transfer to and from the GPU. These allow for concurrent GPU-host and host-GPU data transfers, fully overlapped with CPU and GPU processing time.

What’s next?
Nvidia’s build-out of tools for CUDA software development continues, as well. This week at the GPU Technology Conference, Nvidia will unveil its Nexus development platform, with a Microsoft Visual Studio plug-in for CUDA pictured below. Fermi has full exception handling, which should make debugging with tools like these easier.

Nvidia’s investment in software tools for GPU computing clearly outclasses AMD’s, and it’s not really even close. Although this fact has prompted some talk of standards battles, I get the impression Nvidia’s primary interest is making sure every available avenue for programming its GPUs is well supported, whether it be PhysX and C for CUDA or OpenCL and DirectCompute.

That’s all part of a very intentional strategy of cultivating new markets in GPU computing, and the company expects imminent success on this front. In fact, the firm showed us its own estimates that place the total addressable market for GPU computing at just north of $1.1 billion, across traditional HPC markets, education, and defense. That is, I believe, for next year—2010. Those projections may be controversial in their optimism, but they reveal much about Nvidia’s motivations behind the Fermi architecture.

There are many things we still don’t know about Nvidia’s next GPU, including crucial information about its graphics features and likely performance. When we visited Nvidia earlier this month to talk about the GPU-compute aspects of the architecture, the first chips were going through bring-up. Depending on how that process goes, we could see shipping products some time later this year or not until well into next year, as I understand it.

We now have a sense that when Fermi arrives, it should at least match AMD’s Cypress in its support for the OpenCL and DirectCompute APIs, along with IEEE 754-2008-compliant mathematical precision. For many corners of the GPU computing world, though, Fermi may be well worth the wait, thanks to its likely superiority in terms of double-precision compute performance, memory bandwidth, caching, and ECC support—along with a combination of hardware hooks and software tools that should give Fermi unprecedented programmability for a GPU.

Let me suggest reading David Kanter’s piece on Fermi if you’d like more detail on the architecture.

0 responses to “Nvidia’s ‘Fermi’ GPU architecture revealed

  1. “#46, I don’t think it says much? What do you want to say? Care to elaborate?

    This only confirms “Fermi” was built for GPU-compute and lots of effort went into getting higher double precision ops faster and ECC (none of which helps with games/video/what consumers use GPU for). We already have lots of teasers on this front but still nothing on gaming front… ”

    stop the press, it will play games……..

    alas NOT DURING THE 2009 CHRISTMAS SEASON when sales really matter.

    you CANT buy it for your chrismat tree, as they dont sell it, crazy Nvidia (and ATI/AMD for that matter) accountants not having something NEW ready and waiting for the most busy part of the holiday season…

    come the end of january/febuary we are all trying to pay off the over spend and thinking about the expensive warm and sunny seaside get away….

    you lost your most critical time for NEW produce availability, christmas holiday season, shame…what were You thinking crazy executive.

  2. “Hopefully in a couple more generations OpenCL becomes the de-facto standard and APIs and IDEs exists for both architectures”

    and Hopefully , AMD/ATI will finally have their OpenCL code actually using and taking advantage thier GPUs rather than just running the OpenCL on your CPU ONLY….

    hopefully too you will be able to get some OSS apps that use their ATI UVD chips in all your gfx cards sitting iddle…

    hopefully the ATI (and Nvida Encoding it must be said too) trancoding app AVIVO will be able to produce a REAL [email protected] Something even near as good a quality AVC/H.264 1080P Encode as your generic x264, and at least the same speed , not very likely though is it…

  3. You only came up with one application where Nvidia’s cards are faster, and then you say my statement is not true…

  4. Incorrect. Having more “cores” means nothing, if you can’t put them to good use and that’s inherent to the architecture itself. Just look at [email protected] performance, where a single GTX 280 computes more points per day, than a HD 4870 X2. There are several advantages in NVIDIA’s architectures for GPGPU applications, over ATI’s.

  5. Good point.

    AMD still poses a threat to Nvidia in the HPC market as AMD’s GPUs are more power efficient. Not to mention AMD’s GPUs feature more cores, which allow for greater parallelism.

  6. Well they already raised quite a few eyebrows. 4 of their GPUs, in a ~4000 euros system, was already faster than a multi-million dollar super computer (look for FASTRA), while consuming a lot less power and a lot less space. Cheaper and faster ? The combination of those words means $$$ for NVIDIA and cost reduction for every company that buys their products.

    And of course others are in the HPC market. It exists precisely because others had already ventured into it. The only real question is, will they be able to keep their current clients away from NVIDIA’s offerings and keep their business in these areas, intact ?

  7. And this would be time 13. It is a never ending circle of hate. What I do enjoy are these people claming nvidia is finished….really? That is like saying Intel will go bankrupt if they put out one bad product now. Nvidia has so much money stored up (FAR more than AMD), nvidia can weather out one lousy season or worse, a full generation of products.

    The tables again will turn. They always do. In say five or eight years, AMD will be outperforming Intel, again.

    It is the circle of -[

  8. You’re right, the HPC market is very lucrative, but too bad Nvidia isn’t the only company after that market. Same goes for the mobile market.

    If Nvidia manages to knock out the other major companies in the HPC market or mobile market, then yes, the billions will come flying in.

  9. The problem with that one is that they did indeed saw demos/benchmarks of Fermi’s computational power.

  10. Nor did I say it was. But the focus of NVIDIA’s GTC was indeed the HPC market and developers as a whole.
    As for where the big money is, I don’t think you know how profitable the HPC market is or can be.
    Obviously this doesn’t mean that NVIDIA is not giving importance to the consumer graphics market. Far from it. It’s still their biggest business. However, NVIDIA, for a while now actually, is widening their business area. They have done so with the handhelds (Tegra) and now they want a bigger presence in the HPC market, with Fermi and all its new features.

  11. Talk about a SERIOUS dose of epic ignorance! Have you even read the article?

    C++, OpenCL, and DirectCompute are all supported by the GT300…None of these force anyone to rely on a single manufacturer or source for hardware technology.

    Effectively, your 3dfx and Glide argument just got butt-f**ked three ways from Sunday.

  12. 4 years ago, Most PCs still had IGPs, I’m talking actual facts, not some trend I’ve seen in one PC shop.

  13. Today, yes, and I said as much (‘or’) but 4 years ago it was more common to see low-end cards which is what you disputed. *Now I guess whether you consider consumer or business PCs it skews it one way or another but I do recall seeing lots of low-end cards in consumer PCs 4-5 years ago.

  14. The vast majority of the PC market is IGPs, not discrete GPUs, and laptops make up a significant percentage of that.

  15. It’s not a fail. The vast vast majority of systems in the entire PC market, not enthusiast market, are sold with low-end graphics whether it be integrated or discrete.

  16. Now this is funny I should defiantly read the comments more often.

    Laptops outsell desktops and most laptops that Joe public see in the shops feature integrated graphics, just as the handful of tower, SFF and all-in-one’s mostly do.

    I was in a electronic shop just yesterday 62 computers on display only 2 tower’s with discrete graphics 1 ati/amd the other nvidia.
    4 years ago nearly all PC’s featured low end discrete cards today integrated are enough for the majority, hell pick a number in the 90% plus area and your probably right.

    Neither nvidia or ati/amd are going to survive on just discrete sales so it doesn’t surprise me that nvidia is focusing on the HPC market strongly as well as the mobile computing market.

  17. Well yea I know that, but the HD2900 was on paper a great card, and the thing that f’d it up was its ROP’s. Granted, AMD is using tech that was researched by ATI, but it’s also been there during the refining and tuning process, and its not out of the question to think that they could have had something to do with the turn around of their product lines.

  18. Where the hell did you get HPC from?
    GPGPU is not tied to HPC.

    Do you think Nvidia is pushing GT300 as a cGPU solely for the HPC market?
    You think CUDA was meant solely for the HPC market?
    No, they’re aiming at the desktop segment, which is where the money is at.

  19. lol.
    Windows 7 has nothing special to do with GPGPU apps, it can be done on XP or vista, or w/e OS.
    Like I said, they’re far from becoming*[

  20. Hehe, love the hate without even a coherent reason for it 🙂

    First, the HPC market is not “mainstream” nor was it ever intended to be. Second, the HPC market yields very high profits. Third, Fermi was created with lots of input from people in this market, so it will be used by them, especially when it consumes far less power than typical super computers, takes up less space, costs less and performs better. There’s really nothing against it in terms of appeal for those in this market.

    And what does this all mean for gamers ? Nothing, because these features are not being aimed at gamers. GTC was a conference for developers. Fermi based consumer cards, will most likely not have many of these features (ECC is a sure candidate to be removed from consumer oriented chips), among other things, which will make the chip smaller and cheaper.

    When graphics performance numbers arrive and we know exactly what the consumer chip will be, you can them make your conclusions of what it will be against the HD 5800s. Right now, all we know for sure, is that Fermi is a computation “monster”.

  21. This is HD4000s VS GTX200s all over again.

    I believe there is no way in hell GTX380 will compete with HD5870 in terms of price/performance.

    Nvidia can’t beat AMD right now, so they’re taking some shots at Intel with this GPGPU/cGPU crap.
    I call it crap because GPGPU applications won’t become mainstream anytime soon.
    That won’t stop Nvidia from shoving it down our throats with press releases, tech demos, etc…

  22. now THIS is a paper launch. A desperate attention grab basically saying “hey don’t forget us!”

  23. OK, if we skip all the “moral” implications ( i.e. whether he “should” have told us it was a mock-up or not), here is what I think is the better reason to mention these things: when you do a product demo you want EVERYTHING to go according to plan. Having all kinds of talk on the internet and even articles being published about whether it was fake and if anything was real is NOT what I think Nvidia would call “according to plan”. I’m satisfied with Nvidia’s explanation; I don’t have a personal stake in whether it’s part fake, all fake, or all real. I think that when you hold something up that looks darn real, people are going to, as you say, ‘take it at face value’ that it is real unless you tell them otherwise. If that was not the case, then nobody would be talking about it. The fact that there was such as stir proves the point, IMO.

  24. I’m not implying anything other than take things at face value and don’t ASSume anything so the fallacious ‘your statement implies the inverse’ argument doesn’t work. Would he have somehow held it differently if it wasn’t a mock-up? :p And what difference does it make whether it’s a mock-up or not? It’s not like it would change the existing production timeframe.

    I mean, great, people figured out it’s a mock-up, whether he specifically said it ‘mock-up’ or not isn’t a big deal but it’s silly for all the geeks to get all worked up over it. But like I said if he did clearly state it was a working sample then there’s a reason to get a little worked up.

  25. So you’re implying that he made it clear or that it was obvious from the start that it was a mock-up? No, he held that thing up like it was the real deal. Mock-ups are fine when people know they’re mock-ups.

  26. It is a non-working working prototype used as a demo prop. Everybody in the industry does it with product that is in the pre-production stage. The most famous example would be earlier Voodoo 5 6000s. 😉

    I don’t get it why people are making a big deal of it. Nvidia is just playing damage control against AMD’s new offerings. Because they know that is race’s outcome is a lot less clear this time around. Fermi is most then likely going to be slightly faster then 5870 at gaming performance, but loses at power consumption/performance ratio.

    However, Fermi trumps the 5870 at GPGPU related tasks and embarrasses Larrabee. Nvidia is clearly playing to this by showing their prop as a “Tesla” rather then as a Geforce.

  27. EXACTLY, nobody freaks out when car companies display concept cars with no parts, or cars made of clay 😉

    Now, if nvidia sets a production date, and misses it or has more tmsc problems… then start the rumor wagon up and wave the flags..


  28. R&D for a card doesn’t start mere weeks or even months before its introduction. The HD3000 series was likely already well on its way and even the HD4000 series was at least started.

  29. Its really hard to say actually, the HD2900 was the last thing the old ATI R&D’d, so if AMD hadn’t had bought them up who knows, they very well could have been hurt bad enough to go bankrupt.

  30. It’s called a mock-up. I guess people love getting all worked up over this stuff but unless he said ‘In my hand I hold an operating production card’ or some such…meh

  31. Did you look at the pictures on that link that has been very poorly translated by google translate? I translated the text manually myself:

    NVIDIA’s CEO Jen-Hsun Huang stood yesterday in front of the press, investors and analysts holding in his hand a Tesla graphics card that is based on a Fermi GPU built from 3 billion transistors. According to Huang the card is targeted primarily to GPGPU computing and GeForce cards targeted to consumers will be introduced later.

    A closer look to pictures of the graphics card published on the internet shows howerver, that the introduced Tesla graphics card is a fake with a very high probability. On the following there is listed the faults that have been interpreted from the pictures and mentioned on various different forums:

    1. The PCB has been sawed off from the back. On the backside you can see that the cutting line goes through the PCI Express powerconnector. Even all the stickers have been left on the cutting line, of which you can now only see half.
    2. The eight pin powerconnector has no soldering points on the PCB or fastening(?)* clips that go through the PCB. The PCB has been cut so, that all that’s left of the original PCI Express power connector is two soldering points.
    3. The six pin powerconnector has no soldering points on the PCB or fastening(?)* clips that go through the PCB.
    4. The second and furtherback SLI connector is left under the coolingplate so, an SLI bridge doesn’t fit to it’s place.
    5. Just an observation, but there’s a point for another DVI connector, which has not been used.
    6. The cooler has been secured only through the backplate and not through holes in the PCB like it is normally.
    7. There is an empty cooler mounting hole on the PCB behind the six pin powerconnector
    8. There’s only plastic behind the top warm air exhaust holes and no channel of which through air could come out.

    * These may be incorrect words but I did my best translating the text, it should still be fairly accurate.

    You can see different pictures of the card at least on PC Watch site, here: §[<<]§ There's most likely more pictures of it on different sites but i'm too busy to dig more. In my opinion the card that Jen-Hsun Huang introduced is indeed fake. It's a shame they have to go to such extremeties and try to fool the press, their own investors and analysts by doing so. The pictures have also lead people to question if the demos that were supposedly ran on Fermi were actually ran on something else. In any case I hope Nvidia get Fermi out as soon as possible, because lack of competition is never good, even though i'm glad if AMD can make some money with HD5000 series. EDIT: fixed some chapter spacing.

  32. Oh my Gord, I think you’re right. Even more, I think CHARLIE is right! Look at that picture:
    §[<<]§ It's quite clear that the backplate partially blocks an SLI connector. No big deal. I could write this off, but the hits keep coming. The six and eight pin power connectors clearly don't even sort-of line up with the solder points, and he's right, there are two bar code stickers cut off part way down! I think that's a surgically altered GTX285!!!

  33. You speak like a true non-developer…

    How does the fact that this architecture is very programmable, make it “proprietary” ? Far from it. NVIDIA supports all standards and the developer will choose which ones he/she wants to use. It’s EXTREMELY flexible and any developer appreciates that. Not to mention being able to use it for C++. Now that’s awesome.

  34. you know what? no matter how great it is, no matter how great the tools are, developers do not want to be stuck to a single manufacturer. this is 3dfx all over again. 3dfx didn’t want to get into a hardware war so they thought if they designed an excellent platform with great tools that everybody would go with ease of programming, etc. Glide was great. but then all the major developers went agnostic and 3dfx and glide died.

  35. Hi, snakeoil. Welcome back. What name are you going to use after this one gets banned?

  36. i don’t give a rat’s ass about science, scientists should buy their own computers.
    they are probably making a bomb, and you are helping them.

  37. This dx11 gen of hardware is where it starts to be feasible/compelling…if they want to spend the dev money. I’d guess good threading for new CPUs is much cheaper/safer and more free RAM.

    I’d look to an upstart with a new niche and fresh new code base to use GPUs well…like Z-Brush and Mudbox did. (I don’t know CAD.) There were some small vendors with real-time GPU anim/render workflows at siggraph. But, they are stuck with essentially game tricks/limitations on dx9/10 hw.

  38. It used to be that the GPU supplemented the CPU and enhanced a computer’s capabilities. Its starting to seem that the CPU might be becoming just a way to support the GPU.

    Can’t wait to see how this baby folds!

  39. Let me quote the easily-accessible Wikipedia for ya.

    “An investment is a choice by an individual or an organization such as a pension fund, after at least some careful analysis or thought, to place or lend money in a vehicle (e.g. property, stock securities, bonds) that has sufficiently low risk and provides the possibility of generating returns over a period of time.[3] ”

    There are a few situations in which a video card is an investment. I once bought a $130 X800XT PE and sold it for $200 used lol. Before that I did the same thing with an old Quantum3D Obsidian that I found mis-labeled on eBay. A Voodoo5 6000 is another nice one because it’s worth a lot of money today.

    Usually a vid card needs to be rare and build value with a group of people or you need to figure out how to get something way below the typical price and then resell it. Otherwise video cards are definitely not a good investment regardless of how much you enjoy the result. They have a *[

  40. I think there’s arguably a monetary logic to this, too. One $300 card or 3 $150 cards?

    OK, so with the stagnation of PC games at the moment, that may not be entirely fair, but it would apply to things like the aforementioned knives (have 3 $50 knives that break in the time it takes one $100 knive to break, or what have you).

  41. I’m still waiting on CAD programs to stop using the CPU to generate display output (I’m talking about your software, Autodesk).

  42. Wanna throw any more mud in the water? The conversation was gross margin, but we could throw all kinds of other topics in there that have nothing to do with the original topic: total revenue, profitability, the price of rice in China, who’s on first, the color of Obama’s underwear….

  43. I see you’ve drunk the koolaid too and yes I understand what he meant but that doesn’t change my position. I agree that consumer goods can be thought of as good or even smart *purchases* for the reasons you stated but they are not *investments* – it just irks me when I read people say depreciating value goods are investments. Some might say it’s just semantics but I say it’s proper use of vocabulary without trying to put a spin on things, it’s also sort of the Rich Dad Poor Dad way of thinking of liabilities versus assets. To use the knives example for whatever reason that came up, sure, a great set of knives that lasts a lifetime is a good *puchase* but it’s not an investment. You guys could try looking up various meanings of investment for starters 🙂

  44. There go the price wars. That sucks.

    Gonna be a looooong wait for that 5850 to drop below $200.

    AMD puts out a new complete line of cards and rakes in the dough as their prices are v-e-r-y slowly pushed down by Nvidia.

    On the other hand if it helps their cpu business survive till Bulldozer, that all to the good.

    Nothing worse for us than no real competition in the cpu or gpu markets.

    On the third hand, the coming 28/32nm cpu’s and gpu’s will surely be my last upgrades. At some point it’s going to become silly to upgrade a desktop machine any further.

  45. Investments don’t necessarily provide monetary gain. An investment can provide a gain of any kind to a person. In this case, Meadows is saying that a good graphics card provides him with great joy over a long period of time.

    No, he probably won’t make money from it, and yes, it is a consumer good, but if a $300 graphics card gives him more than $300 of pleasure over the time he uses it, well then it’s a good investment. He’s not tricking himself, he’s just determining his best option.

  46. I’m not sure what you mean by stress-relief gear. But it’s great to see that marketers have convinced consumers that purchases which provide no return can be considered investments by twisting the meaning of the word. We need more of you to prop up the economy 😀 so please keep ‘investing’ in consumer goods!

  47. I love it when consumers trick themselves in to calling things like this ‘investments.’

  48. I think you’ll be happy if you search for ‘CUBLAS’ (stands for Cuda-BLAS). It’s been around since CUDA 1.0, but they seem to add more and more functions to it with each update. It still doesn’t have all the BLAS functions, but most of the important ones.

  49. I’ve only done it for one real project, so don’t take my opinion too seriously. It’s not that bad if you can approach your data as a /[

  50. On the software end, it would be nice if:
    1. A BLAS and MKL were available.
    2. There isn’t extensive monkeying around with driver versions/ incompatibilities/ limited functionality. AMD’s MKL-gpu doesn’t accelerate much, and it’s difficult to get to work at all.

    My point was just that if nvidia is committed to exposing the gpu as a computational device, I don’t even care about it being faster than an AMD offering. It working nicely out of the box is worth everything to me.

  51. Pray tell us about AMD’s new architecture?

    RV870 is in my opinion either the last, or second last, in the R600 style generation from ATI/AMD. They can get easy improvements by shrinking and cutting and pasting cores, with tweaks and minor improvements generation to generation. RV970 may be the same, but I think they’ll be working on revamping the architecture at some point to improve double precision from 1/5 to 1/2 like NVIDIA (which may come at the cost of dropping theoretical SP performance, but probably attain higher real-world results via the design revamping). RV970 might improve it to 2/5’s performance if they keep the R600-style architecture for that generation, it’s a relatively cheap design option.

    Fermi is NVIDIA’s R600 equivalent. I think it’s the right thing for NVIDIA to do, but in its first incarnation it has significantly less SP power on paper (although the DP power is impressive). Of course it will be released opposite a HD5890 that does 3 TFLOPS in SP (and hence 600 GFLOPS in DP). I think that HD5000 series GPUs for consumers are the obvious choice for the next six to nine months as a result of ATI’s design being less oriented to GPGPU.

    I think NVIDIA will sell loads at massive mark-ups for GPGPU (with ECC, etc) for supercomputers. 750+ DP GFLOPS in a single chip means 3 TFLOPS in 1U, over 100 TFLOPS in a single rack, and tens of PFLOPS in a small datacentre.

  52. MadManOriginal don’t say stuff like that …
    Prime1 can’t be and never will be wrong … even though i never talked about profits but margins.
    But now that we’re talking about profits, nVidia made a loss in Q1 & Q2 (my first point stays valid).

  53. You program GPUs? What’s that like? More difficult than, say, parallel CPU programming? What tools do you use?

    I’d ask more, but I don’t want to scare you off! 😀

  54. The same happened when nVidia first released the G80, and ATI was left without a competent answer for many months. Did that drive them bankrupt? *[

  55. But you also want an investment like this one to last a while (unless you’re a company manager picking thousand dollar bills from your garden bushes daily).

  56. I’ve updated the article with a few clarifications and tweaks to the wording regarding the available cache policies and multiple kernel support. FYI.

  57. That’s how it was done with Cuda 1.0 and 2.0 (or at least that was one of the ways). But it looks like VS will get more goodies now.

  58. AMD’s new architecture is not designed with double precision floating point performance in mind. You might be able to encode 15 movies on it but that is not what HPC is about. Look it up if you don’t know what I or the article are talking about.

  59. You’re missing the point. They’re saying nVidia may not really care about the high end graphics anymore, especially since the average consumer is satisfied with what’s on consoles.

    By focusing on HPC, they can grow a MUCH higher margin market. The question is though, will that market end up bigger than the enthusiast video card market, or at the very least enough to make up for their losses to AMD in that market?

    I think that publishing info on ‘Fermi’ this long before it’s available is because they’re worried their HPC customers might look at Cypress as good enough, and not bother waiting. After all, apps written for OpenCL, or even directx compute, should be portable between vendors.

  60. I don’t think it says much? What do you want to say? Care to elaborate?

    This only confirms “Fermi” was built for GPU-compute and lots of effort went into getting higher double precision ops faster and ECC (none of which helps with games/video/what consumers use GPU for). We already have lots of teasers on this front but still nothing on gaming front…

  61. When ATI introduced tessellation engines and tried to get it into the dx 10 spec Nvidia balked, they called DX 10.1 useless yet they will introduce a gpu that supports it soon. Nvidia is just building on it’s GPGPU work, I don’t see the great visionary that you see. G200 was an inferior GPGPU chip than RV770, their latest will be marginally superior. Leapfrogging is part of this industry. Hopefully in a couple more generations OpenCL becomes the de-facto standard and APIs and IDEs exists for both architectures.

  62. You’re right that the stack is a very powerful programming paradigm. However, you can _always_ emulate any behavior on a register machine architecture, you’d just need to manage the stack yourself (which would be unfeasible unless you could do it automatically, and in a tuned fashion).

  63. Being able to use one of the more popular IDE’s in Visual Studio is going to be a great boon to getting more developers onboard, IMHO.

  64. Yet nVidia will be designed into the fastest supercomputer and ATI will not. That alone tells you where the industry sees nVidia vs ATI.

    Oak Ridge National Laboratory Looks to NVIDIA ‘Fermi’ Architecture for New Supercomputer

    Oak Ridge Supercomputer Targets NVIDIA GPU Computing Technology to Achieve Order of Magnitude Performance Over Today’s Fastest Supercomputer


  65. You’re right it’s exciting in a broader sense but that doesn’t excite me personally when there are only a few applications I’d think of running that even *could* benefit greatly from this. Some time there may be a killer app that uses this parallel power for large numbers of individuals but usually you hear about even in-the-works applications.

  66. As long as you leave this out…

    “In the second quarter of 2009, AMD Product Company reported a non-GAAP net loss of $244 million and a non-GAAP operating loss of $205 million. ”

    Let me know when they actually make a profit.

  67. Prime1 i have!

    GAAP gross margin
    AMD: Q1 43% – Q2 37%
    nVidia: Q1 29% – Q2 20%

    Seems to me that you should not dance at all.

  68. Innovation doesn’t happen overnight, and it’s certainly an expensive endeavor. You usually don’t get things right the first time. If you remember, the only reason we have the 4850/4870 and 5850/5870 right now is because AMD started with their new design philosophy well before that. They first attempt, R600, failed miserably. Then they kept refining it, and here we are today with something great.

    I wouldn’t so quick to put down nvidia with what they’re doing. They are serious about expanding into other markets and growing their business. I’d doubt you’ll see ATI going in this direction as it would potentially poach business away from the AMD side of things.

  69. Driving force, yes, but I think his point was that this somehow sets this particular GPU apart.

    It’s really just the way things are going to be across the board. Nvidia is just talking it up like it has some sort of special application to them, just because their name is attached to those things all over the news.

  70. Seems to me that the 50% performance improvement ATI made over the 4890 won’t be enough … IMO far from it. But i guess it will be the same story all over again. Close enough in performance to hurt nVidia and if you have seen nVidia’s gross margin lately, you know they’re hurting badly.

  71. this is a catastrophe.

    really, nvidia is finished.

    this means four or five months of amd ruling.

    seriously nvidia has nothing, nothing,zero, nada,

    kaput nvidia is over.

    oh bankruptcy…. why?..

  72. It reminds me very much of the R580 and the R600, I was thinking that on the way home tonight.

    This says a *[

  73. I’m imagining a phone call in NV hq along the lines.

    “So what did AMD/ATI released?”
    “uh huh”
    “So their new GPU was pretty fast?”
    “Priced at… WHAT!?”
    “Quick, give something to the press to keep us on the map!”

  74. GLSL for OpenGL basically evolved from NVIDIAS Cg IIRC. And I think they have put a lot of their CUDA stuff into OpenCL specs. So they’re still a huge driving force for all that stuff.

  75. My bad. Apparently some kind of recursion will be supported but i can’t find any details. My guess is …it’ll be too slow to use it.

  76. Except CUDA is Nvidia’s, while there are other open formats for such things that aren’t even limited to just GPUs, much less one company. CUDA is destined to go just as far as PhysX, which is basically nowhere. You may see realistic examples of it in action here and there, but by the time it’s actually relevant, it will be steam rolled by something everyone can use, and that everyone actually wants to use.

    This is lipstick on a pig, if you ask me.

  77. You are completely ignoring the fact that rv870 is also a computing monster. I’ve seen some tests where it can transcode 3 times faster than gtx295. Maybe an article about this and other applications that use the GPU?
    TR? Thank you 🙂

  78. It looks like Nvidia is pulling another NV30 or least a R600. A very ambitious design plagued by delays. They are trying to play damage control here by leaking out juicy specs that look good on paper.

    I suspect “Fermi” will not be much faster then the current Evergreens at gaming performance. It will shine far better at “GPGPU” tasks. It would be ironic that “high-end” parts of “Fermi” sell better under the “Tesla” and “Quadro FX” line then as “Geforce GTX 380s”.

  79. I could imagine nvidia ceding the high performance gpu crown to amd just to gain better performance in the HPC market. All they need to be is competitive with the HD 5870. The GPGPU market is a considerably higher margin, and possibly higher volume (in the future if not now) market than the high end gpu market. Intel’s move into the gpu/ massively parallel cpu arena has definitely put pressure on nvidia, especially since they lack a cpu division. This is a good move for nvidia.

  80. Nope. Not even in theory. You can’t do recursive functions in CUDA on any GPU (AMD or NVIDIA…Larrabee is a diffrent story). Also most of the code running in an OS is sequential so it will run on one core ( CUDA core if you like their naming schemes). The branch prediction algorythms present on this chip (or any other GPU) are almost like they’re not there (compared to a modern x86 CPU).
    So basically no, no OS running on a GPU anytime soon.

  81. It’s rather disappointing to see nVidia release this sort of PR long before they’ve got working silicon that’s going to yield acceptably for them. The 58xx series release must have them buzzed a great deal to release this speculative sort of “Here’s what’s coming from us, sort of and maybe, sometime” PR, especially right on the heels of the 58xx-series launch.

    But I think it clearly tells us just where nVidia is in this horse race, which I imagine is an unintended consequence. I mean, I don’t think this info will derail a single 58xx sale, but I do think it tells us that nVidia is likely further behind here than we might have thought otherwise. It’s all very R600-ish, isn’t it? This PR release appears to have been rushed without a lot of thought.

  82. Wow, they’re “sharing” info and hyping up their GT300 chip? They’ve been so secretive in the past years. Good timing too. Since 5xxx series just kicked nVidia’s ass, they might as well try to get some attention and keep people from buying AMD until their GT300 arrives.

  83. Do you use a version of BLAS? If so, I think it will be exposed as cleanly as you want. It can be a little tricky to minimize the ratio of memory transfers-to-computations.

  84. Nvidia is doing an awesome thing here; they’re doing the right thing, the good thing, the exciting thing: they are trying as hard as they possibly can to grow the market, and I think we all need to distinguish this from trying to grow market share. The latter is trying to take away from other players, the former is trying to go places other players are not playing.

    GT300 looks almost guaranteed to be the most exciting non-CPU yet. I’m excited. I’m very much looking forward to seeing what this thing will be able to do and what sort of applications will be able to be written for it.

    It sounds like, in theory, you could build a fully functional computer without a “CPU”, just Nvidia’s piece of silicon.

  85. I’ll tell you, if this is really going to give me a ton more double precision computing power than any CPU I can afford, and various nvidia software trickeries expose this power cleanly in matlab (GPUmat?) and c++/fortran, I’d happily drop as much as I’d normally drop on a computer on a graphics card.

  86. There will be some waiting..

    “Then timing is just as valid, because while Fermi currently exists on paper, it’s not a product yet. Fermi is late. Clock speeds, configurations and price points have yet to be finalized. NVIDIA just recently got working chips back and it’s going to be at least two months before I see the first samples. Widespread availability won’t be until at least Q1 2010.
    I asked two people at NVIDIA why Fermi is late; NVIDIA’s VP of Product Marketing, Ujesh Desai and NVIDIA’s VP of GPU Engineering, Jonah Alben. Ujesh responded: because designing GPUs this big is “fucking hard”.


  87. Interesting that they’d announce this today, what with the HD 5850 threatening to take all the headlines, that is…. 😉

  88. Mmm yes, quite.

    it would turn out to be funny if nvidia drops their 3xx series for this. Or if the fab problems they are having are because of all the effort going into this “Fermi”.