Exploring Nvidia’s Pascal architecture

Rys Sommefeldt works for Imagination Technologies and runs Beyond3D. He took us inside Nvidia’s Fermi architecture a couple of years ago, and now he’s back with a deep dive into what we know about Nvidia’s Pascal GPU so far.

Telling you that Nvidia recently announced its first Pascal GPU, the GP100, is probably a bit redundant. It’s been the talk of the PC technology space since Nvidia CEO Jen-Hsun Huang announced the GP100-powered Tesla P100 in his inimitable, best-keynote-giver-in-Silicon-Valley style during the first keynote of the company’s recent GPU Technology Conference (GTC) in sunny San Jose.

It feels like Pascal’s been a long time coming. In reality, we haven’t deviated too far from Nvidia’s typical modus operandi of a roadmap announcement at a GTC, followed by products a couple of years down the line. The company hasn’t been able to release the first Pascal chip for a while, thanks to the size it needed to be to make a generational leap in performance over prior Tesla products. Now that volume production of 16-nm transistors is possible, it’s finally time for the first big Pascal chip to arrive, too. 

28-nm manufacturing has lasted a long time in discrete GPU land. AMD and Nvidia both skipped the 20-nm node at the various foundries because of its unsuitability for the needs of high-power semiconductor devices. Because of the long pause at 28 nanometers, people have been clamoring for the first products on newer production technologies to see what advancements they’d bring to the table. Volume manufacturing for TSMC’s 28-nm high-performance process started back in late 2011, remember!

Now that Pascal is here, at least in announcement form, I jumped at the chance to reprise my 2009 analysis of Nvidia’s Fermi architecture. Fermi was announced at GTC in September of that year, but the company mostly talked about it from the standpoint of its GPU compute potential. I took a look at that chip then, and made some guesses about what its features might mean for consumer graphics products. I’ll be performing a similar analysis this time around.

My task is a little different this time, though, because we were also told the basic graphics-focused makeup of GP100 at GTC. Thanks to those details, I don’t have to do too much speculation about the chip’s graphics features and risk getting some of them wrong, like I did with Fermi. However, reading the Pascal tea leaves leaves me wondering whether GP100 will actually ever be used in GeForce products.

Let’s start with a brief recap of the last generation to see where today’s chips ended up on 28nm before we jump into the new stuff. Be warned: if you’re not interested in the bigger building blocks of GPU design and lots of talk about how many of them are present, here be dragons. Still with me? Great, because some context and background always helps set the scene. Join us now on this weird journey through Blaise’s semiconductor namesake.

 

A recap of the Maxwell architecture

We were actually going to take you all the way back to Fermi here, but after collating all of the research to take that seven-year trip down memory lane, we realised that a backdrop of Maxwell and Maxwell 2 is enough. You see, Maxwell never really showed up in true Tesla form like GP100 has for Pascal. Even the biggest manifestation of the Maxwell 2 microarchitecture, GM200, made some design choices that were definitely focused on satisfying consumer GeForce customers, rather than the folks that might have wanted to buy it in Tesla form for HPC applications.

Key for those HPC customers is support for double-precision arithmetic, or FP64. FP64 is something that has no real place in what you might call a true GPU, because of the nature of graphics rendering itself. That capability is needed for certain HPC applications and algorithms, though, especially those where a highly-parallel machine that looks a lot like a GPU is a good fit, and for those tasks that have a ratio of FP64 to lesser-precision computation that’s much more in favour of having a lot of FP64 performance baked into the design.

You’d expect a HPC-focused Maxwell to have at least a 1/3 FP64-to-FP32 throughput ratio like that of the big Kepler chip, GK110, that came before it. Instead, GM200 had almost the bare minimum of FP64 performance—1/32 of the FP32 rate—without cutting it out of the design altogether. We’ll circle back to that thought later. The rest of the Maxwell microarchitecture, especially in Maxwell 2, was typical of a graphics-focused design. It’s also typical of the way Nvidia has scaled out its designs in recent generations: from the building block of a streaming multiprocessor, or SM, upwards.

The Maxwell SM. Source: Nvidia

Nvidia groups a number of SMs in a structure that could stand on its own as a full GPU, and it calls those structures graphics processing clusters, or GPCs. Indeed, they do operate independently. A GPC has everything needed to go about the business of graphics rendering, including a full front-end with a rasterizer, the SMs that provide all of the GPC’s compute and texturing ability, the required fixed-function bits like schedulers and shared memory, and a connection to the outside world and memory through the company’s now-standard L2 cache hierarchy.

Maxwell GPCs contain four SMs. Each Maxwell SM is a collection of four 32-wide main scalar SIMD ALUs, each with its own scheduler. Each of the 32 lanes in the SIMD operate in unison, as you’d expect a modern scalar SIMD design to. Texturing hardware also comes along for the ride in the SM to let the GPU get nicer access to spatially coherent (and usually filtered) data. Normally, that data is used to render your games, but it can also do useful things for compute algorithms. Fusing off the texture hardware for HPC-focused designs doesn’t make too much sense—unless you’re trying to hide that the chip used to be a GPU, of course. Each Maxwell SM offers eight samples per clock of texturing ability.

The GM200 GPU. Source: Nvidia

GM200 uses six GPCs, so it has six front-ends, six rasterisers, six sets of back-ends and connections to the shared 3MB of L2 cache in its memory hierarchy, and a total of 24 SMs (and thus 24 times 4 times vec32 SIMDs, and 24 times 8 samples per clock of texturing capability) across the whole chip. With clock speeds of 1GHz or more in all of its shipping configurations, and speeds that are often even greater in its GeForce GTX 980 Ti form—especially the overclocked partner boards—it’s the most powerful single GPU that’s shipped to date.

If GM200 sounds big, that’s because it absolutely is. At just over 600mm², fabricated by TSMC on its 28-nm high-performance process technology, it’s pretty much the biggest GPU Nvidia could have made before tipping over the edge of the yield curve. Big GPUs lend themselves to decent yields, because it’s easy to sell them in cut-down form. You still need the yield to be decent to extract a profit from a GPU configured with the bits you are able to turn on against the competitive landscape of the day, though.

So that’s our GP100 backdrop in a nutshell. What I’m trying to get at by painting yet another picture of the big Maxwell is that it’s mostly just a big consumer GPU, not an HPC part. Maxwell’s lack of FP64 performance hurts its usefulness in HPC applications, and Nvidia can’t ignore that forever. Intel is shipping its new Knights Landing (KNL) Xeon Phi now. That product is an FP64 beast. It’s also capable of tricks that other GPU-like designs can’t pull off, like booting an OS by itself. That’s because each of its SIMD vector units is managed by a set of decently-capable x86 cores.

Our Maxwell and GM200 recap highlights the fact that GP100 has its work cut out in a particular field: HPC. Let’s take a 10,000-foot view of how it’s been designed to tackle that market as an overall product before we dive into some of the details.

 

The GP100 GPU

At a high level, GP100 is still an “SMs in collections of GPCs” design, so we don’t have to develop a new understanding of how it works at the microarchitecture level—at least as far as the basics go. Nvidia has resurrected the concept of a texture-processing cluster, or TPC, as a way of grouping a pair of SMs, but we can mostly ignore that name for our purposes.

The GP100 SM. Source: Nvidia

A full, unfettered GP100 is a six-GPC design, and each of those GPCs contains 10 SMs. Nvidia announced that the first shipping product with the GP100, the Tesla P100, would have 56 of its SMs enabled. It’s highly likely that Nvidia is disabling two TPCs in different GPCs to achieve that cut-down state, and it’s likely turning them off to improve yields.

A block diagram of the GP100 GPU. Source: Nvidia

That’s because GP100 is a whopping 610mm², and it’s produced by TSMC on its 16-nm FinFET Plus (16FF+) node. 16FF+ is definitely mature, but GP100 is easily the biggest and most complex design that’s yet been manufactured using that technology so far. Given the potential customers for the Tesla P100, you can bet that Nvidia would absolutely turn on all 60 SMs in GP100 if it could. I’m guessing that power usage isn’t a concern for GP100, really, so the reason behind the deactivated SMs has to be yield-related.

The Pascal SM in GP1xx is actually much smaller than the GM2xx SM for the main hardware. It’s just two 32-wide main SIMD ALUs this time, rather than four. There are also big changes afoot in this main ALU, but let’s hold that thought for the time being. Also along for the ride is a separate 16-wide FP64 ALU, giving the design “half-rate” FP64 throughput. If we multiply out all of the numbers that describe the GP100 design, you’ll see exactly what that rate ends up as: 5.3 TFLOPs. Good googly moogly. Most of the GPUs I work on for my day job at Imagination Technologies have around 1/10th of that throughput for FP32 performance, and literally zero FP64 ability at all. If you’re a HPC person and your code needs FP64 performance to go fast, GP100 is your very best friend.

Pascal has a familiar L1-shared memory-into-L2 cache hierarchy, as we’ve seen on Kepler and Maxwell, and it’s 4MB in size on GP100. That changes the “L2-size-per-SM” ratio significantly compared to GM200 and Maxwell, and not in the bigger-is-better direction. GP100’s 56 enabled SMs share 4MB of L2 in GP100, compared to 24 SMs that share 3MB of L2 in GM200.

While there are half the 32-wide ALUs per SM in GP100 compared to GM200, there’s no reduction in the size of the register file (RF) that the SMs have access to. That gives GP100 twice the per-SM RF space compared to GM200. For certain classes of data-dense code like the kind you tend to find in HPC applications, that’s a very welcome change in the new chip. As an aside, if you think the 4MB of L2 cach GP100 has is a lot of on-chip memory, there’s actually more than three times that amount in total RF space if you add it all up.

Six GPCs made up of 10 SMs—and each of those GPCs with lots of welcome FP64 ALU performance—plus a large per-SM register file all want to be fed with a beefy memory subsystem to give a nice “bytes-per-FLOP” ratio: the metric that really matters for devices like this. To get there, Nvidia is using the second version of High Bandwidth Memory (called HBM2) for GP100. I’ll leave the gory details of that memory for later, but there’s a huge increase in external memory bandwidth for GP100 compared to what was possible in GM200 and other GPUs that relied on GDDR5. That’s even the case with the conservative clocks for the HBM2 configuration Nvidia chosen to go with in GP100.

From a HPC standpoint, at least, we’re pretty much done with our high-level view of how GP100 is constructed. For a recap, the chip has six GPCs, and 10 SMs are in each GPC. Each SM gets 256KiB of register file to play with. Nvidia has turned off four SMs across the chip (two SMs grouped as TPCs—one TPC each for two unlucky GPCs), all sharing a 4MB L2 cach and then on to a very wide, high-throughput HBM memory system.

Let’s take a closer look at the SM and see what’s changed in the ALUs and how they interact with the register file. The changes help both HPC and graphics applications so they’re particularly interesting.

 

GP100 and FP16 performance

The biggest change in the Pascal microarchitecture at the SM level is support for native FP16 (or half-precision) arithmetic. Rather than dedicate a separate ALU structure to FP16 like it does with FP64 hardware, Pascal runs FP16 arithmetic by cleverly reusing its FP32 hardware. It won’t be completely apparent how Pascal does this until the chip’s ISA is released, but we can take a guess.

Nvidia has disclosed that the hardware supports data packing and unpacking from the regular 32-bit wide registers, along with the required sub-addressing. Along with the huge RF we discussed earlier, it’s highly likely that GP100 splits each FP32 SIMD lane in the ALU into a “vec2” type of arrangement, and those vec2 FP16 instructions then address two halves of a single register in the ISA. This method is probably identical to how Nvidia supported FP16 in the Maxwell Tegra X1. If that’s the case, Pascal isn’t actually the first Nvidia design of the modern era to support native FP16, but it is the first design destined for a discrete GPU.

Because the FP16 capability is part of the same ALU that GP100 already needs to support FP32, it’s reasonably cheap to design in terms of on-die area. Including FP16 support offers benefits to a couple of big classes of programs that might be run on a GP100 in its useful lifetime. Because GP100  only powers Tesla products right now (and may always do so), Nvidia’s messaging around FP16 support focuses on how it helps deep learning algorithms. This capability makes for a big performance jump when running those algorithms, and it also offers a reduction in required storage and movement of the data required to feed those algorithms. Those savings are mainly in the form of memory bandwidth, although we’ll soon see that GP100 has plenty of that, too.

The second obvious big winner for native FP16 support is graphics. The throughput of the FP16 hardware is up to twice as fast of that as FP32 math, and lots of modern shader programs can be run at reduced precision if the shader language and graphics API support it. In turn, those programs can take advantage of native FP16 support in hardware. That “up-to” caveat is important, though, because it highlights the fact that there’s a vectorization aspect to FP16; it’s not just “free.” FP16 support is part of many major graphics APIs these days, so a GeForce Pascal is ideally suited to produce big potential benefits in performance for gaming applications, as well.

Wide and fast: GP100’s HBM2 memory subsystem

We’re in the home stretch of describing what’s new in Pascal compared to Maxwell, at least in the context of GP100. AMD was first to market with HBM, putting it to critically-acclaimed use with its Fiji GPU in a range of Radeon consumer products. HBM brings two big benefits to the table, and AMD took advantage of both of these: lots and lots of dedicated bandwidth, and a much smaller package size.

In short, HBM individually connects the memory channels of a number of DRAM devices directly to the GPU, by way of a clever physical packaging method and a new wiring technology. The DRAM devices are stacked on top of each other, and the parallel channels connect to the GPU using an interposer. That means the GPU sits on top of a big piece of passive silicon with wires etched into it, and the DRAM devices sit right next to the GPU on that same big piece of silicon. As you may have guessed, the interposer lets all of those parts sit together on one package.

Nvidia’s pictures of the GP100 package (and the cool NVLink physical interconnect) show you what I mean. Each of the four individual stacks of DRAM devices talk to the GPU using a 1024-bit memory interface. High-end GPUs have bounced between 256-bit and 512-bit bus widths for some time before the rise of HBM. Now, with HBM, we get 1024-bit memory interfaces per stack. Each stack has a maximum memory capacity defined by the JEDEC standards body, so aggregate memory bandwidth and memory capacity are intrinsically linked in designs that use HBM.

GP100 connects to four 1024-bit stacks of HBM2, each made up of four 8Gb DRAM layers. In total, GP100 has 16GB of memory. The peak clock of HBM2 in the JEDEC specification is 1000 MT/s, giving a per-stack bandwidth of 256GB/sec, or 1TiB/sec across a four-stack setup. Nvidia has chosen to clock GP100’s HBM2 at 700 MT/s, or an effective 1400 MT/s thanks to HBM2’s double data rate. GP100 therefore has just a touch less than 720GB/sec of memory bandwidth, or around double that of the fastest possible GDDR5-equipped GPU on a 384-bit bus (like GM200).

The downside of all of that bandwidth is its cost. The interposer silicon has to be big enough to hold the GPU and four stacks of HBM, and we already noted that the GP100 die is a faintly ridiculous 610 mm² on a modern 16-nm process. Given that information, I’m guessing the GP100 interposer is probably on the order of 1000 mm². We could work it out together, you and I, but my eyeballing of the package in Nvidia’s whitepaper tells me that I’m close, so let’s keep our digital calipers in our drawers.

1000-mm² pieces of silicon—with etched features, remember, so there’s lithography involved—are expensive, even if those features are regular and reasonably straightforward to image and manufacture. They’re cut from the same 300-mm silicon wafers as normal processors, too, so chipmakers only get a relatively small handful of them per wafer. The long sides of the interposer will result in quite a lot of wasted space on the circular wafer, too. We wouldn’t be surprised if making the interposer alone results in a per-unit cost of around two of Nvidia’s low-end discrete graphics cards in their entirety: GPU, memories, PCB, display connectors, SMT components, and so on.

Now that we have a good picture of the changes wrought in Pascal’s microarchitecture and memory system in the compute-oriented GP100, we can have a go at puzzling over what the first GeForce products that contain Pascal might look like.

 

GeForce Pascals: some wild guesses

When we start thinking about what Pascal might look like in consumer GeForces, I have a couple of guesses. The major changes that Nvidia is likely to make in these parts boils down to two things: FP64 compute and the use of HBM2.

To repeat what we concluded earlier, FP64 is completely useless for graphics, and it takes up a lot of die area. That’s especially true for the dedicated SIMDs needed to run FP64 alongside the main FP32-and-FP16 pipeline, as with the GP100 design. To keep costs down for consumers, I’m expecting Nvidia to effectively remove FP64 in the chips that arrive to power GeForce models. It’ll still be there because it can’t disappear completely, but it’ll probably just be 1/32-rate like we got in GM200.

Then there’s HBM2. I’d have argued for its inclusion in GeForce Pascals a few months ago, but GDDR5X is on the way. This memory doubles the prefetch length and also should come with a fairly large increase in effective clock speed. It’ll be cheaper to use than HBM2 at similar aggregate bandwidths, and it’s cheaper to implement at the on-chip PHY level—not to mention the savings from the lack of an interposer and stack packaging. GDDR5X also doesn’t have strict rules tying bandwidth to capacity. That lets Nvidia use memory sizes other than 4GB, 8GB, 12GB, or 16GB on its GeForce products, compared to the limitations of HBM2.

Given those guesses, I think there’s at least one consumer chip that’s still really big, but quite a bit smaller than 610 mm². It probably has similar overall throughput to GP100 in the metrics we care about for graphics, and it’ll probably come with less memory capacity. Even so, it should still have plenty of overall bandwidth. Some rumours say this chip is called GP102. I think it’ll have 56 to 60 SMs, 1/32nd FP64 throughput, and more than 8GB of 384-bit GDDR5X. If it exists, then it’s likely destined for a Titan-class card first, and maybe a enthusiast’s favourite “Ti” product later on.

Nvidia is also likely working on a GM200 replacement for the pair of high-end GeForce non-Tis that make up the meat of the enthusiast market these days. It’s likely called GP104. That chip will likely also have token FP64 throughput—remember that these are GPUs, not HPC cards. I also bet it’ll have 8GB of 256-bit GDDR5X, 40 SMs or thereabouts, and all the associated machinery in terms of texturing and backend throughput that implies, in a die of around 300 mm².

After that, I don’t really want to put a flag in the ground. Expect something else for the “GTX 1060” part of the product line, and something else for the “GTX 1050” and below, probably at a die size of around 100mm2. By then, we’re onto GP2xx and some other possible changes for the design in some small ways.

Conclusions

We left out some discussion of some other really interesting bits of GP100, should you want to go read about them yourself. Nvidia’s own architecture whitepaper is a good resource, so I’d recommend reading it and focusing on two things. The first is the details of the NVLink interconnect. NVLink is used extensively in Nvidia’s construction of its DGX-1 rackable supercomputer. The other point of interest is the fact that GP100 can now service its own page faults without host intervention. That feature has some really exciting applications for graphics, but it’s too big of a topic to cover here. We don’t know whether that feature will make it into GeForce Pascals, but it’s definitely worth keeping an eye out for.

Anyway, we hope that trip through a Maxwell refresher, an overview of Pascal and GP100, a look at HBM2 and its associated costs, and  some guesses about the other Pascals has whetted your appetite for the upcoming pitched battle between Nvidia’s Pascal and AMD’s Polaris. I, for one, need something to drive an Oculus Rift. Maybe two of those somethings for each eye. Jeff’s also jonesin’ for a fix of big, powerful GPUs, given that the Oculus Rift and HTC Vive have shown up at TR HQ recently. It’s about time new GPUs made possible with new manufacturing showed up, and the groundswell of VR adoption is probably going to be a really good kicker for whatever hits the market from both Nvidia and AMD.

Comments closed
    • Ninjitsu
    • 3 years ago

    OH SHIT I made a weirdly sort-of correct [url=https://techreport.com/news/26300/rumor-points-to-bigger-maxwell-gpus-with-integrated-arm-cores?post=814702#814702<]prediction[/url<]! [quote<] When Maxwell originally launched in Feb, I had done some rough calculations and came to the conclusion that GM210* could end up with 3824 stream processors on the same process node, so 3200 for GM204 (and even more for GM210) doesn't look too far-fetched on 20nm. [/quote<] Of course Maxwell 2 never got 20nm or 16nm, but it's eerily correct for GP100. Probably won't be for GP204. I don't know why I changed it later, probably SM considerations.

    • Ninjitsu
    • 3 years ago

    So I was comparing FLOPS between my GTX 560 and the speculated 40 SM GTX 1080, just for fun (a 7x difference, only considering shaders).

    I realised that something was odd; GF114 could put out about 1.1 TFLOPS, about 1/7th of what this 40SM GP104 would be capable of, that too with faster clock speeds. I thought it strange that a 7x increase in shaders led to a 7x increase in FLOPS between two architectures 5 years apart – suggested a similar performance ratio per shader.

    So I looked up the 580, saw that GF110 produced 1581 GFLOPS with 512 shaders. If you scale this up to 3584 shaders as enabled on the Tesla P100’s incarnation of GP100, you’d get 11067 GFLOPS, which is already 4.4% faster than Pascal. [s<]Throw in another er... [i<]91%[/i<] increase due to clocks and you're looking at a 21216 GFLOP part. Let that sink in, Fermi could have done 21.2 TFLOPS with the same shader count and clocks. (EDIT: suggests that clock for clock, shader fo shader, Fermi was 2x as fast as Pascal).[/s<] EDIT: As AnotherReader points out, Fermi shaders were double clocked. That means Fermi was the one with 4.3% faster clocks instead, resulting in an almost identical 10608 GFLOPS. Pascal is Fermi! 😛 Of course, GF110 was a 3B transistor part. Naturally, a 3584 shader Fermi chip would need some 21B tranistors! (EDIT: assuming 15.3B isn't counting disabled stuff). [s<]While this may be too big on 16nm, I can see this as something they'll return to later on. As it is, a Pascal SM looks a bit like two Fermi SMs glued together. We've also seen the shader/SM number go from 32>192>128>64, it feels like we'll go back to 32 next time.[/s<] (EDIT: No, now I don't see the point anymore). I know I'm looking at this simplistically, but it was an unexpected and fun to look at outcome of my bored FLOPS comparison!

      • AnotherReader
      • 3 years ago

      You are forgetting the double clocked shaders of Fermi; the shader clock for a reference GTX 560 is 1620 MHz. These shaders required [url=http://images.anandtech.com/doci/5699/PowerClock.jpg<]more area, but were still more efficient on an area basis[/url<]. However, [url=http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/3<]power was saved[/url<] by giving up the doubled clock.

        • Ninjitsu
        • 3 years ago

        Oh yeah, completely forgot that. Will be about equal then, will update.

        EDIT: Yup, Pascal is Fermi-II confirmed :p

      • Orwell
      • 3 years ago

      Just for fun, I compared a couple of previous big NVIDIA chips using these metrics:
      [url<]http://pastebin.com/SpqYD5ct[/url<] This shows that separate shaders tend to become slower and slower while the GPU as a whole (and per mm2) becomes faster over the years.

        • Ninjitsu
        • 3 years ago

        Seems to become faster per transistor too. I’m too lazy at the moment, but should be cool to see what GP100’s numbers are.

    • Laykun
    • 3 years ago

    And then, 4 years down when TSMC is struggling to bring out the next process node and we’re all still stuck on 16nm FinFET, nvidia can cut out all the FP64 units again to give us a performance “boost”. The consumer markets must be a pretty small fraction of their income or projected income if they keep having to make concessions for FP64 ALUs. Mind you, that does mean that broken Pascal Tesla cards can trickle down into the consumer level, but I’d just love to see a monolithic GPU design primarily focused on gaming. I feel like if AMD went down the road of de-prioritising FP64 and focusing on FP16/FP32 performance they could become a considerable threat to nvidia in the consumer market.

    I’m very excited to see the benefit of the Vec2 16FP units though. There are so many places in modern shader code where I could see this applied without game developers having to do much at all, but instead some smart nvidia shader compiler optimisations.

      • psuedonymous
      • 3 years ago

      “I feel like if AMD went down the road of de-prioritising FP64 and focusing on FP16/FP32 performance they could become a considerable threat to nvidia in the consumer market.”

      Fiji already de-prioritised FP64 to 1/16 of FP32.

        • Laykun
        • 3 years ago

        So did Maxwell, tha’s how both cards were able to squeeze so much more out of 28nm. But as you can see with pascal, nvidia is going back to having a relatively large FP64 fraction and one can only assume AMD will follow suit to try and capture the HPC market.

    • CthulhuBill
    • 3 years ago

    I ran (Sound for) an event about the future of current major A.I systems, in attendance were experts from Facebook, Google, Microsoft, and Nvidia. Jen-Hsun Huang was Nvidia’s attending panelist. (Very happy and enthusiastic person.) It was obvious by the events end that His ideas for Nvidia are now lasering in on powerful scientific usage for their cards. Much was talked about how they are still constrained by hardware, and what a ridiculous revolution it was for A.I.’s when they could program for arrays of video cards instead of CPU’s.

    This card’s build and design offers more credence to that belief. I wouldn’t be surprised if he changed the company’s major direction in the next few years, or created a spin-off business focusing on scientific software acceleration or somesuch.

    They’ve been tuning these things for video games for decades, they are just now realizing that building them with a focus for certain supercomputing functions can reap ridiculous results, and it’s not been touched yet.

    • Wild Thing
    • 3 years ago

    You “need something to drive that Rift”?
    It’s available now….Radeon Pro Duo.

      • MathMan
      • 3 years ago

      For only $1600! With 4GB per GPU! Only weeks before both companies will roll out their next big thing!

    • sparkman
    • 3 years ago

    How much ass will medium- and high-end Pascal GPU’s kick in comparison to GTX 9xx and AMD R9?

    I expected the last page, speculating on the specs of future consumer Pascal parts, to tie it all back to real-world performance, but it didn’t. Good article, nonetheless.

      • Ninjitsu
      • 3 years ago

      ~40 SMs in GP104 would mean about 7.5 TFLOPs of FP32. That’s about 10% more than GM200 (6.8 TFLOPS), without accounting for clock rates, mem bandwidth, arch improvements, and potential gains from FP16.

      That’s also 64% faster than GM104 in the GTX 980 (4.6 TFLOPS).

      EDIT: New gen Gx104 being 10-20% faster than previous gen Gx100 seems to be tradition at this point.

        • sparkman
        • 3 years ago

        Disappointing. I imagined the new process would allow more a dramatic speedup.

        I’ll take the extra power efficiency, though.

        EDIT: Never mind, “64% faster than GM104” just sunk in as being a much bigger deal than “Gx104 being 10-20% faster”.

        Yes, I want GTX 1080 that is 64%+ faster than GTX 980.

          • Ninjitsu
          • 3 years ago

          Worth noting that 64% is a coarse estimate and depends on how many SMs GP104 will end up having: might even be 37 SMs and ~7 TFLOPs (using the same ratios as GM104:GM200*), which is ~52% faster than GM104.

          Real world perf may be way better (or not that great) because these are peak theoretical numbers (depend on the “Boost” clock rate).

          All I’m saying is, wait and watch, this is still very much speculation (and probably why Rys avoided a comparison). Maxwell had a lot of driver magic going, Pascal seems more “brute force” to me.

          I still think it’s safe to say that a 1080 would be at least 50% and up to 70% faster than a 980, and 0-30% over the 980 Ti.

          *Worth noting that GM200 was the full chip, current GP100 is not. Same ratios applied to a full chip would produce the 40 SM config Rys mentions.

          EDIT: I had meant 10-20% faster than the big current gen chip in my previous reply, sorry if I put it confusingly!

    • djayjp
    • 3 years ago

    Even the manufacturer of gddr5x says the bandwidth increase over gddr5 will be minimal for 2016 (there was a chart by the manufacturer predicting its bandwidth trajectory over several years).

    GP104 is confirmed to be using gddr5x (check die shot):

    [url<]http://www.eurogamer.net/articles/digitalfoundry-2016-in-theory-can-pascal-offer-titan-x-performance-at-970-money[/url<]

      • mczak
      • 3 years ago

      It is true that the bandwidth increase isn’t that large initially (the projections said 10-12 gbps initially, later 16 gbps, whereas gddr5 tops out at 8 gbps), however I wouldn’t call that “minimal”.
      That GP104 (only the GTX 1080 variant) is indeed using “only” 10 gbps ggdr5x – but this still represents a 25% increase over what would be possible with gddr5, that’s nothing to sneeze at (and over the GTX 980, it represents an even larger increase of 43%, because those gddr5 8 gbps aren’t really used anywhere as they are too new too). Meaning it has nearly the same bandwidth as last years flagship GTX 980Ti (95% of it), despite only using a 256bit instead of a 384bit memory interface.

        • djayjp
        • 3 years ago

        I’m just expecting something much more revolutionary. It’s a full TWO process nodes after all and AMD will have been shipping HBM for a year or more? If they’re going large (apparently not) then it should be up to 4x the theoretical performance (like the initially suspected HBM2 at 1TB/s).

    • AnotherReader
    • 3 years ago

    It is great to see another article by Rys. I anticipate your articles as much as I do ones written by David Kanter. Will we get a deep-dive of Polaris in the near future as well?

    • NTMBK
    • 3 years ago

    [quote<]my eyeballing of the package in Nvidia’s whitepaper[/quote<] Hands off Jen-Hsun's package!

      • Ninjitsu
      • 3 years ago

      It’s the biggest ever fabricated

        • Spunjji
        • 3 years ago

        -actually snorted-

        • w76
        • 3 years ago

        Pushes hard and fast, so it can race to sleep.

        • derFunkenstein
        • 3 years ago

        That was really fantastic.

    • Unknown-Error
    • 3 years ago

    Nice in-depth article. Thanx Rys Sommefeldt

    • DeadOfKnight
    • 3 years ago

    Kind of curious about this GP102 rumor. This is the first time I’ve ever heard of it. Source?

      • NoOne ButMe
      • 3 years ago

      I don’t have a source, but basically the theory is that a card with 60SMs minus all the compute features on GP100 could save 100-150mm^2 of die area. And not using HBM saves quite a bit to make it.

      If GM200 ever costs under $300 to make with 8GB of HBM2 to produce (silicon, interposer, HBM and product costs) I would be surprised. GM200 silicion and GDDR5 is near or under $100 right now.

      • Val Paladin
      • 3 years ago

      3DC among others have theorized that the GP102 could be a separate and reworked chip. Removing most of the 1,920 FP64 units, adding further FP32 ALUs, ditching the four NVLink interfaces (unneeded for consumer products unless a single link is desirable for GPU-GPU connection on dual GPU cards replacing the PLX bridge chip), swapping out the I/O interfaces, PHY, and controllers if the GPU is being interfaced with GDDR5X rather than HBM2.
      [url<]http://www.3dcenter.org/news/der-release-fahrplan-zu-den-1416nm-grafikchips-von-amd-nvidia[/url<] I'm still unsure about some aspects of the GP100 even after reading this synopsis. What is the overall ROP count? Does it follow consumer GPU lines, or have they been culled since many HPC workloads don't really leverage them.

        • NoOne ButMe
        • 3 years ago

        Ryan Smith posted on Beyond3D that GP100 does have ROPs

          • Val Paladin
          • 3 years ago

          I didn’t mean it to sound like the chip had NO ROPs (some workloads would require them), just whether they are as prevalent as a consumer orientated GPU. Plenty of sources state that GP100 has ROPs but I have yet to see any information as to their number.

          As for the GP102 rumours. If anyone thinks that all that work I listed makes a GP102 non viable, I would mention that as per Maxwell, Nvidia could effectively double the size of GP104 (as per GM200 -> GM204 -> GM206) and just delete the redundant doubling up of uncore components ( command processor/transcode engine/PCI-E interface/display out) and a 50% increase in GDDR5X I/O, PHY, controllers ( 256-bit to 384-bit) rather than 100% (512-bit).

            • NoOne ButMe
            • 3 years ago

            But that drives up cost. More likely NVidia makes a GP100 minus compute and saves the die area for being able to sell Volta as an improvement.

        • MathMan
        • 3 years ago

        3DCenter got the name from a CUDA dll:
        [url<]http://www.3dcenter.org/news/reihenweise-pascal-und-volta-codenamen-aufgetaucht-gp100-gp102-gp104-gp106-gp107-gp10b-gv100[/url<]

    • Krogoth
    • 3 years ago

    Pascal is going to be very interesting. It is the biggest change in Nvidia camp since Fermi.

      • chuckula
      • 3 years ago

      Who are you and what have you done with Krogoth?!?!?!

        • morphine
        • 3 years ago

        My thoughts exactly.

        • BurntMyBacon
        • 3 years ago

        He said “interesting”, not “impressive”.

        There are a whole slew of things on youtube that some may find interesting, but certainly aren’t impressive. (Not that this should reflect on Pascal, but this is Krogoth we’re talking about)

      • Ninjitsu
      • 3 years ago

      So you’re possibly mildly impressed? 😮

      They must be very pleased at Nvidia HQ lol

    • chuckula
    • 3 years ago

    Hi Rys,

    Excellent article, here’s a quote and question:
    [quote<]The Pascal SM in GP1xx is actually much smaller than the GM2xx SM for the main hardware. It’s just two 32-wide main SIMD ALUs this time, rather than four. There are also big changes afoot in this main ALU, but let’s hold that thought for the time being. Also along for the ride is a separate 16-wide FP64 ALU, giving the design "half-rate" FP64 throughput. [/quote<] I take it from your reiteration of the "separate" 64-bit ALU description down in the FP16 section that there literally is a completely different 64-bit ALU sitting next to the regular pair of 32-bit ALUs and obviously only one set gets activated in each SM at a time based on workload. In your expert opinion, what would the pros and cons be to have the two 32-bit ALUs be combined with some extra logic (additional carry-lookahead levels etc.) to make the 64-bit ALU in a similar manner to how CPUs operate? Nvidia already appears to be doing something similar in reverse with the 16-bit FP operations, so would there be some benefit to avoiding using all that silicon for a separate 64-bit ALU? Here's another question about granularity: Given that Nvidia has made each SM smaller with only two ALUs, do you think that Pascal will have noticeable improvements to fine-grained scheduling for shaders? The "asynchronous compute" feature that's been touted in GCN appears to rely on a scheduler that can save the state of a stalled shader and context switch to a different shader very quickly while Nvidia's simpler scheduler approach requires much coarser grain and slower context switches. Does Pascal's approach of using smaller SM's help to mitigate the issues since the compute resources can be addressed with finer granularity, or is this still a major advantage for AMD?

      • MathMan
      • 3 years ago

      The benefit of having a separate FP64 unit is power consumption. When a large part of the workload is expected to be FP32 or FP16, the extra logic to make FP32 do FP64 in multiple cycles would be in the FP32 datapath. When you see Nvidia putting out slides with energy usage compute in pJ/FLOP, I can imagine this is something they want to waste.

        • BurntMyBacon
        • 3 years ago

        I wonder how much silicon space power-gating the unused portion of a unified 64/32/16 ALU (per Chuckula’s post) would take. Is it comparable to the spaced used by an extra ALU? Doesn’t seem like it would be to me. Perhaps nVidia doesn’t have the ability to shut off / turn on logic very quickly, but it seems to me that the largely FP32 or FP16 workload you describe wouldn’t need to. At worst, just leave a small number in FP64 mode to handle the odd FP64 calculation. One would think that with data packing (1 FP64 -> 2 FP32 -> 4 FP16) you would see an increase in compute performance without incurring much in the way of wasted power consumption.

          • NoOne ButMe
          • 3 years ago

          likely either all the FP32 are power gated or all the FP64 are. I believe reports said in some cases some professional (IE not Titan) GK110 cards downclocked when doing heavy FP64.

          If the goal is to be able to reach both maximum FP32 OR maximum FP64 performance in a 300W envelope and you always want dedicated units as when you power gate the opposite unit the power used overall for X work should be less than with a shared unit. As dedicated units typically get (much) better performance per watt.

          Look at Fermi and Hawaii for architectures/cards that can do a 1/2 FP64 rate without dedicated FP64 units.

          edit: love the downvotes for. Uh. I dunno. This is all technical, and I don’t think I’ve gotten anything wrong. I don’t mind downvotes on my opinions, but this is completely factual as far as I know.

            • auxy
            • 3 years ago

            Fermi was 1/4 DP and Hawaii is 1/16. Tahiti was 1/4. Kepler Titan was 1/3. Nothing does 1/2 except Pascal which isn’t out yet. (´・ω・`)

            • AnotherReader
            • 3 years ago

            Fermi was 1/2 DP too. Look at the Tesla M2090.

            The consumer version of Hawaii’s ratio of SP to FP is 1/8, but the [url=https://www.amd.com/en-us/products/graphics/workstation/firepro-3d/9100<]FirePro variant[/url<] increases it to 1/2. Edit: Added Fermi rate too

            • NoOne ButMe
            • 3 years ago

            Nope. Fermi (at least GF100/110) was fundamentally a 1/2 rate architecture and so is Hawaii.

            • Ninjitsu
            • 3 years ago

            Consumer Hawaii was 1/8, pro was 1/2 and Fermi was 1/2…

            • auxy
            • 3 years ago

            Yah, yah, I was wrong. ┐( ̄ヘ ̄)┌

            • BurntMyBacon
            • 3 years ago

            [quote<]If the goal is to be able to reach both maximum FP32 OR maximum FP64 performance in a 300W envelope and you always want dedicated units as when you power gate the opposite unit the power used overall for X work should be less than with a shared unit. As dedicated units typically get (much) better performance per watt. [/quote<] Unless, of course, you are at the practical size limit of your silicon die. Hence, my question about silicon space requirements of the power gating circuitry (which I would still like an answer to if anybody knows). The 300W power limit is your design limit, not mine. PCIe with 2x8-pin power connectors is 375W (150w per 8-pin, 75w from the slot). As far as Pascal is concerned, this chip became available in nVidia's custom form factor long before standard PCIe models will become available anyways. Finally, there are architectures that don't max out the power limit (Maxwell) and manufacturers that will ignore the limit for special chips. The GTX980Ti and Titan X are only rated at 250W for example. Lets take a theoretical architecture and process that could fit 480 32bit ALUs and 240 64bit within the maximum silicon die area. Like many of today's architectures, these are mutually exclusive and you can fire up all the 32bit and 64bit ALUs at once. Lets say you could fit 360 unified ALUs into that same area. That would net you the performance of 360 64bit ALUs, 720 32bit ALUs, or 1440 16bit ALUs. On the surface, this is a clear win. However, this presupposes that the power gating circuitry doesn't eat all of the space savings (hence my question) and that the ALUs maintain similar performance (possible, but not guaranteed).

      • AnotherReader
      • 3 years ago

      [url=http://www.lirmm.fr/arith18/papers/libo-multipleprecisionmaf.pdf<]This paper by Huang et al [/url<] from ISCA 2008 indicates that the cost of a multiple precision FPU is an increase in delay by 9 % over a separate DP FPU. The area cost is 118% of a single DP FPU, but less than that of 2 SP + 1 DP FPU.

        • chuckula
        • 3 years ago

        Interestly. Thank you for the info.

          • MOSFET
          • 3 years ago

          Were you drunk or just refusing to use language properly?

        • BurntMyBacon
        • 3 years ago

        Good stuff.

        For architectures where clock speed is limited by the FPU and not other logic paths you could only acheive 91.7% of the clock rate 1/1.09(latency). Putting aside transport and power gating circuitry for the moment, you would need 8.3% more ALUs (in an ideal workflow) to make up for the throughput loss. This, according to the paper, would require 127.8% of the area of the DP FPUs. As long as the SP FPUs are more than 13.9% the size of DP FPUs, then the multiple precision FPU results in a more efficient use of space.

        That said, adding in the power gating circuitry is going to hurt our space savings and we’d need to know how much space this would take on multiple precision units before we can really come to a conclusion. Working for us, though, is the fact that we would only be running half as many transport lines to a single multiple precision FPU (64bits) as we would to a 64bit DP FPU + 2x(32bits)SP FPUs.

          • AnotherReader
          • 3 years ago

          The paper’s SP FPU requires 32% of the die area of a regular DP FPU. I don’t think the added delay would be a problem for GPUs as they aren’t known for their stellar latencies.

      • Ninjitsu
      • 3 years ago

      From the whitpaper (don’t know if this is the same thing you’re looking for):
      [quote<] [b<]Compute Preemption[/b<] is another important new hardware and software feature added to GP100 that allows compute tasks to be preempted at instruction-level granularity, rather than thread block granularity as in prior Maxwell and Kepler GPU architectures. Compute Preemption prevents long-running applications from either monopolizing the system (preventing other applications from running) or timing out. Programmers no longer need to modify their long-running applications to play nicely with other GPU applications. With Compute Preemption in GP100, applications can run as long as needed to process large datasets or wait for various conditions to occur, while scheduled alongside other tasks. For example, both interactive graphics tasks and interactive debuggers can run in concert with long-running compute tasks. [/quote<]

        • chuckula
        • 3 years ago

        That’s also very interestly. It sounds an awful lot like what AMD calls “asynchronous shaders” so we’ll see how it works in practice.

          • Andrew Lauritzen
          • 3 years ago

          Yes this is supported today on both AMD and Intel (Skylake). Both can truly preempt compute tasks on a predictable latency, whereas no architectures can predictably preempt graphics. There are differing granularities of how quickly they can stop it but ultimately no GPU that I know of can actually pause, de-schedule and resume 3D work on a machine – you have to wait for some amount of it (triangle, draw call, whatever) to finish which has variable latency.

          Little bit tangential but thought you might be curious.

            • chuckula
            • 3 years ago

            Thanks!

      • Ryszard
      • 3 years ago

      The main reason for it being separate it just one of balance for the architecture in its target markets. It’s a GPU first, so asking all of the consumer derivatives to pay the tax of complexity, power and area to run a more complex block of logic that’ll never do any native FP64 work is a hard sell. Simpler datapaths are also much easier to validate. So keeping FP64 separate lets Nvidia scale it down for consumer versions and save on area most of all.

      As for the granularity of operation, each half of the SM runs its own instructions on an independent warp already, with a dedicated scheduler. Maxwell had the same granularity of one scheduler per 32-wide main ALU. That’s outside of any scheduler changes Pascal has that might affect that, which it might well have.

        • chuckula
        • 3 years ago

        Thanks Rys! I always appreciate your insights.

    • anotherengineer
    • 3 years ago

    Wow interesting article.

    1 question, on page 4 “GP100 connects to four 1024-bit stacks of HBM2, each made up of four 8GB DRAM layers. In total, GP100 has 16GB of memory. ”

    I got a bit lost and require clarification.

    ??four 8GB DRAM layers, and 16GB of memory??

    Thanks

      • morphine
      • 3 years ago

      Fixed. Thanks for the heads-up!

        • anotherengineer
        • 3 years ago

        No Prob, thanks for fix.

      • Mr Bill
      • 3 years ago

      I was thrown off by that too. It might be a good idea to write out gigabit instead of Gb so the less technical reader gets a better clue.

        • anotherengineer
        • 3 years ago

        Gb is fine, it was written GB though originally.

        So 32Gb or 4GB per chip and 4 chips for 16GB, got it!

          • Mr Bill
          • 3 years ago

          I know, I saw it when it was GB also. Just saying the average non techi might not realize b is bit and B is byte.

            • derFunkenstein
            • 3 years ago

            If Gb/gigabit vs. GB/gigabyte is confusing for the reader, I imagine that the rest of the article is impenetrably dense.

            • ImSpartacus
            • 3 years ago

            You’d be surprised. As a layman, I learned a tremendous amount from reading articles like this on Anandtech. You have to start somewhere.

            • derFunkenstein
            • 3 years ago

            I agree that we all started from nowhere, but this is not the place to start. 😆

      • Ryszard
      • 3 years ago

      I think I wrote that differently in the original draft, so I’m going to plead the 5th and handwave that it was Jeff during editing!

        • anotherengineer
        • 3 years ago

        😀

        That works for me 😉

          • chuckula
          • 3 years ago

          Apparently the next iteration of HBM 2.0 will allow for 8 GB stacks to hit 32 GB, but that’s a 2017 thing.

            • BurntMyBacon
            • 3 years ago

            Would that be HBM 2.1, HBM 3.0 or is this just a more dense set of memory compatible with HBM 2.0?

            • chuckula
            • 3 years ago

            It’s still HBM 2.0, just denser RAM (or higher stacks) that has more capacity. Similar to how DDR3/DDR4 can come with different capacities based on the sizes of chips and the number of chips on a DIMM.

            • BurntMyBacon
            • 3 years ago

            So not actually another iteration of HBM. I thought I had missed something and was getting excited for no reason. Still, an iteration of the memory capacity is good too.

    • Tirk
    • 3 years ago

    You mentioned using GDDR5X in AMD and Nvidia this year do you have any further news on volume production and release of GDDR5X or is this solely based on a very quick turnaround of Micron’s volume production expected this summer?

    It’d be great to see but I’m still leaning towards it not being a reality, so any updated news you might have on that would be great.

    • Ninjitsu
    • 3 years ago

    That whitepaper has been lying open and unread for a while in my browser, but I read this article instead because I always enjoy bonus commentary from people in the industry.

    Really nice read, thanks Rys, TR!

      • Ryszard
      • 3 years ago

      Thanks! I didn’t really add that much over the white paper (and skipped over some things they disclosed, too), but it was fun to walk through why I thought they made certain decisions, and what might happen for GeForce. Plus Jeff gets to absolve himself of any of my mistakes 😉

    • tsk
    • 3 years ago

    Good to see I came to the same conclusion as Rys after GTC; GP102 will be the Titan card.

    Everyone who downvoted #48, I accept your apologies. [url<]https://techreport.com/news/29999/rumor-nvidia-kills-some-maxwell-chips-ahead-of-june-pascal-launch#metal[/url<]

    • Mr Bill
    • 3 years ago

    Nice write up. The concept of the rackable computer design with interconnectable compute modules looks really cool. Maybe in the not to distant future, the house computer will come to resemble that owned by Mr Slippery in ‘True Names’ by Vernor Vinge.

    • chuckula
    • 3 years ago

    Thanks in advance Rys, I’ll bloviate more after I’ve had a chance to give it a thorough read.

      • phileasfogg
      • 3 years ago

      ” I’ll bloviate more after I’ve had a chance to give it a thorough read.”

      wow, that’s the second time in 24 hours I’ve come across that word 😉
      Prof Robert Reich of UC Berkeley used that word in a Facebook post yesterday.
      Although, he used it to describe Donald Trump… who obviously doesn’t know what a GPU is 😉
      or… maybe he’ll pull a Justin Truedeau on us and explain how GPU computing works.

      • anotherengineer
      • 3 years ago

      lol

      But when you bloviate, you seem so pugnacious when you do so. Maybe you’re a pugnacious bloviator or maybe you bloviate with pugnaciousness. 😉

      hmmmm

      I think it’s time another TR vote!!

    • UberGerbil
    • 3 years ago

    Has anyone ever disclosed the wire pitch on the interposers? The AMD ones have been out in the wild for some time now so I assume it’s known, just not by me. But they look like they could be etched on a much older, larger (and cheaper) process: given that they’re just carrying signals from various discrete components, the circuitry can’t be very dense.

      • Ryszard
      • 3 years ago

      Good question. I think I have access to what I need to figure it out. Will investigate.

      • muxr
      • 3 years ago

      They are. Interposers are made on “cheap silicon” as they don’t require high density at all (and the yield is 100%). Article got it wrong. The expense in HBM mostly comes from packaging.

      edit: seriously why am I being downvoted? This is an absolute fact. AMD themselves talked about it in one of the HBM releases, do your own research from now on, I am done commenting on this site. This community blows.

      • mesyn191
      • 3 years ago

      This article has some of the best public information I’ve seen yet on the cost of interposers and HBM:
      [url<]http://semiengineering.com/time-to-revisit-2-5d-and-3d/[/url<] Synopsys released some great information too: [url<]http://synopsys.com/Company/Publications/SynopsysInsight/Pages/Art3-3ddesign-flow-IssQ1-12.aspx[/url<] They're using a interposer done on a 65nm process there but others have used larger processes before. Current interposers are "passive" only and don't really have much in the way of circuitry on them. They're just connecting various chips together and acting as a go between for the rest of the physical packaging for power and data. "Active" interposers with actual powered circuits doing actual processing work are coming but they're going to be real hard to pull off. Maybe 2018 with Navi? That is probably the earliest you can expect them in a consumer product though that is probably optimistic.

    • DancinJack
    • 3 years ago

    LOVE THIS ARTICLE. MORE OF THIS PLEASE.

      • ImSpartacus
      • 3 years ago

      I know, right? Good shit.

        • DancinJack
        • 3 years ago

        Seriously. <3 the deep dives and more technical stuff with Rys and Kanter etc. I also think the people here as a whole would love to see more of it.

          • derFunkenstein
          • 3 years ago

          I agree. I would love if they had more to write about, too.

    • Convert
    • 3 years ago

    Can’t wait to read it!

    • maxxcool
    • 3 years ago

    Hey Rys ? Any article edits upcoming that might address the whole “Asynchronous” elephant in regards to Nvidia?

    I am super curious if I am buying a AMD card to replace my 4gb-760 ..

      • sotti
      • 3 years ago

      It’ll be interesting to see how important Async actually is.

      Unless you actually write code on a render engine, I’m not sure we actually know.

      We know it can be tweaked to show good performance gains on AMD hardware, but then again nvidia hardware typically is already quite a bit faster.

        • maxxcool
        • 3 years ago

        Yeah I just don’t know either… could be mantle 2.0 all over again, or could be very easy to code for and see it everywhere. /shrug/ … hence my curiosity.

          • Pettytheft
          • 3 years ago

          I thought the whole purpose of DX12 and Vulcan was to get rid of the driver side optimizations. If Epic, Crytek or Unity put it in their engine then games will support it.

            • travbrad
            • 3 years ago

            Sounds great in theory but I’m not sure why game developers and engine developers will be any better at optimizing than driver developers were. We’ve only seen a couple DX12 benchmarks so far, but in one of them DX11 was faster on both AMD and Nvidia…

            • muxr
            • 3 years ago

            Because optimizations “after the fact” in the drivers is the wrong way to do it. There is only so much you can do in the drivers. A driver sometimes has to heuristically guess what the code is trying to accomplish in order to apply the right optimization path, whereas a game engine developer knows exactly what needs to happen and how something should be executed efficiently.

            If a game engine developer has access to these features he can get much more performance out of them, than relying on the drivers to do the guesswork.

            Part of the reason SLI/Crossfire for instance has been difficult to do in the recent years with a lot of games not supported is because some effects can’t be made to render on multiple GPUs in the drivers. Like volumetric lighting, that effect has to be designed for multiple GPUs from the ground up. Most optimizations come from the ground up, which is why relying on the driver for them was always the wrong way to do it.

            It’s one of the reasons why consoles have consistently delivered more performance per hardware than PC in the past. And why so often the console PC ports run so poorly on PCs.

            • Spunjji
            • 3 years ago

            Really confused by the downvotes here. Anyone care to elaborate, that all looked factually accurate to me 😐

            • Klimax
            • 3 years ago

            It is wrong. Low level APIs are never answer on platform which can change. See Mantle and its failure when R7-285 got released in Battlefield 4. Reviewers HAD to use DirectX 11 path to get good performance. (IIRC delta was 30%)

            Same is true for all low level APIs. NO exceptions. Mantle, DX 12, Vulcan all share same problem and there is not even theoretical solution to it out there.

            • Klimax
            • 3 years ago

            And yet driver-side optimization is absolutely required and there si so far no hint at all that in future it will be otherwise. Only driver has necessary information about HW and its state. Gamer and engine developers will be always in disadvantage and their code will be always poor substitution for properly written driver and its code optimization. (And memory management)

            Also it is the only way to ensure future compatibility. No code can and very unlikely will scale magically on new hardware without any prior knwoledge. Without extra work, old code will have severe performance regressions. We have seen that already, we will see very soon another batch of it and there is so far no way to prevent it in future. (And if unlucky, it will fail completely – broken assumptions, bugs,…)

            TL.dr: The only relevant performance metrics is and will always be DirectX 11 (and OpenGL) renderer. Any (theoretical) performance gains are virtual and are never permanent. What was gained fast will be lost as fast.

      • Ryszard
      • 3 years ago

      I’m not really sure what value I’d add on that bit of the microarchitecture debate, because to really do it justice I’d need information Nvidia probably don’t want to disclose. There’s a good thread at Beyond3D which helps though, although it’s got its fair share of noise to signal and it’s not until the later bits of the thread where the discussion goes in the right direction (and it’s a loooooong thread):

      [url<]https://forum.beyond3d.com/threads/dx12-performance-discussion-and-analysis-thread.57188/[/url<]

        • maxxcool
        • 3 years ago

        Your awesome! Keep up the good work.

          • dodozoid
          • 3 years ago

          *You’re
          #JustGrammarNaziThing

            • maxxcool
            • 3 years ago

            #meh

            • dodozoid
            • 3 years ago

            BTW are you american?
            (I mean no offence, just curious if it is an american issue or worldwide (empirewide)… I am apparently not a native english speaker, but this whole your/you’re, there/their/they’re just makes me subconsciously ignore the point of any message and despise the author a bit)

            Grammar: difference between knowing your shit and knowing you’re shit.

            • maxxcool
            • 3 years ago

            Indeed I am. I would not call it a cultural thing per say however. Credient examples will abound for bad to good to excellent grammatical conveyance.

            In my case.. it’s more a issue of ‘its only a forum post, and I’m in a hurry because of my profession…’ thing 😉

            cheers..

            edit if = of 😛

            • libradude
            • 3 years ago

            I don’t understand the idea that using correct grammar takes any more time than just doing it right the first damn time. You’re clearly not illiterate; why write like you are?

            Sorry, just a pet peeve, like several others ’round here it seems.

            • LoneWolf15
            • 3 years ago

            That’s “per se”, not “per say”; its origin is Latin.

            • derFunkenstein
            • 3 years ago

            America, as a whole, does not value literacy on the internet. Or math, for that matter. Anything that isn’t making memes out of TV show screencaps isn’t valued by Americans on the internet.

            There are exceptions, of course, but that seems to be the general trend.

            • dodozoid
            • 3 years ago

            Kinda sad for nation that used to pioneer many technological and scieltific milestones during the last century.

            • w76
            • 3 years ago

            But, we have cat videos now! Lots of cat videos.

            • derFunkenstein
            • 3 years ago

            Yeah, I know. I feel like an old person when I rage on and on about how people should make things and do stuff. It’s completely disheartening that “geek culture” nowadays is about consuming media and not about making stuff.

            • libradude
            • 3 years ago

            Your downvotes show that you might have hit a nerve, and folks can’t handle the truth 🙁 It bothers me too.

        • Flapdrol
        • 3 years ago

        Thanks for the link. Some good stuff there.

    • Redocbew
    • 3 years ago

    Woohoo! Rys is writing for TR now?

    Let all my fellow gerbils rejoice.

      • Jeff Kampman
      • 3 years ago

      Rys has done a number of guest articles like this one for us in the past, and I’m glad to have him back to dive into Pascal 🙂

Pin It on Pinterest

Share This