Pascal makes its debut on Nvidia’s Tesla P100 HPC card

During his GTC keynote today, Nvidia CEO Jen-Hsun Huang introduced the company's Pascal GP100 graphics-processing unit on board the Tesla P100 high-performance computing (HPC) accelerator. The 610 mm2 GP100 is built on TSMC's 16-nm FinFET process. It uses 15 billion transistors paired with 16GB of HBM2 RAM to deliver 5.3 teraflops of FP64 performance, 10.6 TFLOPS for FP32, and 21.2 TFLOPS for FP16.

In its GP100 form, Pascal includes 3584 stream processors (or CUDA cores) spread across 56 streaming multiprocessor (or SM) units, out of a potential total of 60. The chip uses a 1328MHz base clock and a 1480MHz boost clock. Its 4096-bit path to memory offers a claimed 720 GB/s peak bandwidth, and it slots into a 300W TDP.

A block diagram of the GP100 GPU. Source: Nvidia

Each GP100 SM comprises 64 single-precision stream processors, down from 128 SPs in Maxwell and 192 in Kepler. Those 64 SPs are further partitioned into two processing blocks of 32 SPs. Each processing block, in turn, has its own instruction buffer, warp schedule, and a pair of dispatch units. Nvidia says each GP100 SM has the same register file size and the same number of registers as a Maxwell SM. GP100 has more SMs and more registers in total than Maxwell does, though, so Nvidia says the chip can have more threads, warps, and thread blocks in flight compared to its older GPUs.

A GP100 SM. Source: Nvidia

GP100 includes full pre-emption support for compute tasks. It also uses an improved unified memory architecture to simplify its programming model. GP100's 49-bit virtual address space allows programs to address the full address spaces of both the CPU and the GPU. Older Tesla accelerators could only have a shared memory address space as large as the memory on board the GPU.

GP100 also adds memory page faulting support, meaning it can launch kernels without synchronizing all of its managed memory allocations to the GPU first. Instead, if the kernel tries to access a page of memory that isn't resident, it will fault, and the page will then be synchronized with the GPU on-demand. Faulting pages can also be mapped for access over a PCIe or NVLink interconnect in systems with multiple GPUs. Nvidia also says page faulting support guarantees global data coherency across the new unified memory model, allowing CPUs and GPUs to access shared memory locations at the same time.

Tesla P100 cards also support the NVLink interconnect for high-speed inter-GPU communication in multi-GPU HPC systems. NVLink supports the GP100 GPU's ISA, meaning instructions on one GPU can be used to execute instructions on data residing in the memory of another GPU in an NVLink mesh. The topologies of these meshes varies depending on the CPUs in the host system. A dual-socket Intel server might talk to graphics cards in meshes using PCIe switches, while NVLink-compatible IBM Power CPUs can communicate with the mesh directly.

Nvidia says it's producing Tesla P100s in volume today. The company says it's devoting its entire production of Tesla P100 cards (and presumably the GP100 GPU) to its DGX-1 high-density HPC node systems and HPC servers from IBM, Dell, and Cray. DGX-1 nodes will be available in June for $129,000, while servers from other manufacturers are expected to become available in the first quarter of 2017.

Comments closed
    • chuckula
    • 7 years ago

    Feel free to play with the numbers:
    [url<]http://anysilicon.com/die-per-wafer-formula-free-calculators/[/url<] Given the area, you can get different results based on the assumed aspect-ratio of the chip and other factors like the wafer edge exclusion size. I'm getting a number of around 84, but of course that's only an estimate.

    • NTMBK
    • 7 years ago

    I wonder how many wafers NVidia gets through per good GP100 die.

    • Thbbft
    • 7 years ago

    Wonder how many good GP100 dies Nvidia gets per wafer.

    • ImSpartacus
    • 7 years ago

    fukn savage

    • reever
    • 7 years ago

    Nope, Xilinx has a 20B transistor FPGA

    • BryanC
    • 7 years ago

    They did actually show a running demo – the TensorFlow demo in the keynote was running on GP100. GP100 is real and works.

    (Charlie is just plain wrong and people need to hold him accountable).

    • ronch
    • 7 years ago

    OK, I know two different circuit designs built on two different nodes from maybe two different foundries (or maybe both were/are made by TSMC) aren’t exactly comparable, but the GP100 with 15 billion transistors built on 16nm and occupying 615mm2 is maybe 2x the density of the 28nm used by my HD7770, which contains 1.5B transistors and is 123mm2. 1.5B x 10 = 15B and 123 x 10 = 1,230mm2. Compare that to two GP100 dies spanning 1,230mm2 would contain 30B transistors, which is twice as much as 10 x HD7770 dies with just 15B transistors. Is this right?

    Then, if we want to roughly see how much denser a 16nm node would be compared to a 28nm node with perfect area scaling:

    (16 x 16) ÷ (28 x 28) = 0.326

    But we see that TSMC’s 16nm only offers 2x (or 0.5) the density of 28nm, which puts it in 20nm territory. So is it true that the foundries’ 16nm nodes are really more like 20nm but with FinFETs?

    • ronch
    • 7 years ago

    Doesn’t the article say this is now in production?

    • ronch
    • 7 years ago

    Suddenly, that Mercedes looks cheap.

    • Krogoth
    • 7 years ago

    DING! You got 500 points in “Graphical Vendors”. 😉

    • AnotherReader
    • 7 years ago

    Thanks for the upvote in triplicate 🙂

    • chuckula
    • 7 years ago

    Charlie is too busy working on his article [url=http://www.theinquirer.net/inquirer/news/1026032/intel-80-core-chip-revealed<]all about Polaris, including a die shot.[/url<]

    • NeelyCam
    • 7 years ago

    It sucks that [url=http://semiaccurate.com/<]The most trustworthy tech website[/url<] has not reported this news yet.

    • libradude
    • 7 years ago

    What is (or was) 3Dfx?

    • Krogoth
    • 7 years ago

    It depends on the resolution and games in question.

    For ultra-high resolution and majority of games. It will not be the case.

    If you go really low resolutions or play CPU-intensive games then the CPU will be the bottleneck.

    • nanoflower
    • 7 years ago

    AMD does have something that might compete with this underway. That’s what Vega seems to be but it won’t be shipping till the end of the year.

    • Waco
    • 7 years ago

    You can say that, but I’ve yet to see any evidence to back it up.

    EDIT: To be clear, I think you’re just wrong about scaling on x86 versus GPUs. Things like Linpack *really* don’t matter in real world workloads.

    • DavidC1
    • 7 years ago

    The *regular* Xeons do well, but not Xeon Phi. In fact even GPGPUs do better in average/peak when compared to Xeon Phi. The latter has ways to go, contra revenue schemes or not.

    • UnfriendlyFire
    • 7 years ago

    Because if you can get Skynet to work for you, you can sic the Terminators on everyone else.

    “Oh, you want sue us? Come here Terminators, BURN EVERYTHING DOWN.’

    “Oh, you want to regulate us? Good luck when even your military is wary of going after us.”

    • kuttan
    • 7 years ago

    You where talking about Tesla P100 which is currently only available to some high priority HPC and Supercomputing firms. OEM Availability is in Q1 2017 meaning its unlikely to see GP100 based GeForces before Q1 2017.
    [url<]http://wccftech.com/nvidia-tesla-p100-gp100-june-2016/[/url<]

    • chuckula
    • 7 years ago

    And you just got the second +3 of my renewed membership.
    Keep up the good work people!

    • chuckula
    • 7 years ago

    Congratulations. You just got the very first +3 of my renewed membership.

    • Waco
    • 7 years ago

    *From the perspective of a guy working in HPC*

    There is a massive difference between CUDA and anything x86. Stop reading about peak flops (really, nobody cares) and dig into why GPU-like things are nearly impossible to get peak rates out of when [i<]doing real work[/i<].

    • Srsly_Bro
    • 7 years ago

    3.5 years

    • jihadjoe
    • 7 years ago

    So while coding…
    #include three_laws_of_robotics.h

    And then during compile
    skynet.c:3: three_laws_of_robotics.h: No such file or directory

    But somehow the error isn’t treated as fatal and [url=https://xkcd.com/303/<]Compiling[/url<] continues, the fateful warning message buried in the depths of the console log.

    • UberGerbil
    • 7 years ago

    Thing is, you’re in no way an average jerk, for any reasonable definition of “average”: the vast, vast, [i<]vast[/i<] majority of users just don't need that kind of range and precision. Meanwhile, the vast majority of few that do are using it for the kind of jobs that pay for the hardware. Clearly there is some overlap in the Venn diagram, because you evidently* exist; but the overlap is such a tiny sliver that from a commercial viewpoint it doesn't matter. (That said, the previous generation does get surplussed when the new ones are introduced, so if it matters to you enough you may be able to hunt something down at pennies on the dollar; obsolete cast-offs kind of exist for the poor sclubs of the world who can substitute time, effort, and ingenuity for cash) * Of course, you could be an AI commentbot or [url=http://imgc-cn.artprintimages.com/images/P-473-488-90/81/8173/72KC300Z/posters/peter-steiner-on-the-internet-nobody-knows-you-re-a-dog-new-yorker-cartoon.jpg<]the like[/url<]. But in that case, you're probably already running on fancy hardware with high fp64 horsepower, so the whole question gets a bit existentially strange.

    • nanoflower
    • 7 years ago

    Could be but pre-emption makes it sound like it’s directing which tasks to work on instead of being able to do them asynchronously as AMD can purportedly do with the latest GCN architecture.

    • nanoflower
    • 7 years ago

    Yes, it looks like everyone is taking some time to get production ramped up. Still not clear if AMD or Nvidia will be the first to actually ship products. It seems most likely it will be AMD since they’ve actually shown a working sample of Polaris 10 and 11 while Nvidia hasn’t shown any working consumer grade Pascal GPUs and, I think, doesn’t have any planned big events till Computex where they might announce their consumer grade Pascal cards.

    • DavidC1
    • 7 years ago

    NTMBK: The point that guy working in HPC made real sense.

    It’s that *most* of vendors don’t really bother rewriting code completely to fit other architectures like GPGPU or Xeon Phi. If you read from HPC sites(and I have) trying to find real performance data of Phi, it often notes that the work required to take it above regular Xeons take lot of effort – no different than starting from scratch with GPGPU. If you do really as Intel claims and just put few lines of code, at best you end up performing just like regular Xeons. What’s the point?

    Nvidia has put lot of effort to port code to CUDA. That means those applications have an advantage over Xeon Phi already. And if you look at Xeon Phi data, it claims 2-3x with most optimal code. CUDA GPUs can go much more than that, like 5-20x.

    Intel was selling Knights Corner Xeon Phi chips discounted to $100-200! Now what does that mean? How can they sell something for 95% discount when its great as they claim? I also remember an article calculating out the numbers of a supercomputer with massive Xeon + Xeon Phi elements and how without basically giving away Xeon Phi, the cost they were saying would go well over what they were claiming. It’s basically contra-revenue without officially saying it.

    The real issue is that ultimately the regular Xeons are Intel’s bread and butter server chip. That means no matter what they claim, and break from that mindset, Xeon Phi will end up being a subset. That means its yet another Itanium, or mobile Atom.

    • quaz0r
    • 7 years ago

    well, personally im just a poor schlub who salivates at the thought of double-precision horsepower to fuel my double-precision compute endeavors. i could just never afford any of these hpc sorts of cards which they can basically charge anything they want to for them and the market will happily pay it.

    it would be nice if average jerks like me could buy a compute-oriented card that is as good for compute as gamer-oriented cards are for games, at similar prices.

    • Forge
    • 7 years ago

    At least none.

    • ImSpartacus
    • 7 years ago

    There’s been speculation that GP102 might fill that role.

    Nvidia could release a Titan with gp100 for ~$1500 and then use a gp102 to fill out the remaining high end.

    Think 50 SMs (3200 SPs) at P100 clocks with a 1/32 dp rate and gddr5x on a 384-bit bus (480-576 GB/s @ 10-12 Gb/s). That oughta be good enough for 2017’s <$800 parts.

    And you’d probably a much smaller die a cheaper memory subsystem.

    • AnotherReader
    • 7 years ago

    I agree that the i386 is the appropriate comparison, especially considering the origin of GP100’s competitor in the HPC space. Moreover, it is good to be reminded of probably the most influential microprocessor of all time.

    • chuckula
    • 7 years ago

    OK, on December 31st you point out the newegg link where you can buy one GPU card that I can slot into any normal PC and that is setup to actually drive a display like a regular GPU and that has the “P100” chip and I’ll buy you… an online New Year’s greeting card.

    • Leader952
    • 7 years ago

    Yes I did. You on the other hand are very slick in glossing over the obvious typo he had in the GP100 when he clearly meant the P100.

    [quote<]This 610mm2 16nm GP100 GPU unlikely to see day of light this year[/quote<] Note the word [b<]THIS[/b<] in that statement. This meaning : referring to a specific thing or situation just mentioned. Since his response is to this article about the Pascal P100 that is what [b<]This[/b<] is about.

    • chuckula
    • 7 years ago

    He did. He said the GP100 *GPU* wouldn’t be out for sale this year.

    You said that a P100 HPC accelerator chip would be out this year.

    You never presented any facts contradicting his statement though.

    • Leader952
    • 7 years ago

    [quote<]This 610mm2 16nm GP100 GPU unlikely to see day of light this year[/quote<] Then explain the 4500 P100's that will be delivered this year to CSCS? [quote<]CSCS plans to upgrade the system later this year with 4,500 Pascal-based GPUs.[/quote<] [url<]http://finance.yahoo.com/news/nvidia-pascal-gpus-double-speed-070000862.html[/url<]

    • Leader952
    • 7 years ago

    [quote<]Launch dates: Q3 2016 for KNL, Q1 2017 for Pascal[/quote<] Launch date(s) wrong for Pascal it's 2016 not 2017. Pascal is launching in June 2016 (in the DGX-1) and now Nvidia will be shipping 4500 P100's this year (2016) for the upgrade to the Piz Daint system at the Swiss National Supercomputing Center (CSCS) in Lugano. [quote<]Piz Daint, named after a mountain in the Swiss Alps, currently delivers 7.8 petaflops of compute performance, or 7.8 quadrillion mathematical calculations per second. That puts it at No. 7 in the latest TOP500 list of the world's fastest supercomputers. CSCS plans to upgrade the system later this year with 4,500 Pascal-based GPUs[/quote<] [url<]http://finance.yahoo.com/news/nvidia-pascal-gpus-double-speed-070000862.html[/url<]

    • AnotherReader
    • 7 years ago

    Could it be that Nvidia has gone back to a Fermi style DP unit? Fermi didn’t have separate DP and SP units; instead, it had a unified DP unit that could process SP operands at double the DP rate. The disadvantage of a multiple precision FPU is [url=http://www.lirmm.fr/arith18/papers/libo-multipleprecisionmaf.pdf<]18% greater area and 9% greater delay[/url<]. I'll end on a fun factoid: A GCN compute unit and a Pascal SM are nearly identical at a high-level; each has a 256 kiB register file, 64 FP32 lanes, and 64KB LDS.

    • jokinin
    • 7 years ago

    I will be waiting for mid range based Pascal GPUs (~ 200 to 400€), and hopefully by that time AMD will have something competitive, in that pice range, too.
    If this happens, maybe it will be the time to upgrade my almost 4 year old Radeon HD 7870.

    • chuckula
    • 7 years ago

    Yeah, obviously the i386 didn’t invent demand paging, but it was an easy reference for this forum. I hadn’t actually heard about Atlas before so thanks for the link!

    • AnotherReader
    • 7 years ago

    [url=http://en.wikipedia.org/wiki/Atlas_Computer<]Atlas[/url<] is sad

    • Chrispy_
    • 7 years ago

    True, perhaps, but I’d wager that the entire mobility market is worth significantly less than the compute market.

    Sure, revenue may be higher, but Margins on the compute stuff are quite simply [i<]staggering[/i<]. Someone did some back-of-a-napkin maths a couple of months ago and worked out that a Tesla chip has similar manufacturing costs to a 980Ti chip, but nets Nvidia 1000% more money.

    • Waco
    • 7 years ago

    72 tiny cores and 4 threads per are easy especially compared to thousands of GPU “cores”.

    The codes that will be running on KNL are codes that have been around for years running on many many thousands of cores. Even moving to a new architecture (on x86) costs a lot of time and money…moving to a CUDA implementation would take even longer.

    We’ll see how nvlink works in practice soon, and *maybe* then, there’ll be some traction in the big-iron HPC space for it.

    Also…nobody really cares about Linpack for real work.

    • kuttan
    • 7 years ago

    Well said words 🙂

    • Waco
    • 7 years ago

    This. Single port EDR (100 gbit) cards go for nearly $2k, so figure somewhere between $2k and $8k for that quad port implementation.

    • geniekid
    • 7 years ago

    You don’t want to underpay your software people. That was Hammond’s mistake.

    • AnotherReader
    • 7 years ago

    I second you. Moreover, there are algorithms which would be better suited to KNL. KNL has 9 times more globally visible cache (L2) than GP100.

    On a sidenote, TSMC’s finfet process is looking great: over 30 % clock speed increase!

    • NTMBK
    • 7 years ago

    Superfast networking fabric.

    • Srsly_Bro
    • 7 years ago

    MSI 7950 tf3. It’s a decent card except for artifacts and occasional driver crashes.

    • Prestige Worldwide
    • 7 years ago

    If you have Sandy Bridge or newer, very unlikely in a single-GPU scenario.

    • Milo Burke
    • 7 years ago

    I honestly don’t know what that is. Is it similar to a flux capacitor?

    • Prestige Worldwide
    • 7 years ago

    Big Pascal Q1 2017, GP104 June.

    • Spunjji
    • 7 years ago

    I find it interesting that AMD are talking about (and demonstrating) their smaller products and Nvidia are not, while the reverse is true for the larger ones; although in point of fact nVidia have still not show an actual working Pascal to the world. It doesn’t take a huge leap of imagination to figure out whose products are in better shape at this stage.

    • Spunjji
    • 7 years ago

    Agreed. Unless this is the first node launch ever to ramp up yields that fast then this product will not be affordable for consumers any time soon.

    • Spunjji
    • 7 years ago

    This sounds most likely – smaller Pascal with GDDR5 for this year, /maybe Big Pascal with HBM2 and FP64 functional units disabled as a new Titan towards the end of next year.

    • Spunjji
    • 7 years ago

    That approach risks them getting absolutely gutted in the mobility market for a couple of quarters, though. AMD appear to be explicitly targeting that area with Polaris and, unlike the low/mid-end gaming market, it’s one place where simply cutting the price of a 28nm product will not pass muster.

    • ForceEdge
    • 7 years ago

    i think you forgot about the quad infiniband…..

    • pranav0091
    • 7 years ago

    This is primarily a fp64/DL product (1/2 fp64 and 2x fp16)- to the best of my knowledge there is nothing out there that comes close. There are people and workloads that cant do without fp64 and this targets that market. Increasingly, DL seems to be interested in fp16, and so this will help such workloads too. Also note that certain problems have massive working sets that run into GBs – they need a lot of VRAM.

    Fiji is a fp32 product – one could say that its a “gaming card”, with the fp64 bits running at 1/16. That and its use 4GB HBM1 limits its involvement in workloads that are huge or need fp64.

    Whats often missed in tech-talk is the software side of things. Its not enough to have a great number on the spec sheet. Whats – sometimes even more – important is how easy its to tap into all those flops, and how consistently can you utilise all of them. Machine time is often much cheaper than engineer-time, and thats when usability becomes a key, but often overlooked, factor.

    <I work at Nvidia, but the opinions here are purely personal>

    • ronch
    • 7 years ago

    Just curious: which brand of card is your 7950?

    • ronch
    • 7 years ago

    I’ve been on my ‘trash’ HD7770 too for 3 years and 4 months now. It stinks! /s

    • ronch
    • 7 years ago

    IIRC a 6850 is about as fast as a 7770.

    • kuttan
    • 7 years ago

    This is not the place for your Trash comments.

    • chuckula
    • 7 years ago

    I have a very hard time believing that it’s somehow impossible to optimize for KNL when the last 20 years of HPC software development all center around massively parallel processing, which is exactly what KNL is tuned for to a tee.

    • NTMBK
    • 7 years ago

    Optimizing for 72 tiny cores- to be honest, the techniques for optimizing vectorized code for dual socket Xeon with 22 cores per socket should put you in good stead. Tuning will be required, certainly, but starting with a fully working codebase from Xeon and optimizing from there is going to be a lot quicker than rewriting for CUDA.

    I’m not a big Phi booster, don’t worry, but I feel it does have its niche. For most applications it’s probably not going to be the right choice, but I’m curious about it.

    • Srsly_Bro
    • 7 years ago

    Cost allocation is not in your future.

    • Srsly_Bro
    • 7 years ago

    -trash collector

    • Srsly_Bro
    • 7 years ago

    One of the memory chips is defective, causing random artifacts with certain games. Fallout 4 is a light show when I play it. I’ve been considering donating it after I upgrade.

    • w76
    • 7 years ago

    Intel won’t die, not quickly, but overall if the situation is as you describe this is fantastic for consumers, from gamers and office workers that don’t know any better up to professional HPC users. It’s the first signs of competition, and that would benefit all.

    • Anomymous Gerbil
    • 7 years ago

    Thanks gents!

    • the
    • 7 years ago

    It would depend upon when the consumer part launches since that is clearly coming after the HPC parts hit the market. If yields are poor, then there may not even be a consumer part. If yields improve and the consumer part arrives late in 2017, then it would be feasible to have more SM blocks enabled than the Tesla P100. nVidia would have had time to stockpile chips that are fully functional or with 2 SM blocks disabled. An early 2017 launch would require a bit of guessing on how well yields will be long term. I can see 6 to 4 SM blocks disabled in the consumer part to ensure supply even if yields turn out to be good.

    • Chrispy_
    • 7 years ago

    It’s probably because yields on 16nm are good enough to make large chips.

    Nvidia has the option to sell large compute chips at vast, ridonculous profit to enterprise, so why would they squander that option and waste their 16nm capability on the low-profit consumer products first.

    16nm consumer products will come once they’ve met demand for the enterprise compute market and have spare capacity to throw at the (relatively) low-profit consumer market.

    • kuttan
    • 7 years ago

    Waiting to see that

    Vega 10 vs GP100 review here …

    • DavidC1
    • 7 years ago

    Okay, so if Skylake EP comes out, we fix one thing. What about optimizing for 72 tiny cores and 4 threads? As I remember the benchmarks for first Xeon Phi were very wild. Also, the top gains were pretty minimal, like 2-3x, while with GPUs you can get 10-20x. At least with GPUs you can secure a niche. Dare I say this is another Itanium?

    Then you have Pascal, which has equal perf/watt, and can do much better peak. You think 610mm2 is large? Try ~680mm2. KNL lost when it got 12 month delay from original date. Actually, that’s the case with nearly all of Intel’s 14nm parts. The screwup might cost them dearly.

    Various competition is bring Intel-era to an end, albeit slowly. S|A put it correctly when they said it’ll die from thousand cuts.

    About Pascal: Also, people seem to be disappointed that P100 is “only” 10TFlops. Uhh, mind you that Fury X-beating 980Ti is “only” 5.6TFlops, meaning this workstation card will be nearly 2x the performance of the 980Ti. And offer you 5.3TFlops FP64. If they reduce FP64 to 1/16 they would be able to get 13-15TFlops. That means 2.5-3x 980 Ti. Nvidia might have a top gaming card again.

    • NTMBK
    • 7 years ago

    [quote<]There was a post by someone(B3D forums) who actually works in the area of HPC that said KNL's supposed advantage is actually useless, and being you still need titanic efforts to get maximum out of parallelizing those 72 cores and SIMD units anyway, Nvidia is far better with better support in CUDA. [/quote<] SKX is finally getting AVX-512, so code which is optimized for that should hopefully be pretty optimal for KNL.

    • DavidC1
    • 7 years ago

    I think Nvidia has a big advantage here.

    Their Linpack rates approach 80% efficiency while Knights Corner were at 60%. Knights Landing isn’t better with measured DP rates only being at ~2.3TFlops or so(Intel data).

    Perf/watt is formidable for Pascal too. 5.3TFlops @ 300W is 17.7GFlops/watt, while 3TFlops @ 200W is 15GFlops/watt. The fact that the highest config is that efficient means they can pull out a 225W part possibly approaching 20GFlops/watt.

    There was a post by someone(B3D forums) who actually works in the area of HPC that said KNL’s supposed advantage is actually useless, and being you still need titanic efforts to get maximum out of parallelizing those 72 cores and SIMD units anyway, Nvidia is far better with better support in CUDA.

    Programming efforts: Equal, well, better with much more CUDA supported applications
    Pure DP compute: Pascal by far
    Perf/watt: About on par best case scenario for KNL
    SP compute: No comparison, Pascal’s raw figures are 3x+ better than supposed KNL
    Launch dates: Q3 2016 for KNL, Q1 2017 for Pascal

    Intel’s 14nm efforts are so far, mediocre. KNL would have done better if the intro was LAST YEAR. Now its slated to be Q3 of this year.

    • bfar
    • 7 years ago

    You could be right. Mind you, we didn’t think big Kepler would ever be a gaming part either. If they disable enough SM units they might be able to work around the yield and power issues.

    • bfar
    • 7 years ago

    That’s telling. I wonder how many more they’d have to disable on a consumer part?

    • kuttan
    • 7 years ago

    AMD’s reply to GP100 is there 14nm Vega 10 which is rumored to have 15-18 Billion transistor GPU with HBM 2 memory slated to launch in 2017.

    • Ninjitsu
    • 7 years ago

    I dunno, you’re comparing two products on a mature 28nm process to two products on a fresh 16nm FinFET process…when they do put out a consumer version, it’ll likely cost the same as always: ~$1000.

    • Ninjitsu
    • 7 years ago

    It’s GP100, it could find its way into a Titan. GM104 will likely be around 7 TFLOPS FP32 as well.

    • Firestarter
    • 7 years ago

    I’m just surprised that they did this on a new node right out of the gate. 600mm[super<]2[/super<] 28nm GPUs have been with us a while but 28nm GPUs certainly did not start off that big. It's as if Nvidia had released the first Tesla right around the time that the HD7970 hit the market, instead of a year later

    • Ninjitsu
    • 7 years ago

    While I don’t know how many will use FP64 outside of the prosumer space, I think that as long as the FP64 goodness is inherent to the architecture you’ll be able to buy a consumer card with that capability. It may be limited to the Titan series via drivers, worst case.

    Remember, Maxwell architecturally had nerfed FP64, and so did Kepler (as compared to Fermi and Pascal).

    • kuttan
    • 7 years ago

    This 610mm2 16nm GP100 GPU unlikely to see day of light this year and AMD had similarly big GPU at 14nm named Vega 10 which is rumored to contain 15-18 Billion transistor GPU with HBM 2 slated to launch in 2017. This year we are more likely to see smaller GPUs from both vendors AMD with Polaris 10 and 11 whereas Nvidia with their GTX 1080 and 1070 GPUs

    • brucethemoose
    • 7 years ago

    This isn’t something one just scrambles to create. If they don’t already have a compute monster baking in the oven, they aren’t gonna compete with GP100 anytime soon.

    • balanarahul
    • 7 years ago

    maybe its time they start making separate cards for gaming and HPC im not talking about the GP100 where it has a some dedicated fp64 cores im talking about completely removing all gaming related circuitry and have nothing but HPC related cores in a card for HPC usage

    Not that I don’t want it to happen, but if they had any intention of doing that they’d have done it already.

    • balanarahul
    • 7 years ago

    How is FP64 good for consumers? If anything it’s bad since 99% of buyers are never going to run FP64 instructions on their GPU (afaik) and putting FP64 shaders + stuff necessary to pass on instructions to those shaders wastes die space.

    • MathMan
    • 7 years ago

    I don’t think P100 is much a threat to AMD: their market share in that segment is very low to begin with.

    • MathMan
    • 7 years ago

    How is this different than Fiji?

    I totally understand that Nvidia can ask huge prices because they own that market segment. But give it a year, and Titan class GP100 should be very doable.

    • balanarahul
    • 7 years ago

    Might as well be the biggest chip ever built at 15.3 billion transistors.

    • ronch
    • 7 years ago

    Knowing AMD, they’ll scramble to come up with something like this like they always do when their competition does something amazing and trailblazing. Funny thing is, when AMD does something, most of the time the competition just shrugs and laughs and comes up with their own version instead of licensing AMD’s tech even if AMD is practically giving it away for free — ‘Open’. Turbo Boost, Freesync, SSE5/XOP, 3DNow!, FMA4, etc. Either AMD scrambled to copy or tried to do something but got ignored.

    • Bensam123
    • 7 years ago

    I’ve heard June / July as well… Which seems to fit inline, but thought someone would come up with something based on past release schedules.

    • Krogoth
    • 7 years ago

    Nvidia made its business on OEM contracts like every major graphical vendor that still exists today.

    Guess who was the last graphical vendor who cater solely at gaming demographic? Here is a hint, Nvidia owns their IP and other assets. There’s a reason why they are gone.

    • f0d
    • 7 years ago

    personally i think the fp64 stuff is useless and id rather that space be all standard shader cores, its like how the intel integrated graphics is just taking up space that id rather be dedicated to more cpu cores (and yes i have a 2011 cpu just for that reason)

    maybe its time they start making separate cards for gaming and HPC
    im not talking about the GP100 where it has a some dedicated fp64 cores im talking about completely removing all gaming related circuitry and have nothing but HPC related cores in a card for HPC usage

    • markhahn
    • 7 years ago

    it’s a yield thing: if the FP64 stuff took up no significant space, it wouldn’t affect yields and they’d just fuse it off for market segmentation purposes. but since it does take a lot of space…

    • markhahn
    • 7 years ago

    really? I’d be surprised if they didn’t ship a competitive halo product by the end of this year.

    shader count is just a matter of die size and yield. clock is subject to design optimization, but is also mainly a power thing – and in-package ram saves quite a bit of power. 720 GB/s is going to be routine for 4x HBM2 modules. so: nothing particularly impressive, just half a cycle out of phase with AMD’s cycle. as usual. as planned.

    • markhahn
    • 7 years ago

    $12k seems about right, since the current high-end is about half that.

    • Laykun
    • 7 years ago

    They can, it just depends on the type of application. In general, through CUDA, you can access the texture units and exploit the texture memory caches if you’re threads read from generally the same area in the texture, giving better memory bandwidth. But it all depends on the type of application and how well it’s written.

    • ronch
    • 7 years ago

    I wonder if AMD has a similar behemoth in the works or they’ll scramble to create one after this announcement.

    • the
    • 7 years ago

    They still could since the DGK-1’s are expected to start shipping until July in ultra low volumes and could easily be pushed back. The midrange and low end Pascals could easily ship in volume before then.

    A GP100 card for consumers is still possible but I wouldn’t expect it until GP100 production has reached ‘volume’ levels. Considering that the Tesla card is shipping with 4 of 60 SM blocks disabled, it’ll likely be 2017 before it arrives.

    • the
    • 7 years ago

    The other difference in that comparison is that Pascal still needs a host CPU to function (and it looks like one with plenty of PCIe lanes too). That’s additional power consumption that won’t be contributing much relatively to the total peak throughput. Depending on the workload, it could very well be wiser to use a single Telsa P100 PCIe cards with a power efficient host CPU than a big multi socket Xeon to enable a quad or eight GPU configuration. The type of workloads here will also matter.

    Ironically, Knights Landing has plenty of PCIe lanes so it possible get the best of both worlds here: numerous HPC focused x86 CPU cores with several Tesla P100 cards.

    • the
    • 7 years ago

    The way to think of these GPUs is more along the lines of massive multi-socket Xeon E7 chips due to the multi-GPU fabric. Not directly comparable but nvLink is about increasing GPU scalability. This is the core of the DGX-1 motherboard design. For comparison, the Xeon E7’s top out at [url=http://ark.intel.com/products/84685/Intel-Xeon-Processor-E7-8890-v3-45M-Cache-2_50-GHz<]$7,174[/url<] per chip. Though one difference is that the Xeon E7's don't need a host system as they can boot themselves.

    • the
    • 7 years ago

    That is one WOPR of a question.

    • mesyn191
    • 7 years ago

    I suppose TSMC could always fudge the geometry even more by tweaking their 16nm process a bit and then calling it 10nm to get something about by then but yeah otherwise they’re going to have to push back their shrink.

    But then its TSMC so that would be par for course for them.

    • synthtel2
    • 7 years ago

    Most GPGPU stuff doesn’t hit TMUs much, but every once in a while there’s something that can abuse them for massive benefit. It is a graphics example, but [url=http://iryoku.com/aacourse/downloads/04-Jimenez’s-MLAA-&-SMAA-%28Subpixel-Morphological-Anti-Aliasing%29.pdf<]SMAA[/url<] is my favorite example of something you wouldn't expect that gets massive gains from TMUs. It's basically a hardware implementation of [code<]result = (end2 - end1) * location + end1[/code<] with some extra stuff for triliinear and anisotropic filtering that GPGPU doesn't usually care about. Pascal appears to keep the same shader:TMU capability ratios, FWIW. ROPs and geometry hardware, on the other hand, have basically no use (to my knowledge) in GPGPU. Edit log: too much. Wrote incrementally, then just now cleaned up so it's actually readable.

    • DeadOfKnight
    • 7 years ago

    So much for the gigantic stream of rumors that consumer products would be unveiled today. Personally I don’t give a damn about GP100, I want to see what GP104 can do. In fact, I’d even rather see what the mobile chips can do on the new process if efficiency is so high. It’d be funny if new recommended min requirements for VR headsets were GTX 1080M or whatever. We’ll probably still have to wait a couple generations before 4K becomes reasonable though.

    • MathMan
    • 7 years ago

    Some image processing applications do use them. And in rare cases, they are abused for free bilinear interpolation.

    • chuckula
    • 7 years ago

    They aren’t really of much use and honestly Nvidia didn’t go into a bunch of detail on the texture units and other parts of the P100 that you would actually use to generate graphics.

    Additionally, those cards in that DGX1 server using the NVlink interconnect don’t even bother with vestigial graphics connectors in the first place.

    This may be [yet another] reason why you aren’t going to get a copy of the P100 in a consumer graphics card, although there will certainly be some type of “big” Pascal graphics card somewhere down the road.

    • torquer
    • 7 years ago

    They should apply “deep learning” and figure out why their drivers have suddenly become garbage. Its almost like they hired some AMD driver engineers.

    • Anomymous Gerbil
    • 7 years ago

    Today’s dumb question:

    Do programs running on GPGPUs make much use of the texture-processing blocks?

    • jihadjoe
    • 7 years ago

    We should increase pipeline depth on those engineers so they can keep more threads in-flight.

    • Beahmont
    • 7 years ago

    Maybe not even then if TSMC is to be believed. By 2017 TSMC says they are supposed to be on 10nm. And even the half node jump is going to decrease the area needed to get similar performance to something much more reasonable. This particular part may never make it to the PC. That is of course if TSMC isn’t blowing smoke about 10nm.

    If TSMC’s 10nm takes to long or turns out to be a flow for high performance, then it may make it to a PC card.

    • the
    • 7 years ago

    Well they’re already disabling 4 out of the 60 SM clusters so that should tell you something.

    • synthtel2
    • 7 years ago

    On package, but not on the 16nm die. The die that has to include space for HBM is on an old and cheap process, and is much bigger than 610mm[super<]2[/super<].

    • Srsly_Bro
    • 7 years ago

    This is the 22 Core Xeon we all want and can’t afford. 🙁

    • mesyn191
    • 7 years ago

    Fiji is super low volume though. Its also more of a prestige product too. They don’t mind since demand is low for it but its clearly not all that practical of a product.

    Maybe in 2017 or 2018 when 16/14nm matures the story will be different but right now its unrealistic to expect 600mm+ die size GPU as a mass launch PC product.

    • mesyn191
    • 7 years ago

    Its a HPC part priced at a stupid high level.

    You shouldn’t use this as an indicator of what they’ll be offering exactly for the PC market.

    • mesyn191
    • 7 years ago

    Its been hinted at for a month or 2 now but yeah its a surprise still. Prior to that everyone assumed they were going to launch 16nm with a mid range part.

    This will likely be a very low volume part though so yields are probably as crappy as you’d expect. Given the price ($120K+) I think nV is fine with that. That means you probably won’t be seeing this card available for PC until closer to year end at best (2017 seems more realistic) though since they’re launching this as a HPC product.

    • flip-mode
    • 7 years ago

    HBM on die adds some heft.

    • liquidsquid
    • 7 years ago

    There must come a day soon where processors like this are a core part of an engineering workstation for simulations, and not out of the price reach of mere mortals. Sims like thermal analysis in real time, full circuit simulations rather than just sub-circuits, fast signal integrity sims, etc would be valuable. The price had to come down to a point where it costs less than just purchasing and building boards to test them.

    Gaming has sort of dwindled in my life as I get older. I would rather tools make my difficult job easier.

    • derFunkenstein
    • 7 years ago

    yeah, my bad. Per “blade”, I guess. Still, it’s going to be a high-margin product whatever the term.

    • Andrew Lauritzen
    • 7 years ago

    Yeah it should be clear from the specs alone that there’s is no way they will ever sell such a piece of silicon at consumer pricing levels.

    Like it or not the rift between data-center and client gear – and pricing – is going to continue to increase for a while at least.

    • brucethemoose
    • 7 years ago

    610mm^2? I wonder what yields will be like.

    • shaq_mobile
    • 7 years ago

    thats a lot of money to pay someone to engineer the end of mankind. youd think wed have the foresight to at least underpay skynet engineers. 🙁

    • shaq_mobile
    • 7 years ago

    but can it run crysis?

    • quaz0r
    • 7 years ago

    does anyone foresee a time in the semi near future where they wont strip out all that FP64 goodness from consumer cards? or will that always be the primary means of differentiating / justifying the super expensive “hpc” cards?

    • Milo Burke
    • 7 years ago

    It’d sure be nice if some GP100 could make it to consumers. But I imagine demand is more than a little high from the HPC world.

    Ready for some more back-of-the-envelope math?

    – The GTX Titan X sells for $1,100
    – The Quadro M6000 sells for $5,000 (if I’m reading this right, they’re equivalent)

    That’s a 4.5x markup for the workstation version of the same chip. Or, you could say the consumer version costs 22% of the price of the enterprise version.

    If we removed an equivalent 78% of the cost of the enterprise GP100, would the next Titan be priced somewhere around $2,700?

    (Assuming enterprise demand ever slows enough to create such a product.)

    • Milo Burke
    • 7 years ago

    Bah, and when I had predicted March-April 2016, I had drawn from research of their previous release-cycle lengths, aiming for the longer side. Looks like this transition period is a doozy. But worth it.

    • Milo Burke
    • 7 years ago

    I opted for the beefiest PSU options they had. You can mentally adjust down from $12,400 per card to include PSU support for them.

    • MathMan
    • 7 years ago

    It will sell like hotcakes.

    • MathMan
    • 7 years ago

    The head of the Stanford AI lab said that some of their PhD students were offered $1M employment offers…

    • NTMBK
    • 7 years ago

    You forgot the extra 2400W (no, really) power supply required to power those GPUs.

    • Tirk
    • 7 years ago

    I’ll go one step further and estimate that they’ll release Pascal with plain old GDDR5 non X if their going to be selling them at all this year for the desktop.

    • Tirk
    • 7 years ago

    Nvidia’s best kept secret in performance, wood screws. Its the secret sauce that allows them to dissipate all that heat.

    If Intel debuted their KNL with wood screws I bet it would push out 10 TFlops of DP 😉

    • Leader952
    • 7 years ago

    Time is money.

    Work done on previous system 25 hours, new system 2 hours.

    What is the salary of a Deep Learning engineer?
    Having one idle for 2+ days is costly.

    • Tirk
    • 7 years ago

    If it lasted you 3.5 years I wouldn’t really consider it trash.

    How long do you expect a Pascal GPU to last you?

    • Leader952
    • 7 years ago

    4th quarter 2017

    • Tirk
    • 7 years ago

    Q1 2017

    • Milo Burke
    • 7 years ago

    Who’s ready for some back-of-the-envelope math??

    I specced out a server on Dell’s website:
    – PowerEdge R730
    – Dual Xeon E5-2698 processors
    – 512 GB 2133 LRDIMM memory
    – 4x 1.92 TB SSDs
    – 2x Intel 10Gb network cards
    – 2x 1100w PSUs
    – No GPU

    Dell price? $29,900.
    Nvidia DGX-1 node price? $129,000.

    That’s a $99,100 price difference for eight Pascal GPUs.

    Giving a ballpark price of $12,400 per Nvidia Tesla GP100.

    • nanoflower
    • 7 years ago

    That’s only if Nvidia plans on using HBM2 for their consumer cards. That seems unlikely for the low end and mid range. It’s certainly possible for a Titan replacement and likely won’t see the light of day till the end of the year, but I really doubt Nvidia plans to hold off releasing their non-flagship cards that long so they won’t be using HBM2 for those cards. They may want to use GDDR5X for the high end cards which would likely mean they want be available till fall 2016 since the spec was just officially adopted not long ago.

    • MathMan
    • 7 years ago

    “Card” is a bit of a misnomer. It’s 8 P100 cards, 2 Xeon CPUs, 7TB SSDs etc.

    • derFunkenstein
    • 7 years ago

    I’m conflicted – gamers absolutely got Nvidia where they are, but man. $129k per card? I’d sell to HPC buyers first, too.

    • cygnus1
    • 7 years ago

    It has four 4GB HBM2 chips on the interposer. It’s not one 16GB chip and it’s not shipping now either.

    • bfar
    • 7 years ago

    So basically, for consumer gaming parts this year, we’ll be looking at GP104 or Polaris. If they arrive around the same time and perform similarly there could be some bargains 🙂

    • DPete27
    • 7 years ago

    I’ve seen possibly at Computex, so June 1-ish

    • dodozoid
    • 7 years ago

    What the actual fuck ? First 16nm GPU and It s the straight up the biggest GPU ever built. Didn t see that comming…

    edit: They didn t actualy “show” the chip, not to mention a running demo… so it might be just slides…

    • flip-mode
    • 7 years ago

    That would be rad!

    • Tumbleweed
    • 7 years ago

    Hopefully NV will make an S100 bus version of it for me to put in my IMSAI 8080, and we’ll find out!

    • DrCR
    • 7 years ago

    He’s not [i<]that[/i<] bitter. Just sour apple green.

    • Firestarter
    • 7 years ago

    trash? It’s quite possibly the best purchase I ever made, second only maybe to the i5-2500K

    • NTMBK
    • 7 years ago

    It won’t reach widespread server availability until Q1 2017, and they still didn’t have a card to show on stage today. It’s a 300W chip which spends a ton of area on DP. I honestly don’t think this thing will ever be a gaming product.

    • BaronMatrix
    • 7 years ago

    Wait, last I heard 4GB HBM2 wouldn’t be ready until after the summer… How are they making these now…?

    • NTMBK
    • 7 years ago

    It’s 300W, and not even shipping properly until 2017. I don’t think this is ever going to be a consumer GPU.

    • flip-mode
    • 7 years ago

    But can it play…. Global Thermonuclear War against Joshua (and make the cool computering sound effects while doing it)?

    • chuckula
    • 7 years ago

    For some perspective, Knight’s Landing has a noticeably lower peak throughput (listed as 3+ Tflops) although Knight’s Landing is officially listed as a 200 watt part while Nvidia is calling the P100 a 300 watt part, so there is a power difference at play too.

    Overall it’s not surprising that Nvidia has a higher raw compute number since KNL is not designed to maximize peak compute performance. It is somewhat impressive that NVidia cracked 5 TFlops in double precision though.

    • Deanjo
    • 7 years ago

    Na it will be out before that if the manufacturing process is as mature as it is claimed to be.

    • ultima_trev
    • 7 years ago

    3584 shaders at 1.5 GHz. 610mm^2 die. 720 GTps bandwidth. I’m sure the ROP and TMU count is equally monstrous. Mother of Jupiter…

    As much as I love AMDATIRTG, I highly doubt they’ll have anything that matches this before 2018.

    • maxxcool
    • 7 years ago

    If you are gaming on this card you should sell it an by a iphone.

    • maxxcool
    • 7 years ago

    I imagine they have stuck to their two tier sauce, juiced it up maybe. But we will have to see.

    • TwoEars
    • 7 years ago

    And when it does it will be in the Titan Über, priced at $1999 or something silly like that. I might get one later when it comes in the 1080 Ti version. But that’s going to take a while.

    • f0d
    • 7 years ago

    my guesstimate is that GP100 wont be in a gaming card until 2018 at the earliest
    🙁

    • chuckula
    • 7 years ago

    It’s more likely 600 mm^2 just for the die considering the performance and price points they are targeting. It is a monster of a chip that obviously isn’t going into a consumer PC anytime in the near future.

    • Firestarter
    • 7 years ago

    AMD’s Fiji GPU is about 600mm[super<]2[/super<] as well, excluding HBM

    • anotherengineer
    • 7 years ago

    Keep them both, I will stick to my old HD 6850 with it’s aftermarket accelero S1 cooler and dual 120mm low profile scythe fans.

    49C in furmark and an inaudible 800 rpm.

    • anotherengineer
    • 7 years ago

    Get a radeon 😉

    • NTMBK
    • 7 years ago

    Fall 2017, maybe. This isn’t even hitting regular workstation and server until next year, only low volume $129,000 Nvidia proprietary servers. (And probably a few high priority HPC customers.)

    • drfish
    • 7 years ago

    I know – but I can still be bitter. 😉

    • NTMBK
    • 7 years ago

    It’s GTC, it’s aimed at CUDA developers.

    • NTMBK
    • 7 years ago

    Trash?! I’ll happily swap my 7770 for your “trash”.

    • drfish
    • 7 years ago

    Gee, thanks nVidia… <- all the gamers you built your business on

    • NTMBK
    • 7 years ago

    I suspect that number is including the area of the HBM2 and interposer.

    Edit: I take it back, it really is that big. Daaaamn.

    • chuckula
    • 7 years ago

    [quote<]GP100 includes full pre-emption support for compute tasks. [/quote<] Not 100% sure yet but this might be similar to AMD's "asynchronous shaders" that have generated some buzz recently.

    • Srsly_Bro
    • 7 years ago

    I don’t game much with working so much… But I want a Pascal Titan. After being on this trash 7950 for 3.5 years, I deserve a prize.

    • TwoEars
    • 7 years ago

    I wonder how many games will become cpu limited when that baby hits.

    • Deanjo
    • 7 years ago

    Big Pascal, probably in the fall. Little Pascal, maybe July.

    • Bensam123
    • 7 years ago

    Any predictions on when we’ll see desktop Pascal based on this announcement?

    • nanoflower
    • 7 years ago

    Samsung said that they started volume production earlier this year but all of the production is allocated to special clients like this one for the next few months. Samsung likely won’t be producing HBM2 for consumer cards until late summer early fall.

    • chuckula
    • 7 years ago

    HEY! JEN HSUN WENT ON FOR HOURS ABOUT AI SO DON’T GO AROUND KNOCKING OUR NEW ROBOTIC OVERLORDS!!

    • derFunkenstein
    • 7 years ago

    According to Chuckula, HBM2 is being produced in volume by other HBM2 chips, somehow. [url<]https://techreport.com/forums/viewtopic.php?f=3&t=117639#p1300662[/url<]

    • Helmore
    • 7 years ago

    In volume production today? I was under the impression that HBM2 wasn’t being produced in volume yet. Have I missed something?

    • anotherengineer
    • 7 years ago

    “Tesla P100 cards (and presumably the GP100 GPU) to its DGX-1 high-density HPC node system and HPC servers from IBM, Dell, and Cray. DGX-1 nodes will be available in June for $129,000, while servers from other manufacturers are expected to become available in the first quarter of 2017.”

    Hit the high-margin products.

    Then workstations…………………………..then gamers 😉

    • Firestarter
    • 7 years ago

    their first 16nm GPU will be a 610mm[super<]2[/super<] part? Sounds like we won't have to wait very long for some huge and fast consumer GPUs

    • chuckula
    • 7 years ago

    [quote<]GP100 also adds memory page faulting support, meaning it can launch kernels without synchronizing all of its managed memory allocations to the GPU first. [/quote<] You're welcome. -- i386.

Pin It on Pinterest

Share This

Share this post with your friends!