Personal computing discussed

Moderators: morphine, SecretSquirrel

Topic Author
Posts: 80
Joined: Wed Jul 09, 2014 4:15 pm
Location: England

Pascal FP16 rate question

Wed Aug 16, 2017 5:05 am


I have a question to people that are knowledgeable on the subject... about Pascal's FP16 rate on consumer GPUs. I know that GP100 supports the ability to run two FP16 ops in the same time as one FP32 op, but GP102,4,6,8 do not as far as I'm aware. They are 1/64th the speed.

My question is: is this a hardware limitation or a software one? I was talking to someone about it and they seem to think that Consumer Pascal GPUs can do 2:1 FP16/32 rate but Nvidia just hasn't enabled it in the drivers. I note that even the Tesla P40 cannot do 2x FP16.

Ryzen 7 1800X @ 4.025 GHz | 16GB @ 3200 MHz C14 | MSI B350 PC MATE| Asus Turbo GTX 1070 Ti | Samsung Polaris-thingy M.2 SSD | Samsung 850 PRO 128GB | 3x 1TB HDDs |650W Seasonic G-Series Gold |NZXT H440 |Creative Soundblaster Z
Maximum Gerbil
Posts: 4425
Joined: Fri Apr 09, 2004 3:49 pm
Location: Europe, most frequently London.

Re: Pascal FP16 rate question

Wed Aug 16, 2017 6:56 am

Look at whether the Quadros can do it FP16. If they can, then it's just drivers.

P6000 = 1080Ti
P5000 = 1080
P4000 = 1070
P2000 = cut-down 1060.
P1000 = 1050

At first glance, their double-precision specs are approximately 1/32th, if they're even listed at all and the hardware is not listed as having seperate FP16/FP32 speeds, making me think that their FP32 speeds are their max rate. Only GP100 and Polaris/Vega can run FP16 faster than FP32.

The thing about the Quadro GP100 is that it's a completely different design with a very different core config and memory architecture, specifically for compute - so you can't really compare it to the rest of the Pascal Geforce or Quadro lines.
Last edited by Chrispy_ on Wed Aug 16, 2017 7:09 am, edited 2 times in total.
Congratulations, you've noticed that this year's signature is based on outdated internet memes; CLICK HERE NOW to experience this unforgettable phenomenon. This sentence is just filler and as irrelevant as my signature.
Gerbil XP
Posts: 304
Joined: Sat Dec 21, 2013 11:21 am

Re: Pascal FP16 rate question

Wed Aug 16, 2017 7:05 am

It's hardware.

As it turns out, when it comes to FP16 NVIDIA has made another significant divergence between the HPC-focused GP100, and the consumer-focused GP104. On GP100, these FP16x2 cores are used throughout the GPU as both the GPU’s primarily FP32 core and primary FP16 core. However on GP104, NVIDIA has retained the old FP32 cores. The FP32 core count as we know it is for these pure FP32 cores. What isn’t seen in NVIDIA’s published core counts is that the company has built in the FP16x2 cores separately.

To get right to the point then, each SM on GP104 only contains a single FP16x2 core. This core is in turn only used for executing native FP16 code (i.e. CUDA code). It’s not used for FP32, and it’s not used for FP16 on APIs that can’t access the FP16x2 cores (and as such promote FP16 ops to FP32). The lack of a significant number of FP16x2 cores is why GP104’s FP16 CUDA performance is so low as listed above. There is only 1 FP16x2 core for every 128 FP32 cores. ... n-review/5
Gold subscriber
Gerbil Jedi
Posts: 1870
Joined: Wed Jan 23, 2008 9:18 pm
Location: Probably where I don't belong.

Re: Pascal FP16 rate question

Wed Aug 16, 2017 7:07 am

I'm going to go with it being at least partially hardware based on your linked article. It's true that Nvidia turns off some features in cosumer-grade parts via drivers but the Teslas are going to turn everything on that's in the silicon.

One thing that really surprised me when the P100 was launched is that the 64-bit ALUs in the P100 are completely separate silicon that do not include the 32-bit ALUs instead of how you expect an x86 CPU to behave where the 32-bit (and 16-bit) hardware is just a subset of the hardware in a single core that also includes additional transistors to implement the 64-bit instructions. Given that level of distinct hardware, it wouldn't shock me if the GP102 just doesn't implement the extra logic needed to do double-rate FP16 instructions. It's not necessarily something that can just be implemented via a driver either, there's subtle but important changes needed in hardware to make it work. Just one example (and I'm sure there's plenty more) is that the carry look-ahead logic for simply adding two different numbers together needs to be adjusted between a single 32-bit ALU that's adding 32-bit value vs. a 2x 16-bit ALU that is adding two 16-bit numbers together.
4770K @ 4.7 GHz; 32GB DDR3-2133; GTX-1080 sold and back to hipster IGP!; 512GB 840 Pro (2x); Fractal Define XL-R2; NZXT Kraken-X60
--Many thanks to the TR Forum for advice in getting it built.

Who is online

Users browsing this forum: No registered users and 1 guest