Pascal makes its debut on Nvidia's Tesla P100 HPC card

— 1:39 PM on April 5, 2016

During his GTC keynote today, Nvidia CEO Jen-Hsun Huang introduced the company's Pascal GP100 graphics-processing unit on board the Tesla P100 high-performance computing (HPC) accelerator. The 610 mm2 GP100 is built on TSMC's 16-nm FinFET process. It uses 15 billion transistors paired with 16GB of HBM2 RAM to deliver 5.3 teraflops of FP64 performance, 10.6 TFLOPS for FP32, and 21.2 TFLOPS for FP16.

In its GP100 form, Pascal includes 3584 stream processors (or CUDA cores) spread across 56 streaming multiprocessor (or SM) units, out of a potential total of 60. The chip uses a 1328MHz base clock and a 1480MHz boost clock. Its 4096-bit path to memory offers a claimed 720 GB/s peak bandwidth, and it slots into a 300W TDP.

A block diagram of the GP100 GPU. Source: Nvidia

Each GP100 SM comprises 64 single-precision stream processors, down from 128 SPs in Maxwell and 192 in Kepler. Those 64 SPs are further partitioned into two processing blocks of 32 SPs. Each processing block, in turn, has its own instruction buffer, warp schedule, and a pair of dispatch units. Nvidia says each GP100 SM has the same register file size and the same number of registers as a Maxwell SM. GP100 has more SMs and more registers in total than Maxwell does, though, so Nvidia says the chip can have more threads, warps, and thread blocks in flight compared to its older GPUs.

A GP100 SM. Source: Nvidia

GP100 includes full pre-emption support for compute tasks. It also uses an improved unified memory architecture to simplify its programming model. GP100's 49-bit virtual address space allows programs to address the full address spaces of both the CPU and the GPU. Older Tesla accelerators could only have a shared memory address space as large as the memory on board the GPU.

GP100 also adds memory page faulting support, meaning it can launch kernels without synchronizing all of its managed memory allocations to the GPU first. Instead, if the kernel tries to access a page of memory that isn't resident, it will fault, and the page will then be synchronized with the GPU on-demand. Faulting pages can also be mapped for access over a PCIe or NVLink interconnect in systems with multiple GPUs. Nvidia also says page faulting support guarantees global data coherency across the new unified memory model, allowing CPUs and GPUs to access shared memory locations at the same time.

Tesla P100 cards also support the NVLink interconnect for high-speed inter-GPU communication in multi-GPU HPC systems. NVLink supports the GP100 GPU's ISA, meaning instructions on one GPU can be used to execute instructions on data residing in the memory of another GPU in an NVLink mesh. The topologies of these meshes varies depending on the CPUs in the host system. A dual-socket Intel server might talk to graphics cards in meshes using PCIe switches, while NVLink-compatible IBM Power CPUs can communicate with the mesh directly.

Nvidia says it's producing Tesla P100s in volume today. The company says it's devoting its entire production of Tesla P100 cards (and presumably the GP100 GPU) to its DGX-1 high-density HPC node systems and HPC servers from IBM, Dell, and Cray. DGX-1 nodes will be available in June for $129,000, while servers from other manufacturers are expected to become available in the first quarter of 2017.

Tip: You can use the A/Z keys to walk threads.
View options

This discussion is now closed.