TR Forums

Waco · Tue May 07, 2019 9:59 am

https://www.hpcwire.com/2019/05/07/cray ... oak-ridge/

I'd heard rumors, but damn, that's a departure from the expected. It's still a stunt machine...but it's a $600M+ stunt with all AMD everything.

Tue May 07, 2019 10:10 am

Good for AMD. It is also really cool that Cray seems to be making a comeback.

One of my co-workers recently left to take a job at Cray. The reaction from the rest of my co-workers was very clearly divided along generational lines:

Most people in my age bracket: Cray, huh? Didn't they go out of business years ago? I didn't realize they were even still around!

Millennials: Who the f*ck is Cray?

Waco · Tue May 07, 2019 10:44 am

The coherent connection between the CPU, GPUs, and interconnect is interesting as well.

Cray certainly has lost the mindshare of the general computing crowd, but in their niche, they're very good.

chuckula · Tue May 07, 2019 11:50 am

With a 4 to 1 GPU to CPU ratio it's clearly an attempt to outdo Nvidia. We'll see how good the interconnect is and how AMD's programming model does vs. CUDA.

I'm at least happy that, like Aurora, this machine will have a relatively open programming model since most accelerated supercomputers outside of Xeon Phis are stuck in Nvidia land.

Mr Bill · Tue May 07, 2019 11:57 am

In my early college days we spoke in awed tones about the big Cray systems. This is a big opportunity for AMD and maybe will somewhat affect future developments in architecture.

chuckula · Tue May 07, 2019 12:01 pm

Mr Bill wrote:
In my early college days we spoke in awed tones about the big Cray systems. This is a big opportunity for AMD and maybe will somewhat affect future developments in architecture.

Seymour's corpse could power it if you strapped on some magnets and put him in a copper coil (he hated the concept of compute clusters)

MileageMayVary · Tue May 07, 2019 12:25 pm

Wonder how much they're going to be selling the old one for?

Anyone wanna go halfsies on a used supercomputer?

Topinio · Tue May 07, 2019 12:32 pm

chuckula wrote:
Seymour's corpse could power it if you strapped on some magnets and put him in a copper coil (he hated the concept of compute clusters)

Nope, don't think so. He was wrong to resist the revolution, so wrong he burned $300M pursuing the GaAs-based Cray-3, causing a spin-out and then bankruptcy after years of delay.

He gave up the resistance and tried to get with the picture in the end, SRC was not about Cray-style vector supercomputers and was working on a parallel cluster design when he died.

Waco · Tue May 07, 2019 1:00 pm

chuckula wrote:
With a 4 to 1 GPU to CPU ratio it's clearly an attempt to outdo Nvidia. We'll see how good the interconnect is and how AMD's programming model does vs. CUDA.

The interconnect isn't a one-off, so I wouldn't expect too much trouble there. Cray has been building interconnects for world-class machines for a few decades.

The programming model / environment, that's a real risk here. I'm sure it'll look great on AMDs balance sheets regardless.

chuckula · Tue May 07, 2019 2:05 pm

Waco wrote:
chuckula wrote:
With a 4 to 1 GPU to CPU ratio it's clearly an attempt to outdo Nvidia. We'll see how good the interconnect is and how AMD's programming model does vs. CUDA.

The interconnect isn't a one-off, so I wouldn't expect too much trouble there. Cray has been building interconnects for world-class machines for a few decades.

The programming model / environment, that's a real risk here. I'm sure it'll look great on AMDs balance sheets regardless.

Ok, the overuse of "infinity fabric" in the article was confusing since you are saying the internode connects are still Ares or whatever Cray is using in 2021. That makes more sense.

As for how much money AMD actually sees... That remains to be seen especially when the contract was awarded to Cray and AMD is the sub. These HPC deals are great ways to show off for other customers moreso than for directly making huge amounts of money.

Waco · Tue May 07, 2019 2:08 pm

chuckula wrote:
Ok, the overuse of "infinity fabric" in the article was confusing since you are saying the internode connects are still Ares or whatever Cray is using in 2021. That makes more sense.

As for how much money AMD actually sees... That remains to be seen especially when the contract was awarded to Cray and AMD is the sub. These HPC deals are great ways to show off for other customers moreso than for directly making huge amounts of money.

Slingshot is called out as the interconnect - I guess my point was that it's not being built just for this machine.

As for AMD - mindshare is great from wins like this; the income is a bonus.

dragontamer5788 · Tue May 07, 2019 3:42 pm

chuckula wrote:
With a 4 to 1 GPU to CPU ratio it's clearly an attempt to outdo Nvidia. We'll see how good the interconnect is and how AMD's programming model does vs. CUDA.

I'm at least happy that, like Aurora, this machine will have a relatively open programming model since most accelerated supercomputers outside of Xeon Phis are stuck in Nvidia land.

Summit seems to be 3-to-1 GPUs to CPU ratio, in case anyone is looking it up.

I'm not sure how happy you can be: ROCm/HCC is going HIP-only, which is their NVidia compatibility layer. So its more like NVidia's CUDA model is going to be the way forward. I guess its "open" in that AMD now has a CUDA-like interface to compete.

freebird · Tue May 07, 2019 3:56 pm

Waco wrote:
chuckula wrote:
Ok, the overuse of "infinity fabric" in the article was confusing since you are saying the internode connects are still Ares or whatever Cray is using in 2021. That makes more sense.

As for how much money AMD actually sees... That remains to be seen especially when the contract was awarded to Cray and AMD is the sub. These HPC deals are great ways to show off for other customers moreso than for directly making huge amounts of money.

Slingshot is called out as the interconnect - I guess my point was that it's not being built just for this machine.

As for AMD - mindshare is great from wins like this; the income is a bonus.

Cray's Shasta & Slingshot was announced last year and is the basis for the system being ordered by the National Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory. Although, that system was only going to cost $146 Million and was leveraging Nvidia GPUs.
https://www.nextplatform.com/2018/10/30 ... computers/

I read Cray's publication on how Slingshot works last year with the announcement above and it sounds pretty impressive. I was wondering if Nvidia made the Mellox purchase to try to "build there own" solution, maybe them and Intel plan on going their own route. I don't really think so because from what I read the Cray Shasta is built modularly to allow Intel/AMD/Nvidia/FPGAs etc all work in one system/cluster.

"With Shasta, Cray can incorporate any processor choice — or a heterogeneous mix — with a single management and application development infrastructure. Customers can flex from single to multi-socket processor nodes, GPUs, FPGAs and other processing options that will emerge, such as AI specialized accelerators. Customers can make late-binding decisions on compute technology and not sacrifice capability, because Shasta’s design allows tailoring of system density and injection bandwidth to optimize price and performance."

Redocbew · Tue May 07, 2019 4:06 pm

dragontamer5788 wrote:
I'm not sure how happy you can be: ROCm/HCC is going HIP-only, which is their NVidia compatibility layer. So its more like NVidia's CUDA model is going to be the way forward. I guess its "open" in that AMD now has a CUDA-like interface to compete.

If it were me(and if I worked in HPC) I'd want to use HIP instead of spending an indeterminate amount of time and resources learning a stack that's essentially useless almost anywhere else. I guess they'll see how good the compatibility layer is, and figure it out one way or another.

chuckula · Tue May 07, 2019 9:13 pm

dragontamer5788 wrote:
chuckula wrote:
With a 4 to 1 GPU to CPU ratio it's clearly an attempt to outdo Nvidia. We'll see how good the interconnect is and how AMD's programming model does vs. CUDA.

I'm at least happy that, like Aurora, this machine will have a relatively open programming model since most accelerated supercomputers outside of Xeon Phis are stuck in Nvidia land.

Summit seems to be 3-to-1 GPUs to CPU ratio, in case anyone is looking it up.

I'm not sure how happy you can be: ROCm/HCC is going HIP-only, which is their NVidia compatibility layer. So its more like NVidia's CUDA model is going to be the way forward. I guess its "open" in that AMD now has a CUDA-like interface to compete.

Summit... and I'll just call this one the next Summit, appears to be a very GPU-focused supercomputer. So the trend of going from 3 to 4 GPUs per node isn't much of a surprise.

As for the programming model, I have heard of ROCm although I have no idea how popular it is. However, that "HIP" translation layer is somewhat new to me and if this thing is going to be a success n 2021, then AMD better get a real software ecosystem whipped into shape in a hurry. Translating CUDA code might be a starting point, but it isn't going to be a winner in what is supposed to be a premier HPC system.

Waco · Tue May 07, 2019 9:20 pm

chuckula wrote:
...in what is supposed to be a premier HPC system.

Unfortunately it will likely spend its life running stunt codes and thousands of simultaneous small jobs that don't make much use of its interconnect horsepower or system scale.

It'll be a cool machine from a technology point of view, but for hard problems, it's going to be insanely inefficient. Modern machines are lucky to hit 1% on physics simulation jobs. GPU based machines tend towards the .5% range. They will absolutely rip on dense structured computation but few real-world problems devolve into that.

I'll admit my perspective is pretty slanted on this one since my sites focus is on getting to 2-3% efficiency versus chasing big numbers. I design/build/support the storage systems underlying machines like these and have a hand in the selection process, so it's tough to see design choices that are very clearly pushing towards the exhibition side of things versus getting things done.

/rant over

Tue May 07, 2019 9:48 pm

Waco wrote:
chuckula wrote:
...in what is supposed to be a premier HPC system.

Unfortunately it will likely spend its life running stunt codes and thousands of simultaneous small jobs that don't make much use of its interconnect horsepower or system scale.

It'll be a cool machine from a technology point of view, but for hard problems, it's going to be insanely inefficient. Modern machines are lucky to hit 1% on physics simulation jobs. GPU based machines tend towards the .5% range. They will absolutely rip on dense structured computation but few real-world problems devolve into that.

I'll admit my perspective is pretty slanted on this one since my sites focus is on getting to 2-3% efficiency versus chasing big numbers. I design/build/support the storage systems underlying machines like these and have a hand in the selection process, so it's tough to see design choices that are very clearly pushing towards the exhibition side of things versus getting things done.

/rant over

Some things never change. I was involved in a similar stunt way back in the early '90s, when a few 10s of GigaFLOPs was considered world-class... and here we are nearly 3 decades later, talking about ExaFLOPs! We got our "fastest" crown (using LINPACK IIRC, though I may be mistaken about that, it has been a long time). We managed to hang on to that distinction for only a couple of months. Then we spent the next couple of years optimizing everything to enable our actual users to get useful work out of the system.

Was a fascinating project regardless. Over 600 Intel i860 RISC CPUs, grouped into 36 clusters with full crossbar connectivity within each cluster, and a hypercube-ish interconnect topology between the clusters. The thing was basically a bespoke Quantum Chromodynamics (Lattice QCD) engine for high-energy physics simulations, though I believe they tried running other workloads (e.g. CFD) on it as well.

Tue May 07, 2019 10:23 pm

just brew it! wrote:
The thing was basically a bespoke Quantum Chromodynamics (Lattice QCD) engine for high-energy physics simulations, though I believe they tried running other workloads (e.g. CFD) on it as well.

I was under the impression that for certain "high-energy physics simulations", both domains are of "interest".

Tue May 07, 2019 10:27 pm

Captain Ned wrote:
just brew it! wrote:
The thing was basically a bespoke Quantum Chromodynamics (Lattice QCD) engine for high-energy physics simulations, though I believe they tried running other workloads (e.g. CFD) on it as well.

I was under the impression that for certain "high-energy physics simulations", both domains are of "interest".

The project I was involved with was explicitly a collabo with a group of QCD theorists, so that was the primary focus.

Kind of a "theory vs. practice" thing I guess. The ivory tower folks do QCD; the ones building the real-world gear are more interested in the CFD. :wink:

(Disclaimer: I'm not a physicist, theoretical or practical. I don't even play one on TV. I just worked with a bunch of 'em, for a few years early in my career! :lol:

)

Waco · Tue May 07, 2019 11:03 pm

just brew it! wrote:
Some things never change. I was involved in a similar stunt way back in the early '90s, when a few 10s of GigaFLOPs was considered world-class... and here we are nearly 3 decades later, talking about ExaFLOPs! We got our "fastest" crown (using LINPACK IIRC, though I may be mistaken about that, it has been a long time). We managed to hang on to that distinction for only a couple of months. Then we spent the next couple of years optimizing everything to enable our actual users to get useful work out of the system.

Was a fascinating project regardless. Over 600 Intel i860 RISC CPUs, grouped into 36 clusters with full crossbar connectivity within each cluster, and a hypercube-ish interconnect topology between the clusters. The thing was basically a bespoke Quantum Chromodynamics (Lattice QCD) engine for high-energy physics simulations, though I believe they tried running other workloads (e.g. CFD) on it as well.

You'd be surprised (maybe not) to hear that the workload hasn't changed much. The scale has, and it's all 3D now, but the underlying compute patterns aren't terribly dissimilar.

dragontamer5788 · Tue May 07, 2019 11:13 pm

chuckula wrote:
As for the programming model, I have heard of ROCm although I have no idea how popular it is. However, that "HIP" translation layer is somewhat new to me and if this thing is going to be a success n 2021, then AMD better get a real software ecosystem whipped into shape in a hurry. Translating CUDA code might be a starting point, but it isn't going to be a winner in what is supposed to be a premier HPC system.

"HIP" is a CUDA-like environment but with full access to AMD GCN Assembly if you need it.

AMD is clearly pushing for Tensorflow acceleration, they've also been pushing BLAS libraries and other stuff (although they still seem behind NVidia. I dunno if its hardware, or a software optimization problem). HIP seems like it'd work, even if it is a scrappy partial-clone of CUDA with slightly different syntax.

OpenMP might be the other environment that will get decent support, but that's mostly LLVM. LLVM C++ OpenMP is getting better and better, and LLVM supports AMD GCN because of all of the ROCm patches AMD has contributed. OpenMP is rather high-level, but maybe that's a good thing.

Tue May 07, 2019 11:31 pm

Waco wrote:
You'd be surprised (maybe not) to hear that the workload hasn't changed much.

Yeah, not massively surprised. Simulations that can be represented as a grid/mesh of points aren't going away any time soon.

Waco wrote:
The scale has,

Larger scale == finer grid, and finer grid == more precision. Who doesn't want more precision in their simulations? :wink:

Waco wrote:
and it's all 3D now,

That trend had already started on the tail end of my stint in the HPC field (mid-'90s). One of the last projects I did at that job was a port of the in-house software tools to Cray's T3D architecture.

Waco wrote:
but the underlying compute patterns aren't terribly dissimilar.

This is actually somewhat reassuring from an employment security standpoint. If push came to shove, I could probably jump back to the HPC industry without too much trouble! :lol:

dragontamer5788 · Wed May 08, 2019 12:54 pm

Waco wrote:
chuckula wrote:
...in what is supposed to be a premier HPC system.

Unfortunately it will likely spend its life running stunt codes and thousands of simultaneous small jobs that don't make much use of its interconnect horsepower or system scale.

It'll be a cool machine from a technology point of view, but for hard problems, it's going to be insanely inefficient. Modern machines are lucky to hit 1% on physics simulation jobs. GPU based machines tend towards the .5% range. They will absolutely rip on dense structured computation but few real-world problems devolve into that.

I'll admit my perspective is pretty slanted on this one since my sites focus is on getting to 2-3% efficiency versus chasing big numbers. I design/build/support the storage systems underlying machines like these and have a hand in the selection process, so it's tough to see design choices that are very clearly pushing towards the exhibition side of things versus getting things done.

It seems like there are a variety of problems that have high arithmetic intensity, which would benefit from GPUs. The other problems would have high memory intensity, which would benefit from HBM2... AMD's GPU has both high-RAM bandwidth AND support for high arithmetic intensity. NVidia's stuff is perhaps better balanced, but its hard for me to tell if its due to superior code optimizations or superior low-level decisions by NVidia.

In either case, your two major bottlenecks (raw FLOPs and raw memory speeds) are basically solved by the GPU platform. I know there are some Japanese supercomputers (ie: Fujitsu) which are using HBM2 and probably be better for memory-hard problems... but I don't think GPUs are necessarily "bad" at that kind of a problem.

With all of the hype going into even higher arithmetic intensity problems (FP16 convolutional neural networks), it only makes sense to advertise more FLOPs, even if most other workloads don't benefit from those advancements.

-------

With regards to cross-thread communication, GPUs have "Shared Memory", and therefore support better cross-thread communication. NVidia GPUs have 32-threads per warp, AMD GPUs have 64-threads per wavefront, and thread communication on AMD GPUs within a wavefront is as easy as reading or writing to an SGPR. All 108 SGPR registers are "shared" between all 64-SIMD threads, the very assembly language supports inter-thread communications at the register level.

AMD GPU Shared memory supports up to 1024 SIMD-threads to communicate in an area of RAM roughly as fast as L1 cache.

So basically, GPUs have superior cross-thread communication capabilities, they have superior memory bandwidth, and finally they achieve superior arithmetic intensity. Three very big advantages.

------

Not saying CPUs are going to lose all fights. CPUs win in latency and clock-speed. CPUs also access more RAM per thread, exponentially more RAM at much lower latencies. If a thread needs ~2MB of "hot" data, it will be faster (lower latency) on CPUs than a GPU. But if you need a huge amount of raw calculations, done on very fast RAM, with high cross-thread communications... you really can't beat a GPU.

The biggest issue is memory size. GPUs can only dedicate 150-bytes per thread, before they run out of register space / L1 cache, and maybe 2MB per thread before they run out of main-memory (!!). In contrast, CPUs have ~200 reorder-buffer registers (effectively 1600 bytes of register-space), 64kB of L1 data + 64kB of L1 code, somewhere on the order of 2MB+ of L3, and many GBs of main memory of space.

Yeah, there's only ~16 or ~32 registers in the CPU architecture. But all of those reorder-buffers kinda count for something: speeding up code automatically and finding parallelism at the assembly level. So I'll count those reorder-buffer registers as high-speed CPU storage space.

Waco · Wed May 08, 2019 1:20 pm

dragontamer5788 wrote:
It seems like there are a variety of proble.......sniiiiip.......ething: speeding up code automatically and finding parallelism at the assembly level. So I'll count those reorder-buffer registers as high-speed CPU storage space.

Excellent summary! There are classes of problems that don't fit on GPUs, though, and that's the hard set for GPU compute.

GPU compute is great for dense problems that fit on their local RAM stacks. For sparse problems that are latency sensitive and have sections of dense compute, GPUs are a pretty difficult fit. The FLOP/memory bandwidth ratio is pretty poor on GPUs (the raw bandwidth is assuredly better, but the compute density outweighs that advantage pretty easily), and the latency/bandwidth hit for reaching beyond the local memory to the CPU memory (or remote memory on another node) is pretty constraining.

I'm not saying GPUs are bad, I'm saying GPUs are bad for particular workloads that happen to be the same ones that I support in my day job.

dragontamer5788 · Wed May 08, 2019 1:46 pm

Waco wrote:
dragontamer5788 wrote:
It seems like there are a variety of proble.......sniiiiip.......ething: speeding up code automatically and finding parallelism at the assembly level. So I'll count those reorder-buffer registers as high-speed CPU storage space.

Excellent summary! There are classes of problems that don't fit on GPUs, though, and that's the hard set for GPU compute.

GPU compute is great for dense problems that fit on their local RAM stacks. For sparse problems that are latency sensitive and have sections of dense compute, GPUs are a pretty difficult fit. The FLOP/memory bandwidth ratio is pretty poor on GPUs (the raw bandwidth is assuredly better, but the compute density outweighs that advantage pretty easily), and the latency/bandwidth hit for reaching beyond the local memory to the CPU memory (or remote memory on another node) is pretty constraining.

I'm not saying GPUs are bad, I'm saying GPUs are bad for particular workloads that happen to be the same ones that I support in my day job.

Out of curiosity, has your group ever worked with alternative accelerators?

Xeon Phi and PEZY are two "CPU-like" accelerators that seemed to be going for a sort of "CPU-like architecture", for different workloads compared to GPUs. I never worked with them (heck, I only really do CPU / GPU programming for fun hobby projects). But it would seem that on paper at least, Xeon Phi should have done better in "sparse compute", so to speak.

EDIT: Xeon Phi is interesting to me, but it seems like they're dying off and that just buying a powerful standard CPU would be more useful in practice (aka: Threadripper / EPYC). Maybe rent out some cloud-compute when I need it instead of buying hardware.

Waco · Wed May 08, 2019 3:21 pm

Yup.

Half of our current flagship machine is ~9000 Xeon Phi (Knights Landing) nodes. They're much closer to a CPU than a GPU, and we did get a few landmark runs on them that weren't possible on prior machines. They have other drawbacks, but porting to them is incredibly easy for many things compared to porting to GPUs.

Before that, we ran on the IBM Cell (similar to the PS3 Cell chip) on Roadrunner.

Topinio · Wed May 08, 2019 4:30 pm

dragontamer5788 wrote:
EDIT: Xeon Phi is interesting to me, but it seems like they're dying off and that just buying a powerful standard CPU would be more useful in practice

Lol, my only Phi (7120P, Knights Corner) is sitting on my desk like the beautiful blue paperweight it was born to be

Wed May 08, 2019 4:45 pm

Waco wrote:
Yup. Half of our current flagship machine is ~9000 Xeon Phi (Knights Landing) nodes.

Cray XC40??

Waco · Wed May 08, 2019 5:19 pm

Captain Ned wrote:
Waco wrote:
Yup. Half of our current flagship machine is ~9000 Xeon Phi (Knights Landing) nodes.

Cray XC40??

Indeed. https://www.top500.org/system/178610

cegras · Fri May 10, 2019 1:22 pm

Can you elaborate on what kind of software you run? I'm just a lowly electronic structure guy, so most of my jobs are single node.

TR Forums

AMD wins big for DOE -

AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Re: AMD wins big for DOE -

Who is online