I'd heard rumors, but damn, that's a departure from the expected. It's still a stunt machine...but it's a $600M+ stunt with all AMD everything.

Personal computing discussed
Moderators: renee, Flying Fox, morphine
Mr Bill wrote:In my early college days we spoke in awed tones about the big Cray systems. This is a big opportunity for AMD and maybe will somewhat affect future developments in architecture.
chuckula wrote:Seymour's corpse could power it if you strapped on some magnets and put him in a copper coil (he hated the concept of compute clusters)
chuckula wrote:With a 4 to 1 GPU to CPU ratio it's clearly an attempt to outdo Nvidia. We'll see how good the interconnect is and how AMD's programming model does vs. CUDA.
Waco wrote:chuckula wrote:With a 4 to 1 GPU to CPU ratio it's clearly an attempt to outdo Nvidia. We'll see how good the interconnect is and how AMD's programming model does vs. CUDA.
The interconnect isn't a one-off, so I wouldn't expect too much trouble there. Cray has been building interconnects for world-class machines for a few decades.
The programming model / environment, that's a real risk here. I'm sure it'll look great on AMDs balance sheets regardless.
chuckula wrote:Ok, the overuse of "infinity fabric" in the article was confusing since you are saying the internode connects are still Ares or whatever Cray is using in 2021. That makes more sense.
As for how much money AMD actually sees... That remains to be seen especially when the contract was awarded to Cray and AMD is the sub. These HPC deals are great ways to show off for other customers moreso than for directly making huge amounts of money.
chuckula wrote:With a 4 to 1 GPU to CPU ratio it's clearly an attempt to outdo Nvidia. We'll see how good the interconnect is and how AMD's programming model does vs. CUDA.
I'm at least happy that, like Aurora, this machine will have a relatively open programming model since most accelerated supercomputers outside of Xeon Phis are stuck in Nvidia land.
Waco wrote:chuckula wrote:Ok, the overuse of "infinity fabric" in the article was confusing since you are saying the internode connects are still Ares or whatever Cray is using in 2021. That makes more sense.
As for how much money AMD actually sees... That remains to be seen especially when the contract was awarded to Cray and AMD is the sub. These HPC deals are great ways to show off for other customers moreso than for directly making huge amounts of money.
Slingshot is called out as the interconnect - I guess my point was that it's not being built just for this machine.
As for AMD - mindshare is great from wins like this; the income is a bonus.
dragontamer5788 wrote:I'm not sure how happy you can be: ROCm/HCC is going HIP-only, which is their NVidia compatibility layer. So its more like NVidia's CUDA model is going to be the way forward. I guess its "open" in that AMD now has a CUDA-like interface to compete.
dragontamer5788 wrote:chuckula wrote:With a 4 to 1 GPU to CPU ratio it's clearly an attempt to outdo Nvidia. We'll see how good the interconnect is and how AMD's programming model does vs. CUDA.
I'm at least happy that, like Aurora, this machine will have a relatively open programming model since most accelerated supercomputers outside of Xeon Phis are stuck in Nvidia land.
Summit seems to be 3-to-1 GPUs to CPU ratio, in case anyone is looking it up.
I'm not sure how happy you can be: ROCm/HCC is going HIP-only, which is their NVidia compatibility layer. So its more like NVidia's CUDA model is going to be the way forward. I guess its "open" in that AMD now has a CUDA-like interface to compete.
chuckula wrote:...in what is supposed to be a premier HPC system.
Waco wrote:chuckula wrote:...in what is supposed to be a premier HPC system.
Unfortunately it will likely spend its life running stunt codes and thousands of simultaneous small jobs that don't make much use of its interconnect horsepower or system scale.
It'll be a cool machine from a technology point of view, but for hard problems, it's going to be insanely inefficient. Modern machines are lucky to hit 1% on physics simulation jobs. GPU based machines tend towards the .5% range. They will absolutely rip on dense structured computation but few real-world problems devolve into that.
I'll admit my perspective is pretty slanted on this one since my sites focus is on getting to 2-3% efficiency versus chasing big numbers. I design/build/support the storage systems underlying machines like these and have a hand in the selection process, so it's tough to see design choices that are very clearly pushing towards the exhibition side of things versus getting things done.
/rant over
just brew it! wrote:The thing was basically a bespoke Quantum Chromodynamics (Lattice QCD) engine for high-energy physics simulations, though I believe they tried running other workloads (e.g. CFD) on it as well.
Captain Ned wrote:just brew it! wrote:The thing was basically a bespoke Quantum Chromodynamics (Lattice QCD) engine for high-energy physics simulations, though I believe they tried running other workloads (e.g. CFD) on it as well.
I was under the impression that for certain "high-energy physics simulations", both domains are of "interest".
just brew it! wrote:Some things never change. I was involved in a similar stunt way back in the early '90s, when a few 10s of GigaFLOPs was considered world-class... and here we are nearly 3 decades later, talking about ExaFLOPs! We got our "fastest" crown (using LINPACK IIRC, though I may be mistaken about that, it has been a long time). We managed to hang on to that distinction for only a couple of months. Then we spent the next couple of years optimizing everything to enable our actual users to get useful work out of the system.
Was a fascinating project regardless. Over 600 Intel i860 RISC CPUs, grouped into 36 clusters with full crossbar connectivity within each cluster, and a hypercube-ish interconnect topology between the clusters. The thing was basically a bespoke Quantum Chromodynamics (Lattice QCD) engine for high-energy physics simulations, though I believe they tried running other workloads (e.g. CFD) on it as well.
chuckula wrote:As for the programming model, I have heard of ROCm although I have no idea how popular it is. However, that "HIP" translation layer is somewhat new to me and if this thing is going to be a success n 2021, then AMD better get a real software ecosystem whipped into shape in a hurry. Translating CUDA code might be a starting point, but it isn't going to be a winner in what is supposed to be a premier HPC system.
Waco wrote:You'd be surprised (maybe not) to hear that the workload hasn't changed much.
Waco wrote:The scale has,
Waco wrote:and it's all 3D now,
Waco wrote:but the underlying compute patterns aren't terribly dissimilar.
Waco wrote:chuckula wrote:...in what is supposed to be a premier HPC system.
Unfortunately it will likely spend its life running stunt codes and thousands of simultaneous small jobs that don't make much use of its interconnect horsepower or system scale.
It'll be a cool machine from a technology point of view, but for hard problems, it's going to be insanely inefficient. Modern machines are lucky to hit 1% on physics simulation jobs. GPU based machines tend towards the .5% range. They will absolutely rip on dense structured computation but few real-world problems devolve into that.
I'll admit my perspective is pretty slanted on this one since my sites focus is on getting to 2-3% efficiency versus chasing big numbers. I design/build/support the storage systems underlying machines like these and have a hand in the selection process, so it's tough to see design choices that are very clearly pushing towards the exhibition side of things versus getting things done.
dragontamer5788 wrote:It seems like there are a variety of proble.......sniiiiip.......ething: speeding up code automatically and finding parallelism at the assembly level. So I'll count those reorder-buffer registers as high-speed CPU storage space.
Waco wrote:dragontamer5788 wrote:It seems like there are a variety of proble.......sniiiiip.......ething: speeding up code automatically and finding parallelism at the assembly level. So I'll count those reorder-buffer registers as high-speed CPU storage space.
Excellent summary! There are classes of problems that don't fit on GPUs, though, and that's the hard set for GPU compute.
GPU compute is great for dense problems that fit on their local RAM stacks. For sparse problems that are latency sensitive and have sections of dense compute, GPUs are a pretty difficult fit. The FLOP/memory bandwidth ratio is pretty poor on GPUs (the raw bandwidth is assuredly better, but the compute density outweighs that advantage pretty easily), and the latency/bandwidth hit for reaching beyond the local memory to the CPU memory (or remote memory on another node) is pretty constraining.
I'm not saying GPUs are bad, I'm saying GPUs are bad for particular workloads that happen to be the same ones that I support in my day job.
dragontamer5788 wrote:EDIT: Xeon Phi is interesting to me, but it seems like they're dying off and that just buying a powerful standard CPU would be more useful in practice
Waco wrote:Yup.Half of our current flagship machine is ~9000 Xeon Phi (Knights Landing) nodes.
Captain Ned wrote:Waco wrote:Yup.Half of our current flagship machine is ~9000 Xeon Phi (Knights Landing) nodes.
Cray XC40??