TR Forums

Sat Feb 27, 2016 6:22 am

LightenUpGuys wrote:
Do supercomputing centers qualify for those rates?

If your regional power company doesn't have the spare capacity to offer you a great deal, natural gas is cheap enough that you might choose to put in your own generation capacity. This requires some up-front capital investment, but fuel costs are currently quite low.

If you're building it in France, nuclear plants provide the power.

LightenUpGuys · Sat Feb 27, 2016 8:18 pm

Keeping it 20-50MW is only one of the considerations, but cheap electricity is nice.

But what kind of architecture do you think it will be?

All CPUs, heyerogenous GPU, heterogenous FPGA or ASIC, ARM, x86, SPARC, vector, superconducting RSFQ?

Waco · Mon Feb 29, 2016 7:55 pm

MarkG509 wrote:
How To Kill A Supercomputer: Dirty Power, Cosmic Rays, and Bad Solder.

In my opinion, bad solder may be the worst of the three mentioned: it generates enough Alpha particals (lead is heavy in decay(ing|ed) uranium). Low-alpha lead is expensive. Tin can grow wiskers and isn't as sticky or flexible as lead.

It hasn't been much of an issue in recent years thanks to leaded solder becoming more scarce. Cosmic rays, on the other hand, are what I blame for the random crashes / ECC errors in my machines both at home and at work.

MarkG509 · Mon Feb 29, 2016 9:53 pm

Waco wrote:
It hasn't been much of an issue in recent years thanks to leaded solder becoming more scarce. Cosmic rays, on the other hand, are what I blame for the random crashes / ECC errors in my machines both at home and at work.

(a) Low-Alpha lead raises the cost considerably. (b) Mostly-tin-based solder had some serious issues, including the whiskers I mentioned above, and that the bond isn't as strong, and isn't as flexible as lead-based solder. Thermal cycling would break the solder joints. They've mostly understood/fixed the issues with tin, but, e.g., search these forums for the keyboard repairs that @JBI and I have had to do on tin solder joints. I know of several supercomputers that would lose a few nodes on every power-cycle, and two that actually caught fire from shorts caused by tin-whiskers.

Edit: Oh, and (c) check your place for high-levels of Radon. One or two computers should very rarely see a cosmic or alpha caused fault. Tens of thousands in a room, if not carefully designed, often is a better alpha/cosmic detector than a supercomputer.

Regarding the cosmics: never sell a Supercomputer to LANL. Los Alamos is about 7k feet above sea-level. I know of computers (supers and mainframes) that have an altitude limit lower than that (and not just because the less-dense air way up there is not as good at cooling stuff). Interestingly (to me, at least), over 10years ago, I used to smoke, and whenever I got above 6k feet above sea-level, my oxygen-starved brain would lose it's "equal key", where I could not compute answers. I used to game the GPS at guessing the altitude based on how I could think at the moment (thankfully a co-worker was driving). Since I've quit smoking, that's no longer a problem at all.

Mr Bill · Mon Feb 29, 2016 11:14 pm

JustAnEngineer wrote:
LightenUpGuys wrote:
Do supercomputing centers qualify for those rates?
If your regional power company doesn't have the spare capacity to offer you a great deal, natural gas is cheap enough that you might choose to put in your own generation capacity. This requires some up-front capital investment, but fuel costs are currently quite low.

If you're building it in France, nuclear plants provide the power.

Perhaps especially if you use Oberon to make DME from natural gas and then use direct DME fuel cells to make the power.

MarkG509 · Tue Mar 01, 2016 1:23 am

Anomymous Gerbil wrote:
I see quite a lot of posts talking about the problems with failure rates of such massive systems. But why is that a problem?

Surely the systems and apps are designed such that the failures are essentially invisible to the apps - or is that not true?
Surely any such computers are built with easily-replaceable modules - or is that not true?
Is it a financial problem, i.e. the sheer cost of staff/materials/etc of finding and replacing all those modules?
Or is it that failures can occur which aren't detected, thereby spoiling the computations?
Or...?

(Just curious, I have no knowledge of these sorts of systems.)

Sorry, but need to use "internet-style" quotes to attempt an answer.

>Surely the systems and apps are designed such that the failures are essentially invisible to the apps - or is that not true?
Good luck with that. Some apps try to do near continuous checkpointing, so that little time is lost if they have to roll-back to a last-known-good state after a fault/crash. Most don't, many can't even tell that they've been corrupted until it's way too late. For most it's hard, or there's just too much state, or too little filesystem bandwidth to take such checkpoints. The "state" of an HPC app on modern supercomputers (even just counting memory state, not counting filesystem state, and definitely not counting network state (stuff sent, but not fully received)) could be many Terabytes.

>Surely any such computers are built with easily-replaceable modules - or is that not true?
There's a difference between "stuck faults", where some component is permanently broken, and "transient faults", where an otherwise "good" component just gets a wrong answer once in a trillion times, or crashes out-of-the-blue. Some marginal components will always pass diagnostics, but still occasionally (once in a gazillion times) get the wrong answer or otherwise crash. If you have enough "spares", the marginal components should be replaced. But, declaring a component marginal is a "policy" decision, where you set some threshold error rate (in failures over time). But, then you've got a gazillion components to carefully check and track, and most times you can't tell exactly what was at fault.

>Is it a financial problem, i.e. the sheer cost of staff/materials/etc of finding and replacing all those modules?
Imagine that you have 1 million of a thing designed to a 1 million hour MTBF. That means that on average 1 will fail per hour somewhere in the machine. The real problem is that stuck-faults are easy to find/fix, but transient-faults can be nasty to find and harder to fix. I.e., answer this question: is this chip getting the wrong answer because the chip is bad, or because its motherboard bad, or because the solder joints holding the chip to the motherboard is bad, or because the power-supply powering the motherboard is bad? Of the things you mention, probably "staff" is the worst, because figuring out exactly what's wrong can be hard when you are dealing with transient failures.

>Or is it that failures can occur which aren't detected, thereby spoiling the computations?
Define "detected". Say, for example, that my CPU issues a load from memory to a register. If a cosmic or alpha hits that target register and flips a bit (that I'm about to overwrite) before the memory contents get there, I just don't care. When designing a supercomputer, you think about how long a bit in some register is likely to stay there (over a range of apps), and what its probability of being killed/flipped by an alpha or cosmic would be before it's overwritten. Memory errors occur all the time (because there's usually/hopefully lots of memory, but ECC usually corrects them to return the right contents, and Servers use "scrubbing" that runs around in the background and reads/corrects/rewrites memory to continually fix single-bit errors (hopefully before they become uncorrectable multi-bit errors). (That's mostly why the computers I build have ECC.)

One of the more important benchmarks for Supercomputers is Linpack, which can run for about a day. It produces an "answer" and a "residual". The answer has to be in the "ballpark", which allows some mixing/reordering of floating-point operations. But, if the residual is too large, it means your supercomputer made an error in some calculation, but God-only-knows-where-or-when it happened in that multi-hour run.

That all probably raised more questions than it answered, but hope it helped.

NTMBK · Tue Mar 01, 2016 4:16 am

Waco wrote:
it'll be as useless as the last few they built

It may have been useless for the Chinese, but it was great marketing for Intel!

NTMBK · Tue Mar 01, 2016 4:24 am

I'm not convinced that Exascale will be CPU only- I expect some sort of highly integrated solution, but I don't see why an APU couldn't do the job just as well. Handful of latency-oriented CPU cores to deal with I/O tasks, controlling a bunch of high-throughput GPU cores.

Some sort of fabric integrated on-package/on-die is probably going to be a necessity though, and AMD and NVidia certainly don't have the expertise for that.

Waco · Tue Mar 01, 2016 10:25 am

MarkG509 wrote:
Waco wrote:
It hasn't been much of an issue in recent years thanks to leaded solder becoming more scarce. Cosmic rays, on the other hand, are what I blame for the random crashes / ECC errors in my machines both at home and at work.

(a) Low-Alpha lead raises the cost considerably. (b) Mostly-tin-based solder had some serious issues, including the whiskers I mentioned above, and that the bond isn't as strong, and isn't as flexible as lead-based solder. Thermal cycling would break the solder joints. They've mostly understood/fixed the issues with tin, but, e.g., search these forums for the keyboard repairs that @JBI and I have had to do on tin solder joints. I know of several supercomputers that would lose a few nodes on every power-cycle, and two that actually caught fire from shorts caused by tin-whiskers.

Oh, I meant problems with radiation from lead-based solder. Tin whiskers and thermal stress are definitely still up there for causing hardware failures.

MarkG509 wrote:
Edit: Oh, and (c) check your place for high-levels of Radon. One or two computers should very rarely see a cosmic or alpha caused fault. Tens of thousands in a room, if not carefully designed, often is a better alpha/cosmic detector than a supercomputer.

I was somewhat joking, but I do blame them at work (I have a lot of racks). At home, I blame my slowly aging overclocks and non-ECC memory.

MarkG509 wrote:
Regarding the cosmics: never sell a Supercomputer to LANL. Los Alamos is about 7k feet above sea-level. I know of computers (supers and mainframes) that have an altitude limit lower than that (and not just because the less-dense air way up there is not as good at cooling stuff). Interestingly (to me, at least), over 10years ago, I used to smoke, and whenever I got above 6k feet above sea-level, my oxygen-starved brain would lose it's "equal key", where I could not compute answers. I used to game the GPS at guessing the altitude based on how I could think at the moment (thankfully a co-worker was driving). Since I've quit smoking, that's no longer a problem at all.

Just over 7,400 feet as I sit actually. You do get used to the altitude surprisingly quickly, although cooling is definitely a bigger issue than most care to admit. I adamantly won't buy certain unnamed vendors products because I *know* they don't actually test at the altitudes they claim are safe, and I've seen fallout that's ludicrously high with their equipment due to heat issues.

Mr Bill · Tue Mar 01, 2016 2:36 pm

MarkG509 wrote:
...One of the more important benchmarks for Supercomputers is Linpack, which can run for about a day. It produces an "answer" and a "residual". The answer has to be in the "ballpark", which allows some mixing/reordering of floating-point operations. But, if the residual is too large, it means your supercomputer made an error in some calculation, but God-only-knows-where-or-when it happened in that multi-hour run.

That all probably raised more questions than it answered, but hope it helped.

I'll just add that minimization algorithms in Linpack (e.g. Singular Value Decomposition or Fast Fourier Transforms) can take a few hits in the process and the minimization will wash it out but with an increase in residual. But I can imagine other kinds of calculations where the minimization is less effective (rate of convergence is slow) or is simply an extended calculation and for those sorts of calculations; a long calculation would simply blow up to ridiculous solutions.

Mr Bill · Tue Mar 01, 2016 2:45 pm

Makes you wonder if this sort of data integrity issue ends up being a limit on silicon based AI.

the · Tue Mar 01, 2016 3:22 pm

Waco wrote:
MarkG509 wrote:
How To Kill A Supercomputer: Dirty Power, Cosmic Rays, and Bad Solder.

In my opinion, bad solder may be the worst of the three mentioned: it generates enough Alpha particals (lead is heavy in decay(ing|ed) uranium). Low-alpha lead is expensive. Tin can grow wiskers and isn't as sticky or flexible as lead.

It hasn't been much of an issue in recent years thanks to leaded solder becoming more scarce. Cosmic rays, on the other hand, are what I blame for the random crashes / ECC errors in my machines both at home and at work.

When in doubt, get data. Conclusion 2 in that paper is what stands out to me: once a DIMM encounters an error, it is likely to cause another within the next month. To put it in layman's terms, a few bad DIMMs apples ruin the bunch. The paper also indicates that soft errors (which cosmic rays would be a potential cause) occur less than expected in the wild.

the · Tue Mar 01, 2016 4:06 pm

To answer the OP question, what architecture would I expect for an exascale system? Pretty much the same as the current number 2 on the Top500, Titan, a cluster with each node containing a CPU + GPU. Following this, if nVidia were to keep power consumption the same for HPC versions of Pascal and Volta (250 W) and ifPascal has four times as much double precision throughput as Kepler (remember this is no HPC version of Maxell) and if Volta delivers twice as much DP performance as Pascal, then an exaflop cluster can be obtained at reasonable power levels in 2018 (<22 megawatt), if nVidia is able to meet their road map. That is a lot of if's but they're relatively reasonable considering that there are two new process nodes expected in this time frame (TSMC 16 nm FinFET this year which is based upon 20 nm and a new 14 nm FinFET process in 2017/2018).

Titan has 18,688 nodes due to its one CPU to one GPU ratio but I see that heavily swinging toward more GPUs per node. SkyLake-EP should have 48 PCIe lanes which in turn would permit six GPUs with 8 lanes each per node. Internconnect on SkyLake-EP would be covered by on package OmniPath. The six nVidia GPUs would be able to communicate with each other via nvLink without touching the host CPU and by pass the potential 8 lane PCIe bottleneck. The number of sockets per node could also increase but I don't see supercomputers utilizing them due to non-linear costs (Intel changes a premium for dual and quad socket capable chips).

Xeon Phi is also an option if a site wanted a pure CPU model with the parallel performance scaling of GPUs. The downside here is that the number of nodes necessary to hit an exaflop is going to be substantial and incur a heavy networking cost compared to SkyLake-EP + GPUs. To counter this, since Knight's Landing doesn't need a host CPU, there is a bit of energy savings. I also see Xeon Phi being more attractive than current Telsa designs due to full ECC support at full memory capacity and bandwidth. (Current Telsa cards can do it but they lose 1/8 of their memory capacity and bandwidth is significantly impacted. Might change with Pascal and HBM.) Knights Hill may also bring 3D Xpoint memory support which is another RAS feature many in the supercomputing space would like.

IBM is still a player in HPC and POWER8+ will follow the same CPU + GPU cluster examples mentioned above. I see the massive fiber switch developed for HPC version of POWER7 being re-introduced with the POWER8+ and nVidia GPU combos. Perhaps the new switch will use silicon photonics. Their more interesting HPC solution, BlueGene, isn't seeing any further development.

I don't see AMD being a big player by themselves in the HPC space. They could pull off a surprise with Zen + Polaris GPU in a single package but any wins with that I see being in smaller scale systems. I just don't see the conservative HPC market placing a big exascale bet on AMD considering the shaky ground the company is on.

The recent SPARC chips are once again competitive from a performance stand point but don't see Oracle being a player in HPC. Rather their big systems will be large cache coherent designs that fall short of the exaflop market but will crunch database workloads like candy. Oracle's hardware motivations are just driven by their software ambitions.

Waco · Tue Mar 01, 2016 5:42 pm

the wrote:
When in doubt, get data. Conclusion 2 in that paper is what stands out to me: once a DIMM encounters an error, it is likely to cause another within the next month. To put it in layman's terms, a few bad DIMMs apples ruin the bunch. The paper also indicates that soft errors (which cosmic rays would be a potential cause) occur less than expected in the wild.

I agree completely, but I'm also in the unique position of having access to lots of data on DRAM failures, at scale, at high altitude.

Here's a quick summary from a few years back: https://hpcuserforum.com/presentations/ ... 20LANL.pdf

Unrelated, I really hope AMD can pull off Zen in an awesome way...I don't expect it, but I would be pleasantly surprised.

MarkG509 · Tue Mar 01, 2016 6:47 pm

the wrote:
Conclusion 2 in that paper is what stands out to me: once a DIMM encounters an error, it is likely to cause another within the next month. To put it in layman's terms, a few bad DIMMs apples ruin the bunch. The paper also indicates that soft errors (which cosmic rays would be a potential cause) occur less than expected in the wild.

Yes, but (good) supers you don't use DIMMS since the connectors are the worst point of failure, and instead solder down the memory. They also use an extra DRAM chip per bank, and the memory controller needs to support chip-kill or at least symbol-kill, so you can spare-in redundant DRAM that's already there. Of course, remembering for every boot what stuff is good or bad, is a real pain.

Seriously interesting is that some supers use very smart memory controllers that save power by using an extra bit per dram row that allows it store rows bit-flipped or not. Eg, rewrite a row of all zero's to all ones and only that 1 bit changes. Most of the power used in writing memory is to flip bits, not to maintain them.

But worse, the people who buy the supers think that any node that throws even one single-bit DRAM error needs to be replaced. But that's just not reasonable/realistic.

Krogoth · Tue Mar 01, 2016 6:56 pm

Mr Bill wrote:
Makes you wonder if this sort of data integrity issue ends up being a limit on silicon based AI.

Digital computers are not capable of handling a *sapient AI* that can learn and program on its own. You can only do so much with integers-based mathematics with some clever approximation techniques to handle those pesky rational, irrational numbers and beyond. You can make a clever script-based that mimics "sapience" but it cannot operate outside the confines of those script and conditions.

We are a long way from developing a full sapient AI from a software and hardware standpoint. Personally, I'm worried about "artificial stupidity" than anything else. In form of a seemingly benign program going awry that ends up destroying itself because it cannot "think" beyond the scope of its programming.

LightenUpGuys · Tue Mar 01, 2016 8:40 pm

The

Oracle doesnt make the SPARC supercomputers. Fujitsu does, and i think they also make the hardware for Oracles engineered systems using the M7(TSMC makes the chips).

Look at their PrimeHPC FX100. Its from 2014 and is still ahead of Intel or Nvidia or really anyone else right now.

Redocbew · Tue Mar 01, 2016 9:29 pm

Mr Bill wrote:
Makes you wonder if this sort of data integrity issue ends up being a limit on silicon based AI.

Only if you wanted to avoid the AI being a little twitchy and kind of insane. :lol:

I've always thought that the problem of strong AI at this point was more structural than having to do with computation. The switching speed of modern transistors have already far surpassed that of a neuron in the human brain, but the brain is massively parallel with areas specialized for certain tasks. I wouldn't be surprised if we already have all the computational power needed to make strong AI possible, but are just lacking the ability to physically build and program the machine to make it a reality.

Tue Mar 01, 2016 10:31 pm

Redocbew wrote:
Mr Bill wrote:
Makes you wonder if this sort of data integrity issue ends up being a limit on silicon based AI.
Only if you wanted to avoid the AI being a little twitchy and kind of insane.

Ah, so HAL went nuts because he wasn't using registered ECC DIMMs.

LightenUpGuys · Wed Mar 02, 2016 9:49 am

Redocbew wrote:
Mr Bill wrote:
Makes you wonder if this sort of data integrity issue ends up being a limit on silicon based AI.

Only if you wanted to avoid the AI being a little twitchy and kind of insane.

I've always thought that the problem of strong AI at this point was more structural than having to do with computation. The switching speed of modern transistors have already far surpassed that of a neuron in the human brain, but the brain is massively parallel with areas specialized for certain tasks. I wouldn't be surprised if we already have all the computational power needed to make strong AI possible, but are just lacking the ability to physically build and program the machine to make it a reality.

Youre right. Its not a matter of processing power. Its the type of hardware and the way it functions to process and store information thats the problem, plus the whole issue of software being designed to give rise to a sentient consciousness, or AI. Understanding that will probably come in the future.

One specific piece of hardware that seems like its useful for these sorts of applications is https://www.micron.com/about/emerging-t ... processing

Cellular Automata are incredibly interesting and useful for understanding how complex systems evolve, and a processor based on them is a great idea.

the · Thu Mar 03, 2016 11:55 am

Waco wrote:
the wrote:
When in doubt, get data. Conclusion 2 in that paper is what stands out to me: once a DIMM encounters an error, it is likely to cause another within the next month. To put it in layman's terms, a few bad DIMMs apples ruin the bunch. The paper also indicates that soft errors (which cosmic rays would be a potential cause) occur less than expected in the wild.

I agree completely, but I'm also in the unique position of having access to lots of data on DRAM failures, at scale, at high altitude. Here's a quick summary from a few years back: https://hpcuserforum.com/presentations/ ... 20LANL.pdf

Nice!

A couple of things worth point out on my quick read through. Jaguar is mentioned but no time frame was given for when data regarding when the was collected. This is important as it has undergone several upgrades and currently uses Bulldozer based Opterons. How the previous Opterons handled memory errors could very well be different, especially in the caches. Ceilo is not using Bulldozer based chips from the info I could find online. This could account for some of the differences as Buldozer and Magny-cours do bit error handing at the cache level differently.

The LANL presentation is actually incorrect about Ceilo's SRAM structures being mostly parity based. By using Magny-Cours, the chips's L1, L2 and L3 structures are fully ECC protected. Bulldozer on the other hand, uses basic parity on the L1 structure since it is inclusive with the L2 cache now. Any error in Bulldozer's L1 cache invokes a read from the L2 cache which then corrects it. ( Source ) I'm really curious how Bulldozer's internal error counter works since data is replicated from L1 into L2, a reported error may only stem from the L2 cache. Also curious if the write coalescing cache between L1 and L2 gets its own counter or is included in either L1 or L2.

The Google paper didn't deal with DDR3 so there could potentially generational difference in how DRAM performs there are well. I do think the LANL is on to something by accounting for DRAM vendor in addition to system vendor. Also of note is that there were more DRAM vendors when the Google paper was published.

The MoonLight section of the LANL paper didn't explicitly specify where the common double bit errors were found in: GPU memory only or combined CPU memory + GPU memory. It does imply just the GDDR5 found on the GPUs but isn't entirely clear. Regardless, the implication that GPU memory has a higher memory rate is not surprising as it is partially based in software in the Fermi generation. With nVidia adopting HBM soon, I can see a dedicated stack to just ECC functionality now much like how DIMMs have an extra DRAM chip or two for error protection. This should greatly reduce errors rates on the GPU side while increasing performance.

Waco wrote:
Unrelated, I really hope AMD can pull off Zen in an awesome way...I don't expect it, but I would be pleasantly surprised.

Same here. Their HSA approach to compute has some very attractive elements for HPC. Performance isn't the only factor in choosing a HPC platform as RAS and ease of programming are also factors. AMD definitely has an advantage in the programming model over current implementations but not as good as Intel's model with Knight's Landing. This is why I see AMD winning some of the smaller HPC designs but not necessarily the top tier of the top500.

Mr Bill · Thu Mar 03, 2016 1:17 pm

For a really fun read on the subject of Artifical Intelligence from the AI's point of view... ME by Thomas T. Thomas.

the · Thu Mar 03, 2016 2:09 pm

LightenUpGuys wrote:
The

Oracle doesnt make the SPARC supercomputers. Fujitsu does, and i think they also make the hardware for Oracles engineered systems using the M7(TSMC makes the chips).

Look at their PrimeHPC FX100. Its from 2014 and is still ahead of Intel or Nvidia or really anyone else right now.

Not really. It catches up to the Telsa K20X introduced two years earlier in terms of bandwidth and theoretical peak performance. Despite its introduction in 2014, the SPARC XIfx chip has been mostly ignored in the HPC community. Everyone appears to be waiting for Knight's Landing, Pascal and POWER8+ to drop for the next big wave of >1 petaFLOP systems this year.

Waco · Thu Mar 03, 2016 2:32 pm

A couple of things worth point out on my quick read through. Jaguar is mentioned but no time frame was given for when data regarding when the was collected. This is important as it has undergone several upgrades and currently uses Bulldozer based Opterons. How the previous Opterons handled memory errors could very well be different, especially in the caches. Ceilo is not using Bulldozer based chips from the info I could find online. This could account for some of the differences as Buldozer and Magny-cours do bit error handing at the cache level differently.

The LANL presentation is actually incorrect about Ceilo's SRAM structures being mostly parity based. By using Magny-Cours, the chips's L1, L2 and L3 structures are fully ECC protected. Bulldozer on the other hand, uses basic parity on the L1 structure since it is inclusive with the L2 cache now. Any error in Bulldozer's L1 cache invokes a read from the L2 cache which then corrects it. ( Source ) I'm really curious how Bulldozer's internal error counter works since data is replicated from L1 into L2, a reported error may only stem from the L2 cache. Also curious if the write coalescing cache between L1 and L2 gets its own counter or is included in either L1 or L2.

I believe it was prior to the reconfiguration (which was Jaguar-PF if I remember right). I'll have to talk to Nathan about the ECC stuff and Cielo!

LightenUpGuys · Thu Mar 03, 2016 7:37 pm

the wrote:
LightenUpGuys wrote:
The

Oracle doesnt make the SPARC supercomputers. Fujitsu does, and i think they also make the hardware for Oracles engineered systems using the M7(TSMC makes the chips).

Look at their PrimeHPC FX100. Its from 2014 and is still ahead of Intel or Nvidia or really anyone else right now.

Not really. It catches up to the Telsa K20X introduced two years earlier in terms of bandwidth and theoretical peak performance. Despite its introduction in 2014, the SPARC XIfx chip has been mostly ignored in the HPC community. Everyone appears to be waiting for Knight's Landing, Pascal and POWER8+ to drop for the next big wave of >1 petaFLOP systems this year.

First of all, you cant really compare the two on a chip level. One is a CPU that runs an operating system and the other requires a host CPU. Therefore, you have to compare them at a compute node level or a system level. If you do, the XIfx is way ahead in terms of performance and computational efficiency.

The XIfx has the interconnect on the chip as well as using HMC, so KNL is just catching up.

And its only really been ignored in the west, where relatively inexpensive systems that use normal CPUs are common. The PrimeHPC FX100 has about 90% computational efficiency, scales to over 100 PFLOPS. No GK110 based system scales like that.

And yes, Shasta and GV100 are what the big US supercomputers are going to use. The K is an ancient architecture(2011) but is still 4th on the Top500 and 1st on Graph500, and the FX100 is about 10x the performance of K. I expect the next architecture they release to have silicon photonics so Knights Hill will have some catching up to do.

Since Knights Hill will more closely resemble XIfx(fewer cores and more menory bandwidth per core than KNL) i expect Shasta to scale well to a few hundred PFLOPS if Omnipath with photonics is good. I think its really between Intels CPUs and SPARC to get to exascale. I personally dont think heterogenous stuff has a chance but we'll have to see if Nvidias GV100 scales incredibly well.

the · Fri Mar 04, 2016 11:22 am

LightenUpGuys wrote:
the wrote:
LightenUpGuys wrote:
The
Oracle doesnt make the SPARC supercomputers. Fujitsu does, and i think they also make the hardware for Oracles engineered systems using the M7(TSMC makes the chips).

Look at their PrimeHPC FX100. Its from 2014 and is still ahead of Intel or Nvidia or really anyone else right now.

Not really. It catches up to the Telsa K20X introduced two years earlier in terms of bandwidth and theoretical peak performance. Despite its introduction in 2014, the SPARC XIfx chip has been mostly ignored in the HPC community. Everyone appears to be waiting for Knight's Landing, Pascal and POWER8+ to drop for the next big wave of >1 petaFLOP systems this year.

First of all, you cant really compare the two on a chip level. One is a CPU that runs an operating system and the other requires a host CPU. Therefore, you have to compare them at a compute node level or a system level. If you do, the XIfx is way ahead in terms of performance and computational efficiency.

Sure I can and I did. K20x has a higher theoretical peak value than the XIfx despite coming out two years before it. The K20x at 1.3 TFLOP peak and 235W power consumption is slightly more efficient than the 1.1 TFLOP peak and 200 W (which is presumed as Fujitsu hasn't released formal figure). If you need to be more efficient, then one can look at the K80 released the same year as the XIfx. Even with a 110W host processor in the mix, it is more efficient than the XIfx at 2.9 TFLOP peak and 410 W (GPU + host CPU).

Things only get worse for the XIfx here as a system node can include multiple K80 GPU so the impact of a host processor is reduced in terms of efficiency. There is only one entry for the XIfx which includes power consumption in the top 500 ( here ). Raw division breaks down to a hair over 245 W per node. Titan with its K20x + Opteron host CPU gets 439 W per node using the same methodology*. However, things are interesting when you look at peak values per system power consumption. Despite being two years older and requiring a host CPU, Titan is more efficient at 3.3 PFLOP per MW. That XIfx system is less than one PFLOP per MW.

*This is a rather figure because these large clusters include core networking and storage systems that don't go toward compute. Thus per node energy consumption for compute nodes are skewed higher.

LightenUpGuys wrote:
The XIfx has the interconnect on the chip as well as using HMC, so KNL is just catching up.

Who said anything about Knight's Landing? I was making the comparison with nVidia's K20x. Lets stay on topic.

LightenUpGuys wrote:
And its only really been ignored in the west, where relatively inexpensive systems that use normal CPUs are common. The PrimeHPC FX100 has about 90% computational efficiency, scales to over 100 PFLOPS. No GK110 based system scales like that.

The difference is that K20 scaled to a 27.9 PFLOP peak system since 2012. The XIfx? A 3.2 PFLOP system at best currently. Sure, it may scale higher but it has bit to go to catch up to what the GK100 chip did four years ago. And you can look at actual LinPACK performance per watt and Titan comes out ahead.

LightenUpGuys wrote:
And yes, Shasta and GV100 are what the big US supercomputers are going to use. The K is an ancient architecture(2011) but is still 4th on the Top500 and 1st on Graph500, and the FX100 is about 10x the performance of K. I expect the next architecture they release to have silicon photonics so Knights Hill will have some catching up to do.

And Titan is a year newer than K but but with 2.5 times the peak performance at 3/4 the energy consumption. I can see history repeating itself with any major XIfx installation coming out this year compared to any Pascal or Volta installation next year.

LightenUpGuys wrote:
Since Knights Hill will more closely resemble XIfx(fewer cores and more menory bandwidth per core than KNL) i expect Shasta to scale well to a few hundred PFLOPS if Omnipath with photonics is good. I think its really between Intels CPUs and SPARC to get to exascale. I personally dont think heterogenous stuff has a chance but we'll have to see if Nvidias GV100 scales incredibly well.

To get to exascale it is about performance per watt and how much power you can provide for an installation. So far GPUs provide the best performance per watt. Scaling at this level requires clusters so the bottlenecks at the system level between a GPU and the host GPU are often hidden. Even then, barriers are being broken down with POWER8+ supporting native nvLink and will share the memory space with Pascal based GPUs later this year. nVidia did plan on a HPC version of Maxwell and included a Denver CPU core on-de. Due to TSMC's 20 nm production issues, nVidia's roadmap was shifted and we got the graphics focused GM200 with poor double precision performance that requires a host CPU.

As I mentioned in another post, there are a few features outside of performance that will make some designs more attractive than others. 3D Xpoint memory is going to be huge for these large installations as they can provide vast amounts of high speed storage and improve reliability in the face of a power outage: existing work wouldn't necessarily have to be redone but could continue processing from when the power was cut. Currently these super computer installations will don't have the local generator power to keep the entire cluster online in the event of a power failure, just enough for storage, limited networking and cooling.

LightenUpGuys · Fri Mar 04, 2016 7:48 pm

The fact that you are comparing a GPU amd SPARC CPU on FLOPS and Linpack performance indicates that you missed my point.

PEAK THEORETICAL performance compared to actual performance is only useful for one thing: showing how inefficient an architecture is.

GK110 is the best GPU architecture until GP100, but that Titan system is only 65% computationally efficient according to Linpack benchmarking you just cited.

Compare that to Ks 93% computational efficiency. Yeah. Tianhe-2 is #1 on the Top500 but is even less efficient than Titan at about 55% of its theoretical peak in Linpack.

Titan is a useful inexpensive system, but its heterogenous architecture requires a host CPU, which is a massive bottleneck in some workloads. There is a reason that K is #1 on the Graph500, which is more relevant to the big data type exascale workloads.

Why is it #1? Its interconnect and overall architecture, while being ancient, are much more refined than a heterogenous system for solving real workloads, rather than running Linpack like Titan and Tihane-2 are.

Thats probably why Cray and Intel are making Shasta what it is: a CPU based system with HMC, an integrated optical interconnect, focused much more on increasing memory and interconnect bandwidth than adding FLOPS.

The byte/FLOP ratio of the SPARC, Knights Landing and Knights Hill, as well as NECs SX-ACE are much closer to 1:1 than heterogenous architectures.

And GPUs and FPGAs may give you the best GFLOPS/W, but the overall system architecture has to be able to translate that into useful work AND scale without bottlenecking at exascale.

That usually means handling things like the operating system and MPI overhead differently. The integrated TOFU and Omnipath controllers help the interconnect scaling issue.

XIfx has 2 cores out of 34 specifically for running the operating system and handling that issue. I would imagine that the Knights Landing and Hill CPUs could have a few of their cores and some cache specifically set aside similarly.

I dont know how GP100 and GV100 based systems will deal with that issue, unless they leave it up to the host CPU.

The features outside of performance in terms of FLOPS, like memory and interconnect bandwidth, TEPS, etc are the things that ACTUALLY matter at exascale though.

And i dont think theyll use generators at exascale. Theyll probably use a cogeneration plant like K does.

the · Sat Mar 05, 2016 11:40 am

LightenUpGuys wrote:
The fact that you are comparing a GPU amd SPARC CPU on FLOPS and Linpack performance indicates that you missed my point.

PEAK THEORETICAL performance compared to actual performance is only useful for one thing: showing how inefficient an architecture is.

GK110 is the best GPU architecture until GP100, but that Titan system is only 65% computationally efficient according to Linpack benchmarking you just cited.

And yet Titan performs better in Linpack while consuming less power and using fewer nodes than K. The raw performance and performance per watt are what matters and Titan is easily beating K despite being only '65% compute efficient'.

LightenUpGuys wrote:
Titan is a useful inexpensive system, but its heterogenous architecture requires a host CPU, which is a massive bottleneck in some workloads. There is a reason that K is #1 on the Graph500, which is more relevant to the big data type exascale workloads.

Due to issues with TSMC's 20 nm process, we never got to see what the HPC version of Maxwell nVidia would have liked to release. That Maxwell chip was to come with a Denver CPU core and be able to run without a separate host processor. nVidia altered their roadmap, dropped 20 nm Maxwell in favor the 28 nm variants we got and placed Pascal in between Maxwell and Volta. It is unknown if nVidia will still put a host CPU on die with Pascal or Volta. One thing nVidia did add is nvLink which is able to interface directly with POWER8+ chips and share the same memory space.

I'd also like to point out that back in 2014 it wasn't always the number 1 system. Something changed for K to literally double in performance.

LightenUpGuys wrote:
Why is it #1? Its interconnect and overall architecture, while being ancient, are much more refined than a heterogenous system for solving real workloads, rather than running Linpack like Titan and Tihane-2 are.

Linpack is a decent benchmark for HPC applications that are bound by parallel compute. Graph500's tests fit more into the big data category.

LightenUpGuys wrote:
Thats probably why Cray and Intel are making Shasta what it is: a CPU based system with HMC, an integrated optical interconnect, focused much more on increasing memory and interconnect bandwidth than adding FLOPS.

The byte/FLOP ratio of the SPARC, Knights Landing and Knights Hill, as well as NECs SX-ACE are much closer to 1:1 than heterogenous architectures.

For the XIfx, the byte flop ration is close to 1 :4.5 where as the K20x has a byte to flop ratio of 1:5.2. Knights Landing is should have a byte to flop ratio of around 1:4.9. They're not that different as you are indicating. The SX-ACE does have a 1:1 byte/flop ratio.

LightenUpGuys wrote:
And GPUs and FPGAs may give you the best GFLOPS/W, but the overall system architecture has to be able to translate that into useful work AND scale without bottlenecking at exascale.

That usually means handling things like the operating system and MPI overhead differently. The integrated TOFU and Omnipath controllers help the interconnect scaling issue.

There is nvLink which interfaces a GP100 chip to a POWER8+ directly. The memory address space is uniform between the GP100 and POWER8+ so a potential bottleneck there has been removed.

IBM has yet to release a HPC focused POWER8 system, likely waiting for POWER8+ and nvLink. The previous generation p755 used 32 sockets per node and a uniform memory address space across nodes. Coherency didn't go across the 32 sockets per node but it greatly simplified the programming model. This sits in a weird place between a cluster and a giant NUMA system. BlueGene/Q offered the same flat memory space across nodes but without coherency. I would predict that a POWER8+ HPC system would incorporate lessons learned from both the p755 and BlueGene/Q for some very impressive scaling. Adding in GP100 to the nodes would further performance.

Intel's Omnipath has plenty of bandwidth and the way they're implementing it at the processor level should provide a good low latency connection. The one area what I'm curious about is how Intel will be implementing it inside of a switch. While a pure Sky Lake-EP cluster wouldn't be that impressive in a Top500 showing, its combination of AVX-512 compute, Omnipath on package networking and 3D Xpoint memory should make it very competitive in Graph500. For the big data problems there, SkyLake-EP could simply host everything in memory (3D Xpoint DIMMs are supposed to go up to 4 TB eventually).

The other company which has a good interconnect is AMD with their SeaMicro acquisition a few years back. The problem is that I don't see big AMD wins at the exascale level due to the company's financial issues.

LightenUpGuys · Sat Mar 05, 2016 7:47 pm

Even Jack Dongarra, the man behind Linpack has been working on replacing it since it doesnt reflect the real world performance of modern supercomputers. HPCG and the Graph500 are often more relevant. The Top500 only shows Linpack performance, and even the creator of Linpack realizes that its usefulness is very limited.

Yeah i know Nvidia didnt make a 20nm TSMC part. Maybe TSMC was too busy making all those XIfx chips and M7s which are.

The K20x or any current GPU cant be measured alone since it requires a host processor. Whats the overall system byte/FLOP of GK110?

I expect GP100 and GV100 and supercomputers using them to be extremely powerful and definitely useful for certain applications. Ill probably have a GP100 on my desk.

There MAY be multiple architectures that make it to exascale. I dont know how well a GV100 heterogenous system will compare to a Knights Hill but we will see as the government is building several hundred PFLOPS versions of each.

In all cases, the architectures are focusing MUCH more on memory and interconnect bandwidth than FLOPS. I was saying that XIfx is ahead of everyone else as far as a pre-exascale architecture goes, given what architectural changes are taking place in preparation for exascale.

I think youll see a replacement for the YARCDATA Urika using Skylake Purley and XPoint.

TR Forums

Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Re: Make your exascale predictions now.

Who is online