TR Forums

chuckula · Tue Jul 17, 2018 11:30 am

Ok, in the TR twitter we have this request:

Give me ideas for applications that scale past 16C/32T. Doesn't need to be Windows but does need to be freely available

Here are a few applications that are freely available, that can be benchmarked at least semi-reasonably, and that at least have some relevance to modern systems *cough*not-just-Cinebench-again*cough*:

1. X.265 encoding HEVC video: This one may not scale perfectly to 32 cores on a single instance but you can run some tests with 1/4/8 instances to test scaling. http://x265.org/

2. Y-cruncher: TR has briefly used this one in the past. It will stress any CPU setup you can think of and really likes bandwidth too (hopefully the new 0.7.6 version will be out): http://www.numberworld.org/y-cruncher/

3. GROMACs: A good HPC benchmark. http://www.gromacs.org/

4. You want rendering? There's more to life than the BMW render in Blender. Try ray-tracing with Embree: https://github.com/embree/embree-benchmark-protoray

5. Databases: How about Postgresql bench: https://www.postgresql.org/docs/10/static/pgbench.html

6. And of course, machine learning is the new hotness. Remember that machine learning includes both training & inferencing. Try Stanford's DAWNBench: https://dawn.cs.stanford.edu/benchmark/

7. Bonus: This one is complex but interesting: Open Porous Media: https://opm-project.org/

Kretschmer · Tue Jul 17, 2018 12:09 pm

Honestly, I think that review aggregates should be split into single-threaded and multi-threaded performance-per-dollar. TR tends to over-represent niche multithreaded applications in their reviews, weighting results towards cramming as many threads into a die as possible. I'm sure that some readers I'd rather see bifurcated results to better inform users whose usage tends towards one or the other scenario. I don't understand why something like WebXPRT 3 (that reflects an activity that 100% of users are interested in) counts as much as something like Indigo Bench that caters to a niche of a niche.

caconym · Tue Jul 17, 2018 12:20 pm

https://www.chaosgroup.com/vray/benchmark

Would love to see V-Ray Benchmark used. V-Ray has become one of the most-used professional CPU renderers, so the results would be relevant for a lot of people.

Topinio · Tue Jul 17, 2018 12:47 pm

NWChem, QUANTUM ESPRESSO, CP2K ? HPL ?

chuckula · Tue Jul 17, 2018 12:49 pm

OpenFOAM: http://openfoamwiki.net/index.php/Benchmarks

And NAS Parallel: https://www.nas.nasa.gov/publications/npb.html

Waco · Tue Jul 17, 2018 8:52 pm

I'd suggest HPCG, but it seems like I need to yell at someone since the damn website is down.

chuckula · Tue Jul 17, 2018 9:18 pm

Waco wrote:
I'd suggest HPCG, but it seems like I need to yell at someone since the damn website is down.

How useful is HPCG when run on only a single node (even if it has a bunch of cores)? Most of the uses I see for that benchmark are to test the interconnect efficiency of large HPC clusters.

Waco · Tue Jul 17, 2018 10:12 pm

chuckula wrote:
Waco wrote:
I'd suggest HPCG, but it seems like I need to yell at someone since the damn website is down.

How useful is HPCG when run on only a single node (even if it has a bunch of cores)? Most of the uses I see for that benchmark are to test the interconnect efficiency of large HPC clusters.

Not very...but it's better than Linpack/HPL.

It does stress memory bandwidth pretty heavily, though, so the difference between it and an equivalent Epyc would be quite interesting.

chuckula · Wed Jul 18, 2018 7:51 am

Here's an interesting paper on HPC benchmarking from my old alma mater: https://engineering.purdue.edu/paramnt/ ... SBA+08.pdf

In a nutshell: It gets more complicated the more hardware becomes available. At least TR has the advantage of only reviewing a single system, even if it has a bunch of cores and potential NUMA characteristics.

They list some interesting computational workloads in that paper too.

dragontamer5788 · Wed Jul 18, 2018 9:16 am

chuckula wrote:
Waco wrote:
I'd suggest HPCG, but it seems like I need to yell at someone since the damn website is down.

How useful is HPCG when run on only a single node (even if it has a bunch of cores)? Most of the uses I see for that benchmark are to test the interconnect efficiency of large HPC clusters.

Infinity fabric is a mesh. An incredibly fast and low latency mesh that's partially on-die... but still a mesh that can be tested. I would expect Intel's i9 (which has a unified L3 cache across all cores) to perform better on something like HPCG. The overall idea for HPCG is to test a more RAM compared to Linpack. Consider the Apple iPhone, which has good CPU cores but lacks the equivalent of an L2 cache (there's only a 32kB L1 cache + 8MB L2 cache on iPhones). The Apple iPhone would perform very well in Linpack, but perform worse on HPCG.

Waco · Wed Jul 18, 2018 11:41 am

dragontamer5788 wrote:
chuckula wrote:
Waco wrote:
I'd suggest HPCG, but it seems like I need to yell at someone since the damn website is down.

How useful is HPCG when run on only a single node (even if it has a bunch of cores)? Most of the uses I see for that benchmark are to test the interconnect efficiency of large HPC clusters.

Infinity fabric is a mesh. An incredibly fast and low latency mesh that's partially on-die... but still a mesh that can be tested. I would expect Intel's i9 (which has a unified L3 cache across all cores) to perform better on something like HPCG. The overall idea for HPCG is to test a more RAM compared to Linpack. Consider the Apple iPhone, which has good CPU cores but lacks the equivalent of an L2 cache (there's only a 32kB L1 cache + 8MB L2 cache on iPhones). The Apple iPhone would perform very well in Linpack, but perform worse on HPCG.

This.

HPCG essentially gives you the maximum throughput you can get through a particular chip while actually accessing data in memory. Linpack/HPL tends to stress the hell out of the execution units and cache, but relies very little on memory bandwidth.

Real problems need both heavy duty vector units and a lot of memory bandwidth / low latency. Cheap HBM/HMC cannot come fast enough!

EDIT: I hate HPL when run on clusters. It's a useless metric for toy problems that aren't good for anything other than bragging rights. It has poisoned the HPC space and given rise to stupid machines that have ridiculous compute capacity without any means of actually feeding a real problem through it (see: ORNL's Summit, DOE CORAL machines, etc). I'm happy to let them play the beating the chest game, but it's a massive waste of resources.

chuckula · Wed Jul 18, 2018 11:43 am

Waco wrote:
HPCG essentially gives you the maximum throughput you can get through a particular chip while actually accessing data in memory. Linpack/HPL tends to stress the hell out of the execution units and cache, but relies very little on memory bandwidth.

Real problems need both heavy duty vector units and a lot of memory bandwidth / low latency. Cheap HBM/HMC cannot come fast enough!

You should check the link to Y-cruncher from my first post. While it's just calculating Pi, the program does hit both memory and CPU processing hard, which turns into a bottleneck for high core-count chips.

ptsant · Wed Jul 18, 2018 1:19 pm

De novo genome assembly is ridiculously power hungry. Maybe even too much for a benchmarking session. Time for a real application can be in the hundreds of hours and peak mem use at 300+ GB.

You could try the introduction from the following page (only steps 1-3) which uses toy data. Can't say how much mem/time it uses, but should be reasonable.
https://github.com/voutcn/megahit/wiki/ ... l-assembly

Another thing to consider is that many bioinformatics algorithms are embarrassingly parallel and will saturate HD, RAM or both in a system with 32 cores.

chuckula · Wed Jul 18, 2018 1:21 pm

ptsant wrote:
De novo genome assembly is ridiculously power hungry. Maybe even too much for a benchmarking session. Time for a real application can be in the hundreds of hours and peak mem use at 300+ GB.

You could try the introduction from the following page (only steps 1-3) which uses toy data. Can't say how much mem/time it uses, but should be reasonable.
https://github.com/voutcn/megahit/wiki/ ... l-assembly

Another thing to consider is that many bioinformatics algorithms are embarrassingly parallel and will saturate HD, RAM or both in a system with 32 cores.

You have a good point about time requirement tradeoffs. You probably want a benchmark that doesn't just run in a few seconds like Cinebench does these days, but you don't want a single run to last 3 hours either if you are trying to publish on a deadline (although something like that might be nice for an after-launch followup article).

ptsant · Wed Jul 18, 2018 3:20 pm

chuckula wrote:
You have a good point about time requirement tradeoffs. You probably want a benchmark that doesn't just run in a few seconds like Cinebench does these days, but you don't want a single run to last 3 hours either if you are trying to publish on a deadline (although something like that might be nice for an after-launch followup article).

I checked the test run that I linked and it shouldn't take more than 30min, in contrast with full production runs. Still, memory usage might be a problem.

dragontamer5788 · Wed Jul 18, 2018 3:35 pm

Waco wrote:
EDIT: I hate HPL when run on clusters. It's a useless metric for toy problems that aren't good for anything other than bragging rights. It has poisoned the HPC space and given rise to stupid machines that have ridiculous compute capacity without any means of actually feeding a real problem through it (see: ORNL's Summit, DOE CORAL machines, etc). I'm happy to let them play the beating the chest game, but it's a massive waste of resources.

You know that Summit is actually on the top of the HPCG benchmark charts, right? Summit has the biggest and widest interconnects ever built for highest bandwidth and lowest latency communications.

https://www.top500.org/hpcg/lists/2018/06/

The Power9 + NVidia Volta combo is deadly in a bandwidth / memory test like HPCG. Power9 + Volta is #1 and #2 slots on HPCG by far. Summit scores 2925.75 on HPCG, #2 is still a Power9+Volta system, with #3 being a Japanese SPARC supercomputer with only 602.74.

In short: Power9 + Volta is something like 4x stronger in HPCG's memory-intensive test than the next non-Power9 / Volta system.

chuckula · Wed Jul 18, 2018 3:47 pm

dragontamer5788 wrote:
Waco wrote:
EDIT: I hate HPL when run on clusters. It's a useless metric for toy problems that aren't good for anything other than bragging rights. It has poisoned the HPC space and given rise to stupid machines that have ridiculous compute capacity without any means of actually feeding a real problem through it (see: ORNL's Summit, DOE CORAL machines, etc). I'm happy to let them play the beating the chest game, but it's a massive waste of resources.

You know that Summit is actually on the top of the HPCG benchmark charts, right? Summit has the biggest and widest interconnects ever built for highest bandwidth and lowest latency communications.

https://www.top500.org/hpcg/lists/2018/06/

The Power9 + NVidia Volta combo is deadly in a bandwidth / memory test like HPCG. Power9 + Volta is #1 and #2 slots on HPCG by far. Summit scores 2925.75 on HPCG, #2 is still a Power9+Volta system, with #3 being a Japanese SPARC supercomputer with only 602.74.

In short: Power9 + Volta is something like 4x stronger in HPCG's memory-intensive test than the next non-Power9 / Volta system.

You're probably not looking at the results that Waco cares about.

Try this page: http://www.hpcg-benchmark.info/custom/i ... 5&slid=295

Look at the "Fraction of Peak" column. That's probably Waco's biggest interest, and Summit is OK but not that spectacular in that metric.

Waco · Wed Jul 18, 2018 4:02 pm

dragontamer5788 wrote:
In short: Power9 + Volta is something like 4x stronger in HPCG's memory-intensive test than the next non-Power9 / Volta system.

...and it's 1.5% efficient at doing that. 1.5%. Repeat that a few times. That means a similar exascale machine (in FLOP count) would only run 15 PF on that test...

chuckula wrote:
You're probably not looking at the results that Waco cares about.

Try this page: http://www.hpcg-benchmark.info/custom/i ... 5&slid=295

Look at the "Fraction of Peak" column. That's probably Waco's biggest interest, and Summit is OK but not that spectacular in that metric.

Exactly. 1.5% of peak is laughable. It's a huge step in the wrong direction in terms of efficiency of compute, especially as power density and total power becomes more and more of a challenge.

dragontamer5788 · Wed Jul 18, 2018 4:09 pm

Waco wrote:
dragontamer5788 wrote:
In short: Power9 + Volta is something like 4x stronger in HPCG's memory-intensive test than the next non-Power9 / Volta system.

...and it's 1.5% efficient at doing that. 1.5%. Repeat that a few times. That means a similar exascale machine (in FLOP count) would only run 15 PF on that test...

chuckula wrote:
You're probably not looking at the results that Waco cares about.

Try this page: http://www.hpcg-benchmark.info/custom/i ... 5&slid=295

Look at the "Fraction of Peak" column. That's probably Waco's biggest interest, and Summit is OK but not that spectacular in that metric.

Exactly. 1.5% of peak is laughable. It's a huge step in the wrong direction in terms of efficiency of compute, especially as power density and total power becomes more and more of a challenge.

Wait, the Japanese K-Computer (5+% peak) uses 12.6 MW of power to achieve 0.603 PFlops in HPCG. In contrast, the Power9 + Volta Summit uses 15 MW to achieve 2.926 PFlops in HPCG.

No matter how you look at it, Summit is the most powerful supercomputer ever built. Sure, it has a ridiculously huge (and impractical) Linpack score. But its "practical" HPCG score beats the competitors in efficiency by like 300%. Summit is straight up the best and most practical supercomputer ever built, with great efficiency numbers to boot.

Waco · Wed Jul 18, 2018 4:10 pm

dragontamer5788 wrote:
Wait, the Japanese K-Computer (5+% peak) uses 12.6 MW of power to achieve 0.603 PFlops in HPCG. In contrast, the Power9 + Volta Summit uses 15 MW to achieve 2.926 PFlops in HPCG.

K is nearly a decade old.

techguy · Wed Jul 18, 2018 4:22 pm

An AVX workload or two would be nice. A Handbrake 2-pass 4k transcode would be awesome. Barring that, Asus' ROG Realbench has a quick H.264 video encoding benchmark which should give us *some* idea of encoding performance

Topinio · Thu Jul 19, 2018 7:56 am

Waco wrote:
Exactly. 1.5% of peak is laughable. It's a huge step in the wrong direction in terms of efficiency of compute, especially as power density and total power becomes more and more of a challenge.

If so, almost everything Intel-based on that list of 130 machines is laughable. Most of the rest could be said to be, too, there aren't many that are even 3% of peak.

Only Earth Simulator and its 6 SX-ACE relatives get over 10%, otherwise only SPARC64 like K is over 5%.

There are only 3 (1 Intel and 2 AMD) systems over 4%. 7 more Intel ones get 3.x% ... the Xeons are mostly in the 0-1.x% range ...

7 @ 10-12%, all NEC SX-ACE (like the Earth Simulator)
2 @ 5-6%, both SPARC64 (like K)
3 @ 4-5%, 2 AMD and 1 Ivy Bridge
8 @ 3-4%, 3 Haswell, 3 Ivy, 1 Sandy, 1 SPARC64
17 @ 2-3%
71 @ 1-2%
15 @ 0-1%

Waco · Thu Jul 19, 2018 8:07 am

Agreed. Efficiency has been sliding downwards for almost all machines. GPUs in general make it worse.

chuckula · Thu Jul 19, 2018 8:14 am

Topinio wrote:
Waco wrote:
Exactly. 1.5% of peak is laughable. It's a huge step in the wrong direction in terms of efficiency of compute, especially as power density and total power becomes more and more of a challenge.

If so, almost everything Intel-based on that list of 130 machines is laughable. Most of the rest could be said to be, too, there aren't many that are even 3% of peak.

Only Earth Simulator and its 6 SX-ACE relatives get over 10%, otherwise only SPARC64 like K is over 5%.

There are only 3 (1 Intel and 2 AMD) systems over 4%. 7 more Intel ones get 3.x% ... the Xeons are mostly in the 0-1.x% range ...

7 @ 10-12%, all NEC SX-ACE (like the Earth Simulator)
2 @ 5-6%, both SPARC64 (like K)
3 @ 4-5%, 2 AMD and 1 Ivy Bridge
8 @ 3-4%, 3 Haswell, 3 Ivy, 1 Sandy, 1 SPARC64
17 @ 2-3%
71 @ 1-2%
15 @ 0-1%

Little of that has to do with CPU architecture and a lot of that has to do with interconnects. The Tofu interconnect topology used by the Japanese supercomputers is really cool in many ways. The individual interconnects don't even have a massive amount of bandwidth but the topology is done extremely well, although I suspect that it's pretty complex to setup, which is probably why everybody else doesn't just copy it.

Topinio · Thu Jul 19, 2018 10:24 am

So, comparing efficiency epeens isn't productive...

dragontamer5788 · Thu Jul 19, 2018 10:27 am

Topinio wrote:
So, comparing efficiency epeens isn't productive...

Calling it "efficiency" is a complete misnomer and I disagree with the use of that word. Efficiency usually means power-efficiency: using fewer watts to do the same result. As I demonstrated earlier: Summit does 400% the HPCG PFLOPs with only 25% more power. Summit is more "efficient" by every measurement.

I'm not even sure why we should care about this % of peak number at all.

Waco · Thu Jul 19, 2018 1:53 pm

dragontamer5788 wrote:
I'm not even sure why we should care about this % of peak number at all.

Paying for dark silicon isn't a good thing.

I care because the primary purpose of the HPC clusters I help design and run is to run problems very similar to HPCG in terms of workload. Going from 1.5% to 3% means the system costs half as much.

Anyway - I think we're sufficiently off the rails here! I would love to see HPCG included, though, since it shows how well balanced an architecture is for compute versus memory bandwidth/latency (and interconnect on clusters).

techguy · Thu Jul 19, 2018 2:10 pm

Waco wrote:
dragontamer5788 wrote:
I'm not even sure why we should care about this % of peak number at all.

Paying for dark silicon isn't a good thing.

I care because the primary purpose of the HPC clusters I help design and run is to run problems very similar to HPCG in terms of workload. Going from 1.5% to 3% means the system costs half as much.

Anyway - I think we're sufficiently off the rails here! I would love to see HPCG included, though, since it shows how well balanced an architecture is for compute versus memory bandwidth/latency (and interconnect on clusters).

Utilization, as a metric of efficiency, is largely irrelevant to this discussion, IMHO. Customers of this size, with these type of workloads, do not care if their clusters have 3% efficiency or 99%. Actual, measurable performance is what matters most. If one solution outperforms another and costs a similar amount, efficiency becomes a mere intellectual curiosity.

To use a car analogy, it's like the old HP/L argument. Honda fanboys used to love this one.

"B...b...but, my 1.6L engine makes 160HP. That's 100HP/L! Way higher than your stupid V8 with its 300HP"

Raw performance is what matters in the end.

Topinio · Thu Jul 19, 2018 5:15 pm

techguy wrote:
Raw performance is what matters in the end.

Kinda. To users, what matters is ease of use, then support, then latency from job submission to job start.

Raw perfomance only matters to the extent that it affects this last (third most important) factor, which it does -- but only when the competition from the rest of the user community can be successfully constrained despite the generational increases in raw performance.

Edit:

techguy wrote:
Customers of this size, with these type of workloads, do not care if their clusters have 3% efficiency or 99%. Actual, measurable performance is what matters most. If one solution outperforms another and costs a similar amount, efficiency becomes a mere intellectual curiosity.

People at customers who are buying, building out and configuring the systems are not the same people who are running on the system. To those making purchasing decisions, HPL or whatever code's raw performance is what matters, sure, but it's easy for eyes to drift off the ball.

Waco · Thu Jul 19, 2018 5:24 pm

techguy wrote:
Utilization, as a metric of efficiency, is largely irrelevant to this discussion, IMHO. Customers of this size, with these type of workloads, do not care if their clusters have 3% efficiency or 99%. Actual, measurable performance is what matters most. If one solution outperforms another and costs a similar amount, efficiency becomes a mere intellectual curiosity.

To use a car analogy, it's like the old HP/L argument. Honda fanboys used to love this one.

"B...b...but, my 1.6L engine makes 160HP. That's 100HP/L! Way higher than your stupid V8 with its 300HP"

Raw performance is what matters in the end.

Yes, raw performance matters. However, raw performance is limited by MTBSI. A system that brute-forces a metric of performance by being 2x as big may end up being less useful once the failure rate is taken into account. Certainly if a machine is of similar size, similar cost, and similar power, the higher performing machine wins.

/a customer of that size

Topinio wrote:
People at customers who are buying, building out and configuring the systems are not the same people who are running on the system. To those making purchasing decisions, HPL or whatever code's raw performance is what matters, sure, but it's easy for eyes to drift off the ball.

Generally, yes. Efficiency of execution and resources is waaaay up there on the list of considerations especially since we've all but exhausted the gains we'll get from process shrinks.

TR Forums

Kampman wants 32 Core benchies? Here are a few!

Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Re: Kampman wants 32 Core benchies? Here are a few!

Who is online