Personal computing discussed

Moderators: renee, Flying Fox, morphine

 
Waco
Gold subscriber
Grand Gerbil Poohbah
Posts: 3250
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Intel declares bankruptcy (RIP 1968-2019)

Fri Jan 25, 2019 1:21 pm

dragontamer5788 wrote:
IBM Power9 has some interesting benchmarks. The 2x8 Power9 system was performing above and beyond the 32-core EPYCs in 64-bit tasks in Phoronix benches.

It's hard to directly compare them due to the fundamental design differences - those dual Power9s have the same number of threads as that single 32 core EPYC. They are cool chips though for specific types of workload.
Desktop: X570 Gaming X | 3900X | 32 GB | Alphacool Eisblock Radeon VII | Heatkiller R3 | Samsung 4K 40" | 1 TB SX8200 Pro + 2 TB 660p + 2 TB SATA SSD
NAS: 1950X | Designare EX | 32 GB ECC | 7x8 TB RAIDZ2 | 8x2 TB RAID10 | FreeNAS | ZFS | LSI SAS
 
leor
Gold subscriber
Maximum Gerbil
Posts: 4878
Joined: Wed Dec 11, 2002 6:34 pm
Location: NYC
Contact:

Re: Intel declares bankruptcy (RIP 1968-2019)

Fri Jan 25, 2019 1:54 pm

derFunkenstein wrote:
K-L-Waster wrote:
Youth is fleeting, but immaturity can last forever.

Oh, I have tons of immaturity. Just not the energy to do it day-in, day-out like this. :lol:

For real, Chuck, can you drop the frequency of the nonsense to 10 times a year or so?
 
dragontamer5788
Gerbil Elite
Posts: 529
Joined: Mon May 06, 2013 8:39 am

Re: Intel declares bankruptcy (RIP 1968-2019)

Fri Jan 25, 2019 3:19 pm

Waco wrote:
dragontamer5788 wrote:
IBM Power9 has some interesting benchmarks. The 2x8 Power9 system was performing above and beyond the 32-core EPYCs in 64-bit tasks in Phoronix benches.

It's hard to directly compare them due to the fundamental design differences - those dual Power9s have the same number of threads as that single 32 core EPYC. They are cool chips though for specific types of workload.


I agree its a hard comparison to have.

Nonetheless, the 8-core Power9 is only $595. While a 32-core EPYC 7551 is well over $2000. Furthermore, the Power9 systems have an 18-core offered for only $1425. Thats 18-core / 72-threads, since Power9 is SMT4.

Which means your $1425 Power9 chip can perform like a $2000+ EPYC 7551 chip in some workloads. And if you can benefit from NVLink, PCIe 4.0, or any other special feature of the Power9 system, you get huge benefits on top of that. Note that the 18-core Power9 has lol 90 MB of L3 cache for example, which would be essential to any database application.

Its not like the EPYC 7551 is a bad chip either. Indeed, EPYC 7551 is probably one of the best price/performance chips on the market. That's what makes the comparison so stunning IMO. Now true, Power9 motherboards cost more, but EPYC motherboards are still pretty expensive. So its still not an apples-to-apples comparison, but it doesn't seem like Power9 is a bad deal by my eyes.
 
Waco
Gold subscriber
Grand Gerbil Poohbah
Posts: 3250
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Intel declares bankruptcy (RIP 1968-2019)

Fri Jan 25, 2019 3:58 pm

dragontamer5788 wrote:
Which means your $1425 Power9 chip can perform like a $2000+ EPYC 7551 chip in some workloads.

*assuming the additional speed wasn't from the dual-socket nature and double the DRAM bandwidth on the dual 8 core Power9 system. :) Some Power9 are SMT8 as well, though I don't know the model breakdown.

I haven't seen a recent comparison on power efficiency either, but older Power chips were pretty notoriously bad at idle / light load power consumption.
Desktop: X570 Gaming X | 3900X | 32 GB | Alphacool Eisblock Radeon VII | Heatkiller R3 | Samsung 4K 40" | 1 TB SX8200 Pro + 2 TB 660p + 2 TB SATA SSD
NAS: 1950X | Designare EX | 32 GB ECC | 7x8 TB RAIDZ2 | 8x2 TB RAID10 | FreeNAS | ZFS | LSI SAS
 
dragontamer5788
Gerbil Elite
Posts: 529
Joined: Mon May 06, 2013 8:39 am

Re: Intel declares bankruptcy (RIP 1968-2019)

Fri Jan 25, 2019 4:18 pm

Waco wrote:
dragontamer5788 wrote:
Which means your $1425 Power9 chip can perform like a $2000+ EPYC 7551 chip in some workloads.

*assuming the additional speed wasn't from the dual-socket nature and double the DRAM bandwidth on the dual 8 core Power9 system. :) Some Power9 are SMT8 as well, though I don't know the model breakdown.

I haven't seen a recent comparison on power efficiency either, but older Power chips were pretty notoriously bad at idle / light load power consumption.


Hmmm... that's a good point. Stockfish (Chess AI) seems to be compute-bound... but I know that the hash-table of chess positions is shared between all threads, and could be bandwidth bound. I generally assume that Stockfish / AIs in general are compute bound however. I doubt that 7-zip compression is memory-bound, mostly because that test always scales well with cores / threads.

The single-chip also has benefits for communication. A shared 90MB L3 cache is better than 2x40MB caches spread across two chips. There's a good chance that the shared-hash table of Stockfish chess positions is better accelerated on the shared 90MB L3 cache of a bigger processor, rather than going between sockets on a 2xSocket system.

-----------

SMT8 is only on IBM's really high end systems. SMT8 is basically two cores bolted together, being called one core. There is a 12-wide decoder on SMT8 systems (compared to 6-wide decoder on SMT4 systems). Skylake and Zen btw have 4-wide decoders + 6-wide uOP caches (6-wide if a loop fits in the uOp cache, 4-wide if it fits in L1 cache). SMT8 systems have 8 somewhat independent pipelines, each of which will run its own thread. (2-pipelines can share resources with each other: so its superior over fully independent pipelines)

SMT8 (and really, SMT4 which is just 1/2 of that design) are kinda like AMD's Bulldozer. Except IBM calls the entire 8-thread behemoth a "single core". 12-wide decoder would be equivalent to roughly 3 Skylakes or Zen cores, with regards to L1 decoding speed. I know it isn't an apples-to-apples comparison, but that's probably the closest analog between the two designs that makes sense to compare.
 
Waco
Gold subscriber
Grand Gerbil Poohbah
Posts: 3250
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Intel declares bankruptcy (RIP 1968-2019)

Fri Jan 25, 2019 4:38 pm

Yep! It's always really fun to compare wildly different architectures like this. I have to admit, I'm playing devils advocate (going off memory) here more than I'm digging in deep into each design. I do that too much in my day job. :P
Desktop: X570 Gaming X | 3900X | 32 GB | Alphacool Eisblock Radeon VII | Heatkiller R3 | Samsung 4K 40" | 1 TB SX8200 Pro + 2 TB 660p + 2 TB SATA SSD
NAS: 1950X | Designare EX | 32 GB ECC | 7x8 TB RAIDZ2 | 8x2 TB RAID10 | FreeNAS | ZFS | LSI SAS
 
chuckula
Gold subscriber
Minister of Gerbil Affairs
Topic Author
Posts: 2109
Joined: Wed Jan 23, 2008 9:18 pm
Location: Probably where I don't belong.

Re: Intel declares bankruptcy (RIP 1968-2019)

Fri Jan 25, 2019 8:48 pm

dragontamer5788 wrote:
Waco wrote:
dragontamer5788 wrote:
IBM Power9 has some interesting benchmarks. The 2x8 Power9 system was performing above and beyond the 32-core EPYCs in 64-bit tasks in Phoronix benches.

It's hard to directly compare them due to the fundamental design differences - those dual Power9s have the same number of threads as that single 32 core EPYC. They are cool chips though for specific types of workload.


I agree its a hard comparison to have.

Nonetheless, the 8-core Power9 is only $595. While a 32-core EPYC 7551 is well over $2000. Furthermore, the Power9 systems have an 18-core offered for only $1425. Thats 18-core / 72-threads, since Power9 is SMT4.

Which means your $1425 Power9 chip can perform like a $2000+ EPYC 7551 chip in some workloads. And if you can benefit from NVLink, PCIe 4.0, or any other special feature of the Power9 system, you get huge benefits on top of that. Note that the 18-core Power9 has lol 90 MB of L3 cache for example, which would be essential to any database application.

Its not like the EPYC 7551 is a bad chip either. Indeed, EPYC 7551 is probably one of the best price/performance chips on the market. That's what makes the comparison so stunning IMO. Now true, Power9 motherboards cost more, but EPYC motherboards are still pretty expensive. So its still not an apples-to-apples comparison, but it doesn't seem like Power9 is a bad deal by my eyes.


The whole point of the POWER architecture is to push transactions as quickly and with as much fault-tolerance as possible. It's an incredibly... POWERful... mircroarchitecture for that purpose and its design revolves around that purpose far more than for performing pure number crunching.

Irony: the "R" in POWER originally stood for RISC (Performance Optimization With Enhanced RISC) but modern POWER chips are light years away from anything that should properly be called RISC because they've been enhanced with all the necessary execution baggage to be used in Z-series mainframes. That includes specific binary-coded decimal instructions to enable them to perform banking transactions that spit out the exact same results as a 40 year old S370 would.
4770K @ 4.7 GHz; 32GB DDR3-2133; Officially RX-560... that's right AMD you shills!; 512GB 840 Pro (2x); Fractal Define XL-R2; NZXT Kraken-X60
--Many thanks to the TR Forum for advice in getting it built.
 
anotherengineer
Gerbil Jedi
Posts: 1677
Joined: Fri Sep 25, 2009 1:53 pm
Location: Northern, ON Canada, Yes I know, Up in the sticks

Re: Intel declares bankruptcy (RIP 1968-2019)

Sun Jan 27, 2019 10:11 am

Chrispy_ wrote:
I see the thread.
I think to myself, 'WoW, that's super low even for Chucky, but I bet it's Chucky.
I click on the thread, it's Chucky.
I'm not reading anything in this thread, I just came here to type this.


Agreed

I visit TR for the well written articles, and first thing one sees when the site loads is the hot forum threads, and in a glance you can instantly tell who and what it is, and then just close the browser and walk away, because it just ruins the site seeing unprofessional stuff like that.
Life doesn't change after marriage, it changes after children!
 
DragonDaddyBear
Silver subscriber
Gerbil Elite
Posts: 985
Joined: Fri Jan 30, 2009 8:01 am

Re: Intel declares bankruptcy (RIP 1968-2019)

Sun Jan 27, 2019 12:54 pm

I entertain the like this because even with the troll-tastic headline is usually a topic is interest. And is such an internet thing to have click bait they I have learned to live with it. As annoying as Chuckula is, he appears to be able to contribute intelligent thoughts, as demonstrated above. I do wish the trolling would come down a notch, though. There is more meaningful headlines that could have been chosen that indicate the topics at hand but arec still poking fun at fanboys.
 
cegras
Gerbil First Class
Posts: 187
Joined: Mon Nov 05, 2007 3:12 pm

Re: Intel declares bankruptcy (RIP 1968-2019)

Sun Jan 27, 2019 8:21 pm

That's an intelligent thought? That's all publicly available information. Dragontamer and Waco were having an interesting conversation, don't let Chuckula leach credit off of it.
 
chuckula
Gold subscriber
Minister of Gerbil Affairs
Topic Author
Posts: 2109
Joined: Wed Jan 23, 2008 9:18 pm
Location: Probably where I don't belong.

Re: Intel declares bankruptcy (RIP 1968-2019)

Sun Jan 27, 2019 8:33 pm

cegras wrote:
That's an intelligent thought? That's all publicly available information. Dragontamer and Waco were having an interesting conversation, don't let Chuckula leach credit off of it.


What have you ever added to anything... including "public information."

But while your here stroking your own ego, why don't you take the time to tell us again why Haswell sucks so bad when AMD is still copying Haswell's AVX units into its leading edge 2019 product line.

Or we can just look here:
Image
Yay AMD wins and Intel sucks because a quad-core 65W beats up on lower-range laptop chips RIGHT?!?!?

Try again with AVX turned on... a move that even helps AMD if you look at the numbers:
Image

Oh wait... a dual-core "failed" Cannon Lake winning? Well that's OK. Zen 2 will be literally twice as fast, so a 4-core Zen 2 will only lose by a factor of 2 to a "failed" Cannon Lake part.

What's neat about that particular test? It's a benchmark that was actually written by the author of the article at Anandtech back when he was writing academic code. A few lines of AVX-512 tweaking and we get that result. This isn't some commercial test that was built by a huge team of coders, it's an amateur software project that shows just how powerful actually using the modern facilities of a modern CPU can be.

And that's from CannonLake, you know the "bad" part. Imagine what happens when Intel starts to make parts that are just "not so bad" instead.

Hell, I'd love to have an intelligent conversation around here about the best optimizations for using AVX-512 VL to adjust the register width for various instruction mixes ranging from the huge FMA workloads where the full register set tends to slow clocks vs. the much faster POPCNT and BW sets where you can run the full-width registers without too much trouble, but every time I talk about something that goes over your "CINENBENCH DUURRRRR" level of understanding I get a childish response that "AVX-512 is just a niche". Well guess what, in Q4 2018 it was a $6.1 Billion dollar "niche" that makes everything AMD has ever done look like an inconsequential joke.



Frankly, the reason I put up this thread title is because it plays exactly in to the emotions and prejudices of people like you. Imagine if I had said instead: Intel's Q4 2018 profit makes next closest competitor's revenue look like a rounding error. Or how about: Intel's non-volatile systems group that makes those "failed" Optane parts would be the largest division in the company easily outpacing 100% of all Epyc sales if it were to be dropped onto AMD [you know, the company that won't tell us Epyc's revenue separate from video game console sales... because those product lines are clearly one and the same].

I'm getting tired of having to talk down to you, so I'm just coming down to your level. The next fact-free insult better have a little more behind it other than "Chuckula accurately pointed out my own bigotry and I'm uncomfortable with it waahh!"
4770K @ 4.7 GHz; 32GB DDR3-2133; Officially RX-560... that's right AMD you shills!; 512GB 840 Pro (2x); Fractal Define XL-R2; NZXT Kraken-X60
--Many thanks to the TR Forum for advice in getting it built.
 
Waco
Gold subscriber
Grand Gerbil Poohbah
Posts: 3250
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Intel declares bankruptcy (RIP 1968-2019)

Sun Jan 27, 2019 8:47 pm

cegras wrote:
That's all publicly available information.

Anyone willing to break NDAs for some forum cred won't be doing it for long...

EDIT: I do have to laugh at the i3-8121 beating the i3-8130 by a huge margin. Intel's naming schemes are just silly sometimes.
Last edited by Waco on Sun Jan 27, 2019 8:50 pm, edited 1 time in total.
Desktop: X570 Gaming X | 3900X | 32 GB | Alphacool Eisblock Radeon VII | Heatkiller R3 | Samsung 4K 40" | 1 TB SX8200 Pro + 2 TB 660p + 2 TB SATA SSD
NAS: 1950X | Designare EX | 32 GB ECC | 7x8 TB RAIDZ2 | 8x2 TB RAID10 | FreeNAS | ZFS | LSI SAS
 
derFunkenstein
Gerbil God
Posts: 25254
Joined: Fri Feb 21, 2003 9:13 pm
Location: Comin' to you directly from the Mothership

Re: Intel declares bankruptcy (RIP 1968-2019)

Sun Jan 27, 2019 8:48 pm

Waco wrote:
cegras wrote:
That's all publicly available information.

Anyone willing to break NDAs for some forum cred won't be doing it for long...

Based on your location I'm expecting that AMD would hire the aliens to abduct you if you break NDA. Mysterious disappearance? Mmmmmmaybeeeee....
I do not understand what I do. For what I want to do I do not do, but what I hate I do.
Twittering away the day at @TVsBen
 
Waco
Gold subscriber
Grand Gerbil Poohbah
Posts: 3250
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Intel declares bankruptcy (RIP 1968-2019)

Sun Jan 27, 2019 8:51 pm

derFunkenstein wrote:
Based on your location I'm expecting that AMD would hire the aliens to abduct you if you break NDA. Mysterious disappearance? Mmmmmmaybeeeee....

[speaks in alien]No comment[/speaks in alien]
Desktop: X570 Gaming X | 3900X | 32 GB | Alphacool Eisblock Radeon VII | Heatkiller R3 | Samsung 4K 40" | 1 TB SX8200 Pro + 2 TB 660p + 2 TB SATA SSD
NAS: 1950X | Designare EX | 32 GB ECC | 7x8 TB RAIDZ2 | 8x2 TB RAID10 | FreeNAS | ZFS | LSI SAS
 
synthtel2
Gerbil Elite
Posts: 956
Joined: Mon Nov 16, 2015 10:30 am

Re: Intel declares bankruptcy (RIP 1968-2019)

Sun Jan 27, 2019 9:09 pm

AMD's FPUs aren't all that similar to Intel's beyond supporting the same instructions, and that includes Zen 2 versus Haswell.
 
NTMBK
Gerbil XP
Posts: 371
Joined: Sat Dec 21, 2013 11:21 am

Re: Intel declares bankruptcy (RIP 1968-2019)

Mon Jan 28, 2019 5:06 am

chuckula wrote:
cegras wrote:
That's an intelligent thought? That's all publicly available information. Dragontamer and Waco were having an interesting conversation, don't let Chuckula leach credit off of it.


What have you ever added to anything... including "public information."

But while your here stroking your own ego, why don't you take the time to tell us again why Haswell sucks so bad when AMD is still copying Haswell's AVX units into its leading edge 2019 product line.

Or we can just look here:
Image
Yay AMD wins and Intel sucks because a quad-core 65W beats up on lower-range laptop chips RIGHT?!?!?

Try again with AVX turned on... a move that even helps AMD if you look at the numbers:
Image

Oh wait... a dual-core "failed" Cannon Lake winning? Well that's OK. Zen 2 will be literally twice as fast, so a 4-core Zen 2 will only lose by a factor of 2 to a "failed" Cannon Lake part.

What's neat about that particular test? It's a benchmark that was actually written by the author of the article at Anandtech back when he was writing academic code. A few lines of AVX-512 tweaking and we get that result. This isn't some commercial test that was built by a huge team of coders, it's an amateur software project that shows just how powerful actually using the modern facilities of a modern CPU can be.

And that's from CannonLake, you know the "bad" part. Imagine what happens when Intel starts to make parts that are just "not so bad" instead.

Hell, I'd love to have an intelligent conversation around here about the best optimizations for using AVX-512 VL to adjust the register width for various instruction mixes ranging from the huge FMA workloads where the full register set tends to slow clocks vs. the much faster POPCNT and BW sets where you can run the full-width registers without too much trouble, but every time I talk about something that goes over your "CINENBENCH DUURRRRR" level of understanding I get a childish response that "AVX-512 is just a niche". Well guess what, in Q4 2018 it was a $6.1 Billion dollar "niche" that makes everything AMD has ever done look like an inconsequential joke.



Frankly, the reason I put up this thread title is because it plays exactly in to the emotions and prejudices of people like you. Imagine if I had said instead: Intel's Q4 2018 profit makes next closest competitor's revenue look like a rounding error. Or how about: Intel's non-volatile systems group that makes those "failed" Optane parts would be the largest division in the company easily outpacing 100% of all Epyc sales if it were to be dropped onto AMD [you know, the company that won't tell us Epyc's revenue separate from video game console sales... because those product lines are clearly one and the same].

I'm getting tired of having to talk down to you, so I'm just coming down to your level. The next fact-free insult better have a little more behind it other than "Chuckula accurately pointed out my own bigotry and I'm uncomfortable with it waahh!"


Except as even Ian admits, this sort of code would run much better on a GPU... and any PhD doing the same today would probably write it in CUDA instead of screwing around with the Intel compiler trying to generate AVX-512 binaries. Even the on-die GPU would probably run it better if you fed it OpenCL, and it was actually functional. Speaking of which...

The damn thing doesn't have a functional GPU, and Intel dumped it into the tiniest niche Chinese market that they could find, just to tell investors that 10nm was "launched". Intel refuse to even sell it to the rest of the world. They didn't acknowledge its existence on Ark until journalists actually managed to get their hands on one. It's a failed chip.
 
Topinio
Gerbil Jedi
Posts: 1758
Joined: Mon Jan 12, 2015 9:28 am
Location: London

Re: Intel declares bankruptcy (RIP 1968-2019)

Mon Jan 28, 2019 5:32 am

Those numbers are crazy for that CPU on that specific AVX benchmark.

It's a 2C4T chip at 15 W and sits between the scores of the Ryzen 7 2700X and the Core i9-9900K scores?

These are 8C16T, and 105 W / 95 W, 3.7-4.3 / 3.6-5.0 GHz, $330 / $490 top drawer desktop CPUs.

This reminds me of when Sun showed off the UltraSPARC III Cu with those SPECfp 2000 scores and didn't expect it to be looked at further.
Desktop: E3-1270 v5, X11SAT-F, 32GB, RX 580, 500GB Crucial P1, 250GB Crucial MX500, 4TB 7E8, Xonar DGX, XL2730Z + L22e-20
HTPC: i5-2500K, DH67GD, 6GB, GT 1030, 250GB MX500, 1.5TB Barracuda, Xonar DX, G2420HDB
Laptop: MacBook6,1
 
NTMBK
Gerbil XP
Posts: 371
Joined: Sat Dec 21, 2013 11:21 am

Re: Intel declares bankruptcy (RIP 1968-2019)

Mon Jan 28, 2019 6:53 am

Topinio wrote:
Those numbers are crazy for that CPU on that specific AVX benchmark.

It's a 2C4T chip at 15 W and sits between the scores of the Ryzen 7 2700X and the Core i9-9900K scores?

These are 8C16T, and 105 W / 95 W, 3.7-4.3 / 3.6-5.0 GHz, $330 / $490 top drawer desktop CPUs.

This reminds me of when Sun showed off the UltraSPARC III Cu with those SPECfp 2000 scores and didn't expect it to be looked at further.


It looks the code has jumped from "basically unvectorized" to "fully vectorized" under AVX-512. The fact that Ryzen got so close to Skylake, even with half the SIMD throughput, indicates that AVX2 was not being properly utilized.

It's certainly conceivable- AVX-512 adds masking to all vector instructions, as well as vector scatter, making it a much better target for auto-vectorization. (It's heavily inspired by the Larabee instruction set, which was designed to vectorize GPU shader code.)
 
FireGryphon
Gold subscriber
Darth Gerbil
Posts: 7724
Joined: Sat Apr 24, 2004 7:53 pm
Location: the abyss into which you gaze

Re: Intel declares bankruptcy (RIP 1968-2019)

Mon Jan 28, 2019 8:50 am

Threads like this remind me just how silly it is to be a fundamentalist, no matter your belief and no matter your cause.
Sheep Rustlers in the sky! <S> Slapt | <S> FUI | Air Warrior II/III
 
cegras
Gerbil First Class
Posts: 187
Joined: Mon Nov 05, 2007 3:12 pm

Re: Intel declares bankruptcy (RIP 1968-2019)

Mon Jan 28, 2019 8:57 am

chuckula wrote:
cegras wrote:
That's an intelligent thought? That's all publicly available information. Dragontamer and Waco were having an interesting conversation, don't let Chuckula leach credit off of it.


What have you ever added to anything... including "public information."


1) Neurotic, chaotic, vitriolic take on publicly available information

2) Collapse of strawman because I own a i5-9600k and have publicly advocated for its supremacy as a 144 Hz CPU

3) I've never had a conversation about AV-512 with you but you act like we have

4) You've put up the thread title because you've lost touch with reality

You need help, or a 1 month gag.
 
cegras
Gerbil First Class
Posts: 187
Joined: Mon Nov 05, 2007 3:12 pm

Re: Intel declares bankruptcy (RIP 1968-2019)

Mon Jan 28, 2019 9:31 am

NTMBK wrote:
chuckula wrote:
cegras wrote:
That's an intelligent thought? That's all publicly available information. Dragontamer and Waco were having an interesting conversation, don't let Chuckula leach credit off of it.


What have you ever added to anything... including "public information."

But while your here stroking your own ego, why don't you take the time to tell us again why Haswell sucks so bad when AMD is still copying Haswell's AVX units into its leading edge 2019 product line.

Or we can just look here:
Image
Yay AMD wins and Intel sucks because a quad-core 65W beats up on lower-range laptop chips RIGHT?!?!?

Try again with AVX turned on... a move that even helps AMD if you look at the numbers:
Image

Oh wait... a dual-core "failed" Cannon Lake winning? Well that's OK. Zen 2 will be literally twice as fast, so a 4-core Zen 2 will only lose by a factor of 2 to a "failed" Cannon Lake part.

What's neat about that particular test? It's a benchmark that was actually written by the author of the article at Anandtech back when he was writing academic code. A few lines of AVX-512 tweaking and we get that result. This isn't some commercial test that was built by a huge team of coders, it's an amateur software project that shows just how powerful actually using the modern facilities of a modern CPU can be.

And that's from CannonLake, you know the "bad" part. Imagine what happens when Intel starts to make parts that are just "not so bad" instead.

Hell, I'd love to have an intelligent conversation around here about the best optimizations for using AVX-512 VL to adjust the register width for various instruction mixes ranging from the huge FMA workloads where the full register set tends to slow clocks vs. the much faster POPCNT and BW sets where you can run the full-width registers without too much trouble, but every time I talk about something that goes over your "CINENBENCH DUURRRRR" level of understanding I get a childish response that "AVX-512 is just a niche". Well guess what, in Q4 2018 it was a $6.1 Billion dollar "niche" that makes everything AMD has ever done look like an inconsequential joke.



Frankly, the reason I put up this thread title is because it plays exactly in to the emotions and prejudices of people like you. Imagine if I had said instead: Intel's Q4 2018 profit makes next closest competitor's revenue look like a rounding error. Or how about: Intel's non-volatile systems group that makes those "failed" Optane parts would be the largest division in the company easily outpacing 100% of all Epyc sales if it were to be dropped onto AMD [you know, the company that won't tell us Epyc's revenue separate from video game console sales... because those product lines are clearly one and the same].

I'm getting tired of having to talk down to you, so I'm just coming down to your level. The next fact-free insult better have a little more behind it other than "Chuckula accurately pointed out my own bigotry and I'm uncomfortable with it waahh!"


Except as even Ian admits, this sort of code would run much better on a GPU... and any PhD doing the same today would probably write it in CUDA instead of screwing around with the Intel compiler trying to generate AVX-512 binaries. Even the on-die GPU would probably run it better if you fed it OpenCL, and it was actually functional. Speaking of which...

The damn thing doesn't have a functional GPU, and Intel dumped it into the tiniest niche Chinese market that they could find, just to tell investors that 10nm was "launched". Intel refuse to even sell it to the rest of the world. They didn't acknowledge its existence on Ark until journalists actually managed to get their hands on one. It's a failed chip.


To be fair, AVX512 looks amazing. I do a lot of computational linear algebra for work, and for now there are still many more cpu cores available vs. gpus.
 
Concupiscence
Silver subscriber
Gerbil Elite
Posts: 702
Joined: Tue Sep 25, 2012 7:58 am
Location: Dallas area, Texas, USA
Contact:

Re: Intel declares bankruptcy (RIP 1968-2019)

Mon Jan 28, 2019 2:47 pm

cegras wrote:
NTMBK wrote:
chuckula wrote:

What have you ever added to anything... including "public information."

But while your here stroking your own ego, why don't you take the time to tell us again why Haswell sucks so bad when AMD is still copying Haswell's AVX units into its leading edge 2019 product line.

Or we can just look here:
Image
Yay AMD wins and Intel sucks because a quad-core 65W beats up on lower-range laptop chips RIGHT?!?!?

Try again with AVX turned on... a move that even helps AMD if you look at the numbers:
Image

Oh wait... a dual-core "failed" Cannon Lake winning? Well that's OK. Zen 2 will be literally twice as fast, so a 4-core Zen 2 will only lose by a factor of 2 to a "failed" Cannon Lake part.

What's neat about that particular test? It's a benchmark that was actually written by the author of the article at Anandtech back when he was writing academic code. A few lines of AVX-512 tweaking and we get that result. This isn't some commercial test that was built by a huge team of coders, it's an amateur software project that shows just how powerful actually using the modern facilities of a modern CPU can be.

And that's from CannonLake, you know the "bad" part. Imagine what happens when Intel starts to make parts that are just "not so bad" instead.

Hell, I'd love to have an intelligent conversation around here about the best optimizations for using AVX-512 VL to adjust the register width for various instruction mixes ranging from the huge FMA workloads where the full register set tends to slow clocks vs. the much faster POPCNT and BW sets where you can run the full-width registers without too much trouble, but every time I talk about something that goes over your "CINENBENCH DUURRRRR" level of understanding I get a childish response that "AVX-512 is just a niche". Well guess what, in Q4 2018 it was a $6.1 Billion dollar "niche" that makes everything AMD has ever done look like an inconsequential joke.



Frankly, the reason I put up this thread title is because it plays exactly in to the emotions and prejudices of people like you. Imagine if I had said instead: Intel's Q4 2018 profit makes next closest competitor's revenue look like a rounding error. Or how about: Intel's non-volatile systems group that makes those "failed" Optane parts would be the largest division in the company easily outpacing 100% of all Epyc sales if it were to be dropped onto AMD [you know, the company that won't tell us Epyc's revenue separate from video game console sales... because those product lines are clearly one and the same].

I'm getting tired of having to talk down to you, so I'm just coming down to your level. The next fact-free insult better have a little more behind it other than "Chuckula accurately pointed out my own bigotry and I'm uncomfortable with it waahh!"


Except as even Ian admits, this sort of code would run much better on a GPU... and any PhD doing the same today would probably write it in CUDA instead of screwing around with the Intel compiler trying to generate AVX-512 binaries. Even the on-die GPU would probably run it better if you fed it OpenCL, and it was actually functional. Speaking of which...

The damn thing doesn't have a functional GPU, and Intel dumped it into the tiniest niche Chinese market that they could find, just to tell investors that 10nm was "launched". Intel refuse to even sell it to the rest of the world. They didn't acknowledge its existence on Ark until journalists actually managed to get their hands on one. It's a failed chip.


To be fair, AVX512 looks amazing. I do a lot of computational linear algebra for work, and for now there are still many more cpu cores available vs. gpus.


It does great heavy lifting for video encodes on my 7940x. Ice Lake's AVX-512 implementation is supposed to have a meaningfully expanded feature repertoire over what Skylake-X already enjoyed, so for the right workloads it could be very impressive. I'm skeptical about speed improvements to the rest of the architecture given Intel's post-Haswell IPC improvement track record outside of AVX, but time's gonna tell soon enough.
Media: Core i9 7940x, 32 gigs RAM, RX Vega 56, Win10 Pro
Science: Ryzen 7 1700, 16 gigs RAM, [...], Xubuntu 18.04 [offline]
Server: Xeon E5-4640, 32 gigs ECC RAM, GTX Titan Xm, Win10 Pro

Read my words at https://www.wallabyjones.com/
 
dragontamer5788
Gerbil Elite
Posts: 529
Joined: Mon May 06, 2013 8:39 am

Re: Intel declares bankruptcy (RIP 1968-2019)

Mon Jan 28, 2019 4:37 pm

AVX512 does indeed look great, but I do wonder if its too little, too late. The main issue is programming model: Intel is still pushing auto-vectorization hard, but its clearly inferior over SIMD models like CUDA, OpenCL, or AMD's "HCC". Most programmers have to dip down into assembly language or intrinsics to use AVX512 effectively. Consider this toy problem: how would you write a sorting network (like bitonic sort, or maybe even-odd sort) using Intel's AVX512 ??

https://gist.github.com/mre/1392067

__global__ void bitonic_sort_step(float *dev_values, int j, int k)
{
  unsigned int i, ixj; /* Sorting partners: i and ixj */
  i = threadIdx.x + blockDim.x * blockIdx.x;
  ixj = i^j;

  /* The threads with the lowest ids sort the array. */
  if ((ixj)>i) {
    if ((i&k)==0) {
      /* Sort ascending */
      if (dev_values[i]>dev_values[ixj]) {
        /* exchange(i,ixj); */
        float temp = dev_values[i];
        dev_values[i] = dev_values[ixj];
        dev_values[ixj] = temp;
      }
    }
    if ((i&k)!=0) {
      /* Sort descending */
      if (dev_values[i]<dev_values[ixj]) {
        /* exchange(i,ixj); */
        float temp = dev_values[i];
        dev_values[i] = dev_values[ixj];
        dev_values[ixj] = temp;
      }
    }
  }
}


Chances are, your bitonic sort in AVX512 is going to be a lot harder to write than that. Granted, Intel has ISPC: https://ispc.github.io/ . But ISPC isn't as popular, and it isn't quite as clear how to mix-and-match code like in CUDA or AMD's HCC.

-----------

The wildcard is OpenMP. If SIMD Programming becomes easy with OpenMP 4.5, then its great for Intel. But it doesn't seem like OpenMP is as flexible as CUDA yet. Furthermore, OpenMP is making it easier to do device offload, so in the future, maybe OpenMP will be all that is needed for heterogeneous compute. So OpenMP doesn't necessarily benefit AVX512's programming model, since it makes GPU programming that much easier.
 
cegras
Gerbil First Class
Posts: 187
Joined: Mon Nov 05, 2007 3:12 pm

Re: Intel declares bankruptcy (RIP 1968-2019)

Mon Jan 28, 2019 4:59 pm

I guess the question is this: I have access to the following compute nodes:

28 x Intel E5-2680v4 @ 2.4 GHz 64 GB 4 x Nvidia K80 GPU

I have a program that diagonalizes a large matrix and does a lot of other linear algebra stuff. If I have to run this program 100 times, would I get the job done faster on the CPUs or the GPUs - assuming I've had all the correct optimizations, especially in batching jobs to number of CPU cores?

Unfortunately given university upgrade cycles I highly doubt we'll see an update on these soon .. I've been running code on sandybridge since I arrived (for 1.5 years).
 
dragontamer5788
Gerbil Elite
Posts: 529
Joined: Mon May 06, 2013 8:39 am

Re: Intel declares bankruptcy (RIP 1968-2019)

Mon Jan 28, 2019 5:05 pm

cegras wrote:
I guess the question is this: I have access to the following compute nodes:

28 x Intel E5-2680v4 @ 2.4 GHz 64 GB 4 x Nvidia K80 GPU

I have a program that diagonalizes a large matrix and does a lot of other linear algebra stuff. If I have to run this program 100 times, would I get the job done faster on the CPUs or the GPUs - assuming I've had all the correct optimizations, especially in batching jobs to number of CPU cores?

Unfortunately given university upgrade cycles I highly doubt we'll see an update on these soon .. I've been running code on sandybridge since I arrived (for 1.5 years).


GPUs were explicitly designed to run matrix multiplications. Pretty much anything involving linear algebra will be better done on a GPU.
 
cegras
Gerbil First Class
Posts: 187
Joined: Mon Nov 05, 2007 3:12 pm

Re: Intel declares bankruptcy (RIP 1968-2019)

Mon Jan 28, 2019 6:11 pm

I get that, but will the speed of running on only 4x K80's offset access to 28 cores that I can partition across my jobs? I can see a scenario where access to lots of AVX512 cores is more price effective than an equivalent amount of GPUs. Surely this question would be moot if I could access XSEDE or any big government cluster which is almost all GPUs.

Also, sometimes I need 50+ GBs of RAM, which unfortunately means that I need use big memory node,

28 x Intel E5-2680v4 @ 2.4 GHz 512 GB
 
dragontamer5788
Gerbil Elite
Posts: 529
Joined: Mon May 06, 2013 8:39 am

Re: Intel declares bankruptcy (RIP 1968-2019)

Mon Jan 28, 2019 6:56 pm

cegras wrote:
I get that, but will the speed of running on only 4x K80's offset access to 28 cores that I can partition across my jobs? I can see a scenario where access to lots of AVX512 cores is more price effective than an equivalent amount of GPUs. Surely this question would be moot if I could access XSEDE or any big government cluster which is almost all GPUs.


The K80 has 4992 CUDA cores per GPU. If you really have 4x K80s (woah!!), that's 19968 CUDA Cores on your machine.

A "CUDA Core" isn't the same as a CPU core. A CUDA core is roughly equivalent to one lane of AVX512. So an apples-to-apples comparison is that your 28-core Intel has ~448 SIMD Cores at its disposal (each AVX512 register is effectively 16x SIMD lanes. 28 * 16 == 448). Another "advantage" to Intel is that they offer multiple pipelines per CPU, so maybe ~1344 SIMD Cores if we assume 3x pipelines. Finally, Intel is ~3 GHz while NVidia K80 is older tech at only 560 MHz. So Intel gets a 6x advantage, so I'll rate the Intel machine at around 8064 CUDA Cores or so.

Just some napkin math. But in any case, you can see that the K80 is an extremely wide, parallel system. AVX512 was invented by Intel to address NVidia's dominance in SIMD Compute capabilities. In practice, the NVidia has far faster RAM and Shared memory capabilities, which are extremely beneficial to matrix multiplication. Honestly, I'd expect even just one K80 to beat the Intel, even with AVX512.

Also, sometimes I need 50+ GBs of RAM, which unfortunately means that I need use big memory node,


That's the big issue. 50GB is more than a K80 can handle. You'll need to split the memory up and stream it appropriately. Otherwise the K80 will be sitting idle while waiting for all of that information to come in.
 
synthtel2
Gerbil Elite
Posts: 956
Joined: Mon Nov 16, 2015 10:30 am

Re: Intel declares bankruptcy (RIP 1968-2019)

Mon Jan 28, 2019 8:38 pm

There are a lot of other factors that could affect it, but to sanity-check / ballpark that, an E5-2680v4 is about 1 TFLOPS and a K80 is 7 or 8, and the E5 has 50-75 GB/s and the K80 has 480. 28 of the E5s are roughly similar to 4 K80s in raw FPU power.

A 28C SKL-X doubles both the core count and SIMD width of the E5s you've got now, coming in at about half a K80 (less in memory bandwidth), but if you've got choices in your modern hardware there are plenty of more recent GPUs that'll blow a K80 or any CPU out of the water.
 
chuckula
Gold subscriber
Minister of Gerbil Affairs
Topic Author
Posts: 2109
Joined: Wed Jan 23, 2008 9:18 pm
Location: Probably where I don't belong.

Re: Intel declares bankruptcy (RIP 1968-2019)

Mon Jan 28, 2019 8:47 pm

From the discussions above, I can appreciate that many uses of AVX-512, especially early uses, are implementing highly parallelizable linear algebra. However, I personally find the standard assumptions about AVX-512 as "just like old AVX but with MOAR bits/registers" to be the least interesting thing about it given what AVX-512 is truly capable of doing. That's partly Intel's fault due to marketing and partly history because the first commercial implementations of AVX-512 were in GPU-competitor products that were mostly advertised as doing the same basic things that GPUs do (minus actually making graphics).

Frankly, if you give me a few terabytes of data representing vectors and you want to me to barf up their normals with perfect parallelism because they have no data dependencies -- or perform other large-scale linear algebra without data dependencies -- I'd recommend a GPU over a CPU. Sure you can get the job done with AVX-512 or older flavors of AVX [see how a small change in Anand's code produced big results without radically altering the entire code base], but that's not what makes these instructions interesting to me personally. In some ways, a standardized CPU instruction set is probably better for a quick-n-dirty implementation vs. going the CUDA or OpenCL route, but for the right types of job the CPU isn't going to win assuming you code the GPU properly just as a GPU isn't going to beat a CPU in its specific domain. In other words, the whole "BUT MUH GPU CAN DO A TRILLION CROSS-PRODUCTS FASTER" is probably true, but also just as irrelevant as claiming that doing cross-products faster means you want your GPU to parse an XHTML document or run your database faster than your CPU.

Here's what makes AVX-512 interesting: Despite the fact that it has its roots in performing highly parallelized math (floating point linear algebra operations in particular) the modern AVX-512 implementations are an *extremely* powerful general-purpose instruction set that really unleashes the CPU power in a wide range of tasks that GPUs are intentionally designed *not* to handle well. In other words, to me the instructions like bit permutation, ternary logic, blending instructions, and the use of the op mask registers to handle conditions are far more interesting than just doing an FMA of floating point value vectors (which is still useful of course, just not the point of this post).

Based on one of the questions about bitonic sorting, here's a trivial case-in-point from a recent academic paper you can access here: Let's do quicksort massively faster using AVX-512! Anybody who's made it to an undergrad data structure & algorithm class can remember that bad boy with the partitioning & pivoting (bonus points if you remember why it's actually pretty lousy at almost-sorted data).

The paper goes into quite a bit of interesting detail as to how they implemented their particular version of the algorithm with AVX-512 including the op mask registers that I'll address in further detail below, but the rather unexpected upshot is this:
In this paper, we introduced new Bitonic sort and a new partition algorithm that have been designed for the AVX-512 instruction set. These two functions are used in our Quicksort variant which makes it possible to have a fully vectorized implementation (at the exception of partitioning tiny arrays). Our approach shows superior performance on Intel SKL in all configurations against two reference libraries: the GNU++ STL, and the Intel IPP. It provides a speedup of 8 to sort small arrays (less than 16 SIMD-vectors), and a speedup of 4 and 1.4 for large arrays, against the C++ STL and the Intel IPP, respectively. These results should also motivate the community to revisit common problems, because some algorithms may become competitive by being vectorizable, or improved, thanks to AVX-512’s novelties. Our source code is publicly available and ready to be used and compared. In the future, we intend to design a parallel implementation of our AVX-512-QS, and we expect the recursive partitioning to be naturally parallelized with a task-based scheme on top of OpenMP.


As you notice above, the biggest factor of 8 speedups are for comparatively small arrays. You know, the types of arrays that CPUs need to churn through millions or billions of times a day in boring, non-exotic non-GPU-accelerated workloads that many people think have reached a hard performance wall with standard CPUs. A 40% performance boost on huge arrays is nothing to sneeze at either, but it's the small stuff that tends to bog down everyday tasks more than the huge jobs. Even if your GPU has implemented an amazing parallelized-sort algorithm (and given the fact that sorting is extremely comparison-heavy that's unlikely) it's probably fine for sorting a huge array of a billion+ elements, but the overhead of just initializing the GPU to do a sort a small array of say -- 128 elements -- is going to completely swamp the actual compute time and render the GPU useless.

You might also say: It doesn't take that long to sort a few hundred elements in a regular CPU now! And you'd be right, but take that algorithm and start adding it to all the other plain ordinary tasks that can be sped up considerably using AVX-512 and all of a sudden boring things like web browsers, XML parsers, database index engines, etc. can suddenly gain major performance boosts just by taking advantage of the resources that have become available in modern x86.

A few other things that make AVX-512 a whole lot more interesting than what most people think is that the "512" on AVX-512 never has to result in slower code. Oh, I know you saw some Cloudflare article (the same guys who promised us ARM servers ruling the cloud by now) saying that AVX-512 is SLOW because your CPU downclocks to like.. 100 MHz whenever you use a single instruction! WAAH! Well, like any tool you can misuse AVX-512, but the Cloudflare guys didn't mention two major features:

1. AVX-512 in Skylake-X does downclock for full-bore FMA-heavy code (basically GPU-esque stuff) but a wide range of useful AVX-512 instructions don't require big downclocks even running full-width instructions.
2. AVX-512 never requires you to use big 512-bit registers if you are scared of them. What? Yeah, that's right. AVX-512 can be a set of 32 general purpose registers operating on 128-bit or 256-bit vectors if you feel like it, with full support for all of the instructions that AVX-512 provides. That's not just "Oh I can drop back to SSE" that's "Oh I can do every crazy thing that AVX-512 allows but if I'm freaked out by low clocks on 512-bit vectors I can run full-turbo with 128 bit vectors if that performs better". That's because of the AVX-512VL feature set that's standard on Skylake X and probably more importantly on Ice Lake. The VL features aren't new instructions per-se, they just allow you to pick & choose the vector length that works best for whatever data you want to process.

The next thing that REALLY sets AVX-512 apart: Traditionally your SIMD code and your comparison/branchy type code didn't really get along that well. SIMD was for performing a bunch of math on vectors without really focusing too much on what's happening inside the vectors. Well AVX-512 introduces op mask registers and corresponding mask instructions that can allow you to bake in conditionals while you are processing vectors of data. In a very incomplete nutshell, you can use the mask registers and corresponding instructions to provide fine-grain control over which sets of vector operations are performed on which elements of each vector in the input. The following article provides some practical examples in accelerating database operations including a cool discussion of vectorized hash tables and B+ tree searches: https://medium.com/@vaclav.loffelmann/v ... 59ce59abd3

There are more features than I've mentioned here including conflict-detection instructions to make it easier for compilers to do autovectorization, not to mention the specialty instructions for AI (VNNI) and encryption (GNFI for Galois Fields in S-box operations & additional AES instructions), but I've touched on the big hitters. I hope that the point I'm making here is that AVX-512 is *not* some new awesome paradigm because now I can do 64-bit FMAs eight at a time instead of four at a time. That's cute but that's really not worth getting too excited about. I'm excited that AVX-512 is finally taking SIMD from this weird step-child that's over in the corner for a few (albeit important) tasks, and turning into a core feature that gets used all the time to push performance forward.

I'll leave you with an entire website showing some fascinating real-world algorithms including many that take advantage of AVX2 & AVX-512: http://0x80.pl/articles/ A few of my personal favorites are using AVX-512 for base64 encode/decode (hello AVX-512 just becoming extremely useful in every web server and other server that handles MIME) and even the basic task of substring searching having an AVX-512 algorithm.
Another great resource is the famous Agner Fog (https://www.agner.org/ ). He even has a C++ vector class library with a wide range of vectorized instructions baked in. Even if you don't want to use it, the manual is a nice read: https://www.agner.org/optimize/vectorclass.pdf
Of course, Intel has its own documentation, but sometimes seeing third-parties wrestle with the technology is more educational.
4770K @ 4.7 GHz; 32GB DDR3-2133; Officially RX-560... that's right AMD you shills!; 512GB 840 Pro (2x); Fractal Define XL-R2; NZXT Kraken-X60
--Many thanks to the TR Forum for advice in getting it built.
 
dragontamer5788
Gerbil Elite
Posts: 529
Joined: Mon May 06, 2013 8:39 am

Re: Intel declares bankruptcy (RIP 1968-2019)

Tue Jan 29, 2019 3:01 am

chuckula wrote:
Here's what makes AVX-512 interesting: Despite the fact that it has its roots in performing highly parallelized math (floating point linear algebra operations in particular) the modern AVX-512 implementations are an *extremely* powerful general-purpose instruction set that really unleashes the CPU power in a wide range of tasks that GPUs are intentionally designed *not* to handle well. In other words, to me the instructions like bit permutation, ternary logic, blending instructions, and the use of the op mask registers to handle conditions are far more interesting than just doing an FMA of floating point value vectors (which is still useful of course, just not the point of this post).


First off: Bit permutation is probably Intel's greatest instruction ever created. Its awesome and I wish more computers implemented pdep and pext.

But outside of that, NVidia GPUs have been doing those things for years. AMD GPUs could do it too, except you had to erm... write GCN Assembly. Sooo... no one really did it on AMD GPUs. The "op mask" is like, 2008-era GPUs actually. Its Intel finally catching up to last decade.

AMD and NVidia GPUs still have a "shared memory" segment which supports something like 32x32-bit load/store instructions across any vector-register. AMD even has "DPP" instructions that allow you to transfer and permute your values between GPU-lanes (great for implementing reduce, scans, and sorting networks). The overall capabilities of these memory operations on NVidia and AMD GPUs is roughly equivalent to "Gather-scatter", except its implemented extremely quickly at the register level, with absolutely ridiculous bandwidth.

Even Intel Icelake only has 2-load / 2-store units per clock tick, and therefore cannot perform gather/scatter as quickly. The closest equivalent Intel has in AVX512 in my experience is VPSHUFB (effectively a "gather" operation applied to the 64-bytes of a ZMM register), but VSHUFB is far weaker than the register-movement operators NVidia PTX (aka: shfl.sync) or AMD GCN (DPP, ds_permute, ds_bpermute, ds_swizzle).

And once again: note that every vector-unit of NVidia or AMD has a load/store unit of its own to CUDA Shared or AMD LDS memory. Arbitrary movements between "lanes" is extremely efficient (as long as bank conflicts" do not occur).

The fact of the matter is: even with AVX512, Intel is still far behind the capabilities of both NVidia and AMD. The op-masks of AVX512 are a great step forward, but its not as flexible as what GPUs offer. Note that NVidia GPUs can now diverge on an individual lane-by-lane basis starting with Volta.

Image

This means that GPU-lanes can now implement mutexes and semaphores and run independently of other lanes (if necessary). NVidia's SIMD cores are incredibly advanced, and are getting damn close to a traditional CPU core (albeit in-order... but NVidia has a kick-ass architecture for sure).

But even AMD's GCN... which is behind of NVidia's Volta / Turing architecture, has a superior design over Intel's AVX512. AMD GCN's LDS (which is functionally equivalent to NVidia's Shared Memory) is an arbitrary crossbar that supports any communication across all running wavefronts. It is functionally equivalent to VPSHUFB, except across all 64-lanes of an AMD GPU (aka: 2048-bits). Oh, and it also can "scatter", it works in both directions (VPSHUFB is "only" a gather).

Based on one of the questions about bitonic sorting, here's a trivial case-in-point from a recent academic paper you can access here: Let's do quicksort massively faster using AVX-512!


I hate to burst your bubble, but there's nothing in there that GPUs can't do. I'm firmly of the opinion that AVX512 is a great step forward for Intel, but it seriously is only "catching up" to GPGPU technology. Intel has great engineers, but they don't understand SIMD architecture like the GPU community does.

I'll leave you with an entire website showing some fascinating real-world algorithms including many that take advantage of AVX2 & AVX-512: http://0x80.pl/articles/


Indeed, the man is a great assembly programmer. But I'm glad you pointed that website out first, because it goes to show just how far behind AVX512 is.

Consider: http://0x80.pl/notesen/2019-01-05-avx51 ... paces.html

This is the "common XML application" of removing spaces from text. A very common lexing step that you would assume a CPU does better than a GPU. But not so fast. Lets look at how GPUs do it instead: http://www.cse.chalmers.se/~uffe/streamcompaction.pdf

In particular, look at these visualized steps:

Image

GPU Pseudocode:

Image

The GPU implementation is simply far cleaner than anything I've seen from AVX512 programmers so far. This is because the "gather scatter" step is implemented in LDS memory (which, in GPUs, has the unique ability to perform a load/store from all individual SIMD Lanes. The equivalent in AVX512 would be if AVX512 had 16-load-store units that could operate at once-per-clock cycle.

Bonus points: This paper is from 2009 and was implemented on a GTX 280. This is the kind of stuff GPUs were doing literally 10 years ago, and I still have issues writing the equivalent code on AVX512.

EDIT: VPSCATTERDD is a correct answer, but unfortunately runs very slowly on Skylake-X. A VPSCATTERDD off of a ZMM register becomes 44-uops and only operates at a throughput of once every 17-clock cycles. See Agner Fog's instruction latencies for more details. In contrast, GPU LDS memory is fully supported as long as no bank-conflicts arise (and there will be no bank conflicts in the above code). I tried to build an equivalent using vpshufb to avoid the L1 memory write, but I couldn't get any kind of vpshufb as efficient as the GPU code. The 0x80 webpage managed to find a methodology using pdep and pext, but you leave the vectorized world to use those instructions.

----------

I mean, "OpMasks" are cute and all. But I don't think AVX512 supports fully divergent SIMD code like NVidia or AMD GPUs do.

Image

Its like, yeah, "OpMasks" are cool and all, but that's like... soooo 2005. GPUs have been handling far more complicated cases for the past decade. Yeah, this stuff can be emulated on the Intel AVX512, but the important operations are hardware-accelerated on GPUs. So IMO, AVX512 is still a bit behind when it comes to SIMD-based control-flow compared to a GPU.

Who is online

Users browsing this forum: No registered users and 1 guest
GZIP: On