Here's what makes AVX-512 interesting: Despite the fact that it has its roots in performing highly parallelized math (floating point linear algebra operations in particular) the modern AVX-512 implementations are an *extremely* powerful general-purpose instruction set that really unleashes the CPU power in a wide range of tasks that GPUs are intentionally designed *not* to handle well. In other words, to me the instructions like bit permutation, ternary logic, blending instructions, and the use of the op mask registers to handle conditions are far more interesting than just doing an FMA of floating point value vectors (which is still useful of course, just not the point of this post).
First off: Bit permutation is probably Intel's greatest instruction ever created. Its awesome and I wish more computers implemented pdep and pext.
But outside of that, NVidia GPUs have been doing those things for years. AMD GPUs could do it too, except you had to erm... write GCN Assembly. Sooo... no one really did it on AMD GPUs. The "op mask" is like, 2008-era GPUs actually. Its Intel finally catching up to last decade.
AMD and NVidia GPUs still have a "shared memory" segment which supports something like 32x32-bit load/store instructions across any vector-register. AMD even has "DPP" instructions that allow you to transfer and permute your values between GPU-lanes (great for implementing reduce, scans, and sorting networks). The overall capabilities of these memory operations on NVidia and AMD GPUs is roughly equivalent to "Gather-scatter", except its implemented extremely quickly at the register
level, with absolutely ridiculous bandwidth.
Even Intel Icelake only has 2-load / 2-store units per clock tick, and therefore cannot perform gather/scatter as quickly. The closest equivalent Intel has in AVX512 in my experience is VPSHUFB (effectively a "gather" operation applied to the 64-bytes of a ZMM register), but VSHUFB is far weaker than the register-movement operators NVidia PTX (aka: shfl.sync) or AMD GCN (DPP, ds_permute, ds_bpermute, ds_swizzle).
And once again: note that every vector-unit of NVidia or AMD has a load/store unit of its own to CUDA Shared or AMD LDS memory. Arbitrary movements between "lanes" is extremely efficient (as long as bank conflicts" do not occur).
The fact of the matter is: even with AVX512, Intel is still far behind the capabilities of both NVidia and AMD. The op-masks of AVX512 are a great step forward, but its not as flexible as what GPUs offer. Note that NVidia GPUs can now diverge on an individual lane-by-lane basis starting with Volta.
This means that GPU-lanes can now implement mutexes and semaphores and run independently of other lanes (if necessary). NVidia's SIMD cores are incredibly advanced, and are getting damn
close to a traditional CPU core (albeit in-order... but NVidia has a kick-ass architecture for sure).
But even AMD's GCN... which is behind of NVidia's Volta / Turing architecture, has a superior design over Intel's AVX512. AMD GCN's LDS (which is functionally equivalent to NVidia's Shared Memory) is an arbitrary crossbar that supports any communication across all running wavefronts. It is functionally equivalent to VPSHUFB, except across all 64-lanes of an AMD GPU (aka: 2048-bits). Oh, and it also can "scatter", it works in both directions (VPSHUFB is "only" a gather).
Based on one of the questions about bitonic sorting, here's a trivial case-in-point from a recent academic paper you can access here
: Let's do quicksort massively faster using AVX-512!
I hate to burst your bubble, but there's nothing in there that GPUs can't do. I'm firmly of the opinion that AVX512 is a great step forward for Intel, but it seriously is only "catching up" to GPGPU technology. Intel has great engineers, but they don't understand SIMD architecture like the GPU community does.
I'll leave you with an entire website showing some fascinating real-world algorithms including many that take advantage of AVX2 & AVX-512: http://0x80.pl/articles/
Indeed, the man is a great assembly programmer. But I'm glad you pointed that website out first, because it goes to show just how far behind AVX512 is.
Consider: http://0x80.pl/notesen/2019-01-05-avx51 ... paces.html
This is the "common XML application" of removing spaces from text. A very common lexing step that you would assume
a CPU does better than a GPU. But not so fast. Lets look at how GPUs do it instead: http://www.cse.chalmers.se/~uffe/streamcompaction.pdf
In particular, look at these visualized steps:
The GPU implementation is simply far cleaner than anything I've seen from AVX512 programmers so far. This is because the "gather scatter" step is implemented in LDS memory (which, in GPUs, has the unique ability to perform a load/store from all
individual SIMD Lanes. The equivalent in AVX512 would be if AVX512 had 16-load-store units that could operate at once-per-clock cycle.
Bonus points: This paper is from 2009 and was implemented on a GTX 280. This is the kind of stuff GPUs were doing literally 10 years ago, and I still have issues writing the equivalent code on AVX512.
EDIT: VPSCATTERDD is a correct answer, but unfortunately runs very slowly on Skylake-X. A VPSCATTERDD off of a ZMM register becomes 44-uops and only operates at a throughput of once every 17-clock cycles. See Agner Fog's instruction latencies for more details. In contrast, GPU LDS memory is fully supported as long as no bank-conflicts arise (and there will be no bank conflicts in the above code). I tried to build an equivalent using vpshufb to avoid the L1 memory write, but I couldn't get any kind of vpshufb as efficient as the GPU code. The 0x80 webpage managed to find a methodology using pdep and pext, but you leave the vectorized world to use those instructions.
I mean, "OpMasks" are cute and all. But I don't think AVX512 supports fully divergent
SIMD code like NVidia or AMD GPUs do.
Its like, yeah, "OpMasks" are cool and all, but that's like... soooo 2005. GPUs have been handling far more complicated cases for the past decade. Yeah, this stuff can be emulated on the Intel AVX512, but the important operations are hardware-accelerated on GPUs. So IMO, AVX512 is still a bit behind when it comes to SIMD-based control-flow compared to a GPU.