Buub wrote:dragontamer5788 wrote:VLIW seems like a good idea, but it seems like in practice, even in the highly-parallel GPU world, its not really practical for the compiler to be making that decision. Modern GPUs are slightly "smarter" than VLIW and are kinda designed to be 10x SMT / hyperthreaded. So sure, there's a VLIW-like thing going on, with VPUs being allocated by the compiler. But the GPU is smart enough to realize when these cores aren't being fully utilized, and can quickly switch between threads to find more work.
"Seemed like a good idea" is the key here. It is impossible for the compiler to predict run-time behavior without a lot of additional information (for example PGO, which is itself imperfect).
VLIW failed because it was the wrong solution to the problem. IMHO, it was the opposite of the right solution.
Modern compilers are fantastic at optimizing code. People often don't realize just much amazing "rocket science" there is in a modern compiler's optimizer. They're fantastic. But, that being said, once again, they can only infer the programmer's intent and turn it into better code; they cannot predict runtime behavior because there are so many external factors that affect the runtime path.
I would argue that VLIW failed because SIMD was more practical way of reaching parallelism, and traditional registers are a more practical way of describing parallelism.
I'm sure that the scalar x86 assembly language is slower than Itanium. But SIMD / AVX2 code will knock your socks off. Skylake has 3 AVX2 pipelines, each of which does a 8x32-bit operation per clock cycle. That's a parallelism count of x24 FLOPs per core per clock tick. GPUs go one step further and only implement the SIMD instruction set, to a fully ridiculous degree. Most GPU code executes at a 32x SIMD level: 32-FLOPs per core per clock tick. And since these SIMD cores are very simplified, GPUs manage to get more instruction-pointers (Symmetric Multiprocessors in CUDA-terms) than CPUs typically get cores.
With numbers like that, Itanium's VLIW of 3-instructions per bundle just can't keep up with that kind of parallelism. Even as future Itanium processors do 4-bundles per clock tick (12-instructions per clock!!), it can't keep up with SIMD. And one of the best SIMD cores is... well... x86 (after GPUs of course. GPUs win in SIMD, but x86 seems like the fastest "normal" instruction set with SIMD implemented)
EDIT: And it turns out that you can "cut dependencies" in traditional scalar assembly language by simply using instructions like "xor eax, eax", which on modern x86 systems effectively functions as the "full stop" instruction works on Itanium. So modern compilers basically learned how to put the VLIW-like parallelism into normal code. I think a future VLIW instruction set could be written to take advantage of the lessons learned in the last 20 years, but its more likely that SIMD GPUs are just going to get bigger and better instead...