Converged applications: LuxMark
One of AMD's goals for APUs going forward is to use the parallel computing power of the integrated graphics processor to assist the CPU cores where possible. Although GPU computing has taken off in specialized sectors like scientific computing and HPC, we are still in the early days of GPU computing for consumer applications. AMD has been making strides in persuading developers to use OpenCL to accelerate certain classes of applications, though, and it has supplied reviewers with a handful of programs to demonstrate the potential there.
These "accelerated" programs fall into several groups. Some of them are just video transcoders that make use of the dedicated encoding hardware built into new CPUs, features like Intel's QuickSync and AMD's HD Media Accelerator. We've recently taken a look at the hardware video encoding options on the PC, so you can read about them if you wish. However, the more interesting programs in our book don't just use dedicated custom logic; they employ real GPU computing, likely through the OpenCL API, to handle tasks previously reserved for the CPU cores.
We tried out accelerated versions of The GIMP image processor and WinZip compression in our review of Trinity's mobile variant, but the program we find most interesting to date is LuxMark, which uses OpenCL to tackle ray-traced rendering. Ray-tracing is a classic "embarrassingly parallel" application, so it's a good test case to demonstrate the potential of data-parallel compute hardware. Also, we've already incorporated LuxMark into our wider CPU suite, which includes a huge selection of chips, so we have ample context for the performance numbers it spits out.
LuxMark should do a nice job of harnessing the capabilities of new CPUs. Since OpenCL code is by nature parallelized and relies on a real-time compiler, it adapts easily to new instructions. For instance, Intel and AMD offer integrated client drivers for OpenCL on x86 processors, and they both claim to support AVX. The AMD APP driver even supports Bulldozer's distinctive instructions, FMA4 and XOP.
We'll start with CPU-only results from a broad swath of processors. These results come from the AMD APP driver for OpenCL, since it tends to be faster on both Intel and AMD CPUs, funnily enough.
Using their CPU cores alone, the new Trinity APUs are only a smidgen faster than the chip they replace, the Llano-based A8-3850. Why? One reason is that the two "Piledriver" modules in Trinity have only one shared FPU each. Each of Llano's four cores has its own dedicated FPU, so although Trinity benefits from the extra-wide vector math enabled by its support for AVX instructions, it's not much faster than Llano.
Intel's Core i3-3225 is only a dual-core processor, but it has two FPUs and can track and execute four threads via Hyper-Threading, so the architectural similarities to Trinity are closer than you might think. The Core i3's FPUs support AVX, as well, and they achieve higher throughput than Trinity's, even though they don't use the fused multiply-add instruction. (FMA support is slated for Intel's next-gen Haswell chip.)
Without AVX or Hyper-Threading, the Pentium G2120 finishes dead last, well behind the A8-5600K.
Moving the workload over to the IGPs uniformly produces lower performance than the same processors achieve with only their CPU cores. The IGP in AMD's Trinity is substantially faster than Intel's HD 4000 graphics, but neither CPU's IGP can match its x86 cores.
If we invoke both the CPU cores and the IGPs at the same time, we see higher overall performance than with just one type of computing unit engaged—and the A10's combined throughput is ever so slightly higher than the Core i3-3225's. There's a hint of potential here; combined performance is roughly equal to the AMD FX-6200's, a chip with three Bulldozer modules.
To give you a better sense of the prospects for mixed-mode computing, let's have a look at a much more capable GPU, the Radeon HD 7950, when driven by the various processors we've tested.
Now that's more like it. Moving some workloads over to a fast enough GPU can really pay off. The Radeon HD 7950 achieves more than twice the throughput of the Core i7-3770K's quad CPU cores, regardless of which processor is driving it. (The 7950 is somewhat faster when combined with Intel processors, likely because of their higher single-threaded performance.)
Of course, this GPU has its own fast, dedicated memory subsystem, so we're not just adding a whole truckload of FLOPS; we're adding bandwidth in support of those FLOPS. The discrete card also has its own rather substantial power envelope. Extracting additional performance out of the beefier IGPs of the future may run up against socket limitations that a discrete card doesn't face. That's especially true for applications that map well to GPUs and IGPs, since they tend to be very bandwidth- and power-intensive.
Here's what happens when we invoke the CPU cores and the Radeon HD 7950 together. Somewhat surprisingly, performance drops for most configurations, except for the recent Intel processors that can track eight threads or more. Apparently, the lower-end CPUs would be better off spending their time just acting in support of the discrete Radeon.