The Mali-T760 GPU in the Exynos 5433 is a full-featured graphics processor based on ARM's Midgard architecture. The Note 4 has given us our first chance to spend significant time with a Mali-equipped device, and our impressions are generally quite positive.
|Samsung Galaxy Note 4|
|SoC||Samsung Exynos 5433|
|GPU||ARM Mali-T760 MP6|
|GPU die area||30.9 mm²|
|Est. clock speed||700 MHz|
|Texture filtering||6 texels/clock|
|Pixel fill||6 pixels/clock|
|System memory||3GB LPDDR3|
Midgard is a bit of an unconventional GPU architecture in terms of its execution model, but it has a robust feature set, especially for the mobile space. The Mali-T760 supports 64-bit addressing and can handle IEEE-compliant floating-point datatypes, including 64-bit double precision. In fact, the Mali-T760 has a nearly desktop-class feature set, with support for DirectX 11.1 (feature level 11_1, or the real thing), OpenGL ES 3.1, and OpenCL 1.2.
Thanks to this combination of high mathematical precision and standards compliance, the T760 is perhaps better suited for GPU computing than most of its competition in the mobile GPU market. ARM was one of the first companies to join the HSA consortium, along with AMD, and has publicly supported that effort.
I'm tempted to call the Mali-T760 a "tiler," but I expect there are folks in the PowerVR camp who would take umbrage with that wording. ARM's Midgard architecture doesn't use fully deferred tile-based rendering like Imagination Tech's PowerVR GPUs. Midgard uses early Z detection like conventional immediate-mode renderers in order to avoid drawing some pixels that would be occluded by other polygons in the final scene, but it doesn't reorder the graphics pipeline in order to eliminate overdraw entirely. Instead, Midgard renders all pixels into on-chip buffers representing 16 x 16-pixel tiles, so blending and overdraw happens on the chip. Midgard can conserve bandwidth and save energy by using a tile buffer in this fashion, since DRAM transactions tend to burn a lot of power.
Divining the structure of Midgard's "shader cores" can be a little confusing. ARM's public documentation appears to divvy things up according to the number of flops the hardware can process rather than its likely underlying organization. Each Midgard "shader core" has two arithmetic pipelines. My sense is that those pipelines break down into two stages: the first stage includes a 128-bit-wide vector unit plus a scalar ALU. The second stage has a special function unit capable of a four-wide vector dot product, and there's a scalar ALU in this stage, too.
Midgard's shader pipelines are unusually flexible in their support for different datatypes. That 128-bit-wide vector unit can be subdivided in various ways. A single vector ALU can process two 64-bit operations, four 32-bit ops, or eight 16-bit operations in a clock cycle.
If you're counting flops at home, the most relevant tally involves 32-bit operations. That big vector unit in the first stage can process four multiply + add operations (eight flops) and the scalar ALU can contribute a multiply (one flop). Stage two's SFU then contributes seven more flops, and the scalar ALU adds another, for a total of 17 from the pipeline in each clock cycle. Thanks to the dual pipes, that's a total of 34 flops per cycle from each "shader core."
ARM offers versions of the T760 that scale up to 16 shader cores, as depicted in the diagram above. The Exynos 5433 hosts a six-cluster version of the GPU known as the Mali-T760 MP6. Reports have suggested this GPU runs at 700MHz in the Note 4.
If that clock speed is correct, then on paper, the Note 4's key graphics rates should look awfully similar to those of the PowerVR GX6450 in the iPhone 6 Plus. The two would have roughly similar pixel fill (~4 gigapixels/second) and bilinear texture filtering (~4 gigatexels/second) rates, and both devices' fp32 arithmetic throughput should peak at about 140 gigaflops. I hesitate to put those numbers into a table, since they're both theoretical and speculative in nature, but the story here should be rough parity.
Of course, theoretical peaks are one thing, and delivered performance is another. We know the PowerVR GPUs tend to be very efficient in real applications with their given resources. The Midgard architecture also contains some provisions meant to boost efficiency, including a nifty transaction elimination feature that only updates the contents of framebuffer pixel blocks when they have changed from the prior frame.
The story of rough parity we outlined above plays out as expected in the fill rate test, but the Note 4 trails the 6 Plus by quite a bit in GFXBench's ALU test. I'm not sure exactly what this directed ALU test does or whether it takes advantage of the dot-product unit in the Midgard shader core. I wouldn't read too much into these results, though.
The Note 4 also trails the 6 Plus in this graphics-specific test of alpha blending capacity. Then again, so does the Tegra K1 chip in the Shield Tablet, and we know it's one of the faster mobile GPUs on the planet.
As I understand it, this benchmark attempts to measure driver overhead by issuing a draw call, changing state, and doing it again, over and over. Performance in this test may end up being gated by CPU throughput as much as anything else.