The warm-up: Memory subsystem performance
Since the TLB erratum workaround is likely to affect the performance of the memory subsystem most directly, we'll start by looking at some synthetic memory benchmark results. Note that performance results for these tests don't translate directly into real-world application performance. These more focused benchmarks are instead designed to stress the memory subsystem, including CPU caches and main memory.
This first test lets us whip out a fancy-looking graph. Bear with us as we show off briefly.
I hesitated to include these results because, well, the graph is really hard to read. The problem? All three of the Phenom CPU configs running at 2.3GHz overlap almost entirely. I do think that's enlightening, though, because it shows us that L2 and L3 cache bandwidth don't appear to be affected by the TLB workaround. We can isolate the larger 1GB block size in this test, though, and see a more tangible difference.
Here, the TLB patch's impact begins to be apparent. It's even more apparent in Sandra's Stream-like test of main memory bandwidth. Have a look.
Ow. Memory bandwidth drops from roughly 5.4 GB/s without the TLB patch to 3.65 GB/s with it.
Of course, bandwidth is only part of the story. Let's look at memory access latencies, as well.
The TLB patch exacts an access latency penalty of 40 nanoseconds at our sample block size. We can look at the latency picture more fully with a terrifying-but-useful 3D graph.
In these graphs, yellow represents L1 cache, light orange is L2 cache, and dark orange is main memory. What you're seeing here is memory access latencies at various block and step sizes, in a way that exposes latency for the various stages in the memory hierarchy.
These latency graphs demonstrate the same basic principle that the bandwidth graphs did. The TLB erratum workaround exacts its performance penalty by slowing main memory performance. The actual impact on memory performance is much greater than the 10% number we've seen floating around, but the patch's affect on application performance will depend on whether the app's working data set can fit into the CPU's on-chip caches. The more often an application must reach into main memory, the greater the impact of the erratum workaround will be.