The big thing it tests that I haven't seen elsewhere is throughput of random accesses, as opposed to latency of random accesses. There are also some more common latency and bandwidth tests, some bandwidth scaling tests, and mixed-workload tests.
Source and binaries are here. It's good for at least Linux x64 and Windows x64, and probably quite a bit beyond that.
==== ==== ==== ==== ==== ==== ==== ====
On Linux, AVX2 version, R7 1700 stock-clocked, DDR4-2666 14-14-14-14-34-48, tRRD/tFAW 4-5-16, bank group swap disabled:
Code: Select all
[ using 16 threads ]
==============================================================================
testing various working set sizes...
==============================================================================
WS/thread MB/s Mops/s ns
-------------------------------------
1048576kB: 116
524288kB: 109
262144kB: 110
131072kB: 37794 361 102
65536kB: 37570 378 98
32768kB: 37732 387 83
16384kB: 37720 396 63
8192kB: 37925 417 31
4096kB: 37770 466 17
2048kB: 40819 661 16
1024kB: 333885 1967 14
512kB: 349250 4085 10
256kB: 686363 5114 8
128kB: 706377 5418 8
64kB: 703693 5430 7
32kB: 703293 5426 5
16kB: 1102090 5554 5
8kB: 1106521 5562 5
==============================================================================
testing random-access throughput scaling (Mops/s)...
==============================================================================
pipelined unpipelined
------------------------------
16T: 362 115
8T: 433 71
4T: 297 37
2T: 174 19
1T: 90 9
==============================================================================
testing bandwidth scaling (MB/s)...
==============================================================================
vector scalar
--------------------------
16T: 37797 37966
8T: 37750 38070
4T: 38501 37643
2T: 36885 27366
1T: 28127 13684
==============================================================================
testing heterogeneous thread combinations...
==============================================================================
----------------- 1 + 1 threads -----------------
unpipelined random throughput: \/ (96%) 9 Mops/s
pipelined random throughput: /\ (100%) 90 Mops/s
unpipelined random throughput: \/ (95%) 9 Mops/s
vector bandwidth: /\ (94%) 27047 MB/s
pipelined random throughput: \/ (94%) 84 Mops/s
scalar bandwidth: /\ (91%) 13076 MB/s
pipelined random throughput: \/ (77%) 70 Mops/s
vector bandwidth: /\ (88%) 24947 MB/s
----------------- 8 + 8 threads -----------------
unpipelined random throughput: \/ (85%) 61 Mops/s
pipelined random throughput: /\ (26%) 113 Mops/s
unpipelined random throughput: \/ (65%) 46 Mops/s
vector bandwidth: /\ (77%) 29541 MB/s
pipelined random throughput: \/ (27%) 119 Mops/s
scalar bandwidth: /\ (69%) 26269 MB/s
pipelined random throughput: \/ (17%) 74 Mops/s
vector bandwidth: /\ (78%) 29910 MB/s
==== ==== ==== ==== ==== ==== ==== ====
If memory clock and the usual primary timings that get noted are the big factors in bandwidth and latency respectively, tFAW is the one for random read throughput. On CPUs with fewer threads, it probably doesn't matter much, but setting tRRD/tFAW to XMP values of 6-8-33 drops 90~100 Mops/s off my pipelined 8T/16T figures. Planetside 2 is the real-world workload where I notice this. The way average latency starts going way up at 4T and beyond could have something to do with it, and I suppose could make low tFAW more useful in some cases than the obvious thread scaling stuff would indicate.
Bank group swap doesn't seem to do anything other than adding a few hundred MB/s to bandwidth figures, as expected.
The latency figures I'm seeing fit all the expected patterns, but they're higher than the ones usually quoted. I don't know why; maybe something about recovery times after a request?
Most of it is very slightly slower on Windows, and as expected gets slower yet when a Windows binary is run on Linux via Wine (Wine's SRWLock is probably a lot less efficient than the other two), but latency doesn't behave right at all. Windows-on-Windows is 10~15 ns slower than Linux-on-Linux, but Windows-on-Linux via Wine is exactly as fast as Linux-on-Linux. That's with nil for (reported) background CPU use on both sides, but I've only tried it on one system, so it could be my Windows install is messed up somehow. If anyone's got any theories on that (and/or finds it handy to try the Linux version on Windows), I'd love to hear them.
Some other 1GB latency figures so far (all on Windows):
R7 1700 DDR4-2666 CL14 ... 128 ns
i5-6600K DDR4-2133 CL15 .... 96 ns
i5-4590 DDR3-2133 CL9 ........ 69 ns
i5-7200U (unknown 1ch) ...... 123 ns at best (some runs 160+, clearly something not right here. Power-saving features?)
i5-540M (unknown) .............. 198 ns
Apparently it is possible to make one thread use a heavy majority of memory bandwidth. Nobody wants to write code like that, though. The closest situation that sounds likely and easy would be some simple blend of multiple input streams into one output stream.
In mixed bandwidth-heavy and random-heavy workloads, it looks like the random side gets the short end of the stick. There's also some good thread priority stuff going on, so threads running a lot of random accesses don't slow things down much for threads running fewer. It looks like maybe Intel is a bit less aggressive with both of these behaviors than AMD, but it's tough to say given the systems involved and small sample size.