TR Forums

synthtel2 · Thu Jul 12, 2018 8:59 pm

... Because I was curious about some of the more obscure properties of memory performance, and this seemed both more illustrative and more fun than doing the research in a more normal way. It is quick-and-dirty in many ways, not at all production-ready (whatever that may mean in this case), and will probably never go beyond this state. Hopefully someone else finds it interesting, at least.

The big thing it tests that I haven't seen elsewhere is throughput of random accesses, as opposed to latency of random accesses. There are also some more common latency and bandwidth tests, some bandwidth scaling tests, and mixed-workload tests.

Source and binaries are here. It's good for at least Linux x64 and Windows x64, and probably quite a bit beyond that.

==== ==== ==== ==== ==== ==== ==== ====

On Linux, AVX2 version, R7 1700 stock-clocked, DDR4-2666 14-14-14-14-34-48, tRRD/tFAW 4-5-16, bank group swap disabled:

[ using 16 threads ]

==============================================================================
testing various working set sizes...
==============================================================================

WS/thread       MB/s    Mops/s     ns
-------------------------------------
1048576kB:                        116
 524288kB:                        109
 262144kB:                        110
 131072kB:     37794       361    102
  65536kB:     37570       378     98
  32768kB:     37732       387     83
  16384kB:     37720       396     63
   8192kB:     37925       417     31
   4096kB:     37770       466     17
   2048kB:     40819       661     16
   1024kB:    333885      1967     14
    512kB:    349250      4085     10
    256kB:    686363      5114      8
    128kB:    706377      5418      8
     64kB:    703693      5430      7
     32kB:    703293      5426      5
     16kB:   1102090      5554      5
      8kB:   1106521      5562      5

==============================================================================
testing random-access throughput scaling (Mops/s)...
==============================================================================

       pipelined   unpipelined
------------------------------
16T:         362           115
 8T:         433            71
 4T:         297            37
 2T:         174            19
 1T:          90             9

==============================================================================
testing bandwidth scaling (MB/s)...
==============================================================================

          vector    scalar
--------------------------
16T:       37797     37966
 8T:       37750     38070
 4T:       38501     37643
 2T:       36885     27366
 1T:       28127     13684

==============================================================================
testing heterogeneous thread combinations...
==============================================================================

----------------- 1 + 1 threads -----------------

unpipelined random throughput: \/ (96%) 9 Mops/s
pipelined random throughput:   /\ (100%) 90 Mops/s

unpipelined random throughput: \/ (95%) 9 Mops/s
vector bandwidth:              /\ (94%) 27047 MB/s

pipelined random throughput:   \/ (94%) 84 Mops/s
scalar bandwidth:              /\ (91%) 13076 MB/s

pipelined random throughput:   \/ (77%) 70 Mops/s
vector bandwidth:              /\ (88%) 24947 MB/s

----------------- 8 + 8 threads -----------------

unpipelined random throughput: \/ (85%) 61 Mops/s
pipelined random throughput:   /\ (26%) 113 Mops/s

unpipelined random throughput: \/ (65%) 46 Mops/s
vector bandwidth:              /\ (77%) 29541 MB/s

pipelined random throughput:   \/ (27%) 119 Mops/s
scalar bandwidth:              /\ (69%) 26269 MB/s

pipelined random throughput:   \/ (17%) 74 Mops/s
vector bandwidth:              /\ (78%) 29910 MB/s

==== ==== ==== ==== ==== ==== ==== ====

If memory clock and the usual primary timings that get noted are the big factors in bandwidth and latency respectively, tFAW is the one for random read throughput. On CPUs with fewer threads, it probably doesn't matter much, but setting tRRD/tFAW to XMP values of 6-8-33 drops 90~100 Mops/s off my pipelined 8T/16T figures. Planetside 2 is the real-world workload where I notice this. The way average latency starts going way up at 4T and beyond could have something to do with it, and I suppose could make low tFAW more useful in some cases than the obvious thread scaling stuff would indicate.

Bank group swap doesn't seem to do anything other than adding a few hundred MB/s to bandwidth figures, as expected.

The latency figures I'm seeing fit all the expected patterns, but they're higher than the ones usually quoted. I don't know why; maybe something about recovery times after a request?

Most of it is very slightly slower on Windows, and as expected gets slower yet when a Windows binary is run on Linux via Wine (Wine's SRWLock is probably a lot less efficient than the other two), but latency doesn't behave right at all. Windows-on-Windows is 10~15 ns slower than Linux-on-Linux, but Windows-on-Linux via Wine is exactly as fast as Linux-on-Linux. That's with nil for (reported) background CPU use on both sides, but I've only tried it on one system, so it could be my Windows install is messed up somehow. If anyone's got any theories on that (and/or finds it handy to try the Linux version on Windows), I'd love to hear them.

Some other 1GB latency figures so far (all on Windows):

R7 1700 DDR4-2666 CL14 ... 128 ns
i5-6600K DDR4-2133 CL15 .... 96 ns
i5-4590 DDR3-2133 CL9 ........ 69 ns
i5-7200U (unknown 1ch) ...... 123 ns at best (some runs 160+, clearly something not right here. Power-saving features?)
i5-540M (unknown) .............. 198 ns

Apparently it is possible to make one thread use a heavy majority of memory bandwidth. Nobody wants to write code like that, though.

The closest situation that sounds likely and easy would be some simple blend of multiple input streams into one output stream.

In mixed bandwidth-heavy and random-heavy workloads, it looks like the random side gets the short end of the stick. There's also some good thread priority stuff going on, so threads running a lot of random accesses don't slow things down much for threads running fewer. It looks like maybe Intel is a bit less aggressive with both of these behaviors than AMD, but it's tough to say given the systems involved and small sample size.

DancinJack · Thu Jul 12, 2018 11:38 pm

Hmm, I can't pull any of the binaries from your Dropbox. I don't know Dropbox well so maybe I'm just ignorant and don't know how. Not on linux box right now so can't pull the source and compile. :/

edit: Yep, I just didn't know Dropbox well. Got them now.

W10 x64
i7-6700K + 16GB DDR4 3000 CL15

WS/thread       MB/s    Mops/s     ns
-------------------------------------
1048576kB:                         79
 524288kB:                         72
 262144kB:                         70
 131072kB:     36548       336     67
  65536kB:     36764       339     63
  32768kB:     36916       353     58
  16384kB:     37124       364     47
   8192kB:     37504       369     26
   4096kB:     42061       404     14
   2048kB:     80107       493     12
   1024kB:    156025      1094     11
    512kB:    268726      2163      9
    256kB:    289358      2404      6
    128kB:    374073      2997      4
     64kB:    430546      3337      4
     32kB:    422223      3352      3
     16kB:    524343      3348      3
      8kB:    522036      3357      3

==============================================================================
testing random-access throughput scaling (Mops/s)...
==============================================================================

Pipelined requests can use a core's out-of-order execution hardware to have many
requests in flight at once, even on a single thread.

       pipelined   unpipelined
------------------------------
 8T:         334           105
 4T:         309            58
 2T:         214            29
 1T:         120            14

==============================================================================
testing bandwidth scaling (MB/s)...
==============================================================================

Vector is actually using two streams and some non-trivial optimization (see the
source for details). Don't expect to go that fast in the real world.

          vector    scalar
--------------------------
 8T:       36392     35969
 4T:       36632     34258
 2T:       34938     29014
 1T:       28632     15073

==============================================================================
testing heterogeneous thread combinations...
==============================================================================

This runs each side of each pair in isolation, then at the same time to see how
each affects the other. Percentages are of the isolated testing. The max-thread
tests are a bit inconsistent - use them for ballparks only.

----------------- 1 + 1 threads -----------------

unpipelined random throughput: \/ (98%) 14 Mops/s
pipelined random throughput:   /\ (97%) 117 Mops/s

unpipelined random throughput: \/ (84%) 12 Mops/s
vector bandwidth:              /\ (96%) 28004 MB/s

pipelined random throughput:   \/ (88%) 104 Mops/s
scalar bandwidth:              /\ (89%) 13500 MB/s

pipelined random throughput:   \/ (68%) 80 Mops/s
vector bandwidth:              /\ (80%) 22888 MB/s

----------------- 4 + 4 threads -----------------

unpipelined random throughput: \/ (79%) 46 Mops/s
pipelined random throughput:   /\ (55%) 175 Mops/s

unpipelined random throughput: \/ (51%) 30 Mops/s
vector bandwidth:              /\ (89%) 33522 MB/s

pipelined random throughput:   \/ (58%) 177 Mops/s
scalar bandwidth:              /\ (49%) 16885 MB/s

pipelined random throughput:   \/ (44%) 136 Mops/s
vector bandwidth:              /\ (63%) 23913 MB/s

synthtel2 · Fri Jul 13, 2018 2:33 pm

Yeah, sorry about the weird distribution. I don't care at all about wide distribution, the usual ways to distribute code have real names all over everything (which it isn't quite time for here), and most of the people I'm sending it to would prefer the non-coder version anyway.

Nice, it looks like a 6700K with typical fast RAM is still pretty latency-bound.

biffzinker · Fri Jul 13, 2018 5:35 pm

memrt_windows_avx2.exe
DDR4-3400 16-18-18-18-38, 1T, tRC 60, and tRFC 435T

Ryzen R5 2600X at Stock (3,600 MHz base and Boost 4,250 MHz)

WS/thread       MB/s    Mops/s     ns
-------------------------------------
1048576kB:                         82
 524288kB:                         80
 262144kB:                         77
 131072kB:     34139       193     74
  65536kB:     34012       194     68
  32768kB:     34044       198     59
  16384kB:     34042       202     50
   8192kB:     34045       221     17
   4096kB:     33982       272     12
   2048kB:     40303       541     11
   1024kB:    404188      3600      9
    512kB:    441972      4160      7
    256kB:    488414      4735      5
    128kB:    506741      5322      5
     64kB:    504990      5374      4
     32kB:    507653      5374      4
     16kB:    610764      5525      3
      8kB:    624777      5541      3

==============================================================================
testing random-access throughput scaling (Mops/s)...
==============================================================================

Pipelined requests can use a core's out-of-order execution hardware to have many
requests in flight at once, even on a single thread.

       pipelined   unpipelined
------------------------------
12T:         192           112
 6T:         214            69
 3T:         213            38
 1T:          79            13

==============================================================================
testing bandwidth scaling (MB/s)...
==============================================================================

Vector is actually using two streams and some non-trivial optimization (see the
source for details). Don't expect to go that fast in the real world.

          vector    scalar
--------------------------
12T:       34013     37931
 6T:       38329     39314
 3T:       38612     35327
 1T:       30934     15305

==============================================================================
testing heterogeneous thread combinations...
==============================================================================

This runs each side of each pair in isolation, then at the same time to see how
each affects the other. Percentages are of the isolated testing. The max-thread
tests are a bit inconsistent - use them for ballparks only.

----------------- 1 + 1 threads -----------------

unpipelined random throughput: \/ (87%) 11 Mops/s
pipelined random throughput:   /\ (94%) 102 Mops/s

unpipelined random throughput: \/ (76%) 10 Mops/s
vector bandwidth:              /\ (89%) 27642 MB/s

pipelined random throughput:   \/ (78%) 87 Mops/s
scalar bandwidth:              /\ (84%) 13004 MB/s

pipelined random throughput:   \/ (54%) 60 Mops/s
vector bandwidth:              /\ (78%) 24159 MB/s

----------------- 6 + 6 threads -----------------

unpipelined random throughput: \/ (80%) 56 Mops/s
pipelined random throughput:   /\ (32%) 75 Mops/s

unpipelined random throughput: \/ (61%) 42 Mops/s
vector bandwidth:              /\ (69%) 26637 MB/s

pipelined random throughput:   \/ (43%) 98 Mops/s
scalar bandwidth:              /\ (59%) 23099 MB/s

pipelined random throughput:   \/ (23%) 54 Mops/s
vector bandwidth:              /\ (72%) 28084 MB/s

==============================================================================
==============================================================================

Ryzen R5 2600X all six cores at 4GHz

WS/thread       MB/s    Mops/s     ns
-------------------------------------
1048576kB:                         85
 524288kB:                         81
 262144kB:                         79
 131072kB:     34095       191     74
  65536kB:     34017       194     69
  32768kB:     34029       196     61
  16384kB:     34013       201     46
   8192kB:     34007       221     18
   4096kB:     33964       270     12
   2048kB:     43316       539     12
   1024kB:    474970      3598     10
    512kB:    495838      4060      6
    256kB:    502350      4628      5
    128kB:    505771      5237      5
     64kB:    505667      5255      5
     32kB:    504618      5300      4
     16kB:    604873      5441      4
      8kB:    618500      5446      4

==============================================================================
testing random-access throughput scaling (Mops/s)...
==============================================================================

Pipelined requests can use a core's out-of-order execution hardware to have many
requests in flight at once, even on a single thread.

       pipelined   unpipelined
------------------------------
12T:         191           107
 6T:         236            68
 3T:         211            37
 1T:         109            13

==============================================================================
testing bandwidth scaling (MB/s)...
==============================================================================

Vector is actually using two streams and some non-trivial optimization (see the
source for details). Don't expect to go that fast in the real world.

          vector    scalar
--------------------------
12T:       34109     37511
 6T:       38232     39192
 3T:       38559     34982
 1T:       30492     14618

==============================================================================
testing heterogeneous thread combinations...
==============================================================================

This runs each side of each pair in isolation, then at the same time to see how
each affects the other. Percentages are of the isolated testing. The max-thread
tests are a bit inconsistent - use them for ballparks only.

----------------- 1 + 1 threads -----------------

unpipelined random throughput: \/ (91%) 12 Mops/s
pipelined random throughput:   /\ (94%) 102 Mops/s

unpipelined random throughput: \/ (78%) 10 Mops/s
vector bandwidth:              /\ (90%) 27514 MB/s

pipelined random throughput:   \/ (78%) 86 Mops/s
scalar bandwidth:              /\ (94%) 12758 MB/s

pipelined random throughput:   \/ (56%) 61 Mops/s
vector bandwidth:              /\ (80%) 24525 MB/s

----------------- 6 + 6 threads -----------------

unpipelined random throughput: \/ (82%) 56 Mops/s
pipelined random throughput:   /\ (32%) 76 Mops/s

unpipelined random throughput: \/ (78%) 53 Mops/s
vector bandwidth:              /\ (50%) 19518 MB/s

pipelined random throughput:   \/ (40%) 95 Mops/s
scalar bandwidth:              /\ (62%) 24376 MB/s

pipelined random throughput:   \/ (22%) 52 Mops/s
vector bandwidth:              /\ (73%) 28332 MB/s

==============================================================================
==============================================================================

SkyWarrior · Fri Jul 13, 2018 9:20 pm

i7 3820 witn DDR3 1600 32GB (Quad channel) SSE2

WS/thread       MB/s    Mops/s     ns
-------------------------------------
1048576kB:                        105
 524288kB:                        104
 262144kB:                        102
 131072kB:     26387       216     97
  65536kB:     26314       220     87
  32768kB:     26331       230     77
  16384kB:     26280       231     56
   8192kB:     26599       239     25
   4096kB:     26825       253     24
   2048kB:     29384       297     25
   1024kB:     83495       656     24
    512kB:     97520       827     19
    256kB:     96574       893     14
    128kB:    173226      1072     11
     64kB:    275180      1148     11
     32kB:    270767      1159      9
     16kB:    371247      1184      9
      8kB:    380896      1184      9

synthtel2 · Sat Jul 14, 2018 2:18 pm

Hmm, something weird is going on with that 2600X - I've never seen bandwidth drop like that with SMT use, and 39 GB/s only requires 2800 for me. That's some awesome Intel-grade latency, though, and massively better than my 1700 would manage at settings like that.

Something about this isn't handling quad-channel memory very well on the 3820. That's barely over half that system's theoretical bandwidth. Does it have NUMA shenanigans in-play? I can't seem to find much info on Sandy-E's memory system.

chuckula · Sat Jul 14, 2018 2:49 pm

Out of curiosity, what is the split in this "memory bench" between a benchmark that hits cache at various levels vs. a benchmark that actually assesses system RAM?

synthtel2 · Sat Jul 14, 2018 3:51 pm

All RAM, excepting the smaller working set sizes in the first battery of tests. Its behavior in that first battery can confirm transitions (and the shape of transitions) between cache levels and looks to be probably not entirely useless for L3, but the compute side is way too slow to make L1/L2 numbers any good (it's running an xorshift64* PRNG on-the-spot to come up with the next address to access). I've got a couple ideas that could improve those within the same testing methodology, but really this basic method of checking iterations against time taken just isn't up to that task.

setaG_lliB · Sat Jul 14, 2018 5:24 pm

An ancient laptop, just because.
Core 2 Duo T5600 (1.83GHz, 667MHz FSB, 2MB L2)
4GB of DDR2-667 (CL5), some of which is shared with the IGP
Win7 x64, SSE2

WS/thread       MB/s    Mops/s     ns
-------------------------------------
1048576kB:                        197
 524288kB:                        196
 262144kB:                        161
 131072kB:      4265        40    133
  65536kB:      3879        59    132
  32768kB:      3976        53    121
  16384kB:      4155        61    119
   8192kB:      4213        68    100
   4096kB:      4199        70     73
   2048kB:      4125        66     28
   1024kB:     14717       266     23
    512kB:     15099       316     19
    256kB:     16732       319     19
    128kB:     17528       320     17
     64kB:     17709       302     15
     32kB:     25441       284     13
     16kB:     25493       330     12
      8kB:     26834       311     12

==============================================================================
testing random-access throughput scaling (Mops/s)...
==============================================================================

Pipelined requests can use a core's out-of-order execution hardware to have many

requests in flight at once, even on a single thread.

       pipelined   unpipelined
------------------------------
 2T:          42            12
 1T:          29             7

==============================================================================
testing bandwidth scaling (MB/s)...
==============================================================================

Vector is actually using two streams and some non-trivial optimization (see the
source for details). Don't expect to go that fast in the real world.

          vector    scalar
--------------------------
 2T:        4064      4016
 1T:        4109      3720

==============================================================================
testing heterogeneous thread combinations...
==============================================================================

This runs each side of each pair in isolation, then at the same time to see how
each affects the other. Percentages are of the isolated testing. The max-thread
tests are a bit inconsistent - use them for ballparks only.

----------------- 1 + 1 threads -----------------

unpipelined random throughput: \/ (69%) 4 Mops/s
pipelined random throughput:   /\ (76%) 26 Mops/s

unpipelined random throughput: \/ (52%) 3 Mops/s
vector bandwidth:              /\ (65%) 2744 MB/s

pipelined random throughput:   \/ (76%) 26 Mops/s
scalar bandwidth:              /\ (39%) 1534 MB/s

pipelined random throughput:   \/ (59%) 19 Mops/s
vector bandwidth:              /\ (48%) 1762 MB/s

----------------- 1 + 1 threads -----------------

unpipelined random throughput: \/ (56%) 4 Mops/s
pipelined random throughput:   /\ (92%) 30 Mops/s

unpipelined random throughput: \/ (68%) 4 Mops/s
vector bandwidth:              /\ (63%) 2644 MB/s

pipelined random throughput:   \/ (78%) 27 Mops/s
scalar bandwidth:              /\ (43%) 1554 MB/s

pipelined random throughput:   \/ (69%) 21 Mops/s
vector bandwidth:              /\ (51%) 2122 MB/s

biffzinker · Sun Jul 22, 2018 4:18 pm

synthtel2 wrote:
Hmm, something weird is going on with that 2600X - I've never seen bandwidth drop like that with SMT use, and 39 GB/s only requires 2800 for me. That's some awesome Intel-grade latency, though, and massively better than my 1700 would manage at settings like that.

I was after reducing latency over bandwidth plus increasing the clockspeed of the Infinity Fabric interconnect between the two triple core CCX. This MSI board plus the two 4GB DDR4 single rank sticks seems to be responsible for the lower bandwidth your seeing since it was doing the same with a Ryzen 3 1200.

synthtel2 · Sun Jul 22, 2018 9:40 pm

I'm using single-rank RAM too, FWIW. Mobo shenanigans do seem likely.

biffzinker · Fri Aug 03, 2018 2:54 pm

synthtel2 wrote:
I'm using single-rank RAM too, FWIW. Mobo shenanigans do seem likely.

Swapped my Team Group 4 GB sticks for 8 GB sticks of PNY Anarchy-X DDR4 3200 C16

WS/thread       MB/s    Mops/s     ns
-------------------------------------
1048576kB:                         87
 524288kB:                         84
 262144kB:                         82
 131072kB:     45233       265     78
  65536kB:     44874       269     72
  32768kB:     45220       274     63
  16384kB:     45274       284     47
   8192kB:     45320       306     18
   4096kB:     45261       374     12
   2048kB:     56056       677     11
   1024kB:    433370      3696     10
    512kB:    490157      4162      7
    256kB:    515570      4818      5
    128kB:    518372      5374      5
     64kB:    517469      5390      4
     32kB:    519148      5436      4
     16kB:    620843      5583      3
      8kB:    631644      5585      3

==============================================================================
testing random-access throughput scaling (Mops/s)...
==============================================================================

Pipelined requests can use a core's out-of-order execution hardware to have many
requests in flight at once, even on a single thread.

       pipelined   unpipelined
------------------------------
12T:         265           111
 6T:         306            68
 3T:         223            36
 1T:         106            12

==============================================================================
testing bandwidth scaling (MB/s)...
==============================================================================

Vector is actually using two streams and some non-trivial optimization (see the
source for details). Don't expect to go that fast in the real world.

          vector    scalar
--------------------------
12T:       45264     46690
 6T:       46717     46716
 3T:       46143     37718
 1T:       30762     14814

==============================================================================
testing heterogeneous thread combinations...
==============================================================================

This runs each side of each pair in isolation, then at the same time to see how
each affects the other. Percentages are of the isolated testing. The max-thread
tests are a bit inconsistent - use them for ballparks only.

----------------- 1 + 1 threads -----------------

unpipelined random throughput: \/ (94%) 12 Mops/s
pipelined random throughput:   /\ (95%) 103 Mops/s

unpipelined random throughput: \/ (93%) 11 Mops/s
vector bandwidth:              /\ (93%) 30611 MB/s

pipelined random throughput:   \/ (89%) 96 Mops/s
scalar bandwidth:              /\ (88%) 12929 MB/s

pipelined random throughput:   \/ (76%) 81 Mops/s
vector bandwidth:              /\ (78%) 26470 MB/s

----------------- 6 + 6 threads -----------------

unpipelined random throughput: \/ (82%) 56 Mops/s
pipelined random throughput:   /\ (47%) 137 Mops/s

unpipelined random throughput: \/ (77%) 53 Mops/s
vector bandwidth:              /\ (64%) 30010 MB/s

pipelined random throughput:   \/ (38%) 119 Mops/s
scalar bandwidth:              /\ (71%) 33414 MB/s

pipelined random throughput:   \/ (22%) 66 Mops/s
vector bandwidth:              /\ (82%) 38441 MB/s

==============================================================================
==============================================================================

synthtel2 · Mon Aug 06, 2018 11:47 pm

Huh, that's quite a difference. Maybe something's going off the rails in training with the Team Group kit?

TR Forums

I built a memory bench

I built a memory bench

Re: I built a memory bench

Re: I built a memory bench

Re: I built a memory bench

Re: I built a memory bench

Re: I built a memory bench

Re: I built a memory bench

Re: I built a memory bench

Re: I built a memory bench

Re: I built a memory bench

Re: I built a memory bench

Re: I built a memory bench

Re: I built a memory bench

Who is online