R7 1700 stock-clocked, ASRock AB350 Gaming-ITX/ac, 16 GB of RAM (still being dialled in but 2666CL16 / 2800CL18 or thereabouts).
G3258 at 4.1-4.3 (recently 3.2-4.0 due to wear), ASRock Z97E-ITX/ac, 8GB RAM at 2133CL9.
Sandisk X300 512GB, 4GB GTX 960, a 1080p60 monitor, Arch Linux.
==== ==== ==== ====
thread? That effect is definitely present here. It stalls in different places than it did before, but there are some patterns. In browsing and general OS/background stuff, most non-I/O waits that would take 200+ msec before have been sped up somewhere between noticably and massively, and most of the new hitches are under 50 msec. The improvements hint that the world may be more multithreaded than we generally give it credit for (a lot of that being Firefox in this case). A lot of those new shorter hitches went away or got better when I got the RAM clocked above 2133CL15, so maybe absolute latency is important for that. As for the rest, I'm suspicious of clock scaling. On the G3258, the intel_pstate governor (despite powersave mode) tended to keep clocks maxed and rely on C-states for power saving, and this 1700 both spends a lot more time at idle P-state (1.55 GHz) and probably takes a lot longer spinning up from that.
(Speaking of that thread, I haven't forgotten about it. I think implementing such tests as profiling on a naturally-big project is going to work better than a dedicated test suite, though, so that's on an opportunistic kind of schedule.)
I'm still at a low-ish sample size of games that are heavy enough to matter (this internet connection is garbage and downloading them is annoying), but the pattern I'm getting is that average framerates are unaffected or moderately improved and frametime / worst case / hitchiness problems are massively improved. Average versus 99% frametime spreads are now in ranges most gamers would call normal, where on the G3258 they could be anywhere between normal and unplayable. Shadow of Mordor is the worst I ran into on the G3258, where it went to <5 fps for maybe a half second whenever I tried to look a different direction (with an associated heavy multithreading attempt). SoM runs pretty well on the 1700, albeit regardless of available CPU power it clearly wasn't designed for the kind of fast view changes a mouse can produce. This paragraph does not account for gaming via Wine, which I'll get to in a bit.
Game load times are often greatly improved, even for relatively simple games like Broforce.
Y'all already know what this upgrade was like for compiling and other known-multithreaded loads.
==== ==== ==== ====
I've got a couple of Linux-specific tricks to put these threads to use:
Instead of swapping to disk, I swap to an LZ4-compressed RAMdisk. With more cores at work on the LZ4, I can act like I've got 20 or 24 GB of RAM and for many purposes hardly even notice a performance hit.
Wine-staging with CSMT is awesome. The idea is that instead of doing all the DX-to-GL black magic on the game's render thread, Wine runs its own thread to handle it in parallel (including the actual OGL calls, I guess). The stable version of Wine apparently has something like this implemented, but it's implemented with emphasis on correctness and it isn't actually faster than doing the work inline. Wine-staging's version doesn't mind being a bit unsafe, it's very rarely an issue, and it brings Real Serious Performance. It's purportedly sometimes even faster (average fps at least) than running the game on Windows, which sounds plausible enough if the base game is doing a lot of non-render work on the render thread (since calling Wine is likely cheaper than calling the actual graphics driver). So far (again with a small sample size), this is all working for me just as described, though Wine still does result in notably less consistent frametimes than native.
==== ==== ==== ====
I haven't (knowingly) run into that compiling bug, but the compiles I'm doing aren't the most likely to trigger it. This is no Gentoo.
Rumors of Ryzen's memory controller weirdness seem well-founded. The kit I'm using is G.Skill's F4-3000C15D-16GTZB, which according to this
page is dual-rank Samsung E-die. I don't know what it actually is (Thaiphoon doesn't work in Wine, unsurprisingly), but it looks single-rank to me. It wouldn't boot over 2133 without manual tweaking. Lots of manual tweaking later, I had it seemingly stable (spoilers: it wasn't) at 3066 16-16-16-36 1.22V.
I'd like to take a moment here to discuss Ryzen's boot-time memory training, which is both wonderful and terribly sketchy. It's wonderful because in finding the limits on a ton of separate timings and timing groups, CMOS clear was only required once, and only once did obviously scary levels of glitchiness make it through to OS boot. Usually, if something is pushed too far, it'll just fail training and dump you back in the UEFI interface using JEDEC settings (but with the last custom ones conveniently saved). It's sketchy because it isn't deterministic. When pushed to its limits, it gives slightly different speeds on each boot (presumably due to variance in some subtimings or other). What if it messes up once in a blue moon and boots with timings that aren't stable despite any stress-testing you already threw at it? I don't like that thought, and I don't have any feel for how much margin it needs to eliminate that possibility.
Anyway, the plan was to dial in something more aggressive than I'd really want to run, throw all the stress testing at it, find the edge, run at it for a bit to confirm stability, then back off by 266 or so to get something I can count on. Well, stress testing doesn't work. It passed an overnight mixed stressapptest run (which I had kind of been assuming was decent because it's what Google uses, right?), I went on to testing through cautious use, and next thing I know a pile of random 775 permissions over in /var changed to 755s and I have to tell pacman to reinstall all packages to fix some kind of wifi packet loss problem.
I'm really not trying to run any extreme settings here, but it is a big deal for performance, and it being impossible to tell where the limit is without accepting software damage makes this difficult.
Since I did give these some individual attention, here are the timings I arrived at in case it helps out anyone else:
tCL-tRCD-tRP-tRAS-tRC-cmdrate = 16-16-16-36-54-2T (GDM off)
tCWL = 14, tRTP = 11, tWR = 22, tRRD_S = 8, tRRD_L = 10, tFAW = 42
tWTR_S = 6, tWTR_L = 9, TrdrdScL = 6, TwrwrScL = 6, Trdwr = 10, Twrrd = 4, tCKE = 8
TwrwrSc = 1, TwrwrSd = 6, TwrwrDd = 6, TrdrdSc = 1, TrdrdSd = 6, TrdrdDd = 6
RFC/2/4 follow JEDEC times the new clock divided by 2133
The tight timings there are on the tCL etc line, and the rest mostly have a bit of margin. Bump tFAW to 50+ while you're working on the rest of it for stability (especially at low voltage like this), and don't touch TwrwrSc/TrdrdSc because they have a huge effect on performance. tWR should be double tRTP, tCWL should probably be a bit less than tCL, and tFAW needs to be 4x tRRD_S at minimum. All of those rdrd/wrwr 6s got tended to in pairs rather than individually.
==== ==== ==== ====
Using lower clocks and more cores to whatever extent workable is about reliability, not power use, but this thing's lack of power use is impressive. If heatsink exhaust is anything to go by, 65W seems like a high estimate, even under full prime95. As for reliability, multicore load sees it running a nicely low 1060mV and 3.15-3.2 GHz.
Were I speccing it out again, I'd drop to an R5 1600 and spend that money on ECC RAM. Even if that RAM isn't as theoretically performant and the overclocking is real overclocking, at least it'd be clear how far is too far. As for the 6C/8C choice, I may appreciate every one of these cores in 2021, but for now 8C is comfortably overkill.