Personal computing discussed
Moderators: renee, Flying Fox, morphine
blastdoor wrote:Also, I'm running this command:
watch -n1 "lscpu | grep MHz | awk '{print $1}'";
to see what the clock speed is doing. I don't know much about what this command is truly reporting, but it's showing a pretty steady clock speed of 3400MHz.
blastdoor wrote:Hmm....some initial benchmarking suggests that, at least for what I'm doing right now, it's best to stop at 32 threads -- going beyond that actually slows things down a bit.
More specifically, for the Monte Carlo runs I'm doing now, here are the performance gains:
8 to 16 threads ---> 1.7 times gain
16 to 32 threads ---> 1.4 times gain
32 to 64 threads ---> 0.9 times "gain" (aka, 10% loss)
I also did a little checking around the edges of 32 (24 and 40 threads), and 32 seems to be the sweet spot.
chuckula wrote:60% of the time it works every time.
blastdoor wrote:Right now, the system is running 64 R (https://cran.r-project.org) threads. The CPU utilization for each thread (reported by top) ranges from 80% to 100%. I have not yet had a chance to fully investigate why it's not 100% across the board, but I'm wondering if it might be due to the uneven access to RAM -- would the CPU utilization on a process (reported by top) drop if it's waiting for access to RAM?
blastdoor wrote:chuckula wrote:60% of the time it works every time.
I love that line!
But in fairness, we don't know for sure if it's contention from the goofy RAM access layout or if it's lame SMT implementation.
blastdoor wrote:Hmm....some initial benchmarking suggests that, at least for what I'm doing right now, it's best to stop at 32 threads -- going beyond that actually slows things down a bit.
More specifically, for the Monte Carlo runs I'm doing now, here are the performance gains:
8 to 16 threads ---> 1.7 times gain
16 to 32 threads ---> 1.4 times gain
32 to 64 threads ---> 0.9 times "gain" (aka, 10% loss)
I also did a little checking around the edges of 32 (24 and 40 threads), and 32 seems to be the sweet spot.
chuckula wrote:Hyperthreading I mean "AMD SMT": 60% of the time it works every time.
blastdoor wrote:But in fairness, we don't know for sure if it's contention from the goofy RAM access layout or if it's lame SMT implementation.
Redocbew wrote:What kind of performance do you usually see with SMT? Isn't it normal not to expect 100% scaling with virtual cores?
synthtel2 wrote:chuckula wrote:Hyperthreading I mean "AMD SMT": 60% of the time it works every time.blastdoor wrote:But in fairness, we don't know for sure if it's contention from the goofy RAM access layout or if it's lame SMT implementation.
Since when does AMD have any kind of rep for lame SMT? There are plenty of theoretical reasons Zen's should actually be worth more than SKL's (as a % gain over non-SMT), and last I knew practical testing was agreeing with that theory.
My money's on something memory-related.Redocbew wrote:What kind of performance do you usually see with SMT? Isn't it normal not to expect 100% scaling with virtual cores?
I think 20-30% is the usual figure people quote for general use these days, but 0-50% is common. Negative scaling is fairly rare now, but far from unheard of. If you're bound by memory bandwidth and/or random throughput and are throwing a lot more requests at the memory controller than it can fulfill in a timely manner, that's definitely a good way to get to negative scaling.
Concupiscence wrote:It sounds like the Ryzen memory controller's generally quite capable. Could it be issues with high latency while fetching from the L3 cache across CCXes? Some of that's bound to improve over time with scheduler modifications.
The CPU utilization for each thread (reported by top) ranges from 80% to 100%
dragontamer5788 wrote:Concupiscence wrote:It sounds like the Ryzen memory controller's generally quite capable. Could it be issues with high latency while fetching from the L3 cache across CCXes? Some of that's bound to improve over time with scheduler modifications.
Unlikely. If he's running 32 to 64 independent monte-carlo simulations, there wouldn't be much cross-CCX communication at all. My bet is on memory-bound. Eventually, the DDR4 itself just becomes the bottleneck.
But... first... he has a 80% utilization problem. So immediately, something isn't right. Going from 80% to 100% utilization will increase overall computational speed by +25%.The CPU utilization for each thread (reported by top) ranges from 80% to 100%
This is a problem that should be figured out first. Before we talk about micro-optimizations involving memory controllers or cache. This right here is potentially +25% speeds if it can be figured out.
dragontamer5788 wrote:Yeah could be... this is a great place to be though. It's rare being memory-bound to the point where speed is worth getting over capacity.Unlikely. If he's running 32 to 64 independent monte-carlo simulations, there wouldn't be much cross-CCX communication at all. My bet is on memory-bound. Eventually, the DDR4 itself just becomes the bottleneck.
Concupiscence wrote:dragontamer5788 wrote:The CPU utilization for each thread (reported by top) ranges from 80% to 100%
This is a problem that should be figured out first. Before we talk about micro-optimizations involving memory controllers or cache. This right here is potentially +25% speeds if it can be figured out.
Yeah, that's non-trivial. If it's hovering around 3.4 GHz reliably it's probably not thermal downthrottling being expressed by a CPU monitor as underutilization. But what gives then? I'd like to see a kernel time chart in a performance profiler while all those Monte Carlos grind along.
Duct Tape Dude wrote:It's rare being memory-bound to the point where speed is worth getting over capacity.
dragontamer5788 wrote:This is a problem that should be figured out first. Before we talk about micro-optimizations involving memory controllers or cache. This right here is potentially +25% speeds if it can be figured out.
dragontamer5788 wrote:But this only matters **after** you get your utilization up to 100%.
synthtel2 wrote:Duct Tape Dude wrote:It's rare being memory-bound to the point where speed is worth getting over capacity.
I think fast memory is badly underrated, even for general-purpose desktop stuff. Going from JEDEC 2133 to 2666 CL14 early in my current rig's life made a noticeable difference in just about everything you'd think might be CPU-bound from boot times on up to gaming.dragontamer5788 wrote:This is a problem that should be figured out first. Before we talk about micro-optimizations involving memory controllers or cache. This right here is potentially +25% speeds if it can be figured out.dragontamer5788 wrote:But this only matters **after** you get your utilization up to 100%.
Utilization problems do deserve priority because they're a lot easier to figure out, but some of those so-called micro-optimizations for memory access can be worth a whole lot more than 25%, depending on the problem. Looking at IPC using perf isn't too difficult, and can be a useful metric right off the bat.
blastdoor wrote:Hi folks,
Right now, the system is running 64 R (https://cran.r-project.org) threads. The CPU utilization for each thread (reported by top) ranges from 80% to 100%. I have not yet had a chance to fully investigate why it's not 100% across the board, but I'm wondering if it might be due to the uneven access to RAM -- would the CPU utilization on a process (reported by top) drop if it's waiting for access to RAM?
I'm eventually going to see how performance scales with threads -- I'm very curious to see that.
synthtel2 wrote:Duct Tape Dude wrote:It's rare being memory-bound to the point where speed is worth getting over capacity.
I think fast memory is badly underrated, even for general-purpose desktop stuff. Going from JEDEC 2133 to 2666 CL14 early in my current rig's life made a noticeable difference in just about everything you'd think might be CPU-bound from boot times on up to gaming.
Bauxite wrote:FWIW you can run 128GB 8x16 at 2933 on TR and have your ECC cake too. 4x8 will do 3200 as well, maybe 4x16 but haven't tried much.
Boosting the fabric on TR is a good idea if you can, pretty much everything benefits.
ptsant wrote:Bauxite wrote:FWIW you can run 128GB 8x16 at 2933 on TR and have your ECC cake too. 4x8 will do 3200 as well, maybe 4x16 but haven't tried much.
Boosting the fabric on TR is a good idea if you can, pretty much everything benefits.
I don't know where you can find 2933 ECC modules, but I suspect they must be obscenely expensive. Most of the local stores (not in US) have 2400 ECC modules. Are there specific brands/models you recommend? I'm considering the switch from DDR4 3000 to ECC for my 1700X when I can shift the DDR4 modules to another system.
Bauxite wrote:AMD is not Intel, TR is not locked ram like Xeons. You actually cannot find any modules above 2666 at all, but samsung is making 3200 chips already. Nobody is binning modules because ???.
As for memory, samsung samsung samsung. B-die specifically, M391A1K43BB1 and M391A2K43BB1. Rating timings don't mean much, some of my stuff is listed as 2133 but made this year.
Krogoth wrote:Getting fast memory is either a hit or miss on non-specialized workloads and applications. It also helps more on platforms with high-core count chips that don't have quad-channel or more (Ryzen 7s and most likely 8-core Coffee Lake refresh).
ptsant wrote:This was particularly true in Ryzen v1, where the onboard data fabric syncs with memory and therefore very significant gains can be observed from 2133 to 2933. Can't say how this works with Zen+, but I would generally advise DDR4 3000, if you're not going for ECC. Generally the price premium from 2133 to 3000 is quite modest. Above 3200 it becomes "l33t gamer" territory, so maybe not very cost-effective.
blastdoor wrote:Hi all--
Thanks for the many comments and suggestions! And sorry for my delay in replying.
Based on suggestions here and further contemplation, I think the problem here probably is memory bandwidth, but I'm hoping it's addressable (so to speak). Each R instance isn't using a ton of RAM -- just a couple hundred megabytes. My impression is that the linux task scheduler is smart enough to keep these processes from hopping around too much from core to core, but I wonder if perhaps I need to step in and set some processor affinities and also do some manual load balancing.
Right now, N monte carlo replications are evenly split across 64 processes, and I let the task scheduler do its thing. Perhaps what I should do instead is set affinity for each process, and then divide the N replications unevenly, such that SMT cores get fewer replications and cores that aren't directly attached to memory get fewer replications. Does that sound like a reasonable approach?
I think the problem here probably is memory bandwidth
Right now, N monte carlo replications are evenly split across 64 processes, and I let the task scheduler do its thing. Perhaps what I should do instead is set affinity for each process, and then divide the N replications unevenly, such that SMT cores get fewer replications and cores that aren't directly attached to memory get fewer replications. Does that sound like a reasonable approach?
Concupiscence wrote:I'm not sure how you're loading the cores, but I'd guess somewhere between 32 and 64 threads you're hitting the limits of Threadripper's quad-channel memory bandwidth. It's interesting it falls off so dramatically after 32, but based on this data point it appears the 2990WX is a fundamentally flawed product. I'd love to read more about it, though.