Personal computing discussed

Moderators: renee, Flying Fox, morphine

 
blastdoor
Gerbil Elite
Topic Author
Posts: 846
Joined: Fri Apr 08, 2005 4:12 pm
Location: Real America

Threadripper 2990wx experiences

Tue Sep 25, 2018 12:29 pm

Hi folks,

As I mentioned over in the Linux forum, I recently bought a Threadripper 2990wx system running Ubuntu (my first non-Mac computer in over 10 years).

I'll save questions/comments on the Linux aspect for the other forum -- for here I'll share some experiences more focussed on the 'ripper (on the off chance that anybody other than me might find this interesting).

Right now, the system is running 64 R (https://cran.r-project.org) threads. The CPU utilization for each thread (reported by top) ranges from 80% to 100%. I have not yet had a chance to fully investigate why it's not 100% across the board, but I'm wondering if it might be due to the uneven access to RAM -- would the CPU utilization on a process (reported by top) drop if it's waiting for access to RAM?

Also, I'm running this command:

watch -n1 "lscpu | grep MHz | awk '{print $1}'";

to see what the clock speed is doing. I don't know much about what this command is truly reporting, but it's showing a pretty steady clock speed of 3400MHz. My system is not overclocked, but it is liquid cooled and sitting in a room with an ambient temperature of around 70.

I'm eventually going to see how performance scales with threads -- I'm very curious to see that.
1. iMac 27" (2020) i7 10700k; AMD Radeon Pro 5500 XT 8 GB; 64 GB RAM; 500 GB internal SSD + external box of SSDs
2. ThreadRipper 2990wx; Ubuntu; Headless; 64GB RAM
3. MacBook Pro (2017); Core i7-7820HQ; 16 GB RAM
 
chuckula
Minister of Gerbil Affairs
Posts: 2109
Joined: Wed Jan 23, 2008 9:18 pm
Location: Probably where I don't belong.

Re: Threadripper 2990wx experiences

Tue Sep 25, 2018 12:33 pm

I don't know if AMD implements the right support mechanism, but the i7z application is the best low-level clockspeed monitor for Linux that I've used:
https://delightlylinux.wordpress.com/20 ... cpu-doing/

It reports the real information on a per-core basis and correctly measures turbboost levels both at stock and in overclocked scenarios.
4770K @ 4.7 GHz; 32GB DDR3-2133; Officially RX-560... that's right AMD you shills!; 512GB 840 Pro (2x); Fractal Define XL-R2; NZXT Kraken-X60
--Many thanks to the TR Forum for advice in getting it built.
 
Redocbew
Minister of Gerbil Affairs
Posts: 2495
Joined: Sat Mar 15, 2014 11:44 am

Re: Threadripper 2990wx experiences

Tue Sep 25, 2018 1:06 pm

blastdoor wrote:
Also, I'm running this command:

watch -n1 "lscpu | grep MHz | awk '{print $1}'";

to see what the clock speed is doing. I don't know much about what this command is truly reporting, but it's showing a pretty steady clock speed of 3400MHz.


This might be better for the other thread, but here's a breakdown of what that shell command is doing.

"watch" is just a wrapper that runs whatever command you give it continuously on a default interval of two seconds. By using the "-n1" option you're setting the interval down to one second.

"lscpu" collects information about the CPU from sysfs and /proc/cpuinfo. Linux displays information about devices in your system in various files on disk. In fact, it lets you treat some of those devices as if they are files on disk, but I digress.

"grep" filters out any lines that don't contain 'MHz'. Awk is an old school scripting language also often used for manipulating results of shell commands. It's not strictly necessary here since it's just printing whatever it's given, and grep would do that anyway. The pipes inbetween each command take the output from the previous command and feed it to the next one. You'll see those a lot when working with shell commands also.
Do not meddle in the affairs of archers, for they are subtle and you won't hear them coming.
 
blastdoor
Gerbil Elite
Topic Author
Posts: 846
Joined: Fri Apr 08, 2005 4:12 pm
Location: Real America

Re: Threadripper 2990wx experiences

Tue Sep 25, 2018 2:42 pm

Hmm....some initial benchmarking suggests that, at least for what I'm doing right now, it's best to stop at 32 threads -- going beyond that actually slows things down a bit.

More specifically, for the Monte Carlo runs I'm doing now, here are the performance gains:

8 to 16 threads ---> 1.7 times gain
16 to 32 threads ---> 1.4 times gain
32 to 64 threads ---> 0.9 times "gain" (aka, 10% loss)

I also did a little checking around the edges of 32 (24 and 40 threads), and 32 seems to be the sweet spot.
1. iMac 27" (2020) i7 10700k; AMD Radeon Pro 5500 XT 8 GB; 64 GB RAM; 500 GB internal SSD + external box of SSDs
2. ThreadRipper 2990wx; Ubuntu; Headless; 64GB RAM
3. MacBook Pro (2017); Core i7-7820HQ; 16 GB RAM
 
chuckula
Minister of Gerbil Affairs
Posts: 2109
Joined: Wed Jan 23, 2008 9:18 pm
Location: Probably where I don't belong.

Re: Threadripper 2990wx experiences

Tue Sep 25, 2018 2:48 pm

blastdoor wrote:
Hmm....some initial benchmarking suggests that, at least for what I'm doing right now, it's best to stop at 32 threads -- going beyond that actually slows things down a bit.

More specifically, for the Monte Carlo runs I'm doing now, here are the performance gains:

8 to 16 threads ---> 1.7 times gain
16 to 32 threads ---> 1.4 times gain
32 to 64 threads ---> 0.9 times "gain" (aka, 10% loss)

I also did a little checking around the edges of 32 (24 and 40 threads), and 32 seems to be the sweet spot.


Hyperthreading I mean "AMD SMT": 60% of the time it works every time.
4770K @ 4.7 GHz; 32GB DDR3-2133; Officially RX-560... that's right AMD you shills!; 512GB 840 Pro (2x); Fractal Define XL-R2; NZXT Kraken-X60
--Many thanks to the TR Forum for advice in getting it built.
 
blastdoor
Gerbil Elite
Topic Author
Posts: 846
Joined: Fri Apr 08, 2005 4:12 pm
Location: Real America

Re: Threadripper 2990wx experiences

Tue Sep 25, 2018 2:57 pm

chuckula wrote:
60% of the time it works every time.


I love that line!

But in fairness, we don't know for sure if it's contention from the goofy RAM access layout or if it's lame SMT implementation.
1. iMac 27" (2020) i7 10700k; AMD Radeon Pro 5500 XT 8 GB; 64 GB RAM; 500 GB internal SSD + external box of SSDs
2. ThreadRipper 2990wx; Ubuntu; Headless; 64GB RAM
3. MacBook Pro (2017); Core i7-7820HQ; 16 GB RAM
 
dragontamer5788
Gerbil Elite
Posts: 715
Joined: Mon May 06, 2013 8:39 am

Re: Threadripper 2990wx experiences

Tue Sep 25, 2018 3:06 pm

blastdoor wrote:
Right now, the system is running 64 R (https://cran.r-project.org) threads. The CPU utilization for each thread (reported by top) ranges from 80% to 100%. I have not yet had a chance to fully investigate why it's not 100% across the board, but I'm wondering if it might be due to the uneven access to RAM -- would the CPU utilization on a process (reported by top) drop if it's waiting for access to RAM?


No. RAM access is CPU-time. Its actually really, really hard to split out RAM bottlenecks vs CPU bottlenecks. There are tools for that, but I haven't played with them personally yet.

CPU-utilization is an OS-level functionality. Whenever the "idle" thread runs, the OS drops utilization. If 10 milliseconds (per second) is idle-thread, then you have 99% utilization. If 500 milliseconds per-second is idle-thread, then you have 50% utilization. That's it. Again, see "utilization" as the (inverted) amount of time that the idle threads are running.

"Top" is a confusing tool IMO. Why don't you run "mpstat" and print out what you get here? I prefer "mpstat", personally, for CPU-related reports.

blastdoor wrote:
chuckula wrote:
60% of the time it works every time.


I love that line!

But in fairness, we don't know for sure if it's contention from the goofy RAM access layout or if it's lame SMT implementation.


Not necessarily "goofy" RAM access either. But any RAM-heavy problem will naturally hit a bottleneck. You only have 4-sticks of RAM I presume? So you have 32 to 64-threads reading from those 4-sticks of RAM. Naturally, the 4-sticks of RAM would be too slow to keep up.

The main benefit of the 2990wx cores are that each one comes with 8MB of L3 cache (shared every 4-cores), 512kB of L2 cache. So if you can manage to fit your entire problem inside of the L3 cache, then you should be able to see good scaling. But if all threads are "waiting on RAM" the whole time, you will naturally see a bottleneck.
Last edited by dragontamer5788 on Tue Sep 25, 2018 3:12 pm, edited 1 time in total.
 
Redocbew
Minister of Gerbil Affairs
Posts: 2495
Joined: Sat Mar 15, 2014 11:44 am

Re: Threadripper 2990wx experiences

Tue Sep 25, 2018 3:11 pm

What kind of performance do you usually see with SMT? Isn't it normal not to expect 100% scaling with virtual cores?
Do not meddle in the affairs of archers, for they are subtle and you won't hear them coming.
 
Concupiscence
Gerbil Elite
Posts: 709
Joined: Tue Sep 25, 2012 7:58 am
Location: Dallas area, Texas, USA
Contact:

Re: Threadripper 2990wx experiences

Tue Sep 25, 2018 3:15 pm

blastdoor wrote:
Hmm....some initial benchmarking suggests that, at least for what I'm doing right now, it's best to stop at 32 threads -- going beyond that actually slows things down a bit.

More specifically, for the Monte Carlo runs I'm doing now, here are the performance gains:

8 to 16 threads ---> 1.7 times gain
16 to 32 threads ---> 1.4 times gain
32 to 64 threads ---> 0.9 times "gain" (aka, 10% loss)

I also did a little checking around the edges of 32 (24 and 40 threads), and 32 seems to be the sweet spot.


I'm not sure how you're loading the cores, but I'd guess somewhere between 32 and 64 threads you're hitting the limits of Threadripper's quad-channel memory bandwidth. It's interesting it falls off so dramatically after 32, but based on this data point it appears the 2990WX is a fundamentally flawed product. I'd love to read more about it, though.
Science: Core i9 7940x, 64 gigs RAM, Vega FE, Xubuntu 20.04
Work: Ryzen 5 3600, 32 gigs RAM, Radeon RX 580, Win10 Pro
Tinker: Core i5 2400, 8 gigs RAM, Radeon R9 280x, Xubuntu 20.04 + MS-DOS 7.10

Read me at https://www.wallabyjones.com/
 
synthtel2
Gerbil Elite
Posts: 956
Joined: Mon Nov 16, 2015 10:30 am

Re: Threadripper 2990wx experiences

Tue Sep 25, 2018 3:50 pm

chuckula wrote:
Hyperthreading I mean "AMD SMT": 60% of the time it works every time.
blastdoor wrote:
But in fairness, we don't know for sure if it's contention from the goofy RAM access layout or if it's lame SMT implementation.

Since when does AMD have any kind of rep for lame SMT? There are plenty of theoretical reasons Zen's should actually be worth more than SKL's (as a % gain over non-SMT), and last I knew practical testing was agreeing with that theory.

My money's on something memory-related.

Redocbew wrote:
What kind of performance do you usually see with SMT? Isn't it normal not to expect 100% scaling with virtual cores?

I think 20-30% is the usual figure people quote for general use these days, but 0-50% is common. Negative scaling is fairly rare now, but far from unheard of. If you're bound by memory bandwidth and/or random throughput and are throwing a lot more requests at the memory controller than it can fulfill in a timely manner, that's definitely a good way to get to negative scaling.
 
Concupiscence
Gerbil Elite
Posts: 709
Joined: Tue Sep 25, 2012 7:58 am
Location: Dallas area, Texas, USA
Contact:

Re: Threadripper 2990wx experiences

Tue Sep 25, 2018 3:53 pm

synthtel2 wrote:
chuckula wrote:
Hyperthreading I mean "AMD SMT": 60% of the time it works every time.
blastdoor wrote:
But in fairness, we don't know for sure if it's contention from the goofy RAM access layout or if it's lame SMT implementation.

Since when does AMD have any kind of rep for lame SMT? There are plenty of theoretical reasons Zen's should actually be worth more than SKL's (as a % gain over non-SMT), and last I knew practical testing was agreeing with that theory.

My money's on something memory-related.

Redocbew wrote:
What kind of performance do you usually see with SMT? Isn't it normal not to expect 100% scaling with virtual cores?

I think 20-30% is the usual figure people quote for general use these days, but 0-50% is common. Negative scaling is fairly rare now, but far from unheard of. If you're bound by memory bandwidth and/or random throughput and are throwing a lot more requests at the memory controller than it can fulfill in a timely manner, that's definitely a good way to get to negative scaling.


It sounds like the Ryzen memory controller's generally quite capable. Could it be issues with high latency while fetching from the L3 cache across CCXes? Some of that's bound to improve over time with scheduler modifications.
Science: Core i9 7940x, 64 gigs RAM, Vega FE, Xubuntu 20.04
Work: Ryzen 5 3600, 32 gigs RAM, Radeon RX 580, Win10 Pro
Tinker: Core i5 2400, 8 gigs RAM, Radeon R9 280x, Xubuntu 20.04 + MS-DOS 7.10

Read me at https://www.wallabyjones.com/
 
dragontamer5788
Gerbil Elite
Posts: 715
Joined: Mon May 06, 2013 8:39 am

Re: Threadripper 2990wx experiences

Tue Sep 25, 2018 3:57 pm

Concupiscence wrote:
It sounds like the Ryzen memory controller's generally quite capable. Could it be issues with high latency while fetching from the L3 cache across CCXes? Some of that's bound to improve over time with scheduler modifications.


Unlikely. If he's running 32 to 64 independent monte-carlo simulations, there wouldn't be much cross-CCX communication at all. My bet is on memory-bound. Eventually, the DDR4 itself just becomes the bottleneck.

But... first... he has a 80% utilization problem. So immediately, something isn't right. Going from 80% to 100% utilization will increase overall computational speed by +25%.

The CPU utilization for each thread (reported by top) ranges from 80% to 100%


This is a problem that should be figured out first. Before we talk about micro-optimizations involving memory controllers or cache. This right here is potentially +25% speeds if it can be figured out.
 
Concupiscence
Gerbil Elite
Posts: 709
Joined: Tue Sep 25, 2012 7:58 am
Location: Dallas area, Texas, USA
Contact:

Re: Threadripper 2990wx experiences

Tue Sep 25, 2018 4:07 pm

dragontamer5788 wrote:
Concupiscence wrote:
It sounds like the Ryzen memory controller's generally quite capable. Could it be issues with high latency while fetching from the L3 cache across CCXes? Some of that's bound to improve over time with scheduler modifications.


Unlikely. If he's running 32 to 64 independent monte-carlo simulations, there wouldn't be much cross-CCX communication at all. My bet is on memory-bound. Eventually, the DDR4 itself just becomes the bottleneck.

But... first... he has a 80% utilization problem. So immediately, something isn't right. Going from 80% to 100% utilization will increase overall computational speed by +25%.

The CPU utilization for each thread (reported by top) ranges from 80% to 100%


This is a problem that should be figured out first. Before we talk about micro-optimizations involving memory controllers or cache. This right here is potentially +25% speeds if it can be figured out.


Yeah, that's non-trivial. If it's hovering around 3.4 GHz reliably it's probably not thermal downthrottling being expressed by a CPU monitor as underutilization. But what gives then? I'd like to see a kernel time chart in a performance profiler while all those Monte Carlos grind along.
Science: Core i9 7940x, 64 gigs RAM, Vega FE, Xubuntu 20.04
Work: Ryzen 5 3600, 32 gigs RAM, Radeon RX 580, Win10 Pro
Tinker: Core i5 2400, 8 gigs RAM, Radeon R9 280x, Xubuntu 20.04 + MS-DOS 7.10

Read me at https://www.wallabyjones.com/
 
Duct Tape Dude
Gerbil Elite
Posts: 721
Joined: Thu May 02, 2013 12:37 pm

Re: Threadripper 2990wx experiences

Tue Sep 25, 2018 4:09 pm

What's your system load average like (in top or similar)? And what about mpstat (sysstat package) output?

dragontamer5788 wrote:
Unlikely. If he's running 32 to 64 independent monte-carlo simulations, there wouldn't be much cross-CCX communication at all. My bet is on memory-bound. Eventually, the DDR4 itself just becomes the bottleneck.
Yeah could be... this is a great place to be though. It's rare being memory-bound to the point where speed is worth getting over capacity.
 
dragontamer5788
Gerbil Elite
Posts: 715
Joined: Mon May 06, 2013 8:39 am

Re: Threadripper 2990wx experiences

Tue Sep 25, 2018 5:13 pm

Concupiscence wrote:
dragontamer5788 wrote:
The CPU utilization for each thread (reported by top) ranges from 80% to 100%


This is a problem that should be figured out first. Before we talk about micro-optimizations involving memory controllers or cache. This right here is potentially +25% speeds if it can be figured out.


Yeah, that's non-trivial. If it's hovering around 3.4 GHz reliably it's probably not thermal downthrottling being expressed by a CPU monitor as underutilization. But what gives then? I'd like to see a kernel time chart in a performance profiler while all those Monte Carlos grind along.


Honestly, my #1 bet is that he's misunderstanding the utilization metrics and is reporting something else. Lol. "top" isn't very intuitive with its naming IMO. Its better to use "mpstat", where the names are a bit more clear where CPU time is going. My #1 bet is that he's got 100% CPU utilization, but doesn't realize it.

On the next chance, if he's reading top correctly and utilization really is in the 80% region, then there's some sort of file-read / file-write (or other I/O issue) that's causing the threads to stall and wait for data. This might be fixed by running more threads, so that more CPU stuffs can do things while the disk is running (if you got a nice, fancy NVMe SSD, then higher queue depths usually lead to better performance).

-----------------

Once utilization hits 100%, then we can start to investigate the (potential) DDR4 memory bottleneck using perf-tools. Not an easy subject to say the least. But this only matters **after** you get your utilization up to 100%. Perf-tools is innately a C / C++ tool however, so a high-level language like R might not have enough control over memory to get things working perfectly.

Still, I'd assume that some use of numactl may improve performance by a few %.
 
synthtel2
Gerbil Elite
Posts: 956
Joined: Mon Nov 16, 2015 10:30 am

Re: Threadripper 2990wx experiences

Tue Sep 25, 2018 5:40 pm

Duct Tape Dude wrote:
It's rare being memory-bound to the point where speed is worth getting over capacity.

I think fast memory is badly underrated, even for general-purpose desktop stuff. Going from JEDEC 2133 to 2666 CL14 early in my current rig's life made a noticeable difference in just about everything you'd think might be CPU-bound from boot times on up to gaming.

dragontamer5788 wrote:
This is a problem that should be figured out first. Before we talk about micro-optimizations involving memory controllers or cache. This right here is potentially +25% speeds if it can be figured out.
dragontamer5788 wrote:
But this only matters **after** you get your utilization up to 100%.

Utilization problems do deserve priority because they're a lot easier to figure out, but some of those so-called micro-optimizations for memory access can be worth a whole lot more than 25%, depending on the problem. Looking at IPC using perf isn't too difficult, and can be a useful metric right off the bat.
 
Krogoth
Emperor Gerbilius I
Posts: 6049
Joined: Tue Apr 15, 2003 3:20 pm
Location: somewhere on Core Prime
Contact:

Re: Threadripper 2990wx experiences

Wed Sep 26, 2018 8:36 am

synthtel2 wrote:
Duct Tape Dude wrote:
It's rare being memory-bound to the point where speed is worth getting over capacity.

I think fast memory is badly underrated, even for general-purpose desktop stuff. Going from JEDEC 2133 to 2666 CL14 early in my current rig's life made a noticeable difference in just about everything you'd think might be CPU-bound from boot times on up to gaming.

dragontamer5788 wrote:
This is a problem that should be figured out first. Before we talk about micro-optimizations involving memory controllers or cache. This right here is potentially +25% speeds if it can be figured out.
dragontamer5788 wrote:
But this only matters **after** you get your utilization up to 100%.

Utilization problems do deserve priority because they're a lot easier to figure out, but some of those so-called micro-optimizations for memory access can be worth a whole lot more than 25%, depending on the problem. Looking at IPC using perf isn't too difficult, and can be a useful metric right off the bat.


Getting fast memory is either a hit or miss on non-specialized workloads and applications. It also helps more on platforms with high-core count chips that don't have quad-channel or more (Ryzen 7s and most likely 8-core Coffee Lake refresh).
Gigabyte X670 AORUS-ELITE AX, Raphael 7950X, 2x16GiB of G.Skill TRIDENT DDR5-5600, Sapphire RX 6900XT, Seasonic GX-850 and Fractal Define 7 (W)
Ivy Bridge 3570K, 2x4GiB of G.Skill RIPSAW DDR3-1600, Gigabyte Z77X-UD3H, Corsair CX-750M V2, and PC-7B
 
ptsant
Gerbil XP
Posts: 397
Joined: Mon Oct 05, 2009 12:45 pm

Re: Threadripper 2990wx experiences

Wed Sep 26, 2018 9:48 am

blastdoor wrote:
Hi folks,
Right now, the system is running 64 R (https://cran.r-project.org) threads. The CPU utilization for each thread (reported by top) ranges from 80% to 100%. I have not yet had a chance to fully investigate why it's not 100% across the board, but I'm wondering if it might be due to the uneven access to RAM -- would the CPU utilization on a process (reported by top) drop if it's waiting for access to RAM?

I'm eventually going to see how performance scales with threads -- I'm very curious to see that.


I wouldn't worry about fluctuations in top, it's not reliable for that purpose. You are almost certainly running at 100%. You can try some other process monitoring software or simply measure the time to completion.

I would also note that R is not really optimized for multithreaded and you may be hitting a wall that is not strictly CPU-related but rather software-related, if for example the software stresses inter-process communication or some particular aspect of the scheduler.

Generally, linux is much faster than windows with the massive 32-core Threadripper, but I'm not aware of specific tuning or kernel options that are helpful. For example, maybe you need to choose a specific scheduler in the kernel, or enable NUMA, or choose a specific kernel/distribution to get 100% of what is definitely NOT a typical desktop build.

Also, consider compiling R with different optimizations (see for example https://www.phoronix.com/scan.php?page= ... ver1&num=1 ) there is at times up to 10% extra performance to be had. This is especially important if you can find/use Zen-specific math libraries (see here for EPYC-specific BLAS: https://developer.amd.com/amd-cpu-libraries/ ). You can instruct R at compile time to link to these libraries and may observe some gain. In my experience, AMD hand-tuned libraries often outperformed the native R BLAS library. This is only relevant if you use linear algebra, which a lot of R software does.

My experience with the 1700X (8c/16t) has been that I got positive scaling up to approx 12 threads then no significant gains up to 16 threads, maybe even very minor loss, but this certainly depends on the kind of software. I now use 12 threads as a default and tune up (or down) slightly. Your mileage may vary, experimentation is necessary.
Image
 
ptsant
Gerbil XP
Posts: 397
Joined: Mon Oct 05, 2009 12:45 pm

Re: Threadripper 2990wx experiences

Wed Sep 26, 2018 9:52 am

synthtel2 wrote:
Duct Tape Dude wrote:
It's rare being memory-bound to the point where speed is worth getting over capacity.

I think fast memory is badly underrated, even for general-purpose desktop stuff. Going from JEDEC 2133 to 2666 CL14 early in my current rig's life made a noticeable difference in just about everything you'd think might be CPU-bound from boot times on up to gaming.


This was particularly true in Ryzen v1, where the onboard data fabric syncs with memory and therefore very significant gains can be observed from 2133 to 2933. Can't say how this works with Zen+, but I would generally advise DDR4 3000, if you're not going for ECC. Generally the price premium from 2133 to 3000 is quite modest. Above 3200 it becomes "l33t gamer" territory, so maybe not very cost-effective.
Image
 
Bauxite
Gerbil Elite
Posts: 788
Joined: Sat Jan 28, 2006 12:10 pm
Location: electrolytic redox smelting plant

Re: Threadripper 2990wx experiences

Wed Sep 26, 2018 10:27 am

FWIW you can run 128GB 8x16 at 2933 on TR and have your ECC cake too. 4x8 will do 3200 as well, maybe 4x16 but haven't tried much.

Boosting the fabric on TR is a good idea if you can, pretty much everything benefits.
TR RIP 7/7/2019
 
ptsant
Gerbil XP
Posts: 397
Joined: Mon Oct 05, 2009 12:45 pm

Re: Threadripper 2990wx experiences

Wed Sep 26, 2018 11:15 am

Bauxite wrote:
FWIW you can run 128GB 8x16 at 2933 on TR and have your ECC cake too. 4x8 will do 3200 as well, maybe 4x16 but haven't tried much.

Boosting the fabric on TR is a good idea if you can, pretty much everything benefits.


I don't know where you can find 2933 ECC modules, but I suspect they must be obscenely expensive. Most of the local stores (not in US) have 2400 ECC modules. Are there specific brands/models you recommend? I'm considering the switch from DDR4 3000 to ECC for my 1700X when I can shift the DDR4 modules to another system.
Image
 
Bauxite
Gerbil Elite
Posts: 788
Joined: Sat Jan 28, 2006 12:10 pm
Location: electrolytic redox smelting plant

Re: Threadripper 2990wx experiences

Wed Sep 26, 2018 12:26 pm

ptsant wrote:
Bauxite wrote:
FWIW you can run 128GB 8x16 at 2933 on TR and have your ECC cake too. 4x8 will do 3200 as well, maybe 4x16 but haven't tried much.

Boosting the fabric on TR is a good idea if you can, pretty much everything benefits.


I don't know where you can find 2933 ECC modules, but I suspect they must be obscenely expensive. Most of the local stores (not in US) have 2400 ECC modules. Are there specific brands/models you recommend? I'm considering the switch from DDR4 3000 to ECC for my 1700X when I can shift the DDR4 modules to another system.


AMD is not Intel, TR is not locked ram like Xeons. You actually cannot find any modules above 2666 at all, but samsung is making 3200 chips already. Nobody is binning modules because ???.

As for memory, samsung samsung samsung. B-die specifically, M391A1K43BB1 and M391A2K43BB1. Rating timings don't mean much, some of my stuff is listed as 2133 but made this year.
TR RIP 7/7/2019
 
ptsant
Gerbil XP
Posts: 397
Joined: Mon Oct 05, 2009 12:45 pm

Re: Threadripper 2990wx experiences

Wed Sep 26, 2018 1:46 pm

Bauxite wrote:
AMD is not Intel, TR is not locked ram like Xeons. You actually cannot find any modules above 2666 at all, but samsung is making 3200 chips already. Nobody is binning modules because ???.

As for memory, samsung samsung samsung. B-die specifically, M391A1K43BB1 and M391A2K43BB1. Rating timings don't mean much, some of my stuff is listed as 2133 but made this year.


Great, thanks for the info. I found the second one at $200 local price, which is a bit expensive. I'll get it when it drops a little bit.
Image
 
Bauxite
Gerbil Elite
Posts: 788
Joined: Sat Jan 28, 2006 12:10 pm
Location: electrolytic redox smelting plant

samsung ram is just plain superior right now

Wed Sep 26, 2018 3:47 pm

That is the going rate for any samsung b-die unfortunately, ECC or not. Ram on AMD is pretty much you get what you pay for, and check specs twice on all the consumer stuff. Don't shop by brand, shop by timings and make sure all 3 are the same: avoid stuff like 14-16-16 or 16-18-18. (some brands have 0 good kits though)

If you filter down to all 3200C14-14-14 kits on newegg (timings you can only get from real b-die) its ~$200 for 2x8 and ~$400 for 2x16. 3600C15-15-15 is going to be slightly better but will cost accordingly (2x8 @ $250) and after that the insane tweaker kits don't do anything better when you leave the intel platform. Non-samsung ram seems to go on sale often enough, but $180 for 2x8 is the best deal on good samsung I've seen in the last couple months.

I'm now up to around a half TB of b-die in a bunch of systems I built, ECC on TR and blingstuff on AM4 though I did confirm the ECC on X370 awhile back.

Maybe the next Zen core will be more forgiving of crappier ram and get a higher typical ceiling with the good stuff, but that huge infinity fabric is doing a lot of work so I would only expect moderate gains over time.
TR RIP 7/7/2019
 
synthtel2
Gerbil Elite
Posts: 956
Joined: Mon Nov 16, 2015 10:30 am

Re: Threadripper 2990wx experiences

Wed Sep 26, 2018 7:19 pm

Krogoth wrote:
Getting fast memory is either a hit or miss on non-specialized workloads and applications. It also helps more on platforms with high-core count chips that don't have quad-channel or more (Ryzen 7s and most likely 8-core Coffee Lake refresh).

This is an R7 1700, but I didn't see much correlation between the threadedness of the work and how much that RAM speed boost helped.

ptsant wrote:
This was particularly true in Ryzen v1, where the onboard data fabric syncs with memory and therefore very significant gains can be observed from 2133 to 2933. Can't say how this works with Zen+, but I would generally advise DDR4 3000, if you're not going for ECC. Generally the price premium from 2133 to 3000 is quite modest. Above 3200 it becomes "l33t gamer" territory, so maybe not very cost-effective.

Zen+ is the same. The RAM I've got is good for much more, I'm just sticking with 2666 because I'm really tired of my CPUs dying and want to go easy on this one (2666 = 912mV VSoC).
 
blastdoor
Gerbil Elite
Topic Author
Posts: 846
Joined: Fri Apr 08, 2005 4:12 pm
Location: Real America

Re: Threadripper 2990wx experiences

Wed Sep 26, 2018 7:55 pm

Hi all--

Thanks for the many comments and suggestions! And sorry for my delay in replying.

Based on suggestions here and further contemplation, I think the problem here probably is memory bandwidth, but I'm hoping it's addressable (so to speak). Each R instance isn't using a ton of RAM -- just a couple hundred megabytes. My impression is that the linux task scheduler is smart enough to keep these processes from hopping around too much from core to core, but I wonder if perhaps I need to step in and set some processor affinities and also do some manual load balancing.

Right now, N monte carlo replications are evenly split across 64 processes, and I let the task scheduler do its thing. Perhaps what I should do instead is set affinity for each process, and then divide the N replications unevenly, such that SMT cores get fewer replications and cores that aren't directly attached to memory get fewer replications. Does that sound like a reasonable approach?
1. iMac 27" (2020) i7 10700k; AMD Radeon Pro 5500 XT 8 GB; 64 GB RAM; 500 GB internal SSD + external box of SSDs
2. ThreadRipper 2990wx; Ubuntu; Headless; 64GB RAM
3. MacBook Pro (2017); Core i7-7820HQ; 16 GB RAM
 
dragontamer5788
Gerbil Elite
Posts: 715
Joined: Mon May 06, 2013 8:39 am

Re: Threadripper 2990wx experiences

Thu Sep 27, 2018 12:48 am

blastdoor wrote:
Hi all--

Thanks for the many comments and suggestions! And sorry for my delay in replying.

Based on suggestions here and further contemplation, I think the problem here probably is memory bandwidth, but I'm hoping it's addressable (so to speak). Each R instance isn't using a ton of RAM -- just a couple hundred megabytes. My impression is that the linux task scheduler is smart enough to keep these processes from hopping around too much from core to core, but I wonder if perhaps I need to step in and set some processor affinities and also do some manual load balancing.

Right now, N monte carlo replications are evenly split across 64 processes, and I let the task scheduler do its thing. Perhaps what I should do instead is set affinity for each process, and then divide the N replications unevenly, such that SMT cores get fewer replications and cores that aren't directly attached to memory get fewer replications. Does that sound like a reasonable approach?


* Step 1: Measure
* Step 2: Measure some more
* Step 3: Make a minor change
* Step 4: Measure again

What does "mpstat" say?

I think the problem here probably is memory bandwidth


What is your plan to measure this?

Hint: memory-bandwidth problems are NOT easy to measure. However, if you forcibly drop your memory bandwidth by setting your RAM speed lower, then performing a measurement (ie: simple timer), you can determine if memory-bandwidth is the bottleneck.

To actually check your memory-bandwidth, you'll need to run perf. In particular, "LLC-load-misses" (Last-level Cache misses), will tell you whether or not your program is waiting for memory. But perf is a difficult tool to use. I barely understand it myself.. But that is how you measure whether or not your code is waiting for RAM.

EDIT: Hmmm.... you can use perf to count the number of context-switches or cpu-migrations. That's probably a way to check if you have a scheduler problem. Then you can fix it with affinity, which should drop the number of context-switches.

Right now, N monte carlo replications are evenly split across 64 processes, and I let the task scheduler do its thing. Perhaps what I should do instead is set affinity for each process, and then divide the N replications unevenly, such that SMT cores get fewer replications and cores that aren't directly attached to memory get fewer replications. Does that sound like a reasonable approach?


With regards to SMT vs no-SMT: you'll want to pay attention to cache, which tends to be a major limitation. When you have 64-threads running on 32-cores, each thread only gets 1/2 the cache, compared to 32-threads on 32-cores.

In either case, you'll want 100% CPU utilization from "mpstat". If you want to turn off SMT, turn it off from the BIOS / UEFI startup screen. That way, all of the core's resources can be dedicated to a thread.
 
yeeeeman
Gerbil In Training
Posts: 8
Joined: Sun Apr 07, 2013 2:00 pm

Re: Threadripper 2990wx experiences

Thu Sep 27, 2018 1:44 am

Concupiscence wrote:
I'm not sure how you're loading the cores, but I'd guess somewhere between 32 and 64 threads you're hitting the limits of Threadripper's quad-channel memory bandwidth. It's interesting it falls off so dramatically after 32, but based on this data point it appears the 2990WX is a fundamentally flawed product. I'd love to read more about it, though.

Lets get things straight, the 2990WX is not a flawed product, it just has some design features that need rethinking an apps memory management and/or thread management for that specific design architecture.
The following phoronix article shows that the change from 32 to 64 threads can make a big difference in performance, as expected: https://www.phoronix.com/scan.php?page= ... ling&num=4
Sure, not all benchmarks scale well and some even regress (as yours), but that just means bad usage of resources. If I were AMD I would get in talks with most of major app companies and try to optimize the apps for Zen arch variants. They really need to get involved in the SW part, since HW is nothing without software and vice versa.

Bottom line I would say that the 2990WX is a great CPU, especially for the price they are asking. Great power consumption also. AMD really outdid themselves and surprised all their fans, probably also Intel. And that is no small feat.
 
Chrispy_
Maximum Gerbil
Posts: 4670
Joined: Fri Apr 09, 2004 3:49 pm
Location: Europe, most frequently London.

Re: Threadripper 2990wx experiences

Thu Sep 27, 2018 6:32 am

Our testing with the 2990WX ES was as impressive as expected for core density per socket, but the cooling requirements (relevant when we're racking several of them in tightly) and premium cost (3x a 1950X) meant that we ended up ordering a whole truckload of 1950Xs instead.

V-Ray raytracing render times for us on a large-scale test scene (which is what they're for):

1950X 3.4GHz 8C/16T = 4H 17M 51S
1950X 3.4GHz 16C/32T = 2H 24M 28S
2990WX 3.4GHz 32C/64T = 1H 29M 30S

Converting to job time normalised to a single core

8C 1950X = 15,471s*8 = 123,768s
16C 1950X = 8,668s*16 = 138,688s
32C 2990WX = 5370s*32 = 171,840s

We can see that Going from 8C to 16C has a small 12% scaling loss, whilst going from 16C to 32C with the 2990WX's weird memory arrangement has a larger 24% scaling loss.
The reason we didn't test a 2990WX in 16C mode was because a 1950X is so much cheaper for the same thing. The 2990WX is utterly pointless if you're not going to use all 32 cores.

FYI, this result is specific to V-Ray raytracing on our models (glass-covered skyscrapers in a city of skyscrapers) using around 80GB RAM. Please don't take it as a general indication of 2990WX's performance in other instances, but I thought I'd post it as a stark difference between simple Cinebench benchmarks and real-world performance with huge workloads that don't ever get used in reviews.
Congratulations, you've noticed that this year's signature is based on outdated internet memes; CLICK HERE NOW to experience this unforgettable phenomenon. This sentence is just filler and as irrelevant as my signature.
 
DragonDaddyBear
Gerbil Elite
Posts: 985
Joined: Fri Jan 30, 2009 8:01 am

Re: Threadripper 2990wx experiences

Thu Sep 27, 2018 7:15 am

That's good information. Maybe TechReport can use that in their reviews?

Who is online

Users browsing this forum: No registered users and 1 guest
GZIP: On