Personal computing discussed

Moderators: Flying Fox, morphine

 
chuckula
Gold subscriber
Gerbil Jedi
Topic Author
Posts: 1839
Joined: Wed Jan 23, 2008 9:18 pm
Location: Probably where I don't belong.

Der8Auer ripped[2] Cinebench's threads out!

Mon Jul 16, 2018 11:18 am

Behold the POWAR of this fully armed and operational 32 core Threadripper running Cinebench!

OK, maybe not fully armed and operational. Instead it's an Epyc 7601 that Der8auer overclocked to be in the same general frequency range as Ripper 2: Electric Boogaloo.

What's really interesting is how dependent Cinebench is to having all 8 memory channels turned on, which I would not have expected since Cinebench never seemed particularly interested in memory bandwidth. He notes that 4 channels of faster memory can somewhat make up for the deficit.
4770K @ 4.7 GHz; 32GB DDR3-2133; GTX-1080 sold and back to hipster IGP!; 512GB 840 Pro (2x); Fractal Define XL-R2; NZXT Kraken-X60
--Many thanks to the TR Forum for advice in getting it built.
 
Waco
Gold subscriber
Minister of Gerbil Affairs
Posts: 2391
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Der8Auer ripped[2] Cinebench's threads out!

Mon Jul 16, 2018 11:33 am

He's going to be retesting with a 1+1+1+1 memory channel config in a few weeks too. I know I'm interested in that test versus the 2+0+2+0 config he tested in the video.

Has AMD ruled officially on how TR2 memory will be wired up? I kinda stopped looking when the buzz died down.
Z170A Gaming Pro Carbon | 6700K @ 4.4 | 16 GB | GTX Titan Xm | Seasonic Gold 850 | XSPC RX360 | Heatkiller R3 | D5 + RP-452X2 | Cosmos II | Samsung 4K 40" | 2048 + 240 + LSI 9207-8i (128x8) SSDs
 
chuckula
Gold subscriber
Gerbil Jedi
Topic Author
Posts: 1839
Joined: Wed Jan 23, 2008 9:18 pm
Location: Probably where I don't belong.

Re: Der8Auer ripped[2] Cinebench's threads out!

Mon Jul 16, 2018 11:35 am

Waco wrote:
He's going to be retesting with a 1+1+1+1 memory channel config in a few weeks too. I know I'm interested in that test versus the 2+0+2+0 config he tested in the video.

Has AMD ruled officially on how TR2 memory will be wired up? I kinda stopped looking when the buzz died down.


We'll find out if it's 1 channel all-around or a 2-2-0-0 setup for sure when the NDA lifts or the leakers spill the beans in a verifiable manner.

I'm sure you want the 1-1-1-1 setup since you want to push 32 cores more often than not. However, in an HEDT system AMD might prefer 2-2-0-0 since (scheduler permitting) it lets the first 16 cores act pretty much like the old Threadripper with minimal negative performance impact while sacrificing scalability to the second set of 16 cores.
4770K @ 4.7 GHz; 32GB DDR3-2133; GTX-1080 sold and back to hipster IGP!; 512GB 840 Pro (2x); Fractal Define XL-R2; NZXT Kraken-X60
--Many thanks to the TR Forum for advice in getting it built.
 
Waco
Gold subscriber
Minister of Gerbil Affairs
Posts: 2391
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Der8Auer ripped[2] Cinebench's threads out!

Mon Jul 16, 2018 1:57 pm

I'd be happy with either, but I know many who'd rather have 16 fast cores along with an extra 16 somewhat slower cores. Most apps aren't NUMA aware anyway, but OSes are getting pretty good at placing tasks.

I'm not really in the market for one though - my next desktop is likely to be Zen2 or [something]-Lake a few years down the road. For server duty the full-fat Epyc chips are pretty awesome given the price and memory bandwidth they're capable of.
Z170A Gaming Pro Carbon | 6700K @ 4.4 | 16 GB | GTX Titan Xm | Seasonic Gold 850 | XSPC RX360 | Heatkiller R3 | D5 + RP-452X2 | Cosmos II | Samsung 4K 40" | 2048 + 240 + LSI 9207-8i (128x8) SSDs
 
Chrispy_
Maximum Gerbil
Posts: 4382
Joined: Fri Apr 09, 2004 3:49 pm
Location: Europe, most frequently London.

Re: Der8Auer ripped[2] Cinebench's threads out!

Mon Jul 16, 2018 2:45 pm

I dunno. If you're in the market for 32 cores, you probably want to use all of them, right?

I have €15K approved to buy 8x 32C Threadrippers as soon as they're launched to add to the renderfarm and 1+1+1+1 would be the preference, for sure. It's possible the TR4 socket isn't wired up for that though :(

Certainly if you were wanting a few fast cores the 2700X is your answer? 16 threads is a lot more than most software uses but doesn't sacrifice top-end clockspeed and doesn't run into the UMA limitations of >20 threads.

Threadripper (1st gen) already has legacy mode (halves the number of cores) as well as an UMA/NUMA memory toggle. Since 32C is very clearly into the all-core/NUMA customer territory I seriously hope they've managed to implement 1+1+1+1....
Congratulations, you've noticed that this year's signature is based on outdated internet memes; CLICK HERE NOW to experience this unforgettable phenomenon. This sentence is just filler and as irrelevant as my signature.
 
dragontamer5788
Gerbil
Posts: 66
Joined: Mon May 06, 2013 8:39 am

Re: Der8Auer ripped[2] Cinebench's threads out!

Tue Jul 17, 2018 6:16 pm

Chrispy_ wrote:
I dunno. If you're in the market for 32 cores, you probably want to use all of them, right?

I have €15K approved to buy 8x 32C Threadrippers as soon as they're launched to add to the renderfarm and 1+1+1+1 would be the preference, for sure. It's possible the TR4 socket isn't wired up for that though :(

Certainly if you were wanting a few fast cores the 2700X is your answer? 16 threads is a lot more than most software uses but doesn't sacrifice top-end clockspeed and doesn't run into the UMA limitations of >20 threads.

Threadripper (1st gen) already has legacy mode (halves the number of cores) as well as an UMA/NUMA memory toggle. Since 32C is very clearly into the all-core/NUMA customer territory I seriously hope they've managed to implement 1+1+1+1....


1+1+1+1 would be a significant degregation in single-threaded memory bandwidth, in fact it would half your bandwidth. I almost would expect that 2+2+0+0 would be faster for most people who'd buy Threadripper. In the case of 2666 MT/s RAM (aka 21GB/s), 2+2+0+0 would have 42GB/s bandwidth, while 1+1+1+1 would only have 21GB/s bandwidth. Remember that Infinity fabric may have higher latency, but its incredibly thick. EPYC's layout means that every die has full bandwidth to all RAM. With the 32b/cycle connection, all EPYC dies have 2xRAM transfers per clock (or in the case of 2666 MT/s, you have 42GB/s per infinity fabric link).

In the case of highly-threaded work, you'd want bandwidth, as each core has a good L3 cache to work on in the short term. So latency would be less of a concern. Furthermore, highly-scalable programs don't communicate as much, so latency is less of a concern on highly-scalable problem sets (like rendering).

------------

My computer (x399 Taichi + Threadripper) has options for memory interleaving. I've never tested it, but its "None", "Channel", "Die", and "Socket". When you use Ryzen master to go into "Distributed" or "Local" mode, it switches between None and Die. This would suggest that 1+1+1+1 is possible, but I'd seriously expect 2+2 with full-die interleaving would be the fastest for rendering tasks.
 
synthtel2
Gerbil Elite
Posts: 740
Joined: Mon Nov 16, 2015 10:30 am

Re: Der8Auer ripped[2] Cinebench's threads out!

Tue Jul 17, 2018 8:03 pm

dragontamer5788 wrote:
1+1+1+1 would be a significant degregation in single-threaded memory bandwidth, in fact it would half your bandwidth. I almost would expect that 2+2+0+0 would be faster for most people who'd buy Threadripper. In the case of 2666 MT/s RAM (aka 21GB/s), 2+2+0+0 would have 42GB/s bandwidth, while 1+1+1+1 would only have 21GB/s bandwidth. Remember that Infinity fabric may have higher latency, but its incredibly thick. EPYC's layout means that every die has full bandwidth to all RAM. With the 32b/cycle connection, all EPYC dies have 2xRAM transfers per clock (or in the case of 2666 MT/s, you have 42GB/s per infinity fabric link).

In the case of highly-threaded work, you'd want bandwidth, as each core has a good L3 cache to work on in the short term. So latency would be less of a concern. Furthermore, highly-scalable programs don't communicate as much, so latency is less of a concern on highly-scalable problem sets (like rendering).

Single-threaded bandwidth isn't usually much of a concern when running 16+ threads, though, and it's still the same aggregate bandwidth either way (assuming the OS is handling NUMA competently). 2/2/0/0 makes it easier to use all the bandwidth, but it doesn't seem likely there'll be a problem there in the first place if you're putting that many cores to use. Latency does still matter in that kind of case; per-core, the likelihood of and penalty from cache misses shouldn't be much different from parts with fewer cores given there's no unified last-level cache here.
 
Waco
Gold subscriber
Minister of Gerbil Affairs
Posts: 2391
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Der8Auer ripped[2] Cinebench's threads out!

Tue Jul 17, 2018 8:07 pm

dragontamer5788 wrote:
1+1+1+1 would be a significant degregation in single-threaded memory bandwidth, in fact it would half your bandwidth.

Sure, but anything that scales well doesn't really depend on a single thread's memory bandwidth very heavily. Spreading the bandwidth evenly between dies and pinning processes properly will *always* be preferable to competing over remote resources in my book (simulation HPC and parallel storage systems). Heck, your example doesn't really even hold true for anything reasonably scalable (like Cinebench) since many of them spent an awful lot of time reducing any pinch point on performance...which is typically single threads hogging execution time and holding up the rest of the program.
Z170A Gaming Pro Carbon | 6700K @ 4.4 | 16 GB | GTX Titan Xm | Seasonic Gold 850 | XSPC RX360 | Heatkiller R3 | D5 + RP-452X2 | Cosmos II | Samsung 4K 40" | 2048 + 240 + LSI 9207-8i (128x8) SSDs
 
dragontamer5788
Gerbil
Posts: 66
Joined: Mon May 06, 2013 8:39 am

Re: Der8Auer ripped[2] Cinebench's threads out!

Tue Jul 17, 2018 10:45 pm

Waco wrote:
dragontamer5788 wrote:
1+1+1+1 would be a significant degregation in single-threaded memory bandwidth, in fact it would half your bandwidth.

Sure, but anything that scales well doesn't really depend on a single thread's memory bandwidth very heavily. Spreading the bandwidth evenly between dies and pinning processes properly will *always* be preferable to competing over remote resources in my book (simulation HPC and parallel storage systems). Heck, your example doesn't really even hold true for anything reasonably scalable (like Cinebench) since many of them spent an awful lot of time reducing any pinch point on performance...which is typically single threads hogging execution time and holding up the rest of the program.


If you have a problem which scales poorly, I'd expect 42GB/s of low-latency RAM access over 16-cores would be superior over 24GB/s of low-latency access over 32-cores. Its not like your poorly-scaling problem was going to use 16 (or more!!) cores anyway.

Another note: Windows doesn't do NUMA-interleaving. Linux does, but Windows has no mechanism to do that. Windows relies upon the BIOS / Motherboard to do it on behalf of Windows, (aka: "enter creative mode"). So in the typical-case scenario for Windows, a 1+1+1+1 configuration would become 24GB/s. All VirtualAllocs would go to the nearest RAM and none of them would distribute outward to other NUMA nodes.

In the case of Linux, only specially written software which is NUMA aware can interleave memory accesses across NUMA nodes. Fortunately, there are shell-programs which can "pass it along" to programs on a program-per-program basis. Still, it wouldn't be the default way of running code... not that I'm aware of anyway.
 
synthtel2
Gerbil Elite
Posts: 740
Joined: Mon Nov 16, 2015 10:30 am

Re: Der8Auer ripped[2] Cinebench's threads out!

Tue Jul 17, 2018 11:34 pm

That still sounds like a problem that will only actually be a problem if they're marketing this to people who want 32 cores because it sounds cool instead of for a real workload (which they probably are, to be fair).
 
dragontamer5788
Gerbil
Posts: 66
Joined: Mon May 06, 2013 8:39 am

Re: Der8Auer ripped[2] Cinebench's threads out!

Wed Jul 18, 2018 9:24 am

synthtel2 wrote:
That still sounds like a problem that will only actually be a problem if they're marketing this to people who want 32 cores because it sounds cool instead of for a real workload (which they probably are, to be fair).


It will be a problem to anyone who plays video games after they're done with their Blender render. For example: Me. All of my video games will be 21GB/s on a 1+1+1+1 setup (when a far cheaper Ryzen 2700 or Intel i7-8700k will be 42GB/s on a 2-channel setup), because games are typically low-latency sensitive and bottleneck on a few cores.

I'm not buying another computer dedicated to playing games. I get one desktop for serious work and serious gaming, and one laptop for portability. I still care about keeping my game performance decent. If Threadripper2 has poor gaming performance, I'd personally switch to the Intel i9 / Intel Extreme series. (Well, not really. I already bought Threadripper, but hypothetically if I were in the market a few months from now...)

It'd be somewhat acceptable to have a configurable BIOS flag: ie "Creative Mode" vs "Local Mode". But that's still annoying, because you'd have to reset your computer every time you wanted to switch from Blender / Vegas Video Editing / LTSpice simulations (highly threaded / high-bandwidth problems) -> Video Games (low-threaded / latency-sensitive problems). Ideally, Windows should support NUMA interleaving AND the chip designs for the absolute lowest latency possible.

-----------

Another example: Blender's physics simulator is single-threaded. Cloth simulations, Rigid Body Physics, Soft-body deform, Fluids... these are all single threaded problems. So even when I'm modeling in Blender, you're constantly switching from single-threaded problems (baking, physics, etc. etc.) to multi-threaded problems (rendering, etc. etc.). Maximizing single-threaded performance when you're in physics baking mode is still incredibly important.

Cinema4D (the creators of Cinebench) is ironically, single-threaded for example. So most of that community already prefers Intel i9 processors over Threadripper. Multi-threaded is nice, but its impractical in the case of Cinema4D. Today, its better to get an i9-7900x for Cinema4d + a NVidia GPU for rendering. Even then, the main reason isn't for cores but for more PCIe lanes than an i7-8700k. Threadripper happens to be better for my tool (Blender), but its certainly not a winner across the board.
 
techguy
Gerbil XP
Posts: 311
Joined: Tue Aug 10, 2010 9:12 am

Re: Der8Auer ripped[2] Cinebench's threads out!

Wed Jul 18, 2018 10:00 am

chuckula wrote:
Waco wrote:
He's going to be retesting with a 1+1+1+1 memory channel config in a few weeks too. I know I'm interested in that test versus the 2+0+2+0 config he tested in the video.

Has AMD ruled officially on how TR2 memory will be wired up? I kinda stopped looking when the buzz died down.


We'll find out if it's 1 channel all-around or a 2-2-0-0 setup for sure when the NDA lifts or the leakers spill the beans in a verifiable manner.

I'm sure you want the 1-1-1-1 setup since you want to push 32 cores more often than not. However, in an HEDT system AMD might prefer 2-2-0-0 since (scheduler permitting) it lets the first 16 cores act pretty much like the old Threadripper with minimal negative performance impact while sacrificing scalability to the second set of 16 cores.


I think the real question boils down to whether the memory config will be 2-2-0-0 or 2-0-2-0, as rumors first indicated. If AMD does 2-2-0-0 as you suggest and the first 16 cores are able to operate without the latency penalty of having to go over infinity fabric for all memory requests, then I think this would be far preferable for the vast majority of use cases.

I am interested in TR2, but if there is a massive latency penalty for workloads beyond the first 16 threads I think I'll stick with my meager 10 core 7900x. If that stretches to 32 threads though I think the majority of workloads I have will run great, and for those situations where more threads are needed, I already cannot run them on my 7900x so "slower" is still better than "not at all".
 
dragontamer5788
Gerbil
Posts: 66
Joined: Mon May 06, 2013 8:39 am

Re: Der8Auer ripped[2] Cinebench's threads out!

Wed Jul 18, 2018 10:05 am

techguy wrote:
I think the real question boils down to whether the memory config will be 2-2-0-0 or 2-0-2-0, as rumors first indicated.


2-2-0-0 is equivalent to 2-0-2-0. All dies on EPYC are directly connected to other dies.

https://en.wikichip.org/w/images/3/39/A ... es_SoC.svg
https://en.wikichip.org/w/images/d/d5/z ... res%29.svg
 
techguy
Gerbil XP
Posts: 311
Joined: Tue Aug 10, 2010 9:12 am

Re: Der8Auer ripped[2] Cinebench's threads out!

Wed Jul 18, 2018 10:32 am

dragontamer5788 wrote:
techguy wrote:
I think the real question boils down to whether the memory config will be 2-2-0-0 or 2-0-2-0, as rumors first indicated.


2-2-0-0 is equivalent to 2-0-2-0. All dies on EPYC are directly connected to other dies.

https://en.wikichip.org/w/images/3/39/A ... es_SoC.svg
https://en.wikichip.org/w/images/d/d5/z ... res%29.svg


The problem is one of logical organization of cores. 2-0-2-0 implies that cores 0-7 (or 1-8 if you prefer) will have access to a full 2 channels of memory bandwidth without the need to jump over infinity fabric to send/receive data in memory, and the next set of 8 cores will have to make that hop over infinity fabric which will incur a latency penalty.

A 2-2-0-0 config implies the first 16 cores will each have those dual memory channels. That's a far more compelling solution for workloads that scale beyond 8 cores/16 threads.
 
Waco
Gold subscriber
Minister of Gerbil Affairs
Posts: 2391
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Der8Auer ripped[2] Cinebench's threads out!

Wed Jul 18, 2018 11:34 am

dragontamer5788 wrote:
Waco wrote:
dragontamer5788 wrote:
1+1+1+1 would be a significant degregation in single-threaded memory bandwidth, in fact it would half your bandwidth.

Sure, but anything that scales well doesn't really depend on a single thread's memory bandwidth very heavily. Spreading the bandwidth evenly between dies and pinning processes properly will *always* be preferable to competing over remote resources in my book (simulation HPC and parallel storage systems). Heck, your example doesn't really even hold true for anything reasonably scalable (like Cinebench) since many of them spent an awful lot of time reducing any pinch point on performance...which is typically single threads hogging execution time and holding up the rest of the program.


If you have a problem which scales poorly, I'd expect 42GB/s of low-latency RAM access over 16-cores would be superior over 24GB/s of low-latency access over 32-cores. Its not like your poorly-scaling problem was going to use 16 (or more!!) cores anyway.

I think my point was that if you have a poorly-scaling program...Threadripper probably isn't the CPU you want anyway.

For those who can put things like this to work, I would bet on the 1+1+1+1 configuration being more consistent and efficient.
Z170A Gaming Pro Carbon | 6700K @ 4.4 | 16 GB | GTX Titan Xm | Seasonic Gold 850 | XSPC RX360 | Heatkiller R3 | D5 + RP-452X2 | Cosmos II | Samsung 4K 40" | 2048 + 240 + LSI 9207-8i (128x8) SSDs
 
Waco
Gold subscriber
Minister of Gerbil Affairs
Posts: 2391
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Der8Auer ripped[2] Cinebench's threads out!

Wed Jul 18, 2018 11:36 am

techguy wrote:
The problem is one of logical organization of cores. 2-0-2-0 implies that cores 0-7 (or 1-8 if you prefer) will have access to a full 2 channels of memory bandwidth without the need to jump over infinity fabric to send/receive data in memory, and the next set of 8 cores will have to make that hop over infinity fabric which will incur a latency penalty.

A 2-2-0-0 config implies the first 16 cores will each have those dual memory channels. That's a far more compelling solution for workloads that scale beyond 8 cores/16 threads.

I don't think that matters at all with a modern OS - it'll schedule processes on NUMA nodes with local memory before scheduling on those only with remote memory.
Z170A Gaming Pro Carbon | 6700K @ 4.4 | 16 GB | GTX Titan Xm | Seasonic Gold 850 | XSPC RX360 | Heatkiller R3 | D5 + RP-452X2 | Cosmos II | Samsung 4K 40" | 2048 + 240 + LSI 9207-8i (128x8) SSDs
 
techguy
Gerbil XP
Posts: 311
Joined: Tue Aug 10, 2010 9:12 am

Re: Der8Auer ripped[2] Cinebench's threads out!

Wed Jul 18, 2018 11:45 am

Waco wrote:
dragontamer5788 wrote:
Waco wrote:
Sure, but anything that scales well doesn't really depend on a single thread's memory bandwidth very heavily. Spreading the bandwidth evenly between dies and pinning processes properly will *always* be preferable to competing over remote resources in my book (simulation HPC and parallel storage systems). Heck, your example doesn't really even hold true for anything reasonably scalable (like Cinebench) since many of them spent an awful lot of time reducing any pinch point on performance...which is typically single threads hogging execution time and holding up the rest of the program.


If you have a problem which scales poorly, I'd expect 42GB/s of low-latency RAM access over 16-cores would be superior over 24GB/s of low-latency access over 32-cores. Its not like your poorly-scaling problem was going to use 16 (or more!!) cores anyway.

I think my point was that if you have a poorly-scaling program...Threadripper probably isn't the CPU you want anyway.

For those who can put things like this to work, I would bet on the 1+1+1+1 configuration being more consistent and efficient.


More consistent - sure. I think we'll find that performance of single-threaded workloads is capped in such a configuration, though. Now obviously you don't buy a 32-core/64-thread CPU to run a single single-threaded workload, but it might not be a bad platform for a mixed environment with many workloads of varying levels of threadedness.
 
techguy
Gerbil XP
Posts: 311
Joined: Tue Aug 10, 2010 9:12 am

Re: Der8Auer ripped[2] Cinebench's threads out!

Wed Jul 18, 2018 11:47 am

Waco wrote:
techguy wrote:
The problem is one of logical organization of cores. 2-0-2-0 implies that cores 0-7 (or 1-8 if you prefer) will have access to a full 2 channels of memory bandwidth without the need to jump over infinity fabric to send/receive data in memory, and the next set of 8 cores will have to make that hop over infinity fabric which will incur a latency penalty.

A 2-2-0-0 config implies the first 16 cores will each have those dual memory channels. That's a far more compelling solution for workloads that scale beyond 8 cores/16 threads.

I don't think that matters at all with a modern OS - it'll schedule processes on NUMA nodes with local memory before scheduling on those only with remote memory.


In theory, sure. In practice though? Windows has demonstrated in the past that it does not always know how best to schedule a given workload, particularly on AMD platforms. This is why Threadripper 1 has a game mode.
 
dragontamer5788
Gerbil
Posts: 66
Joined: Mon May 06, 2013 8:39 am

Re: Der8Auer ripped[2] Cinebench's threads out!

Wed Jul 18, 2018 12:02 pm

Waco wrote:
dragontamer5788 wrote:
Waco wrote:
Sure, but anything that scales well doesn't really depend on a single thread's memory bandwidth very heavily. Spreading the bandwidth evenly between dies and pinning processes properly will *always* be preferable to competing over remote resources in my book (simulation HPC and parallel storage systems). Heck, your example doesn't really even hold true for anything reasonably scalable (like Cinebench) since many of them spent an awful lot of time reducing any pinch point on performance...which is typically single threads hogging execution time and holding up the rest of the program.


If you have a problem which scales poorly, I'd expect 42GB/s of low-latency RAM access over 16-cores would be superior over 24GB/s of low-latency access over 32-cores. Its not like your poorly-scaling problem was going to use 16 (or more!!) cores anyway.

I think my point was that if you have a poorly-scaling program...Threadripper probably isn't the CPU you want anyway.

For those who can put things like this to work, I would bet on the 1+1+1+1 configuration being more consistent and efficient.


All programs alternate between single-threaded modes and multi-threaded modes. The only question is the degree at which it happens. Blender is a great example: single-threaded when running physics simulations, multi-threaded when running renders.

Blender is already on the edge of "multi-threaded friendliness", but even then... Threadripper's weaker single-threaded performance shows up on any practical use of cloth-physics or fluid-physics baking.

If multi-threaded were all I cared about, then Bulldozer would have been a superior chip. In practice, all programs (even 3d renderers, video editors and more) have a major single-threaded component that cannot be ignored.
 
Waco
Gold subscriber
Minister of Gerbil Affairs
Posts: 2391
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Der8Auer ripped[2] Cinebench's threads out!

Wed Jul 18, 2018 12:46 pm

techguy wrote:
Waco wrote:
techguy wrote:
The problem is one of logical organization of cores. 2-0-2-0 implies that cores 0-7 (or 1-8 if you prefer) will have access to a full 2 channels of memory bandwidth without the need to jump over infinity fabric to send/receive data in memory, and the next set of 8 cores will have to make that hop over infinity fabric which will incur a latency penalty.

A 2-2-0-0 config implies the first 16 cores will each have those dual memory channels. That's a far more compelling solution for workloads that scale beyond 8 cores/16 threads.

I don't think that matters at all with a modern OS - it'll schedule processes on NUMA nodes with local memory before scheduling on those only with remote memory.


In theory, sure. In practice though? Windows has demonstrated in the past that it does not always know how best to schedule a given workload, particularly on AMD platforms. This is why Threadripper 1 has a game mode.

<- does not care about Windows much. :)

Game mode isn't due to Windows scheduling, though.

Anyway - it's clear which way AMD is likely to go. It's still not the right approach IMO.
Z170A Gaming Pro Carbon | 6700K @ 4.4 | 16 GB | GTX Titan Xm | Seasonic Gold 850 | XSPC RX360 | Heatkiller R3 | D5 + RP-452X2 | Cosmos II | Samsung 4K 40" | 2048 + 240 + LSI 9207-8i (128x8) SSDs
 
synthtel2
Gerbil Elite
Posts: 740
Joined: Mon Nov 16, 2015 10:30 am

Re: Der8Auer ripped[2] Cinebench's threads out!

Wed Jul 18, 2018 1:32 pm

dragontamer5788 wrote:
It will be a problem to anyone who plays video games after they're done with their Blender render. For example: Me. All of my video games will be 21GB/s on a 1+1+1+1 setup (when a far cheaper Ryzen 2700 or Intel i7-8700k will be 42GB/s on a 2-channel setup), because games are typically low-latency sensitive and bottleneck on a few cores.

No argument, 2/2/0/0 is definitely better for mixed workloads. A 32C CPU is just enough of a high-end tool already that it seems better overall to optimize it for the workloads it's more naturally good at. Either way, this sounds like something that people like Chrispy_ will (in aggregate) be buying a lot of, but a 32C all-purpose workstation sounds pretty niche.

Who is online

Users browsing this forum: No registered users and 1 guest