TR Forums

Waco · Fri Jun 08, 2018 10:41 am

Cloudy workloads typically are compute bound. AMD is smart to target them.

chuckula · Fri Jun 08, 2018 10:50 am

Ok, to simplify Topinio's wall-o-text that doesn't get the very easy to understand premise:

Some workloads which are CPU-bound and not memory-bound will perform better when CPUs with more performance come along; some workloads which are memory-bound and not CPU-bound won't. This should be obvious, as should that in workloads which are memory-bound on the socket and whose memory requirements scale with number of threads used, trying to use more threads won't ever help.

True but irrelevant because you are forgetting to factor in scaling to your arguments. Right now there are x% of workloads that are "compute bound" on, let's say, a 32-core CPU with 4 channels of RAM where the RAM doesn't really matter. So we can just scale that up right! Well, your next CPU is now only running (x- Δ)% of workloads in your "compute bound" scenario and a brand new set of Δ workloads suddenly aren't feeling the love from all those extra cores. And Δ is not an insignificant number.

I'm sure we'll see this to different degrees when Threadripper2 and Skylake X2 launch later this year with reduced memory I/O levels compared to their server equivalents. Sure it won't affect Cinebench but there will be repercussions.

Are you arguing that companies (AMD??) should not be designing and producing newer CPUs which will help the first set, because that doesn't help the last set of use cases?

Of course not and frankly it makes it hard to take the rest of your argument seriously, especially when I could link to a dozen posts screaming about how TR is unfair to a desktop RyZen chip by not running it with bleeding-edge RAM timings, and a lower core-count desktop part is the least impacted by all of this.

synthtel2 · Fri Jun 08, 2018 11:29 am

If it's cheap, which it is, it doesn't have to scale well in very many workloads to be a good part.

Topinio · Fri Jun 08, 2018 1:50 pm

chuckula wrote:
True but irrelevant because you are forgetting to factor in scaling to your arguments.

I don't think I am. I think I'm explicitly saying that scaling is the reason that newer CPUs keep being made with more and more cores.

That, and scaling is the basis on which I argued earlier that per-core memory bandwidth isn't the problem.

chuckula wrote:
Right now there are x% of workloads that are "compute bound" on, let's say, a 32-core CPU with 4 channels of RAM where the RAM doesn't really matter. So we can just scale that up right!

We all know that in some cases, yes; in others, no.

chuckula wrote:
Well, your next CPU is now only running (x- Δ)% of workloads in your "compute bound" scenario and a brand new set of Δ workloads suddenly aren't feeling the love from all those extra cores. And Δ is not an insignificant number.

Let x=50 and ∆=20. What you wrote says only that if I am compute-bound on 50% of my workloads on my existing CPU, then buy a better one which is better I can be compute-bound on only 30% of my workloads. Because the new CPU is better.

Please clarify why that 20% of my workloads "aren't feeling the love" by virtue of their no longer being compute-bound.

chuckula wrote:
I'm sure we'll see this to different degrees when Threadripper2 and Skylake X2 launch later this year with reduced memory I/O levels compared to their server equivalents. Sure it won't affect Cinebench but there will be repercussions.

Will we? Is anyone benchmarking the "repercussions" of the reduced memory bandwidth you get when going from EPYC 7351P to Threadripper 1950X, or from Xeon Gold 6154 to Core i9-7980XE?

Mr Bill · Fri Jun 08, 2018 7:11 pm

chuckula wrote:
Mikael33 wrote:
this topic doesn't seem very serious either

Oh, it's deadly serious assuming Lisa Su didn't flat-out lie at the end of AMD's webcast yesterday. Which, unlike the AMD fansquad around here, I actually watched live.

People tend to forget that I take technology, but not myself, seriously.

The story comments sections are littered with idiots who, quite curiously, tend to hold themselves in great esteem while still not being able to think through the ramifications of their object of worship.

On a slightly more serious note. Can't a particular core timeshare more than one channel of DDR4 simultaneously?

Fri Jun 08, 2018 7:34 pm

Mr Bill wrote:
On a slightly more serious note. Can't a particular core timeshare more than one channel of DDR4 simultaneously?

Not sure what you mean by this.

Mr Bill · Fri Jun 08, 2018 8:51 pm

just brew it! wrote:
Mr Bill wrote:
On a slightly more serious note. Can't a particular core timeshare more than one channel of DDR4 simultaneously?

Not sure what you mean by this.

My life in metaphore.
I mean that if only one core was active, surely it could talk to more than one memory channel? If only two cores were active they could swap having those same channels. Slice the time enough and any one core can see those same memory channels... No? Or do I mean memory bank? I'm neither a hardware nor software engineer.

Redocbew · Fri Jun 08, 2018 9:18 pm

When there's an instruction that says "load this chunk of data" I don't believe the execution core of the CPU has any idea about memory channels, or even if the data is in main memory rather than being stored locally in a cache. My understanding is it's the memory controller(now usually also a part of the CPU) which would be responsible for the details on how requests get divided up into channels. I don't know the specifics about how that happens, but I'd be surprised to learn that it was a simple, static assignment of n cores across m channels. It's probably much more flexible than that.

Fri Jun 08, 2018 10:47 pm

Mr Bill wrote:
just brew it! wrote:
Mr Bill wrote:
On a slightly more serious note. Can't a particular core timeshare more than one channel of DDR4 simultaneously?

Not sure what you mean by this.

My life in metaphore.
I mean that if only one core was active, surely it could talk to more than one memory channel? If only two cores were active they could swap having those same channels. Slice the time enough and any one core can see those same memory channels... No? Or do I mean memory bank? I'm neither a hardware nor software engineer.

Well yeah, that's basically what happens. The problem is that if you don't add memory channels in proportion to the number of cores, memory can become a bottleneck since there isn't enough shared bandwidth to keep all the cores fed.

dragontamer5788 · Fri Jun 22, 2018 1:33 pm

Mr Bill wrote:
chuckula wrote:
Mikael33 wrote:
this topic doesn't seem very serious either

Oh, it's deadly serious assuming Lisa Su didn't flat-out lie at the end of AMD's webcast yesterday. Which, unlike the AMD fansquad around here, I actually watched live.

People tend to forget that I take technology, but not myself, seriously.

The story comments sections are littered with idiots who, quite curiously, tend to hold themselves in great esteem while still not being able to think through the ramifications of their object of worship.
On a slightly more serious note. Can't a particular core timeshare more than one channel of DDR4 simultaneously?

That's not how computers work.

Infinity Fabric connects L3 caches together. The L3 cache's job is to ensure that the cores THINK that they all see the same data, but there's an entire architecture and process going on (infinity fabric + cache coherency).

CPU Cores talk to the L3 cache. The L3 cache fetches data from a "home controller", which is the memory controller. The memory controller checks to see if any other L3 cache out there holds the data, and then forwards the request as appropriate. If no one else has the data, the memory controller fetches it from memory.

This is a bit unusual, but its how AMD Threadripper / EPYC seems to work (this doesn't seem to be documented anywhere btw). As a result, fetching data from another L3 cache is slower than fetching it from memory, at least from a latency perspective. However, it is faster from a bandwidth perspective (if another L3 cache holds the data, then the L3 -> L3 cache copy won't use up any memory-controller bandwidth to DDR4).

These transactions are known to be 32-bytes on AMD Threadripper. So whenever data is copied from L3 -> L3, or from DDR4 -> L3, at least 32-bytes are transferred at a time. This correlates to the cache line, and is the cause of "false sharing" (https://en.wikipedia.org/wiki/False_sharing), if you've ever heard of it.

------------------

Intel's mesh is more unified. The entire block of L3 on Intel systems sees all of the other data in the other L3 caches, so these L3 -> L3 cache copies will never happen on an Intel system (unless you configure "subNuma clustering" and other such obscure options. Just in case your application actually does benefit from multiple copies somehow, Intel provides the BIOS option to you on their higher end systems)

------------------

With that being said: DDR4 has 16-banks per stick. Which means DDR4 can have roughly, 16-simultaneous requests going on at the same time. Even DDR3 had 8x banks per stick. So DDR4 advancements in bandwidth and latency will improve things in the future.

DDR4 is easily 2x faster than DDR3 in all bandwidth-related metrics. And with out-of-order scheduling, deep L2 / L3 caches, large reorder buffers and hyperthreading... individual cores know how to "stay busy" even while they wait for the memory controller. Turning many latency sensitive tasks into bandwidth-sensitive tasks. Future bandwidth improvements will dramatically help future work. Indeed, the big advancements from DDR3 -> DDR4 bandwidth is probably why something like the 32-core Threadripper is somewhat reasonable.

Redocbew wrote:
When there's an instruction that says "load this chunk of data" I don't believe the execution core of the CPU has any idea about memory channels, or even if the data is in main memory rather than being stored locally in a cache. My understanding is it's the memory controller(now usually also a part of the CPU) which would be responsible for the details on how requests get divided up into channels. I don't know the specifics about how that happens, but I'd be surprised to learn that it was a simple, static assignment of n cores across m channels. It's probably much more flexible than that.

Its more complicated than that, because when different threads running on different cores see the data, you need to pretend that the data is all coherent, even though there's a caching layer.

If thread #1 says "store 10 into A. Then, store 25 into A" in that order, how will the other cores see this??

Chances are: other cores, for sake of efficiency, will only see "store 25 into A". The store 10 into A "collapses", because it happens in the L1 cache, or maybe in the CPU registers. By the time the L1 cache, L2 cache, and L3 cache pass the information around, the only message that gets passed to the rest of the system is the final "store 25 into A" message.

To make things even more complicated, modern processors can reorder loads and stores. Imagine if you will, that a computer has "store 10 into A; load B into CPU". But "store 10 into A" is going to take 50nanoseconds (that's 150 clock cycles in a 3GHz computer !!!). A modern x86 processor will do "Load B into CPU" before the store, because the CPU doesn't want to wait the 150+ cycles for main-memory. Hell, the CPU doesn't even want to wait the 4-clock cycles to talk to L1 cache!! Modern x86 processors are out-of-order, and will seek to do any work possible to keep itself fed.

Load/load, Load/Store, and Store/Store boundaries are respected in x86. But other processors don't necessarily respect those boundaries. (X86 only out-of-orders the store/load boundaries).

In effect, the CPU continues to execute independently of memory. Even if the CPU says "load A", it isn't going to wait for the data at all! It continues to execute.

TR Forums

16 core CPUs with a single RAM channel are AWESOME

Re: 16 core CPUs with a single RAM channel are AWESOME

Re: 16 core CPUs with a single RAM channel are AWESOME

Re: 16 core CPUs with a single RAM channel are AWESOME

Re: 16 core CPUs with a single RAM channel are AWESOME

Re: 16 core CPUs with a single RAM channel are AWESOME

Re: 16 core CPUs with a single RAM channel are AWESOME

Re: 16 core CPUs with a single RAM channel are AWESOME

Re: 16 core CPUs with a single RAM channel are AWESOME

Re: 16 core CPUs with a single RAM channel are AWESOME

Re: 16 core CPUs with a single RAM channel are AWESOME

Who is online