TR Forums

Redocbew · Wed May 03, 2017 9:58 pm

Ryu Connor wrote:
This distinctly reminds me of Bob Colwell and his lecture at Stanford. He regaled the class with a tale from when the Itanium team cheated at SPEC by hand optimizing an instruction loop. Bob called the Itanium team out and told them that if they could show him a compiler reaching the same conclusion about the series of instructions he would rescind his complaint. The Itanium team couldn't meet Bob's request though, it was too early in the project for such a thing.

That was a simulation run in an attempt to improve Itanium's standing as an x86 replacement not a modification to the chip its self, wasn't it? He was right to call them out about it, but I'm not sure the comparison works here. Fudging your numbers for the boss is quite a different game than making changes to a shipping product which may have negative consequences on other software, and I still haven't seen much to indicate that's what AMD did here.

Wed May 03, 2017 10:16 pm

Redocbew wrote:
That was a simulation run in an attempt to improve Itanium's standing as an x86 replacement not a modification to the chip its self, wasn't it? He was right to call them out about it, but I'm not sure the comparison works here. Fudging your numbers for the boss is quite a different game than making changes to a shipping product which may have negative consequences on other software, and I still haven't seen much to indicate that's what AMD did here.

There's a strong likelihood it wasn't. In the very same lecture Mr. Colwell talked about just how complex the interaction between the various daemons used to make an out of order processor work. He discussed how it creates complex bugs and that often in the fixing of those complex bugs you end up creating more bugs.

That being said it could be and saying there's "no incentive" when you had an entire company fixated on creating a major milestone in performance to save the company?

Yes, I imagine there was a lot of pressure to make sure Zen performed well for the suits.

I.S.T. · Wed May 03, 2017 10:42 pm

chuckula wrote:
DancinJack wrote:
Come on chuck. This one is a bit too thin to call it "caught cheating."

I'm just applying the same standards to AMD that regularly get applied to AMD's competitors around here. If AMD is back to being some type of top-dog in the CPU/GPU world then they'll have to deal with the fallout of stuff like this just like anybody else will.

It just really seems like you got an axe to grind and you're letting that cloud your judgment.

ptsant · Thu May 04, 2017 2:32 am

When I got my Ryzen 1700x, I ran my own assembly hand-coded benchmark from the 90s. I have scores from K5, Pentium, K7, Phenom II, Vishera etc and the code still runs as it is under linux. So I thought it would be nice to see how Ryzen does. The benchmark itself is not particularly interesting, because it looks at some very specific low-level sequences and today almost all of these are treated almost optimally and scale with the number of ALUs and CPU frequency.

Anyway, I was struck by the fact that some memory writes scaled faster than any theoretical boundary (faster than DDR4, faster than L3/L2/L1 and even faster than what the processor could theoretically read/write from registers). If I remember correctly, in a specific scenario the score was >400GB/s, which works to >100B/cycle, which would require a 800-bit bus running at 4GHz. So, not possible.

Comparing with Vishera, and especially the Phenom, I realized that the CPU was probably not even executing these instructions. It correctly deduced that I was rewriting the same thing and simply didn't to it. Now, I don't know how often this occurs in practice, because the code was, as I said, very artificial, but it seemed to be like a particularly nifty optimization. Not cheating.

The way I see it, the CPU obtained the necessary result (the output). Executing all the instructions is not, in itself, a requirement.

ptsant · Thu May 04, 2017 2:35 am

chuckula wrote:
TR says it all right here.

Now admittedly CPU-Z was never a proper benchmarking tool, which is why TR wisely avoided using it, but that's still no excuse for "optimizing" your way to success in microbenchmarks by intentionally ignoring the program code. Especially when those optimizations could come back to bite you in subtle and unpredictable ways when running a real-world program that doesn't work with your "optimizations".

It all comes down to (a) whether the final state of the system was the same (including memory, registers and flags) or not for having skipped the instructions and (b) whether the behavior is documented. You really can't say it's cheating if the Ryzen system achieves the same output as an Intel system, no matter how it got there. I'd rather say the benchmark needs to be modified to properly account for a new type of optimization. Happens all the time.

shodanshok · Thu May 04, 2017 4:02 am

chuckula wrote:
TR says it all right here.

Now admittedly CPU-Z was never a proper benchmarking tool, which is why TR wisely avoided using it, but that's still no excuse for "optimizing" your way to success in microbenchmarks by intentionally ignoring the program code. Especially when those optimizations could come back to bite you in subtle and unpredictable ways when running a real-world program that doesn't work with your "optimizations".

An OoO process is an extremely complex beast. The chances to alter the hardware design to win a obscure benchmark are basically zero.

Rather, what is likely to happen is that the OoO machinery detects some instructions whose results are immediately discarded/overwritten, and completely skip them without even executing. The detection and discard of these instruction probably happens in the reservation station, where instruction are analyzed and scheduled to available compute units.

Intel processors already do that on MOVE instructions (it's called MOVE elimination in the literature), and - do you know - this affect expected latency/throughput: https://software.intel.com/en-us/forums ... pic/392752

Let me stress that these optimization are a good thing, as they operate on any code which show a similar behavior. As compilers tend to generate repetitive code, this can bring significant speedup to not-optimal code. If any, eliminating some redundant/not useful uOPS will make more space for useful instruction (remember, on-chip buffers are relatively scarce).

If any, this seems to confirm that Ryzen has extremely advanced OoO machinery.

shodanshok · Thu May 04, 2017 4:06 am

Ryu Connor wrote:
Ordinarily, that kind of automatic optimization would be welcome, but upon further investigation, the CPU-Z team failed to replicate that behavior with Ryzen CPUs in real-world situations. Furthermore, the team says that due to the extreme unlikelihood of that specific sequence of instructions showing up in non-benchmark software, it felt it would be best to revise CPU-Z to reflect real-world results more accurately.

This is the problematic line.

This distinctly reminds me of Bob Colwell and his lecture at Stanford. He regaled the class with a tale from when the Itanium team cheated at SPEC by hand optimizing an instruction loop. Bob called the Itanium team out and told them that if they could show him a compiler reaching the same conclusion about the series of instructions he would rescind his complaint. The Itanium team couldn't meet Bob's request though, it was too early in the project for such a thing.

Unfortunately the non-engineering management in the room didn't understand what was going on and ultimately the situation didn't raise a red flag like it should have.

What Bob's story demonstrates is that we can't just pawn this off as Hanlon's razor. While there's still a possibility that this happens to be a complete coincidence. There's also a possibility that some engineer did in fact make this tweak.

This is not the same thing. SPEC is cheated by the various compilers which greatly alter the output code. These are benchmark-specific optimization that need to die, as they have zero benefit in real-world programs. For example, libquantum (part of the SPEC suite) is completely broken by Intel's ICC compiler. Someone tell that the only valid SPEC score is the gcc one, and Itanium always had a very bad show at it.

Anyway SPEC, being an industry standard benchmark, is extremely attractive, and many CPU/compiler vendors cheat at it...

Thu May 04, 2017 7:18 am

I.S.T. wrote:
chuckula wrote:
I'm just applying the same standards to AMD that regularly get applied to AMD's competitors around here. If AMD is back to being some type of top-dog in the CPU/GPU world then they'll have to deal with the fallout of stuff like this just like anybody else will.

It just really seems like you got an axe to grind and you're letting that cloud your judgment.

Kinda feels like AMD ran over his dog or something...

Unless there's a shred of evidence that this was anything more than an optimization to eliminate execution of redundant instructions, which tripped up a sloppily coded benchmark, there shouldn't be any "fallout" to deal with.

ptsant wrote:
When I got my Ryzen 1700x, I ran my own assembly hand-coded benchmark from the 90s. <snip>

Comparing with Vishera, and especially the Phenom, I realized that the CPU was probably not even executing these instructions. It correctly deduced that I was rewriting the same thing and simply didn't to it. Now, I don't know how often this occurs in practice, because the code was, as I said, very artificial, but it seemed to be like a particularly nifty optimization. Not cheating.

It seems fairly likely that you encountered either the same, or a very similar scenario to what tripped up CPU-Z. I'd say this is pretty good evidence that RyZen is simply very good at spotting certain types of superfluous instruction sequences, and optimizing them out.

ptsant wrote:
The way I see it, the CPU obtained the necessary result (the output). Executing all the instructions is not, in itself, a requirement.

Bingo.

The only downsides I can think of are:

1. Code which depends on timing loops may misbehave. In general timing loops are a bad idea anyway. Execution time of specific instruction sequences is already unpredictable enough on modern architectures (clock speeds vary widely due to power saving modes, and HT tosses another variable into the mix...) to be problematic. It's also inefficient, since that core could be off doing something useful on another thread instead of just looping.

2. Complex optimizations open up the possibility of introducing complex bugs. If the CPU gets confused into skipping instructions that actually matter, that's a problem.

3. Benchmark writers need to be more careful to avoid clearly redundant sequences of operations, and to ensure that results of calculations get "used" in a way that convinces any optimization logic that those calculations affect system state in a meaningful way.

ptsant wrote:
It all comes down to (a) whether the final state of the system was the same (including memory, registers and flags) or not for having skipped the instructions and (b) whether the behavior is documented. You really can't say it's cheating if the Ryzen system achieves the same output as an Intel system, no matter how it got there. I'd rather say the benchmark needs to be modified to properly account for a new type of optimization. Happens all the time.

I'd even argue that it doesn't need to be documented, other than a blanket statement to the effect that "execution of instruction sequences which have no net effect on state may be skipped".

Aranarth · Thu May 04, 2017 7:50 am

TL:DR Ryzen saw that a benchmark was just make-work and skipped to the end. In the real world this is called READING ALL OF THE DIRECTIONS before starting a project.

My take: perfect! Unoptimized code will get executed much faster than expected in the real world.

The programmer's take: Processor operating as it should though in an unexpected way. It figured out we were giving it make-work so now we have to give it something that looks like REAL WORK.

Chuckulas take: that's cheating!

I suppose that CPU L1, L2, L3 caches are also cheating depending on your point of view.

Waco · Thu May 04, 2017 8:22 am

Ryu Connor wrote:
Waco wrote:
There's literally zero incentive for them to do so, and they run the risk of making particular code paths unstable if it really is a "cheat".

I'm not sure why you're saying that. The Itanium engineers were purposefully cheating to present super rosy results to their bosses.

Yet you say there's zero incentive? Clearly there is an incentive. Clearly it's actually happened in the real world in the past.

You think this is a coincidence? Fine. But I wouldn't use that line of reasoning. This is a terrible hill to choose to die on.

That's a very different scenario than we're talking about here. Hand optimized assembly that no compiler could create versus a benchmark that happened to do nothing with a CPU that correctly optimized part of it out.

Thu May 04, 2017 9:24 am

I cannot for the life of me find the source again, but I'm pretty sure that the CPU-Z devs were aware of their program producing spurious results at the Ryzen launch and said to avoid making any conclusions based on its performance with the benchmark at the time. I would have a hard time believing it was any kind of intentional gaming of the system.

CPU-Z was not the only benchmark tool that got caught flat-footed by this. Finalwire (AIDA64) and SiSoftware both had to produce Ryzen-optimized versions of their benchmarks after the CPUs launched.

Thu May 04, 2017 9:26 am

I'm pretty sure there's only one poster here who believes it to be deliberate.

ludi · Thu May 04, 2017 10:45 am

On the plus side, that one poster inadvertently started a really interesting and informative discussion. Good show, all!

MileageMayVary · Thu May 04, 2017 12:16 pm

I didn't know CPUZ even had a benchmark!

Concupiscence · Thu May 04, 2017 12:33 pm

MileageMayVary wrote:
I didn't know CPUZ even had a benchmark!

That makes two of us. After reading the article I found it sitting alone in its Bench tab at the far right, and though I've been using CPU-Z for years I have no idea how long the performance test has been there. The fact that it poops out a dimensionless coefficient representing some quantity and scaling to the number of threads is also pretty uninformative. This is a back of the envelope estimation of performance by design.

In any case, efficiently sussing out CPU busywork, only executing relevant code, and returning correct output isn't cheating. It'd be like complaining when a compiler determines that a simple computation always results in the same output and returns that output as a constant instead of going to the trouble to calculate it.

Thu May 04, 2017 1:09 pm

If Ryzen's OoO unit was optimizing a piece of code from a popular game and made it run 10FPS faster, everyone would be cheering quite loudly.

I.S.T. · Thu May 04, 2017 2:08 pm

shodanshok wrote:
Ryu Connor wrote:
Ordinarily, that kind of automatic optimization would be welcome, but upon further investigation, the CPU-Z team failed to replicate that behavior with Ryzen CPUs in real-world situations. Furthermore, the team says that due to the extreme unlikelihood of that specific sequence of instructions showing up in non-benchmark software, it felt it would be best to revise CPU-Z to reflect real-world results more accurately.

This is the problematic line.

This distinctly reminds me of Bob Colwell and his lecture at Stanford. He regaled the class with a tale from when the Itanium team cheated at SPEC by hand optimizing an instruction loop. Bob called the Itanium team out and told them that if they could show him a compiler reaching the same conclusion about the series of instructions he would rescind his complaint. The Itanium team couldn't meet Bob's request though, it was too early in the project for such a thing.

Unfortunately the non-engineering management in the room didn't understand what was going on and ultimately the situation didn't raise a red flag like it should have.

What Bob's story demonstrates is that we can't just pawn this off as Hanlon's razor. While there's still a possibility that this happens to be a complete coincidence. There's also a possibility that some engineer did in fact make this tweak.

This is not the same thing. SPEC is cheated by the various compilers which greatly alter the output code. These are benchmark-specific optimization that need to die, as they have zero benefit in real-world programs. For example, libquantum (part of the SPEC suite) is completely broken by Intel's ICC compiler. Someone tell that the only valid SPEC score is the gcc one, and Itanium always had a very bad show at it.

Anyway SPEC, being an industry standard benchmark, is extremely attractive, and many CPU/compiler vendors cheat at it...

Yeah, I think IBM is one of the vendors who does that. They own basically the entire stack top to bottom, so they have a gigantic incentive to cheat all to hell and back.

I'm not really sure how one can think a CPU arch can specifically optimize for one set of code. An ASIC or something, yeah, but CPUs have to be general as possible. If they want to make X faster, their alterations will benefit beyond X.

bhtooefr · Thu May 04, 2017 8:47 pm

just brew it! wrote:
The only downsides I can think of are:

1. Code which depends on timing loops may misbehave. In general timing loops are a bad idea anyway. Execution time of specific instruction sequences is already unpredictable enough on modern architectures (clock speeds vary widely due to power saving modes, and HT tosses another variable into the mix...) to be problematic. It's also inefficient, since that core could be off doing something useful on another thread instead of just looping.

Although, interestingly, in crypto code, AFAIK it's seen as vital that every code path take the same amount of time, to avoid leaking key data through timing attacks. Apparently even subtle timing differences (when attempting to forge key material, for instance) can even be detected over the internet in some cases, and a weak implementation of an algorithm can leak quite a lot of information about the key under attack that way.

Then again, there'll be CPU instructions specifically to ensure that the pipeline is behaving consistently in those cases, at the expense of performance. Benchmark code isn't using those, because the point is to run as fast as possible.

Thu May 04, 2017 10:56 pm

Seems to me the correct mitigation for stuff like that would be to have the critical crypto code intentionally add delays to its own execution to smooth out (or completely randomize) the execution time. Intentionally slowing down the CPU just because executing certain code paths too fast might leak information to an attacker is a poor solution, since it also affects code for which all we care about is fastest possible execution.

bhtooefr · Thu May 04, 2017 11:43 pm

Well, they are intentionally adding delays and taking measures to ensure that those delays don't get optimized out. (As I understand, a lot of crypto code at that level ends up getting tested on every new CPU design to ensure that each code branch is actually executing at the same speed, and hand-deoptimized assembler for each CPU released.)

Thu May 04, 2017 11:48 pm

At least on systems with a high-resolution timer, you could use that to regulate execution time. Time the slowest code path, then use the timer to delay execution of all code paths to match.

jensend · Fri May 05, 2017 1:21 am

Ryu Connor wrote:
Waco wrote:
There's literally zero incentive for them to do so, and they run the risk of making particular code paths unstable if it really is a "cheat".

I'm not sure why you're saying that. The Itanium engineers were purposefully cheating to present super rosy results to their bosses.

Yet you say there's zero incentive? Clearly there is an incentive. Clearly it's actually happened in the real world in the past.

You think this is a coincidence? Fine. But I wouldn't use that line of reasoning. This is a terrible hill to choose to die on.

I'm sorry, I know other people have already pointed out some of why this is wrong, but I'm going to throw a red flag and say I don't think you even think that, you're just engaging in BS to get a rise out of people.

The difference between

1. "impatient with early compilers for your new architecture, you translated the source code of the most famous benchmark ever, which affects billions of dollars of purchasing decisions, into tidy machine code for your platform-- in a way no compiler will"

and

2. "you baked paths into your silicon to munge the out of order instruction dispatch to try to game an obscure microbenchmark which as far as anyone here can tell has never once been cited in any trade publication or any reputable review site"

is as wide as the Atlantic Ocean. There's no way you genuinely believe those are comparable. It's even weirder than accusing nVidia of cheating on glxgears.

It's always been true that the correlation between a microbenchmark and real-world performance is fragile; compiler/JIT updates or any number of types of hardware changes may break the correlation without any malice aforethought.

yeeeeman · Fri May 05, 2017 1:39 am

This is not cheating, it is optimization. Since the OoO window is very big inside the CPU, you can check for redundant paths and just skip them.
So really, the only fault here is on the guys that wrote the benchmark, because they created a scenario which can be optimized by a CPU and doesn't really reflect raw performance.
On the other hand, in general usage, you might find this type of situation where the CPU could skip some parts because they are marked as redundant. This means more resources available for the real stuff.
Talking about Ryzen performance now, we don't need CPU-z to tell us that it is a good CPU. It is really a very good CPU and I don't really see any reason why you would buy (at any price point) any Intel CPU over AMD. And don't bring again the 10FPS difference at 100FPS between 7700K and R7 1700 because no one cares. Only the ones with small ... brains

raddude9 · Fri May 05, 2017 2:56 am

Kinda feels like AMD ran over his dog or something...

Chuck's issue is not with AMD, it's with Intel. His love for Intel is so great that he goes to great lengths to sing it's virtues and disparage their competitors. I suspect that he has a connection of some sort to Intel (financial or otherwise), but every time he is asked that, or a similar question, he avoids the issue entirely.

Concupiscence · Fri May 05, 2017 1:05 pm

ptsant wrote:
When I got my Ryzen 1700x, I ran my own assembly hand-coded benchmark from the 90s. I have scores from K5, Pentium, K7, Phenom II, Vishera etc and the code still runs as it is under linux. So I thought it would be nice to see how Ryzen does. The benchmark itself is not particularly interesting, because it looks at some very specific low-level sequences and today almost all of these are treated almost optimally and scale with the number of ALUs and CPU frequency.

Anyway, I was struck by the fact that some memory writes scaled faster than any theoretical boundary (faster than DDR4, faster than L3/L2/L1 and even faster than what the processor could theoretically read/write from registers). If I remember correctly, in a specific scenario the score was >400GB/s, which works to >100B/cycle, which would require a 800-bit bus running at 4GHz. So, not possible.

Comparing with Vishera, and especially the Phenom, I realized that the CPU was probably not even executing these instructions. It correctly deduced that I was rewriting the same thing and simply didn't to it. Now, I don't know how often this occurs in practice, because the code was, as I said, very artificial, but it seemed to be like a particularly nifty optimization. Not cheating.

The way I see it, the CPU obtained the necessary result (the output). Executing all the instructions is not, in itself, a requirement.

I forgot to ask earlier: is there any way you could post those benchmarks? I'd love to see a historic comparison, even with the broken Ryzen number thrown in there.

Bauxite · Fri May 05, 2017 2:01 pm

The cpu literally runs those instructions faster. If a program uses similar collections of instructions, it will be faster. It does not matter how it does it on the back end if all input/output states for an instruction or series of instructions are valid results, you know, basic computing concepts. Newsflash: x86 has been sub-optimizing for decades, an i7 is not just some really fast 386 cores on smaller circuits, it is very very very different on the inside. x86 is also objectively the worst instruction set mankind has ever made, which often leaves a lot of room to optimize.

Cpu-z programmers choose a limited set of instructions to base their score on, granted they were also used to a very limited selection of cpus where that happened to be a valid approximation. This is no longer the case, welcome to cpu diversity and don't let the logic bus run you over.

Smarter people have been analyzing ryzen and *gasp* it does some things better than current intel arch! UNPOSSIBLE!

There is no cheat, you are just an idiot troll.

Redocbew · Fri May 05, 2017 9:34 pm

bhtooefr wrote:
Well, they are intentionally adding delays and taking measures to ensure that those delays don't get optimized out. (As I understand, a lot of crypto code at that level ends up getting tested on every new CPU design to ensure that each code branch is actually executing at the same speed, and hand-deoptimized assembler for each CPU released.)

Any idea why that's better than randomizing the timing for each path? Seems like that would add even a little more noise, but maybe they don't want to rely on the behavior of the particular RNG in use.

Kougar · Sat May 06, 2017 7:41 am

It's just another case of an arbitrarily written synthetic benchmark breaking when they should've used a real bit of crunching from some program.

Redocbew wrote:
bhtooefr wrote:
Well, they are intentionally adding delays and taking measures to ensure that those delays don't get optimized out. (As I understand, a lot of crypto code at that level ends up getting tested on every new CPU design to ensure that each code branch is actually executing at the same speed, and hand-deoptimized assembler for each CPU released.)

Any idea why that's better than randomizing the timing for each path? Seems like that would add even a little more noise, but maybe they don't want to rely on the behavior of the particular RNG in use.

Too much variance in the results, especially across 16 or more spawned threads?? It's supposed to be a fine-grained benchmark and RNG can be pretty brutal.

Sat May 06, 2017 7:57 am

Redocbew wrote:
bhtooefr wrote:
Well, they are intentionally adding delays and taking measures to ensure that those delays don't get optimized out. (As I understand, a lot of crypto code at that level ends up getting tested on every new CPU design to ensure that each code branch is actually executing at the same speed, and hand-deoptimized assembler for each CPU released.)

Any idea why that's better than randomizing the timing for each path? Seems like that would add even a little more noise, but maybe they don't want to rely on the behavior of the particular RNG in use.

Even if you add a random delay, patterns can still be detected with enough data points, because the random delays will average out. Since this type of crypto attack is statistical to begin with, the attacker probably doesn't even need to change what they're doing; they just need to wait a little longer.

derFunkenstein · Mon May 15, 2017 8:14 am

morphine wrote:
If Ryzen's OoO unit was optimizing a piece of code from a popular game and made it run 10FPS faster, everyone would be cheering quite loudly.

Great, you just gave AMD an idea. Now we're going to start getting 400MB CPU driver installers with optimization code paths. :lol:

TR Forums

Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

facepalm

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Re: Looks like RyZen got caught cheating on benchmarks

Who is online