news amd confirms linux performance marginality problem on ryzen

AMD confirms Linux “performance marginality problem” on Ryzen

AMD's Ryzen processors' abundance of CPU cores and threads combined with Linux's ability to efficiently use them seems like a match made in heaven. Even the strongest marriages have some ups and downs, though, and Michael Larabel at Phoronix was able to uncover a consistent sore point between Ryzen and Linux. The site's compilation test consistently causes segmentation faults (usually known as "segfaults," causing application crashes) on Ryzen CPUs. Larabel contacted AMD and the company's engineers confirmed the issue, describing it as "a performance marginality problem exclusive to certain workloads on Linux." AMD's engineers went on to explain that the problem is unique to Ryzen desktop chips and is not present in the upcoming Ryzen Threadripper high-end desktop CPU line or in the Epyc server processors.

The scenarios present in the torture test suite seem unlikely to be reproduced in most real-world situations, and isolating the crashes apparently wasn't a straightforward process. Larabel's initial testing shows that Ryzen CPUs would exhibit a segfault during the compilation portion of the script, approximately 85 seconds after it started. Phoronix readers then reported segfaults on non-AMD CPUs in a different portion of the script, leading to the hypothesis that the problem was with the test itself. Larabel proceeded to update the script and reported that the Clang crashes are indeed unique to Ryzen systems. The reviewer then reached out to AMD, whose engineers have confirmed the issue and said that the problem is also not confined to any particular motherboard vendor.

Research into the problem seems to be in a preliminary state and it remains unclear as to whether the issue extends beyond Linux and into other Unix-like operating systems like FreeBSD. Ryzen CPU owners whose work has been affected by the problem can contact AMD Customer Care. According to Phoronix, AMD will add more rigorous testing and QA under Linux when developing future products. The silicon manufacturer says its upcoming Threadripper and Epyc CPUs aren't affected and that Ryzen desktop chips do not exhibit the same problem in a Windows environment.

0 responses to “AMD confirms Linux “performance marginality problem” on Ryzen

  1. The proof of correctness comes from running few short tests, than tens of thousands applications. If AMD verification engineers failed to create such tests, it may be that design is messy and unpredictable.

  2. The DragonflyBSD main dev filed a bug report to AMD in April. The FreeBSD community have developed their own test binary to trigger the bug.

    Haven’t seen a thing about this from the NetBSD, OpenBSD nor OSX86 (Hackintosh) ppl.

  3. When I used FreeBSD and OpenBSD, they were source based too. Has there been much feedback there to triangulate a cause?

  4. AMD doesn’t care about Windows gaming buyers so that’s why they have a seg fault when compiling in Linux… yeah, sure.

  5. And…I spoke too soon. I was building clang this morning and encountered a gcc segfault. Time to re-open my ticket with AMD.

  6. I hope AMD will learn from the past fglrx driver issues and this current Ryzen issue and hopefully regard Linux user as a major existent group. Almost 99% of the top 500 supercomputers run Linux and of course for the TR and Epyc Linux support was the must. The Ryzen is marketed as if only for the game playing industry. What about the “game programmers”, who need to use dozens of threads to compile?

  7. I can see this becoming the Tech Report version of a snipe hunt. Gerbils unaware ask AMDisDEC “what? that makes no sense” only to find themselves in a big pile of crazy while everyone else just plays long until they figure it out.

  8. Sorry grasshopper.
    I would think you’d know about HPC, their love for Linux and their long testing of TR and Epyc.
    Can’t wait until AMD announces the 32C.

  9. I have been compiling a huge amount of code without any difficulties under linux. It never occurred to me to use all 16 threads for compilation. I get best results (measured empirically) in most workloads by using 12-14 threads, so I just run compiles with 12 threads, which does not trigger the bug.

    It seems this is, again, tied to the “IRET” instruction. It is almost certainly fixable via microcode.

  10. Say what?

    If that were true TR and epyc would also be affected and they are not.

    Do you always make shit up to support your weak arguments?

  11. One thing I’m curious about is the kernel upgrades between W7 and W10.

    In your experience, is there any difference in performance w/Ryzen on 7 vs 10?

    Also, “crunching renders and simulations”? Are your preferred software packages not available for Linux (e.g. RHEL/SLES/Ubuntu LTS)?

  12. That’s probably reading too far into their description of the problem. Realistically, it’d be crazy for AMD to not attempt to downplay the significance of the bug, and you can be sure that whatever the original wording was from the engineers it’s since been poked and prodded by marketing and legal until it was deemed fit for public consumption. Highly detailed information on problems such as this usually doesn’t survive that process very well.

  13. Bottom line is, AMD cares less about full testing for the tiny Windows gaming buyer than the large and more demanding server market.
    No surprise.

  14. Not to mention that I’d give him another -3 if possible for his butchering of the word ‘effected’.

  15. Yawn. Every new architecture has errata. Thats what microcode updates are for. Some are severe or at least widely reported (Pentium math bug, TLB, etc), but EVERY new processor has some issues. Most happen in the corneriest of corner cases (like this one). It’ll get fixed with a BIOS update and a new stepping I’m sure

  16. What about LTSB? If you’re licensed for Enterprise, you’re likely licensed for LTSB as well, and that’s the version you’re supposed to use if you need a stable system image.

  17. And there I go, thinking the phrase “performance marginality problem” would appear nowhere in my life beyond my annual performance review…

  18. Epyc is not effected because you simply cannot sell a buggy server chip that has problems with Linux.
    AMD did the tests but ignored them for desktop chips that will in up in OEM systems bundled to force users to use Windows.
    Obviously, they use different microcoding.
    No big problem though. AMD will fix the problem and move on.
    I expect share prices to increase another 60% by Xmas.

  19. I had this problem for a while, though it was fixed with the Aegis update. I run Gentoo and had to knock the make job limit back to 8 for most to complete and sometimes down to 2 for things like web-gtk or mesa to make it through, and sometimes I’d have to retry a couple times even then. I recompiled everything after the Aegis update and didn’t have one segfault. I wanted to confirm it was fixed before I closed the support ticket with AMD. I haven’t tried the synthetic test case yet, so it will be interesting if it still has segfaults under those conditions.

  20. Hmm actually maybe what they mean by “performance marginality problem” is that it’s an issue with the performance margins on some (many?) transistors/lines/whatever in the design being too small, resulting in flipped bits, timing/noise issues or whatever.

    That does kinda match some of the symptoms.
    Although iirc at least some people underclocked and overvolted various components to no avail, meaning that it was still necessary to disable the uop-cache and/or smt and/or aslr (linux security feature) to make the cpu somewhat usable.

  21. Even if this magically works with Threadripper and Epyc due to timing differences, at this stage I still don’t trust the Zeppelin core as far as I can throw it. Especially as compile jobs (including building GCC and Clang) are things I quite often do.

  22. EDIT: Okay, it’s a different issue. “[…]is not related to the recently talked about FreeBSD guard page issue attributed to Ryzen”

    I’m confused as to whether this is a new issue, or just the linux version of the issue that was found in March, i.e [url<][/url<] and [url<][/url<]

  23. I think if it were a clockspeed issue it should be easy to determine that, because there are Ryzen models at various speeds (and core counts). Also, Threadripper models can be clocked as high as Ryzen.

    Very strange that it (apparently) does not effect Threadripper.

  24. So far I’ve been exceptionally impressed with Ryzen. We’re crunching renders and simulations with them 24/7 and the Ryzen 1700 nodes at 65W are slightly (but noticeably) quicker than 140W hexa-core i7-6850K nodes costing twice as much, too. I’d like to think we’ll get a stack of threadrippers too, but I suspect we’re done with farm upgrades for 2017.

    My biggest complain is not with AMD, but with Microsoft and their ridiculous update lockout for Windows 7 and 8.1, both of which are still in support.

    [i<]In case anyone is wondering why we're still using old W7 on a render farm, I can say with confidence that even carefully locked-down W10 Enterprise using 3rd party tools and multiple AD Group Policy locks is [b<]far from ready.[/b<] We've tried and failed, Then the software vendors tried and failed. W10 is just not there yet in terms of fully-controllable behaviour. When you install it, Microsoft control the PC and keep retaking control of the PC if you think you've temporarily gained control through all the patches, scripts, registry and policy edits.[/i<]

  25. Fully agreed. AMD is ridicule.
    Linux community is very large and many do long sessions of compilation under C++ for pro reasons. I have some doubts Epyc is without fault because the die is the same, my suspect is that this issue is out when the cpu is pushed to the max at high clock speeds.
    Epyc line has low clock settings and the die is not stressed at all.

    The problem is that many pro users purchase a Ryzen to do the job at home, loading their code in server boxes at work. Too bad for AMD they will lose these customers. Good for Intel :), their 8 core Skylake X is cheap enough and fast too……still reliable.

  26. Epyc is not afflicted by the issue because it was not tested enough. The die is the same, same process same stepping.
    Right now some users are mad to AMD because their long sessions of compilation under Linux end this failure a data corruption.

  27. The potential for silent bugs and flawed results is certainly the biggest problem, IMHO.

  28. It’s entirely possible there is a race condition that’s simply impossible to hit with multiple does as well.

    At least, I hope it’s something that simple. Hard to track down but generally easy to fix…

  29. Just to play devil’s advocate…

    The NUMA nature of the TR/Epyc architecture might make the DRAM subsystem more of a bottleneck. This could mitigate the segfault issue by putting less stress on the CPU cores, since they spend a bit more time waiting on DRAM access (when the memory location in question is managed by the other die).

  30. The test involves running a workload which is quite common for Linux developers, and which runs without incident on Intel and earlier AMD processors. This points to a potential design or QA issue with the CPU or platform.

    Threadripper/Epyc is a different platform, and there will also be timing differences since it is a NUMA architecture (with half of the memory channels hanging off of each die).

    It would be interesting to take a Threadripper, and use only one die — disable the cores on the 2nd die, and populate only the memory channels connected to the die with active cores. Will we see this issue or not?

  31. Thank God for review sites that run Linux tests.
    AMD probably missed it for Ryzen, focusing on Windows, but ensured server class would definitely run Linux stably.
    If you test Threadripper without running Linux then you are so wrong.

    AMD stock price is up 130%, and climbing!

  32. I’m no expert but.. So if you’re using just one Zeppelin die as you do with desktop chips you encounter the problem, but if you gang two of those dies as you do with Threadripper, the issue goes away. Curious. I have to agree the problem lies with the test itself.

  33. But it’s not just that it runs for 85 seconds, or Handbrake would be useless, Adobe sw would do the same thing and any 3D software renderer would show the same issues. So yes, it certainly is bad for Linux users-they are stuck with systems that pretty much require them to use an OS they went out of their way not to use. That sucks. But it’s not as simple as “AMD never tested all the threads running for 85 seconds.
    If they restrict compiler use to 6 of the 8 cores does it still happen?

  34. Although there are minor changes to the manufacturing process/quality all the time, most of them not qualifying for a stepping change.
    E.g. overclockers often look for chips before or after a particular week of production.

  35. All that would be entirely reasonable and easily messaged by including a sentence to that effect in their PR response. The fact that they instead were talking about future QA and other product lines is what raised my blood pressure. The correct PR response would be much briefer and centered on “we will make this right.”

  36. I don’t know that their attitude is that bad. At this point they don’t know what the exact problem is. They do know that it takes a somewhat heavy load to cause it to happen so not everyone will run into it. They are probably hoping to have an answer quickly that results in them providing a fix that works for existing chips. If they can’t fix existing chips then they will need to step up and offer to replace chips that have the bug. No one wants to treat this as a “world is on fire bug” when there may be a fix coming shortly that works for all chips.

  37. I agree I am not finding this article reassuring. For a casual reader what I’m taking away is “our chips are fine as long as you don’t use all the cores for 85 seconds or more”, with that type of workload of course being exactly why I’d want these chips in the first place.

    Of course bugs happen and that’s nothing new to me. So where I’m really not reassured is the lack of ” you will be receiving a fix or new chip shortly.” It sounds more like their attitude is “you will learn to like it”, which does not strike me as remotely acceptable and unfortunately leads credence to what I had heretofore considered BS Intel statements about “we’re worth the hundreds more because we’re mature.”

    And to whoever in the their PR department specifically advised that “QA would be increased on future chips – and other product lines do not have this issue” – how is that supposed to be anything other than a middle finger to the purchaser of the chips with the issue?

  38. The Gentoo people noticed it because they are a source only distro. Basically, you compile *everything* that runs on your system from source code. So, their users run the compiler a lot. They are prime candidates for finding a bug like this.

    The down side is that people who like to build all of their programs from scratch are considered a little nutty, so they aren’t always taken seriously when they report problems. Lots of time it’s problems with their hardware (bad PSU, CPU pins bent, bad memory, wrong BIOS settings, etc.) and not a true fault in the processor. You have to go through a lot of steps to rule out pilot error before you can solidly point the finger at the processor having a bug.

    The Skylake bug that the Prime95 community found was a lot quicker to escallate because it has a well populated forum of very knowledgable people who have been beating on CPUs for decades and know how to quickly diagnose hardware issues. Heck, Prime95 is commonly used as a burn in tool. Want to see if your PSU is good? Cooling system? Memory config? Run Prime95 for a day.

    So, it comes down to experience of the community and credibility.

  39. Michael doesn’t have a bunch of Ryzen, Threadripper, or Epyc chips to test on–he only has what he has bought. But, now that AMD has picked up on the case from him, they’ve promised some of all three. He’s going to be swimming in Zen chips soon.

  40. I was an early adopter of Sandy Bridge and the defective Intel Cougar Point (P67) chipset. Eventually, Intel footed the bill for a motherboard recall. Asus handled the motherboard replacement program unusually well, given their usual indifference to customer issues.

  41. It looks like all ryzen and threadripper chips are b1 stepping, while epyc is actually b2.

    Somebody in the original amd-forum-thread pointed these out:
    [url<][/url<] [url<][/url<] [url<][/url<] [url<][/url<] amd forum thread: [url<][/url<]

  42. Woah worst PR speak I’ve seen in ages… they’re basically trying to make you think:
    1. performance – it’s just about performance, doesn’t affect the stability or safety of your data, nor the security of your system, or only happens under extreme performance requirements
    2. marginality – very rare and insignificant problem, wouldn’t ever happen to you, pinky promise
    3. problem – not really a nasty bug in the chip they new about early on, and decided to ignore because of all the chips they already started mass-producing and/or sold (rushed launch)

    Except reality seems to be quite contrary to that, based on my impressions from:
    [url<][/url<] It mostly hit normal gentoo users. So sure you need to compile a lot of stuff using all of the threads available... except if you're a dev it's what you do all day long. It's kinda weird the way they overreacted with the tlb-bug, and how now they're completely under-reacting and trying to shovel this under a carpet. If you're a developer Ryzen is useless, the compilation crashing isn't really even that bad, the unknown chance that it doesn't crash and produces an invalid binary is much more terrifying. But they provided no meaningful info about what the actual underlying issue is so who the hell knows what could or couldn't happen...

  43. This is how I see it:

    Assuming that AMD is doing the exact same type of stress tests on EPYC/ThreadRipper that fail on Ryzen but do not fail on EPYC/ThreadRipper, they can state with confidence that “the issue isn’t present in TRs and EPYCs”.

    This does not necessarily imply that they have found/know the root cause for the Ryzen failures.

    Note that I’m only pointing out that there is no confirmation from AMD that the root cause is known; for all I know, your speculation may be spot on.

  44. Hm that’s actually a good remark. Thanks for the heads-up, we cleared the lede up a bit.

  45. I imagine they were getting pretty pissed off haha. Good work on their part brining it to the attention of AMD. It really needed to be publicly acknowledged.

  46. [quote<]"Though one area being explored now as well is the Clang segmentation faults shown in the original article, not originating from conftest as well as Clang being able to yield the system hanging hard where the system is unresponsive and SSH is not working." source: [url<][/url<] [/quote<] However, I can't seem to find any actual references to the Clang tests in the "original article"?

  47. Once AMD fix their Usability “marginality problem” I am looking at getting one of these puppies. Glad to see the problem has been acknowledged, hopefully leading to a speedy fix. I have held off being an early adopter as buying a new PC is always a big investment. Looking forward to building with AMD again.

  48. I’m thinking the lower-end, lower core-count SKUs are less likely to be affected, though, if that’s the case? Interesting thought, though.

  49. …or it is at least partially related to power delivery, and the bigger sockets mitigate this contributing factor.

  50. Yet AMD says the issue isn’t present in the Threadrippers or EPYCs. So, either, the full memory hierarchy being enabled eliminates the issue, they have fixed it in a new stepping, or the full memory hierarchy being enabled merely masks the issue or reduces its frequency to the point that AMD can’t reproduce it any more.

  51. No, but I remember the 1.13GHz Pentium 3 being far worse off.

    [url<],219-3.html[/url<] [url<],221.html[/url<]

  52. You’d think that if it were just microcode, that microcode update would already exist for desktop SKUs, though. So, if microcode is involved, then it’s only part of the fix.

    Although, wait. The memory hierarchy is going to be fused differently, did they accidentally fuse off too much hardware in the Ryzen desktop parts?

  53. Yeah, I was confused by this also. The problem is with a single die, but when you glue 2-4 dies into a package the problem goes away?

  54. Stepping, microcode, some combination of the above?

    I did notice in one of the threadripper SKU tables earlier (the one about the 140w model leaked by mobo vendors) that all of the threadripper SKUs were based on a B stepping core.

  55. The stress test is GCC related but there have been reports about CLANG being capable of doing the same thing.

    It’s not that compilers are magically able to cause the bug per-se but that they tend to have the “correct” types of instruction + memory access patterns to trigger the bug in a reliable way.

  56. Nit-pick: The stress test was using gcc for the compilations, not clang. AFAIK running large parallel compilations with gcc was also the first use case that exposed this issue, some months ago.

  57. Remember when. Tom’s Hardware found that the first release of the 1ghz Pentium 3 could not compile a Linux kernel? Good times….

  58. This issue has been floating around since RyZen’s launch and there’s a thread about the latest developments in TR’s forums [url=<]too[/url<]. While this issue has been around for about 5 months now, the real key to getting it recognized was that some Gentoo guys put together a GCC test script that automated the process so you can start throwing segfaults pretty quickly. That automation and the additional publicity finally got the attention of AMD.

  59. Always hard to ferret out the final few high priority bugs when rolling out a new architecture. It’s simply impossible for a single company to accomplish the same breadth and depth of torture testing as the thousands of software developers and early adopters will do as part of their own testing.

  60. Wait a sec, don’t Ryzen desktop chips, Threadripper, and Epyc all use the same die?

    So is it a stepping issue, then?