AMD readies a fix for Ryzen FMA3 bug

The first major bug affecting AMD's new Ryzen processors was found a couple weeks ago by folks using the open-source processor benchmark Flops. In short, certain sequences of code using instructions from the FMA3 expansion to the x86-64 ISA can cause Ryzen machines to hang irrevocably. That means that even software running inside a VM could force a hang-and-restart on a Ryzen host. Fortunately, just days after the discovery was made, AMD confirmed to Digital Trends that it has already identified the issue and that it's preparing firmware updates to distribute to motherboard vendors.

Originally created by Alexander "Mystical" Yee for his own purposes, Flops first appeared on the web as a response to a question at Stack Overflow. In the creator's own words, the app seeks to "get as many FLOPS as possible from an x64 processor." The software's first version was targeted at Intel Sandy Bridge processors. Mystical eventually updated the app with five specific branches targeting Core 2, Bulldozer, Piledriver, and Haswell CPUs.

It's the Haswell build that gives Ryzen trouble, as it makes heavy use of FMA3 instructions to extract maximum parallelism from the CPU cores. Loading up the Haswell build of Flops v2 on a Ryzen machine will quickly make it hang. Folks on the HWBOT forums leaped into action testing Flops in other scenarios to confirm that the issue is related to the Zen core. As a result, all current Ryzen processors and motherboards are affected.

It's good news that AMD already has a fix coming, although it's worth noting that the likelihood of encountering this bug in the wild is vanishingly small. Not only is there little software using FMA instructions (after all, they're only supported on Haswell and later Intel processors), but the hang is triggered by a specific series of instructions. Other applications using FMA instructions, including Prime95 and Y-Cruncher, aren't affected by the bug. Still, if you have a Ryzen machine, best go ahead and install those BIOS updates when they arrive.

Comments closed
    • slaimus
    • 3 years ago

    The HWBOT forum reports it is already fixed in latest BIOS:
    [url<]http://forum.hwbot.org/showthread.php?t=167605&page=5[/url<]

    • ronch
    • 3 years ago

    This is exactly why I almost never am an early adopter. Let them iron out all the kinks before you buy. Let them improve what else can be improved and tweaked. You’ll get a (much?) better product and it’ll probably be cheaper too.

    Looking forward to a very solid, very reliable, extensively tweaked Zen 3.0 or 4.0 purchase when I’m finally ready to give up my current setup.!

    • just brew it!
    • 3 years ago

    It’s really quite amusing the sort of bugs that make it into production silicon. As an off-and-on embedded developer for most of my career, I’ve hit a few of them over the years. When you’re working that close to the metal it can be very difficult to tell whether you’re seeing a bug in your code or a bug in the hardware.

    The “best” one I’ve seen was probably in the Intel i860, which had the interesting “feature” that once in a blue moon the first instruction of an exception handler would get skipped… or executed twice. Kicker is that (unless you knew about the bug) the natural thing to do was to push a register on the stack as the first instruction of the exception handler; so the consequence of hitting the bug was to f**k up the stack (resulting in a crash). Random, non-deterministic crashes FTW!

      • kuttan
      • 3 years ago

      As the complexity increases chances of bugs increases as well.

      • Shouefref
      • 3 years ago

      Those things have become so complicated … On the other hand: it has always been like that, right from the beginning. Underinvestment. We pay the price as early adopters.

    • Tristan
    • 3 years ago

    First bug corrected 🙂

    • Rza79
    • 3 years ago

    Lets not forget that software, that has also been correctly compiled for AMD, will use FMA4 instructions on AMD processors and not FMA3.

      • ronch
      • 3 years ago

      Thing is though, Ryzen dropped support for FMA4.

        • Rza79
        • 3 years ago

        You’re right. Didn’t know that.

    • synthtel2
    • 3 years ago

    Wait, doesn’t this sound like a microcode fix, not a mobo firmware fix? Does Ryzen load microcode from the mobo instead of further along?

      • Geonerd
      • 3 years ago

      AFAIK, most modern CPUs will happily accept patches from BIOS, the OS, drivers, or an executable run by the operator. I suspect that this microcode stays resident until specifically overwritten or disabled. (Or maybe the patch is volatile, and needs to be re-loaded at each boot event?)

        • Rikki-Tikki-Tavi
        • 3 years ago

        For security purposes, I would assume that the microcode in the chip stays the same, and needs to be reloaded from motherboard each reboot for the fix to stay active. Otherwise it would be possible to place malware in CPUs.

          • synthtel2
          • 3 years ago

          The way I’m used to is telling the bootloader to handle it (it happens after starting the kernel but before anything else). It does have to be done on every boot. See [url=https://wiki.archlinux.org/index.php/Microcode<]the Arch docs on it[/url<] for more detail.

            • Concupiscence
            • 3 years ago

            Yep, and different distributions handle it differently. Slackware contains updated firmware/microcode and applies it automatically unless you tell it not to; Ubuntu considers it optional code that has to be manually enabled with the Additional Drivers tool.

    • Meadows
    • 3 years ago

    “leaped”?

      • RAGEPRO
      • 3 years ago

      [url<]http://www.writersdigest.com/online-editor/leaped-or-leapt[/url<]

      • derFunkenstein
      • 3 years ago

      Yep. [url<]http://grammarist.com/spelling/leaped-leapt/[/url<] edit: lulz, Zak beat me to it.

        • UberGerbil
        • 3 years ago

        Your link is better, though, noting the UK/NA split. You could say it… dealed with it.

      • Redocbew
      • 3 years ago

      I’m not sure if someone can “flop” into action.

      • flip-mode
      • 3 years ago

      The reaper reaped and heaped bodies upon the heap and when the reaping was done the reaper leaped.

    • UberGerbil
    • 3 years ago

    I still have a 90MHz Pentium with the FDIV bug. It’s a collector’s item now. Right? Right?

      • srg86
      • 3 years ago

      I think the 80486 had a Floating Point bug in 1989 as well.

        • swaaye
        • 3 years ago

        As did some P6 processors.

        Some bugs just don’t get popular media coverage though.

      • Takeshi7
      • 3 years ago

      Think of how many billions Intel could have saved if bugs in CPUs could be fixed in firmware back then.

        • bhtooefr
        • 3 years ago

        Nowadays, they often [i<]can[/i<] be. First off, if it's a feature implemented in microcode, you can trivially update the microcode to fix it in modern CPUs (which support updating the microcode at runtime), either in the BIOS or in Windows. As far as I'm aware, the Pentium bug was an issue with a lookup table not being programmed correctly, so it might be possible that a similar modern CPU would have that table implemented in microcode (or loaded from microcode)? Second, "chicken bits" are another way to go - if a designer is worried that a feature might not work properly, they can add hardware to allow microcode to disable that feature. The instruction can then be reimplemented in microcode in a safer, but slower way. [url<]https://media.ccc.de/v/32c3-7171-when_hardware_must_just_work[/url<] is an interesting watch, by the way.

      • chuckula
      • 3 years ago

      I’ll give you $19.98888888893818513515 for it.

      • willmore
      • 3 years ago

      I’ll raise you a non-double sigma 386DX.

      • dpaus
      • 3 years ago

      Amatuers. A Via 386DX running DR DOS with an MDA card and a 14″ amber monitor that mostly-sorta-kinda works OK. Mostly.

      • _ppi
      • 3 years ago

      It was 60 and 66MHz Pentiums with this bug.

        • Concupiscence
        • 3 years ago

        Good ol’ Socket 4, yes indeed. Socket 5 and 7 chips had their own quirks (like F00F), but FDIV was unique to the first run of Pentiums.

        edit: Per bhtooefr, my optimism was not warranted. Whoops!

          • bhtooefr
          • 3 years ago

          Not so – the FDIV bug affected steppings B1 and B3 of the P54C, too, which were 75, 90, and 100 MHz.

            • UberGerbil
            • 3 years ago

            Yep. My [url=https://en.wikipedia.org/wiki/Pentium_FDIV_bug#Affected_models<]90MHz[/url<] definitely had it. And still does. I remember doing the Excel calculation check, and then doing it again the software patches "fixed" it.

            • Concupiscence
            • 3 years ago

            Eww… Pretty sure my first real desktop PC was affected, then. Happy to me told I’m wrong when I am!

        • UberGerbil
        • 3 years ago

        Nope, [url=https://en.wikipedia.org/wiki/Pentium_FDIV_bug#Affected_models<]not just those[/url<].

    • Leader952
    • 3 years ago

    So I take it that the fix is a microcode patch that gets put into the BIOS and loaded into the CPU on boot.

    Any word if the fix impacts performance?

    We all remember the TLB bug in the Phenom.

    [url<]https://en.wikipedia.org/wiki/AMD_Phenom[/url<] [quote<]Before Phenom's original release a flaw was discovered in the translation lookaside buffer (TLB) that could cause a system lock-up in rare circumstances; Phenom processors up to and including stepping "B2" and "BA" are affected by this bug. BIOS and software workarounds disable the TLB, and typically incur a performance penalty of at least 10% [/quote<]

      • UberGerbil
      • 3 years ago

      Well, presumably this fix is only going to affect code using FMA (and possibly some other wide AVX ops if there’s a more general problem that needs addressing), whereas the TLB is used all the time by essentially all memory operations.

    • Krogoth
    • 3 years ago

    FYI, Skylake CPUs had a similar bug with AVX under certain workloads too.

    [url<]https://techreport.com/news/29585/prime95-can-cause-intel-skylake-cpus-to-freeze[/url<]

      • chuckula
      • 3 years ago

      Interestingly the Skylake issue is almost the diametric opposite of this issue.
      Skylake encountered a bug when [b<]not[/b<] using FMA3 in Prime95 but instead falling back to older AVX1 codepaths under certain specific conditions.

    • Concupiscence
    • 3 years ago

    I keep telling people, if you can afford to wait three to six months for a new platform’s quirks to be identified and remedied, you’ll have a much better time.

      • swaaye
      • 3 years ago

      Yeah for some reason it seems like people tend to think the newest thing can only be flawless. Every time.

      At least there has yet to be any performance impacting erratum found.

      • blastdoor
      • 3 years ago

      Agreed. I’m excited about Ryzen, but waiting for the real world experience to role in.

        • dodozoid
        • 3 years ago

        exactly

      • Thrashdog
      • 3 years ago

      As true in computing as anywhere else. I have a Miata that was built within the first few months of that particular version’s introduction, and last year I had to spend several hours hunched over the engine, cutting off the old coil pack connectors and soldering on new ones because the first wiring harnesses they made were slightly too short, and thus prone to breaking wires over time. They fixed that issue in literally the following month of production.

      • ronch
      • 3 years ago

      Totally agree.

Pin It on Pinterest

Share This