Prime95 can cause Intel Skylake CPUs to freeze

'Tis a peaceful night after Christmas, and the ghosts of FDIV and TSX have come once again to haunt Intel. The folks at the Great Internet Mersenne Prime Search (who make Prime95) have come across a bug in Intel Skylake CPUs that causes affected systems to freeze.

The bug manifests when a certain calculation is performed. Prime95 uses the fast Fourier transform to multiply very large numbers, and at least one particular exponent—14,942,209—causes Skylake CPUs to choke. Over- or underclocking the CPU doesn't have any effect on whether the bug occurs.

Intel was quick to track down and identify the issue, and says it's working with motherboard vendors to issue BIOS updates as a workaround:

Intel has identified an issue that potentially affects the 6th Gen Intel® Core™ family of products.  This issue only occurs under certain complex workload conditions, like those that may be encountered when running applications like Prime95.  In those cases, the processor may hang or cause unpredictable system behavior.  Intel has identified and released a fix and is working with external business partners to get the fix deployed through BIOS.

The firm recommends that users contact system manufacturers or motherboard makers for updates.

Comments closed
    • anotherengineer
    • 4 years ago

    Well at least it wasn’t another chipset bug.
    [url<]http://www.anandtech.com/show/4143/the-source-of-intels-cougar-point-sata-bug[/url<]

    • sparkman
    • 4 years ago

    I’m curious about how Intel can fix this kind of bug after the fact.

    Presumably it is a bug in the silicon, correct? Silicon can’t be changed.

      • chuckula
      • 4 years ago

      [url<]https://en.wikipedia.org/wiki/Microcode[/url<]

        • sparkman
        • 4 years ago

        Right, I do know about microcode, but I’m imagining the majority of the ALU isn’t implemented as microcode, but instead is unchangeable silicon.

        How does Intel approach this kind of bug fix? Does their microcode allow them to filter on x86 opcodes, maybe overriding a problem opcode to run a different opcode, instead?

          • willmore
          • 4 years ago

          Short of going to work for Intel and getting a job in just the right place, you will never get an answer to that question. Sorry. That’s super trade secret kind of stuff.

          • bhtooefr
          • 4 years ago

          There’s also approaches like disabling HyperThreading during the problematic workloads, or if some of the scheduling is in microcode, fixing it there…

      • TruthSerum
      • 4 years ago

      A bug in the silicon with embedded flash based microcontrollers…

      • jihadjoe
      • 4 years ago

      Every modern CPU has a ton of errata that is only fixed with microcode. More often than not it requires a very specific set of conditions to trigger a bug, so they put checks in place to make sure those conditions aren’t met. It could be as simple as making sure a certain value doesn’t get stored in a certain register (negligible performance hit), or as bad as falling back to a different code path (like that nasty Phenom I bug).

      Check out these pretty recent spec documents from both AMD and Intel if you want to see some of the nasty stuff that’s being covered with microcode-level band aid:

      [url<]http://support.amd.com/TechDocs/48063_15h_Mod_00h-0Fh_Rev_Guide.pdf[/url<] [url<]http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-family-spec-update.pdf[/url<]

    • fyo
    • 4 years ago

    The article should be updated to reflect that this problem only occurs with Hyper-Threading and doesn’t actually cause the system to freeze when running Prime95. The program catches the error, retries the calculation, and after the calculation fails again, exits with an error message.

    The CPU bug could certainly cause a freeze with other software, but this has not been demonstrated and does not occur with Prime95.

    No current Skylake i5 chips use Hyper-Threading, so this bug appears to only be exposed on i7 chips and, perhaps, i3 chips. The latter has not been confirmed, AFAIK.

    • ronch
    • 4 years ago

    Oh jeez!! So that’s why I keep getting those errors!

    Oh wait, I’m using an AMD processor.

    • ronch
    • 4 years ago

    I think an Intel Oompa Loompa won’t get his chocolate this week.

    • kuttan
    • 4 years ago

    If the new BIOS patch compromises performance, then this is a bad thing from Intel.

      • TruthSerum
      • 4 years ago

      Well it’s not a good thing either way, lol. I wonder what else might be found?

      ~Intel Management Engine has intercepted this message and alerted the proper authorities.

      • eofpi
      • 4 years ago

      You’d rather get the wrong answer faster than the correct answer a bit slower?

        • TruthSerum
        • 4 years ago

        Nope. He definitely did not say that.

    • willmore
    • 4 years ago

    Original forum thread:
    [url<]http://www.mersenneforum.org/showthread.php?t=20714[/url<] It was later posted to the Intel forums once it was determined that it was likely a chip issue. To summarize the thread: When running an older version of Prime95 (which used the AVX codepath for all CPUs that could support it), some German overclockers were getting errors on one size of FFT--768K. They brought it to the attention of the MersenneForum people where the author of the code and several other community members worked to narrow down what was happening. Once it was determined not to be a motherboard, memory, power supply, thermal, OS, nor overclocking issue, they moved on to making Intel aware of the problem. Several personal contacts at Intel were attempted. Only one made any headway--the security group was able to reproduce it on an older revision of the SkyLake microcode patch. Meanwhile, a few MersenneForum community members joined the Intel forums to report the bug via that means. That method later brought the bug to wider attention at Intel. And that pretty much gets us to today when the news has been breaking at bunches of tech sites. So, no crashing, no freezing, etc. The code that detects the problem is there because the Mersenne primality caclulations on large numbers can take *months*. Any mistake along the way can ruin the whole calculation. So, checks were put in the code to try to detect them. If, while doing a real calculation, an error is detected, the code will rollback to a known good state and try again. Normally, this fixes the problem--if it was a temporary thing. If not, the program gives up and reports a severe error to the user. Prime95--the program in question--became popular with overclockers because it's a great stress testing tool. If your machine can run Prime95's stress test for a day or two, chances are nothing else will stress it harder and your overclock settings are good. That's why the problem was detected by the German overclockers. They were using the stress testing mode in Prime95 to prove the stability of their overclocks. On the effected Sky Lake chips, the Prime95 stresstest fails on the 768K FFT AVX test and displays an error. There is no freezing nor crashing. That was just language used to get Intel's attention on the Intel forums. But, the problem is that other code *could* be effected by whatever is causing this error. Prime95 has checks in it because of the nature of the calculations it does. Most code isn't like that and any faulty calculations could go unnoticed. That's way more scarry than crashing or freezing. Errors that gum up the works are annoying. Errors that silently corrupt things are way scarrier. Edited: corrected typos

    • credible
    • 4 years ago

    Damn power virus.

    • MetricT
    • 4 years ago

    Are they going to be able to fix this before Kaby Lake? Or will we be waiting for Cannonlake (or Cannonlake+1)? And what about Skylake-E? I’m in HPC, so any time I hear CPU errata like this I get a tad nervous.

      • Waco
      • 4 years ago

      Considering they’re already talking about fixing it via a microcode update, I’d bet it’s not something to worry about much.

      It remains to be seen if there’s an associated performance drop.

    • Dygear
    • 4 years ago

    Any word on if this also effects GNURadio? FFTs are used there as well to visualize the spectrum that the device is monitoring.

      • lycium
      • 4 years ago

      FFT is used in just about everything, it takes a very specific AVX workload to trigger it apparently.

    • TopHatKiller
    • 4 years ago

    WHAT! Another one?!

      • TopHatKiller
      • 4 years ago

      no one bothers with intel bugs. they are, after all, the god of cpus. how wonderful it would be if amd presented problems like these…i’m sure no one would mind that either…

        • Zizy
        • 4 years ago

        Nobody cares about Intel bugs because the are loved gods that cannot do anything wrong.
        Nobody cares about AMD bugs because nobody uses that.

    • blastdoor
    • 4 years ago

    In other news, AMD revises all marketing materials to focus on Prime 95 performance.

    • brucethemoose
    • 4 years ago

    Certain P95 [s<]exponents[/s<] FFT sizes have always hit certain Intel CPU generations hard. I forget the exact numbers, but there were 1 or 2 numbers that always crashed my Haswell CPU, even when other stress tests would run for days without any errors. But I'm surprised this happens at stock speeds.

    • ronch
    • 4 years ago

    Every time a bug is discovered in Intel processors, I always think about AMD. The last AMD bug that drew attention was the TLB bug in Barcelona. Have AMD chips simply been more robust since then?

      • just brew it!
      • 4 years ago

      Nobody finds bugs in new AMD chips any more because hardly anyone is using them, outside of current-gen game consoles. 😉

        • ronch
        • 4 years ago

        Also, if something goes wrong, like the system hangs up, it’s Windows that’s at fault. It’s always Windows. Or a virus.

        My FX CPU is practically rock-solid. 😀

      • NoOne ButMe
      • 4 years ago

      There are problems in all chips. AMD has had no major problems in a part that has shipped to consumers SINCE Barcelona that wasn’t fixed or blocked in microcode.

      I believe this is Intel’s first major issue since the Pentium or Pentium 2. Where the FPU had something something which made it not run a certain math operation.

      Given how complex this stuff is it is quite amazing their are so few problems that are not found or fixed.

      For more information you may wish to watch this very intriguing video about catching and fixing problems like this as well as other things in design. [url<]https://www.youtube.com/watch?v=eDmv0sDB1Ak[/url<]

        • just brew it!
        • 4 years ago

        [quote<]There are problems in all chips. AMD has had no major problems in a part that has shipped to consumers that wasn't fixed or blocked in microcode.[/quote<] Only if you consider a 10% performance hit to be an acceptable fix. The original Phenom would've been more competitive if they had been able to leave the TLB enabled.

          • NoOne ButMe
          • 4 years ago

          Sorry! I meant to say after that bug!

        • BobbinThreadbare
        • 4 years ago

        If you want some fun track down an errata for any modern x86 chip.

          • just brew it!
          • 4 years ago

          Or *any* complex chip, for that matter.

          At pretty much every job where I’ve done embedded or device driver work, I’ve encountered at least one issue that was eventually traced back to a bug in the silicon. Application developers (and end users) are typically unaware that these kinds of bugs even exist, because they get worked around at the microcode, firmware, or device driver level.

          Some examples off the top of my head:

          * Certain PIC32 microcontrollers had a nasty bug where reading a device register had a certain (low) probability of reading the register twice instead of once. This is harmless in most cases. It’s *not* harmless if reading the register has side effects. I spent several days chasing the cause of dropped bytes on a serial port, before realizing that I was being “had” by a chip errata – since reading from the UART’s FIFO register automatically advances the internal FIFO pointer, whenever you got hit by the “double read” bug one of the bytes in the data stream simply vanished without a trace. Workaround was to disable all hardware interrupts whenever reading from a register with side effects (apparently the race condition involved the CPU taking an interrupt while reading from the register).

          * Intel’s first attempt at a RISC workstation CPU — the i860 — had all kinds of fun chip bugs. Two that immediately come to mind: Every once in a blue moon, the CPU would decide to execute the first instruction of the exception handler twice. Or skip it entirely. What’s the first thing an exception handler normally does? Save the contents of internal machine registers on the stack! So whenever you got hit by this bug, you blew your stack because the stack pointer wasn’t pointing where it was supposed to. The workaround? The exception handler must start with a NOP. The other i860 bug was a bit of nasty behavior in the TLB logic, which would cause TLB thrashing in tight loops that executed in a multiple of (IIRC) 4 clock cycles, while accessing memory. This slowed performance to a crawl. Workaround was to insert a NOP in loops which exhibited this bad behavior to change the number of clock cycles required to execute the loop!

          * Some Motorola microcontroller back in the ’90s… race condition in the Ethernet interface’s DMA engine. If you tried to do DMA transfers of a burst of multiple minimum-length Ethernet frames, the Ethernet controller would sometimes wedge. Workaround was to detect the wedged Ethernet controller (watchdog timer in software), and reset the controller when the problem occurred. Yes, this dropped any in-flight frames on the floor; but you were already dropping packets anyway since the Ethernet port was wedged, so it was the lesser of two evils! (I was the first person to get bit by this and run it to ground; my workaround made it into Motorola’s official errata list…)

        • chuckula
        • 4 years ago

        [url<]http://www.bit-tech.net/news/hardware/2012/03/06/coder-finds-amd-chip-bug/1[/url<]

          • ronch
          • 4 years ago

          Interesting article. Given how the bug affects Barcelona, Shanghai, and Istanbul chips, I’m inclined to think the bug resides in the K10 core itself.

          Good to know the Bulldozer lineage doesn’t have this bug, although I’m sure it’s also got its fair share of them. Having used my FX-8350 for more than 3 years now, however, I think it’s no more or less reliable than any other prior x86 CPU I’ve used. I can play my games just fine and the CPU keeps perfect track of the bullets I expend, so I guess it has passed validation.

    • DancinJack
    • 4 years ago

    Wow. What a bug.

    – In the menu go to ‘Advanced | Test’ and fill in the number 14942209 in the box labeled ‘Exponent to test’
    – Let the program run for some time and at some point, minutes or hours, the system will freeze.

    I can see how this was something that might have been missed during validation.

      • NTMBK
      • 4 years ago

      There are presumably other use cases which will cause the same error. This is just one way to reproduce it.

        • DancinJack
        • 4 years ago

        Possibly, but they haven’t been found. Until evidence is found I don’t think I could assume you could hit it other ways.

          • derFunkenstein
          • 4 years ago

          Probably irresponsible to assume your’e immune just because something hasn’t been found, but I’m not exactly rushing to storm Intel’s gates just because my i5-6600K is probably affected (I’m not going to test it out).

            • chuckula
            • 4 years ago

            Actually your 6600K is not affected. Hyperthreading appears to be a requirement to trigger the bug, and the 6600K lacks Hyperthreading.

            • derFunkenstein
            • 4 years ago

            nice. I guess I should have clicked the link. 😳

            • DancinJack
            • 4 years ago

            I didn’t say immune. I just don’t think it’s logical to worry about something like this that affects an extremely small population and doesn’t really have any long-lasting, negative consequences.

    • Krogoth
    • 4 years ago

    Sounds like a bug in Skylake’s ALU.

    I wouldn’t be surprised though. Modern chips are incredibility complex and isn’t too far-fetch to have a specific equation to cause a major problem.

    It is rather obscure problem too so it isn’t that shocking that wasn’t caught during QA on pre-production samples.

      • ermo
      • 4 years ago

      Summary: Krogoth is not impressed. NEXT!

        • ronch
        • 4 years ago

        I’d be impressed if someone is actually impressed by a chip bug.

      • willmore
      • 4 years ago

      I would guess it’s a bug in register renaming. Intel has gotten very careful about their ALUs after that whole FDIV thing. 🙂

      It only happens when running a 4C8T processor with eight threads working on the same problem. So, I guess it could be a cache coherency thing, but Intel has been using the same kind of ringbus system for a while.

      On the other hand, the register renaming and instruction retirement logic gets reworked ever generation of processor.

    • puppetworx
    • 4 years ago

    How does this get past pre-mass-production testing? Prime95 is exactly the type of software your think they’d test with.

      • chuckula
      • 4 years ago

      If you read the details there are more complexities present. If you take a recent version of Prime95 and run it straight on a Skylake processor with this input you will NOT trigger the bug. That’s because Skylake defaults to using the FMA3 codepaths that aren’t affected by the bug.

      To trigger, you need to run the specific input exponent, have hyperthreading turned on, AND fall back to using an AVX1 code path instead of using the newer FMA3 codepaths. Even then, it’s not a simple insta-hang, but it occurs after an intedeterminate period of time, which likely indicates a subtle race condition buried somewhere in the execution units.

      They do lots and lots of validation on these chips, but it’s hard to test every possible combination.

        • spugm1r3
        • 4 years ago

        I love being reminded that some of the more clever smart@$$es on the site are also, actually, clever.

          • morphine
          • 4 years ago

          Yes, but you must choose smartasses wisely.

            • jihadjoe
            • 4 years ago

            [url<]https://www.youtube.com/watch?v=jgBNi17Nxcc[/url<]

        • willmore
        • 4 years ago

        It’s not a specific exponent, it’s a particular FFT size–768K.

          • DancinJack
          • 4 years ago

          So, they are using the wrong language here?

          [url<]https://communities.intel.com/mobile/mobile-access.jspa#jive-content?content=%2Fapi%2Fcore%2Fv3%2Fcontents%2F524553[/url<] [quote<]- In the menu go to 'Advanced | Test' and fill in the number 14942209 in the box labeled 'Exponent to test'[/quote<]

            • willmore
            • 4 years ago

            That Mersenne number will use the 768K FFT size. It’s a short way of getting someone not familiar with Prime95 to reproduce the bug.

        • puppetworx
        • 4 years ago

        Thanks, that makes more sense as to how it snuck by.

      • Laykun
      • 4 years ago

      Yeah, and how come software has bugs? What’s up with that?!

    • chuckula
    • 4 years ago

    [quote<]Prime95 can cause Intel Skylake CPUs to freeze[/quote<] Awesome! Who needs a crazy cooling solution when you can get the CPU down to freezing with software! It's a brave new world indeed.

      • torquer
      • 4 years ago

      Interesting that the number that trips up Skylake is exactly the amount of money AMD has left…

        • DPete27
        • 4 years ago

        Does not compute

          • torquer
          • 4 years ago

          That’s why they had to drop the price on the R9 Nano. The original price was computed using a Skylake CPU.

      • RedBearArmy
      • 4 years ago

      You know it’s just the good old DIV at 0°C bug.

      • dpaus
      • 4 years ago

      Apparently it freezes up when you try to calculate the value-for-money quotient. Can’t say I blame it….

      • jihadjoe
      • 4 years ago

      You wouldn’t download a heatsink…

Pin It on Pinterest

Share This