A corrected hardware error has occurred.

Don't see a specific place for your hardware question? This is the forum for you!

Moderators: mac_h8r1, Nemesis

Re: A corrected hardware error has occurred.

Postposted on Sun Mar 23, 2014 11:20 pm

I have no direct experience with WHEA errors, but I'd be skeptical that an unclean OS install has anything to do with them. Your increasing error rate also says 'hardware problem' to me. I suppose driver corruption or somesuch remains a possibility, though.

If only to satisfy my curiosity, I'd go with your original plan to do the OS reinstall (or maybe you have another hard drive handy?) and processor swap. That would hopefully narrow it down.
E3-1230v2 | GA-6UASL3 | 16 GB ECC | Seagate 600 240GB | WD Black + Red | GTX 660 | Silverstone TJ08-E | Seasonic X650 | Win Server 2012 | Dell U2713HM || Zotac CI320
mako
Gerbil
Gold subscriber
 
 
Posts: 94
Joined: Sat Oct 30, 2004 6:09 pm
Location: North Bay

Re: A corrected hardware error has occurred.

Postposted on Mon Mar 24, 2014 12:06 am

A couple quick points:
1) Thermal and voltage issues can cause "weak" bits to throw errors. So, if the CPU was getting hot, or the power supply is weak, marginal (but serviceable) parts can show fails early. So, check your temps and power. Even if your powersupply is fine, power comes in thru socket pins so check that the chip is well seated, thermal paste is evenly applied, etc..
2) Intel CPU cores negotiate their APIC IDs on boot. Basically, they all say "I think I'm zero", and a winner is atomically selected. Then every core but 0 tries 1, and so on. So, their IDs can change from boot to boot. So, don't be surprised if an error seems to move around from core to core.
3) Intel cores run microcode. So, check that your software (BIOS and O/S) is fully updated. Windows hides this. Linux packages-up microcode with kernel builds. Microcode can fix many kinds of things, including replacing/adding/removing instructions, or affecting internal configuration settings. Intel learned their lesson from the infamous FP-Div bug long ago.

If it were me, I'd play around a bit checking power, temps and microcode, but still RMA the thing.
MarkG509
Gerbil First Class
Gold subscriber
 
 
Posts: 145
Joined: Thu Feb 21, 2013 6:51 pm

Re: A corrected hardware error has occurred.

Postposted on Mon Mar 24, 2014 8:12 pm

The 4770K was already doing the WHEA error and occasional BSoD thing, was why I wrongly concluded the chip had degraded from moderate overclocking. Now that I'm seeing similar issues with the 4771 I'm sure one (if not both) processors are actually fine. I'm running a triple 140mm radiator setup and even with a 480 in the loop, combined load temps are around 60c.

Noticed something odd about GB's Easytune software, by default it was screwing around with mosfet/phase settings, current protection thresholds, and even loadline calibration. Changed it all to standard, call it a hunch but I'm going to see if that stops the errors before I do anything else. Words can't describe my opinion of Gigabyte's software right now though... no reason at all it should be modifying UEFI settings by default.
Kougar
Gerbil XP
 
Posts: 406
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Postposted on Tue Mar 25, 2014 8:17 am

Kougar wrote:The 4770K was already doing the WHEA error and occasional BSoD thing, was why I wrongly concluded the chip had degraded from moderate overclocking. Now that I'm seeing similar issues with the 4771 I'm sure one (if not both) processors are actually fine. I'm running a triple 140mm radiator setup and even with a 480 in the loop, combined load temps are around 60c.

Noticed something odd about GB's Easytune software, by default it was screwing around with mosfet/phase settings, current protection thresholds, and even loadline calibration. Changed it all to standard, call it a hunch but I'm going to see if that stops the errors before I do anything else. Words can't describe my opinion of Gigabyte's software right now though... no reason at all it should be modifying UEFI settings by default.

Ugh.....I hate all that automatic crap. One of the first things I do on a new system is double and triple check that all of that stuff is turned off. I don't want anything being done automatically (and potentially causing errors and instability) without my knowledge.
i5 2500k - P67 - GTX660 - 840 Pro 256GB - Xonar Essence STX - Senn HD595's
The Egg
Gerbil Elite
Silver subscriber
 
 
Posts: 565
Joined: Sun Apr 06, 2008 4:46 pm

Re: A corrected hardware error has occurred.

Postposted on Tue Mar 25, 2014 9:44 am

The Egg wrote:
Kougar wrote:The 4770K was already doing the WHEA error and occasional BSoD thing, was why I wrongly concluded the chip had degraded from moderate overclocking. Now that I'm seeing similar issues with the 4771 I'm sure one (if not both) processors are actually fine. I'm running a triple 140mm radiator setup and even with a 480 in the loop, combined load temps are around 60c.

Noticed something odd about GB's Easytune software, by default it was screwing around with mosfet/phase settings, current protection thresholds, and even loadline calibration. Changed it all to standard, call it a hunch but I'm going to see if that stops the errors before I do anything else. Words can't describe my opinion of Gigabyte's software right now though... no reason at all it should be modifying UEFI settings by default.

Ugh.....I hate all that automatic crap. One of the first things I do on a new system is double and triple check that all of that stuff is turned off. I don't want anything being done automatically (and potentially causing errors and instability) without my knowledge.

I also dislike this trend of BIOS settings to "auto" meaning "Hey, screw around with that for me". When I got my current 2500K and the Z77 board it's on, I had trouble running the same clocks I had used on my Z68X setup previously. After a bit of hunting, found that still more of the BIOS now defaults to so-called "auto". After manually setting everything plus dog to "normal" instead of "auto", my settings stayed where they were supposed to be, my CPU stopped cooking under 1.6V, and I was able to resume 4.8GHz (up from a very hot 4.1 with "auto").

Ninja: Oh, and it's not just Gigabyte doing it. Had an Asus board briefly which did the same thing. I've heard tell it's the coming thing on many mobo brands.
Siglessness is boring.
Image - M4800-Eight1
Image - Vargr-Z97
Forge
Lord High Gerbil
Silver subscriber
 
 
Posts: 8059
Joined: Wed Dec 26, 2001 7:00 pm
Location: SouthEast PA

Re: A corrected hardware error has occurred.

Postposted on Tue Mar 25, 2014 9:57 am

Agreed. The default settings should run everything at stock. Period. Any tweaks -- whether manual or automatic -- should be something you need to explicitly enable.
(this space intentionally left blank)
just brew it!
Administrator
Gold subscriber
 
 
Posts: 37991
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: A corrected hardware error has occurred.

Postposted on Tue Mar 25, 2014 2:42 pm

I can't agree more. Especially with "auto" in the UEFI, as Forge said. I don't trust it because OEM's have proven it can't actually be trusted time and time again...

I have never used the 3D Power functionality built into EasyTune because I do all my OCing straight from inside the UEFI, so I never paid any attention to the settings the 3D Power area was using but apparently auto-applied when starting EasyTune. The only reason EasyTune is installed at all was for fan control (even though out of the 5 fan headers it isn't actually able to modify the CPU fan speeds, go figure that one out :roll: ). Too early to say but so far I've not seen any WHEA events... Already wondering if EasyTune was why I had so much trouble keeping my previous 4770K stable at any clockspeed.
Kougar
Gerbil XP
 
Posts: 406
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Postposted on Tue Apr 08, 2014 12:10 pm

Looks like the reason wasn't because EasyTune was mucking up the power profile after all. After two weeks of no errors showing up I suddenly received 12 WHEA corrected errors within a single hour. Guess it's time to try a new OS.
Kougar
Gerbil XP
 
Posts: 406
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Postposted on Wed May 28, 2014 4:24 pm

New OS achievement unlocked. Been two weeks and so far while Metro is still highly irritating, I've not seen a single BSoD or WHEA notification. Too early to say with 100% certainty but sure looks promising.

So in total, looks like a dirty Win 7 OS install was somehow causing WHEA CPU Cache Parity Errors, correctable (event logs) and uncorrectable (Blue screens). More recently it also was throwing general 0x0000124 BSoDs, which notably are supposed to be hardware caused. :o
Kougar
Gerbil XP
 
Posts: 406
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Postposted on Thu May 29, 2014 10:00 am

Kougar wrote:After two weeks of no errors showing up I suddenly received 12 WHEA corrected errors within a single hour. Guess it's time to try a new OS.

Kougar wrote:Been two weeks and so far while Metro is still highly irritating, I've not seen a single BSoD or WHEA notification. Too early to say with 100% certainty but sure looks promising.

Ehh... given that you previously went two weeks on the Win7 install without errors, I'd say it is still 100% uncertain.

It is also possible that there really is some sort of hardware issue, but Win8 doesn't trip over it for reasons unknown.
(this space intentionally left blank)
just brew it!
Administrator
Gold subscriber
 
 
Posts: 37991
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: A corrected hardware error has occurred.

Postposted on Fri May 30, 2014 12:50 am

Brew, yer totally ruining all the good feels :P You made me think of something... is a guest OS still capable of receiving WHEA notifications from the CPU??

When a user installs Hyper-V it turns the host OS into a higher-tier guest OS, which is important to know for a variety of reasons. But one of them is that things like VTd, VTx, or SLAT are no longer detectable to a guest OS. So along those lines, if it also breaks the WHEA notifications from the CPU then that could explain why I'm not seeing them. I added HyperV to Windows immediately after install and rather extensively been learning and playing around with Hyper-V since.

Aye, it's true I went two weeks without anything before, but that was also before the 0x124 BSoDs. Those were what finally forced me to dump the old OS and were a new thing that began occurring around the start of May. I'm feeling pretty confident for the first time in ages (knock on particle board) that I may finally have sorted out the cause behind a lot of past craziness.
Kougar
Gerbil XP
 
Posts: 406
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Postposted on Fri May 30, 2014 5:45 am

Kougar wrote:You made me think of something... is a guest OS still capable of receiving WHEA notifications from the CPU??


I don't see how it could, the errors you were seeing are coming from MSRs. There is no reason why the virtual machine would duplicate the values from them, and depending on how they present the virtualized hardware to guest OSes, they might not implement those specific registers at all.

As you noted, they clearly virtualize some MSRs already, namely, the ones that report virtualization support are clearly set to "off". :wink:
Glorious
Darth Gerbil
Gold subscriber
 
 
Posts: 7884
Joined: Tue Aug 27, 2002 6:35 pm

Re: A corrected hardware error has occurred.

Postposted on Fri May 30, 2014 6:44 am

Glorious wrote:
Kougar wrote:You made me think of something... is a guest OS still capable of receiving WHEA notifications from the CPU??

I don't see how it could, the errors you were seeing are coming from MSRs. There is no reason why the virtual machine would duplicate the values from them, and depending on how they present the virtualized hardware to guest OSes, they might not implement those specific registers at all.

As you noted, they clearly virtualize some MSRs already, namely, the ones that report virtualization support are clearly set to "off". :wink:

...but we're talking about WHEA notifications in the *host* OS here, right? Or did I miss something, and Win8 running as a guest?
(this space intentionally left blank)
just brew it!
Administrator
Gold subscriber
 
 
Posts: 37991
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: A corrected hardware error has occurred.

Postposted on Fri May 30, 2014 6:51 am

JBI wrote:...but we're talking about WHEA notifications in the *host* OS here, right? Or did I miss something, and Win8 running as a guest?


He's describing HyperV as an actual hypervisor that's separate from the Windows OS itself, that his windows install is now, in effect, only a "higher-tier" guest OS as he put it himself.

I don't really know anything about HyperV, but that sounds plausible.

Of course, there could be multiple different variations of HyperV or something, maybe someone with further experience could chime in.
Glorious
Darth Gerbil
Gold subscriber
 
 
Posts: 7884
Joined: Tue Aug 27, 2002 6:35 pm

Re: A corrected hardware error has occurred.

Postposted on Fri May 30, 2014 8:25 am

Ahh, OK. Got it. Even if that's the case, I would think there would be *some* indication if the errors are still occurring.
(this space intentionally left blank)
just brew it!
Administrator
Gold subscriber
 
 
Posts: 37991
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: A corrected hardware error has occurred.

Postposted on Fri May 30, 2014 10:36 am

Yes. As far as I'm aware the base Hyper-V client that one installs via Add Roles/Software wizard is the same, only a handful of features differ between Win 8 Pro and Windows Server 2012 R2 versions but the core remains the same. I'm actually using Server 2012 R2 just to be clear here, but I am fairly sure it applies the same to Win 8.

After the user installs HyperV the original host OS is no longer that, it's just another (if priority) guest OS using the invisible hypervisor. Other things like measuring total system load isn't possible via Task Manager since the "host" OS is just another oblivious guest OS. It's a true hypervisor because if I restart my OS all the VM's continue running in the background... it's a lot of fun, but it's also very different from my former perspective as a VBox & VMware Workstation user. When it comes to Hyper-V, the concept of a host OS just gets thrown out the window basically.

Glorious wrote:
Kougar wrote:You made me think of something... is a guest OS still capable of receiving WHEA notifications from the CPU??


I don't see how it could, the errors you were seeing are coming from MSRs. There is no reason why the virtual machine would duplicate the values from them, and depending on how they present the virtualized hardware to guest OSes, they might not implement those specific registers at all.

As you noted, they clearly virtualize some MSRs already, namely, the ones that report virtualization support are clearly set to "off". :wink:


I was so afraid of that. So there's literally no way to check then either, because with the Hyper-V hypervisor in place I'd have to boot to a different disk & OS entirely right?
Kougar
Gerbil XP
 
Posts: 406
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Postposted on Fri May 30, 2014 10:39 am

If you're still getting hardware errors there should be a way to see those through some sort of Hyper-V management interface, I would think.
(this space intentionally left blank)
just brew it!
Administrator
Gold subscriber
 
 
Posts: 37991
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: A corrected hardware error has occurred.

Postposted on Fri May 30, 2014 12:28 pm

just brew it! wrote:If you're still getting hardware errors there should be a way to see those through some sort of Hyper-V management interface, I would think.


The Hyper-V Manager interface is surprisingly barebones for how powerful it is, but I don't see any place for that. As best I can tell my best bet is still to monitor Administrative Events and Hyper-V's own event log section.
Kougar
Gerbil XP
 
Posts: 406
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Postposted on Fri May 30, 2014 12:51 pm

Kougar wrote:
just brew it! wrote:Sounds like an error in one of the internal caches to me. It can't be a DRAM error since according to the specs for that CPU, it does not support ECC RAM; so it would have no way of even knowing that a DRAM error occurred, let alone correcting it.


Quit being so logical!

But that still puzzles me, because I've seen this with two processors. I'd just figured the first proc had OC issues. Could the motherboard have defective power regulation on the uncore power plane? I find it extremely unlikely that two Haswell procs would both be "bad" with the SAME cache parity WHEA errors.

Sorry, but if Intel went to the trouble of implementing error detection and correction in the cache, why would you think it extremely unlikely that there would be errors in the cache? They didn't implement it for no reason. They know that the cache has to operate close to the edge of non-functionality in order to do its job.

Be thankful the EDAC is there. Consider this analogous to a "soft" (correctable) error off a hard drive, which happens often enough.
This problem was caused by Windows, which was created by Microsoft Corporation.
sluggo
Gerbil Jedi
Gold subscriber
 
 
Posts: 1546
Joined: Wed Feb 16, 2005 8:44 pm
Location: under the table and dreaming

Re: A corrected hardware error has occurred.

Postposted on Fri May 30, 2014 1:02 pm

Kougar wrote:I was so afraid of that. So there's literally no way to check then either, because with the Hyper-V hypervisor in place I'd have to boot to a different disk & OS entirely right?


Yeah, that's almost certainly the situation.

VMware has an option to enable/pass-on machine check exceptions, but by default it doesn't. If you can't find anything in the configuration for HyperV about this I'm going to guess you can't see them from within any guest OS.

As JBI says though, you'd think there'd be some way the hypervisor would catch/report them....
Glorious
Darth Gerbil
Gold subscriber
 
 
Posts: 7884
Joined: Tue Aug 27, 2002 6:35 pm

Re: A corrected hardware error has occurred.

Postposted on Fri May 30, 2014 5:48 pm

sluggo wrote:Sorry, but if Intel went to the trouble of implementing error detection and correction in the cache, why would you think it extremely unlikely that there would be errors in the cache? They didn't implement it for no reason. They know that the cache has to operate close to the edge of non-functionality in order to do its job.

Be thankful the EDAC is there. Consider this analogous to a "soft" (correctable) error off a hard drive, which happens often enough.


Because this isn't normal to be getting WHEA errors multiple times a week. Especially when WHEA Cache errors are NOT correctable. I know my threads always get muddled, so let me recap a bit:

The WHEA CPU Cache Parity Errors were not all soft errors, occasionally one would be uncorrectable and that blue screened the system. After some Q&A Intel and Gigabyte reps both indicated to me they believed the OS was the most likely culprit. You can bet I asked because I was on the verge of RMA'ing the motherboard after both CPUs had the same WHEA error. CPU cache parity errors are extremely easy to cause via OCing or improper voltages.

Around the start of May I began also receiving 0x124 blue screens which were something else entirely that I'd not seen before. So despite my intense dislike of metro, the hassles involved in figuring out what to install the OS onto and then reinstalling a litany of programs, I shelved my doubts and did a clean install of Win 2012 R2 Sever. Given the special licensing involved for handling Win 8 / 2012 Server VMs it made more sense than using Win 8, and I liked that most of the bundled metro apps weren't baked in too.

Glorious wrote:As JBI says though, you'd think there'd be some way the hypervisor would catch/report them....


There could be, I won't pretend to be an expert on HyperV. Maybe there's another server role addon I have to enable to get it, I'll do some more google fishing
Kougar
Gerbil XP
 
Posts: 406
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Postposted on Fri May 30, 2014 8:16 pm

The Parent Partition still has direct hardware access. Event Viewer should still pass along WHEA errors.

http://en.wikipedia.org/wiki/File:Hyper-V.png
"Welcome back my friends to the show that never ends. We're so glad you could attend. Come inside! Come inside!"
Ryu Connor
Global Moderator
Gold subscriber
 
 
Posts: 3591
Joined: Thu Dec 27, 2001 7:00 pm
Location: Marietta, GA

Re: A corrected hardware error has occurred.

Postposted on Fri May 30, 2014 9:08 pm

Hey Kougar,

First, let me convey my sympathies - debugging WHEA errors is the worst. And I mean that literally. Over the last 15 years working as lead tech in a shop that worked on 300-500 systems a month, WHEA errors were my least favorite thing to troubleshoot. And most of the time, I encountered them in 1 or 2P workstations or servers with gobs of RAM - you can imagine how much fun that would be.

Anywho, not to be a buzz-kill here, but I never once encountered a situation where an OS was the culprit. Though I have been in the situation several times where either Intel or the board vendor claimed that it was an OS issue, that never actually turned out to be the case. I sincerely hope your OS reload proves to be a solid fix, and I'll refrain from any additional WHEA horror stories in the meantime. :)

Cheers!
db0
divide_by_zero
Gerbil
Gold subscriber
 
 
Posts: 38
Joined: Fri May 30, 2014 8:57 pm

Re: A corrected hardware error has occurred.

Postposted on Sat May 31, 2014 11:34 am

Thanks for the post db0, and welcome to the forums! :) If you have tales to tell I don't mind hearing about 'em.

So far neither the main OS nor guest VMs have thrown a single blue screen once, and I've yet to see anything in the event logs. Windows update did break a few things, I guess some things never change when it comes to Microsoft :-? Will give it a month and post back, or sooner if I the WHEA or BSoDs return.
Kougar
Gerbil XP
 
Posts: 406
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Postposted on Tue Jun 10, 2014 3:29 am

Bah, I hate it when you guys are right! :cry:

Encountered a system freeze two days ago, no errors or event log notifications. Today the system spontaneously rebooted without a BSoD or event log notification as to why.

Presuming the Gigabyte Z87 motherboard is defective and going to RMA it after I figure out what I'm going to use in place of this system....
Kougar
Gerbil XP
 
Posts: 406
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Previous

Return to General Hardware

Who is online

Users browsing this forum: Yahoo [Bot] and 5 guests