Personal computing discussed

Moderators: renee, mac_h8r1, Nemesis

 
Kougar
Minister of Gerbil Affairs
Topic Author
Posts: 2306
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

A corrected hardware error has occurred.

Sat Feb 15, 2014 3:02 pm

Those familiar with Haswell overclocking should be familiar with the title, it's a common event log error that indicates the processor detected and corrected an error that otherwise would've led to a BSoD. I saw this frequently during my trials and tribulations when OCing my 4770K... but now I have a real puzzler. I'm getting this error on a Core i7 4771 processor. Hence there is zero OCing going on.

A corrected hardware error has occurred.

Reported by component: Processor Core
Error Source: Corrected Machine Check
Error Type: Internal parity error
Processor ID: 4

The details view of this entry contains further information.


Is anyone an expert with this errors that knows what's going on? Could this possibly indicate an error detected in the RAM, or is it a CPU cache error? Memtest & prime don't find any issues as this error is fairly rare, but I've noted a few other instances in the event logs.
 
notfred
Maximum Gerbil
Posts: 4610
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Re: A corrected hardware error has occurred.

Mon Feb 17, 2014 2:54 pm

If you grab the details of the entry then there are a bunch of websites that have Machine Check Exception decoder details around. That should point at processor or memory. Also check your PSU is feeding everything with enough juice.
 
Kougar
Minister of Gerbil Affairs
Topic Author
Posts: 2306
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Mon Feb 17, 2014 6:35 pm

Thanks for the reply. Just what would be considered as the Machine Check Exception under the details pane?? Keep in mind this is NOT a BSoD so there's no MCE code that I could find.

The process ID for the last one points to svchost.exe, but the other doesn't match any running processes so I don't know what it was. Event ID of 19 if that matters.
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: A corrected hardware error has occurred.

Mon Feb 17, 2014 8:44 pm

Sounds like an error in one of the internal caches to me. It can't be a DRAM error since according to the specs for that CPU, it does not support ECC RAM; so it would have no way of even knowing that a DRAM error occurred, let alone correcting it.
Nostalgia isn't what it used to be.
 
Kougar
Minister of Gerbil Affairs
Topic Author
Posts: 2306
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Tue Feb 18, 2014 1:19 am

just brew it! wrote:
Sounds like an error in one of the internal caches to me. It can't be a DRAM error since according to the specs for that CPU, it does not support ECC RAM; so it would have no way of even knowing that a DRAM error occurred, let alone correcting it.


Quit being so logical!

But that still puzzles me, because I've seen this with two processors. I'd just figured the first proc had OC issues. Could the motherboard have defective power regulation on the uncore power plane? I find it extremely unlikely that two Haswell procs would both be "bad" with the SAME cache parity WHEA errors. I'd open a support ticket with Gigabyte if I thought there was a chance they would help figure out what the problem was.

Using the most recent non-beta BIOS. Reset everything to optimized settings after I installed the 4771, and the RAM is also at stock 1600Mhz settings. The PSU is a holdover Corsair AX1200 from when I was running a 980X and pulling >800w from the wall. :roll:
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: A corrected hardware error has occurred.

Tue Feb 18, 2014 1:43 am

Well, I suppose it is possible that a problem with the motherboard is causing this particular error to be reported spuriously. I have no idea how it is signaled by the CPU, or if the motherboard is involved in reporting it.
Nostalgia isn't what it used to be.
 
Glorious
Gerbilus Supremus
Posts: 12343
Joined: Tue Aug 27, 2002 6:35 pm

Re: A corrected hardware error has occurred.

Tue Feb 18, 2014 6:45 am

JBI wrote:
Well, I suppose it is possible that a problem with the motherboard is causing this particular error to be reported spuriously. I have no idea how it is signaled by the CPU, or if the motherboard is involved in reporting it.


It could be something wrong with QPI (Intel's new Point-to-Point version of an FSB). That could still potentially point to the motherboard (damaged/weakened traces/socket?). The mechanism isn't parity based, but it does a CRC on messages and has the capability to resend. If that were to happen, it'd probably be reported via APEI, MCE or some other method.
 
ronch
Graphmaster Gerbil
Posts: 1142
Joined: Mon Apr 06, 2009 7:55 am

Re: A corrected hardware error has occurred.

Tue Feb 18, 2014 7:09 am

Is this an isolated incident? I remember years ago when I've encountered Hypertransport-related issues more than once with not just one motherboard. That was back in the Athlon 64 X2 era. So this can, as someone has said, be QPI related. Either that, the motherboard is flaky, or .... could it be that you've stumbled upon a Haswell bug?
NEC V20 > AMD Am386DX-40 > AMD Am486DX2-66 > Intel Pentium-200 > Cyrix 6x86MX-PR233 > AMD K6-2/450 > AMD Athlon 800 > Intel Pentium 4 2.8C > AMD Athlon 64 X2 4800 > AMD Phenom II X3 720 > AMD FX-8350 > RYZEN?
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: A corrected hardware error has occurred.

Tue Feb 18, 2014 8:58 am

Yeah, I suppose it could be an interconnect error. But the fact that it is reported as an "internal" parity error makes me lean towards an on-die cache or some other data path inside the CPU. The reporting of an associated Processor ID also supports this theory (though in the case of an external bus it is possible that the CPU knows which core the transfer was being done on behalf of, and reports that core).

@Kougar - Just out of curiosity, how rare is "fairly rare"?
Nostalgia isn't what it used to be.
 
Glorious
Gerbilus Supremus
Posts: 12343
Joined: Tue Aug 27, 2002 6:35 pm

Re: A corrected hardware error has occurred.

Tue Feb 18, 2014 9:53 am

JBI wrote:
Yeah, I suppose it could be an interconnect error. But the fact that it is reported as an "internal" parity error makes me lean towards an on-die cache or some other data path inside the CPU. The reporting of an associated Processor ID also supports this theory (though in the case of an external bus it is possible that the CPU knows which core the transfer was being done on behalf of, and reports that core).


I was just running with the theory because he said the previous CPU had the same issues, but then again, as he said, he was overclocking that one. So, yeah... :-?

---

Kougar, the message says there are further details. Can you supply that information?
 
Kougar
Minister of Gerbil Affairs
Topic Author
Posts: 2306
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Tue Feb 18, 2014 10:45 am

It looks like Just Brew's logic is correct, it's probably internal to the CPU's caches. In my googling I found someone else that had Event ID 19 WHEA "Memory Controller Error" messages, which is what I'd originally been concerned about. So that effectively rules out the RAM in any form or fashion.

just brew it! wrote:
@Kougar - Just out of curiosity, how rare is "fairly rare"?


Two so far this month. Keep in mind I only JUST swapped the 4771 processor this month! Futhermore, both WHEA errors reference different core IDs. :-?
 
Glorious
Gerbilus Supremus
Posts: 12343
Joined: Tue Aug 27, 2002 6:35 pm

Re: A corrected hardware error has occurred.

Tue Feb 18, 2014 11:01 am

Kougar wrote:
It looks like Just Brew's logic is correct, it's probably internal to the CPU's caches. In my googling I found someone else that had Event ID 19 WHEA "Memory Controller Error" messages, which is what I'd originally been concerned about. So that effectively rules out the RAM in any form or fashion.


But there's also a "Cache Hierarchy Error", so I'm not sure how much we can read into such things. I'd agree, of course, that it's most likely a cache problem, but it would suck, really suck, if you were to replace/RMA the CPU and then experience this exact problem again.

So... do you have the ability provide more details as your initial post of the event log suggests? :P

Ultimately, though, these things are so complicated/interrelated and the tools & reporting so limited it's pretty much trial-and-error troubleshooting anyway :(
 
SuperSpy
Minister of Gerbil Affairs
Posts: 2403
Joined: Thu Sep 12, 2002 9:34 pm
Location: TR Forums

Re: A corrected hardware error has occurred.

Tue Feb 18, 2014 12:41 pm

AFAIK the chipset even in a single socket machine still has to notify the CPU of cache coherency issues (from DMA I'd imagine) so it's probably still possible a faulty chipset/motherboard could pollute the CPU caches causing an ECC fault.
Desktop: i7-4790K @4.8 GHz | 32 GB | EVGA Gefore 1060 | Windows 10 x64
Laptop: MacBook Pro 2017 2.9GHz | 16 GB | Radeon Pro 560
 
Kougar
Minister of Gerbil Affairs
Topic Author
Posts: 2306
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Tue Feb 18, 2014 12:57 pm

Glorious wrote:
So... do you have the ability provide more details as your initial post of the event log suggests? :P


Sure, but I'm not sure it's of any use. Here's the details view for the error that I linked to svchost.exe

- System 

  - Provider

   [ Name]  Microsoft-Windows-WHEA-Logger
   [ Guid]  {C26C4F3C-3F66-4E99-8F8A-39405CFED220}
 
   EventID 19
 
   Version 0
 
   Level 3
 
   Task 0
 
   Opcode 0
 
   Keywords 0x8000000000000000
 
  - TimeCreated

   [ SystemTime]  2014-02-14T04:45:18.196555700Z
 
   EventRecordID 316009
 
  - Correlation

   [ ActivityID]  {BDC4FC88-C53F-42FF-9FED-3FBF98565FBD}
 
  - Execution

   [ ProcessID]  1676
   [ ThreadID]  13900
 
   Channel System
 
   Computer Kougar-PC
 
  - Security

   [ UserID]  S-1-5-19
 

- EventData

  ErrorSource 1
  ApicId 4
  MCABank 0
  MciStat 0x90000040000f0005
  MciAddr 0x0
  MciMisc 0x0
  ErrorType 12
  TransactionType 256
  Participation 256
  RequestType 256
  MemorIO 256
  MemHierarchyLvl 256
  Timeout 256
  OperationType 256
  Channel 256
  Length 864
  RawData 435045521002FFFFFFFF0300020000000200000060030000112D04000E020E140000000000000000000000000000000000000000000000000000000000000000BDC407CF89B7184EB3C41F732CB57131B18BCE2DD7BD0E45B9AD9CF4EBD4F8908835C324CA26CF0100000000000000000000000000000000000000000000000058010000C00000000102000001000000ADCC7698B447DB4BB65E16F193C4F3DB0000000000000000000000000000000002000000000000000000000000000000000000000000000018020000400000000102000000000000B0A03EDC44A19747B95B53FA242B6E1D0000000000000000000000000000000002000000000000000000000000000000000000000000000058020000080100000102000000000000011D1E8AF94257459C33565E5CC3F7E80000000000000000000000000000000002000000000000000000000000000000000000000000000057010000000000000002080000000000C30603000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000400000000000000000000000000000000000000000000000000000000000000000000000000000003000000000000000400000000000000C306030000081004FFFBFA7FFFFBEBBF00000000000000000000000000000000000000000000000000000000000000000100000001000000D89C148F3F29CF0104000000000000000000000000000000000000000000000005000F0040000090000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
 
GodsMadClown
Gerbil First Class
Posts: 123
Joined: Thu Feb 27, 2003 10:02 am

Re: A corrected hardware error has occurred.

Tue Feb 18, 2014 1:20 pm

My god, it's full of zeros!
 
shizuka
Gerbil
Posts: 19
Joined: Sun Sep 09, 2012 4:41 pm

Re: A corrected hardware error has occurred.

Tue Feb 18, 2014 1:42 pm

status 0x90000040000f0005, bank 0
IFU (Instruction Fetch Unit - Instruction Cache):
[15:0] 0000000000000101 = 0b101 Internal Parity Error
 
Glorious
Gerbilus Supremus
Posts: 12343
Joined: Tue Aug 27, 2002 6:35 pm

Re: A corrected hardware error has occurred.

Wed Feb 19, 2014 6:45 am

shizuka wrote:
IFU (Instruction Fetch Unit - Instruction Cache):


Looks like further confirmation that JBI is right: something is likely wrong with one of its cache.

Bum CPU, I guess. Then again, if the errors are always correctible and it passes normal CPU stability testing... is there any reason to do anything? It's unsettling, but reason enough to get another CPU?

Then again, how many other haswell users occasionally get this message?
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: A corrected hardware error has occurred.

Wed Feb 19, 2014 7:36 am

Glorious wrote:
shizuka wrote:
IFU (Instruction Fetch Unit - Instruction Cache):

Looks like further confirmation that JBI is right: something is likely wrong with one of its cache.

Bum CPU, I guess. Then again, if the errors are always correctible and it passes normal CPU stability testing... is there any reason to do anything? It's unsettling, but reason enough to get another CPU?

Then again, how many other haswell users occasionally get this message?

@shizuka - How'd you determine that from the info Kougar posted?

@Glorious - In general I agree. It is happening only twice a month, and it is being corrected, so if it continues at this level it will probably never result in actual system misbehavior. My areas of concern would be:

1) Is it getting worse?

2) If this really is an I-cache error, does the CPU use simple parity or full ECC on the I-cache? I-cache errors can be corrected merely by re-reading the affected instruction from DRAM, so the fact that it is being caught and corrected doesn't necessarily imply full ECC on the cache. The reason this might matter is that full ECC can detect double bit errors, whereas simple parity can only detect single bit errors; so if the I-cache is going bad, simple parity might not always catch the error before it causes the system to malfunction.
Nostalgia isn't what it used to be.
 
The Egg
Minister of Gerbil Affairs
Posts: 2938
Joined: Sun Apr 06, 2008 4:46 pm

Re: A corrected hardware error has occurred.

Wed Feb 19, 2014 8:06 am

I would RMA the CPU. You paid for an error-free product.
 
Glorious
Gerbilus Supremus
Posts: 12343
Joined: Tue Aug 27, 2002 6:35 pm

Re: A corrected hardware error has occurred.

Wed Feb 19, 2014 9:18 am

JBI wrote:
2) If this really is an I-cache error, does the CPU use simple parity or full ECC on the I-cache? I-cache errors can be corrected merely by re-reading the affected instruction from DRAM, so the fact that it is being caught and corrected doesn't necessarily imply full ECC on the cache.


That's exactly what the K8 architecture does. I don't know about Haswell, but given that parity takes less space I'd be willing to bet they made a similar design choice because of the same reason you mentioned.
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: A corrected hardware error has occurred.

Wed Feb 19, 2014 11:15 am

The Egg wrote:
I would RMA the CPU. You paid for an error-free product.

It *is* an error-free product from the user's perspective. The error is getting caught and corrected before it affects operation of the system. You don't return ECC DIMMs just because you are getting occasional corrected DRAM errors; that's what ECC is supposed to do -- detect and correct errors (which are *expected* to happen at some low rate).

I am very curious to know what Intel considers to be a "normal" background rate for corrected errors on the I-cache.
Nostalgia isn't what it used to be.
 
The Egg
Minister of Gerbil Affairs
Posts: 2938
Joined: Sun Apr 06, 2008 4:46 pm

Re: A corrected hardware error has occurred.

Wed Feb 19, 2014 11:43 am

just brew it! wrote:
The Egg wrote:
I would RMA the CPU. You paid for an error-free product.

It *is* an error-free product from the user's perspective. The error is getting caught and corrected before it affects operation of the system. You don't return ECC DIMMs just because you are getting occasional corrected DRAM errors; that's what ECC is supposed to do -- detect and correct errors (which are *expected* to happen at some low rate).

I still feel that the error shouldn't be occurring in the first place, even if it's being corrected and isn't apparent to the user. If there's a defect in the CPU, the extent of it might not become apparent until later, or it may get progressively worse. If it were my new CPU, I'd RMA it.
 
shizuka
Gerbil
Posts: 19
Joined: Sun Sep 09, 2012 4:41 pm

Re: A corrected hardware error has occurred.

Wed Feb 19, 2014 11:47 am

just brew it! wrote:
@shizuka - How'd you determine that from the info Kougar posted?

The event log has everything needed to decode it:
status is the large, 64 bit register that describes the fault
mcibank is a value that determines which fault bucket it belongs to... it was 0, which implicates the instruction fetch / l1d cache

I ran the values through the Intel MCE decoder. Unfortunately, it's under nda.

recommendation is to rma the cpu. you should not be getting _any_ WHEA errors at spec.

please note that if you are getting it on different cores, then it could be a platform (ie. voltage delivery) issue. if it always happens on core 4 (apicid 4) then you have a bum chip.
 
notfred
Maximum Gerbil
Posts: 4610
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Re: A corrected hardware error has occurred.

Wed Feb 19, 2014 12:48 pm

You could do it by hand from the references or use one of the programs listed http://en.wikipedia.org/wiki/Machine-ch ... oding_MCEs
 
Glorious
Gerbilus Supremus
Posts: 12343
Joined: Tue Aug 27, 2002 6:35 pm

Re: A corrected hardware error has occurred.

Wed Feb 19, 2014 3:19 pm

shizuka wrote:
mcibank is a value that determines which fault bucket it belongs to... it was 0, which implicates the instruction fetch / l1d cache


How do you know that? About bank 0 meaning "instruction fetch/ l1 i-cache? I keep reading that they are architecturally dependent, but I can't find any listing of what those banks means. I can see how mcistat corresponds to IA32_MCi_STATUS where i is the bank. So, we are looking for what IA32_MC0_STATUS means on haswell.

But how do I find that?

I'm trying to do what notfred suggested:

notfred wrote:
You could do it by hand


So...

http://www.intel.com/content/dam/www/pu ... manual.pdf

Section 15, particularly 15.3.2.2 tells me how to determine the "Internal Parity Error" part, but they never seem to list what each i means.

Annoying, and I can find it for AMD. For instance, in Bulldozer architecture bank 1 is the instruction fetch unit, instruction cache.

http://amd-dev.wpengine.netdna-cdn.com/ ... _BKDG1.pdf

section 2.13.1.1

But how do I find this for Intel?

Argh!

In fact, the only thing I *CAN* find indicates that bank 0 corresponds to QPI: section 16.3.1 on the Intel document.
 
shizuka
Gerbil
Posts: 19
Joined: Sun Sep 09, 2012 4:41 pm

Re: A corrected hardware error has occurred.

Wed Feb 19, 2014 4:09 pm

no QPI on client chips. client platforms use DMI to connect the PCH and the CPU.

banks are cpu specific, so yeah. i don't know where the list mapping mcbank to functional unit is sadly. I'm relying on the internal MCE tool
 
Kougar
Minister of Gerbil Affairs
Topic Author
Posts: 2306
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Wed Feb 19, 2014 4:31 pm

shizuka wrote:
just brew it! wrote:
@shizuka - How'd you determine that from the info Kougar posted?

The event log has everything needed to decode it:
status is the large, 64 bit register that describes the fault
mcibank is a value that determines which fault bucket it belongs to... it was 0, which implicates the instruction fetch / l1d cache

I ran the values through the Intel MCE decoder. Unfortunately, it's under nda.

recommendation is to rma the cpu. you should not be getting _any_ WHEA errors at spec.


Thank you for crunching the info!! I sincerely do appreciate it :)

shizuka wrote:
[please note that if you are getting it on different cores, then it could be a platform (ie. voltage delivery) issue. if it always happens on core 4 (apicid 4) then you have a bum chip.


This is the crux of my problem. The WHEA parity errors were indeed on different core ID's, so it only increases my suspicion about the motherboard. Gigabyte's US support page wasn't accessible for most of yesterday but I was able to finally create a support ticket last night. Will see what they recommend.

just brew it! wrote:
It *is* an error-free product from the user's perspective. The error is getting caught and corrected before it affects operation of the system. You don't return ECC DIMMs just because you are getting occasional corrected DRAM errors; that's what ECC is supposed to do -- detect and correct errors (which are *expected* to happen at some low rate).


Except that DRAM naturally does occasionally have errors. It's natural and expected, and why ECC RAM exists. The same can't be said for a processor so I don't think it's a valid comparison! Given I plan to keep the same processor for 5+ years I want to be sure it isn't going to deteriorate or have the random BSoD.

Glorious wrote:
In fact, the only thing I *CAN* find indicates that bank 0 corresponds to QPI: section 16.3.1 on the Intel document.


Dunno, most of the debugging is completely over my head. But I did find this in the Intel MCE PDF: "Most MCE registers are core-specific, that is, each core has its own set of control, status, and address registers. However, in newer processor families such as Nehalem, new banks of registers have been added to the architecture to address package-level error information. For example, in Nehalem processor families, bank 0, 1, 6, 7 are per-package and introduced to address QPI, integrated memory and graphics." Not entirely sure that's applicable to your comment but figured I'd toss it out there in case it was. :P

For anyone that wanted to play with it here was the other WHEA error. Same info as the first post but differs with a Processor ID of 3.

- System 

  - Provider

   [ Name]  Microsoft-Windows-WHEA-Logger
   [ Guid]  {C26C4F3C-3F66-4E99-8F8A-39405CFED220}
 
   EventID 19
 
   Version 0
 
   Level 3
 
   Task 0
 
   Opcode 0
 
   Keywords 0x8000000000000000
 
  - TimeCreated

   [ SystemTime]  2014-02-02T06:54:09.206630500Z
 
   EventRecordID 314144
 
  - Correlation

   [ ActivityID]  {4C75CD32-FD52-43A8-A065-BCAF2D72DC39}
 
  - Execution

   [ ProcessID]  1732
   [ ThreadID]  3140
 
   Channel System
 
   Computer Kougar-PC
 
  - Security

   [ UserID]  S-1-5-19
 

- EventData

  ErrorSource 1
  ApicId 3
  MCABank 0
  MciStat 0x90000040000f0005
  MciAddr 0x0
  MciMisc 0x0
  ErrorType 12
  TransactionType 256
  Participation 256
  RequestType 256
  MemorIO 256
  MemHierarchyLvl 256
  Timeout 256
  OperationType 256
  Channel 256
  Length 864
  RawData 435045521002FFFFFFFF03000200000002000000600300000836060002020E140000000000000000000000000000000000000000000000000000000000000000BDC407CF89B7184EB3C41F732CB57131B18BCE2DD7BD0E45B9AD9CF4EBD4F89008C0F2ED7F18CF0100000000000000000000000000000000000000000000000058010000C00000000102000001000000ADCC7698B447DB4BB65E16F193C4F3DB0000000000000000000000000000000002000000000000000000000000000000000000000000000018020000400000000102000000000000B0A03EDC44A19747B95B53FA242B6E1D0000000000000000000000000000000002000000000000000000000000000000000000000000000058020000080100000102000000000000011D1E8AF94257459C33565E5CC3F7E80000000000000000000000000000000002000000000000000000000000000000000000000000000057010000000000000002080000000000C30603000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000300000000000000000000000000000000000000000000000000000000000000000000000000000003000000000000000300000000000000C306030000081003FFFBFA7FFFFBEBBF00000000000000000000000000000000000000000000000000000000000000000100000001000000C8621492E31FCF0103000000000000000000000000000000000000000000000005000F0040000090000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
 
shizuka
Gerbil
Posts: 19
Joined: Sun Sep 09, 2012 4:41 pm

Re: A corrected hardware error has occurred.

Wed Feb 19, 2014 4:52 pm

I'm not sure if APICID counts up from 0 or 1, but maybe that is the same core, different thread:
1,2
3,4 <-- fault
5,6
7,8
 
Kougar
Minister of Gerbil Affairs
Topic Author
Posts: 2306
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Wed Feb 19, 2014 7:24 pm

I was able to chat with an Intel representative about the WHEA errors. I asked if it could be normal behavior as Brew suggested but he didn't directly answer that. He instead indicated that it is indeed possible they can be incorrectly caused by the OS if the processor/board was changed without a clean OS install and recommended I reinstall the OS. I should do that anyway given the abuse I've done to this OS install, so I'll do that and see if the errors return.

shizuka wrote:
I'm not sure if APICID counts up from 0 or 1, but maybe that is the same core, different thread:
1,2
3,4 <-- fault
5,6
7,8


That's a good point. To answer that question I went through the entire history of WHEA errors for both processors since it's the same OS, and indeed some of the 4770K parity event errors start at Processor ID 0 to answer your question. So given that, it would actually mean two different cores on the 4771... at this point I just need time. I'll reinstall the OS and wait... if they occur again then I'll install the 4770K and see if they continue. If they show up on the 4770K then I'll know to RMA the board. If they stop then I'll know to RMA the 4771. And if they stop after the OS reinstall, then I'll know it was a software issue all along. :P

Side bit of info. The Intel rep pointed me to this Intel Proc Diagnostic Tool that does various tests including the option for a lengthy 2-hour burn-in test. It passed the default tests but I'll let it run the long burn-in test later just for fun.

I'd like to thank everyone for all the replies, you guys are great! :D
 
Kougar
Minister of Gerbil Affairs
Topic Author
Posts: 2306
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Sun Mar 23, 2014 8:36 pm

Given I'm in the middle of a move I've not gotten around to reinstalling the OS. The WHEA Errors have not only continued but increased in frequency. The kicker being this evening I received the first actual BSoD.

A fatal hardware error has occurred.

Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Internal Timer Error
Processor ID: 4

The details view of this entry contains further information.


A timer error makes me think of overclocking all over again, but again this is a stock 4771 chip.

Given the BSoD, is there any remote chance at all that it could still be an OS issue? Given it's unlikely two different processors would be bad, I am left to conclude this is a motherboard issue?? :o Any input or thoughts before I commit to a week or two without without a desktop during the RMA process?

Who is online

Users browsing this forum: No registered users and 1 guest
GZIP: On