Personal computing discussed

Moderators: renee, mac_h8r1, Nemesis

 
Topinio
Gerbil Jedi
Posts: 1839
Joined: Mon Jan 12, 2015 9:28 am
Location: London

Re: A corrected hardware error has occurred.

Thu Jun 23, 2016 4:24 am

Kougar wrote:
WHEA Corrected Hardware Errors on three Haswell processors [...] using 32GB (4 x 8GB) RAM configurations between 2133-2400Mhz, and spent months/years in 24/7 100% load use conditions.

[...] At this point the only method I've found for making them go away is to reduce RAM clocks or increase latency timings. I attempted on two chips to raise both the uncore / RAM voltages first without changing anything and that had no effect. [...]

The implication would seem to be that Haswell has a long-term degradation issue with memory controllers and 4x8GB RAM configurations, and a year of hot temps and 24/7 loads brings it out?

I don't think that's a fair inference. A better one would be that degradation should be unsurprising when chips are abused, particularly if the abuse is long term and sustained.

An implication of having a specification is that exceeding the specification might lead to degradation.

Who said these chips could drive 4 DIMMs at 1.5x the specified maximum frequency? Why be surprised it's broken when it's been run thrashed 24x7 way out of spec?

Intel wrote:
Warning: Altering PC clock or memory frequency and/or voltage may (i) reduce system stability and use life of the system, memory and processor; (ii) cause the processor and other system components to fail; (iii) cause reductions in system performance; (iv) cause additional heat or other damage; and (v) affect system data integrity. Intel assumes no responsibility that the memory, included if used with altered clock frequencies and/or voltages, will be fit for any particular purpose.
Desktop: 750W Snow Silent, X11SAT-F, E3-1270 v5, 32GB ECC, RX 5700 XT, 500GB P1 + 250GB BX100 + 250GB BX100 + 4TB 7E8, XL2730Z + L22e-20
HTPC: X-650, DH67GD, i5-2500K, 4GB, GT 1030, 250GB MX500 + 1.5TB ST1500DL003, KD-43XH9196 + KA220HQ
Laptop: MBP15,2
 
Forge
Lord High Gerbil
Posts: 8253
Joined: Wed Dec 26, 2001 7:00 pm
Location: Gone

Re: A corrected hardware error has occurred.

Thu Jun 23, 2016 9:11 am

Just chiming in with some additional anecdote, I've got a i7-4790K that's been paired with 4*8GB of Kingston Valueram for nearly two years now, and it's been rock solid and remains so. On the other hand, my stuff is 1.35V DDR3-1600, so I'm sure there's lots less stress.
Please don't edit my signature for me. Thanks.
 
jackbomb
Gerbil XP
Posts: 363
Joined: Tue Aug 12, 2008 10:25 pm

Re: A corrected hardware error has occurred.

Thu Jun 23, 2016 8:11 pm

Ruh roh. I've been running 32GB of DDR3-2400 (undervolted slightly to 1.55v) for 2 years on an Ivy-E CPU.

There haven't been any problems, but I didn't realize high clocked DIMMs were so hard on the CPU.
Like a good neighbor jackbomb is there.
 
Krogoth
Emperor Gerbilius I
Posts: 6049
Joined: Tue Apr 15, 2003 3:20 pm
Location: somewhere on Core Prime
Contact:

Re: A corrected hardware error has occurred.

Thu Jun 23, 2016 11:13 pm

Running memory out of spec is always hard on motherboard and memory controller. They are usually the first things that start to fail on a system overclock.
Gigabyte X670 AORUS-ELITE AX, Raphael 7950X, 2x16GiB of G.Skill TRIDENT DDR5-5600, Sapphire RX 6900XT, Seasonic GX-850 and Fractal Define 7 (W)
Ivy Bridge 3570K, 2x4GiB of G.Skill RIPSAW DDR3-1600, Gigabyte Z77X-UD3H, Corsair CX-750M V2, and PC-7B
 
Ninjitsu
Gerbil Team Leader
Posts: 219
Joined: Thu Feb 20, 2014 3:46 am

Re: A corrected hardware error has occurred.

Fri Jun 24, 2016 2:45 am

This has been an extremely interesting thread to read. I'm glad you isolated the problem, Kouger!

jackbomb wrote:
Ruh roh. I've been running 32GB of DDR3-2400 (undervolted slightly to 1.55v) for 2 years on an Ivy-E CPU.

There haven't been any problems, but I didn't realize high clocked DIMMs were so hard on the CPU.

IIRC SNB/IVB memory controllers were happy to run with 1.5v and 1.65v, so probably less of a problem in your case.
 
Topinio
Gerbil Jedi
Posts: 1839
Joined: Mon Jan 12, 2015 9:28 am
Location: London

Re: A corrected hardware error has occurred.

Fri Jun 24, 2016 2:51 am

Krogoth wrote:
Running memory out of spec is always hard on motherboard and memory controller. They are usually the first things that start to fail on a system overclock.

Yes, as it's not the memory that you're running at too high a load but the memory controller (now in the CPU).

@jackbomb: 2 years is usually okay unless load is constantly high, but as time goes on the chances of problems only increase as integrate circuits simply degrade with use and while the degradation is not going to cause problems for the typical Intel CPU within its typical lifetime if it's run within spec, the more voltage and/or more heat you put in the faster things break down...

I want to assume you meant overvolted slightly to 1.55 V, but in case you actually meant undervolted are you aware that DDR3 voltage spec is 1.5 V and that all the >1.5 V DIMMs are out of spec for the memory controllers?

Edit:
Ninjitsu wrote:
IIRC SNB/IVB memory controllers were happy to run with 1.5v and 1.65v, so probably less of a problem in your case.

Were they? Spec was still 1.5 and in operation 1.5 ± 0.075 V, since the introduction of DDR3.
Desktop: 750W Snow Silent, X11SAT-F, E3-1270 v5, 32GB ECC, RX 5700 XT, 500GB P1 + 250GB BX100 + 250GB BX100 + 4TB 7E8, XL2730Z + L22e-20
HTPC: X-650, DH67GD, i5-2500K, 4GB, GT 1030, 250GB MX500 + 1.5TB ST1500DL003, KD-43XH9196 + KA220HQ
Laptop: MBP15,2
 
Kougar
Minister of Gerbil Affairs
Topic Author
Posts: 2306
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Fri Jun 24, 2016 11:26 am

Okay, I need to mention this again as I forgot to do so:

Kougar wrote:
Using the most recent non-beta BIOS. Reset everything to optimized settings after I installed the 4771, and the RAM is also at stock 1600Mhz settings.


These WHEA errors have occurred with the following RAM configurations

Kingston 2400Mhz 1.65v 32GB
Kingston 2133MHz 1.60v 32GB
Crucial 1600 MHz 1.35v 32GB

I cannot honestly remember if I still got WHEA errors when running the Crucial kit at 1.50v so I will not add it to the list. It has two XMP profiles and the only difference is the voltage applied. The Crucial system had a fresh install of Win 10 a few months ago so there's nothing in its event logs since.

But my point is I HAVE seen these errors on a 1600Mhz low voltage Crucial kit when I first started this thread. After reading everyone's posts I guess I had the wrong impression about 2133Mhz being a reliable speed, but be that as it may there is literally zero excuse for Intel chips to fail at their rated 1600Mhz unless Haswell isn't rated for 1.35v either? These 32GB kits were specifically rated to be 32GB kits, I avoid mix & matching or doubling up on identical kits just to avoid having to troubleshoot RAM issues (for all the good that's done on Haswell :lol: ).
 
Topinio
Gerbil Jedi
Posts: 1839
Joined: Mon Jan 12, 2015 9:28 am
Location: London

Re: A corrected hardware error has occurred.

Fri Jun 24, 2016 12:15 pm

Kougar wrote:
But my point is I HAVE seen these errors on a 1600Mhz low voltage Crucial kit when I first started this thread. After reading everyone's posts I guess I had the wrong impression about 2133Mhz being a reliable speed, but be that as it may there is literally zero excuse for Intel chips to fail at their rated 1600Mhz unless Haswell isn't rated for 1.35v either?

If that CPU's memory controller has driven DIMMs overvolted (>1.575 V) and/or over-spec (>800 MHz / 1600 MT/s) for sustained periods over a long time period, particularly at high temperatures, it will more likely than not have a much more pronounced degradation of the integrated circuits than if it had been run as designed.

It could well just be broken now, fast living takes its toll and often reduces life expectancy...
Desktop: 750W Snow Silent, X11SAT-F, E3-1270 v5, 32GB ECC, RX 5700 XT, 500GB P1 + 250GB BX100 + 250GB BX100 + 4TB 7E8, XL2730Z + L22e-20
HTPC: X-650, DH67GD, i5-2500K, 4GB, GT 1030, 250GB MX500 + 1.5TB ST1500DL003, KD-43XH9196 + KA220HQ
Laptop: MBP15,2
 
Kougar
Minister of Gerbil Affairs
Topic Author
Posts: 2306
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Thu Jun 30, 2016 5:19 am

The 4771 processor was paired with 1600Mhz RAM. Nothing was ever overvolted or overclocked, the 4771 is a locked chip after all. So fobbing WHEA errors off due to overclocking doesn't make sense.

To recap: This all started with a 4770K, and I assumed my overclocking had caused the issue even though it was paired with 1600Mhz RAM. So I replaced it with the 4771, same RAM. It didn't even take a year to see the WHEA errors begin. New kit of RAM and a 4790K later, the errors are back. But the 4771 proves it wasn't due to overclocking... :roll:

In fact, it gets even better. The 4771 was the ONLY chip to actually throw a BSoD for this very problem, the result of an uncorrectable version of the WHEA parity error. So overclocking has absolutely nothing to do with it, and this isn't some event log error I can just ignore because it leads to very real BSoDs.

I don't think it is too much to ask for a stable, working PC at stock regardless if it spends its idle moments running F@H or not. It makes me wonder if the motherboards are degrading the chips by using unhealthy levels of uncore voltage, since ASUS in it's brilliant auto-OC mindset never designed a "stock" setting for its voltage options. I can only leave it on "AUTO" and hope it's using default VIDs.
 
Topinio
Gerbil Jedi
Posts: 1839
Joined: Mon Jan 12, 2015 9:28 am
Location: London

Re: A corrected hardware error has occurred.

Thu Jun 30, 2016 5:38 am

Kougar wrote:
The 4771 processor was paired with 1600Mhz RAM. Nothing was ever overvolted or overclocked, the 4771 is a locked chip after all. So fobbing WHEA errors off due to overclocking doesn't make sense.

To recap: This all started with a 4770K, and I assumed my overclocking had caused the issue even though it was paired with 1600Mhz RAM. So I replaced it with the 4771, same RAM. It didn't even take a year to see the WHEA errors begin. New kit of RAM and a 4790K later, the errors are back. But the 4771 proves it wasn't due to overclocking... :roll:

In fact, it gets even better. The 4771 was the ONLY chip to actually throw a BSoD for this very problem, the result of an uncorrectable version of the WHEA parity error. So overclocking has absolutely nothing to do with it, and this isn't some event log error I can just ignore because it leads to very real BSoDs.

I don't think it is too much to ask for a stable, working PC at stock regardless if it spends its idle moments running F@H or not. It makes me wonder if the motherboards are degrading the chips by using unhealthy levels of uncore voltage, since ASUS in it's brilliant auto-OC mindset never designed a "stock" setting for its voltage options. I can only leave it on "AUTO" and hope it's using default VIDs.

These statement seem to contradict the earlier ones:

Kougar wrote:
As of this month I have seen WHEA Corrected Hardware Errors on three Haswell processors across three different motherboards and three operating systems. Only point of commonality is that all were Haswell chips using 32GB (4 x 8GB) RAM configurations between 2133-2400Mhz, and spent months/years in 24/7 100% load use conditions.

That pretty much rules out a lot of things right there. 4770K, 4771, and 4790K, GB + ASUS boards, Win 7, Windows Server 2012 R2, and Win 10. At this point the only method I've found for making them go away is to reduce RAM clocks or increase latency timings. I attempted on two chips to raise both the uncore / RAM voltages first without changing anything and that had no effect. The 4790K had been running Windows 10 since last year and only just now began exhibiting WHEA errors... When first built it ran the 2400Mhz XMP profile tested stable across Memtest and Prim95, fast forward some length of time and after some usual program crashes I stress-tested the system again and found it was no longer stable. I reduced it to the 2133 XMP profile, verified it was stable again and thought nothing more of it until I began seeing WHEA errors this past week.
Desktop: 750W Snow Silent, X11SAT-F, E3-1270 v5, 32GB ECC, RX 5700 XT, 500GB P1 + 250GB BX100 + 250GB BX100 + 4TB 7E8, XL2730Z + L22e-20
HTPC: X-650, DH67GD, i5-2500K, 4GB, GT 1030, 250GB MX500 + 1.5TB ST1500DL003, KD-43XH9196 + KA220HQ
Laptop: MBP15,2
 
Kougar
Minister of Gerbil Affairs
Topic Author
Posts: 2306
Joined: Tue Dec 02, 2008 2:12 am
Location: Texas

Re: A corrected hardware error has occurred.

Thu Jun 30, 2016 9:15 pm

Do I have to quote myself from 2014 again? Okay.

Kougar wrote:
Using the most recent non-beta BIOS. Reset everything to optimized settings after I installed the 4771, and the RAM is also at stock 1600Mhz settings. The PSU is a holdover Corsair AX1200 from when I was running a 980X and pulling >800w from the wall. :roll:


I have better things to do than to make this **** up. The 4790K threw three more WHEA errors at 1866Mhz three days ago, getting really frustrated with this entire mess. The 1600Mhz kit was the original memory I built my first Haswell rig with the day Haswell launched. http://www.newegg.com/Product/Product.a ... 6820148664
 
Ryu Connor
Global Moderator
Posts: 4369
Joined: Thu Dec 27, 2001 7:00 pm
Location: Marietta, GA
Contact:

Re: A corrected hardware error has occurred.

Fri Jul 01, 2016 3:13 am

1866 is still an overclock and presumably that unit is also using a fully loaded memory controller? (4x8GB)

You're going to need to push voltage memory voltage up and/or relax memory timings until you find a stable point.
All of my written content here on TR does not represent or reflect the views of my employer or any reasonable human being. All content and actions are my own.

Who is online

Users browsing this forum: No registered users and 1 guest
GZIP: On