I tried one stick of memory at a time, one slot on the mobo at a time, and some completely different RAM from my brother's system. No change.
I reapplied TIM - temps dropped 10*C under load (49*C on Prime95 now)! It still hangs, though.
I got the latest firmware for the mobo (should have done that eariler), and in the process let it use all its defaults (I did sanity check them). It still hangs, but it's now giving the CPU even less power. It varies between 1.13 and 1.21V. Amazing, I think?
I tried giving the CPU NB (memory controller and L3 cache) a voltage bump to 1.3V (from 1.175). No change.
I tried disabling 4 of 6 cores (it didn't give me a choice on which two remained). No change.
Looking around for any other even slightly relevant settings in the firmware, I tried HPC mode. That made Prime95 throw some errors shortly before the hang, which actually had never happened before.
I removed all unneccesary parts (meaning an optical drive and a PCI-e wifi card). No change.
I swapped to my old PSU, a 500W rated (actually 450W, and only 360W on 12V) Cooler Master. No change.
Looking to test alternates for every component I could (I don't have access to any other compatible CPUs/mobos), I swapped in my brother's Sapphire HD 7770. This with the Mint install gave no video output when it should have gotten X running, so I went ahead and swapped the Minty drive for my laptop's SSD with Arch. I still couldn't get a GUI up too easily, but who needs that?
I forgot that this Arch install drops kernel messages into virtual consoles, and it dropped some very
interesting ones when I fired up Prime95:
- Code: Select all
[ 601.496334] [Hardware Error]: Corrected error, no action required.
[ 601.497239] [Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|Poison|CECC]: 0x9d4948a5001d011b
[ 601.498130] [Hardware Error]: MC4_ADDR: 0x000000009aa29fc0
[ 601.499144] [Hardware Error]: MC4 Error (node 0): L3 cache tag error.
[ 601.500245] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[ 301.077742] [Hardware Error]: Corrected error, no action required.
[ 301.078847] [Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[Over|CE|MiscV|-|AddrV|-|Poison|CECC]: 0xdd4948a5001d011b
[ 301.079989] [Hardware Error]: MC4_ADDR: 0x00000001d8c09fc0
[ 301.081113] [Hardware Error]: MC4 Error (node 0): L3 cache tag error.
[ 301.082241] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[ 451.496334] [Hardware Error]: Corrected error, no action required.
[ 451.497239] [Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[Over|CE|MiscV|-|AddrV|-|Poison|CECC]: 0xdd4948a2001d011b
[ 451.498130] [Hardware Error]: MC4_ADDR: 0x00000000d4569fc0
[ 451.499144] [Hardware Error]: MC4 Error (node 0): L3 cache tag error.
[ 451.500245] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
I did two runs - the first error was from the first run, the second two were both from the second. It hung very shortly after the error on the first run - I glanced away for a second, and when I looked back, both had happened. I wasn't around at 301 on the second run, but I saw the last one as it appeared at 451, and it kept running for a couple seconds (2 or 3?) before the hang. For the uninitiated, the first item on each line is a timestamp of the number of seconds since boot.
Cache tags are very much an internal-to-the-CPU thing, so this seems fairly damning of some L3 cache circuitry or other. This has potential to explain quite a lot: for the sake of argument, let's say a piece of the L3 cache in the last MB or two is bad (it has 8 MB), and causes problems when used. If the CPU isn't always keeping the cache filled with something or other (it arguably should be, but we know the 'dozers have some issues with that), then I'm unlikely to ever hit that last MB in my normal usage, as all my intensive tasks are small-dataset, branching all over the place kinds of things - nothing linear with big datasets, like video encoding. Hence, the stability day-to-day. Prime95 blend and Memtest will sure fill up a cache, though. I could probably test this theory further by restricting Prime95 to small FFT sizes.
(As I've said in another comment: "This is all theoretical, I'm just some guy on the internet, don't believe everything you read, yadda yadda yadda." I should put it in a sig or something.)
It's certainly curious that the timestamps are all aligned to 2.5 minute intervals (since boot, at that). It's plausible based on my previous experiences that the Prime95 hangs all are - again, I should do more testing. I know the Memtest hangs weren't so aligned.
It would also be very interesting to know for sure what those 64-bit hex values are, but my google-fu hasn't been strong enough to find much description on these messages. I'll hazard a guess that the first is a bunch of status flags (reflected by the adjacent text? what's Over mean?) at error time, and the second is a main RAM address (0x1d8c09fc0 just barely makes it under the 8 GB mark). Odd how the last 17 bits (128K alignment) are the same for all of them.
Also, there's the fact that this happens in very close temporal proximity to the hang, but not at the moment of.
Where's a cross-eyed emoticon when you need it? X-(
Anyway, this looks likely to end in a CPU warranty claim (or could it still be mobo?). I haven't done anything to it that hasn't been described in this thread, so they should still take it. I'd really like to have a solid case that it's one component or another before trying to get it replaced, so there's still work to do. I'm trying to look through kernel logs more on the Mint install, but I find I'm more than a little rusty at logging, and the internet resources I'm finding are thoroughly unhelpful. I may have to try to find the book from 2005 I learned it from in the first place.
If anyone has any ideas on how narrow it down to either the CPU or the mobo, that would be great, but at this point I'm just as much trying to figure out what the problem is for curiosity's sake as I am trying to get an OC-able system.