TR Forums

deepblueq · Wed Jul 02, 2014 2:20 pm

Hi all, I'm new to overclocking and having a bit of trouble.

The computer:

Rosewill 550W PSU
ASUS M5A97 R2.0 mobo
AMD FX-6300
8 GB of G.Skill DDR3-2133 9-11-10-28
MSI GTX 650 Ti Boost
92mm tower CPU cooler (Zalman)
Linux Mint 16 Xfce

Obivously there's nothing extreme about this setup, but I would find a bit more single-threaded performance handy, and the FX-6300 should have some headroom for it. I did some reading, tried some stuff, and ran into two problems fairly quickly.

First, Prime95 crashes even at all stock settings. The only changes made were when the computer was built, and that was just setting up the RAM for full performance (I've since double-checked everything is within spec). I do some gaming with this machine, and it's always been stable for that. Running Prime95, the computer hangs after ~2 mins, with the CPU temp >50*C (it looks to stabilize in the upper 50s). Looking around the internet, I've seen some references to AMD and Prime95 not getting along well, but that strikes me as implausible at best.

Second, I can't seem to get a Vcore reading with the OS booted. lm-sensors seems to be the standard tool for this on Linux, but I've fully set it up and am getting some rather garbage readings (these are all the voltages returned):

in0: +2.82V
in1: +2.87V
in2: +0.88V
+3.3V: +3.29V
in4: +2.47V
in5: +2.52V
in6: +1.54V
3VSB: +1.82V
Vbat: +3.36V

Everything returned from that interface (it8721-isa-0290) looks questionable (there are also temperatures that don't change and a fan speed that's high). Other interfaces (k10temp-pci-00c3 and fam15h_power-pci-00c4) provide CPU temp and CPU power draw reliably. The temp reads 10*C low, but I can work with that.

For the Prime95 hang, everything I can think of seems implausible. The one that seems to me least implausible is a power issue, but that's tough to diagnose without post-boot voltage readings, which are non-functional (all voltages look good from the firmware, but that's unloaded).

Thanks in advance for any ideas!

DPete27 · Thu Jul 03, 2014 2:47 pm

What do you get when you run Memtest overnight or up to 24 hrs?

deepblueq · Thu Jul 03, 2014 11:19 pm

That's a good idea, I just started that up and will look at it in the morning.

Along that line of thinking, before starting memtest I tried Prime95 again with the RAM at 1600 and 10-12-12-31, just to give it lots of room. It ran for about 10 minutes that time, far longer than before. CPU temps stabilized quickly on 61*C (I messed with fan settings at some point, I can do a bit better than that), and the mode of failure was still a hang.

Thu Jul 03, 2014 11:34 pm

Ugh.

Do not change multiple settings in any given run. In the OC world it's all about single-variable equations. Does this increase the number of test runs? That's the bloody point.

EDIT: Take the minimum granularity for the important bits and record a journal of your results as you massage 3 variables one at a time. If this sounds painful, that's the way they designed it.

vargis14 · Fri Jul 04, 2014 7:23 am

Is your cooler mounted correctly....seems strange it will not run Prime @ stock settings.

deepblueq · Fri Jul 04, 2014 1:34 pm

Well, Memtest hung too. Oddly, the blinking characters designed to show that it's not hung were still going, but nothing else on the screen was changing and it was completely unresponsive to keyboard input.

It hung after only 7 mins running, during a block move in the 4-6 GB range. No errors were shown before the hang.

I'm running it again now, and the CPU fan does run a bit above idle, starting around the site of the crash. It's made it past that, and seems to be doing fine.

Memtest does show some odd stuff. Measured speeds are L1 = 66 GB/s, L2 = 33, L3 = 9.8, RAM = 8.7. Maybe it's not doing measurement in the obvious way, but those numbers look awfully low (my RAM should be capable of several times that, and L3 should be a lot faster than RAM). There's also a line showing some incorrect config:

Settings: RAM : 320 MHz (DDR640) / CAS : 1-3-3-3 / DDR2 (64 bits)

A legacy artifact in the way memtest detects memory, or an actual problem? I have no clue.

Approximately how long should Memtest take for a full pass with my setup? My recollection of using it in the past is dim, but I recall it going faster than this (looking like it will take well over an hour for a full pass).

I'm still biased away from RAM being the problem, since I've logged nearly a month of uptime on this machine at times, with lots of use in that time, and I've always had the rock solid stability and determinism I would expect from a real-time microcontroller, not a system with countless millions of lines of code and billions of transistors (contrast my brother's machines, which always accumulate minor glitches at alarming rates for no reason either of us can tell).

Captain Ned wrote:
Ugh.

Do not change multiple settings in any given run. In the OC world it's all about single-variable equations. Does this increase the number of test runs? That's the bloody point.

EDIT: Take the minimum granularity for the important bits and record a journal of your results as you massage 3 variables one at a time. If this sounds painful, that's the way they designed it.

I would almost always agree, but nothing useful is going to happen until I get it to pass Prime95 one way or another.

If you have a known working state, your advice is great for finding the boundaries at which it will stay working.

Unfortunately, all I'm trying to do at the moment is find a working state. Unless you think there's a significant chance that 1600/9-11-10-28 will work when 1600/10-12-12-31 won't, working with them separately seems like a quite literal waste of time.

In retrospect, settting it to 1066 and ignoring the timings would probably have been good, since lowering the speed also affects the timings.

vargis14 wrote:
Is your cooler mounted correctly....seems strange it will not run Prime @ stock settings.

I was a bit hurried when I put the system together, but I've double-checked everything since. If there's a problem in that area, it's that there's a bit more TIM than there needs to be. I should redo that at some point in this process, and it should be sooner rather than later. I'll get some new TIM next time it's convenient (probably Monday).

While there's likely a couple *C room for improvement there, I doubt it's the cause of the current problem. Mounting is all correct, I did take special care to avoid bubbles and the like (so probably no hotspots), and while the measured CPU temps are a bit higher than they might be, they're certainly within reason.

deepblueq · Fri Jul 04, 2014 2:12 pm

Memtest just hung again: 1 hour 15 mins running, one full pass successful, no errors, and again it was doing a block move when it failed (in a different address range, though).

It probably means nothing, but I noticed that the L1 cache measured speed is exactly twice that of the L2 cache: 66892 MB/s vs. 33446 MB/s. Odd.

DPete27 · Mon Jul 07, 2014 8:29 am

Strip down to one DIMM in the mobo at a time. If Memtest hangs, swap DIMMs and run again. If it hangs with more than 2 different DIMMs, try a different slot.

deepblueq · Mon Jul 07, 2014 11:52 pm

Dang, why didn't I think of that? :oops:

I'll be trying that directly.

In the meantime, I found the sensor correction tables I needed; all the PSU rails are right where they should be, but Vcore is strange at best. Idle is 0.88V, and when running Prime95 it varies between 1.21 and 1.30V. When I'm watching it in the BIOS, it varies between 1.27 and 1.43-ish V. I haven't found anything I would call reliable data online, but I was under the impression the FX-6300 is supposed to run at more like 1.43V at full load, in which case this thing is seriously starving for power. Amazing it works as well as it does, really. That assumes that my sensor data is correct - considering the implausibility of 3.5 GHz / 1.21V working at all, I don't know what to think. Even if the sensor data is lacking some scaling/offset, it's got an awful lot of variability under load.

Anyway, now that I have a Vcore reading, I can manually fix voltages in firmware before looking in the OS to get a better idea of how correct it is. Updates will follow.

deepblueq · Tue Jul 08, 2014 1:27 am

Those Vcore readings are dead accurate, and the default settings really did have this thing pushing 3.5 GHz @ 1.21V at times. I can't decide what to think of that, but it's not the cause of my problem.

After disabling Turbo Core, APM, C1E, CnQ, basically all the stuff that could be getting in the way, the Vcore at idle was 1.28 and loaded was 1.30V. It still hung. Then I put it in manual voltage mode and locked it in to 1.425V, the actual spec (AFAICT). 1.43V in OS, so the readings are good. It turns out my LLC is a bit aggressive, and it hit 1.46V when I loaded it up. Needless to say, the thermal load got a major bump, and it was running 68*C. I had high hopes for this, but Prime95 still only lasted about 15 minutes.

It somehow managed to use less power idling at 1.28V with all the power saving features off than at 0.88V with power saving stuff maxed - now there's a head-scratcher.

I'll do RAM testing and get TIM reapplied tomorrow. Unless the TIM works wonders, my next upgrade is a new cooler; this one sounds like a jet when it's wound up, and isn't even cooling that well. On the plus side, if it can do 3.5 on 1.21V, 1.4-ish should take it pretty far.

deepblueq · Sat Jul 12, 2014 1:38 am

I tried one stick of memory at a time, one slot on the mobo at a time, and some completely different RAM from my brother's system. No change.

I reapplied TIM - temps dropped 10*C under load (49*C on Prime95 now)! It still hangs, though.

I got the latest firmware for the mobo (should have done that eariler), and in the process let it use all its defaults (I did sanity check them). It still hangs, but it's now giving the CPU even less power. It varies between 1.13 and 1.21V. Amazing, I think?

I tried giving the CPU NB (memory controller and L3 cache) a voltage bump to 1.3V (from 1.175). No change.

I tried disabling 4 of 6 cores (it didn't give me a choice on which two remained). No change.

Looking around for any other even slightly relevant settings in the firmware, I tried HPC mode. That made Prime95 throw some errors shortly before the hang, which actually had never happened before.

I removed all unneccesary parts (meaning an optical drive and a PCI-e wifi card). No change.

I swapped to my old PSU, a 500W rated (actually 450W, and only 360W on 12V) Cooler Master. No change.

Looking to test alternates for every component I could (I don't have access to any other compatible CPUs/mobos), I swapped in my brother's Sapphire HD 7770. This with the Mint install gave no video output when it should have gotten X running, so I went ahead and swapped the Minty drive for my laptop's SSD with Arch. I still couldn't get a GUI up too easily, but who needs that?

I forgot that this Arch install drops kernel messages into virtual consoles, and it dropped some very interesting ones when I fired up Prime95:

[  601.496334] [Hardware Error]: Corrected error, no action required.
[  601.497239] [Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|Poison|CECC]: 0x9d4948a5001d011b
[  601.498130] [Hardware Error]: MC4_ADDR: 0x000000009aa29fc0
[  601.499144] [Hardware Error]: MC4 Error (node 0): L3 cache tag error.
[  601.500245] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

[  301.077742] [Hardware Error]: Corrected error, no action required.
[  301.078847] [Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[Over|CE|MiscV|-|AddrV|-|Poison|CECC]: 0xdd4948a5001d011b
[  301.079989] [Hardware Error]: MC4_ADDR: 0x00000001d8c09fc0
[  301.081113] [Hardware Error]: MC4 Error (node 0): L3 cache tag error.
[  301.082241] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

[  451.496334] [Hardware Error]: Corrected error, no action required.
[  451.497239] [Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[Over|CE|MiscV|-|AddrV|-|Poison|CECC]: 0xdd4948a2001d011b
[  451.498130] [Hardware Error]: MC4_ADDR: 0x00000000d4569fc0
[  451.499144] [Hardware Error]: MC4 Error (node 0): L3 cache tag error.
[  451.500245] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

I did two runs - the first error was from the first run, the second two were both from the second. It hung very shortly after the error on the first run - I glanced away for a second, and when I looked back, both had happened. I wasn't around at 301 on the second run, but I saw the last one as it appeared at 451, and it kept running for a couple seconds (2 or 3?) before the hang. For the uninitiated, the first item on each line is a timestamp of the number of seconds since boot.

Cache tags are very much an internal-to-the-CPU thing, so this seems fairly damning of some L3 cache circuitry or other. This has potential to explain quite a lot: for the sake of argument, let's say a piece of the L3 cache in the last MB or two is bad (it has 8 MB), and causes problems when used. If the CPU isn't always keeping the cache filled with something or other (it arguably should be, but we know the 'dozers have some issues with that), then I'm unlikely to ever hit that last MB in my normal usage, as all my intensive tasks are small-dataset, branching all over the place kinds of things - nothing linear with big datasets, like video encoding. Hence, the stability day-to-day. Prime95 blend and Memtest will sure fill up a cache, though. I could probably test this theory further by restricting Prime95 to small FFT sizes.

(As I've said in another comment: "This is all theoretical, I'm just some guy on the internet, don't believe everything you read, yadda yadda yadda." I should put it in a sig or something.)

It's certainly curious that the timestamps are all aligned to 2.5 minute intervals (since boot, at that). It's plausible based on my previous experiences that the Prime95 hangs all are - again, I should do more testing. I know the Memtest hangs weren't so aligned.

It would also be very interesting to know for sure what those 64-bit hex values are, but my google-fu hasn't been strong enough to find much description on these messages. I'll hazard a guess that the first is a bunch of status flags (reflected by the adjacent text? what's Over mean?) at error time, and the second is a main RAM address (0x1d8c09fc0 just barely makes it under the 8 GB mark). Odd how the last 17 bits (128K alignment) are the same for all of them.

Also, there's the fact that this happens in very close temporal proximity to the hang, but not at the moment of.

Where's a cross-eyed emoticon when you need it? X-(

Anyway, this looks likely to end in a CPU warranty claim (or could it still be mobo?). I haven't done anything to it that hasn't been described in this thread, so they should still take it. I'd really like to have a solid case that it's one component or another before trying to get it replaced, so there's still work to do. I'm trying to look through kernel logs more on the Mint install, but I find I'm more than a little rusty at logging, and the internet resources I'm finding are thoroughly unhelpful. I may have to try to find the book from 2005 I learned it from in the first place.

If anyone has any ideas on how narrow it down to either the CPU or the mobo, that would be great, but at this point I'm just as much trying to figure out what the problem is for curiosity's sake as I am trying to get an OC-able system.

TR Forums

OC newbie having trouble

OC newbie having trouble

Re: OC newbie having trouble

Re: OC newbie having trouble

Re: OC newbie having trouble

Re: OC newbie having trouble

Re: OC newbie having trouble

Re: OC newbie having trouble

Re: OC newbie having trouble

Re: OC newbie having trouble

Re: OC newbie having trouble

Re: OC newbie having trouble

Who is online