TR Forums

Fri Apr 10, 2009 12:30 pm

In the past few days I've had a bunch of 2669 WUs fail at 100% in the Linux SMP client. The client hangs; if you kill and restart it, you're rewarded with the dreaded "Deleting current work unit & continuing..." error. My production has temporarily dropped nearly to zero as a result, since all of my SMP boxes were crunching 2669 WUs. :roll:

Googling around and searching foldingforum.org hasn't yielded anything conclusive yet. However, I think the solution may be to force FAH to download an updated a2 core. The one system that hasn't had this problem appears to be running a newer version of the a2 core (2.06 instead of 2.01). Deleting FahCore_a2.exe and restarting the client should force it to fetch the new a2 core.

Anyone else who has seen this issue, or has successfully turned in 2669 WUs please post your experiences (and the version of the a2 core you're using from your log files) on this thread!

JPinTO · Fri Apr 10, 2009 12:52 pm

I do this sequence:

1. Stop your client
2. Check your task manager to ensure that no fahcore's processes are running. If they are then kill them
3. Run qfix in your FAH directory: http://foldingforum.org/viewtopic.php?f=8&t=191
qfix will tell you information about your current slot and that there is folding data in there to be processed.
3b. I run qfix a second time just in case.
4. I attempt to send the WU manually using ./fah6 -send 01 (or whatever slot# you are processing)
5. Usually it fails to upload
6. I will restart the FAH client
7. At this point it will either finish the WU and upload it normally, or after a little while complain with another core error.
8. Check and make sure no fahcore processes are running.
9. Run Qfix again
10. Attempt manual send again using ./fah6 -send 01
11. When it fails delete the WU using ./fah6 -delete 01
12. Run Qfix twice
13. At this point you have 2 choices: Start FAH again or do this sequence:
14. Attempt manual send again using ./fah6 -send 01
15. WU should upload after a few minutes
16. Once the upload confirmation is done, I delete my Queue.dat, Work folder and *.exe
17. Start Fah up

Yes, this is a rigamarole and you may not need all of these steps in this sequence. Once you've done it a few times it becomes second nature.

The key is to let qfix fix your queue properly, and then use the ./fah6 -delete to clean it up. WARNING: You don't want to issue the ./fah6 -delete too quickly , as you can easily delete your WU . That's why I do step 6 to see if the WU is truly finished processing.

- JP

JPinTO · Fri Apr 10, 2009 1:00 pm

A little more concise info about the process: http://foldingforum.org/viewtopic.php?f=44&t=3889

Fri Apr 10, 2009 2:30 pm

I should add that this appears to be different from the other "a2 hang" issue I've posted about previously. With the previous issue, simply killing the wayward cores allowed things to proceed, with the completed WU uploading successfully. This latest issue results in the loss of the completed WU.

I will update this thread after I've determined whether the updated a2 core seems to correct the issue.

Ragnar Dan · Fri Apr 10, 2009 2:55 pm

It happened to me on my VMware'd Ubuntu 7.10 client 2 days ago (~7.5 hours from now it will be 48 hours), and I noticed it had frozen and killed fah6 and other folding processes and restarted it. It wiped out the WU and restarted. I was irked, but left it alone:

[02:45:21] Completed 247500 out of 250000 steps  (99%)
[03:02:14] Completed 250000 out of 250000 steps  (100%)
[03:05:00] 
[03:05:00] Finished Work Unit:
[03:05:00] - Reading up to 21124224 from "work/wudata_02.trr": Read 21124224
[03:05:01] trr file hash check passed.
[03:05:01] - Reading up to 4489500 from "work/wudata_02.xtc": Read 4489500
[03:05:01] xtc file hash check passed.
[03:05:01] edr file hash check passed.
[03:05:01] logfile size: 198417
[03:05:01] Leaving Run
[03:05:04] - Writing 26255405 bytes of core data to disk...
[03:05:04]   ... Done.
[03:06:42] - Shutting down core
[03:28:55] ***** Got a SIGTERM signal (15)
[03:28:55] Killing all core threads

Folding@Home Client Shutdown.


--- Opening Log file [April 9 03:28:58] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 6.02

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/ragnardan/folding/FAH
Executable: ./fah6
Arguments: -local -smp -forceasm -advmethods -verbosity 9 

Warning:
 By using the -forceasm flag, you are overriding
 safeguards in the program. If you did not intend to
 do this, please restart the program without -forceasm.
 If work units are not completing fully (and particularly
 if your machine is overclocked), then please discontinue
 use of the flag.

[03:28:58] - Ask before connecting: No
[03:28:58] - User name: Ragnar_Dan (Team 2630)
[03:28:58] - User ID: 1503ECE6554148A8
[03:28:58] - Machine ID: 1
[03:28:58] 
[03:28:59] Loaded queue successfully.
[03:28:59] 
[03:28:59] + Processing work unit
[03:28:59] Core required: FahCore_a2.exe
[03:28:59] Core found.
[03:28:59] - Autosending finished units...
[03:28:59] Trying to send all finished work units
[03:28:59] + No unsent completed units remaining.
[03:28:59] - Autosend completed
[03:28:59] Working on Unit 02 [April 9 03:28:59]
[03:28:59] + Working ...
[03:28:59] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -priority 96 -checkpoint 5 -forceasm -verbose -lifeline 9247 -version 602'

[03:28:59] 
[03:28:59] *------------------------------*
[03:28:59] Folding@Home Gromacs SMP Core
[03:28:59] Version 2.00 (Wed Jul 9 13:11:25 PDT 2008)
[03:28:59] 
[03:28:59] Preparing to commence simulation
[03:28:59] - Ensuring status. Please wait.
[03:29:08] - Assembly optimizations manually forced on.
[03:29:08] - Not checking prior termination.
[03:29:08] Need version 206
[03:29:08] Error: Work unit read from disk is invalid
[03:29:10] - Expanded 4836074 -> 23977273 (decompressed 495.8 percent)
[03:29:11] Called DecompressByteArray: compressed_data_size=4836074 data_size=23977273, decompressed_data_size=23977273 diff=0
[03:29:11] - Digital signature verified
[03:29:11] 
[03:29:11] Project: 2669 (Run 2, Clone 7, Gen 107)
[03:29:11] 
[03:29:11] Assembly optimizations on if available.
[03:29:11] Entering M.D.
[03:49:00] Completed 2500 out of 250000 steps  (1%)

Then I had a series of crashes on my machine caused by my GPU overclocking follies, and when it restarted the Linux SMP client it complained about the version but otherwise seemed to be fine:

[20:03:17] Preparing to commence simulation
[20:03:17] - Ensuring status. Please wait.
[20:03:26] - Assembly optimizations manually forced on.
[20:03:26] - Not checking prior termination.
[20:03:26] Need version 206
[20:03:26] Error: Work unit read from disk is invalid
[20:03:31] - Expanded 4836074 -> 23977273 (decompressed 495.8 percent)
[20:03:32] Called DecompressByteArray: compressed_data_size=4836074 data_size=23977273, decompressed_data_size=23977273 diff=0
[20:03:32] - Digital signature verified
[20:03:32] 
[20:03:32] Project: 2669 (Run 2, Clone 7, Gen 107)
[20:03:32] 
[20:03:33] Assembly optimizations on if available.
[20:03:33] Entering M.D.
[20:03:39] Will resume from checkpoint file
[20:03:42] Resuming from checkpoint
[20:03:42] Verified work/wudata_02.log
[20:03:44] Verified work/wudata_02.trr
[20:03:44] Verified work/wudata_02.xtc
[20:03:44] Verified work/wudata_02.edr
[20:03:44] Completed 122520 out of 250000 steps  (49%)

But then it finished, and... sat there. I went to the terminal and sent a "kill -9" to all of the FahCore_a2.exe processes, and that got it to quit. But it gave me server errors on upload attempts, and that's how it's been since. If it turns out to trash the completed WU, then yet another Stanford failure can be chocked up.

[11:36:11] Completed 247500 out of 250000 steps  (99%)
[11:53:55] Completed 250000 out of 250000 steps  (100%)
[11:56:35] 
[11:56:35] Finished Work Unit:
[11:56:35] - Reading up to 21124224 from "work/wudata_02.trr": Read 21124224
[11:56:36] trr file hash check passed.
[11:56:36] - Reading up to 4489628 from "work/wudata_02.xtc": Read 4489628
[11:56:36] xtc file hash check passed.
[11:56:36] edr file hash check passed.
[11:56:36] logfile size: 202256
[11:56:36] Leaving Run
[11:56:40] - Writing 26265132 bytes of core data to disk...
[11:56:40]   ... Done.
[11:56:44] - Shutting down core
[14:56:00] CoreStatus = 0 (0)
[14:56:00] Sending work to server
[14:56:00] Project: 2669 (Run 2, Clone 7, Gen 107)


[14:56:00] + Attempting to send results [April 10 14:56:00 UTC]
[14:56:00] - Reading file work/wuresults_02.dat from core
[14:56:00]   (Read 26265132 bytes from disk)
[14:56:00] Connecting to http://171.64.65.56:8080/
[14:56:01] - Couldn't send HTTP request to server
[14:56:01] + Could not connect to Work Server (results)
[14:56:01]     (171.64.65.56:8080)
[14:56:01] + Retrying using alternative port
[14:56:01] Connecting to http://171.64.65.56:80/
[14:56:01] - Couldn't send HTTP request to server
[14:56:01] + Could not connect to Work Server (results)
[14:56:01]     (171.64.65.56:80)
[14:56:01] - Error: Could not transmit unit 02 (completed April 10) to work server.
[14:56:01] - 1 failed uploads of this unit.
[14:56:01]   Keeping unit 02 in queue.
[14:56:01] Trying to send all finished work units
[14:56:01] Project: 2669 (Run 2, Clone 7, Gen 107)


[14:56:01] + Attempting to send results [April 10 14:56:01 UTC]
[14:56:01] - Reading file work/wuresults_02.dat from core
[14:56:01]   (Read 26265132 bytes from disk)
[14:56:01] Connecting to http://171.64.65.56:8080/
[14:56:01] - Couldn't send HTTP request to server
[14:56:01] + Could not connect to Work Server (results)
[14:56:01]     (171.64.65.56:8080)
[14:56:01] + Retrying using alternative port
[14:56:01] Connecting to http://171.64.65.56:80/
[14:56:01] - Couldn't send HTTP request to server
[14:56:01] + Could not connect to Work Server (results)
[14:56:01]     (171.64.65.56:80)
[14:56:01] - Error: Could not transmit unit 02 (completed April 10) to work server.
[14:56:01] - 2 failed uploads of this unit.


[14:56:01] + Attempting to send results [April 10 14:56:01 UTC]
[14:56:01] - Reading file work/wuresults_02.dat from core
[14:56:01]   (Read 26265132 bytes from disk)
[14:56:01] Connecting to http://171.67.108.25:8080/
[14:56:01] - Couldn't send HTTP request to server
[14:56:01]   (Got status 503)
[14:56:01] + Could not connect to Work Server (results)
[14:56:01]     (171.67.108.25:8080)
[14:56:01] + Retrying using alternative port
[14:56:01] Connecting to http://171.67.108.25:80/
[14:56:02] - Couldn't send HTTP request to server
[14:56:02]   (Got status 503)
[14:56:02] + Could not connect to Work Server (results)
[14:56:02]     (171.67.108.25:80)
[14:56:02]   Could not transmit unit 02 to Collection server; keeping in queue.
[14:56:02] + Sent 0 of 1 completed units to the server
[14:56:02] - Preparing to get new work unit...
[14:56:02] + Attempting to get work packet
[14:56:02] - Will indicate memory of 500 MB
[14:56:02] - Connecting to assignment server
[14:56:02] Connecting to http://assign.stanford.edu:8080/
[14:56:02] Posted data.
[14:56:02] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[14:56:02] + News From Folding@Home: Welcome to Folding@Home
[14:56:02] Loaded queue successfully.
[14:56:02] Connecting to http://171.64.65.64:8080/
[14:56:04] Posted data.
[14:56:04] Initial: 0000; - Receiving payload (expected size: 2437090)
[14:56:10] - Downloaded at ~396 kB/s
[14:56:10] - Averaged speed for that direction ~765 kB/s
[14:56:10] + Received work.
[14:56:10] Trying to send all finished work units
[14:56:10] Project: 2669 (Run 2, Clone 7, Gen 107)


[14:56:10] + Attempting to send results [April 10 14:56:10 UTC]
[14:56:10] - Reading file work/wuresults_02.dat from core
[14:56:10]   (Read 26265132 bytes from disk)
[14:56:10] Connecting to http://171.64.65.56:8080/
[14:56:10] - Couldn't send HTTP request to server
[14:56:10] + Could not connect to Work Server (results)
[14:56:10]     (171.64.65.56:8080)
[14:56:10] + Retrying using alternative port
[14:56:10] Connecting to http://171.64.65.56:80/
[14:56:10] - Couldn't send HTTP request to server
[14:56:10] + Could not connect to Work Server (results)
[14:56:10]     (171.64.65.56:80)
[14:56:10] - Error: Could not transmit unit 02 (completed April 10) to work server.
[14:56:10] - 3 failed uploads of this unit.


[14:56:10] + Attempting to send results [April 10 14:56:10 UTC]
[14:56:10] - Reading file work/wuresults_02.dat from core
[14:56:10]   (Read 26265132 bytes from disk)
[14:56:10] Connecting to http://171.67.108.25:8080/
[14:56:10] - Couldn't send HTTP request to server
[14:56:10]   (Got status 503)
[14:56:10] + Could not connect to Work Server (results)
[14:56:10]     (171.67.108.25:8080)
[14:56:10] + Retrying using alternative port
[14:56:10] Connecting to http://171.67.108.25:80/
[14:56:11] - Couldn't send HTTP request to server
[14:56:11]   (Got status 503)
[14:56:11] + Could not connect to Work Server (results)
[14:56:11]     (171.67.108.25:80)
[14:56:11]   Could not transmit unit 02 to Collection server; keeping in queue.
[14:56:11] + Sent 0 of 1 completed units to the server
[14:56:11] + Closed connections
[14:56:16] 
[14:56:16] + Processing work unit
[14:56:16] Work type a1 not eligible for variable processors
[14:56:16] Core required: FahCore_a1.exe
[14:56:16] Core found.
[14:56:16] Working on queue slot 03 [April 10 14:56:16 UTC]
[14:56:16] + Working ...
[14:56:16] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 03 -priority 96 -checkpoint 5 -forceasm -verbose -lifeline 5516 -version 624'

[14:56:16] 
[14:56:16] *------------------------------*
[14:56:16] Folding@Home Gromacs SMP Core
[14:56:16] Version 1.74 (November 27, 2006)
[14:56:16] 
[14:56:16] Preparing to commence simulation
[14:56:16] - Ensuring status. Please wait.
[14:56:33] - Assembly optimizations manually forced on.
[14:56:33] - Not checking prior termination.
[14:56:34] - Expanded 2436578 -> 12916733 (decompressed 530.1 percent)
[14:56:34] - Starting from initial work packet
[14:56:34] 
[14:56:34] Project: 2653 (Run 24, Clone 175, Gen 101)
[14:56:34] 
[14:56:34] Assembly optimizations on if available.
[14:56:34] Entering M.D.
[14:56:41] Rejecting checkpoint
[14:56:42] Protein: Protein in POPC
[14:56:42] Writing local files
[14:56:42] Extra SSE boost OK.
[14:56:42] Writing local files
[14:56:43] Completed 0 out of 500000 steps  (0 percent)
[15:06:58] Timered checkpoint triggered.

What really annoys me is that when I first look at things in the morning, they're usually working perfectly well, but then for some reason within an hour or two when I'm really busy, the errors come and I don't notice them until mid to late morning, especially because sometimes you can't tell by what FahMon indicates unless you notice the numbers not changing. Irksome.

Anyway, if it fails I'll be fairly bothered, but I did as the OP advises and mv'ed my FahCore_a2.exe to another filename, so it should eventually download a newer version if available. Once that seems to be OK, I'll delete the previous version. Yeah, it probably doesn't help anything by saving it, but my trust in Stanford is about zero now, so I do such things for the pretense of having some way to get around their screw-ups. :wink:

Ragnar Dan · Fri Apr 10, 2009 5:50 pm

Mine looks like it successfully uploaded. It took 6 hours before it tried again (it used to try a large number of times before giving up and downloading a new WU), and that was ~2.5 hours ago. Stanford has it, though not yet on the user_summary.txt thing that EOC uses. That one lags now, often depriving me (on EOC) of my post-9 PM points until the 3 AM update.

So that's one good thing. Hopefully they'll get their serves working properly again, and return us to our a2 cores, instead of the pokey a1's I'm running for present (though one of my machines is on an a2, and I'm concerned about how it will upload tomorrow morning while I'm sleeping in).

PhilipMcc · Fri Apr 10, 2009 7:55 pm

Not sure how much this helps. I was able to upload A2 results at 16:25 (UTC) earlier today. On the second try. I have been having considerable difficulty sending results from my uniprocessor clients. Had 3 in the queue until one went at 22:41 today.

Ragnar Dan · Sat Apr 11, 2009 5:06 pm

After reading their logs this morning, I discovered that 2 of my notfred's SMP machines had downloaded the updated a2 core on April 7, once they tried to start a new WU and got an error. They got the new core and folded through the same WU they just downloaded without problem. For whatever reason, my disk-based Ubuntu install (through VMware) didn't do that, and gave me failures similar to those described by the Original Poster.

Sun Apr 12, 2009 9:59 am

Update: I can now confirm that the new 2.06 a2 core seems to fix the issue. No more hangs here since forcing all of my systems to download the new version.

Ragnar Dan · Sun Apr 12, 2009 10:45 am

This whole thing concerns me because, while I've seen your reports of systems having trouble now and then with new cores or clients updating properly, it has never happened before on my Ubuntu that I can recall (though I admit I am more forgetful about things that aren't a moral failing of my own, which is to say things I can't affect except by over the top or extraordinary means).

I've been running the SMP client through VMware since the end of January, 2007, and generally, except for the fact that I didn't upgrade in time and am now stuck until I make a new install (&*# #@^^!%) (my version is 6.10 and not 7.10 as I incorrectly wrote above, BTW), it's been trouble free. Maybe there's a way to upgrade by hand editing where it's looking for updates, but I haven't tried to find that configuration file, if it exists, yet.

Sun Apr 12, 2009 11:01 am

Presumably you haven't done much to the stock Ubuntu install, other than setting up the folding client. That being the case, trying to do an in-place OS upgrade is likely to be more work than just reformatting the VM's virtual disk and reinstalling from scratch.

OTOH if your VMs seem to be trouble-free, there's something to be said for just leaving them as-is. It is certainly possible that some of the issues I've had are actually related to running F@h on newer Linux kernels (I am currently using a mix of 8.04 and 8.10 systems here).

Ragnar Dan · Sun Apr 12, 2009 12:14 pm

I haven't done a great deal, but I did seek out information on making Samba work back in... late 2007 maybe, and modified the required files according to the information I found (on the Ubuntu forum somewhere). That got it working on FahMon. And I've also installed a few packages, the names of which I forget (except Samba), over the time I've had it running. And I've downloaded some folding utilities, and written a few scripts, and got the init.d stuff working to make it launch the folding client automatically after booting up with help from notfred (and maybe you)...

So it's not quite as easy to start fresh as I'd like it to be. And yesterday my newish WD 750GB drive started giving me delayed write failure errors, which is always nice. And that means, after disconnecting its power so that goes away until I can figure out why it's happening, I have no big new drive to put things on, copy files to, etc. But eventually the plan is to consolidate drives to the 750, and make a new VM and copy files to it from the old VM and when I'm satisfied, shut the first one down and leave it alone for a while, and eventually recover its space.

But now my long series of great "successes" with friction-only SATA power and data connectors is once more pushing the date for that move into the future. Or at least that's my hope of what the problem is. I do have some connectors that have that little metal clip to secure them, but since WD OEM drives don't even come with screws to mount them, let alone cables, the cheap bastages, I threw it in there temporarily using a cable I got from one of my various motherboards.

I just hope my new install isn't the mess your original 8.10 install was. I'm sure it's much improved by now, but that's why I never upgraded my old Ubuntu, and eventually got stuck without an easy way to upgrade.

Sun Apr 12, 2009 3:02 pm

One of my Linux VM made the switch from 2.04 to 2.06 A2 fine. It finished one 2677 WU, got the "core outdated" status, grabbed the new core, and then started a 2675 WU on April 8.

[15:17:55] *------------------------------*
[15:17:55] Folding@Home Gromacs SMP Core
[15:17:55] Version 2.04 (Thu Jan 29 16:43:57 PST 2009)
[15:17:55] 
[15:17:55] Preparing to commence simulation
[15:17:55] - Ensuring status. Please wait.
[15:18:06] - Looking at optimizations...
[15:18:06] - Working with standard loops on this execution.
[15:18:06] - Files status OK
[15:18:06] Need version 206
[15:18:06] Error: Work unit read from disk is invalid
[15:18:06] 
[15:18:06] Folding@home Core Shutdown: CORE_OUTDATED
[15:18:08] CoreStatus = 6E (110)
[15:18:08] + Core out of date. Auto updating...
[15:18:08] - Attempting to download new core...
[15:18:08] + Downloading new core: FahCore_a2.exe
[15:18:08] Downloading core (/~pande/Linux/AMD64/Core_a2.fah from www.stanford.edu)
[15:18:08] Initial: AFDE; + 10240 bytes downloaded
[15:18:08] Initial: 1FF1; + 20480 bytes downloaded

It is now back to working the slooow A1 stuff. I don't know about the other Linux VM yet but last I checked it was already back working on the A1 I don't know if it had the chance to make the switch to 2.06 A2. I can only report on that in a few days.

The VM I was talking about was a brand new VMware server VM installed with the Xubuntu 8.10 ISO image, no extra funky changes other than turning on Samba and installing the updates from the Tray Notification area. The VM is running the 6.24 beta client. The other VM is a VMware Workstation VM installed with Ubuntu 8.10 ISO image, also no funky changes and the required updates (this one has no Samba enabled). This one runs the 6.1 (or 6.2?) non-beta client.

Edit: if it makes a difference, both VMs are AMDs and using the "64-bit guest over 32-bit host" magic.

Mon Apr 13, 2009 2:48 am

Ragnar Dan wrote:
about a whole bunch of problems.

It sounds like time to nuke from orbit, then start with a clean piece of magnetic media and a more recent distro.

Ragnar Dan · Mon Apr 13, 2009 5:04 pm

[Edited this a couple of times after remembering more stuff.]

Argh. My memory is crap, lately. I misremembered again.

My 6.x Ubuntu was upgraded to 7.10, and that's where it's stuck at (I verified it this time by looking at the ISO the VM loads). I forgot about that because I changed it by using a new ISO, and then updated that one with the various changes. I keep thinking of the Ubuntu 6.10 server VM I also installed and have never changed, but also, keep thinking that it must be impossible that those geniuses would have wiped out all easy means of upgrading from the prior version number in such a short amount of time. When 8.04 came out, like usual I ignored it. I may be forgetting some in between upgrade, but normally you can update and then the next version number lets you upgrade, but not now. Anyway, even though they're still sending everyone 8.10, there seems to be no way to upgrade using the graphical admin tools, including playing around with "software sources". I may be able to manually add an address in there, but it has to be found somehow, and that I don't know.

Now the system claims it wants to "Upgrade" to 7.04 from the graphical update tool, for some reason.

But those things are all that's seriously wrong with it. Honest. :roll:

The system itself is aging, in a 6 year old case, and, being the biggest one I own, the only one I can fit as many drives in as I have. And it is a heat monster. The real problem I have with it is that the drive cages and even the unimpressive 9600GSO won't fit in line, so I had to move that to the other PCIe-x16 slot, and it's at the bottom of the case where the card runs warmer, and is running at x2. It probably doesn't make a big difference, but I'd like to see what improvement I could get with it in a full speed x16 slot. And it would be in a better place for the air to escape through the rear fans if it was higher. But this system is too old to want to buy a new case for it, so... I leave it be.

FYI, farmpuma, when you're summarizing a single post or a series of them, the way to indicate that the quote is not direct is by using square brackets, like so:

farmpuma wrote:
[a nigh crazy idea about blowing up my computer from high in the sky, somewhat fictionalized here in this box]

Tue Apr 14, 2009 6:47 am

I figured the content provided context [but point taken].

Ragnar Dan · Tue Apr 14, 2009 12:18 pm

I went back and forth about it a couple of times before saying anything, but decided that someone may pull up the thread one day after searching on something, and not read every post. It's just the proper way of doing things, and I figured I'd point it out just to be sure.

stoneman828 · Sun Apr 19, 2009 9:15 pm

I've been having problems with my Linux box completing work units. At the end of a work unit, it shuts down the core and hangs forever. Its been doing this for a while now. Has anyone else seen this and do you know of a solution.

Here is my computer info
__________________________________________________________________
-Computer-
Processor : 2x AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
Memory : 4054MB (3594MB used)
Operating System : Ubuntu 8.10
User Name : ****************************
Date/Time : Sun 19 Apr 2009 07:12:18 PM PDT
-Display-
Resolution : 1280x960 pixels
OpenGL Renderer : GeForce 7300 SE/7200 GS/PCI/SSE2
X11 Vendor : The X.Org Foundation
-Multimedia-
Audio Adapter : HDA-Intel - HDA NVidia
-Input Devices-
Macintosh mouse button emulation
Power Button (FF)
Power Button (CM)
PC Speaker
BTC USB Multimedia Keyboard
BTC USB Multimedia Keyboard
Logitech USB Optical Mouse
No brand SP02-A1
-Printers (CUPS)-
Stylus_C88 : <i>(Default)</i>
Stylus_CX9400
-IDE Disks-
-SCSI Disks-
ATA ST3320620AS
HL-DT-ST DVDRAM GH20LS15
DVDRW IDE1108
Sony USB HS-CF Card
Sony USB HS-xD/SM
Sony USB HS-MS Card
Sony USB HS-SD Card
_____________________________________________________________________________________________

Here is the FAH Dump

Writing checkpoint, step 17989910 at Sun Apr 19 11:03:21 2009
[18:04:05] Completed 240008 out of 250000 steps (96%)

Writing checkpoint, step 17991960 at Sun Apr 19 11:18:23 2009
[18:22:23] Completed 242508 out of 250000 steps (97%)

Writing checkpoint, step 17994000 at Sun Apr 19 11:33:19 2009
[18:40:43] Completed 245008 out of 250000 steps (98%)

Writing checkpoint, step 17996050 at Sun Apr 19 11:48:22 2009
[18:59:04] Completed 247508 out of 250000 steps (99%)

Writing checkpoint, step 17998090 at Sun Apr 19 12:03:20 2009

Writing checkpoint, step 18000002 at Sun Apr 19 12:17:21 2009

Writing final coordinates.

Average load imbalance: 10.7 %
Part of the total run time spent waiting due to load imbalance: 6.6 %
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: Z 0 %

Parallel run - timing based on wallclock.
NODE (s) Real (s) (%)
Time: 197044.000 197044.000 100.0
2d06h44:04
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 18.248 2.762 0.219 109.469

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[19:18:23]
[19:18:23] Finished Work Unit:
[19:18:23] - Reading up to 21148416 from "work/wudata_08.trr": Read 21148416
[19:18:23] trr file hash check passed.
[19:18:23] - Reading up to 4440916 from "work/wudata_08.xtc": Read 4440916
[19:18:23] xtc file hash check passed.
[19:18:23] edr file hash check passed.
[19:18:23] logfile size: 189790
[19:18:23] Leaving Run
[19:18:25] - Writing 26112946 bytes of core data to disk...
[19:18:25] ... Done.
[19:18:28] - Shutting down core
^C
Hangs here forever

Mon Apr 20, 2009 2:10 am

From his "New a2 core issue (and potential solution)" thread -

just brew it! wrote:
Update: I can now confirm that the new 2.06 a2 core seems to fix the issue. No more hangs here since forcing all of my systems to download the new version.

Let us know if this fixes your issue and I will merge this thread into that one.

Mon Apr 20, 2009 9:02 am

Do a "ps -elf | grep FahCore_a2.exe" and look and see if any of the processes are in "T" state. If they are then "kill -CONT" whatever the process number is. I have a little set of scripts that do this as part of my Diskless Folding suite.

#!/bin/sh
# check_hang.sh - checks log files and kills/continues the cores if hung at completion
# Also does cleanup of stale files in the work directory
#
LOGFILE=/tmp/folding_hanglog.txt
while [ 1 ]
do
  # Run every 5 minutes
  sleep 300

  # Clean up the log file
  if [ -f /tmp/folding_hanglog.txt ]
  then
    tail -n 1000 $LOGFILE > /tmp/hanglog.bak
    mv /tmp/hanglog.bak $LOGFILE
  fi

  echo `date` " Checking " >> $LOGFILE

  # Check for FINISHED_UNIT without CoreStatus following
  grep -E 'FINISHED_UNIT|CoreStatus' FAHlog.txt | tail -n 1 | grep -q FINISHED_UNIT
  if [ $?  -eq 0 ]
  then
    # Give the client a chance to kill the cores
    echo "Potential hang found, waiting to see if it clears..." >> $LOGFILE

    sleep 300
    grep -E 'FINISHED_UNIT|CoreStatus' FAHlog.txt | tail -n 1 | grep -q FINISHED_UNIT
    if [ $?  -eq 0 ]
    then
      echo "Hang failed to clear, killing cores" >> $LOGFILE
      ./kill_cores.sh $LOGFILE
    fi
  fi

  # Check for upload and not trying to download following
  grep -E 'Number of Units Completed|Preparing to get new work unit|Starting local stats count at'
FAHlog.txt | tail -n 1 | grep -qE 'Number of Units Completed|Starting local stats count at'
  if [ $?  -eq 0 ]
  then
    # Give the client a chance to continue
    echo "Potential stop found, waiting to see if it clears..." >> $LOGFILE

    sleep 300
    grep -E 'Number of Units Completed|Preparing to get new work unit|Starting local stats count at' FAHlog.txt | tail -n 1 | grep -qE 'Number of Units Completed|Starting local stats count at'
    if [ $?  -eq 0 ]
    then
      echo "Stop failed to clear, continuing cores" >> $LOGFILE
      ./cont_cores.sh $LOGFILE
    fi
  fi

  # Clean up any stale files in the work directory
  slot=0
  while [ "$slot" -lt "10" ]
  do
    state=`./queueinfo queue.dat $slot`
    if [ "$state" -eq "0" ]
    then
      rm -f work/*_0$slot*
    fi
    slot=`expr $slot + 1`
  done

done

#!/bin/sh
# kill_cores.sh - kills cores for the specified instance
#
CWD=`pwd`
echo "kill_cores.sh for $CWD" >> $1

# Walk /proc looking for processes
for procdir in `find /proc -name '[0-9]*' | awk '/\/proc\/[0-9]*$/ {print $0}'`
do
  # Check if they are the right exe and the right cwd
  if [ -e $procdir/exe -a -e $procdir/cwd ]
  then
    if [ "`readlink $procdir/exe`" = "$CWD/FahCore_a1.exe" -a  "`readlink $procdir/cwd`" = "$CWD" ]    then
      # kill -9 the core procs to free the hang
      kill -9 `echo $procdir | awk -F / '{print $3}'`
      echo "Killing " `echo $procdir | awk -F / '{print $3}'` >> $1
    fi
  fi
done

#!/bin/sh
# cont_cores.sh - continues cores for the specified instance
#
CWD=`pwd`
echo "cont_cores.sh for $CWD" >> $1
# Walk /proc looking for processes

for procdir in `find /proc -name '[0-9]*' | awk '/\/proc\/[0-9]*$/ {print $0}'`
do
  # Check if they are the right exe and the right cwd
  if [ -e $procdir/exe -a -e $procdir/cwd ]
  then
    if [ "`readlink $procdir/exe`" = "$CWD/FahCore_a2.exe" -a  "`readlink $procdir/cwd`" = "$CWD" ]    then
      # kill -CONT the core procs to free the hang
      kill -18 `echo $procdir | awk -F / '{print $3}'`
      echo "Continuing " `echo $procdir | awk -F / '{print $3}'` >> $1
    fi
  fi
done

/*
 * queueinfo.c - a program to output the state of the work unit slots
 * Reads from queue.dat in argv[1] the state of slot argv[2]
 * Copyright Nicholas Reilly 29 September 2008
 * Licensed under the GPL v2 or any later version
 */
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <errno.h>

#define SIZE 7168

int main(int argc, char *argv[])
{
  char *addr, *stat;
  int fd, slot;

  if (argc != 3) {
    fprintf(stderr, "Usage: %s <queue.dat> <slot 0-9>\n", argv[0]);
    return EXIT_FAILURE;
  }
  slot = atoi(argv[2]);
  if ((slot < 0) || (slot > 9)) {
    fprintf(stderr, "Usage: %s <queue.dat> <slot 0-9>\n", argv[0]);
    return EXIT_FAILURE;
  }

  fd = open(argv[1], O_RDONLY);
  if (fd == -1) {
    perror("Failed to open queue.dat");
    return EXIT_FAILURE;
  }

  addr = mmap(NULL, SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
  if (addr == MAP_FAILED) {
    perror("Failed to map file");
    return EXIT_FAILURE;
  }

  /* Skip first 8 bytes (general stuff)*/
  stat = addr + 8;

  /* Each queue entry is 712 bytes long with status as first byte */
  stat += (712 * slot);
  printf("%d\n", *stat);

  (void)munmap(addr, SIZE);
  close(fd);
  return EXIT_SUCCESS;
}

Put all of these in your folding directory. Build queueinfo with "gcc -o queueinfo queueinfo.c" and then startup "check_hang.sh &" and let it run in the background. If you run multiple instances, do the same in each directory.

stoneman828 · Mon Apr 20, 2009 5:26 pm

Thanx for the responses.

I've stopped the current WU and deleted the "a2" core. When I restarted, a new "a2" core downloaded. I'll see if this WU finishes properly, while I digest Notreds instructions.

DWW

stoneman828 · Wed Apr 22, 2009 7:53 am

The new Core seems to have fixed the problem. Thanx to all.......

DWW

Sun Apr 26, 2009 3:50 pm

Version 2.07 has been released over the past few days. It may provide even more stability fixes. My Linux VM grabbed the update without me noticing until today.

Ragnar Dan · Sun Apr 26, 2009 8:24 pm

I noticed it, too, and thought it was insignificant though strange to be releasing an update so quickly. But today I noticed this on my main machine, which is troubling:

[17:18:04] - Shutting down core
[17:18:04] 
[17:18:04] Folding@home Core Shutdown: FINISHED_UNIT
[17:37:53] CoreStatus = 64 (100)

That's almost 20 minutes of waiting before it finally got past the FINISHED_UNIT point. I've seen long delays now and then before, and it's probably the fah6 client and not the a2 core, but they should have fixed that by now at any rate. They should make more public announcements about what the code changes they're doing are about, and respond to their donors' complaints better, too.

TR Forums

New a2 core issue (and potential solution)

New a2 core issue (and potential solution)

Re: New a2 core issue (and potential solution)

Re: New a2 core issue (and potential solution)

Re: New a2 core issue (and potential solution)

Re: New a2 core issue (and potential solution)

Re: New a2 core issue (and potential solution)

Re: New a2 core issue (and potential solution)

Re: New a2 core issue (and potential solution)

Re: New a2 core issue (and potential solution)

Re: New a2 core issue (and potential solution)

Re: New a2 core issue (and potential solution)

Re: New a2 core issue (and potential solution)

Re: New a2 core issue (and potential solution)

Re: New a2 core issue (and potential solution)

Re: New a2 core issue (and potential solution)

Re: New a2 core issue (and potential solution)

Re: New a2 core issue (and potential solution)

FAH6 Linux SMP core Failures at the WU Complete

Re: FAH6 Linux SMP core Failures at the WU Complete

Re: FAH6 Linux SMP core Failures at the WU Complete

Re: FAH6 Linux SMP core Failures at the WU Complete

Re: FAH6 Linux SMP core Failures at the WU Complete

Re: New a2 core issue (and potential solution)

Re: New a2 core issue (and potential solution)

Who is online