Personal computing discussed

Moderators: renee, farmpuma, just brew it!

 
Hoobie7
Gerbil
Topic Author
Posts: 21
Joined: Mon Oct 29, 2007 11:27 am
Contact:

First Foray into Folding Farm Hell

Mon Oct 29, 2007 11:54 am

Ok, so me and the buddies at work decided to start a farm. I currently have a C2D 6400 and 6420 work nodes going. It's been an up hill battle to say the least. My most pressing problem right now is that only one machine will hand in at a time. The 6420 is up and running now, it's handed in a work unit and has started it's next.
The 6400 node on the other hand is having all kinds of problems. It pulls an IP and pxelinux and a WU . . . . usually. And it can crunch all the way through it but then hangs at the end. I did see that the 64 bit SMP linux Folding client has this problem so is this from that? So I reboot the system and it pulls the backed up data and recrunches the last percent then sits at "Folding@home Core Shutdown: FINISHED_UNIT" so I leave it over night thinking I might just be impatient. Next day, no further. So I reboot it again and get this:

--- Opening Log file [October 30 00:18:55] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 6.00beta1

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /etc/folding/1
Executable: ./fah6
Arguments: -local -forceasm -verbosity 9 -smp

Warning:
 By using the -forceasm flag, you are overriding
 safeguards in the program. If you did not intend to
 do this, please restart the program without -forceasm.
 If work units are not completing fully (and particularly
 if your machine is overclocked), then please discontinue
 use of the flag.

[00:18:55] - Ask before connecting: No
[00:18:55] - User name: TheFarm (Team 73303)
[00:18:55] - User ID: 77F980A6383CBE0D
[00:18:55] - Machine ID: 1
[00:18:55]
[00:18:55] Loaded queue successfully.
[00:18:55]
[00:18:55] + Processing work unit
[00:18:55] Core required: FahCore_a1.exe
[00:18:55] Core not found.
[00:18:55] - Core is not present or corrupted.
[00:18:55] - Attempting to download new core...
[00:18:55] + Downloading new core: FahCore_a1.exe
[00:18:55] Downloading core (/~pande/Linux/x86//Core_a1.fah from www.stanford.edu)
[00:18:55] - Autosending finished units...
[00:18:55] Trying to send all finished work units
[00:18:55] + No unsent completed units remaining.
[00:18:55] - Autosend completed
[00:18:55] Initial: AFDE; + 10240 bytes downloaded
[00:18:56] Initial: B54E; + 20480 bytes downloaded
[00:18:56] Initial: D6C2; + 30720 bytes downloaded
[00:18:56] Initial: 9F08; + 40960 bytes downloaded
[00:18:56] Initial: C6C3; + 51200 bytes downloaded
[00:18:56] Initial: EBA8; + 61440 bytes downloaded
[00:18:56] Initial: 3141; + 71680 bytes downloaded
[00:18:56] Initial: D218; + 81920 bytes downloaded
[00:18:56] Initial: F7AC; + 92160 bytes downloaded
[00:18:56] Initial: 820B; + 102400 bytes downloaded
[00:18:56] Initial: 1B1E; + 112640 bytes downloaded
[00:18:56] Initial: C249; + 122880 bytes downloaded
[00:18:56] Initial: 5EBD; + 133120 bytes downloaded
[00:18:56] Initial: CD6C; + 143360 bytes downloaded
[00:18:56] Initial: 221C; + 153600 bytes downloaded
[00:18:56] Initial: DB18; + 163840 bytes downloaded
[00:18:56] Initial: 237E; + 174080 bytes downloaded
[00:18:56] Initial: AEEC; + 184320 bytes downloaded
[00:18:56] Initial: 4C66; + 194560 bytes downloaded
[00:18:56] Initial: AE1E; + 204800 bytes downloaded
[00:18:56] Initial: A37E; + 215040 bytes downloaded
[00:18:56] Initial: 8193; + 225280 bytes downloaded
[00:18:56] Initial: 9F05; + 235520 bytes downloaded
[00:18:56] Initial: AAA5; + 245760 bytes downloaded
[00:18:56] Initial: 6400; + 256000 bytes downloaded
[00:18:56] Initial: 6E3D; + 266240 bytes downloaded
[00:18:56] Initial: EA6B; + 276480 bytes downloaded
[00:18:56] Initial: 820A; + 286720 bytes downloaded
[00:18:56] Initial: DE6D; + 296960 bytes downloaded
[00:18:56] Initial: B97B; + 307200 bytes downloaded
[00:18:56] Initial: 9D5D; + 317440 bytes downloaded
[00:18:56] Initial: 91D7; + 327680 bytes downloaded
[00:18:56] Initial: BB3B; + 337920 bytes downloaded
[00:18:56] Initial: 611B; + 348160 bytes downloaded
[00:18:56] Initial: B290; + 358400 bytes downloaded
[00:18:56] Initial: B0AA; + 368640 bytes downloaded
[00:18:56] Initial: 6A85; + 378880 bytes downloaded
[00:18:56] Initial: BF10; + 389120 bytes downloaded
[00:18:56] Initial: A818; + 399360 bytes downloaded
[00:18:56] Initial: 90E1; + 409600 bytes downloaded
[00:18:56] Initial: 2869; + 419840 bytes downloaded
[00:18:56] Initial: CAFE; + 430080 bytes downloaded
[00:18:56] Initial: 414B; + 440320 bytes downloaded
[00:18:56] Initial: 9B7A; + 450560 bytes downloaded
[00:18:56] Initial: 33AA; + 460800 bytes downloaded
[00:18:56] Initial: B1D5; + 471040 bytes downloaded
[00:18:56] Initial: 0206; + 481280 bytes downloaded
[00:18:56] Initial: 11F4; + 491520 bytes downloaded
[00:18:56] Initial: 31B5; + 501760 bytes downloaded
[00:18:56] Initial: 46B2; + 512000 bytes downloaded
[00:18:56] Initial: 3113; + 522240 bytes downloaded
[00:18:56] Initial: 525A; + 532480 bytes downloaded
[00:18:56] Initial: 66F9; + 542720 bytes downloaded
[00:18:56] Initial: 9672; + 552960 bytes downloaded
[00:18:56] Initial: 9058; + 563200 bytes downloaded
[00:18:56] Initial: 49ED; + 573440 bytes downloaded
[00:18:56] Initial: 515D; + 583680 bytes downloaded
[00:18:56] Initial: CAC0; + 593920 bytes downloaded
[00:18:56] Initial: 0B15; + 604160 bytes downloaded
[00:18:56] Initial: 5A89; + 614400 bytes downloaded
[00:18:56] Initial: 0F31; + 624640 bytes downloaded
[00:18:56] Initial: 2BC3; + 634880 bytes downloaded
[00:18:56] Initial: 3C06; + 645120 bytes downloaded
[00:18:56] Initial: 89C7; + 655360 bytes downloaded
[00:18:56] Initial: 6C54; + 665600 bytes downloaded
[00:18:56] Initial: 8D4D; + 675840 bytes downloaded
[00:18:56] Initial: EA59; + 686080 bytes downloaded
[00:18:56] Initial: C563; + 696320 bytes downloaded
[00:18:56] Initial: 8D45; + 706560 bytes downloaded
[00:18:56] Initial: 9BD0; + 716800 bytes downloaded
[00:18:56] Initial: 130C; + 727040 bytes downloaded
[00:18:56] Initial: CDA1; + 737280 bytes downloaded
[00:18:56] Initial: 7681; + 747520 bytes downloaded
[00:18:56] Initial: 1110; + 757760 bytes downloaded
[00:18:56] Initial: EE35; + 768000 bytes downloaded
[00:18:56] Initial: E5E1; + 778240 bytes downloaded
[00:18:56] Initial: 4B97; + 788480 bytes downloaded
[00:18:56] Initial: 4D75; + 798720 bytes downloaded
[00:18:56] Initial: E268; + 808960 bytes downloaded
[00:18:56] Initial: FAC6; + 819200 bytes downloaded
[00:18:56] Initial: A625; + 829440 bytes downloaded
[00:18:56] Initial: A12A; + 839680 bytes downloaded
[00:18:56] Initial: 83A3; + 849920 bytes downloaded
[00:18:56] Initial: 3BEA; + 860160 bytes downloaded
[00:18:56] Initial: 5298; + 870400 bytes downloaded
[00:18:56] Initial: 4811; + 880640 bytes downloaded
[00:18:56] Initial: EB07; + 890880 bytes downloaded
[00:18:56] Initial: 83FC; + 901120 bytes downloaded
[00:18:56] Initial: FA4E; + 911360 bytes downloaded
[00:18:57] Initial: 2945; + 921600 bytes downloaded
[00:18:57] Initial: 6BC9; + 931840 bytes downloaded
[00:18:57] Initial: E495; + 942080 bytes downloaded
[00:18:57] Initial: 1050; + 952320 bytes downloaded
[00:18:57] Initial: 2070; + 962560 bytes downloaded
[00:18:57] Initial: 1083; + 972800 bytes downloaded
[00:18:57] Initial: 96E5; + 983040 bytes downloaded
[00:18:57] Initial: 3EEE; + 993280 bytes downloaded
[00:18:57] Initial: 84AC; + 1003520 bytes downloaded
[00:18:57] Initial: 3B6B; + 1013760 bytes downloaded
[00:18:57] Initial: 3030; + 1024000 bytes downloaded
[00:18:57] Initial: 4B95; + 1034240 bytes downloaded
[00:18:57] Initial: D9BC; + 1044480 bytes downloaded
[00:18:57] Initial: C5B8; + 1054720 bytes downloaded
[00:18:57] Initial: A5EF; + 1064960 bytes downloaded
[00:18:57] Initial: 28DC; + 1075200 bytes downloaded
[00:18:57] Initial: 0943; + 1085440 bytes downloaded
[00:18:57] Initial: 338A; + 1095680 bytes downloaded
[00:18:57] Initial: ADFC; + 1105920 bytes downloaded
[00:18:57] Initial: ED39; + 1116160 bytes downloaded
[00:18:57] Initial: D284; + 1126400 bytes downloaded
[00:18:57] Initial: 0057; + 1136640 bytes downloaded
[00:18:57] Initial: 3E65; + 1146880 bytes downloaded
[00:18:57] Initial: FCB5; + 1157120 bytes downloaded
[00:18:57] Initial: A7D8; + 1167360 bytes downloaded
[00:18:57] Initial: A564; + 1177600 bytes downloaded
[00:18:57] Initial: 7654; + 1187840 bytes downloaded
[00:18:57] Initial: 0848; + 1198080 bytes downloaded
[00:18:57] Initial: 471E; + 1208320 bytes downloaded
[00:18:57] Initial: A7F3; + 1218560 bytes downloaded
[00:18:57] Initial: FA59; + 1228800 bytes downloaded
[00:18:57] Initial: FBF2; + 1239040 bytes downloaded
[00:18:57] Initial: F54E; + 1249280 bytes downloaded
[00:18:57] Initial: 3023; + 1259520 bytes downloaded
[00:18:57] Initial: AB37; + 1269760 bytes downloaded
[00:18:57] Initial: 0896; + 1280000 bytes downloaded
[00:18:57] Initial: 756D; + 1290240 bytes downloaded
[00:18:57] Initial: C1E7; + 1300480 bytes downloaded
[00:18:57] Initial: 9AAC; + 1310720 bytes downloaded
[00:18:57] Initial: E5AF; + 1320960 bytes downloaded
[00:18:57] Initial: BBE3; + 1331200 bytes downloaded
[00:18:57] Initial: 3596; + 1341440 bytes downloaded
[00:18:57] Initial: 924C; + 1351680 bytes downloaded
[00:18:57] Initial: 30B7; + 1361920 bytes downloaded
[00:18:57] Initial: AEB7; + 1372160 bytes downloaded
[00:18:57] Initial: 7D25; + 1382400 bytes downloaded
[00:18:57] Initial: 0FEB; + 1392640 bytes downloaded
[00:18:57] Initial: 3131; + 1402880 bytes downloaded
[00:18:57] Initial: 755F; + 1413120 bytes downloaded
[00:18:57] Initial: 4800; + 1423360 bytes downloaded
[00:18:57] Initial: 1282; + 1433600 bytes downloaded
[00:18:57] Initial: B2A3; + 1443840 bytes downloaded
[00:18:57] Initial: 21E9; + 1454080 bytes downloaded
[00:18:57] Initial: 789E; + 1464320 bytes downloaded
[00:18:57] Initial: 8542; + 1474560 bytes downloaded
[00:18:57] Initial: 3A56; + 1484800 bytes downloaded
[00:18:57] Initial: D4FE; + 1490945 bytes downloaded
[00:18:57] Verifying core Core_a1.fah...
[00:18:57] Signature is VALID
[00:18:57]
[00:18:57] Trying to unzip core FahCore_a1.exe
[00:18:57] Decompressed FahCore_a1.exe (3625104 bytes) successfully
[00:18:57] + Core successfully engaged
[16:19:03]
[16:19:03] + Processing work unit
[16:19:03] Core required: FahCore_a1.exe
[16:19:03] Core found.
[16:19:03] Working on Unit 01 [October 29 16:19:03]
[16:19:03] + Working ...
[16:19:03] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 01 -checkpoint 15 -forceasm -verbose -lifeline 489 -version 600'

[16:19:03]
[16:19:03] *------------------------------*
[16:19:03] Folding@Home Gromacs SMP Core
[16:19:03] Version 1.74 (November 27, 2006)
[16:19:03]
[16:19:03] Preparing to commence simulation
[16:19:03] - Ensuring status. Please wait.
[16:19:21] - Assembly optimizations manually forced on.
[16:19:21] - Not checking prior termination.
[16:19:21] Finalizing output
[16:19:21] - Starting from initial work packet
[16:19:21]
[16:19:21] Project: 0 (Run 0, Clone 0, Gen 0)
[16:19:21]
[16:19:21] Error: Could not write local file.  Exiting.
[16:19:21] - Shutting down core
[16:21:07] CoreStatus = 12 (18)
[16:21:07] Client-core communications error: ERROR 0x12
[16:21:07] Deleting current work unit & continuing...
[16:25:28] - Warning: Could not delete all work unit files (1): Core returned invalid code
[16:25:28] Trying to send all finished work units
[16:25:28] + No unsent completed units remaining.
[16:25:28] - Preparing to get new work unit...
[16:25:28] + Attempting to get work packet
[16:25:28] - Will indicate memory of 486 MB
[16:25:28] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 15, Stepping: 2
[16:25:28] - Connecting to assignment server
[16:25:28] Connecting to http://assign.stanford.edu:8080/
[16:25:29] Posted data.
[16:25:29] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[16:25:29] + News From Folding@Home: Welcome to Folding@Home
[16:25:29] Loaded queue successfully.
[16:25:29] Connecting to http://171.64.65.56:8080/
[16:25:32] Posted data.
[16:25:32] Initial: 0000; - Receiving payload (expected size: 2448031)
[16:25:35] - Downloaded at ~796 kB/s
[16:25:35] - Averaged speed for that direction ~796 kB/s
[16:25:35] + Received work.
[16:25:35] + Closed connections
[16:25:40]
[16:25:40] + Processing work unit
[16:25:40] Core required: FahCore_a1.exe
[16:25:40] Core found.
[16:25:40] Working on Unit 02 [October 29 16:25:40]
[16:25:40] + Working ...
[16:25:40] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 02 -checkpoint 15 -forceasm -verbose -lifeline 489 -version 600'

[16:25:40]
[16:25:40] *------------------------------*
[16:25:40] Folding@Home Gromacs SMP Core
[16:25:40] Version 1.74 (November 27, 2006)
[16:25:40]
[16:25:40] Preparing to commence simulation
[16:25:40] - Ensuring status. Please wait.
[16:25:57] - Assembly optimizations manually forced on.
[16:25:57] - Not checking prior termination.
[16:25:58] - Expanded 2447519 -> 12907669 (decompressed 527.3 percent)
[16:25:58] - Starting from initial work packet
[16:25:58]
[16:25:58] Project: 2605 (Run 12, Clone 44, Gen 48)
[16:25:58]
[16:25:58] Assembly optimizations on if available.
[16:25:58] Entering M.D.
[16:26:05] Rejecting checkpoint
[16:26:05] OPC
[16:26:05] Writing local files
[16:26:05]
[16:26:05] Writing local files
[16:26:06] Extra SSE boost OK.
[16:26:06] Writing local files
[16:26:06] Completed 0 out of 500000 steps  (0 percent)


This time it apparently does not like something in the backup so it dumps it and starts over. I really have no idea whats going on anymore! :cry:

Any help would help!
Hoobie
 
cass
Minister of Gerbil Affairs
Posts: 2269
Joined: Mon Feb 10, 2003 9:12 am
Contact:

Mon Oct 29, 2007 2:26 pm

Catch the thing before it ends like somewhere between 0-99% and just reboot it.. it should be fine then.

Just in case....

make a manual backup of the WU sometime BEFORE it finishes. At the point you are above, its probably a goner, but you need to check the backup files. If you have a backup file that is approximately 23MB use that one. If both your backup files are approx. 5-6MB ... it has went past the end wrote data and hung.

Once I reset mine once they are doing fine.

The backup files are in /var/lib/tftpboot/ IIRC

You can use the backup feature from the http page and place the tar ball where you like it. I backed up both the backup.$IP.A.1 and the backup.$IP.B.1 and manually backed up on my one last night, but It had just finished when I noticed it hanging, so I was able to save my unit by booting from the older backup. The most recent backup gave the results you show above.

If I have any trouble after rebooting, I am going to start a cron process to copy my smp's every 12 hours so I can always get something back.

My box looked just like yours above, and it finished the second unit and submitted just fine... I don't really know if the 1st unit was lost or not, but its going on now.

Is the 6420 disk based or diskless?
 
Nitrodist
Grand Gerbil Poohbah
Posts: 3281
Joined: Wed Jul 19, 2006 1:51 am
Location: Minnesota

Mon Oct 29, 2007 2:38 pm

psychojoy had (has) the same problem.
Image
 
notfred
Maximum Gerbil
Posts: 4610
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Mon Oct 29, 2007 6:57 pm

Yes, I'm not sure quite what is going on but a few people seem to be reporting hangs at WU completion. Unfortunately work has been nuts the past week and looks like continuing the same way this week, plus I'm away over the weekend and then next week I'm in SJ for more work stuff so I don't have the time to troubleshoot this further. If anyone comes up with a solution I can probably patch it in fairly quickly.

One idea might be to try firing up the client with --config-only or firing it up, killing it after 5 minutes and then restarting it, or some other similar bizarre method. If anyone wants to work on it the source code is on the website, you'll need to be running a 64bit Linux version (I run Ubuntu 7.10 and there are instructions in the README file for extra packages that the build needs), and the main script is initrd_dir/init which is an ash shell script.
 
Hoobie7
Gerbil
Topic Author
Posts: 21
Joined: Mon Oct 29, 2007 11:27 am
Contact:

Mon Oct 29, 2007 8:00 pm

cass - Both machines are diskless and identical except for the proc.

notfred - Please, don't let me rush you. I'm just happy knowing I'm not the only one. Ya, I saw the source there, but this is all way over my head.

What I wish for more than anything else when troubleshooting a problem would be to have a little Uber nerd angel sitting on my shoulder telling me what I'm doing wrong and how to fix it. Is that just me? :D

Another question though, or a few. I'm using the TFTPD32 app on my Win 2K3 box for serving it. And I've come up with a couple questions:

1. I never get any info in the TFTP server tab's window, is this normal?

2. I currently have an IP pool of 2, one for each machine. If I reboot a node without first deleting its entry in the DHCP tab's window it won't give the node an IP. It knows the MAC address, why doesn't it reassign the same IP? Is this a limitation of the program?

3. I had the IP pool set higher to like 10 or 20 but once in a while the DHCP would freak out often when I rebooted a node. It would start assigning IP's to a nonexistent MAC address of 46:46:3A:46:46:3A, usually like 10 or 15 Ip's in a row. Again, is this the software?

4. Finally, I've got a couple thumb drives, if I go the bootable USB drive route and dump the whole DHCP/TFTP server thing will it work better? Is this 100% hang a problem with them too?

Thanks very much!
Hoobie
 
Hoobie7
Gerbil
Topic Author
Posts: 21
Joined: Mon Oct 29, 2007 11:27 am
Contact:

Tue Oct 30, 2007 10:43 pm

Ok, screw it. Diskless is more trouble than it's worth! I came home today to both nodes hung at 100%. I got one on to a thumb drive and booted off that and saved that WU I think. THe other may be a write off. So we're gonna try something different. A thumb drive in each board and they're both humming along again. DHCP is now on the router again, I didn't have much trust in that TFTPD32 app, it seemed flaky. But with the same thing just on thumb drive now all seems to be better, we'll see. notfred, thanks for the software, I think this will run as intended now.

THanks
Hoobie :D
 
Flying Fox
Gerbil God
Posts: 25690
Joined: Mon May 24, 2004 2:19 am
Contact:

Tue Oct 30, 2007 11:20 pm

Strange. Other than that fact that my unpatched Win2K VM running TFTP and ICS screwed up on the time change last weekend, my TFTP+ICS and diskless VMs team set up continue to work. Actually I haven't seen the 100% hung problem (knock on wood) since I refreshed the .iso with the new client. May be it is the temperature cooling off, I don't know. :P
The Model M is not for the faint of heart. You either like them or hate them.

Gerbils unite! Fold for UnitedGerbilNation, team 2630.

Who is online

Users browsing this forum: No registered users and 1 guest
GZIP: On