New release of diskless folding suite

Come join the... uh... er... fold.

Moderators: just brew it!, farmpuma

Re: New release of diskless folding suite

Postposted on Wed Aug 13, 2008 10:10 pm

If the workunits are crashing before reaching 100% then it isn't this issue, I've had a couple of them bail out half way through, but nothing in the last two weeks.

If the workunits are getting to the end, uploading to Stanford but then not downloading then it is a well known hang but I don't think -forceasm has anything to do with it. If the client is not shut down cleanly before being restarted then when it restarts it will disable the use of SSE/3dNow and just go back to basic math, making it run a lot slower. This is meant to make the client run on machines that overheat when using SSE/3dNow. In the case of my diskless stuff, there is no clean shutdown flag because you can just power them off (make sure it isn't actually in the middle of a backup though!) and on restart it will restore from the backup place. The -forceasm flag is more about client start not core exit. The hang is actually due to the cores receiving the STOP signal, I've just uploaded a new version that should check for that and send them the CONT signal to keep them folding.
notfred
Grand Gerbil Poohbah
 
Posts: 3726
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Re: New release of diskless folding suite

Postposted on Wed Aug 13, 2008 10:40 pm

1) "waiting 10 seconds for USB drives to start" after booting from a USB drive is a logical error. It would also be nice if you could eat some of the lines when you plug in a USB drive since it takes more than an entire screen full.

2) "sh: fold.txt: unknown operand" error still shows up at some point in the boot process.

3) I think I'm still getting the problem of the machines which boot off of your ISO having networking problems if my gateway restarts while they're running. I had a restart of the gateway (but not the folding PC's) happen this morning and I can't tell what the result was, but it strangely appears like I got credited with 2 extra WU submissions:

Code: Select all
[01:10:53] Project: 2662 (Run 2, Clone 167, Gen 7)
[01:10:53]
[01:10:53] Assembly optimizations on if available.
[01:10:53] Entering M.D.
[01:20:45] Completed 2500 out of 250000 steps  (1%)
[01:30:30] Completed 5000 out of 250000 steps  (2%)
[01:40:13] Completed 7500 out of 250000 steps  (3%)
[01:49:58] Completed 10000 out of 250000 steps  (4%)
[01:59:43] Completed 12500 out of 250000 steps  (5%)
[02:09:29] Completed 15000 out of 250000 steps  (6%)
[02:19:12] Completed 17500 out of 250000 steps  (7%)
[02:28:58] Completed 20000 out of 250000 steps  (8%)
[02:38:43] Completed 22500 out of 250000 steps  (9%)
[02:48:27] Completed 25000 out of 250000 steps  (10%)
[02:58:11] Completed 27500 out of 250000 steps  (11%)
[03:07:56] Completed 30000 out of 250000 steps  (12%)
[03:17:41] Completed 32500 out of 250000 steps  (13%)
[03:27:27] Completed 35000 out of 250000 steps  (14%)
[03:37:12] Completed 37500 out of 250000 steps  (15%)
[03:46:56] Completed 40000 out of 250000 steps  (16%)
[03:56:42] Completed 42500 out of 250000 steps  (17%)
[04:06:27] Completed 45000 out of 250000 steps  (18%)
[04:16:11] Completed 47500 out of 250000 steps  (19%)
[04:25:55] Completed 50000 out of 250000 steps  (20%)
[04:35:42] Completed 52500 out of 250000 steps  (21%)
[04:45:27] Completed 55000 out of 250000 steps  (22%)
[04:55:11] Completed 57500 out of 250000 steps  (23%)
[05:04:55] Completed 60000 out of 250000 steps  (24%)
[05:14:39] Completed 62500 out of 250000 steps  (25%)
[05:24:22] Completed 65000 out of 250000 steps  (26%)
[05:34:07] Completed 67500 out of 250000 steps  (27%)
[05:43:51] Completed 70000 out of 250000 steps  (28%)
[05:53:36] Completed 72500 out of 250000 steps  (29%)
[06:03:20] Completed 75000 out of 250000 steps  (30%)
[06:13:05] Completed 77500 out of 250000 steps  (31%)
[06:22:50] Completed 80000 out of 250000 steps  (32%)
[06:32:34] Completed 82500 out of 250000 steps  (33%)
[06:42:20] Completed 85000 out of 250000 steps  (34%)
[06:52:04] Completed 87500 out of 250000 steps  (35%)
[06:57:14] - Autosending finished units...
[06:57:14] Trying to send all finished work units
[06:57:14] + No unsent completed units remaining.
[06:57:14] - Autosend completed
[07:01:48] Completed 90000 out of 250000 steps  (36%)
[07:11:31] Completed 92500 out of 250000 steps  (37%)
[07:21:15] Completed 95000 out of 250000 steps  (38%)
[07:30:59] Completed 97500 out of 250000 steps  (39%)
[07:40:43] Completed 100000 out of 250000 steps  (40%)
[07:50:27] Completed 102500 out of 250000 steps  (41%)
[08:00:11] Completed 105000 out of 250000 steps  (42%)
[08:09:56] Completed 107500 out of 250000 steps  (43%)
[08:19:39] Completed 110000 out of 250000 steps  (44%)
[08:29:23] Completed 112500 out of 250000 steps  (45%)
[08:39:09] Completed 115000 out of 250000 steps  (46%)
[08:48:54] Completed 117500 out of 250000 steps  (47%)
[08:58:38] Completed 120000 out of 250000 steps  (48%)
[09:08:23] Completed 122500 out of 250000 steps  (49%)
[09:18:07] Completed 125000 out of 250000 steps  (50%)
[09:27:51] Completed 127500 out of 250000 steps  (51%)
[09:37:35] Completed 130000 out of 250000 steps  (52%)
[09:47:19] Completed 132500 out of 250000 steps  (53%)
[09:57:02] Completed 135000 out of 250000 steps  (54%)
[10:06:46] Completed 137500 out of 250000 steps  (55%)
[10:16:29] Completed 140000 out of 250000 steps  (56%)
[10:26:13] Completed 142500 out of 250000 steps  (57%)
[10:35:56] Completed 145000 out of 250000 steps  (58%)
[10:45:41] Completed 147500 out of 250000 steps  (59%)
[10:55:26] Completed 150000 out of 250000 steps  (60%)
[11:05:12] Completed 152500 out of 250000 steps  (61%)
[11:14:56] Completed 155000 out of 250000 steps  (62%)
[11:24:41] Completed 157500 out of 250000 steps  (63%)
[11:34:25] Completed 160000 out of 250000 steps  (64%)
[11:44:09] Completed 162500 out of 250000 steps  (65%)
[11:53:55] Completed 165000 out of 250000 steps  (66%)
[12:03:40] Completed 167500 out of 250000 steps  (67%)
[12:13:25] Completed 170000 out of 250000 steps  (68%)
[12:23:09] Completed 172500 out of 250000 steps  (69%)
[12:32:53] Completed 175000 out of 250000 steps  (70%)
[12:42:38] Completed 177500 out of 250000 steps  (71%)
[12:52:21] Completed 180000 out of 250000 steps  (72%)
[12:57:14] - Autosending finished units...
[12:57:14] Trying to send all finished work units
[12:57:14] + No unsent completed units remaining.
[12:57:14] - Autosend completed
[13:02:05] Completed 182500 out of 250000 steps  (73%)
[13:11:48] Completed 185000 out of 250000 steps  (74%)
[13:21:34] Completed 187500 out of 250000 steps  (75%)
[13:31:18] Completed 190000 out of 250000 steps  (76%)
[13:41:03] Completed 192500 out of 250000 steps  (77%)
[13:50:46] Completed 195000 out of 250000 steps  (78%)
[14:00:30] Completed 197500 out of 250000 steps  (79%)
[14:10:15] Completed 200000 out of 250000 steps  (80%)
[14:20:00] Completed 202500 out of 250000 steps  (81%)
[14:29:46] Completed 205000 out of 250000 steps  (82%)
[14:39:30] Completed 207500 out of 250000 steps  (83%)
[14:49:14] Completed 210000 out of 250000 steps  (84%)
[14:58:59] Completed 212500 out of 250000 steps  (85%)
[15:08:43] Completed 215000 out of 250000 steps  (86%)
[15:18:28] Completed 217500 out of 250000 steps  (87%)
[15:28:13] Completed 220000 out of 250000 steps  (88%)
[15:37:57] Completed 222500 out of 250000 steps  (89%)
[15:47:41] Completed 225000 out of 250000 steps  (90%)
[15:57:25] Completed 227500 out of 250000 steps  (91%)
[16:07:10] Completed 230000 out of 250000 steps  (92%)
[16:16:54] Completed 232500 out of 250000 steps  (93%)
[16:26:38] Completed 235000 out of 250000 steps  (94%)
[16:36:22] Completed 237500 out of 250000 steps  (95%)
[16:46:06] Completed 240000 out of 250000 steps  (96%)
[16:55:50] Completed 242500 out of 250000 steps  (97%)
[17:05:34] Completed 245000 out of 250000 steps  (98%)
[17:15:18] Completed 247500 out of 250000 steps  (99%)
[17:25:03] Completed 250000 out of 250000 steps  (100%)
[17:26:04]
[17:26:04] Finished Work Unit:
[17:26:04] - Reading up to 21310704 from "work/wudata_00.trr": Read 21310704
[17:26:04] trr file hash check passed.
[17:26:04] - Reading up to 4698256 from "work/wudata_00.xtc": Read 4698256
[17:26:04] xtc file hash check passed.
[17:26:04] edr file hash check passed.
[17:26:04] logfile size: 181237
[17:26:04] Leaving Run
[17:26:05] - Writing 26441221 bytes of core data to disk...
[17:26:05]   ... Done.
[17:26:05] - Shutting down core
[17:26:05]
[17:26:05] Folding@home Core Shutdown: FINISHED_UNIT
[17:29:25] CoreStatus = 64 (100)
[17:29:25] Unit 0 finished with 77 percent of time to deadline remaining.
[17:29:25] Updated performance fraction: 0.765363
[17:29:25] Sending work to server


[17:29:25] + Attempting to send results
[17:29:25] - Reading file work/wuresults_00.dat from core
[17:29:25]   (Read 26441221 bytes from disk)
[17:29:25] Connecting to http://171.64.65.56:8080/
[17:32:28] Posted data.
[17:32:28] Initial: 0000; - Uploaded at ~138 kB/s
[17:32:31] - Averaged speed for that direction ~135 kB/s
[17:32:31] + Results successfully sent
[17:32:31] Thank you for your contribution to Folding@Home.
[17:32:31] + Number of Units Completed: 6

[17:32:43] - Warning: Could not delete all work unit files (0): Core file absent
[17:32:43] Trying to send all finished work units
[17:32:43] + No unsent completed units remaining.
[17:32:43] - Preparing to get new work unit...
[17:32:43] + Attempting to get work packet
[17:32:43] - Will indicate memory of 2010 MB
[17:32:43] - Connecting to assignment server
[17:32:43] Connecting to http://assign.stanford.edu:8080/
[17:32:43] Posted data.
[17:32:43] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[17:32:43] + News From Folding@Home: Welcome to Folding@Home
[17:32:43] Loaded queue successfully.
[17:32:43] Connecting to http://171.64.65.56:8080/
[17:32:48] Posted data.
[17:32:48] Initial: 0000; - Receiving payload (expected size: 4921543)
[17:33:04] - Downloaded at ~300 kB/s
[17:33:04] - Averaged speed for that direction ~332 kB/s
[17:33:04] + Received work.
[17:33:04] Trying to send all finished work units
[17:33:04] + No unsent completed units remaining.
[17:33:04] + Closed connections
[17:33:04]
[17:33:04] + Processing work unit
[17:33:04] Core required: FahCore_a2.exe
[17:33:04] Core found.
[17:33:04] Working on Unit 01 [August 13 17:33:04]
[17:33:04] + Working ...
[17:33:04] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 01 -checkpoint 10 -forceasm -verbose -lifeline 541 -version 602'

[17:33:04]
[17:33:04] *------------------------------*
[17:33:04] Folding@Home Gromacs SMP Core
[17:33:04] Version 2.00 (Wed Jul 9 13:11:25 PDT 2008)
[17:33:04]
[17:33:04] Preparing to commence simulation
[17:33:04] - Ensuring status. Please wait.
[17:33:14] - Assembly optimizations manually forced on.
[17:33:14] - Not checking prior termination.
[17:33:15] - Expanded 4921031 -> 24360573 (decompressed 495.0 percent)
[17:33:15] Called DecompressByteArray: compressed_data_size=4921031 data_size=24360573, decompressed_data_size=24360573 diff=0
[17:33:15] - Digital signature verified
[17:33:15]
[17:33:15] Project: 2662 (Run 2, Clone 236, Gen 9)
[17:33:15]
[17:33:15] Assembly optimizations on if available.
[17:33:15] Entering M.D.
[17:33:21] Will resume from checkpoint file
[17:33:23] Resuming from checkpoint
[17:33:23] File work/wudata_01.log has changed since last checkpoint
[17:33:27] CoreStatus = FF (255)
[17:33:27] Client-core communications error: ERROR 0xff
[17:33:27] Deleting current work unit & continuing...
[17:33:40] - Warning: Could not delete all work unit files (1): Core file absent
[17:33:40] Trying to send all finished work units
[17:33:40] + No unsent completed units remaining.
[17:33:40] - Preparing to get new work unit...
[17:33:40] + Attempting to get work packet
[17:33:40] - Will indicate memory of 2010 MB
[17:33:40] - Connecting to assignment server
[17:33:40] Connecting to http://assign.stanford.edu:8080/
[17:33:41] Posted data.
[17:33:41] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[17:33:41] + News From Folding@Home: Welcome to Folding@Home
[17:33:41] Loaded queue successfully.
[17:33:41] Connecting to http://171.64.65.56:8080/
[17:33:46] Posted data.
[17:33:46] Initial: 0000; - Receiving payload (expected size: 4921543)
[17:33:59] - Downloaded at ~369 kB/s
[17:33:59] - Averaged speed for that direction ~339 kB/s
[17:33:59] + Received work.
[17:33:59] + Closed connections
[17:34:04]
[17:34:04] + Processing work unit
[17:34:04] Core required: FahCore_a2.exe
[17:34:04] Core found.
[17:34:04] Working on Unit 02 [August 13 17:34:04]
[17:34:04] + Working ...
[17:34:04] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 10 -forceasm -verbose -lifeline 541 -version 602'

[17:34:04]
[17:34:04] *------------------------------*
[17:34:04] Folding@Home Gromacs SMP Core
[17:34:04] Version 2.00 (Wed Jul 9 13:11:25 PDT 2008)
[17:34:04]
[17:34:04] Preparing to commence simulation
[17:34:04] - Ensuring status. Please wait.
[17:34:14] - Assembly optimizations manually forced on.
[17:34:14] - Not checking prior termination.
[17:34:15] - Expanded 4921031 -> 24360573 (decompressed 495.0 percent)
[17:34:15] Called DecompressByteArray: compressed_data_size=4921031 data_size=24360573, decompressed_data_size=24360573 diff=0
[17:34:15] - Digital signature verified
[17:34:15]
[17:34:15] Project: 2662 (Run 2, Clone 236, Gen 9)
[17:34:15]
[17:34:15] Assembly optimizations on if available.
[17:34:15] Entering M.D.
[17:34:21] Will resume from checkpoint file
[17:34:23] Resuming from checkpoint
[17:34:23] Verified work/wudata_02.log
[17:34:23] Verified work/wudata_02.trr
[17:34:23] File work/wudata_02.xtc has changed since last checkpoint
[17:34:27] CoreStatus = FF (255)
[17:34:27] Client-core communications error: ERROR 0xff
[17:34:27] Deleting current work unit & continuing...
[17:34:40] - Warning: Could not delete all work unit files (2): Core file absent
[17:34:40] Trying to send all finished work units
[17:34:40] + No unsent completed units remaining.
[17:34:40] - Preparing to get new work unit...
[17:34:40] + Attempting to get work packet
[17:34:40] - Will indicate memory of 2010 MB
[17:34:40] - Connecting to assignment server
[17:34:40] Connecting to http://assign.stanford.edu:8080/
[17:34:41] Posted data.
[17:34:41] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[17:34:41] + News From Folding@Home: Welcome to Folding@Home
[17:34:41] Loaded queue successfully.
[17:34:41] Connecting to http://171.64.65.56:8080/
[17:34:46] Posted data.
[17:34:46] Initial: 0000; - Receiving payload (expected size: 4921543)
[17:35:04] - Downloaded at ~267 kB/s
[17:35:04] - Averaged speed for that direction ~325 kB/s
[17:35:04] + Received work.
[17:35:04] + Closed connections
[17:35:09]
[17:35:09] + Processing work unit
[17:35:09] Core required: FahCore_a2.exe
[17:35:09] Core found.
[17:35:09] Working on Unit 03 [August 13 17:35:09]
[17:35:09] + Working ...
[17:35:09] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 03 -checkpoint 10 -forceasm -verbose -lifeline 541 -version 602'

[17:35:09]
[17:35:09] *------------------------------*
[17:35:09] Folding@Home Gromacs SMP Core
[17:35:09] Version 2.00 (Wed Jul 9 13:11:25 PDT 2008)
[17:35:09]
[17:35:09] Preparing to commence simulation
[17:35:09] - Ensuring status. Please wait.
[17:35:18] - Assembly optimizations manually forced on.
[17:35:18] - Not checking prior termination.
[17:35:19] - Expanded 4921031 -> 24360573 (decompressed 495.0 percent)
[17:35:20] Called DecompressByteArray: compressed_data_size=4921031 data_size=24360573, decompressed_data_size=24360573 diff=0
[17:35:20] - Digital signature verified
[17:35:20]
[17:35:20] Project: 2662 (Run 2, Clone 236, Gen 9)
[17:35:20]
[17:35:20] Assembly optimizations on if available.
[17:35:20] Entering M.D.
[17:35:26] Will resume from checkpoint file
[17:35:27] Resuming from checkpoint
[17:35:27] fcSaveRestoreState: I/O failed dir=0, var=0000000001EC2240, varsize=591120
[17:35:27] Verified work/wudata_03.log
[17:35:27] Verified work/wudata_03.trr
[17:35:27] Verified work/wudata_03.xtc
[17:35:27] Verified work/wudata_03.edr
[17:35:27] Completed 247510 out of 250000 steps  (99%)
[17:45:11] Completed 250000 out of 250000 steps  (100%)
[17:46:12]
[17:46:12] Finished Work Unit:
[17:46:12] - Reading up to 21212784 from "work/wudata_03.trr": Read 21212784
[17:46:12] trr file hash check passed.
[17:46:12] - Reading up to 4472408 from "work/wudata_03.xtc": Read 4472408
[17:46:12] xtc file hash check passed.
[17:46:12] edr file hash check passed.
[17:46:12] logfile size: 181621
[17:46:12] Leaving Run
[17:46:16] - Writing 26119277 bytes of core data to disk...
[17:46:16]   ... Done.
[17:46:16] - Shutting down core
[17:46:16]
[17:46:16] Folding@home Core Shutdown: FINISHED_UNIT
[17:49:37] CoreStatus = 64 (100)
[17:49:37] Unit 3 finished with 100 percent of time to deadline remaining.
[17:49:37] Updated performance fraction: 0.811617
[17:49:37] Sending work to server


[17:49:37] + Attempting to send results
[17:49:37] - Reading file work/wuresults_03.dat from core
[17:49:37]   (Read 26119277 bytes from disk)
[17:49:37] Connecting to http://171.64.65.56:8080/
[17:52:37] Posted data.
[17:52:37] Initial: 0000; - Uploaded at ~140 kB/s
[17:52:39] - Averaged speed for that direction ~136 kB/s
[17:52:39] + Results successfully sent
[17:52:39] Thank you for your contribution to Folding@Home.
[17:52:39] + Number of Units Completed: 7

[17:52:53] - Warning: Could not delete all work unit files (3): Core file absent
[17:52:53] Trying to send all finished work units
[17:52:53] + No unsent completed units remaining.
[17:52:53] - Preparing to get new work unit...
[17:52:53] + Attempting to get work packet
[17:52:53] - Will indicate memory of 2010 MB
[17:52:53] - Connecting to assignment server
[17:52:53] Connecting to http://assign.stanford.edu:8080/
[17:52:53] Posted data.
[17:52:53] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[17:52:53] + News From Folding@Home: Welcome to Folding@Home
[17:52:53] Loaded queue successfully.
[17:52:53] Connecting to http://171.64.65.56:8080/
[17:52:58] Posted data.
[17:52:58] Initial: 0000; - Receiving payload (expected size: 4920188)
[17:53:08] - Downloaded at ~480 kB/s
[17:53:08] - Averaged speed for that direction ~356 kB/s
[17:53:08] + Received work.
[17:53:08] Trying to send all finished work units
[17:53:08] + No unsent completed units remaining.
[17:53:08] + Closed connections
[17:53:08]
[17:53:08] + Processing work unit
[17:53:08] Core required: FahCore_a2.exe
[17:53:08] Core found.
[17:53:08] Working on Unit 04 [August 13 17:53:08]
[17:53:08] + Working ...
[17:53:08] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 04 -checkpoint 10 -forceasm -verbose -lifeline 541 -version 602'

[17:53:08]
[17:53:08] *------------------------------*
[17:53:08] Folding@Home Gromacs SMP Core
[17:53:08] Version 2.00 (Wed Jul 9 13:11:25 PDT 2008)
[17:53:08]
[17:53:08] Preparing to commence simulation
[17:53:08] - Ensuring status. Please wait.
[17:53:18] - Assembly optimizations manually forced on.
[17:53:18] - Not checking prior termination.
[17:53:19] - Expanded 4919676 -> 24360573 (decompressed 495.1 percent)
[17:53:19] Called DecompressByteArray: compressed_data_size=4919676 data_size=24360573, decompressed_data_size=24360573 diff=0
[17:53:19] - Digital signature verified
[17:53:19]
[17:53:19] Project: 2662 (Run 2, Clone 153, Gen 5)
[17:53:19]
[17:53:19] Assembly optimizations on if available.
[17:53:19] Entering M.D.
[17:53:25] Will resume from checkpoint file
[17:53:27] Resuming from checkpoint
[17:53:27] fcSaveRestoreState: I/O failed dir=0, var=0000000001EC2240, varsize=585852
[17:53:27] Verified work/wudata_04.log
[17:53:27] Verified work/wudata_04.trr
[17:53:27] Verified work/wudata_04.xtc
[17:53:27] Verified work/wudata_04.edr
[17:53:27] Completed 247510 out of 250000 steps  (99%)
[18:03:09] Completed 250000 out of 250000 steps  (100%)
[18:04:10]
[18:04:10] Finished Work Unit:
[18:04:10] - Reading up to 21310704 from "work/wudata_04.trr": Read 21310704
[18:04:10] trr file hash check passed.
[18:04:10] - Reading up to 4713096 from "work/wudata_04.xtc": Read 4713096
[18:04:10] xtc file hash check passed.
[18:04:10] edr file hash check passed.
[18:04:10] logfile size: 181744
[18:04:10] Leaving Run
[18:04:11] - Writing 26459448 bytes of core data to disk...
[18:04:11]   ... Done.
[18:04:11] - Shutting down core
[18:04:11]
[18:04:11] Folding@home Core Shutdown: FINISHED_UNIT
[18:07:31] CoreStatus = 64 (100)
[18:07:31] Unit 4 finished with 100 percent of time to deadline remaining.
[18:07:31] Updated performance fraction: 0.848628
[18:07:31] Sending work to server


[18:07:31] + Attempting to send results
[18:07:31] - Reading file work/wuresults_04.dat from core
[18:07:31]   (Read 26459448 bytes from disk)
[18:07:31] Connecting to http://171.64.65.56:8080/
[18:10:35] Posted data.
[18:10:35] Initial: 0000; - Uploaded at ~138 kB/s
[18:10:38] - Averaged speed for that direction ~136 kB/s
[18:10:38] + Results successfully sent
[18:10:38] Thank you for your contribution to Folding@Home.
[18:10:38] + Number of Units Completed: 8

[18:10:52] - Warning: Could not delete all work unit files (4): Core file absent
[18:10:52] Trying to send all finished work units
[18:10:52] + No unsent completed units remaining.
[18:10:52] - Preparing to get new work unit...
[18:10:52] + Attempting to get work packet
[18:10:52] - Will indicate memory of 2010 MB
[18:10:52] - Connecting to assignment server
[18:10:52] Connecting to http://assign.stanford.edu:8080/
[18:10:52] Posted data.
[18:10:52] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[18:10:52] + News From Folding@Home: Welcome to Folding@Home
[18:10:52] Loaded queue successfully.
[18:10:52] Connecting to http://171.64.65.56:8080/
[18:10:57] Posted data.
[18:10:57] Initial: 0000; - Receiving payload (expected size: 5001213)
[18:11:08] - Downloaded at ~443 kB/s
[18:11:08] - Averaged speed for that direction ~373 kB/s
[18:11:08] + Received work.
[18:11:08] Trying to send all finished work units
[18:11:08] + No unsent completed units remaining.
[18:11:08] + Closed connections
[18:11:08]
[18:11:08] + Processing work unit
[18:11:08] Core required: FahCore_a2.exe
[18:11:08] Core found.
[18:11:08] Working on Unit 05 [August 13 18:11:08]
[18:11:08] + Working ...
[18:11:08] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 05 -checkpoint 10 -forceasm -verbose -lifeline 541 -version 602'

[18:11:08]
[18:11:08] *------------------------------*
[18:11:08] Folding@Home Gromacs SMP Core
[18:11:08] Version 2.00 (Wed Jul 9 13:11:25 PDT 2008)
[18:11:08]
[18:11:08] Preparing to commence simulation
[18:11:08] - Ensuring status. Please wait.
[18:11:18] - Assembly optimizations manually forced on.
[18:11:18] - Not checking prior termination.
[18:11:19] - Expanded 5000701 -> 24742709 (decompressed 494.7 percent)
[18:11:20] Called DecompressByteArray: compressed_data_size=5000701 data_size=24742709, decompressed_data_size=24742709 diff=0
[18:11:20] - Digital signature verified
[18:11:20]
[18:11:20] Project: 2662 (Run 1, Clone 113, Gen 12)
[18:11:20]
[18:11:20] Assembly optimizations on if available.
[18:11:20] Entering M.D.
[18:21:30] Completed 2500 out of 250000 steps  (1%)
[18:31:31] Completed 5000 out of 250000 steps  (2%)
[18:41:33] Completed 7500 out of 250000 steps  (3%)
[18:51:35] Completed 10000 out of 250000 steps  (4%)


I think that may be (or have been) the Linux version's problem, but don't know.
Ragnar Dan
Gerbil Elder
Silver subscriber
 
 
Posts: 5355
Joined: Sun Jan 20, 2002 7:00 pm

Re: New release of diskless folding suite

Postposted on Wed Aug 13, 2008 10:45 pm

Thanks for this, nf. I'm pretty clear now that it's just not -forceasm that's the culprit. I'm d/l'ing the latest .iso as I type this and possibly one of the other adjustments will make the difference.
ArVee
Gerbil
 
Posts: 21
Joined: Sun May 18, 2008 12:21 pm

Re: New release of diskless folding suite

Postposted on Wed Aug 20, 2008 8:53 pm

I was under the impression that if I set this to run an instance for every two processors that I just had to set that option when I downloaded the iso. I'm running this with VMWare Server 1.06 and it only sets up one instance with my Q6600. Isn't it supposed to set up two, or will it not work this way with VMWare?
slugbug
Gerbil
 
Posts: 15
Joined: Wed Aug 20, 2008 12:29 pm

Re: New release of diskless folding suite

Postposted on Wed Aug 20, 2008 9:09 pm

slugbug wrote:I was under the impression that if I set this to run an instance for every two processors that I just had to set that option when I downloaded the iso. I'm running this with VMWare Server 1.06 and it only sets up one instance with my Q6600. Isn't it supposed to set up two, or will it not work this way with VMWare?

AFAIK VMware Server supports up to 2 "virtual CPUs" (read: the VM will use 2 cores) only.
Image
The Model M is not for the faint of heart. You either like them or hate them.

Gerbils unite! Fold for UnitedGerbilNation, team 2630.
Flying Fox
Gerbil God
 
Posts: 24357
Joined: Mon May 24, 2004 2:19 am

Re: New release of diskless folding suite

Postposted on Thu Aug 21, 2008 11:14 am

I followed LumberJack's guide on page one of this thread to the tee and am still unable to get this to save backups onto my USB thumb drive. Would it be a problem if I was running two diskless folding instances and using one thumb drive for both?
slugbug
Gerbil
 
Posts: 15
Joined: Wed Aug 20, 2008 12:29 pm

Re: New release of diskless folding suite

Postposted on Thu Aug 21, 2008 11:39 am

How are you running them? If it is two instances within the same OS then that's fine, they should write to directories 1 and 2. If you are running 2 different VMs and have the same USB stick exported to both then that is a problem as they will both think they are instance 1.
notfred
Grand Gerbil Poohbah
 
Posts: 3726
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Re: New release of diskless folding suite

Postposted on Thu Aug 21, 2008 5:07 pm

Thanks for the response. By the way, what's the trick to getting FahMon to display the diskless folding client speeds properly? I've tried both bridged and NAT with VMWare and most times FahMon will show the client as Hung even though it is working properly. Very rarely will it show the folding speed.
slugbug
Gerbil
 
Posts: 15
Joined: Wed Aug 20, 2008 12:29 pm

Re: New release of diskless folding suite

Postposted on Fri Aug 22, 2008 8:22 am

Make sure you set the asynchronous clocks flag in FAHMon. There is a well known problem with times in VMWare, I've set everything that I know about in the .vmx file, but it appears that the clock is still all over the place.
notfred
Grand Gerbil Poohbah
 
Posts: 3726
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Re: New release of diskless folding suite

Postposted on Sat Aug 23, 2008 1:45 pm

I have no luck at all with this A2 core :cry: : "out of memory: kill process 458 (mpiexec) score 5007 or a child killed process 463 (Fah core_A2.exe)
Every one craps out after only 3 or 4 steps are completed.
slugbug
Gerbil
 
Posts: 15
Joined: Wed Aug 20, 2008 12:29 pm

Re: New release of diskless folding suite

Postposted on Sat Aug 23, 2008 6:20 pm

Bump up the memory if you are seeing the Out Of Memory killer start killing processes. It seems to need over 512MB.
notfred
Grand Gerbil Poohbah
 
Posts: 3726
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Re: New release of diskless folding suite

Postposted on Sun Aug 24, 2008 10:47 am

I'm through the memory problem with 2662, fixed by a bump to 1024, but now it appears that Stanford's come up with another snag in the form of new core a_2.01 . Released earlier this week, it appears to be the cause of hangs after completion and submission of a WU. It sends the work, gives the Thank you message and the local stats count message and then just sits there. I think they said 2.01 was meant to address hangs, but I only had a couple before this came out and I've had four in a row over two machines. It requires a VMWare re-start. I have the notfred 0813 release, all was well before this new core update. Anybody else having this problem? Any fix? Thanks.
ArVee
Gerbil
 
Posts: 21
Joined: Sun May 18, 2008 12:21 pm

Re: New release of diskless folding suite

Postposted on Sun Aug 24, 2008 12:37 pm

I've been seeing the hangs as well, but AFAIK not running the 2.01 on any systems yet.

I don't have a fix, but it should not be necessary to restart the entire VM. If the killall tool is part of notfred's image (not sure if it is, can someone verify?), all you should need to do is run the following command on the VM whenever you notice that the client is stuck:
Code: Select all
killall -9 FahCore_a2.exe
(this space intentionally left blank)
just brew it!
Administrator
Gold subscriber
 
 
Posts: 37632
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: New release of diskless folding suite

Postposted on Sun Aug 24, 2008 1:09 pm

If it needs a nudge I don't mind the VM restart, it resets the amount of RAM in use anyway, it's the being here to nudge it that's the problem, time lost. You would have got the a_2.01 automatically if a_2 was called for, it started mid-to-late-week. You can see the version toward the start of a new WU in the log. It seems too coincidental for these hangs to be starting right when they started with 2.01, esp. when that aspect of it was fine before it. Ironic that it was stated that it was released to address that very issue, hangs at 100%.
ArVee
Gerbil
 
Posts: 21
Joined: Sun May 18, 2008 12:21 pm

Re: New release of diskless folding suite

Postposted on Sun Aug 24, 2008 1:27 pm

I guess I'll go back to running two Linux instances with VMWare. These diskless folding problems are getting to be too much.
slugbug
Gerbil
 
Posts: 15
Joined: Wed Aug 20, 2008 12:29 pm

Re: New release of diskless folding suite

Postposted on Sun Aug 24, 2008 1:47 pm

My impression is that the problem lies with the interaction between Linux and the core with the emphasis on the latter as opposed to any shortcoming with notfred's diskless folding. Everything was fine until 2.01 and that was Stanford.
ArVee
Gerbil
 
Posts: 21
Joined: Sun May 18, 2008 12:21 pm

Re: New release of diskless folding suite

Postposted on Sun Aug 24, 2008 3:09 pm

It is definitely an issue with the a2 core. I'm seeing the hangs intermittently too, and I am not using notfred's diskless distro.
(this space intentionally left blank)
just brew it!
Administrator
Gold subscriber
 
 
Posts: 37632
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: New release of diskless folding suite

Postposted on Sat Aug 30, 2008 3:30 am

Notfred, I wrote up a Feature Request for you.

Add SMP instance per option for 8 , 12 , 16 cores.

Something you might want to look at before the release of Nehalem in a few months.

With the new A2 core out for Linux, this is going to be needed.

Kasson has already been tweaking the Assignment and Work Servers to give out work for those configurations.
EvilAlchemist
Gerbil
 
Posts: 28
Joined: Tue Jan 29, 2008 1:54 am

Re: New release of diskless folding suite

Postposted on Sun Aug 31, 2008 10:55 pm

just brew it! wrote:I've been seeing the hangs as well, but AFAIK not running the 2.01 on any systems yet.

I don't have a fix, but it should not be necessary to restart the entire VM. If the killall tool is part of notfred's image (not sure if it is, can someone verify?), all you should need to do is run the following command on the VM whenever you notice that the client is stuck:
Code: Select all
killall -9 FahCore_a2.exe

Yes, it's there. The kill command's there, too, but it doesn't seem to be the same as other ones because it won't accept non-numeric signal names like CONT, which I can't find a numeric equivalent for, so terminating it seems all that's available.
Ragnar Dan
Gerbil Elder
Silver subscriber
 
 
Posts: 5355
Joined: Sun Jan 20, 2002 7:00 pm

Re: New release of diskless folding suite

Postposted on Mon Sep 01, 2008 10:39 am

CONT should be 18 so "killall -18 FahCore_a2.exe". I've put an auto detect thing in there in the last build, but recently spotted a bug in it so it will not clear the hang automatically. I have it fixed but want to make another change to allow upgrade of VMs / USB sticks easier before I do a release.
notfred
Grand Gerbil Poohbah
 
Posts: 3726
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Re: New release of diskless folding suite

Postposted on Wed Sep 03, 2008 9:00 pm

My Ubuntu 6.x, I forget the exact version number, quit allowing me to update to the next version in the update manager some months back when I tried. But it's man page for kill listed lots of numeric equivalents, but the one for -CONT was left blank for some reason.

I should probably ask in the Linux forum what I have to do to update, and if it's worth doing anyway. My experience is newer versions add more ways to eat cycles with little else, but that was mostly before GPU's were around.

Anyway, thanks for the info. I'll try it if it comes up again.
Ragnar Dan
Gerbil Elder
Silver subscriber
 
 
Posts: 5355
Joined: Sun Jan 20, 2002 7:00 pm

Re: New release of diskless folding suite

Postposted on Thu Sep 04, 2008 8:29 am

Yup, the man page for kill only lists the signal numbers that are the same across different processor architectures, some of them (like CONT) change depending on processor type. Try "man 7 signal" and note "Where three values are given, the first one is usually valid for alpha and sparc, the middle one for i386, ppc and sh, and the last one for mips." and "SIGCONT 19,18,25 Cont Continue if stopped" so it is 18 on the Intel processors.
notfred
Grand Gerbil Poohbah
 
Posts: 3726
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Re: New release of diskless folding suite

Postposted on Thu Sep 04, 2008 4:19 pm

Ah, that helps. I noticed the list of other tools, /bin programs, whatever to "See also" at the bottom of kill's man page which includes signal(7) now that I'm more aware of things.

I set up the tftp thing on my Ubuntu just to see how much CPU time it eats, and it doesn't seem to be using any that I can find over the last 20-some hours. Which is good. Oddly, sendmail has used 1 second and reports "sendmail: MTA: accepting connections" which I don't know how it can be since I haven't configured anything about it. Anyway.

Next I'll try to see if I can make the machine and the DHCP server stuff work with my gateway to hand out IP addresses so I can begin using tftp to back up to the hard drive area the Ubuntu VM has access to (and maybe get one of my machines to boot off the network since it's relying on a Maxtor HD and Win2K and sometimes takes an hour of restarting itself before it settles down and runs. It has no working CD drive for some reason I'm not sure of yet. Might be connector, but it looks to be the motherboard). I have to read your setup instructions and any theory therein to figure out just what's going on and why I have to do things the way it seems they are... and see if I can determine whether it would screw up the various machines hanging off of it when the DHCP server machine reboots.
Ragnar Dan
Gerbil Elder
Silver subscriber
 
 
Posts: 5355
Joined: Sun Jan 20, 2002 7:00 pm

Re: New release of diskless folding suite

Postposted on Thu Sep 25, 2008 8:26 am

Currently using the latest version of the VM Appliance. I've got the memory bumped to 1GB.

I still on occassion get the following two issues:
- Reached end of physical disk (or some such message)
- Hangs after delivering or waiting to deliver and NOT getting a new WU while it's doing so

I've found that restarting the VM seems to correct both issues. The physical diskspace issue I believe is related to Virtual Memory Paging (as previously outlined by others). Restarting the VM frees up the paging file and I no longer have issues. Restarting also corrects that hanging issue because it forces it to "restart" and figure out where it is, ie: have I submitted all my WU, if not, do so. Get a new WU.

Is there a way to get the VM Appliance to restart at specific periods of time? I've tried looking for a way to do so via VMWare, but the "tools" that are available require VMWare Tools installed in the VM for them to work. Obviously not an option here. I do see that there is CGI code behind the "reboot" hyperlink. So the restarting of the VM is possible, but I would like to be able to do so every say 4 hours. This prevents any issue with memory leakage and also limits my downtime if a hang occurs.

Feature Request: have the ability to "restart" the VM after a specific period of time has passed

In the meantime, is there a way using Windows Task Manager to call the CGI code that does the restart?
capreppy
Gerbil In Training
 
Posts: 2
Joined: Wed Sep 10, 2008 4:35 pm

Re: New release of diskless folding suite

Postposted on Thu Sep 25, 2008 2:30 pm

I think I have the hang fixed, and I have ideas on disk cleanup that I posted in the other thread. There is no paging file though.

If you were on Linux (or any other Unix like OS), you could setup a cron job to call wget to hit the reload link but I have no idea how to script such stuff in Windows.
notfred
Grand Gerbil Poohbah
 
Posts: 3726
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Re: New release of diskless folding suite

Postposted on Fri Sep 26, 2008 4:13 pm

Found a solution for Windows and restarting the VM. Thanks notfred for the wget idea. I found a version of wget for windows and this works WELL!!!

Found good information here:
http://mohammednv.wordpress.com/2008/04 ... sing-wget/

I downloaded wget from here:
http://users.ugent.be/~bpuype/wget/wget.exe

Create a batch file that contained the following:
"C:\wget\wget" -q -O nul "http://192.168.1.80/cgi-bin/reboot.cgi"

replace the IP address above with the IP Address of the VM. This obviously works REALLY well if you have static IP addresses as you'll need to occassionally change the IP address if you're using dynamic IP addressing. I'm happy now and I don't need to worry about this again!!!!

note: sorry about not creating hyperlinks above. not used to your editor :)
capreppy
Gerbil In Training
 
Posts: 2
Joined: Wed Sep 10, 2008 4:35 pm

Previous

Return to TR Distributed Computing Effort

Who is online

Users browsing this forum: No registered users and 1 guest