Personal computing discussed
Moderators: renee, farmpuma, just brew it!
[02:45:21] Completed 247500 out of 250000 steps (99%)
[03:02:14] Completed 250000 out of 250000 steps (100%)
[03:05:00]
[03:05:00] Finished Work Unit:
[03:05:00] - Reading up to 21124224 from "work/wudata_02.trr": Read 21124224
[03:05:01] trr file hash check passed.
[03:05:01] - Reading up to 4489500 from "work/wudata_02.xtc": Read 4489500
[03:05:01] xtc file hash check passed.
[03:05:01] edr file hash check passed.
[03:05:01] logfile size: 198417
[03:05:01] Leaving Run
[03:05:04] - Writing 26255405 bytes of core data to disk...
[03:05:04] ... Done.
[03:06:42] - Shutting down core
[03:28:55] ***** Got a SIGTERM signal (15)
[03:28:55] Killing all core threads
Folding@Home Client Shutdown.
--- Opening Log file [April 9 03:28:58]
# SMP Client ##################################################################
###############################################################################
Folding@Home Client Version 6.02
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: /home/ragnardan/folding/FAH
Executable: ./fah6
Arguments: -local -smp -forceasm -advmethods -verbosity 9
Warning:
By using the -forceasm flag, you are overriding
safeguards in the program. If you did not intend to
do this, please restart the program without -forceasm.
If work units are not completing fully (and particularly
if your machine is overclocked), then please discontinue
use of the flag.
[03:28:58] - Ask before connecting: No
[03:28:58] - User name: Ragnar_Dan (Team 2630)
[03:28:58] - User ID: 1503ECE6554148A8
[03:28:58] - Machine ID: 1
[03:28:58]
[03:28:59] Loaded queue successfully.
[03:28:59]
[03:28:59] + Processing work unit
[03:28:59] Core required: FahCore_a2.exe
[03:28:59] Core found.
[03:28:59] - Autosending finished units...
[03:28:59] Trying to send all finished work units
[03:28:59] + No unsent completed units remaining.
[03:28:59] - Autosend completed
[03:28:59] Working on Unit 02 [April 9 03:28:59]
[03:28:59] + Working ...
[03:28:59] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -priority 96 -checkpoint 5 -forceasm -verbose -lifeline 9247 -version 602'
[03:28:59]
[03:28:59] *------------------------------*
[03:28:59] Folding@Home Gromacs SMP Core
[03:28:59] Version 2.00 (Wed Jul 9 13:11:25 PDT 2008)
[03:28:59]
[03:28:59] Preparing to commence simulation
[03:28:59] - Ensuring status. Please wait.
[03:29:08] - Assembly optimizations manually forced on.
[03:29:08] - Not checking prior termination.
[03:29:08] Need version 206
[03:29:08] Error: Work unit read from disk is invalid
[03:29:10] - Expanded 4836074 -> 23977273 (decompressed 495.8 percent)
[03:29:11] Called DecompressByteArray: compressed_data_size=4836074 data_size=23977273, decompressed_data_size=23977273 diff=0
[03:29:11] - Digital signature verified
[03:29:11]
[03:29:11] Project: 2669 (Run 2, Clone 7, Gen 107)
[03:29:11]
[03:29:11] Assembly optimizations on if available.
[03:29:11] Entering M.D.
[03:49:00] Completed 2500 out of 250000 steps (1%)
[20:03:17] Preparing to commence simulation
[20:03:17] - Ensuring status. Please wait.
[20:03:26] - Assembly optimizations manually forced on.
[20:03:26] - Not checking prior termination.
[20:03:26] Need version 206
[20:03:26] Error: Work unit read from disk is invalid
[20:03:31] - Expanded 4836074 -> 23977273 (decompressed 495.8 percent)
[20:03:32] Called DecompressByteArray: compressed_data_size=4836074 data_size=23977273, decompressed_data_size=23977273 diff=0
[20:03:32] - Digital signature verified
[20:03:32]
[20:03:32] Project: 2669 (Run 2, Clone 7, Gen 107)
[20:03:32]
[20:03:33] Assembly optimizations on if available.
[20:03:33] Entering M.D.
[20:03:39] Will resume from checkpoint file
[20:03:42] Resuming from checkpoint
[20:03:42] Verified work/wudata_02.log
[20:03:44] Verified work/wudata_02.trr
[20:03:44] Verified work/wudata_02.xtc
[20:03:44] Verified work/wudata_02.edr
[20:03:44] Completed 122520 out of 250000 steps (49%)
[11:36:11] Completed 247500 out of 250000 steps (99%)
[11:53:55] Completed 250000 out of 250000 steps (100%)
[11:56:35]
[11:56:35] Finished Work Unit:
[11:56:35] - Reading up to 21124224 from "work/wudata_02.trr": Read 21124224
[11:56:36] trr file hash check passed.
[11:56:36] - Reading up to 4489628 from "work/wudata_02.xtc": Read 4489628
[11:56:36] xtc file hash check passed.
[11:56:36] edr file hash check passed.
[11:56:36] logfile size: 202256
[11:56:36] Leaving Run
[11:56:40] - Writing 26265132 bytes of core data to disk...
[11:56:40] ... Done.
[11:56:44] - Shutting down core
[14:56:00] CoreStatus = 0 (0)
[14:56:00] Sending work to server
[14:56:00] Project: 2669 (Run 2, Clone 7, Gen 107)
[14:56:00] + Attempting to send results [April 10 14:56:00 UTC]
[14:56:00] - Reading file work/wuresults_02.dat from core
[14:56:00] (Read 26265132 bytes from disk)
[14:56:00] Connecting to http://171.64.65.56:8080/
[14:56:01] - Couldn't send HTTP request to server
[14:56:01] + Could not connect to Work Server (results)
[14:56:01] (171.64.65.56:8080)
[14:56:01] + Retrying using alternative port
[14:56:01] Connecting to http://171.64.65.56:80/
[14:56:01] - Couldn't send HTTP request to server
[14:56:01] + Could not connect to Work Server (results)
[14:56:01] (171.64.65.56:80)
[14:56:01] - Error: Could not transmit unit 02 (completed April 10) to work server.
[14:56:01] - 1 failed uploads of this unit.
[14:56:01] Keeping unit 02 in queue.
[14:56:01] Trying to send all finished work units
[14:56:01] Project: 2669 (Run 2, Clone 7, Gen 107)
[14:56:01] + Attempting to send results [April 10 14:56:01 UTC]
[14:56:01] - Reading file work/wuresults_02.dat from core
[14:56:01] (Read 26265132 bytes from disk)
[14:56:01] Connecting to http://171.64.65.56:8080/
[14:56:01] - Couldn't send HTTP request to server
[14:56:01] + Could not connect to Work Server (results)
[14:56:01] (171.64.65.56:8080)
[14:56:01] + Retrying using alternative port
[14:56:01] Connecting to http://171.64.65.56:80/
[14:56:01] - Couldn't send HTTP request to server
[14:56:01] + Could not connect to Work Server (results)
[14:56:01] (171.64.65.56:80)
[14:56:01] - Error: Could not transmit unit 02 (completed April 10) to work server.
[14:56:01] - 2 failed uploads of this unit.
[14:56:01] + Attempting to send results [April 10 14:56:01 UTC]
[14:56:01] - Reading file work/wuresults_02.dat from core
[14:56:01] (Read 26265132 bytes from disk)
[14:56:01] Connecting to http://171.67.108.25:8080/
[14:56:01] - Couldn't send HTTP request to server
[14:56:01] (Got status 503)
[14:56:01] + Could not connect to Work Server (results)
[14:56:01] (171.67.108.25:8080)
[14:56:01] + Retrying using alternative port
[14:56:01] Connecting to http://171.67.108.25:80/
[14:56:02] - Couldn't send HTTP request to server
[14:56:02] (Got status 503)
[14:56:02] + Could not connect to Work Server (results)
[14:56:02] (171.67.108.25:80)
[14:56:02] Could not transmit unit 02 to Collection server; keeping in queue.
[14:56:02] + Sent 0 of 1 completed units to the server
[14:56:02] - Preparing to get new work unit...
[14:56:02] + Attempting to get work packet
[14:56:02] - Will indicate memory of 500 MB
[14:56:02] - Connecting to assignment server
[14:56:02] Connecting to http://assign.stanford.edu:8080/
[14:56:02] Posted data.
[14:56:02] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[14:56:02] + News From Folding@Home: Welcome to Folding@Home
[14:56:02] Loaded queue successfully.
[14:56:02] Connecting to http://171.64.65.64:8080/
[14:56:04] Posted data.
[14:56:04] Initial: 0000; - Receiving payload (expected size: 2437090)
[14:56:10] - Downloaded at ~396 kB/s
[14:56:10] - Averaged speed for that direction ~765 kB/s
[14:56:10] + Received work.
[14:56:10] Trying to send all finished work units
[14:56:10] Project: 2669 (Run 2, Clone 7, Gen 107)
[14:56:10] + Attempting to send results [April 10 14:56:10 UTC]
[14:56:10] - Reading file work/wuresults_02.dat from core
[14:56:10] (Read 26265132 bytes from disk)
[14:56:10] Connecting to http://171.64.65.56:8080/
[14:56:10] - Couldn't send HTTP request to server
[14:56:10] + Could not connect to Work Server (results)
[14:56:10] (171.64.65.56:8080)
[14:56:10] + Retrying using alternative port
[14:56:10] Connecting to http://171.64.65.56:80/
[14:56:10] - Couldn't send HTTP request to server
[14:56:10] + Could not connect to Work Server (results)
[14:56:10] (171.64.65.56:80)
[14:56:10] - Error: Could not transmit unit 02 (completed April 10) to work server.
[14:56:10] - 3 failed uploads of this unit.
[14:56:10] + Attempting to send results [April 10 14:56:10 UTC]
[14:56:10] - Reading file work/wuresults_02.dat from core
[14:56:10] (Read 26265132 bytes from disk)
[14:56:10] Connecting to http://171.67.108.25:8080/
[14:56:10] - Couldn't send HTTP request to server
[14:56:10] (Got status 503)
[14:56:10] + Could not connect to Work Server (results)
[14:56:10] (171.67.108.25:8080)
[14:56:10] + Retrying using alternative port
[14:56:10] Connecting to http://171.67.108.25:80/
[14:56:11] - Couldn't send HTTP request to server
[14:56:11] (Got status 503)
[14:56:11] + Could not connect to Work Server (results)
[14:56:11] (171.67.108.25:80)
[14:56:11] Could not transmit unit 02 to Collection server; keeping in queue.
[14:56:11] + Sent 0 of 1 completed units to the server
[14:56:11] + Closed connections
[14:56:16]
[14:56:16] + Processing work unit
[14:56:16] Work type a1 not eligible for variable processors
[14:56:16] Core required: FahCore_a1.exe
[14:56:16] Core found.
[14:56:16] Working on queue slot 03 [April 10 14:56:16 UTC]
[14:56:16] + Working ...
[14:56:16] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 03 -priority 96 -checkpoint 5 -forceasm -verbose -lifeline 5516 -version 624'
[14:56:16]
[14:56:16] *------------------------------*
[14:56:16] Folding@Home Gromacs SMP Core
[14:56:16] Version 1.74 (November 27, 2006)
[14:56:16]
[14:56:16] Preparing to commence simulation
[14:56:16] - Ensuring status. Please wait.
[14:56:33] - Assembly optimizations manually forced on.
[14:56:33] - Not checking prior termination.
[14:56:34] - Expanded 2436578 -> 12916733 (decompressed 530.1 percent)
[14:56:34] - Starting from initial work packet
[14:56:34]
[14:56:34] Project: 2653 (Run 24, Clone 175, Gen 101)
[14:56:34]
[14:56:34] Assembly optimizations on if available.
[14:56:34] Entering M.D.
[14:56:41] Rejecting checkpoint
[14:56:42] Protein: Protein in POPC
[14:56:42] Writing local files
[14:56:42] Extra SSE boost OK.
[14:56:42] Writing local files
[14:56:43] Completed 0 out of 500000 steps (0 percent)
[15:06:58] Timered checkpoint triggered.
[15:17:55] *------------------------------*
[15:17:55] Folding@Home Gromacs SMP Core
[15:17:55] Version 2.04 (Thu Jan 29 16:43:57 PST 2009)
[15:17:55]
[15:17:55] Preparing to commence simulation
[15:17:55] - Ensuring status. Please wait.
[15:18:06] - Looking at optimizations...
[15:18:06] - Working with standard loops on this execution.
[15:18:06] - Files status OK
[15:18:06] Need version 206
[15:18:06] Error: Work unit read from disk is invalid
[15:18:06]
[15:18:06] Folding@home Core Shutdown: CORE_OUTDATED
[15:18:08] CoreStatus = 6E (110)
[15:18:08] + Core out of date. Auto updating...
[15:18:08] - Attempting to download new core...
[15:18:08] + Downloading new core: FahCore_a2.exe
[15:18:08] Downloading core (/~pande/Linux/AMD64/Core_a2.fah from www.stanford.edu)
[15:18:08] Initial: AFDE; + 10240 bytes downloaded
[15:18:08] Initial: 1FF1; + 20480 bytes downloaded
Ragnar Dan wrote:about a whole bunch of problems.
farmpuma wrote:[a nigh crazy idea about blowing up my computer from high in the sky, somewhat fictionalized here in this box]
just brew it! wrote:Update: I can now confirm that the new 2.06 a2 core seems to fix the issue. No more hangs here since forcing all of my systems to download the new version.
#!/bin/sh
# check_hang.sh - checks log files and kills/continues the cores if hung at completion
# Also does cleanup of stale files in the work directory
#
LOGFILE=/tmp/folding_hanglog.txt
while [ 1 ]
do
# Run every 5 minutes
sleep 300
# Clean up the log file
if [ -f /tmp/folding_hanglog.txt ]
then
tail -n 1000 $LOGFILE > /tmp/hanglog.bak
mv /tmp/hanglog.bak $LOGFILE
fi
echo `date` " Checking " >> $LOGFILE
# Check for FINISHED_UNIT without CoreStatus following
grep -E 'FINISHED_UNIT|CoreStatus' FAHlog.txt | tail -n 1 | grep -q FINISHED_UNIT
if [ $? -eq 0 ]
then
# Give the client a chance to kill the cores
echo "Potential hang found, waiting to see if it clears..." >> $LOGFILE
sleep 300
grep -E 'FINISHED_UNIT|CoreStatus' FAHlog.txt | tail -n 1 | grep -q FINISHED_UNIT
if [ $? -eq 0 ]
then
echo "Hang failed to clear, killing cores" >> $LOGFILE
./kill_cores.sh $LOGFILE
fi
fi
# Check for upload and not trying to download following
grep -E 'Number of Units Completed|Preparing to get new work unit|Starting local stats count at'
FAHlog.txt | tail -n 1 | grep -qE 'Number of Units Completed|Starting local stats count at'
if [ $? -eq 0 ]
then
# Give the client a chance to continue
echo "Potential stop found, waiting to see if it clears..." >> $LOGFILE
sleep 300
grep -E 'Number of Units Completed|Preparing to get new work unit|Starting local stats count at' FAHlog.txt | tail -n 1 | grep -qE 'Number of Units Completed|Starting local stats count at'
if [ $? -eq 0 ]
then
echo "Stop failed to clear, continuing cores" >> $LOGFILE
./cont_cores.sh $LOGFILE
fi
fi
# Clean up any stale files in the work directory
slot=0
while [ "$slot" -lt "10" ]
do
state=`./queueinfo queue.dat $slot`
if [ "$state" -eq "0" ]
then
rm -f work/*_0$slot*
fi
slot=`expr $slot + 1`
done
done
#!/bin/sh
# kill_cores.sh - kills cores for the specified instance
#
CWD=`pwd`
echo "kill_cores.sh for $CWD" >> $1
# Walk /proc looking for processes
for procdir in `find /proc -name '[0-9]*' | awk '/\/proc\/[0-9]*$/ {print $0}'`
do
# Check if they are the right exe and the right cwd
if [ -e $procdir/exe -a -e $procdir/cwd ]
then
if [ "`readlink $procdir/exe`" = "$CWD/FahCore_a1.exe" -a "`readlink $procdir/cwd`" = "$CWD" ] then
# kill -9 the core procs to free the hang
kill -9 `echo $procdir | awk -F / '{print $3}'`
echo "Killing " `echo $procdir | awk -F / '{print $3}'` >> $1
fi
fi
done
#!/bin/sh
# cont_cores.sh - continues cores for the specified instance
#
CWD=`pwd`
echo "cont_cores.sh for $CWD" >> $1
# Walk /proc looking for processes
for procdir in `find /proc -name '[0-9]*' | awk '/\/proc\/[0-9]*$/ {print $0}'`
do
# Check if they are the right exe and the right cwd
if [ -e $procdir/exe -a -e $procdir/cwd ]
then
if [ "`readlink $procdir/exe`" = "$CWD/FahCore_a2.exe" -a "`readlink $procdir/cwd`" = "$CWD" ] then
# kill -CONT the core procs to free the hang
kill -18 `echo $procdir | awk -F / '{print $3}'`
echo "Continuing " `echo $procdir | awk -F / '{print $3}'` >> $1
fi
fi
done
/*
* queueinfo.c - a program to output the state of the work unit slots
* Reads from queue.dat in argv[1] the state of slot argv[2]
* Copyright Nicholas Reilly 29 September 2008
* Licensed under the GPL v2 or any later version
*/
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <errno.h>
#define SIZE 7168
int main(int argc, char *argv[])
{
char *addr, *stat;
int fd, slot;
if (argc != 3) {
fprintf(stderr, "Usage: %s <queue.dat> <slot 0-9>\n", argv[0]);
return EXIT_FAILURE;
}
slot = atoi(argv[2]);
if ((slot < 0) || (slot > 9)) {
fprintf(stderr, "Usage: %s <queue.dat> <slot 0-9>\n", argv[0]);
return EXIT_FAILURE;
}
fd = open(argv[1], O_RDONLY);
if (fd == -1) {
perror("Failed to open queue.dat");
return EXIT_FAILURE;
}
addr = mmap(NULL, SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
if (addr == MAP_FAILED) {
perror("Failed to map file");
return EXIT_FAILURE;
}
/* Skip first 8 bytes (general stuff)*/
stat = addr + 8;
/* Each queue entry is 712 bytes long with status as first byte */
stat += (712 * slot);
printf("%d\n", *stat);
(void)munmap(addr, SIZE);
close(fd);
return EXIT_SUCCESS;
}
[17:18:04] - Shutting down core
[17:18:04]
[17:18:04] Folding@home Core Shutdown: FINISHED_UNIT
[17:37:53] CoreStatus = 64 (100)