Personal computing discussed
Moderators: renee, farmpuma, just brew it!
runlinux wrote:i believe i am as i have benchmark options upon boot.
i noticed the problem started more so when i move my fan closer to the nodes to help keep them cool, so last night i moved it further way and i hope that fixes it.
i tried building my own suite from nf's code to setup 1 client per 2 cpus, but i guess i dont have the stuff on my rig to get all the way through the: make all
it was worth a shot as if it works for me, i can tell nf that it works just fine.
runlinux wrote:maybe i should add that it was a BIG box fan cooling the nodes... EMR ftl...
i cant wait till he can get around to working on this.
btw, im more than willing to help with this in any way. i have a bit of linux knowledge and have quit a few quad cores to test things on....
Flying Fox wrote:I recommend you join up on SourceForge to submit patches and stuff if you are interested.
runlinux wrote:its very easy to change the settings to get 2 instances of folding up on a quad.
i did it on mine and i have successfully got 6 clients going over my 3 nodes.
i wont go into detail as this is NF's work, but for his sake, it is an easy task - took me about 3 minutes to get it working. he just needs the time to get around to working on it; time isn't too easy to come by these days...
notfred wrote:I'm more bothered about why the hang check has broken and I want that working on the next release. It's no good working on 2 WU instead of 1 if they don't upload!
notfred wrote:I'm more bothered about why the hang check has broken and I want that working on the next release. It's no good working on 2 WU instead of 1 if they don't upload!
EvilAlchemist wrote:notfred wrote:I'm more bothered about why the hang check has broken and I want that working on the next release. It's no good working on 2 WU instead of 1 if they don't upload!
I have not had any hang in the lsat 7 days
I have binded all my diskless folders's MAC adress to specific IP's ( Using Linksys WRT300N - DHCP Reservations Tab)
Also changed the DHCP refresh time to 5 days.
Thanks for all your hard work Notfred.
I would rather have the Hang Check fixed then the /2 SMP switch.
[18:19:44] Completed 10000000 out of 10000000 steps (100 percent)
[18:19:45] Writing final coordinates.
[18:19:45] Past main M.D. loop
[18:19:45] Will end MPI now
[18:20:44]
[18:20:44] Finished Work Unit:
[18:20:44] - Reading up to 232416 from "work/wudata_07.arc": Read 232416
[18:20:44] - Reading up to 13720960 from "work/wudata_07.xtc": Read 13720960
[18:20:45] goefile size: 0
[18:20:45] logfile size: 265850
[18:20:45] Leaving Run
[18:20:48] - Writing 14619582 bytes of core data to disk...
[18:20:48] ... Done.
[18:20:48] - Shutting down core
[18:20:48]
[18:20:48] Folding@home Core Shutdown: FINISHED_UNIT
[18:35:47] CoreStatus = 64 (100)
[18:35:47] Unit 7 finished with 72 percent of time to deadline remaining.
[18:35:47] Updated performance fraction: 0.819928
[18:35:47] Sending work to server
[18:35:47] + Attempting to send results
[18:35:47] - Reading file work/wuresults_07.dat from core
[18:35:47] (Read 14619582 bytes from disk)
[18:35:47] Connecting to http://171.64.65.63:8080/
[18:39:49] Posted data.
[18:39:49] Initial: 0000; - Uploaded at ~58 kB/s
[18:39:50] - Averaged speed for that direction ~58 kB/s
[18:39:50] + Results successfully sent
[18:39:50] Thank you for your contribution to Folding@Home.
[18:39:50] + Starting local stats count at 1
[18:43:54] - Warning: Could not delete all work unit files (7): Core returned invalid code
[18:43:54] Trying to send all finished work units
[18:43:54] + No unsent completed units remaining.
[18:43:54] - Preparing to get new work unit...
[18:43:54] + Attempting to get work packet
[18:43:54] - Will indicate memory of 1000 MB
[18:43:54] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 15, Stepping: 11
[18:43:54] - Connecting to assignment server
[18:43:54] Connecting to http://assign.stanford.edu:8080/
[18:43:54] Posted data.
[18:43:54] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[18:43:54] + News From Folding@Home: Welcome to Folding@Home
[18:43:54] Loaded queue successfully.
[18:43:54] Connecting to http://171.64.65.64:8080/
[18:43:57] Posted data.
[18:43:57] Initial: 0000; - Receiving payload (expected size: 2965944)
[18:44:02] - Downloaded at ~579 kB/s
[18:44:02] - Averaged speed for that direction ~398 kB/s
[18:44:02] + Received work.
[18:44:02] Trying to send all finished work units
[18:44:02] + No unsent completed units remaining.
[18:44:02] + Closed connections
[18:44:02]
[18:44:02] + Processing work unit
[18:44:02] Core required: FahCore_a1.exe
[18:44:02] Core found.
[18:44:02] Working on Unit 08 [February 29 18:44:02]
[18:44:02] + Working ...
[18:44:02] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 08 -checkpoint 15 -forceasm -verbose -lifeline 506 -version 601'
[18:44:02]
[18:44:02] *------------------------------*
[18:44:02] Folding@Home Gromacs SMP Core
[18:44:02] Version 1.74 (November 27, 2006)
[18:44:02]
[18:44:02] Preparing to commence simulation
[18:44:02] - Ensuring status. Please wait.
[18:44:19] - Assembly optimizations manually forced on.
[18:44:19] - Not checking prior termination.
[18:44:19] - Expanded 2965432 -> 15213631 (decompressed 513.0 percent)
[18:44:20] - Starting from initial work packet
[18:44:20]
[18:44:20] Project: 2653 (Run 18, Clone 194, Gen 69)
[18:44:20]
[18:44:20] Assembly optimizations on if available.
[18:44:20] Entering M.D.
[18:44:26] Rejecting checkpoint
[18:44:26] Protein: Protein in POPCExtra SSE boost OK.
[18:44:26]
[18:44:26] Extra SSE boost OK.
[18:44:27] Writing local files
[18:44:27] Completed 0 out of 500000 steps (0 percent)
[18:50:46]
[18:50:46] Folding@home Core Shutdown: INTERRUPTED
[18:50:50] CoreStatus = 66 (102)
[18:50:50] + Shutdown requested by user. Exiting.***** Got a SIGTERM signal (15)
[18:50:50] Killing all core threads
Folding@Home Client Shutdown.
grep -E 'FINISHED_UNIT' /etc/folding/$instance/FAHlog.txt | tail -n 1 | grep -q FINISHED_UNIT
if [ $? -eq 0 ]
then
sleep 600
grep -E 'FINISHED_UNIT|CoreStatus' /etc/folding/$instance/FAHlog.txt | tail -n 1 | grep -q FINISHED_UNIT
if [ $? -eq 0 ]
then
#Do the killing code here
fi
fi
for procdir in `find /proc -name '[1-9]*' | awk '/\/proc\/[1-9]*$/ {print $0}'`
for procdir in `find /proc -name '[0-9]*' | awk '/\/proc\/[0-9]*$/ {print $0}'`
notfred wrote:OK, new version is out:
1 March 08: Fix a bug in the hang check - wasn't killing cores with a 0 in the PID. Fix 1903637 - Add kill link to homepage. Fix 1870815 - Allow SMP per 2 CPUs.
Come and get it!
EvilAlchemist wrote:theMass , any results on the 2/smp vs 4/smp in the system with more then 1 GB ...??
notfred, the new /fixed hang check is awesome. Not a single hang since release date. Thanks
runlinux wrote:on my box that had 2gb of ram, it ran just fine with 2x SMP going.
it went from about 3kppd at 3ghz (1x SMP) to almost 4400ppd at times at the same speed. it really helped out the points.
but i noticed that on my boxes with only 1gb of ram i would randomly get Long 1-4 errors and the client would crash and nuke the WU.
bollix47 wrote:There has been a new beta release of the SMP client.
http://www.stanford.edu/group/pandegrou ... -Linux.tgz
One of my diskless computers shutdown overnight and now it will not start up again because it can't find the old beta file.
Is there any way that this can be fixed so that a new kernel doesn't have to be created every time Stanford releases a new beta/release? Perhaps a text file with the various client links or the client names in it that we could modify when this happens?
Thanks for any guidance.