Personal computing discussed
Moderators: renee, farmpuma, just brew it!
notfred wrote:As theMASS spotted, I spent this evening finishing off merging in the benchmark stuff, I've also added the new URL and a few more tweaks. I didn't think I would get it done tonight, so it is actually labelled with tomorrow's date
Download it at the usual place - http://reilly.homeip.net/folding/
RAH wrote:Does this update fix the hang upon completion on the USB stick too?
I'm having that problem. Needs to be restarted to send.
Other then that, works great.
jmfox wrote:every couple of WU's it locks up.
jmfox wrote:Easiest way is to make sure you haven't disabled backups (i.e. "Backup Interval" is set to something other than 0) and have a USB stick plugged in. It should back up to the USB stick regularly and if it hangs you can reboot and it should restore from the USB stick.This last time I copied out the backup tarball and the entire work directory. How might I go about restoring this after rebootin? or is it better to just dump it and let it run?
[18:52:25] Completed 500000 out of 500000 steps (100 percent)
[18:52:25] Writing final coordinates.
[18:52:26] Past main M.D. loop
[18:52:26] Will end MPI now
[18:53:26]
[18:53:26] Finished Work Unit:
[18:53:26] - Reading up to 3728544 from "work/wudata_03.arc": Read 3728544
[18:53:26] - Reading up to 1785264 from "work/wudata_03.xtc": Read 1785264
[18:53:26] goefile size: 0
[18:53:26] logfile size: 19063
[18:53:26] Leaving Run
[18:53:27] - Writing 5537271 bytes of core data to disk...
[18:53:27] ... Done.
[18:53:27] - Shutting down core
[18:53:27]
[18:53:27] Folding@home Core Shutdown: FINISHED_UNIT
[22:03:09] - Autosending finished units...
[22:03:09] Trying to send all finished work units
[22:03:09] + No unsent completed units remaining.
[22:03:09] - Autosend completed
[22:56:18] Project: 2605 (Run 10, Clone 155, Gen 25)
[22:56:18]
[22:56:18] Assembly optimizations on if available.
[22:56:18] Entering M.D.
[22:56:24] Calling FAH init
[22:56:24] Read topology
[22:56:24] (Starting from checkpoint)
[22:56:25] 000 out of 500000 steps (99 percent)
[22:56:25] Extra SSE boost OK.
[22:56:25] es
[22:56:25] Completed 495000 out of 500000 steps (99 percent)
[22:56:25] Extra SSE boost OK.
[23:13:43] Writing local files
[23:13:43] Completed 500000 out of 500000 steps (100 percent)
[23:13:43] Writing final coordinates.
[23:13:43] Past main M.D. loop
[23:13:43] Will end MPI now
[23:14:43]
[23:14:43] Finished Work Unit:
[23:14:43] - Reading up to 3728544 from "work/wudata_03.arc": Read 3728544
[23:14:43] - Reading up to 1785296 from "work/wudata_03.xtc": Read 1785296
[23:14:43] goefile size: 0
[23:14:43] logfile size: 27075
[23:14:43] Leaving Run
[23:14:48] - Writing 5545315 bytes of core data to disk...
[23:14:48] ... Done.
[23:14:48] - Shutting down core
[23:14:48]
[23:14:48] Folding@home Core Shutdown: FINISHED_UNIT
[23:14:52] CoreStatus = 64 (100)
[23:14:52] Unit 3 finished with 65 percent of time to deadline remaining.
[23:14:52] Updated performance fraction: 0.682033
[23:14:52] Sending work to server
notfred wrote:Hmm, that should get picked up by the auto hang check in that it says FINISHED_UNIT but then doesn't say CoreStatus.
EvilAlchemist wrote:Also, I know you have a ticked submitted on it for a while, but where do you stand on the 2 x SMP for the quads?
Would this be a major rewrite to run SMP per 2 Cores found .... or just a simple config change .. and are you willing to explore doing this?
notfred wrote:Thanks for a pointer to that, I am running a relatively old version of glibC, may be worth an upgrade...
EvilAlchemist wrote:Notfred - I got a pm from the user and the updated glibC.
He stated that it was still hanging some times .. just not as much ...
Update #2 :Had one hang ... but it was the first one in five days .....
From what I can tell ... tons of users are having this happen. Linux and Mac both.. Guess we will just have to wait till the final version comes out to fix this completely!
theMASS wrote:The Hang Check seems to have been broken in the latest release. I've been beyond busy the last few weeks so I only updated one box and it hangs regularly. (It was fine before the upgrade) The 6 machines I run the previous version on still work perfectly.
EvilAlchemist wrote:theMASS wrote:The Hang Check seems to have been broken in the latest release. I've been beyond busy the last few weeks so I only updated one box and it hangs regularly. (It was fine before the upgrade) The 6 machines I run the previous version on still work perfectly.
Good information to have. Glad someone else is having the same thing happen.
I hope Notfred has some time soon to look at this. He is probley tired of seeing me on here .. so "theMass" , it is your turn ...
PS ... Notfred ... thanks for all your hard work. I know you have been busy with family at your place latley!
notfred wrote:Didn't get much time to look at this tonight, but at least the basic hangcheck stuff is working as before (not tried it with extra text after the FINISHED_UNIT line yet, will try that next) in that it kill -9's the cores. There may be something different in the beta2 client that means the old workaround of kill -9 the cores no longer fixes the hang. Are there any reports of that?
Ragnar Dan wrote:I found it sitting for several hours after reaching FINISHED_UNIT on at least 2 occasions last week on the one machine I was using the newer ISO, so I reverted to the November 28 version. I'm probably going to set up the tftp server or something to protect myself from this thing, but that will take some time to set up since it's in a VM and I'll also have to make it do DHCP instead of the letting the gateway do it.
runlinux wrote:hey notfred, i love you setup as it is the easiest way for me to get my farm working.
but, like others around here, i have issues with the hang check. i have lost maybe 10+ units the past few weeks as the checks dont seem to be working for me. I have my linux server working as DHCP and TFTP so i get my backups done eveyr 15 minutes. i have three rigs going head and diskless, and im they alone are worth 8.5k ppd.
i know how to work my way around linux, and i took a look at the source, but i dont know how i would go about making changes and then applying them to my systems.
i think it would be nice if you could add a Kill link on the node so if one notices they are acting up, we can just kill the cores and have it restart and hopefully not loose any work.
i have posted about your work over at http://www.forums.extremeoverclocking.com and http://www.xtremesystems.org and people there seem to like it a whole lot, so i stand behind you on this project!
digital_exhaust wrote:Code: Select all[23:41:28] Completed 500000 out of 500000 steps (100 percent)
[23:41:28] Writing final coordinates.
[23:41:29] Past main M.D. loop
[23:41:29] Will end MPI now
[23:42:29]
[23:42:29] Finished Work Unit:
[23:42:29] - Reading up to 3722208 from "work/wudata_02.arc": Read 3722208
[23:42:29] - Reading up to 1775176 from "work/wudata_02.xtc": Read 1775176
[23:42:29] goefile size: 0
[23:42:29] logfile size: 16912
[23:42:29] Leaving Run
[23:42:32] - Writing 5518696 bytes of core data to disk...
[23:42:32] ... Done.
[23:42:32] - Shutting down core
[23:42:32]
[23:42:32] Folding@home Core Shutdown: FINISHED_UNIT
[23:57:48] - Autosending finished units...
[23:57:48] Trying to send all finished work units
[23:57:48] + No unsent completed units remaining.
[23:57:48] - Autosend completed
[15:51:01] Folding@Home Gromacs SMP Core
[15:51:01] Version 1.74 (November 27, 2006)
[15:51:01]
[15:51:01] Preparing to commence simulation
[15:51:01] - Ensuring status. Please wait.
[15:51:01]
[15:51:01] Error: Could not write local file. Exiting.
[15:51:06] - Shutting down core
[15:51:18] Exiting.
[15:51:18] - Shutting down core
[15:53:10] CoreStatus = 12 (18)
[15:53:10] Client-core communications error: ERROR 0x12
[15:53:10] Deleting current work unit & continuing...
[15:57:31] - Warning: Could not delete all work unit files (7): Core returned invalid code
runlinux wrote:digital_exhaust wrote:Code: Select all[23:41:28] Completed 500000 out of 500000 steps (100 percent)
[23:41:28] Writing final coordinates.
[23:41:29] Past main M.D. loop
[23:41:29] Will end MPI now
[23:42:29]
[23:42:29] Finished Work Unit:
[23:42:29] - Reading up to 3722208 from "work/wudata_02.arc": Read 3722208
[23:42:29] - Reading up to 1775176 from "work/wudata_02.xtc": Read 1775176
[23:42:29] goefile size: 0
[23:42:29] logfile size: 16912
[23:42:29] Leaving Run
[23:42:32] - Writing 5518696 bytes of core data to disk...
[23:42:32] ... Done.
[23:42:32] - Shutting down core
[23:42:32]
[23:42:32] Folding@home Core Shutdown: FINISHED_UNIT
[23:57:48] - Autosending finished units...
[23:57:48] Trying to send all finished work units
[23:57:48] + No unsent completed units remaining.
[23:57:48] - Autosend completed
im having this issue when i do a remote reboot. anytime during a unit it is fine as they are backed up on my server, but when they finish, it just hangs and upon reboot they are dead saying that they cant open the local file. i believe i have the latest vesrsion too. i'll try the usb drive as i have 3 im not using right now.
upon a finshed unit and the hang, the reboot spits this out:Code: Select all[15:51:01] Folding@Home Gromacs SMP Core
[15:51:01] Version 1.74 (November 27, 2006)
[15:51:01]
[15:51:01] Preparing to commence simulation
[15:51:01] - Ensuring status. Please wait.
[15:51:01]
[15:51:01] Error: Could not write local file. Exiting.
[15:51:06] - Shutting down core
[15:51:18] Exiting.
[15:51:18] - Shutting down core
[15:53:10] CoreStatus = 12 (18)
[15:53:10] Client-core communications error: ERROR 0x12
[15:53:10] Deleting current work unit & continuing...
[15:57:31] - Warning: Could not delete all work unit files (7): Core returned invalid code