New diskless folding suite released

Come join the... uh... er... fold.

Moderators: just brew it!, farmpuma

Postposted on Wed Dec 05, 2007 11:31 pm

Hmm, try that again and check your DNS is not doing stale entries. Mine returns 171.67.20.36 for http://www.stanford.edu so the transaction looks like:
Code: Select all
wget http://www.stanford.edu/group/pandegroup/folding/release/FAH6.00beta1-Linux.tgz
--23:29:59--  http://www.stanford.edu/group/pandegroup/folding/release/FAH6.00beta1-Linux.tgz
           => `FAH6.00beta1-Linux.tgz'
Resolving www.stanford.edu... 171.67.20.36
Connecting to www.stanford.edu|171.67.20.36|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 139,893 (137K) [application/x-gzip]

100%[====================================>] 139,893      184.39K/s             

23:30:01 (184.00 KB/s) - `FAH6.00beta1-Linux.tgz' saved [139893/139893]
notfred
Grand Gerbil Poohbah
 
Posts: 3775
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Postposted on Thu Dec 06, 2007 4:06 am

mac_h8r1 wrote:notfred, I've got an issue running the folding CD on a server I just built. It's a dual quad-core Xeon, so the client tries to start 2 instances of the SMP client. I get the following, though, and I know the internet connection is good:
Connectiong to folding.stanford.edu [171.67.20.40]:80
wget:server returned error 404: HTTP/1.1 404 Not Found
tar:FAH_SMP_Linux.tgz:No such file or directory
Setting up instance 2
Setting up instance 1

The issue has shown up before, as whenever I build one of these I try the CD to get a few WUs out before the server ships.

Thoughts?


Are you using a recent iso? It looks like you're trying to connect to the old SMP client. The new beta is FAH6.00beta1-Linux.tgz and you're looking for FAH_SMP_Linux.tgz.

Try the latest build notfred has really done a great job with it.
Image
theMASS
Gerbil First Class
 
Posts: 132
Joined: Thu Sep 27, 2007 3:24 am

Postposted on Thu Dec 06, 2007 9:58 am

Whoops, missed that - thanks theMASS you are right. Stanford changed their website around and I updated the CD on 12th October to use the new URL.

The only thing I would be wary of with the 8 processors is that if it hangs and does an auto kill it will miss up the other instance, See earlier in the thread for discussion of this. I didn't think there were that many 8 processor machines around, guess I need to catch up! :)
notfred
Grand Gerbil Poohbah
 
Posts: 3775
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Postposted on Sun Dec 09, 2007 4:25 am

Hi notfred !,
Would it be possible for you to update the diskless image to append the kernel option 'pci=nommconf' at boot ?

I've got a recent Intel G33 chipset board and I can't boot into any linux distro without adding this to the boot arguments.

Should be an easy one ? Cheers.
Brett
Gerbil
 
Posts: 23
Joined: Tue Oct 02, 2007 6:03 pm

Postposted on Sun Dec 09, 2007 12:55 pm

I'm a bit wary of adding workarounds for specific boards that then may come back and not work on other boards.

If you are network booting then you can add that command your self to the boot parametersin pxelinux.cfg/default. If you are working off the CD then I would suggest extracting the CD, editing the isolinux.cfg and rebuilding the iso. The command I use to build the iso is
Code: Select all
mkisofs -o folding_cd.iso -b isolinux.bin -c boot.cat -no-emul-boot -boot-load-size 4 -boot-info-table boot
where boot is the directory containing kernel32, kernel64, initrd, isolinux.bin and isolinux.cfg

Another alternative would be if there is a workaround in a later kernel, I could update to that kernel version - it is currently running 2.6.22.1
notfred
Grand Gerbil Poohbah
 
Posts: 3775
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Postposted on Sun Dec 09, 2007 3:29 pm

Thanks notfred,
I'm using the CD generator, so I should just add arguments to the APPEND line in isolinux.cfg ?
Brett
Gerbil
 
Posts: 23
Joined: Tue Oct 02, 2007 6:03 pm

Postposted on Sun Dec 09, 2007 7:15 pm

All the boards I run it on are G33 based. Have you played around with the BIOS? I had one box that was giving me trouble booting anything and I think the solution was a SATA setting relating IDE/Native mode. Is your CD/DVD IDE or SATA?
Image
theMASS
Gerbil First Class
 
Posts: 132
Joined: Thu Sep 27, 2007 3:24 am

Postposted on Sun Dec 09, 2007 7:43 pm

Brett wrote:Thanks notfred,
I'm using the CD generator, so I should just add arguments to the APPEND line in isolinux.cfg ?

Yup, that gets passed to the kernel straight.
notfred
Grand Gerbil Poohbah
 
Posts: 3775
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Postposted on Sun Dec 09, 2007 8:02 pm

theMASS wrote:All the boards I run it on are G33 based. Have you played around with the BIOS? I had one box that was giving me trouble booting anything and I think the solution was a SATA setting relating IDE/Native mode. Is your CD/DVD IDE or SATA?

Yep I'll try that also - the KVM switch I use only has PS/2 and the board only has USB - so I've been using it just for display output so far & haven't changed anything from BIOS defaults.
The DVD-ROM is an external USB unit I share amongst all machines as needed (initial boot to install onto USB flash). I don't think this is to do with SATA specifically; more PCI IRQ assignment and APIC/LAPIC - which seem to be new ways of doing the same thing that the linux kernel doesn't handle well yet.

When installing OpenSUSE 10.3 onto an Intel DQ35JO board I ended up at a similar screen of output when booting; showing various PCI & IRQ messages and stopping dead. Google gave me the answer by way of appending 'pci=nommconf' which got it booting fine.
Brett
Gerbil
 
Posts: 23
Joined: Tue Oct 02, 2007 6:03 pm

Postposted on Sun Dec 09, 2007 10:36 pm

Brett wrote:I've been using it just for display output so far & haven't changed anything from BIOS defaults.
The DVD-ROM is an external USB unit I share amongst all machines as needed (initial boot to install onto USB flash). I don't think this is to do with SATA specifically; more PCI IRQ assignment and APIC/LAPIC - which seem to be new ways of doing the same thing that the linux kernel doesn't handle well yet.

When installing OpenSUSE 10.3 onto an Intel DQ35JO board I ended up at a similar screen of output when booting; showing various PCI & IRQ messages and stopping dead.
Google gave me the answer by way of appending 'pci=nommconf' which got it booting fine.

It sounds like the BIOS isn't configured by default to boot from a USB device and/or it's not compatible with the default ACPI setting. I'm willing to bet a tweak or two in the BIOS will work... also you need to get into the BIOS after you have your USB flash drive ready to let it know to boot from it.
Image
theMASS
Gerbil First Class
 
Posts: 132
Joined: Thu Sep 27, 2007 3:24 am

Postposted on Mon Dec 10, 2007 5:59 am

OK I updated the APPEND arguments in isolinux.cfg, rebuild / burn / boot -> gets me into the OS fine, but without the onboard NIC (Intel GigE LOM).
Using an old PCI NIC gets me connected & folding; I'll try the LOM again next kernel update ;)
Brett
Gerbil
 
Posts: 23
Joined: Tue Oct 02, 2007 6:03 pm

Postposted on Tue Dec 18, 2007 8:23 pm

The hang check missed one. First one its missed since it I've been running it.

Just an FYI... If it happens again I'll let you know.

Code: Select all
[15:47:07] Completed 9900000 out of 10000000 steps  (99 percent)
[15:53:46] Writing local files
[15:53:46] Completed 10000000 out of 10000000 steps  (100 percent)
[15:53:46] Writing final coordinates.
[15:53:46] Past main M.D. loop
[15:53:46] Will end MPI now
[15:54:47]
[15:54:47] Finished Work Unit:
[15:54:47] - Reading up to 232392 from "work/wudata_03.arc": Read 232392
[15:54:47] - Reading up to 13745940 from "work/wudata_03.xtc": Read 13745940
[15:54:47] goefile size: 0
[15:54:47] logfile size: 257315
[15:54:47] Leaving Run
[15:54:52] - Writing 14635011 bytes of core data to disk...
[15:54:52]   ... Done.
[15:54:52] - Shutting down core
[15:54:52]
[15:54:52] Folding@home Core Shutdown: FINISHED_UNIT
[16:00:00] - Autosending finished units...
[16:00:00] Trying to send all finished work units
[16:00:00] + No unsent completed units remaining.
[16:00:00] - Autosend completed
[22:00:00] - Autosending finished units...
[22:00:00] Trying to send all finished work units
[22:00:00] + No unsent completed units remaining.
[22:00:00] - Autosend completed


Possible theory... Maybe because "Autosend finished units" kicked in before "kill -9" script?

ADDED: It sent out fine on reboot :)

Something else that may be an issue... My next WU was a 3060..and I ran a backup at 0% so I could add it to the WU archive I'm starting to build for "more accurate" testing. The backup has swelled to 104MB so there must be some other stuck stuff in there. I'm running out for the night but I'll look at it tomorrow.
Image
theMASS
Gerbil First Class
 
Posts: 132
Joined: Thu Sep 27, 2007 3:24 am

Postposted on Tue Dec 18, 2007 9:46 pm

Thanks for the bug report. I've just started using Sourceforge for the Diskless Folding project https://sourceforge.net/projects/foldingcd/ and so I just submitted that as the first bug report under Tracker https://sourceforge.net/tracker/index.p ... id=1021757

I'll upload some of the other issues / requests people have reported as well. I've already uploaded the source in to their Subversion repository - although I still need to tag the last release and then I can work on fixing some of the bugs and adding new features.

If anyone else wants in on the project, drop me a PM.
notfred
Grand Gerbil Poohbah
 
Posts: 3775
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Postposted on Sat Jan 12, 2008 5:10 am

notfred-
Software is working great and seems to be for everyone since there hasn't been a post in almost a month ;)

Just thought I'd mention on the Diskless Folding Website... the version released Jan 11 is listed as 07... it's 08 now ;) ...and correctly labeled on the sourceforge page.

Gotta get myself a dual Quad to see how it works.
Image
theMASS
Gerbil First Class
 
Posts: 132
Joined: Thu Sep 27, 2007 3:24 am

Postposted on Sat Jan 12, 2008 12:54 pm

Thanks for catching that, it should be fixed on my website now. I was going to let this thread die and use viewtopic.php?t=56317 for the new release so people don't have to wade through 10 pages of obsolete info.
notfred
Grand Gerbil Poohbah
 
Posts: 3775
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Postposted on Mon Jan 14, 2008 4:00 pm

Your week of my 3200+ @ 2.22 GHz is half over... and for some reason right before the time I changed it to your folding name I notice your average PPD output had declined significantly. Lucky you, I helped push it up, some. :wink:

I wonder why you removed the "kill the cores" (or whatever it was called) link that used to be on your ISO. If you ask me, just because you put in code to do it automatically doesn't mean the option to do it manually should have been removed. I've noticed a number of times that thing taking almost 10 minutes to notice it's hung, and that can mess up meeting those 3-hour deadlines at EOC. :D
Ragnar Dan
Gerbil Elder
Silver subscriber
 
 
Posts: 5355
Joined: Sun Jan 20, 2002 7:00 pm

Postposted on Mon Jan 14, 2008 10:05 pm

Thanks for folding for me! It looks like the memory on my work machine (an X2 4200) has gone flaky (the SMP cores started crashing) so I killed the folding on it, that caused the drop. Just a few days ago I added a second SMP instance on my home Q6600 so that will help bring the PPD back up along with your help. I also have new memory on the way for the work machine, should get here later this week.

I removed the "kill cores" link for several reasons, the first being the auto hang detection. That does have a 10 minute grace period on it as I have seen it take a couple of minutes to detect the cores are done on a non-hung system. The kill cores link also would completely bork folding if the cores were not at the end (hit the link by accident) and it also didn't work with multiple SMP instances. I don't think 10 minutes is too bad to wait, but if you really want it back, you could submit a feature request at http://sourceforge.net/tracker/?group_i ... id=1021760
I suppose I could tie the link in to the current kill mechanism to work with multiple SMP if really required, but I am inclined to make the link an option that is off by default if you really do want it back.
notfred
Grand Gerbil Poohbah
 
Posts: 3775
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Postposted on Mon Jan 14, 2008 11:35 pm

Very small amount of help I'm adding to you, that's for sure. It was great when those 310-point Project 3906 WU's were coming in consecutively when I brought the machine up, because it was delivering well above 400 PPD with them. But other WU's slowed it down noticeably (292 points taking near 30 hours, as opposed to 310 taking under 17). And then there's the problem of it being on a wireless connection which is only active when logged in, which slows progress down a bit. I usually time it well enough that most of the folding happens while logged out (Windows®, eating excess cycles for our customers since 1986 or thereabouts™ :wink:).

As for your change in your Q6600, when I was checking your stats on EOC, I noticed you getting some WU's that I never get on either of my dual-cores, so maybe they're restricting them to quad-cores... which means you won't get 'em now, though your points will probably increase fairly noticeably.

You seem to have more than your share of memory problems. I use better brands (Crucial, Mushkin, and Corsair though I'm less enamored of the last of those after reading that they substituted cheaper, less OC-able RAM chips in their DDR2-800) and never have problems as far as I know, although maybe OC-ing would go better with something else... I don't go as far as some do with that sort of thing.

With respect to the kill cores link, IIRC it asked for another confirmation click just like the reboot link does, so I'd think that should be enough. Me, submit a feature request? Then I'd have to learn how to work Sourceforge stuff, or something. I think I do have an account with them, hmm. It's not that big of a deal, since I can reboot and it often re-runs the last N minutes of the frame and continues on to submit it successfully (on ~12 minute frame WU's with 10 minute backups, it's often worth doing).
Ragnar Dan
Gerbil Elder
Silver subscriber
 
 
Posts: 5355
Joined: Sun Jan 20, 2002 7:00 pm

Postposted on Tue Jan 15, 2008 12:27 pm

I will still get the quad core WUs, that's why Stanford isn't too keen on people running two copies of the SMP client on quad cores - the client detects the number of cores but not the other copy of the client. From what I have seen in WU times though, the second copy still enables both WUs to complete very close after the preferred deadlines (just miss by a few hours).

I've only had memory issues on 2 setups, the first was cheap generic RAM, I'm guessing this second issue is a memory issue (cores crash on a different WU even after reboot) but this one is running Crucial Ballistix DDR2 so once I get a chance to take down the work server and do more testing, it may just need a reseat of the DIMMs or I may need to take advantage of their warranty.
notfred
Grand Gerbil Poohbah
 
Posts: 3775
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Postposted on Wed Jan 16, 2008 10:45 am

I can't seem to get the USB stick folding to work.
I made/dl'd the iso. Boot to cd. USB stick found no problem.
The files found on the stick - initrd, kernel32, kernel64, syslinux, version.

Then a lot of errors? - Unable to STAT USBA/.svn/"file name",
No such file or directory

Then starts setting up instance one.

Reboot to USB drive. Nothing. I have used it before, but no sure
if it was with this AW8D motherboard.

What am i doing wrong.
RAH
Gerbil
 
Posts: 10
Joined: Tue Jan 15, 2008 10:08 pm

Postposted on Wed Jan 16, 2008 12:32 pm

I just found this about 11pm last night, it looks like when I moved to using Subversion for source control it added a bunch of extra directories (.svn). There's also an error about /tmp permissions when doing the syslinux and the syslinux doesn't install ldlinux.sys. I'll fix it in the next release that I'm currently working on, but for now, if you just grab a copy of syslinux from my website http://reilly.homeip.net/folding/usb.html and then run that against the USB stick it should fix it.
notfred
Grand Gerbil Poohbah
 
Posts: 3775
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Postposted on Wed Jan 16, 2008 1:28 pm

Used syslinux
That installs the hidden file inlinux.sys.

Does not seem to install MBR
Get Invalid system disk error.

Tried with fresh blank stick, and HP formatted bootable stick.
RAH
Gerbil
 
Posts: 10
Joined: Tue Jan 15, 2008 10:08 pm

Postposted on Wed Jan 16, 2008 4:18 pm

OK, unfortunately I don't know of a way for you to do that under Windows.

The MBR is on the CD -you would need to extract the initrd, then unpack it (it is gzip and cpio format) and then in the bin directory it is the file mbr.bin and you need to copy that on to the very beginning of the USB stick. I know the Linux commands but I'm not even sure that it is possible under Windows, at least without installing a bunch of extra software.
notfred
Grand Gerbil Poohbah
 
Posts: 3775
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Postposted on Wed Jan 16, 2008 6:03 pm

Could I do it with a linux live cd.
Copy the existing FAH cd to a folder on my desktop.
Shows six files - boot, initrd, isolinux, isolinux.cfg, kernel32, kernel64

Will live cd allow me to extract the needed files to stick.
RAH
Gerbil
 
Posts: 10
Joined: Tue Jan 15, 2008 10:08 pm

Postposted on Thu Jan 17, 2008 10:28 pm

notfred wrote:I'm guessing this second issue is a memory issue (cores crash on a different WU even after reboot)

Turned out to be a corrupted install, after swapping the memory (from 2GB of DDR2 667 to 4GB of DDR2 800) I got the same error. I wiped the directory apart from the client.cfg, redownloaded from Stanford and fired it up again. Now fear my folding power! :-)

I have an X2 3800+ and an X2 4200+ both running 1 copy of the Linux SMP client, plus a Q6600 running 2 copies of the Linux SMP client.
notfred
Grand Gerbil Poohbah
 
Posts: 3775
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Postposted on Thu Jan 17, 2008 10:38 pm

RAH wrote:Could I do it with a linux live cd.
Copy the existing FAH cd to a folder on my desktop.
Shows six files - boot, initrd, isolinux, isolinux.cfg, kernel32, kernel64

Will live cd allow me to extract the needed files to stick.

I think you can do it with a Linux live CD after copying the files to your desktop.
Code: Select all
1) Boot the Linux live CD
2) Copy the initrd from your desktop folder in to somewhere in the Linux system (/tmp is a good choice)
3) Become root (will depend on your live CD, either "su" or "sudo su"
4) cd /tmp
5) zcat <path> | cpio -i
6) Plug in the USB stick, wait 10 seconds
7) dmesg and the last few lines will be about sda or sdb or sdc etc, make a note of what it is
8) cat /tmp/bin/mbr.bin > /dev/sda (or whatever it was from above)
9) Wait for the light on the stick to stop flashing before unplugging it, then reboot and try the stick.
notfred
Grand Gerbil Poohbah
 
Posts: 3775
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Postposted on Fri Jan 18, 2008 6:34 pm

Hi notfred,
I'm finding one of my diskless folders (Intel E6600) is getting its cores killed as soon as it starts folding:
Code: Select all
[23:22:18] Initial: D4FE; + 1490945 bytes downloaded
[23:22:18] Verifying core Core_a1.fah...
[23:22:18] Signature is VALID
[23:22:18]
[23:22:18] Trying to unzip core FahCore_a1.exe
[23:22:19] Decompressed FahCore_a1.exe (3625104 bytes) successfully
[23:22:19] + Core successfully engaged
[23:22:24]
[23:22:24] + Processing work unit
[23:22:24] Core required: FahCore_a1.exe
[23:22:24] Core found.
[23:22:24] Working on Unit 05 [January 18 23:22:24]
[23:22:24] + Working ...
[23:22:24] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 05 -checkpoint 15 -forceasm -verbose -lifeline 520 -version 600'

[23:22:24]
[23:22:24] *------------------------------*
[23:22:24] Folding@Home Gromacs SMP Core
[23:22:24] Version 1.74 (November 27, 2006)
[23:22:24]
[23:22:24] Preparing to commence simulation
[23:22:24] - Ensuring status. Please wait.
[23:22:41] - Assembly optimizations manually forced on.
[23:22:41] - Not checking prior termination.
[23:22:42] - Expanded 2435509 -> 12886013 (decompressed 529.0 percent)
[23:22:42]
[23:22:42] Project: 2605 (Run 9, Clone 571, Gen 5)
[23:22:42]
[23:22:42] Assembly optimizations on if available.
[23:22:42] Entering M.D.
[23:22:48] Calling FAH init
[23:22:49] in POPC
[23:22:49] Writing local files
[23:22:49] Extra SSE boost OK.
[23:22:49] eckpoint
[23:22:49] Protein: Protein in POPC
[23:22:49] Writing local files
[23:22:49] Extra SSE boost OK.
[23:22:49] Writing local files
[23:22:49] Completed 0 out of 500000 steps  (0 percent)
[23:22:49]
[23:22:49] Folding@home Core Shutdown: INTERRUPTED
[23:22:53] CoreStatus = 66 (102)
[23:22:53] + Shutdown requested by user. Exiting.***** Got a SIGTERM signal (15)
[23:22:53] Killing all core threads

Folding@Home Client Shutdown.

It seems the hung core checks are returning false positives.
My other x86_64 machine (AMD x2 3800+) runs fine on the same diskless code & exact same core .. any ideas ?

Cheers.
Brett
Gerbil
 
Posts: 23
Joined: Tue Oct 02, 2007 6:03 pm

Postposted on Fri Jan 18, 2008 9:43 pm

I'm not convinced it's the hang check killer, reason being that it is reporting being sent SIGTERM (15) and I send SIGKILL(9). Also the hang check thing is triggered by the presence of the text "FINISHED_UNIT" without it being followed by "CoreStatus" within 5 minutes.

Are you sure that this machine is rock stable?
notfred
Grand Gerbil Poohbah
 
Posts: 3775
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Postposted on Sat Jan 19, 2008 12:07 am

Hi notfred,
Yes this machine is a intel board (not just chipset), so no settings for overclocking etc exist.
The line
Code: Select all
[23:22:53] + Shutdown requested by user. Exiting.***** Got a SIGTERM signal (15)

.. shows the cores were given a sigterm.

Does the kill cores routine fire on a CoreStatus line such as this ?
Code: Select all
[23:22:53] CoreStatus = 66 (102)

It is happening every time I boot the machine, as the log shows - 4 seconds after starting :-?

Maybe I need to delete the USB stick contents & get a fresh WU ?

Thanks !
Brett
Gerbil
 
Posts: 23
Joined: Tue Oct 02, 2007 6:03 pm

Postposted on Sat Jan 19, 2008 11:22 am

Yes, I would wipe the backup from the USB stick and try again.

The kill cores fires if it doesn't see a CoreStatus line like the one you posted. It looks for a line containing "FINISHED_UNIT" and then if it doesn't see CoreStatus within 5 minutes then it kills the cores.

I send sigkill and not sigterm, so I think it is something else going on, probably a corrupt WU.
notfred
Grand Gerbil Poohbah
 
Posts: 3775
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

PreviousNext

Return to TR Distributed Computing Effort

Who is online

Users browsing this forum: No registered users and 1 guest