Stabilizing the Linux SMP client

Come join the... uh... er... fold.

Moderators: just brew it!, farmpuma

Stabilizing the Linux SMP client

Postposted on Sun Sep 28, 2008 10:12 pm

I've been running the Linux Folding SMP client on as many as a dozen VMware guests over WinXP for the bulk of 2008. Lately, I've been "underperforming" and I haven't been able to pinpoint why. I think the problem lately has related to using the -advmethods to get the A2 cores. Something is badly amiss here, and it's frustrating.

It seems to be more of an art than science to keep the Linux SMP client running: random crashes, hangs, 100% completion with hangs... etc. Reboots, Qfix'es, Ctrl+C, hop up and down while hitting yourself on the left side of the head with your keyboard.

As I mentioned, my PPD has taken a hit lately, and I've started paying attention to the Work Unit Run Clone and Generation. Does everyone do this?? Watch the Project RCG?

Now that I'm paying attention, I notice that several Ubuntu SMP guests are 100% completing a Work Unit, not uploading properly, and starting the same Work Unit again from scratch. Then completing it (AGAIN) to 100% and failing to upload it, and starting the same PRCG at the beginning.

I've noticed weird download timelines previously on Fahmon, where it says that a WU was downloaded 2-3 days ago, when it clearly was downloaded only hours ago. I always dismissed it to calculation issues in Fahmon. However, now that I look closely, the first instance of the project download was indeed done days before, but the guest has been repeatedly processing the same project Run Clone and Generation. Nice.

Now, most of the VMware Ubuntu guests that I have running have processed hundred of WU's, and been cloned numerous times, including being shutdown hard countless times. Perhaps I need to do a new "clean" SMP guest build.

Some of you guys must be running into similar issues. Monitoring and playing with a dozen VMWare Linux guests is a pain. I've learned a few things, which I'll post as I start watching more carefully.

- JP
Last edited by JPinTO on Sun Jun 07, 2009 6:04 am, edited 1 time in total.
JPinTO
Gerbil Team Leader
 
Posts: 240
Joined: Sat Jun 30, 2007 7:02 am
Location: Toronto, Ontario

Re: Vmware Linux SMP Client "issues"

Postposted on Sun Sep 28, 2008 10:35 pm

A dozen guests? >24 cores, or multiple machines?
I've been here long enough that I think I can forgo a signature.
Forge
Darth Gerbil
 
Posts: 7959
Joined: Wed Dec 26, 2001 7:00 pm
Location: SouthEast PA

Re: Vmware Linux SMP Client "issues"

Postposted on Sun Sep 28, 2008 11:00 pm

I took -advmethods off to see if that would stop the hangs after it completes the WU which for the most part worked. Still one here or there. But I still get projects using A2 (usually a mix between 1760 and 1920 ppd WUs). Last month I was reading on the folding forum and they also mentioned A2 projects not needing -advmethods any more.
i3-530 | HR-01 Plus (passive) | DH55TC | 4GB Kingston DDR3 | Toshiba 250GB 2.5" HDD | Mini P180 | picoPSU 150-xt w/102w brick | 21w idle
Pegasus
Gerbil First Class
 
Posts: 161
Joined: Fri Jun 30, 2006 1:13 am

Re: Vmware Linux SMP Client "issues"

Postposted on Sun Sep 28, 2008 11:19 pm

I'm also experiencing the same issues with my VM setups. I'll remove the -advmethods flag to see if things improve.
Fold! And I don't mean your clothes!

Do you have a favorite gerbil recipe? Please share with the TR community!
flybywire
Gerbil Jedi
 
Posts: 1883
Joined: Wed Jun 16, 2004 2:28 pm
Location: Springfield, VA - USA

Re: Vmware Linux SMP Client "issues"

Postposted on Sun Sep 28, 2008 11:54 pm

Removing -advmethods, for me, at least give me a chance to get back the 260x WUs rather than all 266x. So I did it. But now it's like everytime one WU is finished I need to check if it gets a new 260x. If it is not, it will most certainly fail and I am now forced to trash the 266x until I get a 260x with only a few tries. It's bad for the science, but it is stillbetter than the thing running til about 25% and then errors out, I think. :cry: :-?

It's ironic that having a "slow" S939 X2 3800+ is a benefit to me because I only need to do this dance every 3-4 days.
Image
The Model M is not for the faint of heart. You either like them or hate them.

Gerbils unite! Fold for UnitedGerbilNation, team 2630.
Flying Fox
Gerbil God
 
Posts: 24286
Joined: Mon May 24, 2004 2:19 am

Re: Vmware Linux SMP Client "issues"

Postposted on Mon Sep 29, 2008 10:38 am

Not sure which overides which: -advmethods switch when calling fah6 or the client.cfg setting

I've got mine calling with -advmethods but I get almost all 2605 WU's, with the odd 266x.

I'm glad I'm not the only one with issues... misery loves company.
JPinTO
Gerbil Team Leader
 
Posts: 240
Joined: Sat Jun 30, 2007 7:02 am
Location: Toronto, Ontario

Re: Vmware Linux SMP Client "issues"

Postposted on Mon Sep 29, 2008 12:28 pm

I had my -advmethods in the script or shortcut calling FAH.
i3-530 | HR-01 Plus (passive) | DH55TC | 4GB Kingston DDR3 | Toshiba 250GB 2.5" HDD | Mini P180 | picoPSU 150-xt w/102w brick | 21w idle
Pegasus
Gerbil First Class
 
Posts: 161
Joined: Fri Jun 30, 2006 1:13 am

Re: Vmware Linux SMP Client "issues"

Postposted on Mon Sep 29, 2008 2:43 pm

Anyone else notice that the work directory stays full of crap. There are 10 "queues", 00-09 in SMP FAH that round robin. So, if a machine is working on queue 00, and does 1 WU/Day, in 10 days you will reuse Queue 00.

My work files from a week ago are still there, and I think that they are causing issues where the previous work unit 00, is impacting the current 00.

I've cleared out all my old work files, and so far, there seem to be a few less halts.

- JP
JPinTO
Gerbil Team Leader
 
Posts: 240
Joined: Sat Jun 30, 2007 7:02 am
Location: Toronto, Ontario

Re: Vmware Linux SMP Client "issues"

Postposted on Mon Sep 29, 2008 7:18 pm

Yup, the failure to clean up is a well known issue. It also causes my diskless stuff to grow in memory usage (the work directory is on a ramdisk). I need to put some sort of clean up in there.
notfred
Grand Gerbil Poohbah
 
Posts: 3712
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Re: Vmware Linux SMP Client "issues"

Postposted on Wed Oct 01, 2008 6:57 pm

I've decided to leave -advmethods on, but I've been a bit more vigilant about cleaning out the work directory .

This seems to have yielded a little bit more stable production. The clients do pretty much have to be ended and restarted most of the time.

- JP
JPinTO
Gerbil Team Leader
 
Posts: 240
Joined: Sat Jun 30, 2007 7:02 am
Location: Toronto, Ontario

Re: startfah script

Postposted on Sun Jun 07, 2009 6:02 am

I've been running this script for the last month, and it's been stable for me. It uses qfix to correct work unit errors and I don't lose many work units, if any.

The only thing is that the hang at 100% completion still occurs, so I have to hit Ctrl-C to kill the stalled fah6. Other than that, the script runs well for me.

-JP

Code: Select all
#!/bin/bash
cd ~/folding/FAH
while true; do
./qfix
./fah6 -forceasm -smp -verbosity 9 -advmethods -oneunit
./qfix
rm *.exe
killall -9 "FahCore_a2.exe"
sleep 5
done


qfix can be downloaded from here: http://foldingforum.org/viewtopic.php?f=8&t=191
JPinTO
Gerbil Team Leader
 
Posts: 240
Joined: Sat Jun 30, 2007 7:02 am
Location: Toronto, Ontario

Re: Stabilizing the Linux SMP client

Postposted on Sun Jun 14, 2009 4:16 am

Latest trial script that clears the work folder if the work units are uploaded ok:

Code: Select all
#!/bin/bash
cd ~/folding/FAH
while true; do

./qfix
./fah6 -forceasm -smp -verbosity 9 -advmethods -oneunit
./qfix

# Clear work files
count1=`./qfix | grep "status 1" | wc -l`
count2=`./qfix | grep "status 2" | wc -l`
if [[ $count1 = 0 && $count2 = 0  ]]
then
rm queue.dat
rm work/*
fi

rm *.exe
killall -9 "FahCore_a2.exe"
sleep 5
done
JPinTO
Gerbil Team Leader
 
Posts: 240
Joined: Sat Jun 30, 2007 7:02 am
Location: Toronto, Ontario


Return to TR Distributed Computing Effort

Who is online

Users browsing this forum: No registered users and 2 guests