Band-aid for a2 core hangs

Come join the... uh... er... fold.

Moderators: just brew it!, farmpuma

Band-aid for a2 core hangs

Postposted on Mon Dec 08, 2008 1:20 am

I'm currently testing a script to automatically recover from the "hang at end of work unit" issue with the a2 core. Anyone else who's interested can help me test it too. Save the script (at the end of this post) in your folding directory, and make sure execute permission is set on the file. To use, just run the script in a terminal window... e.g. if you named the script a2check and your folding directory is fah, you would execute the following commands:
Code: Select all
cd fah
./a2check

If you'd rather run it in the background (and log to a file), you can do the following instead:
Code: Select all
cd fah
./a2check >>a2check.log 2>&1 &

What it does: Every 5 minutes it counts the number of instances of FahCore_a2.exe running on the system. Anything other than 4 (the normal number) or 0 (indicating that the a2 core is not being used at all) means there's a potential problem. If the script detects an anomalous number of a2 cores running, it waits another 5 minutes and checks again (this should prevent "false positives" from occurring if we happen to check just as the WU is starting up or shutting down normally). If the second check still indicates a problem, it kills the wayward a2 cores. This should allow the main folding client to start the next WU normally.

The script runs continuously until killed, and logs a message every time it initiates a recovery.

Here's the script itself (save as a2check in your folding directory):
Code: Select all
#!/bin/bash
app="FahCore_a2.exe"
while true; do
count=`ps -A | grep $app | wc -l`
if [[ $count > 0 && $count != 4 ]]
then
sleep 300
count=`ps -A | grep $app | wc -l`
if [[ $count > 0 && $count != 4 ]]
then
echo `date`: "Nuking $count $app processes"
killall -9 $app
fi
fi
sleep 300
done
(this space intentionally left blank)
just brew it!
Administrator
Gold subscriber
 
 
Posts: 37955
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Band-aid for a2 core hangs

Postposted on Mon Dec 08, 2008 10:52 am

I think that will run in to problems if people are running with something like "-smp 8" options.

Here's the way I do it in my Diskless stuff, I look for the Upload of the previous WU and no attempt to download a new WU:

Code: Select all
#!/bin/sh

while [ 1 ]
do
  # Run every 5 minutes
  sleep 300

  # Clean up the log file
  if [ -f /etc/folding/hanglog.txt ]
  then
    tail -n 1000 /etc/folding/hanglog.txt > /tmp/hanglog.txt
    mv /tmp/hanglog.txt /etc/folding/hanglog.txt
  fi

  # For each instance
  instance=1
  while [ -d /etc/folding/$instance ]
  do
    echo `date` " Checking instance " $instance >> /etc/folding/hanglog.txt

    # Check for upload and not trying to download following
    grep -E 'Number of Units Completed|Preparing to get new work unit|Starting local stats count at' /etc/folding/$instance/FAHlog.txt | tail -n 1 | grep -qE 'Number of Units Completed|Starting local stats count at'
    if [ $?  -eq 0 ]
    then
      # Give the client a chance to continue
      echo "Potential stop found, waiting to see if it clears..." >> /etc/folding/hanglog.txt

      sleep 300
      grep -E 'Number of Units Completed|Preparing to get new work unit|Starting local stats count at' /etc/folding/$instance/FAHlog.txt | tail -n 1 | grep -qE 'Number of Units Completed|Starting local stats count at'
      if [ $?  -eq 0 ]
      then
        echo "Stop failed to clear, continuing cores" >> /etc/folding/hanglog.txt
        killall -CONT FahCore_a2.exe
      fi
    fi
    instance=`expr $instance + 1`
  done
done


This is just part of the check_hang.sh script in folding_cd/initrd_dir/bin if you grab the source code for my stuff.
notfred
Grand Gerbil Poohbah
 
Posts: 3761
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Re: Band-aid for a2 core hangs

Postposted on Mon Dec 08, 2008 1:04 pm

Ahh, yes... my "fix" is dependent on there being 4 cores running when things are operating normally. I guess I should take a closer look at how your check_hang script works.

As an aside, my a2check script has already successfully unstuck two hung Linux clients with no manual intervention on my part. One on a Linux VM, and another on a native Linux box. No more babysitting the SMP cores... yay! :D

Edit: Interesting that we both seem to have arrived at very similar solutions, even down to the 5 minute check interval...
(this space intentionally left blank)
just brew it!
Administrator
Gold subscriber
 
 
Posts: 37955
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Band-aid for a2 core hangs

Postposted on Mon Dec 08, 2008 9:48 pm

I setup a2check on a couple clients. I'll let you know how it goes. Thanks.

- JP
JPinTO
Gerbil Team Leader
 
Posts: 240
Joined: Sat Jun 30, 2007 7:02 am
Location: Toronto, Ontario

Re: Band-aid for a2 core hangs

Postposted on Tue Dec 09, 2008 9:39 am

Nuking 1 Fahcore_a2.exe

Works like a charm! Thanks! :D

- JP
JPinTO
Gerbil Team Leader
 
Posts: 240
Joined: Sat Jun 30, 2007 7:02 am
Location: Toronto, Ontario

Re: Band-aid for a2 core hangs

Postposted on Sun Dec 14, 2008 12:42 pm

Update: Since setting this up on my three dual-core systems about a week ago, it has kicked in 7 times, and worked flawlessly every time.

Until Stanford gets their act together and releases a new a2 core that fixes the end-of-WU shutdown bug, I strongly recommend using this script (or something similar like notfred's method) to keep your Linux SMP WUs flowing smoothly. It improves your point production (by minimizing idle time due to stuck a2 cores), and eliminates the aggravation of dealing with the Linux SMP client.
(this space intentionally left blank)
just brew it!
Administrator
Gold subscriber
 
 
Posts: 37955
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Band-aid for a2 core hangs

Postposted on Sat Dec 27, 2008 7:11 am

This has been working very well for me. I've got it running on 5 or 6 Linux VM's and I have had very few manual restarts in the last few weeks.

Great job!!!
JPinTO
Gerbil Team Leader
 
Posts: 240
Joined: Sat Jun 30, 2007 7:02 am
Location: Toronto, Ontario

Re: Band-aid for a2 core hangs

Postposted on Sun Jun 07, 2009 5:46 am

Status Update: I ran the a2check on a dozen clients for several months. Unfortunately, the killing of the a2core seems to be creating instabilities in the fah6 process which sometimes led to WU corruption.

I was hoping that the a2check would help create a smoothly running SMP client, but unfortunately not. I was still doing too much manual intervention with qfix after each WU completed. I've been playing with another script that I've had reasonable success with lately.

- JP
JPinTO
Gerbil Team Leader
 
Posts: 240
Joined: Sat Jun 30, 2007 7:02 am
Location: Toronto, Ontario


Return to TR Distributed Computing Effort

Who is online

Users browsing this forum: No registered users and 1 guest