Creating the "perfect" FAH Monitoring Software

Come join the... uh... er... fold.

Moderators: just brew it!, farmpuma

Re: Creating the "perfect" FAH Monitoring Software

Postposted on Sat Nov 08, 2008 3:03 pm

sdack wrote:Btw, why do you even bother about it? Are you the author of FahMon?
Nope, FahMon is written by Andrew Schofield who goes by the handle "Uncle Fungus", I just want to enlighten people as to why it is so hard to do. I see later on you have posted that wuinfo file where the client has gone crazy so you should be aware of some of the problems. The source code is open, you could always post patches to fix these issues, lots of people would appreciate it.
notfred
Grand Gerbil Poohbah
 
Posts: 3762
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Re: Creating the "perfect" FAH Monitoring Software

Postposted on Sat Nov 08, 2008 4:50 pm

Right now do I see three moderators walking on a limb. Why do you not tell the OP what it is that you like to see being improved with FahMon or are you lobbying for Uncle Fungus? :wink:
Last edited by sdack on Sat Nov 08, 2008 6:52 pm, edited 1 time in total.
sdack
Gerbil
 
Posts: 66
Joined: Mon Apr 21, 2008 4:47 am
Location: In another thread, having a nice talk

Re: Creating the "perfect" FAH Monitoring Software

Postposted on Sat Nov 08, 2008 5:20 pm

It is becoming increasingly obvious that you have nothing positive to add to this thread. I respectfully ask you to change your tack or please refrain from further posting here.

JBI warned you earlier and I count this as strike two.
Image Image
.* * M-51 * *. .The Whirlpool Galaxy.
farmpuma
Minister of Gerbil Affairs
Silver subscriber
 
 
Posts: 2320
Joined: Mon Mar 22, 2004 12:33 am
Location: Soybean field, IN, USA, Earth .. just a bit south of John .. err .... Fart Wayne, Indiana

Re: Creating the "perfect" FAH Monitoring Software

Postposted on Sat Nov 08, 2008 5:56 pm

sdack wrote:Right now do I see three moderators walking on a limb.

This particular forum has two moderators: JBI and farmpuma. Their names are listed near the top. The other posters may have the green colours but they are moderators for the other forums. The other mods have no modding say in the forums they don't moderate and are just like any other regular posters. Please check who are really modding the forums before making ill-informed comments.

And for the record, JBI, notfred and I are just highlighting some of the challenges involved in parsing those log files, just like you are highlighting your wishlist for whatever software the OP was thinking about. I see it as a healthy discussion so some of your comments are not helping the discussions here.
Image
The Model M is not for the faint of heart. You either like them or hate them.

Gerbils unite! Fold for UnitedGerbilNation, team 2630.
Flying Fox
Gerbil God
 
Posts: 24524
Joined: Mon May 24, 2004 2:19 am

Re: Creating the "perfect" FAH Monitoring Software

Postposted on Sun Nov 09, 2008 8:11 pm

Thanks for the spirited conversation. Back to the subject, yes I'm not satisfied with some of the bugs of Fahmon, but I won't slam the author. It's free and I got what I paid for.

I'm monitoring 20+ clients, and could easily throw on a dozen more but I hold off because I don't have an adequate mechanism for reliably monitoring the 20 clients that I have, let alone the 30 I could have.

The log file parsing is tricky for sure and I've started writing the parser code for it. Differential formats depending on the WU/Client will be a pain in the neck. But, I've got some ideas that are already gelling out in the monitoring tool which I'm excited about... I think the GUI is going to rock. I want some of the gui simplicity of Fahmon, but with more advanced functionality for those with OCD who need to have all info at their fingertips. Sifting through the log file manually eyeballing stuff is real drudgery and pointless.

In the end, I am sure that I'll have numerous bugs in the software... it's "hobby" level software when I'm not writing software during my day job. I've got a MS-Windows background and not opensource, so my first try will be using the crappy MS tools I already know.

One of my issues with Fahmon is it's PPD calculation methods. I'm probably going to use a three phase PPD calculation: Actual PPD based on uploads, WU PPD and step PPD.

As for the complaints about the fahlog errors by Stanford, does anyone care to shed some light on the specifics so that I am cognizant of them? I don't think I'll even use the unitinfo.txt file at all and just stick to the logfiles.

- JP
Last edited by JPinTO on Sun Nov 09, 2008 8:43 pm, edited 3 times in total.
JPinTO
Gerbil Team Leader
 
Posts: 240
Joined: Sat Jun 30, 2007 7:02 am
Location: Toronto, Ontario

Re: Creating the "perfect" FAH Monitoring Software

Postposted on Sun Nov 09, 2008 8:18 pm

I'm looking forward to seeing what you can come up with. I'll be glad to help you beta test it when you reach that stage. :D
Fold! And I don't mean your clothes!

Do you have a favorite gerbil recipe? Please share with the TR community!
flybywire
Gerbil Jedi
 
Posts: 1883
Joined: Wed Jun 16, 2004 2:28 pm
Location: Springfield, VA - USA

Re: Creating the "perfect" FAH Monitoring Software

Postposted on Mon Nov 10, 2008 3:12 am

JPinTO wrote:As for the complaints about the fahlog errors by Stanford, does anyone care to shed some light on the specifics so that I am cognizant of them? I don't think I'll even use the unitinfo.txt file at all and just stick to the logfiles.

- JP

A couple of things need to be mentioned:

- Each value you pull out of the logs or derive from it needs a bounds check. If it falls out of bounds then do not use it, create an estimate or warn about the possibly false value. For example, ETAs should not point into the past, percentages should range between 0 and 100, check for zeros before you do divisions, etc.
- Make your application rely on as little as possible, just like the FAH clients. FahMon requires the user to set the timezone while the clients work fine without it - this is unnecessary. Avoid user interactions with your application, pull the information automatically from the OS or create estimates. The last thing one wants is to care about 30 clients + 1 monitoring application.
- Use as little resources as possible. You do not want to steal too much time from the OS while you may be running a client on the same machine as your application. For instance, do not monitor anything while your apps window is not visible (minimized) and pick up monitoring as soon as it becomes visible. Do not iterate every 10 seconds over the log files when you could just poll for a file change event. (I know, network file systems are tricky however.)
- Have an option to send out an email in case a client is estimated to miss its deadlines or anything else bad has occurred.

And you can take this further:
Build a client/server architecture with a simple daemon/service on the client machines and without relying on network file systems. Have the user setup up the monitoring daemon/service instead of a network file system. Implement support for SNMP, ... And there is more than monitoring. There is control, too. Implement control functions for setting affinity locks and priorities with day&night time / weekend schedules, ... Implement a web proxy so one can run the FAH clients through your monitoring application and to monitor network events and traffic.

Last but not least, do not implement everything people tell you to but only what you would like to see, too.

I am looking forward to what you can do and do not forget to post your progress in here :)
sdack
Gerbil
 
Posts: 66
Joined: Mon Apr 21, 2008 4:47 am
Location: In another thread, having a nice talk

Re: Creating the "perfect" FAH Monitoring Software

Postposted on Mon Nov 10, 2008 11:42 am

sdack wrote:A couple of things need to be mentioned:

- Each value you pull out of the logs or derive from it needs a bounds check. If it falls out of bounds then do not use it, create an estimate or warn about the possibly false value. For example, ETAs should not point into the past, percentages should range between 0 and 100, check for zeros before you do divisions, etc.
- Make your application rely on as little as possible, just like the FAH clients. FahMon requires the user to set the timezone while the clients work fine without it - this is unnecessary. Avoid user interactions with your application, pull the information automatically from the OS or create estimates. The last thing one wants is to care about 30 clients + 1 monitoring application.
- Use as little resources as possible. You do not want to steal too much time from the OS while you may be running a client on the same machine as your application. For instance, do not monitor anything while your apps window is not visible (minimized) and pick up monitoring as soon as it becomes visible. Do not iterate every 10 seconds over the log files when you could just poll for a file change event. (I know, network file systems are tricky however.)
- Have an option to send out an email in case a client is estimated to miss its deadlines or anything else bad has occurred.

And you can take this further:
Build a client/server architecture with a simple daemon/service on the client machines and without relying on network file systems. Have the user setup up the monitoring daemon/service instead of a network file system. Implement support for SNMP, ... And there is more than monitoring. There is control, too. Implement control functions for setting affinity locks and priorities with day&night time / weekend schedules, ... Implement a web proxy so one can run the FAH clients through your monitoring application and to monitor network events and traffic.



LOL! Well, I did ask for feedback, and you weren't shy about providing that. I like your client/server concept, but I'm not going to think that far down the line otherwise the scope will become so large that it no longer feels like a "fun" project. Baby steps.

For simplicities sake, I will continue with Stanford UTC time reporting rather than worry about timezones. I think we're all used to the UTC format standard.

I agree with you that the setup should be trivial, it's one of the things I like about Fahmon, just add a client name and a path and it works. Simple.

Dead client notification will require periodic polling of the clients so the application will need some resources periodically. It will probably be set to a default interval of 5-15 minutes, I will implement a multiple client error detection: Host not responding to ICMP ping, network share not responding, client dead for a certain interval (no log file timestamp change), client error, client warning.

Email notification will be one a phase2 priority.

- JP
JPinTO
Gerbil Team Leader
 
Posts: 240
Joined: Sat Jun 30, 2007 7:02 am
Location: Toronto, Ontario

Re: Creating the "perfect" FAH Monitoring Software

Postposted on Mon Nov 10, 2008 1:36 pm

There are books and standards covering the development of monitor&control applications (also known as "telemetry&telecommand"). Should you ever grow tired or feel your are running out of ideas then just pick one up.

Edit:
Regarding the idea of interval checks; I am not a Windows programmer but if there is an equivalent to the UNIX system call poll() (or select()) under Windows then that it is already all you need (I am pretty sure it exists since it is very helpful). All you do is to poll for a file change together with a large timeout. When poll() returns then either because the client is hung and the timeout has been reached, or because it has touched the file. The result of this function is non-ambiguous and serves you both purposes with a minimum need of resources. You will probably know this function ... If not, then try to find it. It will serve you as the "heart" of your application.
sdack
Gerbil
 
Posts: 66
Joined: Mon Apr 21, 2008 4:47 am
Location: In another thread, having a nice talk

Re: Creating the "perfect" FAH Monitoring Software

Postposted on Tue Nov 11, 2008 9:36 pm

JPinTO wrote:As for the complaints about the fahlog errors by Stanford, does anyone care to shed some light on the specifics so that I am cognizant of them?

Here's a list of some of the gotchas that I have run in to with my diskless stuff.

Some WUs start at 0% straight away, others wait and then start at 1% later.

Even after the WU has finished, it can take a substantial amount of time for all the cores to shut down before beginning to send the results - I have personally seen 8 minutes on one Core_a1. This can make it very difficult to determine for sure when a core is hung.

If the log file gets too big it will get moved to -Prev and a new one opened - if you have just kept your file handle open on the file you are now point at -Prev and not the real file.

SMP WUs can print multiple copies of some info to the WU, particularly during WU startup. This can corrupt the printing of the WU so you may need to go to unitinfo.txt to get that reliably, although I think it doesn't always get updated correctly at the start if you have a WU that does the starting at 1%.

Older single core WU used to vary between "%" and "percent" and some were in [] and some in (). Not sure if that is still the case.

If you run with verbosity 9 then the logfile still gets updated with notices about sending any unfinished WUs even if there aren't any and even if the cores have hung - this needs to be filtered out.
notfred
Grand Gerbil Poohbah
 
Posts: 3762
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Re: Creating the "perfect" FAH Monitoring Software

Postposted on Wed Nov 12, 2008 5:57 pm

Thank You notfred. I've noticed many of those behaviors, I'll keep my eye open for the others. I've noticed that others don't report a 100% completion.

- JP
JPinTO
Gerbil Team Leader
 
Posts: 240
Joined: Sat Jun 30, 2007 7:02 am
Location: Toronto, Ontario

Re: Creating the "perfect" FAH Monitoring Software

Postposted on Thu Nov 13, 2008 5:09 am

Just do not expect anything to rely on. Best is to base your monitoring on the frequency of file changes (=> "alive" or "hung"). Everything else, i.e. what you read out of the files, you should use only as additional information (=> "PPD" or "ETA").

This is the start of a run with an a2 core (Linux):
Code: Select all
...
[08:00:42] Project: 2669 (Run 9, Clone 97, Gen 19)
[08:00:42]
[08:00:42] Assembly optimizations on if available.
[08:00:42] Entering M.D.
[08:00:52] (Run 9, Clone 97, Gen 19)
[08:00:52]
[08:00:53] Entering M.D.
[08:33:37] Completed 5009 out of 250000 steps  (2%)

It can start with 0% but here it starts with 2%. This is from an interrupted a1 core:
Code: Select all
...
[10:13:20] Project: 2665 (Run 3, Clone 383, Gen 66)
[10:13:20]
[10:13:20] Entering M.D.
[10:13:26] Calling FAH init
[10:13:27] Read topology
[10:13:27] (Starting from checkpoint)
[10:13:27] Read checkpoint
[10:13:28] Protein: HGG in water
[10:13:28] Writing local files
[10:13:28] Completed 137500 out of 250000 steps  (55 percent)
...

The a1 core uses "percent" instead of "%". The Nvidia GPU2 core produces this:
Code: Select all
...
[07:49:20] Project: 5506 (Run 0, Clone 58, Gen 229)
[07:49:20]
[07:49:20] Assembly optimizations on if available.
[07:49:20] Entering M.D.
[07:49:26] Will resume from checkpoint file
[07:49:27] Working on p5506_supervillin_e1
[07:49:27] Client config found, loading data.
[07:49:27] Starting GUI Server
[07:49:27] Resuming from checkpoint
[07:49:27] Verified work/wudata_00.log
[07:49:27] Verified work/wudata_00.edr
[07:49:27] Verified work/wudata_00.trr
[07:49:27] Verified work/wudata_00.xtc
[07:49:27] Completed 36%
sdack
Gerbil
 
Posts: 66
Joined: Mon Apr 21, 2008 4:47 am
Location: In another thread, having a nice talk

Re: Creating the "perfect" FAH Monitoring Software

Postposted on Sun Nov 16, 2008 7:08 pm

FahMon has four methods of calculating the ETA but it is still a bit dissatisfying. If you intend to copy these methods then, please, take the following into account:

- "all frames" in FahMon uses all the time frames for a WU from across all clients. It should however not mix the clients' times or the result becomes nonsensical because of individual client speeds.
- When using a single frame then the monitoring application should not wait for the first three frames to calculate an ETA as it is the case with FahMon.
- Have an option like "numbers of hours" to base the calculation of the ETA on a time limit, too. My preference with FahMon is to use the last three frames but three frames can mean as little as three minutes or as much as several hours depending, again, on the client's type.

Best is to have options for specifying a free number of frames and hours (and perhaps days) for the calculation of an ETA, and if there is a lack of frames then it should use what is available and indicate the lack.
sdack
Gerbil
 
Posts: 66
Joined: Mon Apr 21, 2008 4:47 am
Location: In another thread, having a nice talk

Re: Creating the "perfect" FAH Monitoring Software

Postposted on Mon Nov 17, 2008 9:52 pm

I was hoping to come up with an ETA calc that just "works". Still ironing out the GUI design.
JPinTO
Gerbil Team Leader
 
Posts: 240
Joined: Sat Jun 30, 2007 7:02 am
Location: Toronto, Ontario

Re: Creating the "perfect" FAH Monitoring Software

Postposted on Wed Nov 26, 2008 10:03 pm

Phew. This is a lot of work.
JPinTO
Gerbil Team Leader
 
Posts: 240
Joined: Sat Jun 30, 2007 7:02 am
Location: Toronto, Ontario

Re: Creating the "perfect" FAH Monitoring Software

Postposted on Wed Nov 26, 2008 10:06 pm

I had some bizarre thing jam up Fahmon for the last few weeks. It was taking an inordinate amount of time to report the status of clients, appearing as if it was hung.

I finally found the problem: Fahmon reads the unitinfo.txt file, and for one of my clients, the file had grown to 168 Mb in size. The progress line was full of [||||||||||||||||||||||||||||||||||||||||..repeat for 168Mb worth]

Don't know how that happened, but I'll be handling that situation.
JPinTO
Gerbil Team Leader
 
Posts: 240
Joined: Sat Jun 30, 2007 7:02 am
Location: Toronto, Ontario

Re: Creating the "perfect" FAH Monitoring Software

Postposted on Wed Nov 26, 2008 11:07 pm

Check the FAHMon thread on the foldingforum bulletin board. I believe there's a new version out that only loads the first few bytes of the unitinfo.txt file. That way, even if the file is huge, the read of the WU is nice and quick.
notfred
Grand Gerbil Poohbah
 
Posts: 3762
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Re: Creating the "perfect" FAH Monitoring Software

Postposted on Wed Nov 26, 2008 11:30 pm

JPinTO wrote:I finally found the problem: Fahmon reads the unitinfo.txt file, and for one of my clients, the file had grown to 168 Mb in size. The progress line was full of [||||||||||||||||||||||||||||||||||||||||..repeat for 168Mb worth]

:o Sigh, bugs... :-?

It reminded me why I did not join the Folding crowd until recently (was doing UD/WCG for the [H] before).
Image
The Model M is not for the faint of heart. You either like them or hate them.

Gerbils unite! Fold for UnitedGerbilNation, team 2630.
Flying Fox
Gerbil God
 
Posts: 24524
Joined: Mon May 24, 2004 2:19 am

Previous

Return to TR Distributed Computing Effort

Who is online

Users browsing this forum: No registered users and 2 guests