Moderators: just brew it!, farmpuma
JPinTO wrote:#1. Better Visibility and reporting of Stalled or hung clients. Reporting if you are processing the same WU twice in a row. Email notification of problems. Ability to mark clients a "offline" so that they aren't distracting you.
#2: Points Per Day Tracking: I want to see time per step, PPD per step, a PPD History graph calculated on each step like the Task Manager's CPU History graph, etc.
#3: Ability to spawn VNC
#4: Try (!) to compensate for poor time reporting on the client with VMWare Linux clients.
just brew it! wrote:Yeah, I've often thought about doing stalled node detection by checking the last time the FAHlog.txt file was touched (this time is the "last log update" line in my status report); just never got around to implementing it.
just brew it! wrote:The part of this that is a PITA is keeping your WU database updated with new WUs as they come out. I've been doing this manually, by maintaining a text file where I copy-paste new WUs from Stanford's project summary page. The utility that calculates the PPD reads the text file to get the WU point values.
just brew it! wrote:I'm assuming this would be for local nodes only? Exposing VNC on the 'net isn't a good idea, since its security is poor.
just brew it! wrote:IMO a better solution here is to try to deal with the poor clock synchronization. This is one aspect of VMware which really pisses me off; VirtualBox seems to get the clock right, but unfortunately the SMP Linux client won't run under VirtualBox.
notfred wrote:I wouldn't reinvent the wheel, FAHMon is good so start with that and extend it to do what you want, it's open source (GPL License) so go grab the code from svn.
Be aware that it is non-trivial to decide if a client is hung - the log file can still get updates from autosend and such that will not show it is hung - you have to filter the updates to that file. Also some things can take minutes to go through - e.g. at the end of a core_a1 WU and you don't want to go pulling the trigger too early.
JPinTO wrote:I wasn't going to maintain a WU database. I was just going to pull the values directly from the stanford psummary page when needed. Do they not update the psummary page frequently or what am I missing??
I have not had success under Ubuntu with activating NTP under Gnome for whatever reason. I'm not a hardened Linux person, so perhaps I'm doing something wrong. I've activated the vmware tools "synchronization with host" function and that helped greatly. I still have a few guests that aren't able to sync... although these are usually guests that are sharing cores with other guests.
#!/bin/bash
while true; do
date
ntpd -q -g
sleep 150;
donejust brew it! wrote:I don't think the stock NTP client configuration is aggressive enough to deal with the magnitude of clock skew VMware can introduce. I've been using the following shell script (run it in a terminal window in the guest, with root privilege):
- Code: Select all
#!/bin/bash
while true; do
date
ntpd -q -g
sleep 150;
done

sdack wrote:I have one request - just do not be as stupid as the guy who programmed FahMon. I am being serious. The author of FahMon has programmed several classic bugs into his application, like division by zero, offsets being wrong by one and others.
At midnight, when the clock wraps around from 23:59 to 00:00, all my clients get reported as hung even when I enable the option to ignore asynchronous clocks. And when a client reaches 99% is its ETA pointing into the past. This is just ridiculous and one cannot trust any of the application's numbers.
Looking for Knowledge wrote:When drunk.....
I want to have sex, but find I am more likely to be shot down than when I am sober.
sdack wrote:I have one request - just do not be as stupid as the guy who programmed FahMon. I am being serious. The author of FahMon has programmed several classic bugs into his application, like division by zero, offsets being wrong by one and others.
At midnight, when the clock wraps around from 23:59 to 00:00, all my clients get reported as hung even when I enable the option to ignore asynchronous clocks. And when a client reaches 99% is its ETA pointing into the past. This is just ridiculous and one cannot trust any of the application's numbers.
notfred wrote:If you think you can do better, go ahead. FYI many of those issues are actually from Stanford's side and not FAHMon and there are threads on the folding forum requesting Stanford fix their progress indications so that they are not off by one for some WUs and not for others. Stanford do know about it, but it is not a priority for them because it doesn't impact the science contained in the WUs.
sdack wrote:Btw, why do you even bother about it? Are you the author of FahMon?
Looking for Knowledge wrote:When drunk.....
I want to have sex, but find I am more likely to be shot down than when I am sober.
Heiwashin wrote:sdack wrote:Btw, why do you even bother about it? Are you the author of FahMon?
Are you a paying customer of FahMon?

Flying Fox wrote:FahMon/FahSpy/whatever is just a glorified log parser which knows how to read the FahLog.txt file..
sdack wrote:Flying Fox wrote:FahMon/FahSpy/whatever is just a glorified log parser which knows how to read the FahLog.txt file..
No, it [FahMon] does not know how to read the logfile or else it would know what to do when times switch from 23:59 to 00:00. It then does not take a genius to get it right. Do not try get smart with me! This thread is about creating a new monitor application. So until you have anything smart to say I suggest you STFU.

Flying Fox wrote:bla bla bla
sdack wrote:Flying Fox wrote:bla bla bla
No, you STFU and get out of your arm chair. The OP wants to write a new application and I support him. What is it you think you are doing?
Btw, I am a senior software engineer. What is it that you are?
sdack wrote:Flying Fox wrote:bla bla bla
No, you STFU and get out of your arm chair. The OP wants to write a new application and I support him. What is it you think you are doing?
Btw, I am a senior software engineer. What is it that you are?
That's why I never pay much attention about labels such as "senior" when I am interviewing people. I would take a person who knows what he/she is talking about and real experience (note: this does not necessarily correlate with time on a job) than qualifiers that can be obtained via various different means without real skills/knowledge/experience backing them up.flybywire wrote:Yes, I can tell by the level of maturity that you've demonstrated thus far.

Flying Fox wrote:Did I ever say I am not supporting the OP?
That's why I never pay much attention about labels such as "senior" when I am interviewing people. I would take a person who knows what he/she is talking about and real experience (note: this does not necessarily correlate with time on a job) than qualifiers that can be obtained via various different means without real skills/knowledge/experience backing them up.
just brew it! wrote:Edit: My take on the FahMon debate -- Given that it is Open Source software, it makes sense to use it as a starting point for any new effort. While it may have bugs, there is also a lot of working code there already; why reinvent the wheel? We should either figure out how to fix the bugs and submit them back to the original developer, or fork the code base.
:~$ ls -l
...
-rw-r--r-- 1 sven sven 172316321 2008-11-08 18:02 unitinfo.txt
...
:~$ more unitinfo.txt
Current Work Unit
-----------------
Name: Gromacs
Tag: P2669R17C82G21
Download time: November 8 14:29:23
Due time: November 11 14:29:23
Progress: 1723161591% [||||||||||||...
sdack wrote:While you are right in general do I think that with the bugs of FahMon it might not be a too big mistake to have an alternative. A little bit of competition can improve software, too.
just brew it! wrote:I doubt anyone here has the time to do a full-blown FahMon style application completely from scratch.

Flying Fox wrote:The problem is to get a comprehensive list of WU types and their potential errors to workaround from, and to keep that list updated whenever Stanford pushes out new cores that change the behaviour. That is arguable the bigger (because of external factors) problem to tackle.
Return to TR Distributed Computing Effort
Users browsing this forum: No registered users and 0 guests