upload problems from one machine

Come join the... uh... er... fold.

Moderators: just brew it!, farmpuma

upload problems from one machine

Postposted on Mon Apr 27, 2009 3:00 pm

I have an unusual problem. Something is currently broken on one of my fastest SMP machines which is preventing me from connecting to send results.

I have quite a few machines running F@H in he office, but I only run SMP on three that I have direct control over. On one of these, I have been getting the following sorts of messages:

[18:16:29] + Attempting to send results [April 27 18:16:29 UTC]
[18:16:29] - Reading file work/wuresults_05.dat from core
[18:16:29] (Read 22078106 bytes from disk)
[18:16:29] Connecting to http://171.64.65.64:8080/
[18:16:29] - Couldn't send HTTP request to server
[18:16:29] + Could not connect to Work Server (results)
[18:16:29] (171.64.65.64:8080)
[18:16:29] + Retrying using alternative port
[18:16:29] Connecting to http://171.64.65.64:80/
[18:16:29] - Couldn't send HTTP request to server
[18:16:29] + Could not connect to Work Server (results)
[18:16:29] (171.64.65.64:80)
[18:16:29] - Error: Could not transmit unit 05 (completed April 25) to work server.
[18:16:29] - 14 failed uploads of this unit.


On the other machine, also running SMP, I am having no difficulties connecting to the same server.

[15:44:34] Trying to send all finished work units
[15:44:34] + No unsent completed units remaining.
[15:44:34] - Preparing to get new work unit...
[15:44:34] + Attempting to get work packet
[15:44:34] - Will indicate memory of 1024 MB
[15:44:34] - Connecting to assignment server
[15:44:34] Connecting to http://assign.stanford.edu:8080/
[15:44:35] Posted data.
[15:44:35] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[15:44:35] + News From Folding@Home: Welcome to Folding@Home
[15:44:35] Loaded queue successfully.
[15:44:35] Connecting to http://171.64.65.64:8080/
[15:44:38] Posted data.
[15:44:38] Initial: 0000; - Receiving payload (expected size: 2436771)
[15:44:58] - Downloaded at ~118 kB/s
[15:44:58] - Averaged speed for that direction ~106 kB/s
[15:44:58] + Received work.
[15:44:58] Trying to send all finished work units
[15:44:58] + No unsent completed units remaining.
[15:44:58] + Closed connections

This started happening a couple of weeks ago, and I haven't been able to upload a result from machine #1 since then, while machine #2 has worked flawlessly throughout, even when connecting to the same server.

I have no difficulty DOWNLOADING new units (or cores, or anything), just uploading, and just from this one machine.

Out of curiousity, and as a troubleshooting step, I took the queue and work files from the machine that CANNOT connect and transferred them to a machine that CAN ... and suddenly it couldn't connect to send the results either. And once I restored the original queue and work files machine #2 worked fine again. So it's not anything in the machine's configuration. Could there be something in the queue itself that is causing this problem?

Any insight would be appreciated. I've got my heavy hitter out of the lineup, and farmpuma is getting away. :( :wink:
Image
Bookrat
Gerbil
 
Posts: 65
Joined: Thu Mar 09, 2006 2:43 pm

Re: upload problems from one machine

Postposted on Mon Apr 27, 2009 6:07 pm

I suspect your problem is related to the Win SMP client warning about file fragments left in the work folder at the end of each work unit. You could try something like qfix, although I've never had any luck with it fixing anything. Since you are probably very close to or past the WU deadline, I suggest trashing the work folder and the queue.dat file. :cry: Since I'm on dial-up I trash them after each finished work unit.

Catching me should soon become about 3,000 PPD harder, although extreme hot weather may cause night folding only limitations.
Image Image
.* * M-51 * *. .The Whirlpool Galaxy.
farmpuma
Minister of Gerbil Affairs
Silver subscriber
 
 
Posts: 2306
Joined: Sun Mar 21, 2004 11:33 pm
Location: Soybean field, IN, USA, Earth .. just a bit south of John .. err .... Fart Wayne, Indiana

Re: upload problems from one machine

Postposted on Wed Apr 29, 2009 9:06 am

farmpuma wrote:I suspect your problem is related to the Win SMP client warning about file fragments left in the work folder at the end of each work unit. You could try something like qfix, although I've never had any luck with it fixing anything. Since you are probably very close to or past the WU deadline, I suggest trashing the work folder and the queue.dat file. :cry: Since I'm on dial-up I trash them after each finished work unit.

Catching me should soon become about 3,000 PPD harder, although extreme hot weather may cause night folding only limitations.



I agree with Mr. Puma. It sucks to throw away points, but deleting the folders is usually the only fix that works. :cry:
Join UGN's Drive to the Top!
Image
UnitedGerbilNation wants you!!
jeffry55
Grand Gerbil Poohbah
 
Posts: 3181
Joined: Sat Oct 30, 2004 3:38 pm
Location: Menlo Park - just down the street from the F@H Servers!

Re: upload problems from one machine

Postposted on Fri May 01, 2009 10:42 am

Alright, so, on the advice of those above, I completely removed the queue.dat and work folder, then sat down to wait. And, life got busy, so I forgot to check again, until today. And now I'm even more mystified.

Enough time had gone past that the machine had completed TWO units. Looking through the log shows that the first unit completion is suffering exactly the same problem as before ... but look what happened this morning?


[09:58:01] Folding@home Core Shutdown: FINISHED_UNIT
[09:58:04] CoreStatus = 64 (100)
[09:58:04] Unit 2 finished with 70 percent of time to deadline remaining.
[09:58:04] Updated performance fraction: 0.726287
[09:58:04] Sending work to server
[09:58:04] Project: 2653 (Run 16, Clone 129, Gen 104)


[09:58:04] + Attempting to send results [May 1 09:58:04 UTC]
[09:58:04] - Reading file work/wuresults_02.dat from core
[09:58:04] (Read 5519177 bytes from disk)
[09:58:04] Connecting to http://171.64.65.64:8080/
[09:58:39] Posted data.
[09:58:39] Initial: 0000; - Uploaded at ~145 kB/s
[09:58:41] - Averaged speed for that direction ~145 kB/s
[09:58:41] + Results successfully sent
[09:58:41] Thank you for your contribution to Folding@Home.
[09:58:41] + Number of Units Completed: 96


[09:58:48] - Warning: Could not delete all work unit files (2): Core returned invalid code
[09:58:48] Trying to send all finished work units
[09:58:48] Project: 2665 (Run 1, Clone 24, Gen 109)


[09:58:48] + Attempting to send results [May 1 09:58:48 UTC]
[09:58:48] - Reading file work/wuresults_01.dat from core
[09:58:49] (Read 21933813 bytes from disk)
[09:58:49] Connecting to http://171.64.65.64:8080/
[09:58:49] - Couldn't send HTTP request to server
[09:58:49] + Could not connect to Work Server (results)

[09:58:49] (171.64.65.64:8080)
[09:58:49] + Retrying using alternative port
[09:58:49] Connecting to http://171.64.65.64:80/
[09:58:49] - Couldn't send HTTP request to server
[09:58:49] + Could not connect to Work Server (results)
[09:58:49] (171.64.65.64:80)
[09:58:49] - Error: Could not transmit unit 01 (completed April 30) to work server.
[09:58:49] - 9 failed uploads of this unit.



So it still can't upload the first WU, completed April 30, because it can't send an HTTP request to 171.64.65.64:8080 ... but it worked just fine uploading the second unit it finished??!?

Not that I'm complaining - it's the first WU uploaded from this machine in a couple of weeks. Just that it doesn't make sense, and it's non-reproducible. I guess I'll just see how things go with the next few from this machine, but anyone with any thoughts ... I'll listen.
Image
Bookrat
Gerbil
 
Posts: 65
Joined: Thu Mar 09, 2006 2:43 pm

Re: upload problems from one machine

Postposted on Fri May 01, 2009 5:06 pm

I would try creating a send shortcut using the " -send all" and " -verbosity9" flags (without the quotes). This will allow you to repeatedly attempt to send the finished WU and possibly give us extra information on why the upload is failing. Be sure to stop the client before using the send shortcut.
Image Image
.* * M-51 * *. .The Whirlpool Galaxy.
farmpuma
Minister of Gerbil Affairs
Silver subscriber
 
 
Posts: 2306
Joined: Sun Mar 21, 2004 11:33 pm
Location: Soybean field, IN, USA, Earth .. just a bit south of John .. err .... Fart Wayne, Indiana

Re: upload problems from one machine

Postposted on Tue May 05, 2009 10:01 am

There isn't much else to show other than what I already have, but here's the results. I have tried this off and on since yesterday afternoon, always with identical results... the only thing that's changing is that the "Number of failed uploads" counter is going up. :)


C:\FOLDING>fah.exe -send all -verbosity 9

Note: Please read the license agreement (fah.exe -license). Further
use of this software requires that you have read and accepted this agreement.

2 cores detected
If you see this twice, MPI is working
If you see this twice, MPI is working


--- Opening Log file [May 5 15:52:35 UTC]


# Windows SMP Console Edition #################################################
###############################################################################

Folding@Home Client Version 6.23 Beta R1

http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\FOLDING
Executable: fah.exe
Arguments: -send all -verbosity 9 -forceasm -smp -verbosity 9

Warning:
By using the -forceasm flag, you are overriding
safeguards in the program. If you did not intend to
do this, please restart the program without -forceasm.
If work units are not completing fully (and particularly
if your machine is overclocked), then please discontinue
use of the flag.

[15:52:35] - Ask before connecting: No
[15:52:35] - Proxy: 3.20.128.5:88
[15:52:35] - User name: GE_Pharmacy (Team 2630)
[15:52:35] - User ID: 368F85C420D622CE
[15:52:35] - Machine ID: 1
[15:52:35]
[15:52:35] Loaded queue successfully.
[15:52:35] Attempting to return result(s) to server...
[15:52:35] Trying to send all finished work units
[15:52:35] Project: 2665 (Run 0, Clone 707, Gen 110)


[15:52:35] + Attempting to send results [May 5 15:52:35 UTC]
[15:52:35] - Reading file work/wuresults_03.dat from core
[15:52:35] (Read 22081935 bytes from disk)
[15:52:35] Connecting to http://171.64.65.64:8080/
[15:52:35] - Couldn't send HTTP request to server
[15:52:35] + Could not connect to Work Server (results)
[15:52:35] (171.64.65.64:8080)
[15:52:35] + Retrying using alternative port
[15:52:35] Connecting to http://171.64.65.64:80/
[15:52:35] - Couldn't send HTTP request to server
[15:52:35] + Could not connect to Work Server (results)
[15:52:35] (171.64.65.64:80)
[15:52:35] - Error: Could not transmit unit 03 (completed May 3) to work server.
[15:52:35] - 39 failed uploads of this unit.

[15:55:56] + Attempting to send results [May 5 15:55:56 UTC]
[15:55:56] - Reading file work/wuresults_03.dat from core
[15:55:56] (Read 22081935 bytes from disk)
[15:55:56] Connecting to http://171.67.108.25:8080/
[15:55:56] - Couldn't send HTTP request to server
[15:55:56] + Could not connect to Work Server (results)
[15:55:56] (171.67.108.25:8080)
[15:55:56] + Retrying using alternative port
[15:55:56] Connecting to http://171.67.108.25:80/
[15:55:56] - Couldn't send HTTP request to server
[15:55:56] + Could not connect to Work Server (results)
[15:55:56] (171.67.108.25:80)
[15:55:56] Could not transmit unit 03 to Collection server; keeping in queue.
[15:55:56] + Sent 0 of 1 completed units to the server
[15:55:56] - Failed to send all units to server
[15:55:56] ***** Got a SIGTERM signal (2)
[15:55:56] Killing all core threads
[15:55:56] Killing 2 cores
[15:55:56] Killing core 0
[15:55:56] Killing core 1

Folding@Home Client Shutdown.


opening http://171.64.65.64 (either 80 or 8080) gives me the OK
opening http://171.67.108.25 (either 80 or 8080) gives me nothing.

I have checked the firewall logs, and it's showing that all traffic from FAH is being let through in both directions. It is a company firewall that is configured identically on all machines so I can't change it, but no other machines are having this problem.

I'm down to the "Have you tried uninstalling and reinstalling?" troubleshooting steps... nothing else seems to be doing any good (except for that one brief ray of hope after deleting the queue... but that was the SECOND unit to try after deleting the queue, so that doesn't make sense either.)

Thanks.
Image
Bookrat
Gerbil
 
Posts: 65
Joined: Thu Mar 09, 2006 2:43 pm


Return to TR Distributed Computing Effort

Who is online

Users browsing this forum: No registered users and 1 guest