LIGO Support Ticket 19523
Ticket Information
Number: admin 19523
User: carsten.aulbert@aei.mpg.de
Email:
Status: new
Assigned To: tannenba
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: "condor-admin response tracking system" <condor-admin__AT__cs.wisc.edu>
Subject: LIGO condor_quill won't reconnect to dbmsd after network problems
Date: Fri, 24 Jul 2009 16:02:23 +0200
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
Hi,
(promise this will be my last email to -admin for this week, thanks a lot for
all the hard work you put into this)
We had a network problem (ARP-storm) yesterady which "severed" the links
between our head nodes. However, condor_quill did not realize this and kept
putting these lines into the logs:
7/24 06:35:07 ******** Start of Polling Job Queue Log ********
7/24 06:35:07 [SQL EXECUTION ERROR] no connection to the server
7/24 06:35:07 [ERRONEOUS SQL: SELECT last_file_mtime, last_file_size,
last_next_cmd_offset, last_cmd_offset, last_cmd_type, last_cmd_key,
last_cmd_mytype, last_cmd_targettype, last_cmd_name, last_cmd_value from
JobQueuePollingInfo where scheddname = 'atlas1.atlas.aei.uni-hannover.de']
7/24 06:35:07 Reading JobQueuePollInfo --- ERROR [SQL] SELECT last_file_mtime,
last_file_size, last_next_cmd_offset, last_cmd_offset, last_cmd_type,
last_cmd_key, last_cmd_mytype, last_cmd_targettype, last_cmd_name,
last_cmd_value from JobQueuePollingInfo where scheddname =
'atlas1.atlas.aei.uni-hannover.de'
7/24 06:35:07 [QUILL++] Reading JobQueuePollInfo --- ERROR
7/24 06:35:07 >>>>>>>> Fail: Polling Job Queue Log <<<<<<<<
7/24 06:35:07 ******** Start of Polling Event Log ********
7/24 06:35:07 ********* End of Polling Event Log *********
7/24 06:35:07 ******** Start of Polling XML Log ********
7/24 06:35:07 ********* End of Polling XML Log *********
7/24 06:35:07 ++++++++ Sending Quill ad to collector ++++++++
7/24 06:35:07 ++++++++ Sent Quill ad to collector ++++++++
This repeats every 10 seconds until one sends SIGHUP to the daemon and let it
be restarted by condor_master. Thus I think the persistent connection to our
postgresql server on atlas4 is gone and the daemon does not realize this. Of
course I would like that this happened automatically :)
If you feel this is a bug, please put it into your queue of things to fix. If
you say that would be a nice feature, I would be very grateful if you would
add this sooner or later. If you think this is a problem with our network and
will never happen when we take care of it, I'll be still and offer a drink to
the one replying to this ticket first at the net condor meeting ;)
Cheers and a nice weekend
Carsten
===========================================================================
Date of creation: Fri Jul 24 9:02:34 2009 (1248444157)
Subject: Actions
Assigned to gthain by cat
===========================================================================
Date of actions: Fri Jul 24 10:53:07 2009 (1248450787)
Subject: Actions
Assigned to tannenba by tannenba
===========================================================================
Date of actions: Fri Aug 14 15:20:41 2009 (1250281241)