LIGO Support Ticket 19523

Ticket Information
  Number:      admin 19523
  User:        carsten.aulbert@aei.mpg.de
  Email:       
  Status:      new
  Assigned To: tannenba
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: "condor-admin response tracking system" <condor-admin__AT__cs.wisc.edu>
Subject: LIGO condor_quill won't reconnect to dbmsd after network problems
Date: Fri, 24 Jul 2009 16:02:23 +0200
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

Hi,

(promise this will be my last email to -admin for this week, thanks a lot for 
all the hard work you put into this)

We had a network problem (ARP-storm) yesterady which "severed" the links 
between our head nodes. However, condor_quill did not realize this and kept 
putting these lines into the logs:
7/24 06:35:07 ******** Start of Polling Job Queue Log ********
7/24 06:35:07 [SQL EXECUTION ERROR] no connection to the server

7/24 06:35:07 [ERRONEOUS SQL: SELECT last_file_mtime, last_file_size, 
last_next_cmd_offset, last_cmd_offset, last_cmd_type, last_cmd_key, 
last_cmd_mytype, last_cmd_targettype, last_cmd_name, last_cmd_value from 
JobQueuePollingInfo where scheddname = 'atlas1.atlas.aei.uni-hannover.de']
7/24 06:35:07 Reading JobQueuePollInfo --- ERROR [SQL] SELECT last_file_mtime, 
last_file_size, last_next_cmd_offset, last_cmd_offset, last_cmd_type, 
last_cmd_key, last_cmd_mytype, last_cmd_targettype, last_cmd_name, 
last_cmd_value from JobQueuePollingInfo where scheddname = 
'atlas1.atlas.aei.uni-hannover.de'
7/24 06:35:07 [QUILL++] Reading JobQueuePollInfo --- ERROR
7/24 06:35:07 >>>>>>>> Fail: Polling Job Queue Log <<<<<<<<
7/24 06:35:07 ******** Start of Polling Event Log ********
7/24 06:35:07 ********* End of Polling Event Log *********
7/24 06:35:07 ******** Start of Polling XML Log ********
7/24 06:35:07 ********* End of Polling XML Log *********
7/24 06:35:07 ++++++++ Sending Quill ad to collector ++++++++
7/24 06:35:07 ++++++++ Sent Quill ad to collector ++++++++

This repeats every 10 seconds until one sends SIGHUP to the daemon and let it 
be restarted by condor_master. Thus I think the persistent connection to our 
postgresql server on atlas4 is gone and the daemon does not realize this. Of 
course I would like that this happened automatically :)

If you feel this is a bug, please put it into your queue of things to fix. If 
you say that would be a nice feature, I would be very grateful if you would 
add this sooner or later. If you think this is a problem with our network and 
will never happen when we take care of it, I'll be still and offer a drink to 
the one replying to this ticket first at the net condor meeting ;)

Cheers and a nice weekend

Carsten

===========================================================================
Date of creation: Fri Jul 24  9:02:34 2009 (1248444157)
Subject: Actions

Assigned to gthain by cat
===========================================================================
Date of actions: Fri Jul 24 10:53:07 2009 (1248450787)
Subject: Actions

Assigned to tannenba by tannenba
===========================================================================
Date of actions: Fri Aug 14 15:20:41 2009 (1250281241)