LIGO Support Ticket 14206

Ticket Information
  Number:      admin 14206
  User:        anderson@ligo.caltech.edu
  Email:       espinoza_e__AT__ligo.caltech.edu,jmeehean__AT__cs.wisc.edu,naughton__AT__cs.wisc.edu
  Status:      open
  Assigned To: miron
Date: Thu, 14 Sep 2006 15:50:51 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: LIGO 6.8.1pre condor_quill segfault

Is condor_quill meant to segfault when the underlying database crashes?
If so, please ignore this. Otherwise, here is message from the 6.8.1
pre-release running on the LIGO CIT Condor pool triggered by the FC4
x86_64 Linux kernel Out-of-memory module killing postmaster.

Thanks.

"/ldcg/condor/sbin/condor_quill" on "ldas-grid.ligo.caltech.edu" died due to
signal 11.
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file QuillLog:
9/14 06:59:42 ********* End of Probing Job Queue Log File *********
9/14 06:59:42 ++++++++ Sending schedd ad to collector ++++++++
9/14 06:59:42 ++++++++ Sent schedd ad to collector ++++++++
9/14 06:59:52 ******** Start of Probing Job Queue Log File ********
9/14 06:59:57 === Current Probing Information ===
9/14 06:59:57 fsize: 427599894          mtime: 1158242383
9/14 07:00:00 first log entry: 262 CreationTimestamp 1158181310
9/14 07:00:00 POLLING RESULT: ADDED
9/14 07:00:04 ********* End of Probing Job Queue Log File *********
9/14 07:00:04 ++++++++ Sending schedd ad to collector ++++++++
9/14 07:00:04 ++++++++ Sent schedd ad to collector ++++++++
9/14 07:00:14 ******** Start of Probing Job Queue Log File ********
9/14 07:00:14 === Current Probing Information ===
9/14 07:00:14 fsize: 427600467          mtime: 1158242409
9/14 07:00:14 first log entry: 262 CreationTimestamp 1158181310
9/14 07:00:14 POLLING RESULT: ADDED
9/14 07:22:41 [SQL EXECUTION ERROR2] server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
9/14 07:23:00 [SQL: DELETE FROM ProcAds_Num WHERE cid = 7187392 AND pid = 0 AND
attr = 'LastJobLeaseRenewal'; DELETE FROM ProcAds_Str WHERE cid = 7187392 AND
pid = 0 AND attr = 'LastJobLeaseRenewal'; INSERT INTO ProcAds_Num(cid, pid,
attr, val) VALUES (7187392, 0, 'LastJobLeaseRenewal', 1158242431);]
*** End of file QuillLog

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Thu Sep 14 17:51:16 2006 (1158274278)
Subject: Actions

Assigned to miron by miron
===========================================================================
Date of actions: Fri Sep 15  7:21:42 2006 (1158322902)
Date: Fri, 15 Sep 2006 07:24:30 -0500
To: condor-admin__AT__cs.wisc.edu
From: Miron Livny <miron__AT__cs.wisc.edu>
Subject: Re: [condor-admin #14206] LIGO 6.8.1pre condor_quill segfault
CC: Joe Meehean <jmeehean__AT__cs.wisc.edu>, naughton__AT__cs.wisc.edu,
 Greg Thain <gthain__AT__cs.wisc.edu>

At 08:21 AM 9/15/2006, miron wrote:
>Is condor_quill meant to segfault when the underlying database crashes?

No, it does not! It should wait for the DB to come back ...

>If so, please ignore this. Otherwise, here is message from the 6.8.1
>pre-release running on the LIGO CIT Condor pool triggered by the FC4
>x86_64 Linux kernel Out-of-memory module killing postmaster.


We will look into this and get back to you.

As always, thank you for helping us debug Condor!

Miron




>Thanks.
>
>"/ldcg/condor/sbin/condor_quill" on "ldas-grid.ligo.caltech.edu" died due to
>signal 11.
>Condor will automatically restart this process in 10 seconds.
>
>*** Last 20 line(s) of file QuillLog:
>9/14 06:59:42 ********* End of Probing Job Queue Log File *********
>9/14 06:59:42 ++++++++ Sending schedd ad to collector ++++++++
>9/14 06:59:42 ++++++++ Sent schedd ad to collector ++++++++
>9/14 06:59:52 ******** Start of Probing Job Queue Log File ********
>9/14 06:59:57 === Current Probing Information ===
>9/14 06:59:57 fsize: 427599894          mtime: 1158242383
>9/14 07:00:00 first log entry: 262 CreationTimestamp 1158181310
>9/14 07:00:00 POLLING RESULT: ADDED
>9/14 07:00:04 ********* End of Probing Job Queue Log File *********
>9/14 07:00:04 ++++++++ Sending schedd ad to collector ++++++++
>9/14 07:00:04 ++++++++ Sent schedd ad to collector ++++++++
>9/14 07:00:14 ******** Start of Probing Job Queue Log File ********
>9/14 07:00:14 === Current Probing Information ===
>9/14 07:00:14 fsize: 427600467          mtime: 1158242409
>9/14 07:00:14 first log entry: 262 CreationTimestamp 1158181310
>9/14 07:00:14 POLLING RESULT: ADDED
>9/14 07:22:41 [SQL EXECUTION ERROR2] server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
>9/14 07:23:00 [SQL: DELETE FROM ProcAds_Num WHERE cid = 7187392 AND 
>pid = 0 AND
>attr = 'LastJobLeaseRenewal'; DELETE FROM ProcAds_Str WHERE cid = 7187392 AND
>pid = 0 AND attr = 'LastJobLeaseRenewal'; INSERT INTO ProcAds_Num(cid, pid,
>attr, val) VALUES (7187392, 0, 'LastJobLeaseRenewal', 1158242431);]
>*** End of file QuillLog
>
>--
>Stuart Anderson  anderson__AT__ligo.caltech.edu
>http://www.ligo.caltech.edu/~anderson


===========================================================================
Date mail was appended: Fri Sep 15  7:24:32 2006 (1158323072)
Date: Fri, 15 Sep 2006 11:36:37 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, jmeehean__AT__cs.wisc.edu, naughton__AT__cs.wisc.edu
Subject: Re: [condor-admin #14206] LIGO 6.8.1pre condor_quill segfault

compared to schedd stability and 64-bit support.

I have posted the core file and a brief readme file that includes a second
occurance (30 minutes later before anyone noticed what was going on) at,
http://www.ligo.caltech.edu/~anderson/condor.14206
My guess (not confirmed) is that this is easily reproducible on a test
system with, pkill -9 postmaser

P.S. A possible related problem report, which may not be worth fixing and is
almost certainly moot in this instance, is that the second core file overwrote
the first one. In more interesting failure scenarios it is possible that
the first core dump image is the more informative one. If there is no
portable way to get unique coredump images (e.g., based on the PID) perhaps
condor_master could rename them based on the daemon name and timestamp
in a timely fashion.

P.P.S. Has anyone written a condor crash report tool yet? For example,
condor_humpty_dumpy [-start postgrestimestamp] [-end postgrestimestamp]
which aggregates into a single gzip tar file all the log files
(possibly restricted by -start and/or -end), including looking back
at the rolled log files and history files, along with any interesting
configuration files and core dump images. Plus whatever system configuration
you find useful, e.g., uname -a, condor_version, ...
Might it also be possible to add an option to retrieve user log files
that might be interesting for debugging?

Thanks.


On Fri, Sep 15, 2006 at 07:24:32AM -0600, condor-admin response tracking system wrote:
> At 08:21 AM 9/15/2006, miron wrote:
> >Is condor_quill meant to segfault when the underlying database crashes?
> 
> No, it does not! It should wait for the DB to come back ...
> 
> >If so, please ignore this. Otherwise, here is message from the 6.8.1
> >pre-release running on the LIGO CIT Condor pool triggered by the FC4
> >x86_64 Linux kernel Out-of-memory module killing postmaster.
> 
> 
> We will look into this and get back to you.
> 
> As always, thank you for helping us debug Condor!
> 
> Miron
> 
> 
> 
> 
> >Thanks.
> >
> >"/ldcg/condor/sbin/condor_quill" on "ldas-grid.ligo.caltech.edu" died due to
> >signal 11.
> >Condor will automatically restart this process in 10 seconds.
> >
> >*** Last 20 line(s) of file QuillLog:
> >9/14 06:59:42 ********* End of Probing Job Queue Log File *********
> >9/14 06:59:42 ++++++++ Sending schedd ad to collector ++++++++
> >9/14 06:59:42 ++++++++ Sent schedd ad to collector ++++++++
> >9/14 06:59:52 ******** Start of Probing Job Queue Log File ********
> >9/14 06:59:57 === Current Probing Information ===
> >9/14 06:59:57 fsize: 427599894          mtime: 1158242383
> >9/14 07:00:00 first log entry: 262 CreationTimestamp 1158181310
> >9/14 07:00:00 POLLING RESULT: ADDED
> >9/14 07:00:04 ********* End of Probing Job Queue Log File *********
> >9/14 07:00:04 ++++++++ Sending schedd ad to collector ++++++++
> >9/14 07:00:04 ++++++++ Sent schedd ad to collector ++++++++
> >9/14 07:00:14 ******** Start of Probing Job Queue Log File ********
> >9/14 07:00:14 === Current Probing Information ===
> >9/14 07:00:14 fsize: 427600467          mtime: 1158242409
> >9/14 07:00:14 first log entry: 262 CreationTimestamp 1158181310
> >9/14 07:00:14 POLLING RESULT: ADDED
> >9/14 07:22:41 [SQL EXECUTION ERROR2] server closed the connection unexpectedly
> >         This probably means the server terminated abnormally
> >         before or while processing the request.
> >9/14 07:23:00 [SQL: DELETE FROM ProcAds_Num WHERE cid = 7187392 AND 
> >pid = 0 AND
> >attr = 'LastJobLeaseRenewal'; DELETE FROM ProcAds_Str WHERE cid = 7187392 AND
> >pid = 0 AND attr = 'LastJobLeaseRenewal'; INSERT INTO ProcAds_Num(cid, pid,
> >attr, val) VALUES (7187392, 0, 'LastJobLeaseRenewal', 1158242431);]
> >*** End of file QuillLog
> >
> >--
> >Stuart Anderson  anderson__AT__ligo.caltech.edu
> >http://www.ligo.caltech.edu/~anderson
> 
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Miron Livny <miron__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,jmeehean__AT__cs.wisc.edu,naughton__AT__cs.wisc.edu
> 
> -- 
> ======================================================================
> This mail was sent from the RUST Mail System
> Please direct all replies to condor-admin__AT__cs.wisc.edu
> Please include the current subject line in your reply.
> ======================================================================
> 

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Sep 15 13:37:10 2006 (1158345430)
Subject: Actions

Eta changed from  to quill by epaulson
===========================================================================
Date of actions: Wed Oct 18 10:58:09 2006 (1161187089)