LIGO Support Ticket 1677
Ticket Information
Number: support 1677
User: anderson@ligo.caltech.edu
Email: espinoza_e__AT__ligo.caltech.edu,espinoza__AT__ligo.caltech.edu
Status: resolved
Assigned To: gthain
Date: Sat, 9 Sep 2006 15:00:56 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: LIGO condor-6.8.1 schedd segfault
A pre-release of condor-6.8.1 recently provided by Alan for running on the
LIGO CIT cluster just had the schedd segfault with the following message:
"/ldcg/condor/sbin/condor_schedd" on "ldas-grid.ligo.caltech.edu" died due to
signal 11.
Condor will automatically restart this process in 10 seconds.
*** Last 20 line(s) of file SchedLog:
9/9 13:20:26 (pid:25507) Resource vm2__AT__node56.ldas-cit.ligo.caltech.edu has been
unused for 1525 seconds, limit is 600, releasing
9/9 13:20:27 (pid:25507) Sent ad to central manager for gstef@ligo
9/9 13:20:27 (pid:25507) Sent ad to 1 collectors for gstef@ligo
9/9 13:20:27 (pid:25507) Sent ad to central manager for cokelaer@ligo
9/9 13:20:27 (pid:25507) Sent ad to 1 collectors for cokelaer@ligo
9/9 13:20:27 (pid:25507) Sent ad to central manager for arogan@ligo
9/9 13:20:27 (pid:25507) Sent ad to 1 collectors for arogan@ligo
9/9 13:20:27 (pid:25507) Sent ad to central manager for dbrown@ligo
9/9 13:20:27 (pid:25507) Sent ad to 1 collectors for dbrown@ligo
9/9 13:20:27 (pid:25507) Sent ad to central manager for lindy@ligo
9/9 13:20:27 (pid:25507) Sent ad to 1 collectors for lindy@ligo
9/9 13:20:27 (pid:25507) Sent ad to central manager for cadonati@ligo
9/9 13:20:27 (pid:25507) Sent ad to 1 collectors for cadonati@ligo
9/9 13:20:27 (pid:25507) Sent ad to central manager for dietz@ligo
9/9 13:20:27 (pid:25507) Sent ad to 1 collectors for dietz@ligo
9/9 13:20:27 (pid:25507) Starting add_shadow_birthdate(6950245.0)
9/9 13:20:27 (pid:25507) Started shadow for job 6950245.0 on
"<10.14.1.162:44569>", (shadow pid = 20430)
9/9 13:20:28 (pid:25507) DaemonCore: Command received via TCP from host
<10.14.0.12:39738>
9/9 13:20:28 (pid:25507) DaemonCore: received command 71003 (GIVE_MATCHES),
calling handler (DedicatedScheduler::giveMatches)
9/9 13:20:29 (pid:25507) Inserting new attribute Scheduler into non-active
cluster cid=6951018 acid=-1
*** End of file SchedLog
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Sat Sep 9 17:05:47 2006 (1157839553)
Subject: Actions
Assigned to adesmet by adesmet
===========================================================================
Date of actions: Mon Sep 11 11:35:15 2006 (1157992516)
Subject: Actions
Assigned to gthain by gthain
===========================================================================
Date of actions: Mon Sep 11 12:42:33 2006 (1157996553)
Date: Mon, 11 Sep 2006 12:43:26 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1677] LIGO condor-6.8.1 schedd segfault
> A pre-release of condor-6.8.1 recently provided by Alan for running on the
> LIGO CIT cluster just had the schedd segfault with the following
Stuart:
Is there a core file for this?
-Greg
===========================================================================
Date mail was appended: Mon Sep 11 12:43:32 2006 (1157996613)
Date: Mon, 11 Sep 2006 13:02:12 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1677] LIGO condor-6.8.1 schedd segfault
Also, could you turn on D_FULLDEBUG on the schedd?
Thanks!
-Greg
===========================================================================
Date mail was appended: Mon Sep 11 13:02:20 2006 (1157997741)
Date: Mon, 11 Sep 2006 11:04:55 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1677] LIGO condor-6.8.1 schedd segfault
On Mon, Sep 11, 2006 at 12:43:32PM -0600, condor-support response tracking system wrote:
>
> > A pre-release of condor-6.8.1 recently provided by Alan for running on the
> > LIGO CIT cluster just had the schedd segfault with the following
>
> Stuart:
>
> Is there a core file for this?
>
No. Unfortunately, condor_preen's 24hr sweep just happened to be a few minutes
after the crash. We have modified condor_config to exclude core from
condor_preen in the future.
However, there is a condor_schedd core file from the 30 August crash
which was also a post-6.8.0 version provided by Alan at,
http://www.ligo.caltech.edu/~anderson/condor.1652
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Mon Sep 11 13:05:16 2006 (1157997916)
Date: Mon, 11 Sep 2006 11:10:09 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1677] LIGO condor-6.8.1 schedd segfault
On Mon, Sep 11, 2006 at 01:02:20PM -0600, condor-support response tracking system wrote:
> Also, could you turn on D_FULLDEBUG on the schedd?
Yes. Note, we have been running with "D_COMMAND D_PID" for the last several
Months.
Erik,
Please take care of this.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Mon Sep 11 13:10:31 2006 (1157998233)
Date: Mon, 11 Sep 2006 11:17:45 -0700
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
CC: condor-support response tracking system <condor-support__AT__cs.wisc.edu>,
espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1677] LIGO condor-6.8.1 schedd segfault
X-Enigmail-Version: 0.94.0.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No
Done.
Erik
Stuart Anderson wrote:
> On Mon, Sep 11, 2006 at 01:02:20PM -0600, condor-support response tracking system wrote:
>> Also, could you turn on D_FULLDEBUG on the schedd?
>
> Yes. Note, we have been running with "D_COMMAND D_PID" for the last several
> Months.
>
> Erik,
> Please take care of this.
>
> Thanks.
>
--
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517
===========================================================================
Date mail was appended: Mon Sep 11 13:18:20 2006 (1157998701)
Date: Mon, 11 Sep 2006 17:42:35 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>,
tannenba__AT__cs.wisc.edu
CC: espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1677] LIGO condor-6.8.1 schedd segfault
On Mon, Sep 11, 2006 at 01:02:20PM -0600, condor-support response tracking system wrote:
> Also, could you turn on D_FULLDEBUG on the schedd?
>
Greg,
Now that schedd is generating ~1GByte per hour of log files and we
currently have,
MAX_SCHEDD_LOG = 2000000000
this means we have at most 4 hours of lookback between SchedLog and
SchedLog.old. I am concerned that our next crash will happen at 2AM
and we will not have any logfiles left by the time someone notices.
In 6.8.1 is it now safe to increaase MAX_*_LOG beyond 2GByte,
i.e., are the daemons now compiled to support largefile I/O?
In addition, is there a knob for specifying the number of rotated log files
to keep, similar to, MAX_HISTORY_ROTATIONS? If not, please consider that
a low priority enhancement request.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Mon Sep 11 19:43:05 2006 (1158021785)
Date: Sat, 28 Oct 2006 14:49:10 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1677] LIGO condor-6.8.1 schedd segfault
We have not seen this problem since upgrading to condor 6.8.2. Please consider
closing this ticket.
Thanks.
On Mon, Sep 11, 2006 at 01:02:20PM -0600, condor-support response tracking system wrote:
> Also, could you turn on D_FULLDEBUG on the schedd?
>
> Thanks!
>
> -Greg
>
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Sat Oct 28 16:49:25 2006 (1162072166)
Date: Mon, 30 Oct 2006 08:25:42 -0600
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1677] LIGO condor-6.8.1 schedd segfault
condor-support response tracking system wrote:
> We have not seen this problem since upgrading to condor 6.8.2. Please consider
> closing this ticket.
Will do. I'm continuing to keep an eye out for this, though.
-Greg
===========================================================================
Date mail was appended: Mon Oct 30 8:25:48 2006 (1162218349)
Subject: Actions
Ticket resolved by gthain
===========================================================================
Date of actions: Mon Oct 30 8:25:48 2006 (1162218349)