LIGO Support Ticket 15278

Ticket Information
  Number:      admin 15278
  User:        anderson@ligo.caltech.edu
  Email:       espinoza_e__AT__ligo.caltech.edu,espinoza__AT__ligo.caltech.edu
  Status:      open
  Assigned To: psilord
Date: Fri, 6 Apr 2007 13:24:53 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: LIGO condor_starter.std core dumping

The LIGO CIT condor pool running,
# /ldcg/condor/bin/condor_version
$CondorVersion: 6.8.4 Feb  1 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

recently started generating SIGABRT core files for condor_starter.std.
We have been running this version for several weeks since it first came
out and only in the last few days have we had 371 instances of this, so
it is probably triggered by user error. However, I assume the assert
failure is an indication that this condor daemon found itself in an
"impossible" situation that might be worth investigating.

(gdb) where
#0  0x0000003dce22f200 in raise () from /lib64/libc.so.6
#1  0x0000003dce230730 in abort () from /lib64/libc.so.6
#2  0x0000003dce2282b6 in __assert_fail () from /lib64/libc.so.6
#3  0x00000000004a66af in REMOTE_CONDOR_reallyexit ()
#4  0x0000000000488b84 in send_final_status ()
#5  0x0000000000488aa4 in dispose_all ()
#6  0x00000000004c07d4 in StateMachine::execute ()
#7  0x00000000004883bd in main ()

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Fri Apr  6 15:25:12 2007 (1175891114)
Date: Fri, 6 Apr 2007 21:31:16 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #15278] LIGO condor_starter.std core dumping

Please see,
http://www.ligo.caltech.edu/~anderson/condor.15278
for the Starter log, core dump, and gdb stack trace output.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Apr  6 23:31:31 2007 (1175920292)
Subject: Actions

Assigned to adesmet by adesmet
===========================================================================
Date of actions: Mon Apr 16 10:45:25 2007 (1176738325)
Subject: Actions

Assigned to psilord by adesmet
===========================================================================
Date of actions: Mon Apr 23 16:07:08 2007 (1177362428)
Date: Wed, 25 Apr 2007 10:46:04 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: adesmet <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #15278] LIGO condor_starter.std core dumping

Hello,

On Mon, Apr 23, 2007 at 04:07:08PM -0500, adesmet wrote:
> The LIGO CIT condor pool running,
> # /ldcg/condor/bin/condor_version
> $CondorVersion: 6.8.4 Feb  1 2007 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
> 
> recently started generating SIGABRT core files for condor_starter.std.
> We have been running this version for several weeks since it first came
> out and only in the last few days have we had 371 instances of this, so
> it is probably triggered by user error. However, I assume the assert
> failure is an indication that this condor daemon found itself in an
> "impossible" situation that might be worth investigating.
> 
> (gdb) where
> #0  0x0000003dce22f200 in raise () from /lib64/libc.so.6
> #1  0x0000003dce230730 in abort () from /lib64/libc.so.6
> #2  0x0000003dce2282b6 in __assert_fail () from /lib64/libc.so.6
> #3  0x00000000004a66af in REMOTE_CONDOR_reallyexit ()
> #4  0x0000000000488b84 in send_final_status ()
> #5  0x0000000000488aa4 in dispose_all ()
> #6  0x00000000004c07d4 in StateMachine::execute ()
> #7  0x00000000004883bd in main ()

This ticket has been reassigned to me (Pete Keller).

Can you also provide an example of what the shadow log says when this happens?
Just put it in the same URL as the other stuff. I'll start code inspecting
and see what I can find.

Thanks.

-pete

===========================================================================
Date mail was appended: Wed Apr 25 10:46:06 2007 (1177515966)
Dkim-Signature: a=rsa-sha1; c=relaxed/relaxed;         d=gmail.com;  
 s=beta;         h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth;
 b=mWFqXx+M+6RmTii4nTla3yZ/MFPlJy5jIwVTFND+uXmR4KPWJXRVZZTLTpos6mMC49cKGfzNVoF5MvvbfVmdKUb6dc9Owomrjl4fCYVOzqfhUgUkE97dex7hOztnn9PymuNE/48MuOywwnuWojllPVRIyVi7ZOmGRQJfEgxLEp0=
Domainkey-Signature: a=rsa-sha1; c=nofws;         d=gmail.com; s=beta;    
 h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth;
 b=YhRv8jOKpGuds7XYLaYm1JCB1+w9IY9PUlDubXighBkIA3m9rzPFAXC+7uvEhIvLBAaRHVdtqJEnbrpcS6yU5mtq+JU/SLxz7UWfxMrP1xiHAGZteyaIOHZ+KA1FgeWpCX69a81cXPt2eXygIU7jn6dEsku7sjEfDGQxoNh7y3E=
Date: Mon, 7 May 2007 14:34:15 -0700
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #15278] LIGO condor_starter.std core dumping
CC: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
X-Google-Sender-Auth: f6ef8e557bdda84a

Greetings Pete,

My apologies. It appears as though I had lost this e-mail in the
shuffle. Here is the log you requested:

http://www.ligo.caltech.edu/~eespinoz/debug/04-23-2007/ShadowLog

It is approx 90mb.

Thanks,
Erik

On 4/25/07, condor-admin response tracking system
<condor-admin__AT__cs.wisc.edu> wrote:
> Hello,
>
> On Mon, Apr 23, 2007 at 04:07:08PM -0500, adesmet wrote:
> > The LIGO CIT condor pool running,
> > # /ldcg/condor/bin/condor_version
> > $CondorVersion: 6.8.4 Feb  1 2007 $
> > $CondorPlatform: X86_64-LINUX_RHEL3 $
> >
> > recently started generating SIGABRT core files for condor_starter.std.
> > We have been running this version for several weeks since it first came
> > out and only in the last few days have we had 371 instances of this, so
> > it is probably triggered by user error. However, I assume the assert
> > failure is an indication that this condor daemon found itself in an
> > "impossible" situation that might be worth investigating.
> >
> > (gdb) where
> > #0  0x0000003dce22f200 in raise () from /lib64/libc.so.6
> > #1  0x0000003dce230730 in abort () from /lib64/libc.so.6
> > #2  0x0000003dce2282b6 in __assert_fail () from /lib64/libc.so.6
> > #3  0x00000000004a66af in REMOTE_CONDOR_reallyexit ()
> > #4  0x0000000000488b84 in send_final_status ()
> > #5  0x0000000000488aa4 in dispose_all ()
> > #6  0x00000000004c07d4 in StateMachine::execute ()
> > #7  0x00000000004883bd in main ()
>
> This ticket has been reassigned to me (Pete Keller).
>
> Can you also provide an example of what the shadow log says when this happens?
> Just put it in the same URL as the other stuff. I'll start code inspecting
> and see what I can find.
>
> Thanks.
>
> -pete
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Peter Keller <psilord__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
>
> --
> ======================================================================
> This mail was sent from the RUST Mail System
> Please direct all replies to condor-admin__AT__cs.wisc.edu
> Please include the current subject line in your reply.
> ======================================================================
>
>

===========================================================================
Date mail was appended: Mon May  7 16:34:30 2007 (1178573670)
Date: Thu, 10 May 2007 14:06:46 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #15278] LIGO condor_starter.std core dumping

Hello,

On Mon, May 07, 2007 at 04:34:30PM -0500, condor-admin response tracking system wrote:
> Greetings Pete,
> 
> My apologies. It appears as though I had lost this e-mail in the
> shuffle. Here is the log you requested:
> 
> http://www.ligo.caltech.edu/~eespinoz/debug/04-23-2007/ShadowLog
> 
> It is approx 90mb.

Ok, I'll take a look. Thanks!

-pete

===========================================================================
Date mail was appended: Thu May 10 17:34:46 2007 (1178836486)
Date: Thu, 17 May 2007 16:52:15 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #15278] LIGO condor_starter.std core dumping

Hello,

Ok, I did analysis of the logs but unfortuantely, I can't confirm my
theories until I get a synchronized set of logs.

So here's what I'd like:

1. Turn on D_SYSCALLS for the shadows.

2. Find a starter log shows something similar to:

	http://www.ligo.caltech.edu/~anderson/condor.15278/StarterLog.vm3

with the job getting a signal 6 and the starter aborting.

3. Give me the synchronized starter & shadow logs for the failed job.
I need the complete run of the shadow and the starter in the respective logs.

4. Give me a backtrace from the starter and from the user job itself.

5. After you prepare the above, turn off D_SYSCALLS since it will
slow the shadows down and cause load on the system.

I realize that this is some work and analysis on your part, but there
are too many questions I can't answer with only pieces of the logs as
it now stands.

Thank you.

Condor Admin

===========================================================================
Date mail was appended: Thu May 17 16:52:16 2007 (1179438737)