LIGO Support Ticket 1824

Ticket Information
  Number:      support 1824
  User:        espinoza@ligo.caltech.edu
  Email:       dbrown__AT__ligo.caltech.edu,anderson__AT__ligo.caltech.edu
  Status:      resolved
  Assigned To: gthain
Date: Tue, 23 Jan 2007 12:38:33 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: Duncan Brown <dbrown__AT__ligo.caltech.edu>,         Stuart Anderson
 <anderson__AT__ligo.caltech.edu>,         Alain Roy <roy__AT__cs.wisc.edu>
Subject: LIGO: Crash on DedicatedScheduler condor_schedd
X-Enigmail-Version: 0.94.1.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No

Greetings,

Alain, could you please associate this with ligo-tickets?

Duncan, was there anything being run at approx 11:41am on Tuesday Jan
23, 2006?

We had a condor_schedd crash on the our DedicatedScheduler system. We
have two schedd's, one for DedicatedScheduler and one for general use.

Our system is running FC4 x86_64 w/ following condor_version:
$CondorVersion: 6.8.3 Jan  4 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

The core is available here:
http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/core.26377

The gdb output is available here:
http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/gdb.out

The logs consisting of today for MasterLog & SchedLog (all that is
running on our DedicatedScheduler machine) are available here:
http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/MasterLog
http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/SchedLog

Please let me know if there is any further information I can provide for
this ticket.

Thanks,
Erik

-------- Original Message --------
Subject: [Condor] Problem
Date: Tue, 23 Jan 2007 11:41:26 -0800
From: condor__AT__ldas-pcdev1.ligo.caltech.edu
To: ldas_admin_cit__AT__ligo.caltech.edu

This is an automated email from the Condor system
on machine "ldas-pcdev1.ligo.caltech.edu".  Do not reply.

"/ldcg/condor/sbin/condor_schedd" on "ldas-pcdev1.ligo.caltech.edu" died
due to signal 11.
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file SchedLog:
1/23 11:41:16 (pid:26377) Got VACATE_SERVICE from <10.14.2.48:51710>
1/23 11:41:16 (pid:26377) DaemonCore: Command received via TCP from host
<10.14.1.86:57342>
1/23 11:41:16 (pid:26377) DaemonCore: received command 443
(VACATE_SERVICE), calling handler (vacate_service)
1/23 11:41:16 (pid:26377) Got VACATE_SERVICE from <10.14.1.86:57342>
1/23 11:41:16 (pid:26377) DaemonCore: Command received via TCP from host
<10.14.2.60:44306>
1/23 11:41:16 (pid:26377) DaemonCore: received command 443
(VACATE_SERVICE), calling handler (vacate_service)
1/23 11:41:16 (pid:26377) Got VACATE_SERVICE from <10.14.2.60:44306>
1/23 11:41:16 (pid:26377) DaemonCore: Command received via TCP from host
<10.14.1.86:42642>
1/23 11:41:16 (pid:26377) DaemonCore: received command 443
(VACATE_SERVICE), calling handler (vacate_service)
1/23 11:41:16 (pid:26377) Got VACATE_SERVICE from <10.14.1.86:42642>
1/23 11:41:17 (pid:26377) DaemonCore: Command received via TCP from host
<10.14.1.39:41009>
1/23 11:41:17 (pid:26377) DaemonCore: received command 443
(VACATE_SERVICE), calling handler (vacate_service)
1/23 11:41:17 (pid:26377) Got VACATE_SERVICE from <10.14.1.39:41009>
1/23 11:41:17 (pid:26377) DaemonCore: Command received via TCP from host
<10.14.1.34:53120>
1/23 11:41:17 (pid:26377) DaemonCore: received command 443
(VACATE_SERVICE), calling handler (vacate_service)
1/23 11:41:17 (pid:26377) Got VACATE_SERVICE from <10.14.1.34:53120>
1/23 11:41:17 (pid:26377) DaemonCore: Command received via TCP from host
<10.14.1.34:33271>
1/23 11:41:17 (pid:26377) DaemonCore: received command 443
(VACATE_SERVICE), calling handler (vacate_service)
1/23 11:41:17 (pid:26377) Got VACATE_SERVICE from <10.14.1.34:33271>
1/23 11:41:18 (pid:26377) Inserting new attribute Scheduler into
non-active cluster cid=29629 acid=-1
*** End of file SchedLog



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator:
ldas_admin_cit__AT__ligo.caltech.edu
The Official Condor Homepage is http://www.cs.wisc.edu/condor

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date of creation: Tue Jan 23 14:42:50 2007 (1169584972)
Subject: Actions

Assigned to gthain by gquinn
===========================================================================
Date of actions: Tue Jan 23 16:22:55 2007 (1169590975)
Date: Mon, 29 Jan 2007 14:40:26 -0600
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1824] LIGO: Crash on DedicatedScheduler     
 condor_schedd

gquinn wrote:

> We had a condor_schedd crash on the our DedicatedScheduler system. We
> have two schedd's, one for DedicatedScheduler and one for general use.
> 
> Our system is running FC4 x86_64 w/ following condor_version:
> $CondorVersion: 6.8.3 Jan  4 2007 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
> 
> The core is available here:
> http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/core.26377
> 
> The gdb output is available here:
> http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/gdb.out
> 
> The logs consisting of today for MasterLog & SchedLog (all that is
> running on our DedicatedScheduler machine) are available here:
> http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/MasterLog
> http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/SchedLog
> 
> Please let me know if there is any further information I can provide for
> this ticket.
> 
> Thanks,
> Erik

Erik:

I'm having problems with gdb looking at this corefile on our x86_64 FC4 
machines, it doesn't give the same stack backtrace as you see.  Would it 
be possible for me to log into that machine to see what's going on?

-Greg

===========================================================================
Date mail was appended: Mon Jan 29 14:40:35 2007 (1170103235)
Date: Tue, 30 Jan 2007 11:07:31 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: dbrown__AT__ligo.caltech.edu, anderson__AT__ligo.caltech.edu
Subject: Re: [condor-support #1824] LIGO: Crash on DedicatedScheduler     
 condor_schedd
X-Enigmail-Version: 0.94.1.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No

Hi Greg,

Please send me your ssh pub key or your doegrids subject, I will create
a temporary account for you to take a closer look.

Thanks,
Erik

condor-support response tracking system wrote:
> gquinn wrote:
> 
>> We had a condor_schedd crash on the our DedicatedScheduler system. We
>> have two schedd's, one for DedicatedScheduler and one for general use.
>>
>> Our system is running FC4 x86_64 w/ following condor_version:
>> $CondorVersion: 6.8.3 Jan  4 2007 $
>> $CondorPlatform: X86_64-LINUX_RHEL3 $
>>
>> The core is available here:
>> http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/core.26377
>>
>> The gdb output is available here:
>> http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/gdb.out
>>
>> The logs consisting of today for MasterLog & SchedLog (all that is
>> running on our DedicatedScheduler machine) are available here:
>> http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/MasterLog
>> http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/SchedLog
>>
>> Please let me know if there is any further information I can provide for
>> this ticket.
>>
>> Thanks,
>> Erik
> 
> Erik:
> 
> I'm having problems with gdb looking at this corefile on our x86_64 FC4 
> machines, it doesn't give the same stack backtrace as you see.  Would it 
> be possible for me to log into that machine to see what's going on?
> 
> -Greg
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Greg Thain <gthain__AT__cs.wisc.edu>
> * Ticket Email List: espinoza__AT__ligo.caltech.edu, dbrown__AT__ligo.caltech.edu,anderson__AT__ligo.caltech.edu
> 

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Tue Jan 30 13:08:22 2007 (1170184103)
Date: Tue, 30 Jan 2007 15:06:24 -0600
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1824] LIGO: Crash on DedicatedScheduler    
 condor_schedd

This is a multi-part message in MIME format.
--------------070800000007010400010104

condor-support response tracking system wrote:
> Hi Greg,
> 
> Please send me your ssh pub key or your doegrids subject, I will create
> a temporary account for you to take a closer look.

ssh key attached.

-Greg



--------------070800000007010400010104

ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAvOYdASjK8r6a0w9BTOD5Pr/U8IJbYOji4IxGY5OCiph3bAkMRcxm2ULbzvaM5ByBSGY0vtBRjRomGMi2ZQB3wc3byYAlMoJEK4e1RJGygmB3WUocj1e9to0I+ZZmFYPONIAMlOB8pIVGinZ4B4c9Eweaex3qzQg7Uq1xoR9c+jE= gthain__AT__chevre.cs.wisc.edu

--------------070800000007010400010104--

===========================================================================
Date mail was appended: Tue Jan 30 15:06:31 2007 (1170191191)
Date: Tue, 30 Jan 2007 13:20:07 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: dbrown__AT__ligo.caltech.edu, anderson__AT__ligo.caltech.edu
Subject: Re: [condor-support #1824] LIGO: Crash on DedicatedScheduler     
 condor_schedd
X-Enigmail-Version: 0.94.1.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No

Hello Greg,

Please log into ldas-pcdev1.ligo.caltech.edu w/ username gthain.

Execute the following to get gdb working:
gdb /ldcg/condor/sbin/condor_schedd core.26377

The core.26377 is in your home dir.

Thanks,
Erik


condor-support response tracking system wrote:
> This is a multi-part message in MIME format.
> --------------070800000007010400010104
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> Content-Transfer-Encoding: 7bit
> 
> condor-support response tracking system wrote:
>> Hi Greg,
>>
>> Please send me your ssh pub key or your doegrids subject, I will create
>> a temporary account for you to take a closer look.
> 
> ssh key attached.
> 
> -Greg
> 
> 
> 
> --------------070800000007010400010104
> Content-Type: text/plain;
>  name="id_rsa.pub"
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline;
>  filename="id_rsa.pub"
> 
> ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAvOYdASjK8r6a0w9BTOD5Pr/U8IJbYOji4IxGY5OCiph3bAkMRcxm2ULbzvaM5ByBSGY0vtBRjRomGMi2ZQB3wc3byYAlMoJEK4e1RJGygmB3WUocj1e9to0I+ZZmFYPONIAMlOB8pIVGinZ4B4c9Eweaex3qzQg7Uq1xoR9c+jE= gthain__AT__chevre.cs.wisc.edu
> 
> --------------070800000007010400010104--
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Greg Thain <gthain__AT__cs.wisc.edu>
> * Ticket Email List: espinoza__AT__ligo.caltech.edu, dbrown__AT__ligo.caltech.edu,anderson__AT__ligo.caltech.edu
> 

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Tue Jan 30 15:21:04 2007 (1170192064)
Date: Tue, 30 Jan 2007 13:41:13 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: dbrown__AT__ligo.caltech.edu, anderson__AT__ligo.caltech.edu
Subject: Re: [condor-support #1824] LIGO: Crash on DedicatedScheduler     
 condor_schedd
X-Enigmail-Version: 0.94.1.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No

Hey Greg,

During yesterdays conference call this was discussed a bit. During the
time of the condor crash of the DedicatedScheduler, there was an nfs issue.

Thanks,
Erik

condor-support response tracking system wrote:
> This is a multi-part message in MIME format.
> --------------070800000007010400010104
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> Content-Transfer-Encoding: 7bit
> 
> condor-support response tracking system wrote:
>> Hi Greg,
>>
>> Please send me your ssh pub key or your doegrids subject, I will create
>> a temporary account for you to take a closer look.
> 
> ssh key attached.
> 
> -Greg
> 
> 
> 
> --------------070800000007010400010104
> Content-Type: text/plain;
>  name="id_rsa.pub"
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline;
>  filename="id_rsa.pub"
> 
> ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAvOYdASjK8r6a0w9BTOD5Pr/U8IJbYOji4IxGY5OCiph3bAkMRcxm2ULbzvaM5ByBSGY0vtBRjRomGMi2ZQB3wc3byYAlMoJEK4e1RJGygmB3WUocj1e9to0I+ZZmFYPONIAMlOB8pIVGinZ4B4c9Eweaex3qzQg7Uq1xoR9c+jE= gthain__AT__chevre.cs.wisc.edu
> 
> --------------070800000007010400010104--
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Greg Thain <gthain__AT__cs.wisc.edu>
> * Ticket Email List: espinoza__AT__ligo.caltech.edu, dbrown__AT__ligo.caltech.edu,anderson__AT__ligo.caltech.edu
> 

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Tue Jan 30 15:42:21 2007 (1170193342)
Subject: Actions

Ticket resolved by gthain
===========================================================================
Date of actions: Tue Jan 30 16:05:26 2007 (1170194726)
Subject: Actions

Ticket was reopened by mailnull
===========================================================================
Date of actions: Tue Jan 30 16:05:27 2007 (1170194728)
Date: Tue, 30 Jan 2007 16:05:12 -0600
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1824] LIGO: Crash on DedicatedScheduler    
 condor_schedd

condor-support response tracking system wrote:
> Hey Greg,
> 
> During yesterdays conference call this was discussed a bit. During the
> time of the condor crash of the DedicatedScheduler, there was an nfs issue.
> 
> Thanks,
> Erik

OK, I'll close this out, but I understand how to get a useful gdb 
session with your core dumps now without logging into your machines.

-Greg

===========================================================================
Date mail was appended: Tue Jan 30 16:05:27 2007 (1170194728)
Subject: Actions

Ticket resolved by gthain
===========================================================================
Date of actions: Tue Jan 30 16:17:55 2007 (1170195475)