LIGO Support Ticket 1824
Ticket Information
Number: support 1824
User: espinoza@ligo.caltech.edu
Email: dbrown__AT__ligo.caltech.edu,anderson__AT__ligo.caltech.edu
Status: resolved
Assigned To: gthain
Date: Tue, 23 Jan 2007 12:38:33 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: Duncan Brown <dbrown__AT__ligo.caltech.edu>, Stuart Anderson
<anderson__AT__ligo.caltech.edu>, Alain Roy <roy__AT__cs.wisc.edu>
Subject: LIGO: Crash on DedicatedScheduler condor_schedd
X-Enigmail-Version: 0.94.1.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No
Greetings,
Alain, could you please associate this with ligo-tickets?
Duncan, was there anything being run at approx 11:41am on Tuesday Jan
23, 2006?
We had a condor_schedd crash on the our DedicatedScheduler system. We
have two schedd's, one for DedicatedScheduler and one for general use.
Our system is running FC4 x86_64 w/ following condor_version:
$CondorVersion: 6.8.3 Jan 4 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
The core is available here:
http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/core.26377
The gdb output is available here:
http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/gdb.out
The logs consisting of today for MasterLog & SchedLog (all that is
running on our DedicatedScheduler machine) are available here:
http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/MasterLog
http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/SchedLog
Please let me know if there is any further information I can provide for
this ticket.
Thanks,
Erik
-------- Original Message --------
Subject: [Condor] Problem
Date: Tue, 23 Jan 2007 11:41:26 -0800
From: condor__AT__ldas-pcdev1.ligo.caltech.edu
To: ldas_admin_cit__AT__ligo.caltech.edu
This is an automated email from the Condor system
on machine "ldas-pcdev1.ligo.caltech.edu". Do not reply.
"/ldcg/condor/sbin/condor_schedd" on "ldas-pcdev1.ligo.caltech.edu" died
due to signal 11.
Condor will automatically restart this process in 10 seconds.
*** Last 20 line(s) of file SchedLog:
1/23 11:41:16 (pid:26377) Got VACATE_SERVICE from <10.14.2.48:51710>
1/23 11:41:16 (pid:26377) DaemonCore: Command received via TCP from host
<10.14.1.86:57342>
1/23 11:41:16 (pid:26377) DaemonCore: received command 443
(VACATE_SERVICE), calling handler (vacate_service)
1/23 11:41:16 (pid:26377) Got VACATE_SERVICE from <10.14.1.86:57342>
1/23 11:41:16 (pid:26377) DaemonCore: Command received via TCP from host
<10.14.2.60:44306>
1/23 11:41:16 (pid:26377) DaemonCore: received command 443
(VACATE_SERVICE), calling handler (vacate_service)
1/23 11:41:16 (pid:26377) Got VACATE_SERVICE from <10.14.2.60:44306>
1/23 11:41:16 (pid:26377) DaemonCore: Command received via TCP from host
<10.14.1.86:42642>
1/23 11:41:16 (pid:26377) DaemonCore: received command 443
(VACATE_SERVICE), calling handler (vacate_service)
1/23 11:41:16 (pid:26377) Got VACATE_SERVICE from <10.14.1.86:42642>
1/23 11:41:17 (pid:26377) DaemonCore: Command received via TCP from host
<10.14.1.39:41009>
1/23 11:41:17 (pid:26377) DaemonCore: received command 443
(VACATE_SERVICE), calling handler (vacate_service)
1/23 11:41:17 (pid:26377) Got VACATE_SERVICE from <10.14.1.39:41009>
1/23 11:41:17 (pid:26377) DaemonCore: Command received via TCP from host
<10.14.1.34:53120>
1/23 11:41:17 (pid:26377) DaemonCore: received command 443
(VACATE_SERVICE), calling handler (vacate_service)
1/23 11:41:17 (pid:26377) Got VACATE_SERVICE from <10.14.1.34:53120>
1/23 11:41:17 (pid:26377) DaemonCore: Command received via TCP from host
<10.14.1.34:33271>
1/23 11:41:17 (pid:26377) DaemonCore: received command 443
(VACATE_SERVICE), calling handler (vacate_service)
1/23 11:41:17 (pid:26377) Got VACATE_SERVICE from <10.14.1.34:33271>
1/23 11:41:18 (pid:26377) Inserting new attribute Scheduler into
non-active cluster cid=29629 acid=-1
*** End of file SchedLog
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator:
ldas_admin_cit__AT__ligo.caltech.edu
The Official Condor Homepage is http://www.cs.wisc.edu/condor
--
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517
===========================================================================
Date of creation: Tue Jan 23 14:42:50 2007 (1169584972)
Subject: Actions
Assigned to gthain by gquinn
===========================================================================
Date of actions: Tue Jan 23 16:22:55 2007 (1169590975)
Date: Mon, 29 Jan 2007 14:40:26 -0600
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1824] LIGO: Crash on DedicatedScheduler
condor_schedd
gquinn wrote:
> We had a condor_schedd crash on the our DedicatedScheduler system. We
> have two schedd's, one for DedicatedScheduler and one for general use.
>
> Our system is running FC4 x86_64 w/ following condor_version:
> $CondorVersion: 6.8.3 Jan 4 2007 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
>
> The core is available here:
> http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/core.26377
>
> The gdb output is available here:
> http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/gdb.out
>
> The logs consisting of today for MasterLog & SchedLog (all that is
> running on our DedicatedScheduler machine) are available here:
> http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/MasterLog
> http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/SchedLog
>
> Please let me know if there is any further information I can provide for
> this ticket.
>
> Thanks,
> Erik
Erik:
I'm having problems with gdb looking at this corefile on our x86_64 FC4
machines, it doesn't give the same stack backtrace as you see. Would it
be possible for me to log into that machine to see what's going on?
-Greg
===========================================================================
Date mail was appended: Mon Jan 29 14:40:35 2007 (1170103235)
Date: Tue, 30 Jan 2007 11:07:31 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: dbrown__AT__ligo.caltech.edu, anderson__AT__ligo.caltech.edu
Subject: Re: [condor-support #1824] LIGO: Crash on DedicatedScheduler
condor_schedd
X-Enigmail-Version: 0.94.1.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No
Hi Greg,
Please send me your ssh pub key or your doegrids subject, I will create
a temporary account for you to take a closer look.
Thanks,
Erik
condor-support response tracking system wrote:
> gquinn wrote:
>
>> We had a condor_schedd crash on the our DedicatedScheduler system. We
>> have two schedd's, one for DedicatedScheduler and one for general use.
>>
>> Our system is running FC4 x86_64 w/ following condor_version:
>> $CondorVersion: 6.8.3 Jan 4 2007 $
>> $CondorPlatform: X86_64-LINUX_RHEL3 $
>>
>> The core is available here:
>> http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/core.26377
>>
>> The gdb output is available here:
>> http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/gdb.out
>>
>> The logs consisting of today for MasterLog & SchedLog (all that is
>> running on our DedicatedScheduler machine) are available here:
>> http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/MasterLog
>> http://www.ligo.caltech.edu/~eespinoz/debug/01-23-2007/SchedLog
>>
>> Please let me know if there is any further information I can provide for
>> this ticket.
>>
>> Thanks,
>> Erik
>
> Erik:
>
> I'm having problems with gdb looking at this corefile on our x86_64 FC4
> machines, it doesn't give the same stack backtrace as you see. Would it
> be possible for me to log into that machine to see what's going on?
>
> -Greg
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Greg Thain <gthain__AT__cs.wisc.edu>
> * Ticket Email List: espinoza__AT__ligo.caltech.edu, dbrown__AT__ligo.caltech.edu,anderson__AT__ligo.caltech.edu
>
--
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517
===========================================================================
Date mail was appended: Tue Jan 30 13:08:22 2007 (1170184103)
Date: Tue, 30 Jan 2007 15:06:24 -0600
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1824] LIGO: Crash on DedicatedScheduler
condor_schedd
This is a multi-part message in MIME format.
--------------070800000007010400010104
condor-support response tracking system wrote:
> Hi Greg,
>
> Please send me your ssh pub key or your doegrids subject, I will create
> a temporary account for you to take a closer look.
ssh key attached.
-Greg
--------------070800000007010400010104
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAvOYdASjK8r6a0w9BTOD5Pr/U8IJbYOji4IxGY5OCiph3bAkMRcxm2ULbzvaM5ByBSGY0vtBRjRomGMi2ZQB3wc3byYAlMoJEK4e1RJGygmB3WUocj1e9to0I+ZZmFYPONIAMlOB8pIVGinZ4B4c9Eweaex3qzQg7Uq1xoR9c+jE= gthain__AT__chevre.cs.wisc.edu
--------------070800000007010400010104--
===========================================================================
Date mail was appended: Tue Jan 30 15:06:31 2007 (1170191191)
Date: Tue, 30 Jan 2007 13:20:07 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: dbrown__AT__ligo.caltech.edu, anderson__AT__ligo.caltech.edu
Subject: Re: [condor-support #1824] LIGO: Crash on DedicatedScheduler
condor_schedd
X-Enigmail-Version: 0.94.1.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No
Hello Greg,
Please log into ldas-pcdev1.ligo.caltech.edu w/ username gthain.
Execute the following to get gdb working:
gdb /ldcg/condor/sbin/condor_schedd core.26377
The core.26377 is in your home dir.
Thanks,
Erik
condor-support response tracking system wrote:
> This is a multi-part message in MIME format.
> --------------070800000007010400010104
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> Content-Transfer-Encoding: 7bit
>
> condor-support response tracking system wrote:
>> Hi Greg,
>>
>> Please send me your ssh pub key or your doegrids subject, I will create
>> a temporary account for you to take a closer look.
>
> ssh key attached.
>
> -Greg
>
>
>
> --------------070800000007010400010104
> Content-Type: text/plain;
> name="id_rsa.pub"
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline;
> filename="id_rsa.pub"
>
> ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAvOYdASjK8r6a0w9BTOD5Pr/U8IJbYOji4IxGY5OCiph3bAkMRcxm2ULbzvaM5ByBSGY0vtBRjRomGMi2ZQB3wc3byYAlMoJEK4e1RJGygmB3WUocj1e9to0I+ZZmFYPONIAMlOB8pIVGinZ4B4c9Eweaex3qzQg7Uq1xoR9c+jE= gthain__AT__chevre.cs.wisc.edu
>
> --------------070800000007010400010104--
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Greg Thain <gthain__AT__cs.wisc.edu>
> * Ticket Email List: espinoza__AT__ligo.caltech.edu, dbrown__AT__ligo.caltech.edu,anderson__AT__ligo.caltech.edu
>
--
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517
===========================================================================
Date mail was appended: Tue Jan 30 15:21:04 2007 (1170192064)
Date: Tue, 30 Jan 2007 13:41:13 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: dbrown__AT__ligo.caltech.edu, anderson__AT__ligo.caltech.edu
Subject: Re: [condor-support #1824] LIGO: Crash on DedicatedScheduler
condor_schedd
X-Enigmail-Version: 0.94.1.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No
Hey Greg,
During yesterdays conference call this was discussed a bit. During the
time of the condor crash of the DedicatedScheduler, there was an nfs issue.
Thanks,
Erik
condor-support response tracking system wrote:
> This is a multi-part message in MIME format.
> --------------070800000007010400010104
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> Content-Transfer-Encoding: 7bit
>
> condor-support response tracking system wrote:
>> Hi Greg,
>>
>> Please send me your ssh pub key or your doegrids subject, I will create
>> a temporary account for you to take a closer look.
>
> ssh key attached.
>
> -Greg
>
>
>
> --------------070800000007010400010104
> Content-Type: text/plain;
> name="id_rsa.pub"
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline;
> filename="id_rsa.pub"
>
> ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAvOYdASjK8r6a0w9BTOD5Pr/U8IJbYOji4IxGY5OCiph3bAkMRcxm2ULbzvaM5ByBSGY0vtBRjRomGMi2ZQB3wc3byYAlMoJEK4e1RJGygmB3WUocj1e9to0I+ZZmFYPONIAMlOB8pIVGinZ4B4c9Eweaex3qzQg7Uq1xoR9c+jE= gthain__AT__chevre.cs.wisc.edu
>
> --------------070800000007010400010104--
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Greg Thain <gthain__AT__cs.wisc.edu>
> * Ticket Email List: espinoza__AT__ligo.caltech.edu, dbrown__AT__ligo.caltech.edu,anderson__AT__ligo.caltech.edu
>
--
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517
===========================================================================
Date mail was appended: Tue Jan 30 15:42:21 2007 (1170193342)
Subject: Actions
Ticket resolved by gthain
===========================================================================
Date of actions: Tue Jan 30 16:05:26 2007 (1170194726)
Subject: Actions
Ticket was reopened by mailnull
===========================================================================
Date of actions: Tue Jan 30 16:05:27 2007 (1170194728)
Date: Tue, 30 Jan 2007 16:05:12 -0600
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1824] LIGO: Crash on DedicatedScheduler
condor_schedd
condor-support response tracking system wrote:
> Hey Greg,
>
> During yesterdays conference call this was discussed a bit. During the
> time of the condor crash of the DedicatedScheduler, there was an nfs issue.
>
> Thanks,
> Erik
OK, I'll close this out, but I understand how to get a useful gdb
session with your core dumps now without logging into your machines.
-Greg
===========================================================================
Date mail was appended: Tue Jan 30 16:05:27 2007 (1170194728)
Subject: Actions
Ticket resolved by gthain
===========================================================================
Date of actions: Tue Jan 30 16:17:55 2007 (1170195475)