LIGO Support Ticket 19675
Ticket Information
Number: admin 19675
User: anderson@ligo.caltech.edu
Email:
Status: resolved
Assigned To: danb
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: LIGO: startd fatal exception
Date: Thu, 3 Sep 2009 11:04:59 -0700
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
The LIGO CIT Condor pool running,
# condor_version
$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
$CondorPlatform: X86_64-LINUX_RHEL5 $
# cat /etc/redhat-release
CentOS release 5.3 (Final)
# uname -a
Linux node266 2.6.18-128.4.1.el5 #1 SMP Tue Aug 4 20:19:25 EDT 2009
x86_64 x86_64 x86_64 GNU/Linux
generated the following startd fatal exception. Is this a bug or the
expected behavior after killing an unresponsive child pid?
Note, this appears to be a rare occurrence and is not happening
regularly.
Thanks.
Begin forwarded message:
> From: condor__AT__node266.ldas-cit.ligo.caltech.edu
> Date: September 2, 2009 9:09:18 PM PDT
> To: ldas_admin_cit__AT__ligo.caltech.edu
> Subject: [Condor] Problem node266.ldas-cit.ligo.caltech.edu:
> condor_startd exited (4)
>
> This is an automated email from the Condor system
> on machine "node266.ldas-cit.ligo.caltech.edu". Do not reply.
>
> "/usr/sbin/condor_startd" on "node266.ldas-cit.ligo.caltech.edu"
> exited with status 4.
> Condor will automatically restart this process in 10 seconds.
>
> *** Last 20 line(s) of file /usr1/condor/log/StartLog:
> 9/2 21:04:44 Return from HandleReq <command_handler> (handler:
> 0.000s, sec: 0.001s)
> 9/2 21:09:01 Received UDP command 60008 (DC_CHILDALIVE) from
> <10.14.2.16:57186>, access level DAEMON
> 9/2 21:09:01 Calling HandleReq <HandleChildAliveCommand> (0)
> 9/2 21:09:01 Return from HandleReq <HandleChildAliveCommand>
> (handler: 0.000s, sec: 0.001s)
> 9/2 21:09:01 ERROR: Child pid 14899 appears hung! Killing it hard.
> 9/2 21:09:01 error reading from named pipe: watchdog pipe has closed
> 9/2 21:09:01 ProcFamilyClient: failed to read response from ProcD
> 9/2 21:09:01 get_usage: ProcD communication error
> 9/2 21:09:01 ERROR "ProcD has failed" at line 599 in file
> proc_family_proxy.cpp
> 9/2 21:09:01 slot3: Changing state and activity: Claimed/Busy ->
> Preempting/Killing
> 9/2 21:09:01 slot4: Changing state and activity: Claimed/Busy ->
> Preempting/Killing
> 9/2 21:09:01 slot2: Changing activity: Busy -> Killing
> 9/2 21:09:01 slot2: State change: starter exited
> 9/2 21:09:01 slot2: Changing state and activity: Backfill/Killing ->
> Owner/Idle
> 9/2 21:09:01 slot2: State change: IS_OWNER is false
> 9/2 21:09:01 slot2: Changing state: Owner -> Unclaimed
> 9/2 21:09:01 slot2: State change: START_BACKFILL is TRUE
> 9/2 21:09:01 slot2: Changing state: Unclaimed -> Backfill
> 9/2 21:09:01 slot1: Changing activity: Busy -> Killing
> 9/2 21:09:01 startd exiting because of fatal exception.
> *** End of file StartLog
>
>
>
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Questions about this message or Condor in general?
> Email address of the local Condor administrator: ldas_admin_cit__AT__ligo.caltech.edu
> The Official Condor Homepage is http://www.cs.wisc.edu/condor
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Thu Sep 3 13:05:19 2009 (1252001121)
Subject: Actions
Assigned to danb by danb
===========================================================================
Date of actions: Thu Sep 3 13:46:12 2009 (1252003572)
Date: Thu, 03 Sep 2009 13:57:54 -0500
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19675] LIGO: startd fatal exception
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
It appears that something went wrong with the procd and the startd
noticed when it wanted to kill one of its children. The expected
behavior of the startd when the procd fails is to die and get restarted.
Is there any record of what pid 14899 was? Was it a starter? If so,
that pid should be recorded in the starter log if not also the startd log.
The most important question is what went wrong with the procd. There
can be two types of failures: the procd itself ran into a bug or the
named pipe to the procd got messed up in some way (e.g. it got
deleted). Unfortunately, the message in the StartLog doesn't make it
obvious to me what happened.
I don't suppose you have procd logging turned on? Typically, it is not
turned on, because it doesn't get rotated and therefore grows without bound.
--Dan
> The LIGO CIT Condor pool running,
>
> # condor_version
> $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
> $CondorPlatform: X86_64-LINUX_RHEL5 $
>
> # cat /etc/redhat-release
> CentOS release 5.3 (Final)
>
> # uname -a
> Linux node266 2.6.18-128.4.1.el5 #1 SMP Tue Aug 4 20:19:25 EDT 2009
> x86_64 x86_64 x86_64 GNU/Linux
>
>
> generated the following startd fatal exception. Is this a bug or the
> expected behavior after killing an unresponsive child pid?
>
> Note, this appears to be a rare occurrence and is not happening
> regularly.
>
> Thanks.
>
>
>
> Begin forwarded message:
>
>
>> From: condor__AT__node266.ldas-cit.ligo.caltech.edu
>> Date: September 2, 2009 9:09:18 PM PDT
>> To: ldas_admin_cit__AT__ligo.caltech.edu
>> Subject: [Condor] Problem node266.ldas-cit.ligo.caltech.edu:
>> condor_startd exited (4)
>>
>> This is an automated email from the Condor system
>> on machine "node266.ldas-cit.ligo.caltech.edu". Do not reply.
>>
>> "/usr/sbin/condor_startd" on "node266.ldas-cit.ligo.caltech.edu"
>> exited with status 4.
>> Condor will automatically restart this process in 10 seconds.
>>
>> *** Last 20 line(s) of file /usr1/condor/log/StartLog:
>> 9/2 21:04:44 Return from HandleReq <command_handler> (handler:
>> 0.000s, sec: 0.001s)
>> 9/2 21:09:01 Received UDP command 60008 (DC_CHILDALIVE) from
>> <10.14.2.16:57186>, access level DAEMON
>> 9/2 21:09:01 Calling HandleReq <HandleChildAliveCommand> (0)
>> 9/2 21:09:01 Return from HandleReq <HandleChildAliveCommand>
>> (handler: 0.000s, sec: 0.001s)
>> 9/2 21:09:01 ERROR: Child pid 14899 appears hung! Killing it hard.
>> 9/2 21:09:01 error reading from named pipe: watchdog pipe has closed
>> 9/2 21:09:01 ProcFamilyClient: failed to read response from ProcD
>> 9/2 21:09:01 get_usage: ProcD communication error
>> 9/2 21:09:01 ERROR "ProcD has failed" at line 599 in file
>> proc_family_proxy.cpp
>> 9/2 21:09:01 slot3: Changing state and activity: Claimed/Busy ->
>> Preempting/Killing
>> 9/2 21:09:01 slot4: Changing state and activity: Claimed/Busy ->
>> Preempting/Killing
>> 9/2 21:09:01 slot2: Changing activity: Busy -> Killing
>> 9/2 21:09:01 slot2: State change: starter exited
>> 9/2 21:09:01 slot2: Changing state and activity: Backfill/Killing ->
>> Owner/Idle
>> 9/2 21:09:01 slot2: State change: IS_OWNER is false
>> 9/2 21:09:01 slot2: Changing state: Owner -> Unclaimed
>> 9/2 21:09:01 slot2: State change: START_BACKFILL is TRUE
>> 9/2 21:09:01 slot2: Changing state: Unclaimed -> Backfill
>> 9/2 21:09:01 slot1: Changing activity: Busy -> Killing
>> 9/2 21:09:01 startd exiting because of fatal exception.
>> *** End of file StartLog
>>
>>
>>
>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>> Questions about this message or Condor in general?
>> Email address of the local Condor administrator: ldas_admin_cit__AT__ligo.caltech.edu
>> The Official Condor Homepage is http://www.cs.wisc.edu/condor
>>
>
> --
> Stuart Anderson anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
>
>
>
>
> ===========================================================================
> Date of creation: Thu Sep 3 13:05:19 2009 (1252001121)
>
>
> >From RUST Thu, 3 Sep 2009 13:46:12 -0500 (CDT)
> Subject: Actions
>
> Assigned to danb by danb
> ===========================================================================
> Date of actions: Thu Sep 3 13:46:12 2009 (1252003572)
>
===========================================================================
Date mail was appended: Thu Sep 3 13:58:01 2009 (1252004281)
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19675] LIGO: startd fatal exception
Date: Thu, 3 Sep 2009 12:53:29 -0700
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
I do not have procd logging enabled (will 7.4 include log file
management for this daemon?).
If I search all the other daemon logs for lines with a case
insensitive match to the string "pid" and the string "14899" the only
reference I find in the last few days is,
/usr1/condor/log/StartLog:9/2 21:09:01 ERROR: Child pid 14899 appears
hung! Killing it hard.
I double checked the MasterLog and there is no reference to this pid
in that file since last September.
The named pipe is still on the local file system, but it also appears
that some very old ones are still there
[root@node266 log]# ls -l procd_pipe*
prw------- 1 condor condor 0 Dec 11 2007 procd_pipe
prw------- 1 condor condor 0 Dec 11 2007 procd_pipe.8388.0
prw------- 1 condor condor 0 Sep 3 12:51 procd_pipe.STARTD
prw------- 1 condor condor 0 Sep 2 21:09 procd_pipe.STARTD.14069.0
prw------- 1 condor condor 0 Sep 2 21:09 procd_pipe.STARTD.14899.0
prw------- 1 condor condor 0 Apr 14 11:37 procd_pipe.STARTD.2254.0
prw------- 1 condor condor 0 Apr 28 16:30 procd_pipe.STARTD.2618.0
prw------- 1 condor condor 0 Sep 2 21:09 procd_pipe.STARTD.2762.0
prw------- 1 condor condor 0 Apr 6 13:10 procd_pipe.STARTD.27994.0
prw------- 1 condor condor 0 Apr 6 13:10 procd_pipe.STARTD.27995.0
prw------- 1 condor condor 0 Sep 3 12:51 procd_pipe.STARTD.watchdog
prw------- 1 condor condor 0 Dec 1 2007 procd_pipe.watchdog
Only a few of these have any processes attached, i.e.,
[root@node266 log]# lsof procd_pipe*
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
condor_st 15734 condor 12r FIFO 253,1 8454201
procd_pipe.STARTD.watchdog
condor_st 15734 condor 13w FIFO 253,1 8454202 procd_pipe.STARTD
condor_pr 15759 root 0r FIFO 253,1 8454201
procd_pipe.STARTD.watchdog
condor_pr 15759 root 1w FIFO 253,1 8454201
procd_pipe.STARTD.watchdog
condor_pr 15759 root 3r FIFO 253,1 8454202 procd_pipe.STARTD
condor_pr 15759 root 4w FIFO 253,1 8454202 procd_pipe.STARTD
so perhaps there is another small bug here regarding cleaning up.
Thanks.
On Sep 3, 2009, at 11:58 AM, condor-admin response tracking system
wrote:
>
> It appears that something went wrong with the procd and the startd
> noticed when it wanted to kill one of its children. The expected
> behavior of the startd when the procd fails is to die and get
> restarted.
>
> Is there any record of what pid 14899 was? Was it a starter? If so,
> that pid should be recorded in the starter log if not also the
> startd log.
>
> The most important question is what went wrong with the procd. There
> can be two types of failures: the procd itself ran into a bug or the
> named pipe to the procd got messed up in some way (e.g. it got
> deleted). Unfortunately, the message in the StartLog doesn't make it
> obvious to me what happened.
>
> I don't suppose you have procd logging turned on? Typically, it is
> not
> turned on, because it doesn't get rotated and therefore grows
> without bound.
>
> --Dan
>
>> The LIGO CIT Condor pool running,
>>
>> # condor_version
>> $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
>> $CondorPlatform: X86_64-LINUX_RHEL5 $
>>
>> # cat /etc/redhat-release
>> CentOS release 5.3 (Final)
>>
>> # uname -a
>> Linux node266 2.6.18-128.4.1.el5 #1 SMP Tue Aug 4 20:19:25 EDT 2009
>> x86_64 x86_64 x86_64 GNU/Linux
>>
>>
>> generated the following startd fatal exception. Is this a bug or the
>> expected behavior after killing an unresponsive child pid?
>>
>> Note, this appears to be a rare occurrence and is not happening
>> regularly.
>>
>> Thanks.
>>
>>
>>
>> Begin forwarded message:
>>
>>
>>> From: condor__AT__node266.ldas-cit.ligo.caltech.edu
>>> Date: September 2, 2009 9:09:18 PM PDT
>>> To: ldas_admin_cit__AT__ligo.caltech.edu
>>> Subject: [Condor] Problem node266.ldas-cit.ligo.caltech.edu:
>>> condor_startd exited (4)
>>>
>>> This is an automated email from the Condor system
>>> on machine "node266.ldas-cit.ligo.caltech.edu". Do not reply.
>>>
>>> "/usr/sbin/condor_startd" on "node266.ldas-cit.ligo.caltech.edu"
>>> exited with status 4.
>>> Condor will automatically restart this process in 10 seconds.
>>>
>>> *** Last 20 line(s) of file /usr1/condor/log/StartLog:
>>> 9/2 21:04:44 Return from HandleReq <command_handler> (handler:
>>> 0.000s, sec: 0.001s)
>>> 9/2 21:09:01 Received UDP command 60008 (DC_CHILDALIVE) from
>>> <10.14.2.16:57186>, access level DAEMON
>>> 9/2 21:09:01 Calling HandleReq <HandleChildAliveCommand> (0)
>>> 9/2 21:09:01 Return from HandleReq <HandleChildAliveCommand>
>>> (handler: 0.000s, sec: 0.001s)
>>> 9/2 21:09:01 ERROR: Child pid 14899 appears hung! Killing it hard.
>>> 9/2 21:09:01 error reading from named pipe: watchdog pipe has closed
>>> 9/2 21:09:01 ProcFamilyClient: failed to read response from ProcD
>>> 9/2 21:09:01 get_usage: ProcD communication error
>>> 9/2 21:09:01 ERROR "ProcD has failed" at line 599 in file
>>> proc_family_proxy.cpp
>>> 9/2 21:09:01 slot3: Changing state and activity: Claimed/Busy ->
>>> Preempting/Killing
>>> 9/2 21:09:01 slot4: Changing state and activity: Claimed/Busy ->
>>> Preempting/Killing
>>> 9/2 21:09:01 slot2: Changing activity: Busy -> Killing
>>> 9/2 21:09:01 slot2: State change: starter exited
>>> 9/2 21:09:01 slot2: Changing state and activity: Backfill/Killing ->
>>> Owner/Idle
>>> 9/2 21:09:01 slot2: State change: IS_OWNER is false
>>> 9/2 21:09:01 slot2: Changing state: Owner -> Unclaimed
>>> 9/2 21:09:01 slot2: State change: START_BACKFILL is TRUE
>>> 9/2 21:09:01 slot2: Changing state: Unclaimed -> Backfill
>>> 9/2 21:09:01 slot1: Changing activity: Busy -> Killing
>>> 9/2 21:09:01 startd exiting because of fatal exception.
>>> *** End of file StartLog
>>>
>>>
>>>
>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>>> Questions about this message or Condor in general?
>>> Email address of the local Condor administrator: ldas_admin_cit__AT__ligo.caltech.edu
>>> The Official Condor Homepage is http://www.cs.wisc.edu/condor
>>>
>>
>> --
>> Stuart Anderson anderson__AT__ligo.caltech.edu
>> http://www.ligo.caltech.edu/~anderson
>>
>>
>>
>>
>> =
>> =
>> =
>> =
>> =
>> =
>> =====================================================================
>> Date of creation: Thu Sep 3 13:05:19 2009 (1252001121)
>>
>>
>>> From RUST Thu, 3 Sep 2009 13:46:12 -0500 (CDT)
>> Subject: Actions
>>
>> Assigned to danb by danb
>> =
>> =
>> =
>> =
>> =
>> =
>> =====================================================================
>> Date of actions: Thu Sep 3 13:46:12 2009 (1252003572)
>>
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Dan Bradley <danb__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu,
>
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Thu Sep 3 14:53:51 2009 (1252007632)
Date: Wed, 09 Sep 2009 16:30:26 -0500
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19675] LIGO: startd fatal exception
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
> I do not have procd logging enabled (will 7.4 include log file
> management for this daemon?).
>
7.4 does not include log file management for the procd.
> If I search all the other daemon logs for lines with a case
> insensitive match to the string "pid" and the string "14899" the only
> reference I find in the last few days is,
>
> /usr1/condor/log/StartLog:9/2 21:09:01 ERROR: Child pid 14899 appears
> hung! Killing it hard.
>
Are there any references to dir_14899? (Sorry for my late reply if your
logs have already rotated.)
The reason I ask is that the standard unvirse starter does not log its
PID, but since the PID is encoded in the scratch directory, it usually
gets recorded through that.
--Dan
> I double checked the MasterLog and there is no reference to this pid
> in that file since last September.
>
> The named pipe is still on the local file system, but it also appears
> that some very old ones are still there
>
> [root@node266 log]# ls -l procd_pipe*
> prw------- 1 condor condor 0 Dec 11 2007 procd_pipe
> prw------- 1 condor condor 0 Dec 11 2007 procd_pipe.8388.0
> prw------- 1 condor condor 0 Sep 3 12:51 procd_pipe.STARTD
> prw------- 1 condor condor 0 Sep 2 21:09 procd_pipe.STARTD.14069.0
> prw------- 1 condor condor 0 Sep 2 21:09 procd_pipe.STARTD.14899.0
> prw------- 1 condor condor 0 Apr 14 11:37 procd_pipe.STARTD.2254.0
> prw------- 1 condor condor 0 Apr 28 16:30 procd_pipe.STARTD.2618.0
> prw------- 1 condor condor 0 Sep 2 21:09 procd_pipe.STARTD.2762.0
> prw------- 1 condor condor 0 Apr 6 13:10 procd_pipe.STARTD.27994.0
> prw------- 1 condor condor 0 Apr 6 13:10 procd_pipe.STARTD.27995.0
> prw------- 1 condor condor 0 Sep 3 12:51 procd_pipe.STARTD.watchdog
> prw------- 1 condor condor 0 Dec 1 2007 procd_pipe.watchdog
>
> Only a few of these have any processes attached, i.e.,
>
> [root@node266 log]# lsof procd_pipe*
> COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
> condor_st 15734 condor 12r FIFO 253,1 8454201
> procd_pipe.STARTD.watchdog
> condor_st 15734 condor 13w FIFO 253,1 8454202 procd_pipe.STARTD
> condor_pr 15759 root 0r FIFO 253,1 8454201
> procd_pipe.STARTD.watchdog
> condor_pr 15759 root 1w FIFO 253,1 8454201
> procd_pipe.STARTD.watchdog
> condor_pr 15759 root 3r FIFO 253,1 8454202 procd_pipe.STARTD
> condor_pr 15759 root 4w FIFO 253,1 8454202 procd_pipe.STARTD
>
> so perhaps there is another small bug here regarding cleaning up.
>
> Thanks.
>
>
>
> On Sep 3, 2009, at 11:58 AM, condor-admin response tracking system
> wrote:
>
>
>> It appears that something went wrong with the procd and the startd
>> noticed when it wanted to kill one of its children. The expected
>> behavior of the startd when the procd fails is to die and get
>> restarted.
>>
>> Is there any record of what pid 14899 was? Was it a starter? If so,
>> that pid should be recorded in the starter log if not also the
>> startd log.
>>
>> The most important question is what went wrong with the procd. There
>> can be two types of failures: the procd itself ran into a bug or the
>> named pipe to the procd got messed up in some way (e.g. it got
>> deleted). Unfortunately, the message in the StartLog doesn't make it
>> obvious to me what happened.
>>
>> I don't suppose you have procd logging turned on? Typically, it is
>> not
>> turned on, because it doesn't get rotated and therefore grows
>> without bound.
>>
>> --Dan
>>
>>
>>> The LIGO CIT Condor pool running,
>>>
>>> # condor_version
>>> $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
>>> $CondorPlatform: X86_64-LINUX_RHEL5 $
>>>
>>> # cat /etc/redhat-release
>>> CentOS release 5.3 (Final)
>>>
>>> # uname -a
>>> Linux node266 2.6.18-128.4.1.el5 #1 SMP Tue Aug 4 20:19:25 EDT 2009
>>> x86_64 x86_64 x86_64 GNU/Linux
>>>
>>>
>>> generated the following startd fatal exception. Is this a bug or the
>>> expected behavior after killing an unresponsive child pid?
>>>
>>> Note, this appears to be a rare occurrence and is not happening
>>> regularly.
>>>
>>> Thanks.
>>>
>>>
>>>
>>> Begin forwarded message:
>>>
>>>
>>>
>>>> From: condor__AT__node266.ldas-cit.ligo.caltech.edu
>>>> Date: September 2, 2009 9:09:18 PM PDT
>>>> To: ldas_admin_cit__AT__ligo.caltech.edu
>>>> Subject: [Condor] Problem node266.ldas-cit.ligo.caltech.edu:
>>>> condor_startd exited (4)
>>>>
>>>> This is an automated email from the Condor system
>>>> on machine "node266.ldas-cit.ligo.caltech.edu". Do not reply.
>>>>
>>>> "/usr/sbin/condor_startd" on "node266.ldas-cit.ligo.caltech.edu"
>>>> exited with status 4.
>>>> Condor will automatically restart this process in 10 seconds.
>>>>
>>>> *** Last 20 line(s) of file /usr1/condor/log/StartLog:
>>>> 9/2 21:04:44 Return from HandleReq <command_handler> (handler:
>>>> 0.000s, sec: 0.001s)
>>>> 9/2 21:09:01 Received UDP command 60008 (DC_CHILDALIVE) from
>>>> <10.14.2.16:57186>, access level DAEMON
>>>> 9/2 21:09:01 Calling HandleReq <HandleChildAliveCommand> (0)
>>>> 9/2 21:09:01 Return from HandleReq <HandleChildAliveCommand>
>>>> (handler: 0.000s, sec: 0.001s)
>>>> 9/2 21:09:01 ERROR: Child pid 14899 appears hung! Killing it hard.
>>>> 9/2 21:09:01 error reading from named pipe: watchdog pipe has closed
>>>> 9/2 21:09:01 ProcFamilyClient: failed to read response from ProcD
>>>> 9/2 21:09:01 get_usage: ProcD communication error
>>>> 9/2 21:09:01 ERROR "ProcD has failed" at line 599 in file
>>>> proc_family_proxy.cpp
>>>> 9/2 21:09:01 slot3: Changing state and activity: Claimed/Busy ->
>>>> Preempting/Killing
>>>> 9/2 21:09:01 slot4: Changing state and activity: Claimed/Busy ->
>>>> Preempting/Killing
>>>> 9/2 21:09:01 slot2: Changing activity: Busy -> Killing
>>>> 9/2 21:09:01 slot2: State change: starter exited
>>>> 9/2 21:09:01 slot2: Changing state and activity: Backfill/Killing ->
>>>> Owner/Idle
>>>> 9/2 21:09:01 slot2: State change: IS_OWNER is false
>>>> 9/2 21:09:01 slot2: Changing state: Owner -> Unclaimed
>>>> 9/2 21:09:01 slot2: State change: START_BACKFILL is TRUE
>>>> 9/2 21:09:01 slot2: Changing state: Unclaimed -> Backfill
>>>> 9/2 21:09:01 slot1: Changing activity: Busy -> Killing
>>>> 9/2 21:09:01 startd exiting because of fatal exception.
>>>> *** End of file StartLog
>>>>
>>>>
>>>>
>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>>>> Questions about this message or Condor in general?
>>>> Email address of the local Condor administrator: ldas_admin_cit__AT__ligo.caltech.edu
>>>> The Official Condor Homepage is http://www.cs.wisc.edu/condor
>>>>
>>>>
>>> --
>>> Stuart Anderson anderson__AT__ligo.caltech.edu
>>> http://www.ligo.caltech.edu/~anderson
>>>
>>>
>>>
>>>
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =====================================================================
>>> Date of creation: Thu Sep 3 13:05:19 2009 (1252001121)
>>>
>>>
>>>
>>>> From RUST Thu, 3 Sep 2009 13:46:12 -0500 (CDT)
>>>>
>>> Subject: Actions
>>>
>>> Assigned to danb by danb
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =
>>> =====================================================================
>>> Date of actions: Thu Sep 3 13:46:12 2009 (1252003572)
>>>
>>>
>> ========================================
>> MESSAGE INFORMATION
>> ========================================
>> * From: Dan Bradley <danb__AT__cs.wisc.edu>
>> * Ticket Email List: anderson__AT__ligo.caltech.edu,
>>
>>
>
> --
> Stuart Anderson anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
>
>
>
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu,
>
===========================================================================
Date mail was appended: Wed Sep 9 16:30:32 2009 (1252531833)
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19675] LIGO: startd fatal exception
Date: Wed, 9 Sep 2009 14:48:42 -0700
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
On Sep 9, 2009, at 2:30 PM, condor-admin response tracking system wrote:
>
>> If I search all the other daemon logs for lines with a case
>> insensitive match to the string "pid" and the string "14899" the only
>> reference I find in the last few days is,
>>
>> /usr1/condor/log/StartLog:9/2 21:09:01 ERROR: Child pid 14899 appears
>> hung! Killing it hard.
>>
>
> Are there any references to dir_14899? (Sorry for my late reply if
> your
> logs have already rotated.)
>
> The reason I ask is that the standard unvirse starter does not log its
> PID, but since the PID is encoded in the scratch directory, it usually
> gets recorded through that.
That string does show up in the starter log for one of the slots. It
should be noted that the IP address in this section of the logs
(10.14.2.16) is the address of this execute machine.
9/2 20:09:01 ******************************************************
9/2 20:09:01 ** condor_starter (CONDOR_STARTER) STARTING UP
9/2 20:09:01 ** /usr/sbin/condor_starter
9/2 20:09:01 ** SubsystemInfo: name=STARTER type=STARTER(8)
class=DAEMON(1)
9/2 20:09:01 ** Configuration: subsystem:STARTER local:<NONE>
class:DAEMON
9/2 20:09:01 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
9/2 20:09:01 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
9/2 20:09:01 ** PID = 14899
9/2 20:09:01 ** Log last touched 9/2 19:57:48
9/2 20:09:01 ******************************************************
9/2 20:09:01 Using config source: /usr1/condor/condor_config
9/2 20:09:01 Using local config sources:
9/2 20:09:01 /usr1/condor/condor_config.local
9/2 20:09:01 DaemonCore: Command Socket at <10.14.2.16:57267>
9/2 20:09:01 Done setting resource limits
9/2 20:09:01 Communicating with shadow <10.14.0.12:58849>
9/2 20:09:01 Submitting machine is "ldas-grid.ligo.caltech.edu"
9/2 20:09:01 setting the orig job name in starter
9/2 20:09:01 setting the orig job iwd in starter
9/2 20:09:03 File transfer completed successfully.
9/2 20:09:04 warning: unable to chmod condor_exec.exe to ensure
execute bit is set: Operation not permitted
9/2 20:09:04 Job 53783853.0 set to execute immediately
9/2 20:09:04 Starting a VANILLA universe job with ID: 53783853.0
9/2 20:09:04 IWD: /usr1/condor/execute/dir_14899
9/2 20:09:04 Renice expr "0" evaluated to 0
9/2 20:09:04 About to exec /usr1/condor/execute/dir_14899/
condor_exec.exe
9/2 20:09:04 Create_Process succeeded, pid=14905
9/2 20:25:58 Process exited, pid=14905, status=1
9/2 21:09:01 Failed to receive GoAhead message from 10.14.2.16.
9/2 21:09:01 File transfer failed, forcing disconnect.
9/2 21:09:01 JIC::transferOutput() failed, waiting for job lease to
expire or for a reconnect attempt
9/2 21:10:15 ******************************************************
9/2 21:10:15 ** condor_starter (CONDOR_STARTER) STARTING UP
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Wed Sep 9 16:49:01 2009 (1252532942)
Date: Fri, 09 Oct 2009 11:24:47 -0500
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19675] LIGO: startd fatal exception
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
Sorry for letting this issue stall.
I cannot think of a way to make further progress on debugging this
situation without more information about the procd, which probably means
turning on the procd log. I think we need to have a solution for log
rotation in the procd before we can recommend turning on procd logging
in a production pool.
--Dan
condor-admin response tracking system wrote:
> On Sep 9, 2009, at 2:30 PM, condor-admin response tracking system wrote:
>
>>> If I search all the other daemon logs for lines with a case
>>> insensitive match to the string "pid" and the string "14899" the only
>>> reference I find in the last few days is,
>>>
>>> /usr1/condor/log/StartLog:9/2 21:09:01 ERROR: Child pid 14899 appears
>>> hung! Killing it hard.
>>>
>>>
>> Are there any references to dir_14899? (Sorry for my late reply if
>> your
>> logs have already rotated.)
>>
>> The reason I ask is that the standard unvirse starter does not log its
>> PID, but since the PID is encoded in the scratch directory, it usually
>> gets recorded through that.
>>
>
> That string does show up in the starter log for one of the slots. It
> should be noted that the IP address in this section of the logs
> (10.14.2.16) is the address of this execute machine.
>
>
>
> 9/2 20:09:01 ******************************************************
> 9/2 20:09:01 ** condor_starter (CONDOR_STARTER) STARTING UP
> 9/2 20:09:01 ** /usr/sbin/condor_starter
> 9/2 20:09:01 ** SubsystemInfo: name=STARTER type=STARTER(8)
> class=DAEMON(1)
> 9/2 20:09:01 ** Configuration: subsystem:STARTER local:<NONE>
> class:DAEMON
> 9/2 20:09:01 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
> 9/2 20:09:01 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
> 9/2 20:09:01 ** PID = 14899
> 9/2 20:09:01 ** Log last touched 9/2 19:57:48
> 9/2 20:09:01 ******************************************************
> 9/2 20:09:01 Using config source: /usr1/condor/condor_config
> 9/2 20:09:01 Using local config sources:
> 9/2 20:09:01 /usr1/condor/condor_config.local
> 9/2 20:09:01 DaemonCore: Command Socket at <10.14.2.16:57267>
> 9/2 20:09:01 Done setting resource limits
> 9/2 20:09:01 Communicating with shadow <10.14.0.12:58849>
> 9/2 20:09:01 Submitting machine is "ldas-grid.ligo.caltech.edu"
> 9/2 20:09:01 setting the orig job name in starter
> 9/2 20:09:01 setting the orig job iwd in starter
> 9/2 20:09:03 File transfer completed successfully.
> 9/2 20:09:04 warning: unable to chmod condor_exec.exe to ensure
> execute bit is set: Operation not permitted
> 9/2 20:09:04 Job 53783853.0 set to execute immediately
> 9/2 20:09:04 Starting a VANILLA universe job with ID: 53783853.0
> 9/2 20:09:04 IWD: /usr1/condor/execute/dir_14899
> 9/2 20:09:04 Renice expr "0" evaluated to 0
> 9/2 20:09:04 About to exec /usr1/condor/execute/dir_14899/
> condor_exec.exe
> 9/2 20:09:04 Create_Process succeeded, pid=14905
> 9/2 20:25:58 Process exited, pid=14905, status=1
> 9/2 21:09:01 Failed to receive GoAhead message from 10.14.2.16.
> 9/2 21:09:01 File transfer failed, forcing disconnect.
> 9/2 21:09:01 JIC::transferOutput() failed, waiting for job lease to
> expire or for a reconnect attempt
> 9/2 21:10:15 ******************************************************
> 9/2 21:10:15 ** condor_starter (CONDOR_STARTER) STARTING UP
>
>
> --
> Stuart Anderson anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
>
>
>
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu,
>
===========================================================================
Date mail was appended: Fri Oct 9 11:24:52 2009 (1255105493)
Subject: Comments added
gt #780 tracks this ticket.
Comments added by psilord
===========================================================================
Date comments were added: Fri Oct 9 11:46:49 2009 (1255106809)
Subject: Comments added
Comments added by psilord
===========================================================================
Date comments were added: Fri Oct 9 15:49:52 2009 (1255121392)
Subject: Comments added
While this startd error actually happened due to some procd error. It is
not happening all the time. And the real problem is that the procd's log
doesn't rotate, so in a production pool, we can't turn on the log to track
down this problem. So, from ligo's point of view, this ticket will be resolved
on their end. One our end we will leave the ticket in gittrac to be fixed later,
and make a new ticket about the procd log needing a rotation feature.
Comments added by psilord
===========================================================================
Date comments were added: Fri Oct 9 15:49:59 2009 (1255121399)
Subject: Actions
Ticket resolved by psilord
===========================================================================
Date of actions: Fri Oct 9 15:51:37 2009 (1255121497)