LIGO Support Ticket 19675

Ticket Information
  Number:      admin 19675
  User:        anderson@ligo.caltech.edu
  Email:       
  Status:      resolved
  Assigned To: danb
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: LIGO: startd fatal exception
Date: Thu, 3 Sep 2009 11:04:59 -0700
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

The LIGO CIT Condor pool running,

# condor_version
$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
$CondorPlatform: X86_64-LINUX_RHEL5 $

# cat /etc/redhat-release
CentOS release 5.3 (Final)

# uname -a
Linux node266 2.6.18-128.4.1.el5 #1 SMP Tue Aug 4 20:19:25 EDT 2009  
x86_64 x86_64 x86_64 GNU/Linux


generated the following startd fatal exception. Is this a bug or the  
expected behavior after killing an unresponsive child pid?

Note, this appears to be a rare occurrence and is not happening  
regularly.

Thanks.



Begin forwarded message:

> From: condor__AT__node266.ldas-cit.ligo.caltech.edu
> Date: September 2, 2009 9:09:18 PM PDT
> To: ldas_admin_cit__AT__ligo.caltech.edu
> Subject: [Condor] Problem node266.ldas-cit.ligo.caltech.edu:  
> condor_startd exited (4)
>
> This is an automated email from the Condor system
> on machine "node266.ldas-cit.ligo.caltech.edu".  Do not reply.
>
> "/usr/sbin/condor_startd" on "node266.ldas-cit.ligo.caltech.edu"  
> exited with status 4.
> Condor will automatically restart this process in 10 seconds.
>
> *** Last 20 line(s) of file /usr1/condor/log/StartLog:
> 9/2 21:04:44 Return from HandleReq <command_handler> (handler:  
> 0.000s, sec: 0.001s)
> 9/2 21:09:01 Received UDP command 60008 (DC_CHILDALIVE) from   
> <10.14.2.16:57186>, access level DAEMON
> 9/2 21:09:01 Calling HandleReq <HandleChildAliveCommand> (0)
> 9/2 21:09:01 Return from HandleReq <HandleChildAliveCommand>  
> (handler: 0.000s, sec: 0.001s)
> 9/2 21:09:01 ERROR: Child pid 14899 appears hung! Killing it hard.
> 9/2 21:09:01 error reading from named pipe: watchdog pipe has closed
> 9/2 21:09:01 ProcFamilyClient: failed to read response from ProcD
> 9/2 21:09:01 get_usage: ProcD communication error
> 9/2 21:09:01 ERROR "ProcD has failed" at line 599 in file  
> proc_family_proxy.cpp
> 9/2 21:09:01 slot3: Changing state and activity: Claimed/Busy ->  
> Preempting/Killing
> 9/2 21:09:01 slot4: Changing state and activity: Claimed/Busy ->  
> Preempting/Killing
> 9/2 21:09:01 slot2: Changing activity: Busy -> Killing
> 9/2 21:09:01 slot2: State change: starter exited
> 9/2 21:09:01 slot2: Changing state and activity: Backfill/Killing ->  
> Owner/Idle
> 9/2 21:09:01 slot2: State change: IS_OWNER is false
> 9/2 21:09:01 slot2: Changing state: Owner -> Unclaimed
> 9/2 21:09:01 slot2: State change: START_BACKFILL is TRUE
> 9/2 21:09:01 slot2: Changing state: Unclaimed -> Backfill
> 9/2 21:09:01 slot1: Changing activity: Busy -> Killing
> 9/2 21:09:01 startd exiting because of fatal exception.
> *** End of file StartLog
>
>
>
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Questions about this message or Condor in general?
> Email address of the local Condor administrator: ldas_admin_cit__AT__ligo.caltech.edu
> The Official Condor Homepage is http://www.cs.wisc.edu/condor

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date of creation: Thu Sep  3 13:05:19 2009 (1252001121)
Subject: Actions

Assigned to danb by danb
===========================================================================
Date of actions: Thu Sep  3 13:46:12 2009 (1252003572)
Date: Thu, 03 Sep 2009 13:57:54 -0500
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19675] LIGO: startd fatal exception
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu


It appears that something went wrong with the procd and the startd 
noticed when it wanted to kill one of its children.  The expected 
behavior of the startd when the procd fails is to die and get restarted.

Is there any record of what pid 14899 was?  Was it a starter?  If so, 
that pid should be recorded in the starter log if not also the startd log.

The most important question is what went wrong with the procd.  There 
can be two types of failures: the procd itself ran into a bug or the 
named pipe to the procd got messed up  in some way (e.g. it got 
deleted).  Unfortunately, the message in the StartLog doesn't make it 
obvious to me what happened.

I don't suppose you have procd logging turned on?  Typically, it is not 
turned on, because it doesn't get rotated and therefore grows without bound.

--Dan

> The LIGO CIT Condor pool running,
>
> # condor_version
> $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
> $CondorPlatform: X86_64-LINUX_RHEL5 $
>
> # cat /etc/redhat-release
> CentOS release 5.3 (Final)
>
> # uname -a
> Linux node266 2.6.18-128.4.1.el5 #1 SMP Tue Aug 4 20:19:25 EDT 2009  
> x86_64 x86_64 x86_64 GNU/Linux
>
>
> generated the following startd fatal exception. Is this a bug or the  
> expected behavior after killing an unresponsive child pid?
>
> Note, this appears to be a rare occurrence and is not happening  
> regularly.
>
> Thanks.
>
>
>
> Begin forwarded message:
>
>   
>> From: condor__AT__node266.ldas-cit.ligo.caltech.edu
>> Date: September 2, 2009 9:09:18 PM PDT
>> To: ldas_admin_cit__AT__ligo.caltech.edu
>> Subject: [Condor] Problem node266.ldas-cit.ligo.caltech.edu:  
>> condor_startd exited (4)
>>
>> This is an automated email from the Condor system
>> on machine "node266.ldas-cit.ligo.caltech.edu".  Do not reply.
>>
>> "/usr/sbin/condor_startd" on "node266.ldas-cit.ligo.caltech.edu"  
>> exited with status 4.
>> Condor will automatically restart this process in 10 seconds.
>>
>> *** Last 20 line(s) of file /usr1/condor/log/StartLog:
>> 9/2 21:04:44 Return from HandleReq <command_handler> (handler:  
>> 0.000s, sec: 0.001s)
>> 9/2 21:09:01 Received UDP command 60008 (DC_CHILDALIVE) from   
>> <10.14.2.16:57186>, access level DAEMON
>> 9/2 21:09:01 Calling HandleReq <HandleChildAliveCommand> (0)
>> 9/2 21:09:01 Return from HandleReq <HandleChildAliveCommand>  
>> (handler: 0.000s, sec: 0.001s)
>> 9/2 21:09:01 ERROR: Child pid 14899 appears hung! Killing it hard.
>> 9/2 21:09:01 error reading from named pipe: watchdog pipe has closed
>> 9/2 21:09:01 ProcFamilyClient: failed to read response from ProcD
>> 9/2 21:09:01 get_usage: ProcD communication error
>> 9/2 21:09:01 ERROR "ProcD has failed" at line 599 in file  
>> proc_family_proxy.cpp
>> 9/2 21:09:01 slot3: Changing state and activity: Claimed/Busy ->  
>> Preempting/Killing
>> 9/2 21:09:01 slot4: Changing state and activity: Claimed/Busy ->  
>> Preempting/Killing
>> 9/2 21:09:01 slot2: Changing activity: Busy -> Killing
>> 9/2 21:09:01 slot2: State change: starter exited
>> 9/2 21:09:01 slot2: Changing state and activity: Backfill/Killing ->  
>> Owner/Idle
>> 9/2 21:09:01 slot2: State change: IS_OWNER is false
>> 9/2 21:09:01 slot2: Changing state: Owner -> Unclaimed
>> 9/2 21:09:01 slot2: State change: START_BACKFILL is TRUE
>> 9/2 21:09:01 slot2: Changing state: Unclaimed -> Backfill
>> 9/2 21:09:01 slot1: Changing activity: Busy -> Killing
>> 9/2 21:09:01 startd exiting because of fatal exception.
>> *** End of file StartLog
>>
>>
>>
>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>> Questions about this message or Condor in general?
>> Email address of the local Condor administrator: ldas_admin_cit__AT__ligo.caltech.edu
>> The Official Condor Homepage is http://www.cs.wisc.edu/condor
>>     
>
> --
> Stuart Anderson  anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
>
>
>
>
> ===========================================================================
> Date of creation: Thu Sep  3 13:05:19 2009 (1252001121)
>
>
> >From RUST Thu, 3 Sep 2009 13:46:12 -0500 (CDT)
> Subject: Actions
>
> Assigned to danb by danb
> ===========================================================================
> Date of actions: Thu Sep  3 13:46:12 2009 (1252003572)
>   

===========================================================================
Date mail was appended: Thu Sep  3 13:58:01 2009 (1252004281)
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19675] LIGO: startd fatal exception
Date: Thu, 3 Sep 2009 12:53:29 -0700
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

I do not have procd logging enabled (will 7.4 include log file  
management for this daemon?).

If I search all the other daemon logs for lines with a case  
insensitive match to the string "pid" and the string "14899" the only  
reference I find in the last few days is,

/usr1/condor/log/StartLog:9/2 21:09:01 ERROR: Child pid 14899 appears  
hung! Killing it hard.

I double checked the MasterLog and there is no reference to this pid  
in that file since last September.

The named pipe is still on the local file system, but it also appears  
that some very old ones are still there

[root@node266 log]# ls -l procd_pipe*
prw------- 1 condor condor 0 Dec 11  2007 procd_pipe
prw------- 1 condor condor 0 Dec 11  2007 procd_pipe.8388.0
prw------- 1 condor condor 0 Sep  3 12:51 procd_pipe.STARTD
prw------- 1 condor condor 0 Sep  2 21:09 procd_pipe.STARTD.14069.0
prw------- 1 condor condor 0 Sep  2 21:09 procd_pipe.STARTD.14899.0
prw------- 1 condor condor 0 Apr 14 11:37 procd_pipe.STARTD.2254.0
prw------- 1 condor condor 0 Apr 28 16:30 procd_pipe.STARTD.2618.0
prw------- 1 condor condor 0 Sep  2 21:09 procd_pipe.STARTD.2762.0
prw------- 1 condor condor 0 Apr  6 13:10 procd_pipe.STARTD.27994.0
prw------- 1 condor condor 0 Apr  6 13:10 procd_pipe.STARTD.27995.0
prw------- 1 condor condor 0 Sep  3 12:51 procd_pipe.STARTD.watchdog
prw------- 1 condor condor 0 Dec  1  2007 procd_pipe.watchdog

Only a few of these have any processes attached, i.e.,

[root@node266 log]# lsof procd_pipe*
COMMAND     PID   USER   FD   TYPE DEVICE SIZE    NODE NAME
condor_st 15734 condor   12r  FIFO  253,1      8454201  
procd_pipe.STARTD.watchdog
condor_st 15734 condor   13w  FIFO  253,1      8454202 procd_pipe.STARTD
condor_pr 15759   root    0r  FIFO  253,1      8454201  
procd_pipe.STARTD.watchdog
condor_pr 15759   root    1w  FIFO  253,1      8454201  
procd_pipe.STARTD.watchdog
condor_pr 15759   root    3r  FIFO  253,1      8454202 procd_pipe.STARTD
condor_pr 15759   root    4w  FIFO  253,1      8454202 procd_pipe.STARTD

so perhaps there is another small bug here regarding cleaning up.

Thanks.



On Sep 3, 2009, at 11:58 AM, condor-admin response tracking system  
wrote:

>
> It appears that something went wrong with the procd and the startd
> noticed when it wanted to kill one of its children.  The expected
> behavior of the startd when the procd fails is to die and get  
> restarted.
>
> Is there any record of what pid 14899 was?  Was it a starter?  If so,
> that pid should be recorded in the starter log if not also the  
> startd log.
>
> The most important question is what went wrong with the procd.  There
> can be two types of failures: the procd itself ran into a bug or the
> named pipe to the procd got messed up  in some way (e.g. it got
> deleted).  Unfortunately, the message in the StartLog doesn't make it
> obvious to me what happened.
>
> I don't suppose you have procd logging turned on?  Typically, it is  
> not
> turned on, because it doesn't get rotated and therefore grows  
> without bound.
>
> --Dan
>
>> The LIGO CIT Condor pool running,
>>
>> # condor_version
>> $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
>> $CondorPlatform: X86_64-LINUX_RHEL5 $
>>
>> # cat /etc/redhat-release
>> CentOS release 5.3 (Final)
>>
>> # uname -a
>> Linux node266 2.6.18-128.4.1.el5 #1 SMP Tue Aug 4 20:19:25 EDT 2009
>> x86_64 x86_64 x86_64 GNU/Linux
>>
>>
>> generated the following startd fatal exception. Is this a bug or the
>> expected behavior after killing an unresponsive child pid?
>>
>> Note, this appears to be a rare occurrence and is not happening
>> regularly.
>>
>> Thanks.
>>
>>
>>
>> Begin forwarded message:
>>
>>
>>> From: condor__AT__node266.ldas-cit.ligo.caltech.edu
>>> Date: September 2, 2009 9:09:18 PM PDT
>>> To: ldas_admin_cit__AT__ligo.caltech.edu
>>> Subject: [Condor] Problem node266.ldas-cit.ligo.caltech.edu:
>>> condor_startd exited (4)
>>>
>>> This is an automated email from the Condor system
>>> on machine "node266.ldas-cit.ligo.caltech.edu".  Do not reply.
>>>
>>> "/usr/sbin/condor_startd" on "node266.ldas-cit.ligo.caltech.edu"
>>> exited with status 4.
>>> Condor will automatically restart this process in 10 seconds.
>>>
>>> *** Last 20 line(s) of file /usr1/condor/log/StartLog:
>>> 9/2 21:04:44 Return from HandleReq <command_handler> (handler:
>>> 0.000s, sec: 0.001s)
>>> 9/2 21:09:01 Received UDP command 60008 (DC_CHILDALIVE) from
>>> <10.14.2.16:57186>, access level DAEMON
>>> 9/2 21:09:01 Calling HandleReq <HandleChildAliveCommand> (0)
>>> 9/2 21:09:01 Return from HandleReq <HandleChildAliveCommand>
>>> (handler: 0.000s, sec: 0.001s)
>>> 9/2 21:09:01 ERROR: Child pid 14899 appears hung! Killing it hard.
>>> 9/2 21:09:01 error reading from named pipe: watchdog pipe has closed
>>> 9/2 21:09:01 ProcFamilyClient: failed to read response from ProcD
>>> 9/2 21:09:01 get_usage: ProcD communication error
>>> 9/2 21:09:01 ERROR "ProcD has failed" at line 599 in file
>>> proc_family_proxy.cpp
>>> 9/2 21:09:01 slot3: Changing state and activity: Claimed/Busy ->
>>> Preempting/Killing
>>> 9/2 21:09:01 slot4: Changing state and activity: Claimed/Busy ->
>>> Preempting/Killing
>>> 9/2 21:09:01 slot2: Changing activity: Busy -> Killing
>>> 9/2 21:09:01 slot2: State change: starter exited
>>> 9/2 21:09:01 slot2: Changing state and activity: Backfill/Killing ->
>>> Owner/Idle
>>> 9/2 21:09:01 slot2: State change: IS_OWNER is false
>>> 9/2 21:09:01 slot2: Changing state: Owner -> Unclaimed
>>> 9/2 21:09:01 slot2: State change: START_BACKFILL is TRUE
>>> 9/2 21:09:01 slot2: Changing state: Unclaimed -> Backfill
>>> 9/2 21:09:01 slot1: Changing activity: Busy -> Killing
>>> 9/2 21:09:01 startd exiting because of fatal exception.
>>> *** End of file StartLog
>>>
>>>
>>>
>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>>> Questions about this message or Condor in general?
>>> Email address of the local Condor administrator: ldas_admin_cit__AT__ligo.caltech.edu
>>> The Official Condor Homepage is http://www.cs.wisc.edu/condor
>>>
>>
>> --
>> Stuart Anderson  anderson__AT__ligo.caltech.edu
>> http://www.ligo.caltech.edu/~anderson
>>
>>
>>
>>
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> =====================================================================
>> Date of creation: Thu Sep  3 13:05:19 2009 (1252001121)
>>
>>
>>> From RUST Thu, 3 Sep 2009 13:46:12 -0500 (CDT)
>> Subject: Actions
>>
>> Assigned to danb by danb
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> =====================================================================
>> Date of actions: Thu Sep  3 13:46:12 2009 (1252003572)
>>
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Dan Bradley <danb__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu,
>

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Thu Sep  3 14:53:51 2009 (1252007632)
Date: Wed, 09 Sep 2009 16:30:26 -0500
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19675] LIGO: startd fatal exception
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu




> I do not have procd logging enabled (will 7.4 include log file  
> management for this daemon?).
>   

7.4 does not include log file management for the procd.

> If I search all the other daemon logs for lines with a case  
> insensitive match to the string "pid" and the string "14899" the only  
> reference I find in the last few days is,
>
> /usr1/condor/log/StartLog:9/2 21:09:01 ERROR: Child pid 14899 appears  
> hung! Killing it hard.
>   

Are there any references to dir_14899?  (Sorry for my late reply if your 
logs have already rotated.)

The reason I ask is that the standard unvirse starter does not log its 
PID, but since the PID is encoded in the scratch directory, it usually 
gets recorded through that.

--Dan

> I double checked the MasterLog and there is no reference to this pid  
> in that file since last September.
>
> The named pipe is still on the local file system, but it also appears  
> that some very old ones are still there
>
> [root@node266 log]# ls -l procd_pipe*
> prw------- 1 condor condor 0 Dec 11  2007 procd_pipe
> prw------- 1 condor condor 0 Dec 11  2007 procd_pipe.8388.0
> prw------- 1 condor condor 0 Sep  3 12:51 procd_pipe.STARTD
> prw------- 1 condor condor 0 Sep  2 21:09 procd_pipe.STARTD.14069.0
> prw------- 1 condor condor 0 Sep  2 21:09 procd_pipe.STARTD.14899.0
> prw------- 1 condor condor 0 Apr 14 11:37 procd_pipe.STARTD.2254.0
> prw------- 1 condor condor 0 Apr 28 16:30 procd_pipe.STARTD.2618.0
> prw------- 1 condor condor 0 Sep  2 21:09 procd_pipe.STARTD.2762.0
> prw------- 1 condor condor 0 Apr  6 13:10 procd_pipe.STARTD.27994.0
> prw------- 1 condor condor 0 Apr  6 13:10 procd_pipe.STARTD.27995.0
> prw------- 1 condor condor 0 Sep  3 12:51 procd_pipe.STARTD.watchdog
> prw------- 1 condor condor 0 Dec  1  2007 procd_pipe.watchdog
>
> Only a few of these have any processes attached, i.e.,
>
> [root@node266 log]# lsof procd_pipe*
> COMMAND     PID   USER   FD   TYPE DEVICE SIZE    NODE NAME
> condor_st 15734 condor   12r  FIFO  253,1      8454201  
> procd_pipe.STARTD.watchdog
> condor_st 15734 condor   13w  FIFO  253,1      8454202 procd_pipe.STARTD
> condor_pr 15759   root    0r  FIFO  253,1      8454201  
> procd_pipe.STARTD.watchdog
> condor_pr 15759   root    1w  FIFO  253,1      8454201  
> procd_pipe.STARTD.watchdog
> condor_pr 15759   root    3r  FIFO  253,1      8454202 procd_pipe.STARTD
> condor_pr 15759   root    4w  FIFO  253,1      8454202 procd_pipe.STARTD
>
> so perhaps there is another small bug here regarding cleaning up.
>
> Thanks.
>
>
>
> On Sep 3, 2009, at 11:58 AM, condor-admin response tracking system  
> wrote:
>
>   
>> It appears that something went wrong with the procd and the startd
>> noticed when it wanted to kill one of its children.  The expected
>> behavior of the startd when the procd fails is to die and get  
>> restarted.
>>
>> Is there any record of what pid 14899 was?  Was it a starter?  If so,
>> that pid should be recorded in the starter log if not also the  
>> startd log.
>>
>> The most important question is what went wrong with the procd.  There
>> can be two types of failures: the procd itself ran into a bug or the
>> named pipe to the procd got messed up  in some way (e.g. it got
>> deleted).  Unfortunately, the message in the StartLog doesn't make it
>> obvious to me what happened.
>>
>> I don't suppose you have procd logging turned on?  Typically, it is  
>> not
>> turned on, because it doesn't get rotated and therefore grows  
>> without bound.
>>
>> --Dan
>>
>>     
>>> The LIGO CIT Condor pool running,
>>>
>>> # condor_version
>>> $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
>>> $CondorPlatform: X86_64-LINUX_RHEL5 $
>>>
>>> # cat /etc/redhat-release
>>> CentOS release 5.3 (Final)
>>>
>>> # uname -a
>>> Linux node266 2.6.18-128.4.1.el5 #1 SMP Tue Aug 4 20:19:25 EDT 2009
>>> x86_64 x86_64 x86_64 GNU/Linux
>>>
>>>
>>> generated the following startd fatal exception. Is this a bug or the
>>> expected behavior after killing an unresponsive child pid?
>>>
>>> Note, this appears to be a rare occurrence and is not happening
>>> regularly.
>>>
>>> Thanks.
>>>
>>>
>>>
>>> Begin forwarded message:
>>>
>>>
>>>       
>>>> From: condor__AT__node266.ldas-cit.ligo.caltech.edu
>>>> Date: September 2, 2009 9:09:18 PM PDT
>>>> To: ldas_admin_cit__AT__ligo.caltech.edu
>>>> Subject: [Condor] Problem node266.ldas-cit.ligo.caltech.edu:
>>>> condor_startd exited (4)
>>>>
>>>> This is an automated email from the Condor system
>>>> on machine "node266.ldas-cit.ligo.caltech.edu".  Do not reply.
>>>>
>>>> "/usr/sbin/condor_startd" on "node266.ldas-cit.ligo.caltech.edu"
>>>> exited with status 4.
>>>> Condor will automatically restart this process in 10 seconds.
>>>>
>>>> *** Last 20 line(s) of file /usr1/condor/log/StartLog:
>>>> 9/2 21:04:44 Return from HandleReq <command_handler> (handler:
>>>> 0.000s, sec: 0.001s)
>>>> 9/2 21:09:01 Received UDP command 60008 (DC_CHILDALIVE) from
>>>> <10.14.2.16:57186>, access level DAEMON
>>>> 9/2 21:09:01 Calling HandleReq <HandleChildAliveCommand> (0)
>>>> 9/2 21:09:01 Return from HandleReq <HandleChildAliveCommand>
>>>> (handler: 0.000s, sec: 0.001s)
>>>> 9/2 21:09:01 ERROR: Child pid 14899 appears hung! Killing it hard.
>>>> 9/2 21:09:01 error reading from named pipe: watchdog pipe has closed
>>>> 9/2 21:09:01 ProcFamilyClient: failed to read response from ProcD
>>>> 9/2 21:09:01 get_usage: ProcD communication error
>>>> 9/2 21:09:01 ERROR "ProcD has failed" at line 599 in file
>>>> proc_family_proxy.cpp
>>>> 9/2 21:09:01 slot3: Changing state and activity: Claimed/Busy ->
>>>> Preempting/Killing
>>>> 9/2 21:09:01 slot4: Changing state and activity: Claimed/Busy ->
>>>> Preempting/Killing
>>>> 9/2 21:09:01 slot2: Changing activity: Busy -> Killing
>>>> 9/2 21:09:01 slot2: State change: starter exited
>>>> 9/2 21:09:01 slot2: Changing state and activity: Backfill/Killing ->
>>>> Owner/Idle
>>>> 9/2 21:09:01 slot2: State change: IS_OWNER is false
>>>> 9/2 21:09:01 slot2: Changing state: Owner -> Unclaimed
>>>> 9/2 21:09:01 slot2: State change: START_BACKFILL is TRUE
>>>> 9/2 21:09:01 slot2: Changing state: Unclaimed -> Backfill
>>>> 9/2 21:09:01 slot1: Changing activity: Busy -> Killing
>>>> 9/2 21:09:01 startd exiting because of fatal exception.
>>>> *** End of file StartLog
>>>>
>>>>
>>>>
>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>>>> Questions about this message or Condor in general?
>>>> Email address of the local Condor administrator: ldas_admin_cit__AT__ligo.caltech.edu
>>>> The Official Condor Homepage is http://www.cs.wisc.edu/condor
>>>>
>>>>         
>>> --
>>> Stuart Anderson  anderson__AT__ligo.caltech.edu
>>> http://www.ligo.caltech.edu/~anderson
>>>
>>>
>>>
>>>
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> =====================================================================
>>> Date of creation: Thu Sep  3 13:05:19 2009 (1252001121)
>>>
>>>
>>>       
>>>> From RUST Thu, 3 Sep 2009 13:46:12 -0500 (CDT)
>>>>         
>>> Subject: Actions
>>>
>>> Assigned to danb by danb
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> = 
>>> =====================================================================
>>> Date of actions: Thu Sep  3 13:46:12 2009 (1252003572)
>>>
>>>       
>> ========================================
>> MESSAGE INFORMATION
>> ========================================
>> * From: Dan Bradley <danb__AT__cs.wisc.edu>
>> * Ticket Email List: anderson__AT__ligo.caltech.edu,
>>
>>     
>
> --
> Stuart Anderson  anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
>
>
>
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, 
>   

===========================================================================
Date mail was appended: Wed Sep  9 16:30:32 2009 (1252531833)
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19675] LIGO: startd fatal exception
Date: Wed, 9 Sep 2009 14:48:42 -0700
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu


On Sep 9, 2009, at 2:30 PM, condor-admin response tracking system wrote:
>
>> If I search all the other daemon logs for lines with a case
>> insensitive match to the string "pid" and the string "14899" the only
>> reference I find in the last few days is,
>>
>> /usr1/condor/log/StartLog:9/2 21:09:01 ERROR: Child pid 14899 appears
>> hung! Killing it hard.
>>
>
> Are there any references to dir_14899?  (Sorry for my late reply if  
> your
> logs have already rotated.)
>
> The reason I ask is that the standard unvirse starter does not log its
> PID, but since the PID is encoded in the scratch directory, it usually
> gets recorded through that.

That string does show up in the starter log for one of the slots. It  
should be noted that the IP address in this section of the logs  
(10.14.2.16) is the address of this execute machine.



9/2 20:09:01 ******************************************************
9/2 20:09:01 ** condor_starter (CONDOR_STARTER) STARTING UP
9/2 20:09:01 ** /usr/sbin/condor_starter
9/2 20:09:01 ** SubsystemInfo: name=STARTER type=STARTER(8)  
class=DAEMON(1)
9/2 20:09:01 ** Configuration: subsystem:STARTER local:<NONE>  
class:DAEMON
9/2 20:09:01 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
9/2 20:09:01 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
9/2 20:09:01 ** PID = 14899
9/2 20:09:01 ** Log last touched 9/2 19:57:48
9/2 20:09:01 ******************************************************
9/2 20:09:01 Using config source: /usr1/condor/condor_config
9/2 20:09:01 Using local config sources:
9/2 20:09:01    /usr1/condor/condor_config.local
9/2 20:09:01 DaemonCore: Command Socket at <10.14.2.16:57267>
9/2 20:09:01 Done setting resource limits
9/2 20:09:01 Communicating with shadow <10.14.0.12:58849>
9/2 20:09:01 Submitting machine is "ldas-grid.ligo.caltech.edu"
9/2 20:09:01 setting the orig job name in starter
9/2 20:09:01 setting the orig job iwd in starter
9/2 20:09:03 File transfer completed successfully.
9/2 20:09:04 warning: unable to chmod condor_exec.exe to ensure  
execute bit is set: Operation not permitted
9/2 20:09:04 Job 53783853.0 set to execute immediately
9/2 20:09:04 Starting a VANILLA universe job with ID: 53783853.0
9/2 20:09:04 IWD: /usr1/condor/execute/dir_14899
9/2 20:09:04 Renice expr "0" evaluated to 0
9/2 20:09:04 About to exec /usr1/condor/execute/dir_14899/ 
condor_exec.exe
9/2 20:09:04 Create_Process succeeded, pid=14905
9/2 20:25:58 Process exited, pid=14905, status=1
9/2 21:09:01 Failed to receive GoAhead message from 10.14.2.16.
9/2 21:09:01 File transfer failed, forcing disconnect.
9/2 21:09:01 JIC::transferOutput() failed, waiting for job lease to  
expire or for a reconnect attempt
9/2 21:10:15 ******************************************************
9/2 21:10:15 ** condor_starter (CONDOR_STARTER) STARTING UP


--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Wed Sep  9 16:49:01 2009 (1252532942)
Date: Fri, 09 Oct 2009 11:24:47 -0500
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19675] LIGO: startd fatal exception
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu


Sorry for letting this issue stall.

I cannot think of a way to make further progress on debugging this 
situation without more information about the procd, which probably means 
turning on the procd log.  I think we need to have a solution for log 
rotation in the procd before we can recommend turning on procd logging 
in a production pool.

--Dan

condor-admin response tracking system wrote:
> On Sep 9, 2009, at 2:30 PM, condor-admin response tracking system wrote:
>   
>>> If I search all the other daemon logs for lines with a case
>>> insensitive match to the string "pid" and the string "14899" the only
>>> reference I find in the last few days is,
>>>
>>> /usr1/condor/log/StartLog:9/2 21:09:01 ERROR: Child pid 14899 appears
>>> hung! Killing it hard.
>>>
>>>       
>> Are there any references to dir_14899?  (Sorry for my late reply if  
>> your
>> logs have already rotated.)
>>
>> The reason I ask is that the standard unvirse starter does not log its
>> PID, but since the PID is encoded in the scratch directory, it usually
>> gets recorded through that.
>>     
>
> That string does show up in the starter log for one of the slots. It  
> should be noted that the IP address in this section of the logs  
> (10.14.2.16) is the address of this execute machine.
>
>
>
> 9/2 20:09:01 ******************************************************
> 9/2 20:09:01 ** condor_starter (CONDOR_STARTER) STARTING UP
> 9/2 20:09:01 ** /usr/sbin/condor_starter
> 9/2 20:09:01 ** SubsystemInfo: name=STARTER type=STARTER(8)  
> class=DAEMON(1)
> 9/2 20:09:01 ** Configuration: subsystem:STARTER local:<NONE>  
> class:DAEMON
> 9/2 20:09:01 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
> 9/2 20:09:01 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
> 9/2 20:09:01 ** PID = 14899
> 9/2 20:09:01 ** Log last touched 9/2 19:57:48
> 9/2 20:09:01 ******************************************************
> 9/2 20:09:01 Using config source: /usr1/condor/condor_config
> 9/2 20:09:01 Using local config sources:
> 9/2 20:09:01    /usr1/condor/condor_config.local
> 9/2 20:09:01 DaemonCore: Command Socket at <10.14.2.16:57267>
> 9/2 20:09:01 Done setting resource limits
> 9/2 20:09:01 Communicating with shadow <10.14.0.12:58849>
> 9/2 20:09:01 Submitting machine is "ldas-grid.ligo.caltech.edu"
> 9/2 20:09:01 setting the orig job name in starter
> 9/2 20:09:01 setting the orig job iwd in starter
> 9/2 20:09:03 File transfer completed successfully.
> 9/2 20:09:04 warning: unable to chmod condor_exec.exe to ensure  
> execute bit is set: Operation not permitted
> 9/2 20:09:04 Job 53783853.0 set to execute immediately
> 9/2 20:09:04 Starting a VANILLA universe job with ID: 53783853.0
> 9/2 20:09:04 IWD: /usr1/condor/execute/dir_14899
> 9/2 20:09:04 Renice expr "0" evaluated to 0
> 9/2 20:09:04 About to exec /usr1/condor/execute/dir_14899/ 
> condor_exec.exe
> 9/2 20:09:04 Create_Process succeeded, pid=14905
> 9/2 20:25:58 Process exited, pid=14905, status=1
> 9/2 21:09:01 Failed to receive GoAhead message from 10.14.2.16.
> 9/2 21:09:01 File transfer failed, forcing disconnect.
> 9/2 21:09:01 JIC::transferOutput() failed, waiting for job lease to  
> expire or for a reconnect attempt
> 9/2 21:10:15 ******************************************************
> 9/2 21:10:15 ** condor_starter (CONDOR_STARTER) STARTING UP
>
>
> --
> Stuart Anderson  anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
>
>
>
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, 
>   

===========================================================================
Date mail was appended: Fri Oct  9 11:24:52 2009 (1255105493)
Subject: Comments added

gt #780 tracks this ticket.

Comments added by psilord

===========================================================================
Date comments were added: Fri Oct  9 11:46:49 2009 (1255106809)
Subject: Comments added


Comments added by psilord

===========================================================================
Date comments were added: Fri Oct  9 15:49:52 2009 (1255121392)
Subject: Comments added

While this startd error actually happened due to some procd error. It is
not happening all the time. And the real problem is that the procd's log
doesn't rotate, so in a production pool, we can't turn on the log to track
down this problem. So, from ligo's point of view, this ticket will be resolved
on their end. One our end we will leave the ticket in gittrac to be fixed later,
and make a new ticket about the procd log needing a rotation feature.

Comments added by psilord

===========================================================================
Date comments were added: Fri Oct  9 15:49:59 2009 (1255121399)
Subject: Actions

Ticket resolved by psilord
===========================================================================
Date of actions: Fri Oct  9 15:51:37 2009 (1255121497)