LIGO Support Ticket 17811

Ticket Information
  Number:      admin 17811
  User:        anderson@ligo.caltech.edu
  Email:       gmendell__AT__ligo-wa.caltech.edu,skoranda__AT__gravity.phys.uwm.edu
  Status:      open
  Assigned To: adesmet
Date: Sat, 29 Mar 2008 12:20:28 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Gregory Mendell <gmendell__AT__ligo-wa.caltech.edu>,         Scott Koranda
 <skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: condor_procd running without FIFO
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

It was observed on the LIGO LHO condor pool that after upgrading an execute machine
from 6.9.x to 7.0.0 and then 7.0.1,

# condor_version
$CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
$CondorPlatform: I386-LINUX_RHEL3 $

that jobs where failing with the following error from the starter:

3/18 16:38:08 ******************************************************
3/18 16:38:08 ** condor_starter (CONDOR_STARTER) STARTING UP
3/18 16:38:08 ** /usr/sbin/condor_starter
3/18 16:38:08 ** $CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
3/18 16:38:08 ** $CondorPlatform: I386-LINUX_RHEL3 $
3/18 16:38:08 ** PID = 15115
3/18 16:38:08 ** Log last touched 3/18 16:37:08
3/18 16:38:08 ******************************************************
3/18 16:38:08 Using config source: /usr1/condor/condor_config
3/18 16:38:08 Using local config sources: 
3/18 16:38:08    /usr1/condor/condor_config.local
3/18 16:38:08 DaemonCore: Command Socket at <10.12.1.88:40813>
3/18 16:38:08 Done setting resource limits
3/18 16:38:08 Communicating with shadow <10.12.0.12:36290>
3/18 16:38:08 Submitting machine is "ldas-gridi.ligo"
3/18 16:38:08 setting the orig job name in starter
3/18 16:38:08 setting the orig job iwd in starter
3/18 16:38:08 Job 13756046.0 set to execute immediately
3/18 16:38:08 Starting a VANILLA universe job with ID: 13756046.0
3/18 16:38:08 IWD: /archive/home/qonline/q/online/H2
3/18 16:38:08 Output file: /archive/home/qonline/q/online/H2/.condor_out.27889
3/18 16:38:08 Error file: /archive/home/qonline/q/online/H2/.condor_error.27889
3/18 16:38:08 Renice expr "0" evaluated to 0
3/18 16:38:08 About to exec /archive/home/qonline/q/online/H2/.condor_run.27889 
3/18 16:38:08 error opening watchdog pipe /usr1/condor/log/procd_pipe.STARTD.watchdog: No such file or directory (2)
3/18 16:38:08 ProcFamilyClient: error initializing LocalClient
3/18 16:38:08 ProcFamilyProxy: error initializing ProcFamilyClient
3/18 16:38:08 ERROR "ProcD has failed" at line 590 in file proc_family_proxy.C
3/18 16:38:08 ShutdownFast all jobs.


Upon further inspection it was found that the condor_procd process was running
but the FIFO (named pipes) where missing. This is under Linux (FC4) and here is
the output of lsof that shows that the FIFO's where there originally but where
somehow deleted.

[root@node88 condor]# lsof -p 3377
COMMAND    PID USER   FD   TYPE DEVICE   SIZE     NODE NAME
condor_pr 3377 root  cwd    DIR    9,2   4096 20119558 /usr1/condor/log
condor_pr 3377 root  rtd    DIR    9,1   4096        2 /
condor_pr 3377 root  txt    REG    9,1 112928  4374655 /usr/sbin/condor_procd
condor_pr 3377 root  mem    REG    9,1         1954592 /lib/libdl-2.3.6.so (path inode=1954574)
condor_pr 3377 root  DEL    REG    9,1         1955666 /lib/libgcc_s-4.0.2-20051126.so.1.#prelink#.5kP9CY
condor_pr 3377 root  mem    REG    9,1         1954624 /lib/libresolv-2.3.6.so (path inode=1954589)
condor_pr 3377 root  mem    REG    0,0               0 [vdso] (stat: No such file or directory)
condor_pr 3377 root  mem    REG    9,1         1954577 /lib/ld-2.3.6.so (path inode=1954562)
condor_pr 3377 root  mem    REG    9,1         1954594 /lib/libm-2.3.6.so (path inode=1954580)
condor_pr 3377 root  mem    REG    9,1         1954590 /lib/libcrypt-2.3.6.so (path inode=1954597)
condor_pr 3377 root  mem    REG    9,1         4377553 /usr/lib/libstdc++.so.5.0.7 (path inode=4371948)
condor_pr 3377 root  mem    REG    9,1         1954586 /lib/libc-2.3.6.so (path inode=1954570)
condor_pr 3377 root    0r  FIFO    9,2        20119578 /usr1/condor/log/procd_pipe.STARTD.watchdog (deleted)
condor_pr 3377 root    1w  FIFO    9,2        20119578 /usr1/condor/log/procd_pipe.STARTD.watchdog (deleted)
condor_pr 3377 root    3r  FIFO    9,2        20119579 /usr1/condor/log/procd_pipe.STARTD (deleted)
condor_pr 3377 root    4w  FIFO    9,2        20119579 /usr1/condor/log/procd_pipe.STARTD (deleted)


I believe this is a rare corner case, but if it is simple enough, it might make
sense to automatically catch/clean up from this this particular state where the
machine became a blackhole for jobs.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Sat Mar 29 14:20:45 2008 (1206818448)
Subject: Actions

Assigned to adesmet by adesmet
===========================================================================
Date of actions: Thu Apr  3 11:36:03 2008 (1207240563)
Date: Thu, 3 Apr 2008 11:55:32 -0500
From: Alan De Smet <adesmet__AT__cs.wisc.edu>
To: adesmet <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17811] LIGO: condor_procd running without FIFO

> Upon further inspection it was found that the condor_procd
> process was running but the FIFO (named pipes) where missing.
> This is under Linux (FC4) and here is the output of lsof that
> shows that the FIFO's where there originally but where somehow
> deleted.

Was this a one off problem, or did it happen repeatedly?

You say this turned the machine into a black hole.  Do you mean
that the machine matched jobs, but they failed to start, and were
rescheduled later, possibly onto the same machine again?

How we recover from this sort of situation will depend on the
frequency and severity.  Trying to recover without restarting
would be very difficult.  Even restarting Condor adds a lot of
complications.  We'll look into it and see what is realistic.

===========================================================================
Date mail was appended: Thu Apr  3 11:55:40 2008 (1207241740)
Date: Thu, 03 Apr 2008 11:03:47 -0800
From: Gregory Mendell <gmendell__AT__ligo-wa.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17811] LIGO: condor_procd running without FIFO
X-Enigmail-Version: 0.95.6
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

condor-admin response tracking system wrote:
>> Upon further inspection it was found that the condor_procd
>> process was running but the FIFO (named pipes) where missing.
>> This is under Linux (FC4) and here is the output of lsof that
>> shows that the FIFO's where there originally but where somehow
>> deleted.
> 
> Was this a one off problem, or did it happen repeatedly?

There was specific problem on one node (node88) that was one off. 
However, as the result of this problem, jobs failed on that node until
((NumShadowExceptions > 30) || (NumShadowStarts > 110))' evaluated to 
TRUE and they were placed on hold.

For example:

# grep 13759289 /usr1/condor/log/ShadowLog

3/27 16:29:21 (13759289.0) (13722): ERROR "Error from starter on 
slot1__AT__node88.ldas.ligo-wa.caltech.edu: ProcD has failed" at line 649 in 
file pseudo_ops.C
3/27 16:29:21 Initializing a VANILLA shadow for job 13759289.0
3/27 16:29:21 (13759289.0) (13725): Request to run on <10.12.1.88:49519> 
was ACCEPTED
3/27 16:29:21 (13759289.0) (13725): ERROR "Error from starter on 
slot1__AT__node88.ldas.ligo-wa.caltech.edu: ProcD has failed" at line 649 in 
file pseudo_ops.C
3/27 16:29:22 Initializing a VANILLA shadow for job 13759289.0
3/27 16:29:22 (13759289.0) (13728): Request to run on <10.12.1.88:49519> 
was ACCEPTED
3/27 16:29:22 (13759289.0) (13728): ERROR "Error from starter on 
slot1__AT__node88.ldas.ligo-wa.caltech.edu: ProcD has failed" at line 649 in 
file pseudo_ops.C
3/27 16:29:22 Initializing a VANILLA shadow for job 13759289.0
3/27 16:29:22 (13759289.0) (13731): Request to run on <10.12.1.88:49519> 
was ACCEPTED
3/27 16:29:22 (13759289.0) (13731): ERROR "Error from starter on 
slot1__AT__node88.ldas.ligo-wa.caltech.edu: ProcD has failed" at line 649 in 
file pseudo_ops.C
3/27 21:26:05 Initializing a VANILLA shadow for job 13759289.0
3/27 21:26:05 (13759289.0) (21262): Request to run on <10.12.1.52:56901> 
was ACCEPTED
3/27 21:26:13 (13759289.0) (21262): Job 13759289.0 going into Hold state 
(code 3,0): The system macro SYSTEM_PERIODIC_HOLD expression 
'((NumShadowExceptions > 30) || (NumShadowStarts > 110))' evaluated to TRUE
3/27 21:26:13 (13759289.0) (21262): ZKM: setting default map to (null)
3/27 21:26:14 (13759289.0) (21262): **** condor_shadow (condor_SHADOW) 
EXITING WITH STATUS 112
4/3 10:59:34 Initializing a VANILLA shadow for job 13759289.0
4/3 10:59:34 (13759289.0) (6405): Request to run on <10.12.1.107:33858> 
was ACCEPTED
4/3 10:59:42 (13759289.0) (6405): Job 13759289.0 going into Hold state 
(code 3,0): The system macro SYSTEM_PERIODIC_HOLD expression 
'((NumShadowExceptions > 30) || (NumShadowStarts > 110))' evaluated to TRUE
4/3 10:59:42 (13759289.0) (6405): ZKM: setting default map to (null)
4/3 10:59:42 (13759289.0) (6405): **** condor_shadow (condor_SHADOW) 
EXITING WITH STATUS 112

Regards,
Greg

> 
> You say this turned the machine into a black hole.  Do you mean
> that the machine matched jobs, but they failed to start, and were
> rescheduled later, possibly onto the same machine again?
> 
> How we recover from this sort of situation will depend on the
> frequency and severity.  Trying to recover without restarting
> would be very difficult.  Even restarting Condor adds a lot of
> complications.  We'll look into it and see what is realistic.
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Alan De Smet <adesmet__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, gmendell__AT__ligo-wa.caltech.edu,skoranda__AT__gravity.phys.uwm.edu
> 


-- 
Gregory Mendell | Ph: 509-372-8300 ext. 189 | Fax: 509-372-8137
LIGO Hanford Observatory, PO Box 159, MS S9-02, Richland WA 99352
http://www.ligo.caltech.edu | http://www.ligo-wa.caltech.edu

===========================================================================
Date mail was appended: Thu Apr  3 14:04:07 2008 (1207249452)