LIGO Support Ticket 17811
Ticket Information
Number: admin 17811
User: anderson@ligo.caltech.edu
Email: gmendell__AT__ligo-wa.caltech.edu,skoranda__AT__gravity.phys.uwm.edu
Status: open
Assigned To: adesmet
Date: Sat, 29 Mar 2008 12:20:28 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Gregory Mendell <gmendell__AT__ligo-wa.caltech.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: condor_procd running without FIFO
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
It was observed on the LIGO LHO condor pool that after upgrading an execute machine
from 6.9.x to 7.0.0 and then 7.0.1,
# condor_version
$CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
$CondorPlatform: I386-LINUX_RHEL3 $
that jobs where failing with the following error from the starter:
3/18 16:38:08 ******************************************************
3/18 16:38:08 ** condor_starter (CONDOR_STARTER) STARTING UP
3/18 16:38:08 ** /usr/sbin/condor_starter
3/18 16:38:08 ** $CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
3/18 16:38:08 ** $CondorPlatform: I386-LINUX_RHEL3 $
3/18 16:38:08 ** PID = 15115
3/18 16:38:08 ** Log last touched 3/18 16:37:08
3/18 16:38:08 ******************************************************
3/18 16:38:08 Using config source: /usr1/condor/condor_config
3/18 16:38:08 Using local config sources:
3/18 16:38:08 /usr1/condor/condor_config.local
3/18 16:38:08 DaemonCore: Command Socket at <10.12.1.88:40813>
3/18 16:38:08 Done setting resource limits
3/18 16:38:08 Communicating with shadow <10.12.0.12:36290>
3/18 16:38:08 Submitting machine is "ldas-gridi.ligo"
3/18 16:38:08 setting the orig job name in starter
3/18 16:38:08 setting the orig job iwd in starter
3/18 16:38:08 Job 13756046.0 set to execute immediately
3/18 16:38:08 Starting a VANILLA universe job with ID: 13756046.0
3/18 16:38:08 IWD: /archive/home/qonline/q/online/H2
3/18 16:38:08 Output file: /archive/home/qonline/q/online/H2/.condor_out.27889
3/18 16:38:08 Error file: /archive/home/qonline/q/online/H2/.condor_error.27889
3/18 16:38:08 Renice expr "0" evaluated to 0
3/18 16:38:08 About to exec /archive/home/qonline/q/online/H2/.condor_run.27889
3/18 16:38:08 error opening watchdog pipe /usr1/condor/log/procd_pipe.STARTD.watchdog: No such file or directory (2)
3/18 16:38:08 ProcFamilyClient: error initializing LocalClient
3/18 16:38:08 ProcFamilyProxy: error initializing ProcFamilyClient
3/18 16:38:08 ERROR "ProcD has failed" at line 590 in file proc_family_proxy.C
3/18 16:38:08 ShutdownFast all jobs.
Upon further inspection it was found that the condor_procd process was running
but the FIFO (named pipes) where missing. This is under Linux (FC4) and here is
the output of lsof that shows that the FIFO's where there originally but where
somehow deleted.
[root@node88 condor]# lsof -p 3377
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
condor_pr 3377 root cwd DIR 9,2 4096 20119558 /usr1/condor/log
condor_pr 3377 root rtd DIR 9,1 4096 2 /
condor_pr 3377 root txt REG 9,1 112928 4374655 /usr/sbin/condor_procd
condor_pr 3377 root mem REG 9,1 1954592 /lib/libdl-2.3.6.so (path inode=1954574)
condor_pr 3377 root DEL REG 9,1 1955666 /lib/libgcc_s-4.0.2-20051126.so.1.#prelink#.5kP9CY
condor_pr 3377 root mem REG 9,1 1954624 /lib/libresolv-2.3.6.so (path inode=1954589)
condor_pr 3377 root mem REG 0,0 0 [vdso] (stat: No such file or directory)
condor_pr 3377 root mem REG 9,1 1954577 /lib/ld-2.3.6.so (path inode=1954562)
condor_pr 3377 root mem REG 9,1 1954594 /lib/libm-2.3.6.so (path inode=1954580)
condor_pr 3377 root mem REG 9,1 1954590 /lib/libcrypt-2.3.6.so (path inode=1954597)
condor_pr 3377 root mem REG 9,1 4377553 /usr/lib/libstdc++.so.5.0.7 (path inode=4371948)
condor_pr 3377 root mem REG 9,1 1954586 /lib/libc-2.3.6.so (path inode=1954570)
condor_pr 3377 root 0r FIFO 9,2 20119578 /usr1/condor/log/procd_pipe.STARTD.watchdog (deleted)
condor_pr 3377 root 1w FIFO 9,2 20119578 /usr1/condor/log/procd_pipe.STARTD.watchdog (deleted)
condor_pr 3377 root 3r FIFO 9,2 20119579 /usr1/condor/log/procd_pipe.STARTD (deleted)
condor_pr 3377 root 4w FIFO 9,2 20119579 /usr1/condor/log/procd_pipe.STARTD (deleted)
I believe this is a rare corner case, but if it is simple enough, it might make
sense to automatically catch/clean up from this this particular state where the
machine became a blackhole for jobs.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Sat Mar 29 14:20:45 2008 (1206818448)
Subject: Actions
Assigned to adesmet by adesmet
===========================================================================
Date of actions: Thu Apr 3 11:36:03 2008 (1207240563)
Date: Thu, 3 Apr 2008 11:55:32 -0500
From: Alan De Smet <adesmet__AT__cs.wisc.edu>
To: adesmet <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17811] LIGO: condor_procd running without FIFO
> Upon further inspection it was found that the condor_procd
> process was running but the FIFO (named pipes) where missing.
> This is under Linux (FC4) and here is the output of lsof that
> shows that the FIFO's where there originally but where somehow
> deleted.
Was this a one off problem, or did it happen repeatedly?
You say this turned the machine into a black hole. Do you mean
that the machine matched jobs, but they failed to start, and were
rescheduled later, possibly onto the same machine again?
How we recover from this sort of situation will depend on the
frequency and severity. Trying to recover without restarting
would be very difficult. Even restarting Condor adds a lot of
complications. We'll look into it and see what is realistic.
===========================================================================
Date mail was appended: Thu Apr 3 11:55:40 2008 (1207241740)
Date: Thu, 03 Apr 2008 11:03:47 -0800
From: Gregory Mendell <gmendell__AT__ligo-wa.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17811] LIGO: condor_procd running without FIFO
X-Enigmail-Version: 0.95.6
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
condor-admin response tracking system wrote:
>> Upon further inspection it was found that the condor_procd
>> process was running but the FIFO (named pipes) where missing.
>> This is under Linux (FC4) and here is the output of lsof that
>> shows that the FIFO's where there originally but where somehow
>> deleted.
>
> Was this a one off problem, or did it happen repeatedly?
There was specific problem on one node (node88) that was one off.
However, as the result of this problem, jobs failed on that node until
((NumShadowExceptions > 30) || (NumShadowStarts > 110))' evaluated to
TRUE and they were placed on hold.
For example:
# grep 13759289 /usr1/condor/log/ShadowLog
3/27 16:29:21 (13759289.0) (13722): ERROR "Error from starter on
slot1__AT__node88.ldas.ligo-wa.caltech.edu: ProcD has failed" at line 649 in
file pseudo_ops.C
3/27 16:29:21 Initializing a VANILLA shadow for job 13759289.0
3/27 16:29:21 (13759289.0) (13725): Request to run on <10.12.1.88:49519>
was ACCEPTED
3/27 16:29:21 (13759289.0) (13725): ERROR "Error from starter on
slot1__AT__node88.ldas.ligo-wa.caltech.edu: ProcD has failed" at line 649 in
file pseudo_ops.C
3/27 16:29:22 Initializing a VANILLA shadow for job 13759289.0
3/27 16:29:22 (13759289.0) (13728): Request to run on <10.12.1.88:49519>
was ACCEPTED
3/27 16:29:22 (13759289.0) (13728): ERROR "Error from starter on
slot1__AT__node88.ldas.ligo-wa.caltech.edu: ProcD has failed" at line 649 in
file pseudo_ops.C
3/27 16:29:22 Initializing a VANILLA shadow for job 13759289.0
3/27 16:29:22 (13759289.0) (13731): Request to run on <10.12.1.88:49519>
was ACCEPTED
3/27 16:29:22 (13759289.0) (13731): ERROR "Error from starter on
slot1__AT__node88.ldas.ligo-wa.caltech.edu: ProcD has failed" at line 649 in
file pseudo_ops.C
3/27 21:26:05 Initializing a VANILLA shadow for job 13759289.0
3/27 21:26:05 (13759289.0) (21262): Request to run on <10.12.1.52:56901>
was ACCEPTED
3/27 21:26:13 (13759289.0) (21262): Job 13759289.0 going into Hold state
(code 3,0): The system macro SYSTEM_PERIODIC_HOLD expression
'((NumShadowExceptions > 30) || (NumShadowStarts > 110))' evaluated to TRUE
3/27 21:26:13 (13759289.0) (21262): ZKM: setting default map to (null)
3/27 21:26:14 (13759289.0) (21262): **** condor_shadow (condor_SHADOW)
EXITING WITH STATUS 112
4/3 10:59:34 Initializing a VANILLA shadow for job 13759289.0
4/3 10:59:34 (13759289.0) (6405): Request to run on <10.12.1.107:33858>
was ACCEPTED
4/3 10:59:42 (13759289.0) (6405): Job 13759289.0 going into Hold state
(code 3,0): The system macro SYSTEM_PERIODIC_HOLD expression
'((NumShadowExceptions > 30) || (NumShadowStarts > 110))' evaluated to TRUE
4/3 10:59:42 (13759289.0) (6405): ZKM: setting default map to (null)
4/3 10:59:42 (13759289.0) (6405): **** condor_shadow (condor_SHADOW)
EXITING WITH STATUS 112
Regards,
Greg
>
> You say this turned the machine into a black hole. Do you mean
> that the machine matched jobs, but they failed to start, and were
> rescheduled later, possibly onto the same machine again?
>
> How we recover from this sort of situation will depend on the
> frequency and severity. Trying to recover without restarting
> would be very difficult. Even restarting Condor adds a lot of
> complications. We'll look into it and see what is realistic.
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Alan De Smet <adesmet__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, gmendell__AT__ligo-wa.caltech.edu,skoranda__AT__gravity.phys.uwm.edu
>
--
Gregory Mendell | Ph: 509-372-8300 ext. 189 | Fax: 509-372-8137
LIGO Hanford Observatory, PO Box 159, MS S9-02, Richland WA 99352
http://www.ligo.caltech.edu | http://www.ligo-wa.caltech.edu
===========================================================================
Date mail was appended: Thu Apr 3 14:04:07 2008 (1207249452)