LIGO Support Ticket 17147
Ticket Information
Number: admin 17147
User: anderson@ligo.caltech.edu
Email: cannon_k__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu
Status: new
Assigned To: tannenba
Date: Sun, 28 Oct 2007 19:16:25 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Kipp Cannon <cannon_k__AT__ligo.caltech.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: Incorrect DAG execution when file server rebooted
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
The LIGO Condor pool running at Caltech,
# condor_version
$CondorVersion: 6.9.4 Aug 30 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
has sub-optimal performance when an NFS file server gets stuck and is
rebooted. In particular, in the context of a running DAG it appears
that a failed node is returning an exit stats of 0 causing the remainder
of the DAG to be improperly executed. The problem with the file server
is a known bug and we have an adequate work around until we receive a
patch, however, I believe there is also a Condor issue here. Please consider
the Starter log below that shows an errno=116 in the but then reports
exiting with status 0, which the Shadow accepts and exits with status 100.
This occured with Python code running in the Vanilla universe.
I am also confused by the apparent reference to core files in the Starter
log file since all the condor daemons on the execute machine have been
started with the Linux resource limit of "ulimit -c 0".
Thanks.
10/24 14:30:36 ******************************************************
10/24 14:30:36 ** condor_starter (CONDOR_STARTER) STARTING UP
10/24 14:30:36 ** /usr/sbin/condor_starter
10/24 14:30:36 ** $CondorVersion: 6.9.4 Aug 30 2007 $
10/24 14:30:36 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/24 14:30:36 ** PID = 3030
10/24 14:30:36 ** Log last touched 10/24 14:30:22
10/24 14:30:36 ******************************************************
10/24 14:30:36 Using config source: /usr1/condor/condor_config
10/24 14:30:36 Using local config sources:
10/24 14:30:36 /usr1/condor/condor_config.local
10/24 14:30:36 DaemonCore: Command Socket at <10.14.1.194:34104>
10/24 14:30:36 Done setting resource limits
10/24 14:33:40 Communicating with shadow <10.14.0.12:60131>
10/24 14:33:40 Submitting machine is "ldas-grid.ligo.caltech.edu"
10/24 14:33:40 setting the orig job name in starter
10/24 14:33:40 setting the orig job iwd in starter
10/24 14:33:40 Job 19985018.0 set to execute immediately
10/24 14:33:40 Starting a VANILLA universe job with ID: 19985018.0
10/24 14:33:40 IWD: /mnt/qfs1/kipp/playground
10/24 14:33:49 Output file: /mnt/qfs1/kipp/playground/logs/ligolw_bucluster-19985018-0.out
10/24 14:33:49 Error file: /mnt/qfs1/kipp/playground/logs/ligolw_bucluster-19985018-0.err
10/24 14:33:49 Renice expr "0" evaluated to 0
10/24 14:33:49 About to exec /archive/home/kipp/local/bin/ligolw_bucluster --comment INJECTIONS_PLAYGROUND --input-cache cache/ligolw_bucluster_INJECTIONS_P
LAYGROUND_580.cache --cluster-algorithm excesspower
10/24 14:33:56 Create_Process succeeded, pid=3050
10/24 15:58:00 Process exited, pid=3050, status=0
10/24 15:58:00 Failed to rename(/mnt/qfs1/kipp/playground/core.3050,/mnt/qfs1/kipp/playground/core.19985018.0): errno 116 (Stale NFS file handle)
10/24 15:58:00 Failed to rename(/mnt/qfs1/kipp/playground/core,/mnt/qfs1/kipp/playground/core.19985018.0): errno 116 (Stale NFS file handle)
10/24 15:58:00 Got SIGQUIT. Performing fast shutdown.
10/24 15:58:00 ShutdownFast all jobs.
10/24 15:58:00 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
The 10/24 15:58 time corresponds to a reboot of the file server for the
current working directory of the job in question on the execute nodes.
The shadow log records that job completed successfully,
[root@ldas-grid log]# grep 19985018 ShadowLog
10/24 14:30:36 Initializing a VANILLA shadow for job 19985018.0
10/24 14:30:36 (19985018.0) (32723): Request to run on <10.14.1.194:60479> was ACCEPTED
10/24 14:48:49 (19985018.0) (32723): ZKM: setting default map to (null)
10/24 15:03:49 (19985018.0) (32723): ZKM: setting default map to (null)
10/24 15:18:49 (19985018.0) (32723): ZKM: setting default map to (null)
10/24 15:48:49 (19985018.0) (32723): ZKM: setting default map to (null)
10/24 15:58:14 (19985018.0) (32723): ZKM: setting default map to (null)
10/24 15:58:14 (19985018.0) (32723): Job 19985018.0 terminated: exited with status 0
10/24 15:58:14 (19985018.0) (32723): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Sun Oct 28 21:16:42 2007 (1193624205)
Subject: Actions
Assigned to tannenba by pfc
===========================================================================
Date of actions: Mon Oct 29 14:38:51 2007 (1193687150)
Subject: Comments added
This is a really serious problem, and will be fixed for 7.2.0. It doesn't
have anything to do with DAGMan per se and is an underlying Condor problem.
Comments added by psilord
===========================================================================
Date comments were added: Fri Nov 7 13:20:31 2008 (1226085631)