LIGO Support Ticket 17237
Ticket Information
Number: admin 17237
User: skoranda@gravity.phys.uwm.edu
Email: anderson__AT__ligo.caltech.edu
Status: new
Assigned To: psilord
Date: Tue, 20 Nov 2007 14:56:27 -0600
From: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>
Subject: LIGO: remote file IO despite WantRemoteIO = FALSE
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
Hi,
At UWM we are running 6.9.4 currently.
A user submitted a set of jobs into the standard
universe. The user did not explicitly set
WantRemoteIO = FALSE but that is the default for our
pool. We achive that result in our condor_config by
setting
WantRemoteIO = False
Notification = Error
SUBMIT_EXPRS = WantRemoteIO, Notification
So in the class ad for the user's jobs I do see
WantRemoteIO = FALSE
The user did, however, by accident submit these jobs
with a current working directory of
Iwd = "/people/bose/s5/bns/20051104-20061114/playground_only/sw_injections/run_sel_segs_C03_L2_Nov07"
Note that the /people directory is NOT a
directory available from the compute machines. It is
local and only available on the submit host.
The jobs ran and we noticed the problem that the
submit node was under heavy load. Closer
inspection showed that the condor_shadow(s) for these
jobs were accumulating significant CPU time. By
looking at an strace of the condor_shadow we verified
that the job was using remote IO. Specifically we saw
that the condor_shadow was reading from a file on
/people, presumably to send the bytes from that file
to the job running on a compute node with no access
to /people.
The question is, given that WantRemoteIO was set to
FALSE in the job's class ad, should the condor_shadow
have provided remote IO for the job or should have
some error been thrown (or perhaps the job just
failed)?
We intend to prevent this from happening again by
using perhaps SYSTEM_PERIODIC_HOLD to check for jobs
that have a Iwd in the class ad that begins with
/people and which is in the standard universe and
putting those jobs on hold. Still, it would be
helpful to know what should have happened in this
case and why.
Thanks,
Scott
===========================================================================
Date of creation: Tue Nov 20 14:56:37 2007 (1195592200)
Subject: Actions
Assigned to psilord by roy
===========================================================================
Date of actions: Wed Nov 21 12:47:22 2007 (1195670844)