LIGO Support Ticket 17136
Ticket Information
Number: admin 17136
User: anderson@ligo.caltech.edu
Email: skoranda__AT__gravity.phys.uwm.edu,shourov__AT__ligo.mit.edu
Status: open
Assigned To: zmiller
Date: Tue, 23 Oct 2007 17:04:31 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: condor_run intermittently returning NULL results
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
The Caltech LIGO pool running condor version,
# condor_version
$CondorVersion: 6.9.4 Aug 30 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
is intermittently losing the output from condor_run in the Vanilla
universe, e.g.,
# condor_run -u local /bin/hostname
ldas-pcdev1
# condor_run /bin/hostname
node105
# condor_run /bin/hostname
#
As far as I can tell from the log files Condor successfully ran the
job that showed no output. Here is the Shadow log file,
10/23 16:41:00 ******************************************************
10/23 16:41:00 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/23 16:41:00 ** /usr/sbin/condor_shadow
10/23 16:41:00 ** $CondorVersion: 6.9.4 Aug 30 2007 $
10/23 16:41:00 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/23 16:41:00 ** PID = 21391
10/23 16:41:00 ** Log last touched 10/23 16:38:59
10/23 16:41:00 ******************************************************
10/23 16:41:00 Using config source: /usr1/condor/condor_config
10/23 16:41:00 Using local config sources:
10/23 16:41:00 /usr1/condor/condor_config.local
10/23 16:41:00 DaemonCore: Command Socket at <10.14.0.18:47705>
10/23 16:41:00 Initializing a VANILLA shadow for job 416050.0
10/23 16:41:00 (416050.0) (21391): Request to run on <10.14.2.12:57088> was ACCEPTED
10/23 16:41:10 (416050.0) (21391): ZKM: setting default map to (null)
10/23 16:41:10 (416050.0) (21391): Job 416050.0 terminated: exited with status 0
10/23 16:41:10 (416050.0) (21391): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
This job ran on node262,
# condor_q -run
-- Submitter: ldas-pcdev1.ligo.caltech.edu : <10.14.0.18:35613> : ldas-pcdev1.ligo.caltech.edu
ID OWNER SUBMITTED RUN_TIME HOST(S)
416050.0 anderson 10/23 16:40 0+00:00:00 slot2__AT__node262.ldas-cit.ligo.caltech.edu
and the Startd log from node262 shows,
10/23 16:40:43 Return from HandleReq <command_handler>
10/23 16:40:43 Calling HandleReq <command_handler> (0)
10/23 16:40:43 Return from HandleReq <command_handler>
10/23 16:40:43 Calling HandleReq <HandleChildAliveCommand> (0)
10/23 16:40:43 Return from HandleReq <HandleChildAliveCommand>
10/23 16:40:47 Calling Timer handler 9 (poll_resources)
10/23 16:40:47 Return from Timer handler 9 (poll_resources)
10/23 16:40:52 Calling Timer handler 9 (poll_resources)
10/23 16:40:53 Return from Timer handler 9 (poll_resources)
10/23 16:40:53 Calling Timer handler 0 (check_parent)
10/23 16:40:53 Return from Timer handler 0 (check_parent)
10/23 16:40:57 DaemonCore: Command received via UDP from host <10.14.0.25:48431>
10/23 16:40:57 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
10/23 16:40:57 Calling HandleReq <command_match_info> (0)
10/23 16:40:57 slot2: match_info called
10/23 16:40:57 slot2: Received match <10.14.2.12:57088>#1193066318#193#...
10/23 16:40:57 slot2: State change: match notification protocol successful
10/23 16:40:57 slot2: Changing activity: Busy -> Killing
10/23 16:40:57 Return from HandleReq <command_match_info>
10/23 16:40:58 Calling Timer handler 9 (poll_resources)
10/23 16:40:58 Return from Timer handler 9 (poll_resources)
10/23 16:40:58 Calling Handler <DaemonCore::HandleReqSocketHandler>
10/23 16:40:58 DaemonCore: Command received via TCP from host <10.14.0.18:57564>
10/23 16:40:58 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
10/23 16:40:58 Calling HandleReq <command_request_claim> (0)
10/23 16:40:58 slot2: Request accepted.
10/23 16:40:58 Return from HandleReq <command_request_claim>
10/23 16:40:58 Return from Handler <DaemonCore::HandleReqSocketHandler>
10/23 16:40:58 DaemonCore: pid 31841 exited with status 0, invoking reaper 2 <BOINC_BackfillMgr::reaper(
)>
10/23 16:40:58 BOINC client (pid 31841) exited with status 0
10/23 16:40:58 slot2: State change: starter exited
10/23 16:40:58 slot2: Remote owner is anderson@ligo
10/23 16:40:58 slot2: State change: claiming protocol successful
10/23 16:40:58 slot2: Changing state and activity: Backfill/Killing -> Claimed/Idle
10/23 16:40:58 DaemonCore: return from reaper for pid 31841
10/23 16:41:00 Calling Handler <DaemonCore::HandleReqSocketHandler>
10/23 16:41:00 DaemonCore: Command received via TCP from host <10.14.0.18:60213>
10/23 16:41:00 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_clai
m)
10/23 16:41:00 Calling HandleReq <command_activate_claim> (0)
10/23 16:41:00 slot2: Got activate_claim request from shadow (<10.14.0.18:60213>)
10/23 16:41:00 slot2: Remote job ID is 416050.0
10/23 16:41:00 slot2: Got universe "VANILLA" (5) from request classad
10/23 16:41:00 slot2: State change: claim-activation protocol successful
10/23 16:41:00 slot2: Changing activity: Idle -> Busy
10/23 16:41:00 Return from HandleReq <command_activate_claim>
10/23 16:41:00 Return from Handler <DaemonCore::HandleReqSocketHandler>
10/23 16:41:01 Calling HandleReq <HandleChildAliveCommand> (0)
10/23 16:41:01 Return from HandleReq <HandleChildAliveCommand>
10/23 16:41:02 Calling Timer handler 4857 (do_update)
10/23 16:41:02 Return from Timer handler 4857 (do_update)
10/23 16:41:03 Calling Timer handler 9 (poll_resources)
10/23 16:41:03 Return from Timer handler 9 (poll_resources)
10/23 16:41:08 Calling Timer handler 9 (poll_resources)
10/23 16:41:08 Return from Timer handler 9 (poll_resources)
10/23 16:41:10 Calling Handler <DaemonCore::HandleReqSocketHandler>
10/23 16:41:10 DaemonCore: Command received via TCP from host <10.14.0.18:35476>
10/23 16:41:10 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_ha
ndler)
10/23 16:41:10 Calling HandleReq <command_handler> (0)
10/23 16:41:10 slot2: Called deactivate_claim_forcibly()
10/23 16:41:10 Return from HandleReq <command_handler>
10/23 16:41:10 Return from Handler <DaemonCore::HandleReqSocketHandler>
10/23 16:41:10 DaemonCore: pid 31849 exited with status 0, invoking reaper 1 <reaper>
10/23 16:41:10 Starter pid 31849 exited with status 0
10/23 16:41:10 slot2: State change: starter exited
10/23 16:41:10 slot2: Changing activity: Busy -> Idle
10/23 16:41:10 DaemonCore: return from reaper for pid 31849
10/23 16:41:10 DaemonCore: Command received via UDP from host <10.14.0.18:41270>
10/23 16:41:10 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_release_claim)
10/23 16:41:10 Calling HandleReq <command_release_claim> (0)
10/23 16:41:10 slot2: State change: received RELEASE_CLAIM command
10/23 16:41:10 slot2: Changing state and activity: Claimed/Idle -> Preempting/Vacating
10/23 16:41:10 slot2: State change: No preempting claim, returning to owner
10/23 16:41:10 slot2: Changing state and activity: Preempting/Vacating -> Owner/Idle
10/23 16:41:10 slot2: State change: IS_OWNER is false
10/23 16:41:10 slot2: Changing state: Owner -> Unclaimed
10/23 16:41:10 slot2: State change: START_BACKFILL is TRUE
10/23 16:41:10 slot2: Changing state: Unclaimed -> Backfill
10/23 16:41:10 Return from HandleReq <command_release_claim>
10/23 16:41:12 Calling Timer handler 4834 (dc_touch_log_file)
10/23 16:41:12 Return from Timer handler 4834 (dc_touch_log_file)
10/23 16:41:13 Calling Timer handler 9 (poll_resources)
10/23 16:41:13 State change: BOINC client running for slot2
10/23 16:41:13 slot2: Changing activity: Idle -> Busy
and the Starter log for slot 2 shows,
10/23 16:40:38 ********* STARTER terminating normally **********
10/23 16:41:00 ******************************************************
10/23 16:41:00 ** condor_starter (CONDOR_STARTER) STARTING UP
10/23 16:41:00 ** /usr/sbin/condor_starter
10/23 16:41:00 ** $CondorVersion: 6.9.4 Aug 30 2007 $
10/23 16:41:00 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/23 16:41:00 ** PID = 31849
10/23 16:41:00 ** Log last touched 10/23 16:40:38
10/23 16:41:00 ******************************************************
10/23 16:41:00 Using config source: /usr1/condor/condor_config
10/23 16:41:00 Using local config sources:
10/23 16:41:00 /usr1/condor/condor_config.local
10/23 16:41:00 DaemonCore: Command Socket at <10.14.2.12:54063>
10/23 16:41:00 Done setting resource limits
10/23 16:41:00 Communicating with shadow <10.14.0.18:47705>
10/23 16:41:00 Submitting machine is "ldas-pcdev1.ligo.caltech.edu"
10/23 16:41:00 setting the orig job name in starter
10/23 16:41:00 setting the orig job iwd in starter
10/23 16:41:00 Job 416050.0 set to execute immediately
10/23 16:41:00 Starting a VANILLA universe job with ID: 416050.0
10/23 16:41:00 IWD: /mnt/qfs1/anderson
10/23 16:41:00 Output file: /mnt/qfs1/anderson/.condor_out.21372
10/23 16:41:00 Error file: /mnt/qfs1/anderson/.condor_error.21372
10/23 16:41:00 Renice expr "0" evaluated to 0
10/23 16:41:00 About to exec /mnt/qfs1/anderson/.condor_run.21372
10/23 16:41:00 Create_Process succeeded, pid=31850
10/23 16:41:08 Process exited, pid=31850, status=0
10/23 16:41:10 Got SIGQUIT. Performing fast shutdown.
10/23 16:41:10 ShutdownFast all jobs.
10/23 16:41:10 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
10/23 16:41:25 ********** STARTER starting up ***********
The question is what happened to stdout for this job?
My first guess is that this has to do with a race condition for
the submit machine trying to read the contents of the temporary
.condor_out.$$ file before the execute machine has finished
flushing its output to that file. Is this possible?
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Tue Oct 23 19:04:49 2007 (1193184293)
Subject: Actions
Assigned to zmiller by zmiller
===========================================================================
Date of actions: Wed Oct 24 6:43:23 2007 (1193226204)
Date: Wed, 24 Oct 2007 06:54:13 -0500
From: Zachary Miller <zmiller__AT__cs.wisc.edu>
To: zmiller <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17136] LIGO: condor_run intermittently
returning NULL results
>
> The question is what happened to stdout for this job?
indeed. i'll look into it. i think your hunch is correct in that it's the
submit machine doing something wrong, probably the condor_run script itself.
thanks for the report, i'll let you know what i find.
cheers,
-zach
===========================================================================
Date mail was appended: Wed Oct 24 6:54:21 2007 (1193226861)
Date: Wed, 24 Oct 2007 09:34:43 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17136] LIGO: condor_run intermittently
returning NULL results
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
On Wed, Oct 24, 2007 at 06:54:21AM -0500, condor-admin response tracking system wrote:
> >
> > The question is what happened to stdout for this job?
>
> indeed. i'll look into it. i think your hunch is correct in that it's the
> submit machine doing something wrong, probably the condor_run script itself.
>
> thanks for the report, i'll let you know what i find.
>
Looking at the timestamps of the .condor_* files created it does appear
that the problem is due to the relative timing of changes to the .condor_log
and .condor_out files generated by condor_run. The Shadow writes the .log file
locally on the submit machine and assumes that that the .out file is instantly
ready to be read on the same machine when the .log file indicates job
completion, even though the .out file is potentially written to a different
file server from a different execute machine. This race condition in
synchronizing the metadata for a shared filesystem can be avoided with
the following patch which also allows condor_run to work in more environments,
by removing the requirement of having a shared filesystem. Do you agree and
what do you think about incorporating this patch?
Thanks.
$ diff -Naur /usr/bin/condor_run ./condor_run
--- /usr/bin/condor_run 2007-08-30 13:39:04.000000000 -0700
+++ ./condor_run 2007-10-24 09:26:07.119881000 -0700
@@ -136,6 +136,8 @@
print JDF "executable = .condor_run.$$\n";
print JDF "initialdir = $pwd\n";
print JDF "notification = NEVER\n";
+print JDF "should_transfer_files = YES\n";
+print JDF "when_to_transfer_output = ON_EXIT\n";
print JDF "log = .condor_log.$$\n";
print JDF "output = .condor_out.$$\n";
print JDF "error = .condor_error.$$\n";
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Wed Oct 24 11:34:58 2007 (1193243699)
Date: Sat, 10 Nov 2007 12:08:02 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu, Shourov Chatterji <shourov__AT__ligo.mit.edu>
Subject: Re: [condor-admin #17136] LIGO: condor_run intermittently
returning NULL results
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
Upon further investigation the change in semantics to condor_run by adding
should_transfer_files is too significant of a change. I would like to
counter propose the following solution based on acquiring a file lock
for the .out file in analogy to what is done with the .log file.
In particular, if the starter acquires an exclusive file lock on the .out
file and the condor_run script blocks until it can acquire a lock on the
same file the race condition is removed without any change in behavior.
The following example is a proof by example that works on our Linux cluster:
# diff -Naur /usr/bin/condor_run ./condor_run
--- /usr/bin/condor_run 2007-08-30 13:39:04.000000000 -0700
+++ ./condor_run 2007-11-10 11:53:50.649938000 -0800
@@ -124,6 +124,7 @@
print CMD $arg, " ";
}
print CMD "\n";
+print CMD "flock .condor_out.$$ /bin/true\n";
close(CMD) ||
&abort("Failed to write temporary (CMD) file in current directory.\n");
# fix the permissions to something nice.
@@ -190,6 +191,7 @@
# display stdout to the terminal
open(OUTPUT, "<.condor_out.$$") || &abort("Failed to open job output file.\n");
+system "flock .condor_out.$$ /bin/true";
while (<OUTPUT>) {
print STDOUT;
}
If this basic idea is accepted, a cleaner solution would be to use the
perl flock module for acquiring lock on the submit machine and possibly to
add another knob, e.g., ENABLE_USEROUT_LOCKING, that does the equivalent
for user output files that ENABLE_USERLOG_LOCKING does for user log files,
i.e., have the starter acquire a file lock on the .out file when it is
opened on the execute machine and let it be freed when the file is closed.
Thanks.
On Wed, Oct 24, 2007 at 09:34:43AM -0700, Stuart Anderson wrote:
> On Wed, Oct 24, 2007 at 06:54:21AM -0500, condor-admin response tracking system wrote:
> > >
> > > The question is what happened to stdout for this job?
> >
> > indeed. i'll look into it. i think your hunch is correct in that it's the
> > submit machine doing something wrong, probably the condor_run script itself.
> >
> > thanks for the report, i'll let you know what i find.
> >
>
> Looking at the timestamps of the .condor_* files created it does appear
> that the problem is due to the relative timing of changes to the .condor_log
> and .condor_out files generated by condor_run. The Shadow writes the .log file
> locally on the submit machine and assumes that that the .out file is instantly
> ready to be read on the same machine when the .log file indicates job
> completion, even though the .out file is potentially written to a different
> file server from a different execute machine. This race condition in
> synchronizing the metadata for a shared filesystem can be avoided with
> the following patch which also allows condor_run to work in more environments,
> by removing the requirement of having a shared filesystem. Do you agree and
> what do you think about incorporating this patch?
>
> Thanks.
>
>
> $ diff -Naur /usr/bin/condor_run ./condor_run
> --- /usr/bin/condor_run 2007-08-30 13:39:04.000000000 -0700
> +++ ./condor_run 2007-10-24 09:26:07.119881000 -0700
> @@ -136,6 +136,8 @@
> print JDF "executable = .condor_run.$$\n";
> print JDF "initialdir = $pwd\n";
> print JDF "notification = NEVER\n";
> +print JDF "should_transfer_files = YES\n";
> +print JDF "when_to_transfer_output = ON_EXIT\n";
> print JDF "log = .condor_log.$$\n";
> print JDF "output = .condor_out.$$\n";
> print JDF "error = .condor_error.$$\n";
>
>
> --
> Stuart Anderson anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Sat Nov 10 14:08:23 2007 (1194725304)