LIGO Support Ticket 17219

Ticket Information
  Number:      admin 17219
  User:        anderson@ligo.caltech.edu
  Email:       skoranda__AT__gravity.phys.uwm.edu
  Status:      open
  Assigned To: jfrey
Date: Thu, 15 Nov 2007 14:55:42 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: stdout occasionally lost for jobmanager-condor
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

The following command run against the LIGO condor pool at CIT
(condor 6.9.5 and VDT 1.6.1e) occasionally loses stdout. I suspect
this is another instance of the race condition in [condor-admin #17136].

-bash-3.00$ globus-job-run ldas-grid.ligo.caltech.edu/jobmanager-condor /bin/hostname
node239
-bash-3.00$ globus-job-run ldas-grid.ligo.caltech.edu/jobmanager-condor /bin/hostname
-bash-3.00$ 

Please confirm that the globus-gatekeeper makes the same implicit assumption
as an unpatched condor_run script that the OUT file for a condor job
written on an execute machine is instantly visible on a submit machine
once the Shadow process on the submit machines locally creates an entry
in the LOG file indicating that the job is done on the execute machine.
If this is true, perhaps a similar fix to the proposed condor_run patch
could be implemented for submitting jobs to a condor pool via globus
as well.


Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Thu Nov 15 16:56:01 2007 (1195167364)
Subject: Actions

Assigned to jfrey by gthain
===========================================================================
Date of actions: Fri Nov 16  8:57:20 2007 (1195225040)
From: Jaime Frey <jfrey__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17219] LIGO: stdout occasionally lost for     
 jobmanager-condor
Date: Fri, 1 Feb 2008 11:39:02 -0600

> The following command run against the LIGO condor pool at CIT
> (condor 6.9.5 and VDT 1.6.1e) occasionally loses stdout. I suspect
> this is another instance of the race condition in [condor-admin  
> #17136].
>
> -bash-3.00$ globus-job-run ldas-grid.ligo.caltech.edu/jobmanager- 
> condor /bin/hostname
> node239
> -bash-3.00$ globus-job-run ldas-grid.ligo.caltech.edu/jobmanager- 
> condor /bin/hostname
> -bash-3.00$
>
> Please confirm that the globus-gatekeeper makes the same implicit  
> assumption
> as an unpatched condor_run script that the OUT file for a condor job
> written on an execute machine is instantly visible on a submit machine
> once the Shadow process on the submit machines locally creates an  
> entry
> in the LOG file indicating that the job is done on the execute  
> machine.
> If this is true, perhaps a similar fix to the proposed condor_run  
> patch
> could be implemented for submitting jobs to a condor pool via globus
> as well.


I apologize for the late response.

The globus jobmanager does assume that the job's output files are  
available the moment it sees the terminate event in the condor user  
log. It does attempt to prevent nfs from using cached data on the  
gatekeeper machine before reading the files. You can see what it does  
in the nfssync() and stage_out() routines in JobManager.pm.

Thanks and regards,
Jaime Frey
UW-Madison Condor Team



===========================================================================
Date mail was appended: Fri Feb  1 11:39:15 2008 (1201887556)
Subject: Actions

Status changed from open to pending by jfrey
===========================================================================
Date of actions: Fri Feb  1 11:39:15 2008 (1201887557)
Date: Fri, 1 Feb 2008 14:16:16 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17219] LIGO: stdout occasionally lost for  
 jobmanager-condor
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

On Fri, Feb 01, 2008 at 11:39:15AM -0600, condor-admin response tracking system wrote:
> 
> The globus jobmanager does assume that the job's output files are  
> available the moment it sees the terminate event in the condor user  
> log. It does attempt to prevent nfs from using cached data on the  
> gatekeeper machine before reading the files. You can see what it does  
> in the nfssync() and stage_out() routines in JobManager.pm.
> 

Jaime,

The particular filesystem where I can reproduce this problem is QFS,
which is a distributed filesystem rather than a shared filesystem
like NFS. The distinction, if I have the terminology right, is that
QFS has mutliple servers talking directly to the underlying storage
devices rather than a single server as is the common case with NFS.
In this architecture the relative timing of metadata synchronization
between the multiple fileystem servers can be controlled by other knobs.
However, my first choice is to see if there is a more general application
level synchronization solution to this problem rather than tuning the
filesystem metadata synchronization knobs to minimize, or possibly
eliminate this race condition, for just one filesytem type. I am also
concerned that tuning QFS metadata synchronization to minimize
this latency will have an adverse performance impact on other
filesystem activitity, i.e., basically reduce the system to the
same performance level as if I there was just one file server.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Feb  1 16:16:35 2008 (1201904196)
From: Jaime Frey <jfrey__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17219] LIGO: stdout occasionally lost for    
 jobmanager-condor
Date: Mon, 4 Feb 2008 10:26:56 -0600

>> The globus jobmanager does assume that the job's output files are
>> available the moment it sees the terminate event in the condor user
>> log. It does attempt to prevent nfs from using cached data on the
>> gatekeeper machine before reading the files. You can see what it does
>> in the nfssync() and stage_out() routines in JobManager.pm.
>
> The particular filesystem where I can reproduce this problem is QFS,
> which is a distributed filesystem rather than a shared filesystem
> like NFS. The distinction, if I have the terminology right, is that
> QFS has mutliple servers talking directly to the underlying storage
> devices rather than a single server as is the common case with NFS.
> In this architecture the relative timing of metadata synchronization
> between the multiple fileystem servers can be controlled by other  
> knobs.
> However, my first choice is to see if there is a more general  
> application
> level synchronization solution to this problem rather than tuning the
> filesystem metadata synchronization knobs to minimize, or possibly
> eliminate this race condition, for just one filesytem type. I am also
> concerned that tuning QFS metadata synchronization to minimize
> this latency will have an adverse performance impact on other
> filesystem activitity, i.e., basically reduce the system to the
> same performance level as if I there was just one file server.


I'm not sure what application level synchronization can be done that  
isn't specific to the shared filesystem in use. I don't know anything  
about QFS, so I don't know what it's cache coherency behavior is.

If all data for a given directory (directory contents, file contents  
and file metadata) is always coherent with each other, this approach  
may work:
Wrap all jobs with a script that writes out a file in the job's  
working directory after it exits. In the poll() function in gram's  
condor.pm perl module on the gatekeeper machine, when the user log  
says the job is complete, check for the existence of this extra file.  
If it's missing, report the job as still running.

The job's stdout/err are in a different directory, so this wrapper  
script may have to create two files on job exit. Also, this would only  
work for pre-ws gram. Ws gram uses different code to monitor the job's  
status.

Thanks and regards,
Jaime Frey
UW-Madison Condor Team



===========================================================================
Date mail was appended: Mon Feb  4 10:27:06 2008 (1202142428)
Subject: Actions

Status changed from open to pending by jfrey
===========================================================================
Date of actions: Mon Feb  4 10:27:06 2008 (1202142429)
Date: Mon, 4 Feb 2008 20:25:17 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17219] LIGO: stdout occasionally lost for  
 jobmanager-condor
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

On Mon, Feb 04, 2008 at 10:27:06AM -0600, condor-admin response tracking system wrote:
> 
> I'm not sure what application level synchronization can be done that  
> isn't specific to the shared filesystem in use. I don't know anything  
> about QFS, so I don't know what it's cache coherency behavior is.
> 
> If all data for a given directory (directory contents, file contents  
> and file metadata) is always coherent with each other, this approach  
> may work:
> Wrap all jobs with a script that writes out a file in the job's  
> working directory after it exits. In the poll() function in gram's  
> condor.pm perl module on the gatekeeper machine, when the user log  
> says the job is complete, check for the existence of this extra file.  
> If it's missing, report the job as still running.
> 
> The job's stdout/err are in a different directory, so this wrapper  
> script may have to create two files on job exit. Also, this would only  
> work for pre-ws gram. Ws gram uses different code to monitor the job's  
> status.

I suspect a flag file would work. One could also try using some other file
attribute to determine when a job is done and the appropriate filesystem
metadata has been synchronized to the submit machine, e.g., set the execute
bit with chmod() while a job is running and clear it when the job is done on
the execute machine.

Another possibility might be to use a file lock on the stdout file as proposed
in ticket 17136 for the equivalent condor_run issue.

I haven't tested this, but a final rename() might also be a sufficient
metadata synchronization point, i.e., have the execute machine rename one
or more files as its final step, and then the submit machine would look for
the final names as a last step before processing them.

Another random idea is to consider is in-band flow control, i.e.,
always append one of the files with a statement like "job XYZ has finished",
and have the submit machine block until this appears in the file and
then strip it off.

Thanks.


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Feb  4 22:25:34 2008 (1202185535)