LIGO Support Ticket 15277
Ticket Information
Number: admin 15277
User: anderson@ligo.caltech.edu
Email: espinoza_e__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu
Status: open
Assigned To: adesmet
Date: Fri, 6 Apr 2007 12:37:55 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu, wenger__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>, Brown Duncan
<duncan__AT__gravity.phys.uwm.edu>
Subject: LIGO DAGMan spool directory efficiency
This is a request to understand and possibly enhance the efficiency of the
Condor SPOOL directory use by DAGMan.
We recently had schedd exit with status 44--presumably due to the SPOOL
directory filling up even though there was 15GByte free when we looked
at it ([condor-support #1943]). Since each node in a large DAG creates
a separate copy of a potentially large exeutable in the SPOOL directory
at submit time the hypothesis is that somebody submitted a large DAG that
filled up the filesystem and then failed/cleaned up after itself.
We have also observed (through unscientific random sampling) that the
schedd is often reporting via top that it is in the D state, presumably
waiting for I/O to the SPOOL directory as it spools lots of copies
of the same executables from our predominantly DAGMan users.
Therefore, for performance, both speed and scalability, would it be
possible for DAGMan jobs to more efficiently use the SPOOL directory?
For example, is it possible for DAGMan to automatically submit jobs as
clusters where appropriate? Alternatively, if the same executable
is already cached in the SPOOL directory, what about creating a
hard link for the new job id rather than making another copy?
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Fri Apr 6 14:38:18 2007 (1175888300)
Subject: Actions
Assigned to adesmet by adesmet
===========================================================================
Date of actions: Tue Apr 10 14:52:57 2007 (1176234777)
Date: Tue, 10 Apr 2007 16:20:20 -0500
From: Alan De Smet <adesmet__AT__cs.wisc.edu>
To: adesmet <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #15277] LIGO DAGMan spool directory efficiency
> This is a request to understand and possibly enhance the
> efficiency of the Condor SPOOL directory use by DAGMan.
To be clear, DAGMan itself shouldn't be abusing your SPOOL
directory. Indeed, DAGMan itself shouldn't be putting anything
into your SPOOL. You're talking about the jobs that DAGMan is
submitting on your behalf. Presumably your problem is that
Condor defaults to copying those jobs' binaries into the SPOOL
directory when they are submitted.
The most direct solution is to add "copy_to_spool=FALSE" to your
submit files. This will eliminate the copying of the execute
file to the spool. This does mean that you need to keep your
executable in place and visible to Condor whenever the
"executable=" specifies in your submit files. (Starting with
Condor 6.9.1, this is the default.)
Having DAGMan submit jobs as clusters would be a serious change
to DAGMan's implementation for relatively minor benefits. You'll
get a much larger benefit from copy_to_spool=FALSE. I'll note it
in our list of ideas, but I don't expect we'll implement it.
Having the schedd use hard links is an interesting idea and would
benefit a wider variety of users. Unfortunately it's also a
pretty serious change, and seems less useful since copy_to_spool
defaults to FALSE from 6.9.1 and onward. Again, I'll note it in
our list, but I'm doubtful that we'll implement it.
===========================================================================
Date mail was appended: Tue Apr 10 16:20:28 2007 (1176240029)
Date: Tue, 10 Apr 2007 21:51:19 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #15277] LIGO DAGMan spool directory efficiency
On Tue, Apr 10, 2007 at 04:20:28PM -0500, condor-admin response tracking system wrote:
> > This is a request to understand and possibly enhance the
> > efficiency of the Condor SPOOL directory use by DAGMan.
>
> To be clear, DAGMan itself shouldn't be abusing your SPOOL
> directory. Indeed, DAGMan itself shouldn't be putting anything
> into your SPOOL. You're talking about the jobs that DAGMan is
> submitting on your behalf. Presumably your problem is that
> Condor defaults to copying those jobs' binaries into the SPOOL
> directory when they are submitted.
Understood. However we use DAGMan to manage the majority of our Condor jobs.
>
> The most direct solution is to add "copy_to_spool=FALSE" to your
> submit files. This will eliminate the copying of the execute
> file to the spool. This does mean that you need to keep your
> executable in place and visible to Condor whenever the
> "executable=" specifies in your submit files. (Starting with
> Condor 6.9.1, this is the default.)
>
> Having DAGMan submit jobs as clusters would be a serious change
> to DAGMan's implementation for relatively minor benefits. You'll
> get a much larger benefit from copy_to_spool=FALSE. I'll note it
> in our list of ideas, but I don't expect we'll implement it.
>
> Having the schedd use hard links is an interesting idea and would
> benefit a wider variety of users. Unfortunately it's also a
> pretty serious change, and seems less useful since copy_to_spool
> defaults to FALSE from 6.9.1 and onward. Again, I'll note it in
> our list, but I'm doubtful that we'll implement it.
The problem with "copy_to_spool=False" is that many of our users are doing
active development on their code, and so may very well be compiling a
new version while old jobs are still in the queue or actively running.
As I understand it, this would result in users not being sure which
version actually ran, or in bus error/segfault as the executable is
changed out from under running instances due to the wonders of demand
paging on *nix. Therefore, we do want to take advantage of Condor's
ability to spool/cache executables at job submit time so each user
does not have to handle this extra bookkeeping.
I am surprised that you are changing the copy_to_spool default value.
If I understand this correctly, that will break the powerful Condor paradigm
of "shoot and forget". With this change you now have to remember to check the
queue before you ever recompile any of your executables that have been
submitted to Condor. In my opinion, it still makes more sense to put additional
effort into optimizing the use of the spool directory rather than disabling it.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Tue Apr 10 23:51:35 2007 (1176267096)
Subject: Comments added
This is tracked in GNATs as 828
http://condor-bugs.cs.wisc.edu/cgi-bin/gnats/gnatsweb.pl?cmd=view%20audit-trail&database=condor&pr=828
Comments added by adesmet
===========================================================================
Date comments were added: Fri Feb 1 11:36:05 2008 (1201887365)