LIGO Support Ticket 18816

Ticket Information
  Number:      admin 18816
  User:        anderson@ligo.caltech.edu
  Email:       condorligo__AT__aei.mpg.de
  Status:      resolved
  Assigned To: wenger
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: LIGO: Ability to upgrade active Condor pools running nested DAGs
Date: Mon, 8 Dec 2008 20:59:16 -0800
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

As discussed in a recent Condor-LIGO meeting, it is desirable to be  
able to upgrade an active Condor pool (either upgrading just the  
DAGMan binaries, or a full daemon upgrade) that has nested DAGs in the  
queue without disrupting their proper execution.

As of Gnats PR 959 and the 7.2 release this should be robust for the  
case of non-nested DAGs since copy_to_spool will be set to true in  
the .condor.sub file. However, the full fix for the case of nested  
DAGs requires the planned laze .condor.sub creation enhancement.

This ticket is a place holder to keep track of this request.

Thanks.

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date of creation: Mon Dec  8 22:59:26 2008 (1228798768)
Subject: Actions

Assigned to psilord by bt
===========================================================================
Date of actions: Mon Dec  8  8:45:57 2008 (1228835582)
Subject: Actions

Assigned to wenger by wenger
===========================================================================
Date of actions: Fri Dec 19 11:20:21 2008 (1229707221)
Date: Fri, 19 Dec 2008 11:26:03 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18816] LIGO: Ability to upgrade active Condor
 pools running nested DAGs

Stuart,

> As discussed in a recent Condor-LIGO meeting, it is desirable to be
> able to upgrade an active Condor pool (either upgrading just the
> DAGMan binaries, or a full daemon upgrade) that has nested DAGs in the
> queue without disrupting their proper execution.
>
> As of Gnats PR 959 and the 7.2 release this should be robust for the
> case of non-nested DAGs since copy_to_spool will be set to true in
> the .condor.sub file. However, the full fix for the case of nested
> DAGs requires the planned laze .condor.sub creation enhancement.
>
> This ticket is a place holder to keep track of this request.

Just an update to the ticket's status here -- the copy_to_spool = true fix 
is in 7.2.0.  (This is *not* in the 7.2.0 pre-release you have, though.)

The full fix to this problem will be when we are able to defer generating
the .condor.sub files for nested DAGs until just before the subdag is
actually submitted.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Fri Dec 19 11:26:09 2008 (1229707570)
CC: Condor/LIGO mailing list <condorligo__AT__aei.mpg.de>
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18816] LIGO: Ability to upgrade active Condor
 pools running nested DAGs
Date: Fri, 2 Jan 2009 21:21:24 -0800
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu


On Dec 19, 2008, at 9:26 AM, condor-admin response tracking system  
wrote:

> Stuart,
>
>> As discussed in a recent Condor-LIGO meeting, it is desirable to be
>> able to upgrade an active Condor pool (either upgrading just the
>> DAGMan binaries, or a full daemon upgrade) that has nested DAGs in  
>> the
>> queue without disrupting their proper execution.
>>
>> As of Gnats PR 959 and the 7.2 release this should be robust for the
>> case of non-nested DAGs since copy_to_spool will be set to true in
>> the .condor.sub file. However, the full fix for the case of nested
>> DAGs requires the planned laze .condor.sub creation enhancement.
>>
>> This ticket is a place holder to keep track of this request.
>
> Just an update to the ticket's status here -- the copy_to_spool =  
> true fix
> is in 7.2.0.  (This is *not* in the 7.2.0 pre-release you have,  
> though.)
>
> The full fix to this problem will be when we are able to defer  
> generating
> the .condor.sub files for nested DAGs until just before the subdag is
> actually submitted.

Kent,
	I just upgraded the LIGO CIT cluster to 7.2.0 and noticed that this  
version of condor_submit_dag is schizophrenic and sets copy_to_spool  
to both True and False. The routine writeSubmitFile() in  
condor_submit_dag.cpp includes the following:

     fprintf(pSubFile, "universe\t= scheduler\n");
     fprintf(pSubFile, "executable\t= %s\n",  
opts.strDagmanPath.Value());
     fprintf(pSubFile, "copy_to_spool\t= True\n");

...

     fprintf(pSubFile, "copy_to_spool\t= False\n" );


I have confirmed that this actually shows up in the generated submit  
file for a simple test dag. Unfortunately, I suspect the later  
fprintf() take precedence, i.e., the code change is a NOP?

Is there any record of why the False statement is in the previous code  
base, e.g., 7.0.5? Perhaps there was a good reason not to set this to  
True that we overlooked in attempting to solve the upgrade problem?

Out of curiosity will you be adding SUBMIT_DAGMAN_EXPRS (or similar)  
as part of the full fix, or is that not needed?

P.S. It looks like the ghost of Frank showed up to haunt us for  
Christmas :)

Thanks.


--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Fri Jan  2 23:21:41 2009 (1230960101)
Date: Mon, 5 Jan 2009 10:28:02 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
CC: condor-admin__AT__cs.wisc.edu, Condor/LIGO mailing list <condorligo__AT__aei.mpg.de>
Subject: Re: [CondorLIGO] Re: [condor-admin #18816] LIGO: Ability to
 upgrade active Condor pools running nested DAGs

On Fri, 2 Jan 2009, Stuart Anderson wrote:

> 	I just upgraded the LIGO CIT cluster to 7.2.0 and noticed that this 
> version of condor_submit_dag is schizophrenic and sets copy_to_spool to both 
> True and False. The routine writeSubmitFile() in condor_submit_dag.cpp 
> includes the following:
>
>   fprintf(pSubFile, "universe\t= scheduler\n");
>   fprintf(pSubFile, "executable\t= %s\n", opts.strDagmanPath.Value());
>   fprintf(pSubFile, "copy_to_spool\t= True\n");
>
> ...
>
>   fprintf(pSubFile, "copy_to_spool\t= False\n" );
>
>
> I have confirmed that this actually shows up in the generated submit file for 
> a simple test dag. Unfortunately, I suspect the later fprintf() take 
> precedence, i.e., the code change is a NOP?
>
> Is there any record of why the False statement is in the previous code base, 
> e.g., 7.0.5? Perhaps there was a good reason not to set this to True that we 
> overlooked in attempting to solve the upgrade problem?
>
> Out of curiosity will you be adding SUBMIT_DAGMAN_EXPRS (or similar) as part 
> of the full fix, or is that not needed?
>
> P.S. It looks like the ghost of Frank showed up to haunt us for Christmas :)

Ah, hell, you're right!  I'll fix that and give you guys pre-release 7.2.1
binaries.

I think that copy_to_spool being false previously may just have been on 
the assumption that you wouldn't need to do that, and therefore we'd avoid 
wasted effort (not thinking about the "upgrade while running" issue).

Anyhow, I'll do some testing and make sure it doesn't cause problems.

As far as SUBMIT_DAGMAN_EXPRS goes, we already have -insert_sub_file and
-append for condor_submit_dag, but maybe it wouldn't hurt to also have 
something like SUBMIT_DAGMAN_EXPRS.  I don't think that needs to be part 
of the basic fix, though, and it probably should go into 7.3 as opposed to 
7.2.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Mon Jan  5 10:28:07 2009 (1231172888)
Date: Mon, 5 Jan 2009 10:33:20 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18816] LIGO: Ability to upgrade active Condor
 pools running nested DAGs

Stuart,

Well, one of the other Condor people just talked to me about a case where 
we *don't* want the DAGMan binary copied to spool (submitting a large 
number of very small DAGs).

So I guess there needs to be something like a configuration macro for 
this...

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Mon Jan  5 10:33:22 2009 (1231173202)
Date: Fri, 9 Jan 2009 11:52:20 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18816] LIGO: Ability to upgrade active Condor
 pools running nested DAGs

Stuart,

I've generated 7.2.1 pre-release condor_dagman and condor_submit_dag 
binaries for you.  These have a new configuration macro 
(DAGMAN_COPY_TO_SPOOL) that controls whether the condor_dagman binary
is copied to the spool directory.  (This value defaults to false, so
you should set it to true in your configuration somewhere.)

The new binaries are avaiable at:

ftp://ftp.cs.wisc.edu/condor/temporary/forligo/dagman-7.2.1-prerelease-2009-01-08/

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Fri Jan  9 11:52:23 2009 (1231523543)
CC: condorligo__AT__aei.mpg.de
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18816] LIGO: Ability to upgrade active Condor
 pools running nested DAGs
Date: Fri, 9 Jan 2009 11:10:08 -0800
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Thanks. I will give this a try.

When there is efficient job spool caching for duplicate executables,  
e.g.,
[condor-admin #15277] LIGO DAGMan spool directory efficiency
will you be changing this default setting and/or COPY_TO_SPOOL to  
True? If so, you might want to think about starting out this new knob  
with that setting. Just a random thought.

Thanks.

On Jan 9, 2009, at 9:52 AM, condor-admin response tracking system wrote:

> Stuart,
>
> I've generated 7.2.1 pre-release condor_dagman and condor_submit_dag
> binaries for you.  These have a new configuration macro
> (DAGMAN_COPY_TO_SPOOL) that controls whether the condor_dagman binary
> is copied to the spool directory.  (This value defaults to false, so
> you should set it to true in your configuration somewhere.)
>
> The new binaries are avaiable at:
>
> ftp://ftp.cs.wisc.edu/condor/temporary/forligo/dagman-7.2.1-prerelease-2009-01-08/
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, condorligo__AT__aei.mpg.de
>

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Fri Jan  9 13:10:18 2009 (1231528219)
Date: Fri, 9 Jan 2009 13:13:37 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
CC: condor-admin__AT__cs.wisc.edu, Condor/LIGO mailing list
 <condorligo__AT__aei.mpg.de>,
 "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [CondorLIGO] Re: [condor-admin #18816] LIGO: Ability to
 upgrade active Condor pools running nested DAGs

On Fri, 9 Jan 2009, Stuart Anderson wrote:

> When there is efficient job spool caching for duplicate executables, e.g.,
> [condor-admin #15277] LIGO DAGMan spool directory efficiency
> will you be changing this default setting and/or COPY_TO_SPOOL to True? If 
> so, you might want to think about starting out this new knob with that 
> setting. Just a random thought.

Well, usually when we add a new knob like this we make the default 
behavior whatever is was previously, which in this case is copy_to_spool = 
false.  It turns out that there's apparently more reason for it to be 
false than I originally realized, so I think that's an argument for the 
default being false.  We could change the default when the "multiple 
copies in the spool directory" issue is fixed.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Fri Jan  9 13:13:40 2009 (1231528420)
CC: condorligo__AT__aei.mpg.de
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18816] LIGO: Ability to upgrade active Condor
 pools running nested DAGs
Date: Fri, 9 Jan 2009 12:54:44 -0800
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

I can confirm that these binaries (at least the X86_64-LINUX_RHEL3)  
now properly creates a single copy_to_spool entry in the generated  
condor submit file based on the setting of this new macro.

Kent,
	What other changes, if any, are in these pre-release binaries other  
than DAGMAN_COPY_TO_SPOOL?

For the LIGO sites I would like to go with these binaries and a  
setting of True for our upgrade on Monday.

Thanks.

On Jan 9, 2009, at 9:52 AM, condor-admin response tracking system wrote:

> Stuart,
>
> I've generated 7.2.1 pre-release condor_dagman and condor_submit_dag
> binaries for you.  These have a new configuration macro
> (DAGMAN_COPY_TO_SPOOL) that controls whether the condor_dagman binary
> is copied to the spool directory.  (This value defaults to false, so
> you should set it to true in your configuration somewhere.)
>
> The new binaries are avaiable at:
>
> ftp://ftp.cs.wisc.edu/condor/temporary/forligo/dagman-7.2.1-prerelease-2009-01-08/
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, condorligo__AT__aei.mpg.de
>

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Fri Jan  9 14:54:52 2009 (1231534492)
Date: Fri, 9 Jan 2009 15:02:44 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18816] LIGO: Ability to upgrade active Condor
 pools running nested DAGs

Stuart,

> I can confirm that these binaries (at least the X86_64-LINUX_RHEL3)
> now properly creates a single copy_to_spool entry in the generated
> condor submit file based on the setting of this new macro.

Glad it works for you, too!

> 	What other changes, if any, are in these pre-release binaries other
> than DAGMAN_COPY_TO_SPOOL?

Compared to 7.2.0, only other change besides the DAGMAN_COPY_TO_SPOOL 
stuff is that a small memory leak was fixed.

> For the LIGO sites I would like to go with these binaries and a
> setting of True for our upgrade on Monday.

Sounds good to me.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Fri Jan  9 15:02:46 2009 (1231534966)
Date: Fri, 9 Jan 2009 15:23:06 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18816] LIGO: Ability to upgrade active Condor
 pools running nested DAGs

Stuart,

Here's a slightly better workaround:

   ident `which condor_submit_dag` | grep Condor

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Fri Jan  9 15:23:08 2009 (1231536188)
CC: Condor/LIGO mailing list <condorligo__AT__aei.mpg.de>
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18816] LIGO: Ability to upgrade active Condor
 pools running nested DAGs
Date: Fri, 9 Jan 2009 16:59:23 -0800
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

To  be pedantic I also ran lsof to verify that this does exactly what  
we want, i.e., condor_dagman processes started after the configuration  
update have program text segments that are in the Condor spool  
directory and the bits that are in /usr/bin/dagman do not link back to  
any of these processes.

Thanks.

On Jan 9, 2009, at 12:54 PM, Stuart Anderson wrote:

> I can confirm that these binaries (at least the X86_64-LINUX_RHEL3)  
> now properly creates a single copy_to_spool entry in the generated  
> condor submit file based on the setting of this new macro.
>
> Kent,
> 	What other changes, if any, are in these pre-release binaries other  
> than DAGMAN_COPY_TO_SPOOL?
>
> For the LIGO sites I would like to go with these binaries and a  
> setting of True for our upgrade on Monday.
>
> Thanks.
>
> On Jan 9, 2009, at 9:52 AM, condor-admin response tracking system  
> wrote:
>
>> Stuart,
>>
>> I've generated 7.2.1 pre-release condor_dagman and condor_submit_dag
>> binaries for you.  These have a new configuration macro
>> (DAGMAN_COPY_TO_SPOOL) that controls whether the condor_dagman binary
>> is copied to the spool directory.  (This value defaults to false, so
>> you should set it to true in your configuration somewhere.)
>>
>> The new binaries are avaiable at:
>>
>> ftp://ftp.cs.wisc.edu/condor/temporary/forligo/dagman-7.2.1-prerelease-2009-01-08/
>>
>> Kent Wenger
>> Condor Team
>>
>>
>> ========================================
>> MESSAGE INFORMATION
>> ========================================
>> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
>> * Ticket Email List: anderson__AT__ligo.caltech.edu, condorligo__AT__aei.mpg.de
>>
>
> --
> Stuart Anderson  anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
>
>
>

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Fri Jan  9 18:59:47 2009 (1231549188)
Subject: Comments added

St
Stuart said we could close this ticket on LIGO call 1/16/09

Comments added by tannenba

===========================================================================
Date comments were added: Fri Jan 16 13:29:04 2009 (1232134144)
Subject: Actions

Ticket resolved by tannenba
===========================================================================
Date of actions: Fri Jan 16 13:29:38 2009 (1232134178)