LIGO Support Ticket 18816
Ticket Information
Number: admin 18816
User: anderson@ligo.caltech.edu
Email: condorligo__AT__aei.mpg.de
Status: resolved
Assigned To: wenger
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: LIGO: Ability to upgrade active Condor pools running nested DAGs
Date: Mon, 8 Dec 2008 20:59:16 -0800
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
As discussed in a recent Condor-LIGO meeting, it is desirable to be
able to upgrade an active Condor pool (either upgrading just the
DAGMan binaries, or a full daemon upgrade) that has nested DAGs in the
queue without disrupting their proper execution.
As of Gnats PR 959 and the 7.2 release this should be robust for the
case of non-nested DAGs since copy_to_spool will be set to true in
the .condor.sub file. However, the full fix for the case of nested
DAGs requires the planned laze .condor.sub creation enhancement.
This ticket is a place holder to keep track of this request.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Mon Dec 8 22:59:26 2008 (1228798768)
Subject: Actions
Assigned to psilord by bt
===========================================================================
Date of actions: Mon Dec 8 8:45:57 2008 (1228835582)
Subject: Actions
Assigned to wenger by wenger
===========================================================================
Date of actions: Fri Dec 19 11:20:21 2008 (1229707221)
Date: Fri, 19 Dec 2008 11:26:03 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18816] LIGO: Ability to upgrade active Condor
pools running nested DAGs
Stuart,
> As discussed in a recent Condor-LIGO meeting, it is desirable to be
> able to upgrade an active Condor pool (either upgrading just the
> DAGMan binaries, or a full daemon upgrade) that has nested DAGs in the
> queue without disrupting their proper execution.
>
> As of Gnats PR 959 and the 7.2 release this should be robust for the
> case of non-nested DAGs since copy_to_spool will be set to true in
> the .condor.sub file. However, the full fix for the case of nested
> DAGs requires the planned laze .condor.sub creation enhancement.
>
> This ticket is a place holder to keep track of this request.
Just an update to the ticket's status here -- the copy_to_spool = true fix
is in 7.2.0. (This is *not* in the 7.2.0 pre-release you have, though.)
The full fix to this problem will be when we are able to defer generating
the .condor.sub files for nested DAGs until just before the subdag is
actually submitted.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Fri Dec 19 11:26:09 2008 (1229707570)
CC: Condor/LIGO mailing list <condorligo__AT__aei.mpg.de>
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18816] LIGO: Ability to upgrade active Condor
pools running nested DAGs
Date: Fri, 2 Jan 2009 21:21:24 -0800
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
On Dec 19, 2008, at 9:26 AM, condor-admin response tracking system
wrote:
> Stuart,
>
>> As discussed in a recent Condor-LIGO meeting, it is desirable to be
>> able to upgrade an active Condor pool (either upgrading just the
>> DAGMan binaries, or a full daemon upgrade) that has nested DAGs in
>> the
>> queue without disrupting their proper execution.
>>
>> As of Gnats PR 959 and the 7.2 release this should be robust for the
>> case of non-nested DAGs since copy_to_spool will be set to true in
>> the .condor.sub file. However, the full fix for the case of nested
>> DAGs requires the planned laze .condor.sub creation enhancement.
>>
>> This ticket is a place holder to keep track of this request.
>
> Just an update to the ticket's status here -- the copy_to_spool =
> true fix
> is in 7.2.0. (This is *not* in the 7.2.0 pre-release you have,
> though.)
>
> The full fix to this problem will be when we are able to defer
> generating
> the .condor.sub files for nested DAGs until just before the subdag is
> actually submitted.
Kent,
I just upgraded the LIGO CIT cluster to 7.2.0 and noticed that this
version of condor_submit_dag is schizophrenic and sets copy_to_spool
to both True and False. The routine writeSubmitFile() in
condor_submit_dag.cpp includes the following:
fprintf(pSubFile, "universe\t= scheduler\n");
fprintf(pSubFile, "executable\t= %s\n",
opts.strDagmanPath.Value());
fprintf(pSubFile, "copy_to_spool\t= True\n");
...
fprintf(pSubFile, "copy_to_spool\t= False\n" );
I have confirmed that this actually shows up in the generated submit
file for a simple test dag. Unfortunately, I suspect the later
fprintf() take precedence, i.e., the code change is a NOP?
Is there any record of why the False statement is in the previous code
base, e.g., 7.0.5? Perhaps there was a good reason not to set this to
True that we overlooked in attempting to solve the upgrade problem?
Out of curiosity will you be adding SUBMIT_DAGMAN_EXPRS (or similar)
as part of the full fix, or is that not needed?
P.S. It looks like the ghost of Frank showed up to haunt us for
Christmas :)
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Fri Jan 2 23:21:41 2009 (1230960101)
Date: Mon, 5 Jan 2009 10:28:02 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
CC: condor-admin__AT__cs.wisc.edu, Condor/LIGO mailing list <condorligo__AT__aei.mpg.de>
Subject: Re: [CondorLIGO] Re: [condor-admin #18816] LIGO: Ability to
upgrade active Condor pools running nested DAGs
On Fri, 2 Jan 2009, Stuart Anderson wrote:
> I just upgraded the LIGO CIT cluster to 7.2.0 and noticed that this
> version of condor_submit_dag is schizophrenic and sets copy_to_spool to both
> True and False. The routine writeSubmitFile() in condor_submit_dag.cpp
> includes the following:
>
> fprintf(pSubFile, "universe\t= scheduler\n");
> fprintf(pSubFile, "executable\t= %s\n", opts.strDagmanPath.Value());
> fprintf(pSubFile, "copy_to_spool\t= True\n");
>
> ...
>
> fprintf(pSubFile, "copy_to_spool\t= False\n" );
>
>
> I have confirmed that this actually shows up in the generated submit file for
> a simple test dag. Unfortunately, I suspect the later fprintf() take
> precedence, i.e., the code change is a NOP?
>
> Is there any record of why the False statement is in the previous code base,
> e.g., 7.0.5? Perhaps there was a good reason not to set this to True that we
> overlooked in attempting to solve the upgrade problem?
>
> Out of curiosity will you be adding SUBMIT_DAGMAN_EXPRS (or similar) as part
> of the full fix, or is that not needed?
>
> P.S. It looks like the ghost of Frank showed up to haunt us for Christmas :)
Ah, hell, you're right! I'll fix that and give you guys pre-release 7.2.1
binaries.
I think that copy_to_spool being false previously may just have been on
the assumption that you wouldn't need to do that, and therefore we'd avoid
wasted effort (not thinking about the "upgrade while running" issue).
Anyhow, I'll do some testing and make sure it doesn't cause problems.
As far as SUBMIT_DAGMAN_EXPRS goes, we already have -insert_sub_file and
-append for condor_submit_dag, but maybe it wouldn't hurt to also have
something like SUBMIT_DAGMAN_EXPRS. I don't think that needs to be part
of the basic fix, though, and it probably should go into 7.3 as opposed to
7.2.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Mon Jan 5 10:28:07 2009 (1231172888)
Date: Mon, 5 Jan 2009 10:33:20 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18816] LIGO: Ability to upgrade active Condor
pools running nested DAGs
Stuart,
Well, one of the other Condor people just talked to me about a case where
we *don't* want the DAGMan binary copied to spool (submitting a large
number of very small DAGs).
So I guess there needs to be something like a configuration macro for
this...
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Mon Jan 5 10:33:22 2009 (1231173202)
Date: Fri, 9 Jan 2009 11:52:20 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18816] LIGO: Ability to upgrade active Condor
pools running nested DAGs
Stuart,
I've generated 7.2.1 pre-release condor_dagman and condor_submit_dag
binaries for you. These have a new configuration macro
(DAGMAN_COPY_TO_SPOOL) that controls whether the condor_dagman binary
is copied to the spool directory. (This value defaults to false, so
you should set it to true in your configuration somewhere.)
The new binaries are avaiable at:
ftp://ftp.cs.wisc.edu/condor/temporary/forligo/dagman-7.2.1-prerelease-2009-01-08/
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Fri Jan 9 11:52:23 2009 (1231523543)
CC: condorligo__AT__aei.mpg.de
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18816] LIGO: Ability to upgrade active Condor
pools running nested DAGs
Date: Fri, 9 Jan 2009 11:10:08 -0800
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
Thanks. I will give this a try.
When there is efficient job spool caching for duplicate executables,
e.g.,
[condor-admin #15277] LIGO DAGMan spool directory efficiency
will you be changing this default setting and/or COPY_TO_SPOOL to
True? If so, you might want to think about starting out this new knob
with that setting. Just a random thought.
Thanks.
On Jan 9, 2009, at 9:52 AM, condor-admin response tracking system wrote:
> Stuart,
>
> I've generated 7.2.1 pre-release condor_dagman and condor_submit_dag
> binaries for you. These have a new configuration macro
> (DAGMAN_COPY_TO_SPOOL) that controls whether the condor_dagman binary
> is copied to the spool directory. (This value defaults to false, so
> you should set it to true in your configuration somewhere.)
>
> The new binaries are avaiable at:
>
> ftp://ftp.cs.wisc.edu/condor/temporary/forligo/dagman-7.2.1-prerelease-2009-01-08/
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, condorligo__AT__aei.mpg.de
>
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Fri Jan 9 13:10:18 2009 (1231528219)
Date: Fri, 9 Jan 2009 13:13:37 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
CC: condor-admin__AT__cs.wisc.edu, Condor/LIGO mailing list
<condorligo__AT__aei.mpg.de>,
"R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [CondorLIGO] Re: [condor-admin #18816] LIGO: Ability to
upgrade active Condor pools running nested DAGs
On Fri, 9 Jan 2009, Stuart Anderson wrote:
> When there is efficient job spool caching for duplicate executables, e.g.,
> [condor-admin #15277] LIGO DAGMan spool directory efficiency
> will you be changing this default setting and/or COPY_TO_SPOOL to True? If
> so, you might want to think about starting out this new knob with that
> setting. Just a random thought.
Well, usually when we add a new knob like this we make the default
behavior whatever is was previously, which in this case is copy_to_spool =
false. It turns out that there's apparently more reason for it to be
false than I originally realized, so I think that's an argument for the
default being false. We could change the default when the "multiple
copies in the spool directory" issue is fixed.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Fri Jan 9 13:13:40 2009 (1231528420)
CC: condorligo__AT__aei.mpg.de
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18816] LIGO: Ability to upgrade active Condor
pools running nested DAGs
Date: Fri, 9 Jan 2009 12:54:44 -0800
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
I can confirm that these binaries (at least the X86_64-LINUX_RHEL3)
now properly creates a single copy_to_spool entry in the generated
condor submit file based on the setting of this new macro.
Kent,
What other changes, if any, are in these pre-release binaries other
than DAGMAN_COPY_TO_SPOOL?
For the LIGO sites I would like to go with these binaries and a
setting of True for our upgrade on Monday.
Thanks.
On Jan 9, 2009, at 9:52 AM, condor-admin response tracking system wrote:
> Stuart,
>
> I've generated 7.2.1 pre-release condor_dagman and condor_submit_dag
> binaries for you. These have a new configuration macro
> (DAGMAN_COPY_TO_SPOOL) that controls whether the condor_dagman binary
> is copied to the spool directory. (This value defaults to false, so
> you should set it to true in your configuration somewhere.)
>
> The new binaries are avaiable at:
>
> ftp://ftp.cs.wisc.edu/condor/temporary/forligo/dagman-7.2.1-prerelease-2009-01-08/
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, condorligo__AT__aei.mpg.de
>
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Fri Jan 9 14:54:52 2009 (1231534492)
Date: Fri, 9 Jan 2009 15:02:44 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18816] LIGO: Ability to upgrade active Condor
pools running nested DAGs
Stuart,
> I can confirm that these binaries (at least the X86_64-LINUX_RHEL3)
> now properly creates a single copy_to_spool entry in the generated
> condor submit file based on the setting of this new macro.
Glad it works for you, too!
> What other changes, if any, are in these pre-release binaries other
> than DAGMAN_COPY_TO_SPOOL?
Compared to 7.2.0, only other change besides the DAGMAN_COPY_TO_SPOOL
stuff is that a small memory leak was fixed.
> For the LIGO sites I would like to go with these binaries and a
> setting of True for our upgrade on Monday.
Sounds good to me.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Fri Jan 9 15:02:46 2009 (1231534966)
Date: Fri, 9 Jan 2009 15:23:06 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18816] LIGO: Ability to upgrade active Condor
pools running nested DAGs
Stuart,
Here's a slightly better workaround:
ident `which condor_submit_dag` | grep Condor
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Fri Jan 9 15:23:08 2009 (1231536188)
CC: Condor/LIGO mailing list <condorligo__AT__aei.mpg.de>
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18816] LIGO: Ability to upgrade active Condor
pools running nested DAGs
Date: Fri, 9 Jan 2009 16:59:23 -0800
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
To be pedantic I also ran lsof to verify that this does exactly what
we want, i.e., condor_dagman processes started after the configuration
update have program text segments that are in the Condor spool
directory and the bits that are in /usr/bin/dagman do not link back to
any of these processes.
Thanks.
On Jan 9, 2009, at 12:54 PM, Stuart Anderson wrote:
> I can confirm that these binaries (at least the X86_64-LINUX_RHEL3)
> now properly creates a single copy_to_spool entry in the generated
> condor submit file based on the setting of this new macro.
>
> Kent,
> What other changes, if any, are in these pre-release binaries other
> than DAGMAN_COPY_TO_SPOOL?
>
> For the LIGO sites I would like to go with these binaries and a
> setting of True for our upgrade on Monday.
>
> Thanks.
>
> On Jan 9, 2009, at 9:52 AM, condor-admin response tracking system
> wrote:
>
>> Stuart,
>>
>> I've generated 7.2.1 pre-release condor_dagman and condor_submit_dag
>> binaries for you. These have a new configuration macro
>> (DAGMAN_COPY_TO_SPOOL) that controls whether the condor_dagman binary
>> is copied to the spool directory. (This value defaults to false, so
>> you should set it to true in your configuration somewhere.)
>>
>> The new binaries are avaiable at:
>>
>> ftp://ftp.cs.wisc.edu/condor/temporary/forligo/dagman-7.2.1-prerelease-2009-01-08/
>>
>> Kent Wenger
>> Condor Team
>>
>>
>> ========================================
>> MESSAGE INFORMATION
>> ========================================
>> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
>> * Ticket Email List: anderson__AT__ligo.caltech.edu, condorligo__AT__aei.mpg.de
>>
>
> --
> Stuart Anderson anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
>
>
>
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Fri Jan 9 18:59:47 2009 (1231549188)
Subject: Comments added
St
Stuart said we could close this ticket on LIGO call 1/16/09
Comments added by tannenba
===========================================================================
Date comments were added: Fri Jan 16 13:29:04 2009 (1232134144)
Subject: Actions
Ticket resolved by tannenba
===========================================================================
Date of actions: Fri Jan 16 13:29:38 2009 (1232134178)