LIGO Support Ticket 17849

Ticket Information
  Number:      admin 17849
  User:        anderson@ligo.caltech.edu
  Email:       skoranda__AT__gravity.phys.uwm.edu
  Status:      feature
  Assigned To: wenger
Date: Wed, 9 Apr 2008 15:14:29 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Kent Wenger <wenger__AT__cs.wisc.edu>,         Scott Koranda
 <skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: DAGMan protocol version number request
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Recently a LIGO user ran into a non-obvious problem when running one version
of condor_submit_dag with the -no-submit option and then attempting to run
the resulting submit file with a different version of condor_submit_dag
that did not support some of the newer features of the initial version.

It is worth considering adding an explicit DAG protocol version number
to submit files and then have condor_submit_dag refuse to accept
(which a human readable error message) a submit file with a newer
protocol number than it supports running. Implied in this, is that not
every condor release would necessarily have a protocol version number change.
This would also allow more meaningful error messages if anyone accidentally
confuses condor_submit and condor_submit_dag--or even offer the possibility
of merging these two interfaces since there would be no ambiguity if it was
a DAG file or not being submitted.

This might also add additional support for maintaining backwards compatibility
for future changes whose detailed understanding is protocol version specific.

It would also be worth considering adding an optional user specified
condor version number requirement statement to handle the case of a DAG
that is known to fail on version x.y.z (due to a DAGMan bug or a missing
feature that only added in a later version), but will work on x.y.z+1 or
newer.

I believe some or all of this is captured in the internal Condor bug
tracking system as PR 941.

Thanks.


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Wed Apr  9 17:14:42 2008 (1207779285)
Subject: Actions

Assigned to wenger by wenger
===========================================================================
Date of actions: Wed Apr  9 17:39:04 2008 (1207780744)
Date: Wed, 9 Apr 2008 17:41:02 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17849] LIGO: DAGMan protocol version number   
 request

Okay, I'm replying to this again to make sure the reply gets into the RUST
ticket...

> Recently a LIGO user ran into a non-obvious problem when running one version
> of condor_submit_dag with the -no-submit option and then attempting to run
> the resulting submit file with a different version of condor_submit_dag
> that did not support some of the newer features of the initial version.
>
> It is worth considering adding an explicit DAG protocol version number
> to submit files and then have condor_submit_dag refuse to accept
> (which a human readable error message) a submit file with a newer
> protocol number than it supports running. Implied in this, is that not
> every condor release would necessarily have a protocol version number change.
> This would also allow more meaningful error messages if anyone accidentally
> confuses condor_submit and condor_submit_dag--or even offer the possibility
> of merging these two interfaces since there would be no ambiguity if it was
> a DAG file or not being submitted.

I think the general of condor_dagman checking the condor_submit_dag
version idea is good, but I don't really think it's worth
defining a protocol version that's different than the condor version.
There just is no reason for anyone to run different versions of
condor_submit_dag and condor_dagman, and trying to have a separate
protocol version seems like just a way to increase complexity and the
chance of an error (if we forget to increment the protocol version when we
should).

The only case where I can see any benefit to the protocol version idea
is if you have a bunch of .condor.sub files sitting around, and you don't
routinely re-run condor_submit_dag, and then you upgrade your
condor_dagman version.  Personally, I'd rather just say you have to re-run
condor_submit_dag in that case, unless that's a really big problem for
people.

> This might also add additional support for maintaining backwards compatibility
> for future changes whose detailed understanding is protocol version specific.
>
> It would also be worth considering adding an optional user specified
> condor version number requirement statement to handle the case of a DAG
> that is known to fail on version x.y.z (due to a DAGMan bug or a missing
> feature that only added in a later version), but will work on x.y.z+1 or
> newer.

Hmm, I hadn't thought of that.  You mean, a DAGMan version specifier in
the DAG file itself, right?  That might get a little tricky, because we'd
probably need to allow two versions -- one for the development branch and
one for the production branch.

> I believe some or all of this is captured in the internal Condor bug
> tracking system as PR 941.

Some of it, but not all.  I'll probably add some more PRs for the new
stuff.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Wed Apr  9 17:41:08 2008 (1207780869)
Date: Wed, 9 Apr 2008 16:35:42 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17849] LIGO: DAGMan protocol version number  
 request
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Kent,
	Part of the reason I was thinking of a fairly general solution to
the particular problem, is that I have some vague notion of DAG's running
around a grid between pools that are running different versions of Condor.
For example, it might be possible/desirable some day to split off part of
a running DAG, locally generate a submit file, and ship it off to another
pool to complete that part of the graph.

	Even short of this, an automatic DAG generating tool might want
to specify that a minimum condor or DAG protocol version number is necessary
to run DAGs it generates, e.g., taking advantage of some recently added
feature, such as intra-dag node prioritization or throttling.

	This might very well be too generic of a solution, but I thought
I would at least start the discussion.

Thanks.


On Wed, Apr 09, 2008 at 05:41:08PM -0500, condor-admin response tracking system wrote:
> Okay, I'm replying to this again to make sure the reply gets into the RUST
> ticket...
> 
> > Recently a LIGO user ran into a non-obvious problem when running one version
> > of condor_submit_dag with the -no-submit option and then attempting to run
> > the resulting submit file with a different version of condor_submit_dag
> > that did not support some of the newer features of the initial version.
> >
> > It is worth considering adding an explicit DAG protocol version number
> > to submit files and then have condor_submit_dag refuse to accept
> > (which a human readable error message) a submit file with a newer
> > protocol number than it supports running. Implied in this, is that not
> > every condor release would necessarily have a protocol version number change.
> > This would also allow more meaningful error messages if anyone accidentally
> > confuses condor_submit and condor_submit_dag--or even offer the possibility
> > of merging these two interfaces since there would be no ambiguity if it was
> > a DAG file or not being submitted.
> 
> I think the general of condor_dagman checking the condor_submit_dag
> version idea is good, but I don't really think it's worth
> defining a protocol version that's different than the condor version.
> There just is no reason for anyone to run different versions of
> condor_submit_dag and condor_dagman, and trying to have a separate
> protocol version seems like just a way to increase complexity and the
> chance of an error (if we forget to increment the protocol version when we
> should).
> 
> The only case where I can see any benefit to the protocol version idea
> is if you have a bunch of .condor.sub files sitting around, and you don't
> routinely re-run condor_submit_dag, and then you upgrade your
> condor_dagman version.  Personally, I'd rather just say you have to re-run
> condor_submit_dag in that case, unless that's a really big problem for
> people.
> 
> > This might also add additional support for maintaining backwards compatibility
> > for future changes whose detailed understanding is protocol version specific.
> >
> > It would also be worth considering adding an optional user specified
> > condor version number requirement statement to handle the case of a DAG
> > that is known to fail on version x.y.z (due to a DAGMan bug or a missing
> > feature that only added in a later version), but will work on x.y.z+1 or
> > newer.
> 
> Hmm, I hadn't thought of that.  You mean, a DAGMan version specifier in
> the DAG file itself, right?  That might get a little tricky, because we'd
> probably need to allow two versions -- one for the development branch and
> one for the production branch.
> 
> > I believe some or all of this is captured in the internal Condor bug
> > tracking system as PR 941.
> 
> Some of it, but not all.  I'll probably add some more PRs for the new
> stuff.
> 
> Kent Wenger
> Condor Team
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu
> 
> -- 
> ======================================================================
> This mail was sent from the RUST Mail System
> Please direct all replies to condor-admin__AT__cs.wisc.edu
> Please include the current subject line in your reply.
> ======================================================================
> 

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Wed Apr  9 18:35:55 2008 (1207784156)
Subject: Actions

Status changed from open to feature by wenger
===========================================================================
Date of actions: Mon Jun 30 14:46:44 2008 (1214855204)