LIGO Support Ticket 15687

Ticket Information
  Number:      admin 15687
  User:        kipp@gravity.phys.uwm.edu
  Email:       anderson__AT__ligo.caltech.edu
  Status:      resolved
  Assigned To: wenger
X-Authentication-Warning: deirdre.phys.uwm.edu: kipp owned process doing -bs
Date: Thu, 7 Jun 2007 01:36:08 -0500 (CDT)
From: Kipp Cannon <kipp__AT__gravity.phys.uwm.edu>
X-X-Sender: kipp__AT__deirdre.phys.uwm.edu
To: condor-admin__AT__cs.wisc.edu
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>
Subject: LIGO Re: Re: dagman depth-first not working

On Tue, 5 Jun 2007, Kipp Cannon wrote:

> Hi Again,
>
> As a second follow-up,
>
> This is very embarassing...
>
> In my previous e-mail I reported that a dagman was not running a DAG in 
> depth-first order, but only reversing the nodes.  To that e-mail I attached 
> the .dag and .dagman.out files.
>
> For unrelated reasons, that DAG ended with a .rescue DAG being written, and 
> for yet further unrelated reasons a filesystem problem caused the rescue DAG 
> to terminate fatally (no .rescue.rescue DAG).  That forced me to regenerate 
> and re-run the DAG from scratch, which I did this morning, and on this 
> occasion, I noticed that the DAG was now being run in a depth-first order!
>
> After spending a great deal of time with diff and sort, and I finally 
> convinced myself the two .dag files describe the same graph, and so went and 
> took another look at the .dagman.out file from the original run (that I 
> included in my e-mail).
>
> And, this is the very embarassing part, the nodes were infact submitted in 
> depth-first order...
>
> I don't know what to say.  I swear, and I mean I really swear, I have a clear 
> memory of seeing all the first-level jobs submitted before a second level job 
> ran.  But I'm looking now at the log file again with my own two eyes and I 
> see the DAG infact ran depth-first.
>
> Anyway, with much apology, I have to retract the bug report.  The depth-first 
> feature in dagman is running correctly, as far as I can tell.
>
> 							-Kipp
>
> PS --- I think I need to see a shrink... :-(


Thanks for being patient with me.

But can I change my mind again :-).  I think I've figured out what's going 
on, and the good news (for me) is that I'm not crazy.  There is a bug, 
afterall, but it's much more straight forward.

In the DAG I ran, I enabled the depth-first feature by creating a local 
dagman config file, and adding a "CONFIG ..." line to the .dag.  It turns 
out that when the DAG fails, and dagman writes the .rescue DAG, the CONFIG 
line is not propogated into the .rescue DAG.  So when the .rescue is run, 
it runs in breadth-first order.

And that, I believe, is what happened:  I was watching a rescue DAG run 
when I saw it was not going in depth-first order, but the log I sent you 
was from the first attempt of the DAG, which did run in depth-first order.

Anyway, so the bug report is that dagman doesn't propogate a CONFIG line 
from the .dag into the .rescue.  Phew,

 							-Kipp



>
>
>
> On Mon, 4 Jun 2007, Kipp Cannon wrote:
>
>> Hi,
>> 
>> As follow-up to the LIGO-Condor telecon today, attached is an example dag 
>> and the .dagman.out for a DAG with post scripts, no pre-scripts, that was 
>> run with DAGMAN_SUBMIT_DEPTH_FIRST = 1, but that didn't submit jobs depth 
>> first.
>> 
>> There are over 24000 nodes in the DAG, so I'll explain a bit of its 
>> structure.  It starts with a number of "LSCdataFind" jobs that all run in 
>> parallel (these are used to find the input data that will be analyzed). 
>> These jobs' children are "lalapps_power" jobs that also run in parallel. 
>> Most datafinds are shared by several power jobs.  Following the 
>> lalapps_power jobs are several layers of post-processing jobs.  Each layer 
>> tends to have many fewer nodes in it than the layer above, as the analysis 
>> "funnels down" into a small number of final output jobs.
>> 
>> Because the datafind and power jobs have postscripts, which run on the 
>> submit machine (a shared resource for us), the DAG is run with "-maxpost 
>> 1".  The postscripts run very quickly, however, and dagman's buffering of 
>> postscripts is almost always the only factor delaying their completion. 
>> This usually means job submissions are no more than 10 ahead of the 
>> completed postscripts. The specific command line used to submit the DAG was
>> 
>> $ condor_submit_dag -maxpost 1 power.dag
>> 
>>
>> 							-Kipp
>

===========================================================================
Date of creation: Thu Jun  7  1:36:21 2007 (1181198184)
Subject: Actions

Assigned to wenger by wenger
===========================================================================
Date of actions: Thu Jun  7 15:08:41 2007 (1181246921)
Date: Thu, 7 Jun 2007 15:10:56 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #15687] LIGO Re: Re: dagman depth-first not working

Kipp,

> > And, this is the very embarassing part, the nodes were infact submitted in
> > depth-first order...
> >
> > I don't know what to say.  I swear, and I mean I really swear, I have a clear
> > memory of seeing all the first-level jobs submitted before a second level job
> > ran.  But I'm looking now at the log file again with my own two eyes and I
> > see the DAG infact ran depth-first.

Okay, that's actually a relief to me!

> In the DAG I ran, I enabled the depth-first feature by creating a local
> dagman config file, and adding a "CONFIG ..." line to the .dag.  It turns
> out that when the DAG fails, and dagman writes the .rescue DAG, the CONFIG
> line is not propogated into the .rescue DAG.  So when the .rescue is run,
> it runs in breadth-first order.
>
> And that, I believe, is what happened:  I was watching a rescue DAG run
> when I saw it was not going in depth-first order, but the log I sent you
> was from the first attempt of the DAG, which did run in depth-first order.
>
> Anyway, so the bug report is that dagman doesn't propogate a CONFIG line
> from the .dag into the .rescue.  Phew,

Yeah, the CONFIG file setting should get propagated to the rescue DAG.
I'll fix that...

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Thu Jun  7 15:10:58 2007 (1181247059)
Date: Fri, 8 Jun 2007 11:48:21 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #15687] LIGO Re: Re: dagman depth-first not working

Kipp,

I just wanted to let you know that I have a fix for the "CONFIG file
not propagated to rescue DAG" problem.

I should be able to get you a new condor_dagman binary with the fix
some time early next week.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Fri Jun  8 11:48:37 2007 (1181321317)
Date: Wed, 13 Jun 2007 14:30:13 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #15687] LIGO Re: Re: dagman depth-first not working

Kipp,

I've just built 6.9.4 pre-release binaries that have the fix to the
"CONFIG spec not propagated to rescue DAG" problem.  I've notified Stuart 
about the new binaries, so they should get installed soon, and that should
take care of things.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Wed Jun 13 14:30:15 2007 (1181763015)
Subject: Actions

Ticket resolved by wenger
===========================================================================
Date of actions: Wed Jun 13 14:31:30 2007 (1181763091)
Subject: Actions

Ticket was reopened by mailnull
===========================================================================
Date of actions: Wed Jun 13 23:09:06 2007 (1181794147)
X-Authentication-Warning: deirdre.phys.uwm.edu: kipp owned process doing -bs
Date: Wed, 13 Jun 2007 23:08:56 -0500 (CDT)
From: Kipp Cannon <kipp__AT__gravity.phys.uwm.edu>
X-X-Sender: kipp__AT__deirdre.phys.uwm.edu
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: anderson__AT__ligo.caltech.edu
Subject: Re: [condor-admin #15687] LIGO Re: Re: dagman depth-first not working

Hi Kent,

Great, thanks a lot!

By the way, the depth-first feature, even in its preliminary form is 
really fantastic!  It really smooths out the DAG's running.  Fewer files 
accumulate, they don't grow to as large a size, if there's a problem with 
the jobs in the DAG I find out within hours rather than after 3 days of 
running.  And so many little things just seem to work better.

Here's an example.  In our data analysis pipelines, the first thing the 
DAG does is run a large number of "data find" queries to metadata servers 
to locate the instrument data on the cluster filesystem.  Because of the 
need to throttle database server connections, and other things, the 
queries have to be run in sequence, we can't launch 1200 query jobs onto 
the cluster all at once.  They can take many many hours to run through. 
But with the depth-first dagman, after the first query runs, the jobs that 
use its output are launched and begin processing the data.  This happens 
in parallel with the next query, so while the queries are being run in 
sequence the analysis jobs now run in parallel along side them.  The 
result is that the DAG requires almost 1/2 day less run time to complete, 
and the cluster nodes aren't as idle as they otherwise would've been.

I've been telling everyone how great the depth-first feature is...

 							-Kipp


On Wed, 13 Jun 2007, condor-admin response tracking system wrote:

> Kipp,
>
> I've just built 6.9.4 pre-release binaries that have the fix to the
> "CONFIG spec not propagated to rescue DAG" problem.  I've notified Stuart
> about the new binaries, so they should get installed soon, and that should
> take care of things.
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: kipp__AT__gravity.phys.uwm.edu, anderson__AT__ligo.caltech.edu
>
>

===========================================================================
Date mail was appended: Wed Jun 13 23:09:06 2007 (1181794147)
Date: Fri, 15 Jun 2007 09:19:19 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #15687] LIGO Re: Re: dagman depth-first not working

Kipp,

> Great, thanks a lot!

You're welcome!

> By the way, the depth-first feature, even in its preliminary form is
> really fantastic!  It really smooths out the DAG's running.  Fewer files
> accumulate, they don't grow to as large a size, if there's a problem with
> the jobs in the DAG I find out within hours rather than after 3 days of
> running.  And so many little things just seem to work better.
>
> Here's an example.  In our data analysis pipelines, the first thing the
> DAG does is run a large number of "data find" queries to metadata servers
> to locate the instrument data on the cluster filesystem.  Because of the
> need to throttle database server connections, and other things, the
> queries have to be run in sequence, we can't launch 1200 query jobs onto
> the cluster all at once.  They can take many many hours to run through.
> But with the depth-first dagman, after the first query runs, the jobs that
> use its output are launched and begin processing the data.  This happens
> in parallel with the next query, so while the queries are being run in
> sequence the analysis jobs now run in parallel along side them.  The
> result is that the DAG requires almost 1/2 day less run time to complete,
> and the cluster nodes aren't as idle as they otherwise would've been.
>
> I've been telling everyone how great the depth-first feature is...

Cool!  Thanks for the positive feedback.

Hopefully we can get true depth-first traversal, and more flexible 
priorities, implemented pretty soon.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Fri Jun 15  9:19:21 2007 (1181917162)
Subject: Actions

Ticket resolved by wenger
===========================================================================
Date of actions: Fri Jun 15  9:22:58 2007 (1181917379)