LIGO Support Ticket 15687
Ticket Information
Number: admin 15687
User: kipp@gravity.phys.uwm.edu
Email: anderson__AT__ligo.caltech.edu
Status: resolved
Assigned To: wenger
X-Authentication-Warning: deirdre.phys.uwm.edu: kipp owned process doing -bs
Date: Thu, 7 Jun 2007 01:36:08 -0500 (CDT)
From: Kipp Cannon <kipp__AT__gravity.phys.uwm.edu>
X-X-Sender: kipp__AT__deirdre.phys.uwm.edu
To: condor-admin__AT__cs.wisc.edu
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>
Subject: LIGO Re: Re: dagman depth-first not working
On Tue, 5 Jun 2007, Kipp Cannon wrote:
> Hi Again,
>
> As a second follow-up,
>
> This is very embarassing...
>
> In my previous e-mail I reported that a dagman was not running a DAG in
> depth-first order, but only reversing the nodes. To that e-mail I attached
> the .dag and .dagman.out files.
>
> For unrelated reasons, that DAG ended with a .rescue DAG being written, and
> for yet further unrelated reasons a filesystem problem caused the rescue DAG
> to terminate fatally (no .rescue.rescue DAG). That forced me to regenerate
> and re-run the DAG from scratch, which I did this morning, and on this
> occasion, I noticed that the DAG was now being run in a depth-first order!
>
> After spending a great deal of time with diff and sort, and I finally
> convinced myself the two .dag files describe the same graph, and so went and
> took another look at the .dagman.out file from the original run (that I
> included in my e-mail).
>
> And, this is the very embarassing part, the nodes were infact submitted in
> depth-first order...
>
> I don't know what to say. I swear, and I mean I really swear, I have a clear
> memory of seeing all the first-level jobs submitted before a second level job
> ran. But I'm looking now at the log file again with my own two eyes and I
> see the DAG infact ran depth-first.
>
> Anyway, with much apology, I have to retract the bug report. The depth-first
> feature in dagman is running correctly, as far as I can tell.
>
> -Kipp
>
> PS --- I think I need to see a shrink... :-(
Thanks for being patient with me.
But can I change my mind again :-). I think I've figured out what's going
on, and the good news (for me) is that I'm not crazy. There is a bug,
afterall, but it's much more straight forward.
In the DAG I ran, I enabled the depth-first feature by creating a local
dagman config file, and adding a "CONFIG ..." line to the .dag. It turns
out that when the DAG fails, and dagman writes the .rescue DAG, the CONFIG
line is not propogated into the .rescue DAG. So when the .rescue is run,
it runs in breadth-first order.
And that, I believe, is what happened: I was watching a rescue DAG run
when I saw it was not going in depth-first order, but the log I sent you
was from the first attempt of the DAG, which did run in depth-first order.
Anyway, so the bug report is that dagman doesn't propogate a CONFIG line
from the .dag into the .rescue. Phew,
-Kipp
>
>
>
> On Mon, 4 Jun 2007, Kipp Cannon wrote:
>
>> Hi,
>>
>> As follow-up to the LIGO-Condor telecon today, attached is an example dag
>> and the .dagman.out for a DAG with post scripts, no pre-scripts, that was
>> run with DAGMAN_SUBMIT_DEPTH_FIRST = 1, but that didn't submit jobs depth
>> first.
>>
>> There are over 24000 nodes in the DAG, so I'll explain a bit of its
>> structure. It starts with a number of "LSCdataFind" jobs that all run in
>> parallel (these are used to find the input data that will be analyzed).
>> These jobs' children are "lalapps_power" jobs that also run in parallel.
>> Most datafinds are shared by several power jobs. Following the
>> lalapps_power jobs are several layers of post-processing jobs. Each layer
>> tends to have many fewer nodes in it than the layer above, as the analysis
>> "funnels down" into a small number of final output jobs.
>>
>> Because the datafind and power jobs have postscripts, which run on the
>> submit machine (a shared resource for us), the DAG is run with "-maxpost
>> 1". The postscripts run very quickly, however, and dagman's buffering of
>> postscripts is almost always the only factor delaying their completion.
>> This usually means job submissions are no more than 10 ahead of the
>> completed postscripts. The specific command line used to submit the DAG was
>>
>> $ condor_submit_dag -maxpost 1 power.dag
>>
>>
>> -Kipp
>
===========================================================================
Date of creation: Thu Jun 7 1:36:21 2007 (1181198184)
Subject: Actions
Assigned to wenger by wenger
===========================================================================
Date of actions: Thu Jun 7 15:08:41 2007 (1181246921)
Date: Thu, 7 Jun 2007 15:10:56 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #15687] LIGO Re: Re: dagman depth-first not working
Kipp,
> > And, this is the very embarassing part, the nodes were infact submitted in
> > depth-first order...
> >
> > I don't know what to say. I swear, and I mean I really swear, I have a clear
> > memory of seeing all the first-level jobs submitted before a second level job
> > ran. But I'm looking now at the log file again with my own two eyes and I
> > see the DAG infact ran depth-first.
Okay, that's actually a relief to me!
> In the DAG I ran, I enabled the depth-first feature by creating a local
> dagman config file, and adding a "CONFIG ..." line to the .dag. It turns
> out that when the DAG fails, and dagman writes the .rescue DAG, the CONFIG
> line is not propogated into the .rescue DAG. So when the .rescue is run,
> it runs in breadth-first order.
>
> And that, I believe, is what happened: I was watching a rescue DAG run
> when I saw it was not going in depth-first order, but the log I sent you
> was from the first attempt of the DAG, which did run in depth-first order.
>
> Anyway, so the bug report is that dagman doesn't propogate a CONFIG line
> from the .dag into the .rescue. Phew,
Yeah, the CONFIG file setting should get propagated to the rescue DAG.
I'll fix that...
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Thu Jun 7 15:10:58 2007 (1181247059)
Date: Fri, 8 Jun 2007 11:48:21 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #15687] LIGO Re: Re: dagman depth-first not working
Kipp,
I just wanted to let you know that I have a fix for the "CONFIG file
not propagated to rescue DAG" problem.
I should be able to get you a new condor_dagman binary with the fix
some time early next week.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Fri Jun 8 11:48:37 2007 (1181321317)
Date: Wed, 13 Jun 2007 14:30:13 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #15687] LIGO Re: Re: dagman depth-first not working
Kipp,
I've just built 6.9.4 pre-release binaries that have the fix to the
"CONFIG spec not propagated to rescue DAG" problem. I've notified Stuart
about the new binaries, so they should get installed soon, and that should
take care of things.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Wed Jun 13 14:30:15 2007 (1181763015)
Subject: Actions
Ticket resolved by wenger
===========================================================================
Date of actions: Wed Jun 13 14:31:30 2007 (1181763091)
Subject: Actions
Ticket was reopened by mailnull
===========================================================================
Date of actions: Wed Jun 13 23:09:06 2007 (1181794147)
X-Authentication-Warning: deirdre.phys.uwm.edu: kipp owned process doing -bs
Date: Wed, 13 Jun 2007 23:08:56 -0500 (CDT)
From: Kipp Cannon <kipp__AT__gravity.phys.uwm.edu>
X-X-Sender: kipp__AT__deirdre.phys.uwm.edu
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: anderson__AT__ligo.caltech.edu
Subject: Re: [condor-admin #15687] LIGO Re: Re: dagman depth-first not working
Hi Kent,
Great, thanks a lot!
By the way, the depth-first feature, even in its preliminary form is
really fantastic! It really smooths out the DAG's running. Fewer files
accumulate, they don't grow to as large a size, if there's a problem with
the jobs in the DAG I find out within hours rather than after 3 days of
running. And so many little things just seem to work better.
Here's an example. In our data analysis pipelines, the first thing the
DAG does is run a large number of "data find" queries to metadata servers
to locate the instrument data on the cluster filesystem. Because of the
need to throttle database server connections, and other things, the
queries have to be run in sequence, we can't launch 1200 query jobs onto
the cluster all at once. They can take many many hours to run through.
But with the depth-first dagman, after the first query runs, the jobs that
use its output are launched and begin processing the data. This happens
in parallel with the next query, so while the queries are being run in
sequence the analysis jobs now run in parallel along side them. The
result is that the DAG requires almost 1/2 day less run time to complete,
and the cluster nodes aren't as idle as they otherwise would've been.
I've been telling everyone how great the depth-first feature is...
-Kipp
On Wed, 13 Jun 2007, condor-admin response tracking system wrote:
> Kipp,
>
> I've just built 6.9.4 pre-release binaries that have the fix to the
> "CONFIG spec not propagated to rescue DAG" problem. I've notified Stuart
> about the new binaries, so they should get installed soon, and that should
> take care of things.
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: kipp__AT__gravity.phys.uwm.edu, anderson__AT__ligo.caltech.edu
>
>
===========================================================================
Date mail was appended: Wed Jun 13 23:09:06 2007 (1181794147)
Date: Fri, 15 Jun 2007 09:19:19 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #15687] LIGO Re: Re: dagman depth-first not working
Kipp,
> Great, thanks a lot!
You're welcome!
> By the way, the depth-first feature, even in its preliminary form is
> really fantastic! It really smooths out the DAG's running. Fewer files
> accumulate, they don't grow to as large a size, if there's a problem with
> the jobs in the DAG I find out within hours rather than after 3 days of
> running. And so many little things just seem to work better.
>
> Here's an example. In our data analysis pipelines, the first thing the
> DAG does is run a large number of "data find" queries to metadata servers
> to locate the instrument data on the cluster filesystem. Because of the
> need to throttle database server connections, and other things, the
> queries have to be run in sequence, we can't launch 1200 query jobs onto
> the cluster all at once. They can take many many hours to run through.
> But with the depth-first dagman, after the first query runs, the jobs that
> use its output are launched and begin processing the data. This happens
> in parallel with the next query, so while the queries are being run in
> sequence the analysis jobs now run in parallel along side them. The
> result is that the DAG requires almost 1/2 day less run time to complete,
> and the cluster nodes aren't as idle as they otherwise would've been.
>
> I've been telling everyone how great the depth-first feature is...
Cool! Thanks for the positive feedback.
Hopefully we can get true depth-first traversal, and more flexible
priorities, implemented pretty soon.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Fri Jun 15 9:19:21 2007 (1181917162)
Subject: Actions
Ticket resolved by wenger
===========================================================================
Date of actions: Fri Jun 15 9:22:58 2007 (1181917379)