LIGO Support Ticket 19456

Ticket Information
  Number:      admin 19456
  User:        kcannon@ligo.caltech.edu
  Email:       anderson__AT__ligo.caltech.edu,dabrown__AT__physics.syr.edu,skoranda__AT__gravity.phys.uwm.edu,aplundgr__AT__syr.edu,cdcapano__AT__physics.syr.edu
  Status:      resolved
  Assigned To: wenger
Date: Sun, 5 Jul 2009 17:20:36 -0700 (PDT)
From: Kipp Cannon <kcannon__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>,        Duncan Brown
 <dabrown__AT__physics.syr.edu>,        Scott Koranda
 <skoranda__AT__gravity.phys.uwm.edu>,        Andrew P Lundgren
 <aplundgr__AT__syr.edu>,        Collin Capano <cdcapano__AT__physics.syr.edu>
Subject: LIGO: invalid rescue dags written by dagman
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

Hi,

During the upgrade mentioned by Stuart in his e-mail included below, I had 
a dag running that consisted of a super-dag and several sub-dags.  At 
about the time of the upgrade (2009-06-27, 11:19 local) the dag crashed. 
The super-dag dagman wrote a .rescue dag.  When I attempted to re-launch 
the dag (run the rescue dag), dagman refused claiming the .rescue dag was 
invalid on account of there being nodes not marked DONE possessing child 
nodes that were marked DONE.

Checking the .rescue confirmed the error message:  there were jobs 
aparently not complete possessing child jobs marked DONE.  Checking the 
sub-dags confirmed that the parent jobs in question had completed 
successfully, and also that the children marked DONE had been launched and 
completed successfully.  A search revealed several other sub-dags that had 
completed successfully but were not marked DONE in the super-dag's 
.rescue.

I manually edited the super-dag's .rescue to indicate the correct exit 
status of the sub-dags and re-launched.  At the time of the initial crash, 
nearly all of the ~600,000 jobs in the dags had completed.  There was 
perhaps only ~1000 jobs left to run.  ~100 jobs had failed a week earlier 
due to filesystem issues but otherwise the dag had basically run correctly 
through to completion.  Upon launching the super-dag's corrected .resuce 
dag, as the sub-dags started more than 100,000 jobs in the sub-dags showed 
up in the "ready" and "unready" states as though they had not run 
successfully.  As they ran, many of them failed.

I decided the dag was unrecoverable, rm -Rf'ed it, and am now attempting 
to re-run it from scratch.  I preserved many of the log files if they are 
of interest.  They are available at

http://www.ligo.caltech.edu/~kcannon/logs.tar.gz

The file is 261 MB.  When/if it is not needed, please let me know and I 
will delete it.  Thanks,

 							-Kipp



On Sat, 4 Jul 2009, Stuart Anderson wrote:

> We ran into a few problems on the LIGO condor pool at CIT with large DAGs 
> that where running when the pool was upgraded from 7.2.3 to 7.2.4. To get 
> started, here is one example of a user log file that shows a version mismatch 
> error. The procedure I followed was to run rpm -Uvh to install the 7.2.4 RPM 
> and let the condor daemons auto-detect the new version and do a clean 
> restart. However, this does not appear to be the right procedure when there 
> are active DAGs in the pool.
>
> What is the recommended way to upgrade a pool that is actively running DAGs?
>
> The second part this ticket is why the user observed that "most of the dag 
> has run twice"?
>
> Note, there will be a second ticket opened where a much larger DAG was left 
> in a corrupt state after this upgrade and could not be manually repaired but 
> had to be completely re-run.
>
> Thanks.
>
>
> Begin forwarded message:
>
>> From: Andrew P Lundgren <aplundgr__AT__syr.edu>
>> Date: July 4, 2009 6:33:48 AM PDT
>> To: Stuart Anderson <anderson__AT__ligo.caltech.edu>, Duncan Brown 
>> <dabrown__AT__physics.syr.edu>
>> Cc: Kipp Cannon <kcannon__AT__ligo.caltech.edu>, Collin Capano 
>> <cdcapano__AT__physics.syr.edu>
>> Subject: RE: dag
>> 
>> Hi Stuart,
>> 
>> I've been looking through Collin's log files, figuring out run times on the 
>> E14 data.  After the SIGTERM, the dags restarted and then quit because of a 
>> version mismatch in condor.  They came back again (I think a few hours 
>> later) and ran normally this time.  However, before the SIGTERM, more than 
>> 2000 jobs had finished in the full_data part of the dag.  After, only 380 
>> jobs were marked as done.  So most of the dag has run twice.  Any idea what 
>> happened?
>> 
>> The dag was running on ldas-grid in 
>> /archive/home/cdcapano/e14/ihoperun/928875615-929134815.
>> 
>> Thanks,
>> Andy
>> 
>> Here are some lines from the dagman.out file.
>> 
>> 6/28 04:35:37  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
>> 6/28 04:35:37   ===     ===      ===     ===     ===        ===      ===
>> 6/28 04:35:37  2298       0        7       0       0         95        1
>> ...
>> 6/28 17:05:42 Got SIGTERM. Performing graceful shutdown.
>> 6/28 17:05:42 **** condor_scheduniv_exec.49881787.0 (condor_DAGMAN) pid 
>> 3313 EXITING WITH STATUS 3
>> ...
>> 6/28 17:08:04 Lock file inspiral_hipe_full_data.FULL_DATA.dag.lock 
>> detected,
>> 6/28 17:08:04 Duplicate DAGMan PID 3313 is no longer alive; this DAGMan 
>> should continue.
>> ...
>> 6/28 17:08:16 **** condor_scheduniv_exec.49881787.0 (condor_DAGMAN) pid 
>> 3483 EXITING WITH STATUS 1
>> 6/28 17:08:41 ******************************************************
>> 6/28 17:08:41 ** condor_scheduniv_exec.50455804.0 (CONDOR_DAGMAN) STARTING 
>> UP
>> ...
>> 6/28 17:08:41 Version mismatch: condor_submit_dag ($CondorVersion: 7.2.3 
>> May 11 2009 BuildID: 151729 $) vs. condor_dagman ($CondorVersion: 7.2.4 Jun 
>> 15 2009 BuildID: 159529 $)
>> 6/28 17:08:41 **** condor_scheduniv_exec.50455804.0 (condor_DAGMAN) pid 
>> 3788 EXITING WITH STATUS 1
>> 6/29 14:47:49 ******************************************************
>> 6/29 14:47:49 ** condor_scheduniv_exec.50527579.0 (CONDOR_DAGMAN) STARTING 
>> UP
>> ...
>> 6/29 14:47:49 Found rescue DAG number 1; running 
>> inspiral_hipe_full_data.FULL_DATA.dag.rescue001 instead of normal DAG file
>> 6/29 14:47:49 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> 6/29 14:47:49 RUNNING RESCUE DAG 
>> inspiral_hipe_full_data.FULL_DATA.dag.rescue001
>> ...
>> 6/29 14:48:16  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
>> 6/29 14:48:16   ===     ===      ===     ===     ===        ===      ===
>> 6/29 14:48:16   384       0        5       0     160       1852        0
>> 
>
> --
> Stuart Anderson  anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
>
>
>

===========================================================================
Date of creation: Sun Jul  5 19:21:00 2009 (1246839663)
Subject: Actions

Assigned to psilord by jfrey
===========================================================================
Date of actions: Mon Jul  6 14:24:56 2009 (1246908296)
Subject: Actions

Assigned to wenger by psilord
===========================================================================
Date of actions: Wed Jul  8 10:40:25 2009 (1247067625)
Date: Wed, 8 Jul 2009 16:52:40 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: psilord <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19456] LIGO: invalid rescue dags written by dagman

Kipp,

> During the upgrade mentioned by Stuart in his e-mail included below, I had
> a dag running that consisted of a super-dag and several sub-dags.  At
> about the time of the upgrade (2009-06-27, 11:19 local) the dag crashed.
> The super-dag dagman wrote a .rescue dag.  When I attempted to re-launch
> the dag (run the rescue dag), dagman refused claiming the .rescue dag was
> invalid on account of there being nodes not marked DONE possessing child
> nodes that were marked DONE.
>
> Checking the .rescue confirmed the error message:  there were jobs
> aparently not complete possessing child jobs marked DONE.  Checking the
> sub-dags confirmed that the parent jobs in question had completed
> successfully, and also that the children marked DONE had been launched and
> completed successfully.  A search revealed several other sub-dags that had
> completed successfully but were not marked DONE in the super-dag's
> .rescue.
>
> I manually edited the super-dag's .rescue to indicate the correct exit
> status of the sub-dags and re-launched.  At the time of the initial crash,
> nearly all of the ~600,000 jobs in the dags had completed.  There was
> perhaps only ~1000 jobs left to run.  ~100 jobs had failed a week earlier
> due to filesystem issues but otherwise the dag had basically run correctly
> through to completion.  Upon launching the super-dag's corrected .resuce
> dag, as the sub-dags started more than 100,000 jobs in the sub-dags showed
> up in the "ready" and "unready" states as though they had not run
> successfully.  As they ran, many of them failed.
>
> I decided the dag was unrecoverable, rm -Rf'ed it, and am now attempting
> to re-run it from scratch.  I preserved many of the log files if they are
> of interest.  They are available at
>
> http://www.ligo.caltech.edu/~kcannon/logs.tar.gz
>
> The file is 261 MB.  When/if it is not needed, please let me know and I
> will delete it.  Thanks,

Okay, I have the tarball.

I'll have to see what I can figure out.  I don't recall a case before of 
DAGMan writing a corrupted rescue DAG.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Wed Jul  8 16:52:42 2009 (1247089962)
Date: Wed, 8 Jul 2009 17:18:22 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: psilord <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19456] LIGO: invalid rescue dags written by dagman

Kipp,

I have a question about the logs you sent.  There is a pretty big gap in
the DAGMan.out file:

6/27 11:16:49 POST Script of Node e89e7e85711ef98acd1c8e7828312412 6/27 
11:16:49 LOG LINE CACHE: End Flush
failed with status 1
6/27 11:17:08 **** condor_scheduniv_exec.16186971.0 (condor_DAGMAN) pid 
14405 EXITING WITH STATUS 1
6/28 15:18:58 ******************************************************
6/28 15:18:58 ** condor_scheduniv_exec.17263648.0 (CONDOR_DAGMAN) STARTING 
UP

Unfortunately, some time in there is when the offending rescue DAG got
written:

manta(182)% ls -l *rescue*
-rw-r--r-- 1 wenger wenger 33365 Jun 19 03:36 highmass_ihope.dag.rescue001
-rw-r--r-- 1 wenger wenger 33336 Jun 27 13:16 highmass_ihope.dag.rescue002
manta(183)%

Do you know of anything especially weird that happened during the gap? 
None of the dagman.out file seem to have anything for that time period.

--
R. Kent Wenger (wenger__AT__cs.wisc.edu, 608-262-6627,
http://www.cs.wisc.edu/~wenger/)
Computer Sciences Department
University of Wisconsin-Madison


===========================================================================
Date mail was appended: Wed Jul  8 17:18:27 2009 (1247091508)
Date: Wed, 8 Jul 2009 17:30:27 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: psilord <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19456] LIGO: invalid rescue dags written by dagman

Kipp,

Hmm, one other question -- are any of the log files for the subdags and 
their nodes on NFS?  I haven't tested this out yet, but so far the only 
scenario I've thought of for generating the corrupt rescue DAG is that 
DAGMan misses reading some events in recovery mode.  *Maybe* that could 
be explained by some kind of file system flakiness...

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Wed Jul  8 17:30:33 2009 (1247092233)
Date: Wed, 8 Jul 2009 17:27:11 -0700 (PDT)
From: Kipp Cannon <kcannon__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: anderson__AT__ligo.caltech.edu, dabrown__AT__physics.syr.edu,
 skoranda__AT__gravity.phys.uwm.edu, aplundgr__AT__syr.edu, cdcapano__AT__physics.syr.edu
Subject: Re: [condor-admin #19456] LIGO: invalid rescue dags written by dagman
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

Hi Kent,

Sorry, no, I don't understand it.  Infact, all during that time I was 
regularly running a script that parses the .dagman.out files and prints a 
summary of the progress of the DAGs and it was giving me changing results 
suggesting the logs were being written to during that time.

Stuart, was there any file system issue that you might be aware of that 
might've cause the file to be truncated when the program exited?  It's 
hard to imagine a week of the log file disappearing.

 							-Kipp


On Wed, 8 Jul 2009, condor-admin response tracking system wrote:

> Kipp,
>
> I have a question about the logs you sent.  There is a pretty big gap in
> the DAGMan.out file:
>
> 6/27 11:16:49 POST Script of Node e89e7e85711ef98acd1c8e7828312412 6/27
> 11:16:49 LOG LINE CACHE: End Flush
> failed with status 1
> 6/27 11:17:08 **** condor_scheduniv_exec.16186971.0 (condor_DAGMAN) pid
> 14405 EXITING WITH STATUS 1
> 6/28 15:18:58 ******************************************************
> 6/28 15:18:58 ** condor_scheduniv_exec.17263648.0 (CONDOR_DAGMAN) STARTING
> UP
>
> Unfortunately, some time in there is when the offending rescue DAG got
> written:
>
> manta(182)% ls -l *rescue*
> -rw-r--r-- 1 wenger wenger 33365 Jun 19 03:36 highmass_ihope.dag.rescue001
> -rw-r--r-- 1 wenger wenger 33336 Jun 27 13:16 highmass_ihope.dag.rescue002
> manta(183)%
>
> Do you know of anything especially weird that happened during the gap?
> None of the dagman.out file seem to have anything for that time period.
>
> --
> R. Kent Wenger (wenger__AT__cs.wisc.edu, 608-262-6627,
> http://www.cs.wisc.edu/~wenger/)
> Computer Sciences Department
> University of Wisconsin-Madison
>
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: kcannon__AT__ligo.caltech.edu, anderson__AT__ligo.caltech.edu,dabrown__AT__physics.syr.edu,skoranda__AT__gravity.phys.uwm.edu,aplundgr__AT__syr.edu,cdcapano__AT__physics.syr.edu
>
>

===========================================================================
Date mail was appended: Wed Jul  8 19:27:32 2009 (1247099252)
Date: Wed, 8 Jul 2009 17:30:54 -0700 (PDT)
From: Kipp Cannon <kcannon__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: anderson__AT__ligo.caltech.edu, dabrown__AT__physics.syr.edu,
 skoranda__AT__gravity.phys.uwm.edu, aplundgr__AT__syr.edu, cdcapano__AT__physics.syr.edu
Subject: Re: [condor-admin #19456] LIGO: invalid rescue dags written by dagman
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Hi Kent,

The .dagman.out log files for all of the subdags should be there, in 
subdirectories, as well as the .dagman.log files (the condor log of the 
dagman jobs), but none of the logs for the individual jobs in the subdags. 
Those files numbered in the millions.

 							-Kipp

On Wed, 8 Jul 2009, condor-admin response tracking system wrote:

> Kipp,
>
> Hmm, one other question -- are any of the log files for the subdags and
> their nodes on NFS?  I haven't tested this out yet, but so far the only
> scenario I've thought of for generating the corrupt rescue DAG is that
> DAGMan misses reading some events in recovery mode.  *Maybe* that could
> be explained by some kind of file system flakiness...
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: kcannon__AT__ligo.caltech.edu, anderson__AT__ligo.caltech.edu,dabrown__AT__physics.syr.edu,skoranda__AT__gravity.phys.uwm.edu,aplundgr__AT__syr.edu,cdcapano__AT__physics.syr.edu
>
>

===========================================================================
Date mail was appended: Wed Jul  8 19:31:13 2009 (1247099473)
Date: Thu, 9 Jul 2009 12:00:36 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19456] LIGO: invalid rescue dags written by dagman

Kipp,

> Sorry, no, I don't understand it.  Infact, all during that time I was
> regularly running a script that parses the .dagman.out files and prints a
> summary of the progress of the DAGs and it was giving me changing results
> suggesting the logs were being written to during that time.
>
> Stuart, was there any file system issue that you might be aware of that
> might've cause the file to be truncated when the program exited?  It's
> hard to imagine a week of the log file disappearing.

Well, the main dagman.out file is missing a little over a day.

Anyhow, my suggestion is this:  that some of the node job log files had 
stale file handles or something similar that *temporarily* prevented 
DAGMan from reading those log files.  If some files were unreadable and 
some were still readable, this would explain how DAGMan got into an 
illegal state internally.  So far, the only way I've been able to think of 
that the bad state could happen is some problem reading events in recovery 
mode.

I just verified that a problem reading events in recovery mode can, in 
fact, get DAGMan into such an illegal internal state.  It turns out that 
DAGMan does not check for any kind of consistent inter-node state while 
reading events in recovery mode, so it would just read the events it 
could, and then exit from recovery mode, and start running the jobs that 
were considered ready but not yet run -- which could basically be a random 
set depending on which log files it could read.  In fact, in the test I
just ran, I got rid of the event for the first node in the test DAG.
DAGMan happily read the events for all of the other nodes, finished 
recovery mode, and then decided that the first node needed to be run.
Once that finished, it then decided that the DAG had been correctly 
completed!  So we obviously need some more checking of the state of the 
DAG!

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Thu Jul  9 12:00:42 2009 (1247158842)
Date: Thu, 9 Jul 2009 15:20:30 -0700 (PDT)
From: Kipp Cannon <kcannon__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: anderson__AT__ligo.caltech.edu, dabrown__AT__physics.syr.edu,
 skoranda__AT__gravity.phys.uwm.edu, aplundgr__AT__syr.edu, cdcapano__AT__physics.syr.edu
Subject: Re: [condor-admin #19456] LIGO: invalid rescue dags written by dagman
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

Hi Kent,

Great, sounds like some good will come of my disaster :-).  Thanks,

 							-Kipp

On Thu, 9 Jul 2009, condor-admin response tracking system wrote:

> Kipp,
>
>> Sorry, no, I don't understand it.  Infact, all during that time I was
>> regularly running a script that parses the .dagman.out files and prints a
>> summary of the progress of the DAGs and it was giving me changing results
>> suggesting the logs were being written to during that time.
>>
>> Stuart, was there any file system issue that you might be aware of that
>> might've cause the file to be truncated when the program exited?  It's
>> hard to imagine a week of the log file disappearing.
>
> Well, the main dagman.out file is missing a little over a day.
>
> Anyhow, my suggestion is this:  that some of the node job log files had
> stale file handles or something similar that *temporarily* prevented
> DAGMan from reading those log files.  If some files were unreadable and
> some were still readable, this would explain how DAGMan got into an
> illegal state internally.  So far, the only way I've been able to think of
> that the bad state could happen is some problem reading events in recovery
> mode.
>
> I just verified that a problem reading events in recovery mode can, in
> fact, get DAGMan into such an illegal internal state.  It turns out that
> DAGMan does not check for any kind of consistent inter-node state while
> reading events in recovery mode, so it would just read the events it
> could, and then exit from recovery mode, and start running the jobs that
> were considered ready but not yet run -- which could basically be a random
> set depending on which log files it could read.  In fact, in the test I
> just ran, I got rid of the event for the first node in the test DAG.
> DAGMan happily read the events for all of the other nodes, finished
> recovery mode, and then decided that the first node needed to be run.
> Once that finished, it then decided that the DAG had been correctly
> completed!  So we obviously need some more checking of the state of the
> DAG!
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: kcannon__AT__ligo.caltech.edu, anderson__AT__ligo.caltech.edu,dabrown__AT__physics.syr.edu,skoranda__AT__gravity.phys.uwm.edu,aplundgr__AT__syr.edu,cdcapano__AT__physics.syr.edu
>
>

===========================================================================
Date mail was appended: Thu Jul  9 17:20:50 2009 (1247178050)
Date: Thu, 9 Jul 2009 17:33:00 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19456] LIGO: invalid rescue dags written by dagman

Kipp,

> Great, sounds like some good will come of my disaster :-).  Thanks,

Yes, we'll have to decide how to deal with this in the best way.  I'm 
pretty sure that DAGMan should quit, but maybe *not* write a rescue DAG, 
since the rescue DAG will be goofed up.  Maybe it should leave the lock 
file hanging around so that a subsequent DAGMan run will start in recovery 
mode -- that would get you into the correct state if all of the log files 
were again readable.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Thu Jul  9 17:33:10 2009 (1247178791)
Subject: Actions

Status changed from open to bug by wenger
===========================================================================
Date of actions: Fri Jul 31 14:35:56 2009 (1249068956)
Date: Thu, 24 Sep 2009 15:33:49 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: psilord <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19456] LIGO: invalid rescue dags written by dagman

Kipp,

> During the upgrade mentioned by Stuart in his e-mail included below, I had
> a dag running that consisted of a super-dag and several sub-dags.  At
> about the time of the upgrade (2009-06-27, 11:19 local) the dag crashed.
> The super-dag dagman wrote a .rescue dag.  When I attempted to re-launch
> the dag (run the rescue dag), dagman refused claiming the .rescue dag was
> invalid on account of there being nodes not marked DONE possessing child
> nodes that were marked DONE.
>
> Checking the .rescue confirmed the error message:  there were jobs
> aparently not complete possessing child jobs marked DONE.  Checking the
> sub-dags confirmed that the parent jobs in question had completed
> successfully, and also that the children marked DONE had been launched and
> completed successfully.  A search revealed several other sub-dags that had
> completed successfully but were not marked DONE in the super-dag's
> .rescue.

We have a fix for this now.  Unfortunately, it missed 7.3.2 -- I seem to 
have committed it on the 7.5 branch for some reason.  I'm going to see if 
we can cherry-pick this over to 7.4.0.

How high a priority is this for you?  Do you feel like you need 7.5.0 
pre-release binaries?  (That might get a bit complicated, because of 
another fix I just made that's in the 7.4.0 pre-release binaries I just 
sent you guys.  I guess we could merge 7.4 to 7.5, and then 7.5 would
contain all of the bug fixes.)

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Thu Sep 24 15:33:55 2009 (1253824435)
Subject: Actions

Status changed from bug to pending by wenger
===========================================================================
Date of actions: Thu Sep 24 15:34:58 2009 (1253824498)
CC: kcannon__AT__ligo.caltech.edu, dabrown__AT__physics.syr.edu,
 skoranda__AT__gravity.phys.uwm.edu, aplundgr__AT__syr.edu, cdcapano__AT__physics.syr.edu
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19456] LIGO: invalid rescue dags written by dagman
Date: Thu, 24 Sep 2009 17:45:40 -0700
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu


On Sep 24, 2009, at 1:33 PM, condor-admin response tracking system  
wrote:

> Kipp,
>
>> During the upgrade mentioned by Stuart in his e-mail included  
>> below, I had
>> a dag running that consisted of a super-dag and several sub-dags.  At
>> about the time of the upgrade (2009-06-27, 11:19 local) the dag  
>> crashed.
>> The super-dag dagman wrote a .rescue dag.  When I attempted to re- 
>> launch
>> the dag (run the rescue dag), dagman refused claiming the .rescue  
>> dag was
>> invalid on account of there being nodes not marked DONE possessing  
>> child
>> nodes that were marked DONE.
>>
>> Checking the .rescue confirmed the error message:  there were jobs
>> aparently not complete possessing child jobs marked DONE.  Checking  
>> the
>> sub-dags confirmed that the parent jobs in question had completed
>> successfully, and also that the children marked DONE had been  
>> launched and
>> completed successfully.  A search revealed several other sub-dags  
>> that had
>> completed successfully but were not marked DONE in the super-dag's
>> .rescue.
>
> We have a fix for this now.  Unfortunately, it missed 7.3.2 -- I  
> seem to
> have committed it on the 7.5 branch for some reason.  I'm going to  
> see if
> we can cherry-pick this over to 7.4.0.
>
> How high a priority is this for you?  Do you feel like you need 7.5.0
> pre-release binaries?  (That might get a bit complicated, because of
> another fix I just made that's in the 7.4.0 pre-release binaries I  
> just
> sent you guys.  I guess we could merge 7.4 to 7.5, and then 7.5 would
> contain all of the bug fixes.)

Kent,
	I think this is in an important enough bug fix that I would like to
encourage you to try and get it into the 7.4.0 release.

Thanks.

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Thu Sep 24 19:45:53 2009 (1253839554)
Date: Sat, 26 Sep 2009 10:46:40 -0700 (PDT)
From: Kipp Cannon <kcannon__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: anderson__AT__ligo.caltech.edu, dabrown__AT__physics.syr.edu,
 skoranda__AT__gravity.phys.uwm.edu, aplundgr__AT__syr.edu, cdcapano__AT__physics.syr.edu
Subject: Re: [condor-admin #19456] LIGO: invalid rescue dags written by dagman
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

On Thu, 24 Sep 2009, condor-admin response tracking system wrote:

> Kipp,
>
>> During the upgrade mentioned by Stuart in his e-mail included below, I had
>> a dag running that consisted of a super-dag and several sub-dags.  At
>> about the time of the upgrade (2009-06-27, 11:19 local) the dag crashed.
>> The super-dag dagman wrote a .rescue dag.  When I attempted to re-launch
>> the dag (run the rescue dag), dagman refused claiming the .rescue dag was
>> invalid on account of there being nodes not marked DONE possessing child
>> nodes that were marked DONE.
>>
>> Checking the .rescue confirmed the error message:  there were jobs
>> aparently not complete possessing child jobs marked DONE.  Checking the
>> sub-dags confirmed that the parent jobs in question had completed
>> successfully, and also that the children marked DONE had been launched and
>> completed successfully.  A search revealed several other sub-dags that had
>> completed successfully but were not marked DONE in the super-dag's
>> .rescue.
>
> We have a fix for this now.  Unfortunately, it missed 7.3.2 -- I seem to
> have committed it on the 7.5 branch for some reason.  I'm going to see if
> we can cherry-pick this over to 7.4.0.
>
> How high a priority is this for you?  Do you feel like you need 7.5.0
> pre-release binaries?  (That might get a bit complicated, because of
> another fix I just made that's in the 7.4.0 pre-release binaries I just
> sent you guys.  I guess we could merge 7.4 to 7.5, and then 7.5 would
> contain all of the bug fixes.)
>
> Kent Wenger
> Condor Team

Hi Kent,

When it happened it was a significant inconvenience.  It hit at least 
three users and set back one of our analyses of our last science run by 
about a month.  However, as I understand it there is a simple work-around 
which is for admins to be careful to not upgrade condor while dags are 
running.  I believe that if they shutdown dags first and then restart them 
after the upgrade it should be fine.  If that's true then I think it's not 
a high priority.  What I mean is that making it not happen is a very high 
priority, but there seems to be a simple way to achieve that in the short 
term while the proper fix makes its way through the pipe.

 							-Kipp


>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: kcannon__AT__ligo.caltech.edu, anderson__AT__ligo.caltech.edu,dabrown__AT__physics.syr.edu,skoranda__AT__gravity.phys.uwm.edu,aplundgr__AT__syr.edu,cdcapano__AT__physics.syr.edu
>
>

===========================================================================
Date mail was appended: Sat Sep 26 12:46:55 2009 (1253987215)
Date: Mon, 28 Sep 2009 12:01:30 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19456] LIGO: invalid rescue dags written by dagman

Kipp and Stuart,

I have new DAGMan binaries available at the the following URL:
ftp://ftp.cs.wisc.edu/condor/temporary/forligo/dagman-7.4.0-prerelease-2009-09-26/

These binaries have the following 7.4.0 improvments:

* The condor-support #7835/gittrac #744 problem (DAGMan errors out on
abnormal job termination in recovery mode) is fixed.

* The condor-admin #19456/gittrac #622 problem (invalid rescue DAG generated
because of event reading problems in recovery mode) is fixed.

* The "Warning: log file for node xxx already monitored" messages are
not printed at the default verbosity.

They also have all improvements from 7.3.2, such as the "lazy log file
evaluation" and the automatic default log file for node jobs that don't
specify a log file.  You may also want to explore setting the new
DAGMAN_USER_LOG_SCAN_INTERVAL configuration variable to a smaller value
than its default of 5 seconds, for DAGs that have many very small jobs.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Mon Sep 28 12:01:38 2009 (1254157298)
Subject: Actions

Ticket resolved by wenger
===========================================================================
Date of actions: Mon Sep 28 12:01:56 2009 (1254157316)