LIGO Support Ticket 19273

Ticket Information
  Number:      admin 19273
  User:        dabrown@physics.syr.edu
  Email:       anderson__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu,patrick__AT__gravity.phys.uwm.edu,jclayton__AT__gravity.phys.uwm.edu,gskelton__AT__gravity.phys.uwm.edu,rosso__AT__gravity.phys.uwm.edu
  Status:      resolved
  Assigned To: wenger
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>,        Scott Koranda
 <skoranda__AT__gravity.phys.uwm.edu>,        Patrick Brady
 <patrick__AT__gravity.phys.uwm.edu>,        Jessica Clayton
 <jclayton__AT__gravity.phys.uwm.edu>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: LIGO: bug in dagman recovery procedure for retried jobs
Date: Wed, 6 May 2009 23:35:43 -0400
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.7400:2.4.4,1.2.40,4.0.166
 definitions=2009-05-07_01:2009-04-28,2009-05-06,2009-05-06 signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0
 reason=mlx engine=5.0.0-0811170000 definitions=main-0905060255
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Hi Kent and Pete,

I think Jessica has found a bug in the dagman recovery procedure for  
DAGs that have retried nodes. I started to write you a (long) email  
about this but then I figured I could just come up with a test case  
which demonstrates the problem:

If a user tries to recover a dag which contains retired jobs, the  
recovery fails when dagman encounters the first retried node from the  
previous DAG. Here's a test case:

[dbrown@sugar ~]$ cat jessica.dag
JOB 1 sleep.sub
RETRY 1 1
JOB 2 sleep.sub
RETRY 2 1
PARENT 1 CHILD 2

[dbrown@sugar ~]$ cat sleep.sub
universe = vanilla
executable = /bin/sleep
arguments = 300
log = /tmp/duncantest.log
output = sleep.$(cluster).out
error = sleep.$(cluster).err
queue

I started jessica.dag, logged into the node that was running the sleep  
job and sent it a kill -9 so it failed. dagman then re-tried the sleep  
job, as it is supposed to. I then triggered the recovery procedure in  
dagman by doing a condor_hold on the dagman job and then a  
condor_release once it exited. The new dagman proceeds started up in  
recovery mode, but failed with the error message

5/6 23:24:39 ERROR: node 1: job ID in userlog submit event (2847958.0)  
doesn't match ID reported earlier by submit command (2847957.0)!   
Aborting DAG; set DAGMAN_ABORT_ON_SCARY_SUBMIT to false if you are  
*sure* this shouldn't cause an abort.
5/6 23:24:39 Aborting DAG...
5/6 23:24:39 Writing Rescue DAG to jessica.dag.rescue001...

when it encountered the second submit message in the condor job log  
file during recovery. This is a problem for us, as in a large dag the  
retried node could be occur early in the workflow, so a lot of  
completed work goes un-recovered when dagman writes the rescue dag.

Please could you take a look at the recovery code in dagman and see if  
you can fix it to handle recovering dags where a retry has occurred in  
the previous dag?

Jessica's original email and my initial thoughts on debugging is  
below, but this test case demonstrates the problem.

Cheers,
Duncan.

> On May 6, 2009, at 10:20 PM, Jessica Clayton wrote:
>> Recently we have been experiencing a strange problem when running  
>> the CBC pipeline on Nemo. We've now seen the problem happen at  
>> least five times in recent weeks and that has led us to dig into  
>> our log files. Patrick and I think that we have evidence for a bug  
>> in Condor, so we wanted to pass this information along so that you  
>> could see what you think. Here's a little background on what's  
>> happening.
>>
>> First, an ihope dag is submitted. While the dag is running, Condor  
>> is shut down (either because of maintenance or because a user did  
>> something bad). Condor is restarted and starts the dag recovery  
>> process. It must try to figure out which jobs have already  
>> completed and which jobs still need to be run. Generally, one can  
>> see in the dagman.out that this process appears to go smoothly.  
>> However, suddenly the dagman.out file has an error message about  
>> condor_ids not matching for a particular job. At this point, it  
>> simply gives up and writes a rescue dag regardless of how many jobs  
>> it has checked to see whether or not they were already done. The  
>> result is a bad rescue dag that does not accurately reflect the  
>> actual number of complete jobs, but instead reports that many fewer  
>> jobs are actually complete.

I have seen the "Aborting DAG; set DAGMAN_ABORT_ON_SCARY_SUBMIT to  
false" message Jessica encounters in the following circumstances:

1. Running two (separate) DAGs which had the same node names and jobs  
in the dag were writing their log messages the same log file. The  
submit messages from one dagman were confusing the other.

2. The condor job log file was not on an NFS mounted filesystem.

However, Jessica is generating her DAGs using ihope rather than by  
hand, so she's not doing (1) and I'm pretty sure she's not doing (2),  
so I think that there may be a problem in the dagman recovery procedure.

I see the these messages in the following order in the condor job log  
file (I have deleted the irrelevant lines removed from other jobs, but  
made sure to preserve the order the messages appear):

000 (17487427.000.000) 04/28 04:31:21 Job submitted from host:  
<192.168.0.13:34296>
     DAG Node: b36d4cd5050310a2eaef87c5aec01cd9
...
001 (17487427.000.000) 04/28 05:22:09 Job executing on host:  
<192.168.3.235:40092>
...
005 (17487427.000.000) 04/28 05:22:09 Job terminated.
	(1) Normal termination (return value 1)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	1302  -  Run Bytes Sent By Job
	7892543  -  Run Bytes Received By Job
	1302  -  Total Bytes Sent By Job
	7892543  -  Total Bytes Received By Job
...
000 (17487700.000.000) 04/28 05:22:15 Job submitted from host:  
<192.168.0.13:34296>
     DAG Node: b36d4cd5050310a2eaef87c5aec01cd9
...

In the original DAG, node b36d4cd5050310a2eaef87c5aec01cd9 is  
submitted with ID 17487427.0, runs and then fails. It's then re- 
submitted with ID 17487427.0 since the DAG has RETRY 1 for this node.  
Note that node b36d4cd5050310a2eaef87c5aec01cd9 fully completes it's  
submit-execute-terminate cycle before the next submit message appears  
in the log.

At some point later the dagman process dies and so condor_dagman has  
to be restarted.

When dagman restarts and enters recovery mode, it is reading the  
condor job log file to recover the state information. It happily reads  
the first execution of the job:

5/1 19:04:13 Event: ULOG_SUBMIT for Condor Node  
b36d4cd5050310a2eaef87c5aec01cd9 (17487427.0)
5/1 19:04:13 Event: ULOG_EXECUTE for Condor Node  
b36d4cd5050310a2eaef87c5aec01cd9 (17487427.0)
5/1 19:04:13 Event: ULOG_JOB_TERMINATED for Condor Node  
b36d4cd5050310a2eaef87c5aec01cd9 (17487427.0)
5/1 19:04:13 Node b36d4cd5050310a2eaef87c5aec01cd9 job proc  
(17487427.0) failed with status 1.

and then prints a message about re-trying the job:

5/1 19:04:13 Retrying node b36d4cd5050310a2eaef87c5aec01cd9 (retry #1  
of 1)...

but then chokes when it sees the submit message from the retry from  
the original dag

5/1 19:04:13 ERROR: node b36d4cd5050310a2eaef87c5aec01cd9: job ID in  
userlog submit event (17487700.0) doesn't match ID reported earlier by  
submit command (17487427.0)!  Aborting DAG; set  
DAGMAN_ABORT_ON_SCARY_SUBMIT to false if you are *sure* this shouldn't  
cause an abort.

The dagman processes then aborts, writing a rescue dag, but missing  
all the completed jobs _after_ the second submit of the retried job.

Another reason that I'm suspicious of the recovery code's handing of  
re-tried jobs is that b36d4cd5050310a2eaef87c5aec01cd9 is the _first_  
retried node that dagman encounters when it is parsing the job log  
file trying to recover. While this node was not the first (or only)  
failed job from the original dag, it is the first retried job dagman  
encounters in recovery mode.

I've posted a tar ball of all of Jessica's log files at:

http://www.gravity.phy.syr.edu/~duncan/computing/dagman_retry_log.tar.gz

Please could you look into this? This appears to be a fairly serious  
problem for us.

Cheers,
Duncan.

> We have repeatedly noticed this problem with the LV dags that have  
> been running lately. (These dags must run for several weeks, so the  
> extended run time makes it more likely that we encounter Condor  
> downtime.) After downtime, Condor tries to recover and only a  
> percentage of the complete jobs are marked as Done in the rescue. I  
> should also mention that this has happened in three different dags  
> running from three different head nodes.
>
> To see this effect, we used grep to isolate all message related to a  
> single job in the affected dagman.out file. I've pasted these  
> message below (and they are also in the attached tarball -  
> singlejob.dagman.out).
>
> First, job 17487427.0 begins.
> Next, 17487427.0 fails.
> Because retry is set to 1, ihope retries the job.
> The new condor id (for the retry) is 17487700.0.
> Next 17487700.0 successfully completes.
> Condor is shutdown and restarted.
> The dag starts the recovery process.
> Condor notices that job 17487427.0 failed.
> Condor decides to retry that job, but notices that "job ID in  
> userlog submit event (17487700.0) doesn't match ID reported earlier  
> by submit command (17487427.0)!"
> The rescue dag is written immediately, even though it hasn't looked  
> to see if any more jobs were completed.
>
> The dagman contains a message that you should "set  
> DAGMAN_ABORT_ON_SCARY_SUBMIT to false if you are *sure* this  
> shouldn't cause an abort." However, it seems that Condor is actually  
> not treating the recovery process properly.
>
> I've attached some of the dagman and log files from one of these  
> cases. Please let us know what you think and let us know if we can  
> provide any more information.
>
> Thank you,
>
> Jessica (and Patrick)
> *******
> The attached file contains  
> inspiral_hipe_bnslininj.BNSLININJ.dag.dagman.out, the dag and the  
> rescue dag. The log file tmpebbFSQ is also included.
>
> Note that the rescue dag states that 3219 jobs are done, while the  
> dagman.out reported 6219 were complete before Condor went down.
>
> *******
> 4/28 04:31:21 Submitting Condor Node  
> b36d4cd5050310a2eaef87c5aec01cd9 job(s)...
> 4/28 04:31:21 submitting: condor_submit -a dag_node_name' '='  
> 'b36d4cd5050310a2eaef87c5aec01cd9 -a +DAGManJobId' '=' '17444740 -a  
> DAGManJobId' '=' '17444740 -a submit_event_notes' '=' 'DAG' 'Node:'  
> 'b36d4cd5050310a2eaef87c5aec01cd9 -a macroglob' '=' 'V1- 
> INCA_BNSLININJ-874979289-2007.xml -a macromissedinjections' '=' 'V1- 
> SIRE_INJECTIONS_1234_BNSLININJ_MISSED_FIRST_BNSLININJ 
> -874979289-2007.xml -a macrosummary' '=' 'V1- 
> SIRE_INJECTIONS_1234_BNSLININJ_FOUND_FIRST_BNSLININJ 
> -874979289-2007.txt -a macroinjectionfile' '=' 'HL- 
> INJECTIONS_1234_BNSLININJ-873567014-1665000.xml -a macrousertag' '='  
> 'BNSLININJ -a macrooutput' '=' 'V1- 
> SIRE_INJECTIONS_1234_BNSLININJ_FOUND_FIRST_BNSLININJ 
> -874979289-2007.xml -a macroifocut' '=' 'V1 -a +DAGParentNodeNames'  
> '=' '"890ad1b95b75bbe1b70eb1d9e2951deb, 
> 678ba29d80139bf976152ea88e66994d"  
> inspiral_hipe_bnslininj.sire.BNSLININJ.sub
> 4/28 04:31:22 Event: ULOG_SUBMIT for Condor Node  
> b36d4cd5050310a2eaef87c5aec01cd9 (17487427.0)
> 4/28 05:03:39   Node b36d4cd5050310a2eaef87c5aec01cd9, Condor ID  
> 17487427, status STATUS_SUBMITTED
> 4/28 05:16:36   Node b36d4cd5050310a2eaef87c5aec01cd9, Condor ID  
> 17487427, status STATUS_SUBMITTED
> 4/28 05:22:09 Event: ULOG_EXECUTE for Condor Node  
> b36d4cd5050310a2eaef87c5aec01cd9 (17487427.0)
> 4/28 05:22:09 Event: ULOG_JOB_TERMINATED for Condor Node  
> b36d4cd5050310a2eaef87c5aec01cd9 (17487427.0)
> 4/28 05:22:09 Node b36d4cd5050310a2eaef87c5aec01cd9 job proc  
> (17487427.0) failed with status 1.
> 4/28 05:22:09 Retrying node b36d4cd5050310a2eaef87c5aec01cd9 (retry  
> #1 of 1)...
> 4/28 05:22:14 Submitting Condor Node  
> b36d4cd5050310a2eaef87c5aec01cd9 job(s)...
> 4/28 05:22:14 submitting: condor_submit -a dag_node_name' '='  
> 'b36d4cd5050310a2eaef87c5aec01cd9 -a +DAGManJobId' '=' '17444740 -a  
> DAGManJobId' '=' '17444740 -a submit_event_notes' '=' 'DAG' 'Node:'  
> 'b36d4cd5050310a2eaef87c5aec01cd9 -a macroglob' '=' 'V1- 
> INCA_BNSLININJ-874979289-2007.xml -a macromissedinjections' '=' 'V1- 
> SIRE_INJECTIONS_1234_BNSLININJ_MISSED_FIRST_BNSLININJ 
> -874979289-2007.xml -a macrosummary' '=' 'V1- 
> SIRE_INJECTIONS_1234_BNSLININJ_FOUND_FIRST_BNSLININJ 
> -874979289-2007.txt -a macroinjectionfile' '=' 'HL- 
> INJECTIONS_1234_BNSLININJ-873567014-1665000.xml -a macrousertag' '='  
> 'BNSLININJ -a macrooutput' '=' 'V1- 
> SIRE_INJECTIONS_1234_BNSLININJ_FOUND_FIRST_BNSLININJ 
> -874979289-2007.xml -a macroifocut' '=' 'V1 -a +DAGParentNodeNames'  
> '=' '"890ad1b95b75bbe1b70eb1d9e2951deb, 
> 678ba29d80139bf976152ea88e66994d"  
> inspiral_hipe_bnslininj.sire.BNSLININJ.sub
> 4/28 05:22:16 Event: ULOG_SUBMIT for Condor Node  
> b36d4cd5050310a2eaef87c5aec01cd9 (17487700.0)
> 4/28 05:37:36 Event: ULOG_EXECUTE for Condor Node  
> b36d4cd5050310a2eaef87c5aec01cd9 (17487700.0)
> 4/28 05:37:44 Event: ULOG_JOB_TERMINATED for Condor Node  
> b36d4cd5050310a2eaef87c5aec01cd9 (17487700.0)
> 4/28 05:37:44 Node b36d4cd5050310a2eaef87c5aec01cd9 job proc  
> (17487700.0) completed successfully.
> 4/28 05:37:44 Node b36d4cd5050310a2eaef87c5aec01cd9 job completed
> 5/1 19:04:13 Event: ULOG_SUBMIT for Condor Node  
> b36d4cd5050310a2eaef87c5aec01cd9 (17487427.0)
> 5/1 19:04:13 Event: ULOG_EXECUTE for Condor Node  
> b36d4cd5050310a2eaef87c5aec01cd9 (17487427.0)
> 5/1 19:04:13 Event: ULOG_JOB_TERMINATED for Condor Node  
> b36d4cd5050310a2eaef87c5aec01cd9 (17487427.0)
> 5/1 19:04:13 Node b36d4cd5050310a2eaef87c5aec01cd9 job proc  
> (17487427.0) failed with status 1.
> 5/1 19:04:13 Retrying node b36d4cd5050310a2eaef87c5aec01cd9 (retry  
> #1 of 1)...
> 5/1 19:04:13 ERROR: node b36d4cd5050310a2eaef87c5aec01cd9: job ID in  
> userlog submit event (17487700.0) doesn't match ID reported earlier  
> by submit command (17487427.0)!  Aborting DAG; set  
> DAGMAN_ABORT_ON_SCARY_SUBMIT to false if you are *sure* this  
> shouldn't cause an abort.
>
> <bugcondor.tar.gz>

-- 

Duncan Brown                          Room 263-1, Department of Physics,
Assistant Professor of Physics        Syracuse University, NY 13244, USA
Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan




===========================================================================
Date of creation: Wed May  6 22:35:56 2009 (1241667359)
Subject: Actions

Assigned to psilord by bgietzel
===========================================================================
Date of actions: Thu May  7  9:33:03 2009 (1241706783)
Date: Mon, 11 May 2009 15:59:42 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: bgietzel <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19273] LIGO: bug in dagman recovery procedure
 for retried jobs

Hello,

> From: Duncan Brown <dabrown__AT__physics.syr.edu>
> 
> I think Jessica has found a bug in the dagman recovery procedure for  
> DAGs that have retried nodes. I started to write you a (long) email  
> about this but then I figured I could just come up with a test case  
> which demonstrates the problem:

As a fast workaround, you could set:

DAGMAN_ABORT_ON_SCARY_SUBMIT = False

for now, which should stop the immediate problem, but hide the
other problems which could happen that were denoted in the
initial email you copied into this ticket (multiple dags using
the same log file, etc). Would the rarity of the true errors that
DAGMAN_ABORT_ON_SCARY_SUBMIT is supposed to detect be sufficient that
this would work in the immediate term?

It looks to be a quick fix on our part. We'll try and get you a new
dagman executable in a couple of days.

Thank you.

Condor Admin

===========================================================================
Date mail was appended: Mon May 11 15:59:47 2009 (1242075587)
CC: anderson__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu,
 patrick__AT__gravity.phys.uwm.edu, jclayton__AT__gravity.phys.uwm.edu
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19273] LIGO: bug in dagman recovery procedure
 for retried jobs
Date: Mon, 11 May 2009 18:41:59 -0400
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.8161:2.4.5,1.2.40,4.0.166
 definitions=2009-05-12_01:2009-04-28,2009-05-12,2009-05-11 signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0
 reason=mlx engine=5.0.0-0811170000 definitions=main-0905110191
X-Mailfromd-RBL: IP Address 128.230.18.82 is listed on ix.dnsbl.manitu.net
X-Mailfromd: Total of 1 RBL listing (15 mins)
X-Mailfromd-Greylist-Time: May have had a total greylist delay of 15 minutes
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

Hi Pete,

Thanks, we should be able to do this as an interim solution. Jessica,  
do you want to try it?

Cheers,
Duncan.

On May 11, 2009, at 4:59 PM, condor-admin response tracking system  
wrote:

> Hello,
>
>> From: Duncan Brown <dabrown__AT__physics.syr.edu>
>>
>> I think Jessica has found a bug in the dagman recovery procedure for
>> DAGs that have retried nodes. I started to write you a (long) email
>> about this but then I figured I could just come up with a test case
>> which demonstrates the problem:
>
> As a fast workaround, you could set:
>
> DAGMAN_ABORT_ON_SCARY_SUBMIT = False
>
> for now, which should stop the immediate problem, but hide the
> other problems which could happen that were denoted in the
> initial email you copied into this ticket (multiple dags using
> the same log file, etc). Would the rarity of the true errors that
> DAGMAN_ABORT_ON_SCARY_SUBMIT is supposed to detect be sufficient that
> this would work in the immediate term?
>
> It looks to be a quick fix on our part. We'll try and get you a new
> dagman executable in a couple of days.
>
> Thank you.
>
> Condor Admin
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Peter Keller <psilord__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu, anderson__AT__ligo.caltech.edu 
> ,skoranda__AT__gravity.phys.uwm.edu,patrick__AT__gravity.phys.uwm.edu,jclayton__AT__gravity.phys.uwm.edu
>

-- 

Duncan Brown                          Room 263-1, Department of Physics,
Assistant Professor of Physics        Syracuse University, NY 13244, USA
Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan




===========================================================================
Date mail was appended: Mon May 11 17:42:12 2009 (1242081733)
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com;
 s=gamma;        h=domainkey-signature:mime-version:sender:received:in-reply-to
         :references:date:x-google-sender-auth:message-id:subject:from:to:cc
         :content-type;        bh=i6uzgjTYdra1GcQR6foSBo7bSZHCLca9jL+EKpqIxC0=;
        b=JEE0Tn9vek9hjlaovXAYeOTRamTV+wY4mBgUgA1I/ovlY4aYj+Be7z8tc+lCgXkcZP
         q2yEaCCE1j2OqwZs30Z5wpyqa1+aHn6M80YhYGN7M9ZqYky5wtUuN6jzw6O9VC+R2yTQ
         hPFqOFugKinEheThMqz/oIPHhfmehGlG7JDcM=
Domainkey-Signature: a=rsa-sha1; c=nofws;        d=gmail.com; s=gamma;    
    h=mime-version:sender:in-reply-to:references:date        
 :x-google-sender-auth:message-id:subject:from:to:cc:content-type;       
 b=rWDbjeUwiagbQ4X+17nzXS7UJD/mupmP+B9uMwyty6GIk6tzw/j/C5OL/N3VG8h3Os      
   nv7hnRxHwMmP7AlR2HgI5a9xIG35fJyJcuzBTua/PKZmkfL2kgzDlQ0UnoMefkFU2b3u    
     nwUc0DA1FdeaYzrkOQOEQvjrqu+Kjy0283oPE=
Date: Tue, 12 May 2009 09:20:57 -0500
X-Google-Sender-Auth: 4840328f1f4583b5
Subject: Re: [condor-admin #19273] LIGO: bug in dagman recovery procedure
 for 	retried jobs
From: Jessica Clayton <jclayton__AT__gravity.phys.uwm.edu>
To: Duncan Brown <dabrown__AT__physics.syr.edu>
CC: condor-admin__AT__cs.wisc.edu, anderson__AT__ligo.caltech.edu,
 skoranda__AT__gravity.phys.uwm.edu, patrick__AT__gravity.phys.uwm.edu
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

--0015174be628b38d3c0469b7cdd2

Hi Duncan,
If we start any new dags, I would be happy to try this. I'll just
communicate with you then how I should use this option. (Where do I specify
it?) In the mean time, the large dags that the LV team has been running are
still going. They are happily running, so I do not want to stop them unless
there's another problem.

Jessica

On Mon, May 11, 2009 at 5:41 PM, Duncan Brown <dabrown__AT__physics.syr.edu>wrote:

> Hi Pete,
>
> Thanks, we should be able to do this as an interim solution. Jessica, do
> you want to try it?
>
> Cheers,
> Duncan.
>
>
> On May 11, 2009, at 4:59 PM, condor-admin response tracking system wrote:
>
>  Hello,
>>
>>  From: Duncan Brown <dabrown__AT__physics.syr.edu>
>>>
>>> I think Jessica has found a bug in the dagman recovery procedure for
>>> DAGs that have retried nodes. I started to write you a (long) email
>>> about this but then I figured I could just come up with a test case
>>> which demonstrates the problem:
>>>
>>
>> As a fast workaround, you could set:
>>
>> DAGMAN_ABORT_ON_SCARY_SUBMIT = False
>>
>> for now, which should stop the immediate problem, but hide the
>> other problems which could happen that were denoted in the
>> initial email you copied into this ticket (multiple dags using
>> the same log file, etc). Would the rarity of the true errors that
>> DAGMAN_ABORT_ON_SCARY_SUBMIT is supposed to detect be sufficient that
>> this would work in the immediate term?
>>
>> It looks to be a quick fix on our part. We'll try and get you a new
>> dagman executable in a couple of days.
>>
>> Thank you.
>>
>> Condor Admin
>>
>>
>> ========================================
>> MESSAGE INFORMATION
>> ========================================
>> * From: Peter Keller <psilord__AT__cs.wisc.edu>
>> * Ticket Email List: dabrown__AT__physics.syr.edu, anderson__AT__ligo.caltech.edu,
>> skoranda__AT__gravity.phys.uwm.edu,patrick__AT__gravity.phys.uwm.edu,
>> jclayton__AT__gravity.phys.uwm.edu
>>
>>
> --
>
> Duncan Brown                          Room 263-1, Department of Physics,
> Assistant Professor of Physics        Syracuse University, NY 13244, USA
> Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan
>
>
>

--0015174be628b38d3c0469b7cdd2

Hi Duncan,<div><br></div><div>If we start any new dags, I would be happy to=
 try this. I'll just communicate with you then how I should use this op=
tion. (Where do I specify it?) In the mean time, the large dags that the LV=
 team has been running are still going. They are happily running, so I do n=
ot want to stop them unless there's another problem.</div>
<div><br></div><div>Jessica<br><br><div class=3D"gmail_quote">On Mon, May 1=
1, 2009 at 5:41 PM, Duncan Brown <span dir=3D"ltr"><<a href=3D"mailto:da=
brown__AT__physics.syr.edu">dabrown__AT__physics.syr.edu</a>></span> wrote:<br><bl=
ockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #=
ccc solid;padding-left:1ex;">
Hi Pete,<br>
<br>
Thanks, we should be able to do this as an interim solution. Jessica, do yo=
u want to try it?<br>
<br>
Cheers,<br><font color=3D"#888888">
Duncan.</font><div><div></div><div class=3D"h5"><br>
<br>
On May 11, 2009, at 4:59 PM, condor-admin response tracking system wrote:<b=
r>
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
Hello,<br>
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
From: Duncan Brown <<a href=3D"mailto:dabrown__AT__physics.syr.edu" target=3D=
"_blank">dabrown__AT__physics.syr.edu</a>><br>
<br>
I think Jessica has found a bug in the dagman recovery procedure for<br>
DAGs that have retried nodes. I started to write you a (long) email<br>
about this but then I figured I could just come up with a test case<br>
which demonstrates the problem:<br>
</blockquote>
<br>
As a fast workaround, you could set:<br>
<br>
DAGMAN_ABORT_ON_SCARY_SUBMIT =3D False<br>
<br>
for now, which should stop the immediate problem, but hide the<br>
other problems which could happen that were denoted in the<br>
initial email you copied into this ticket (multiple dags using<br>
the same log file, etc). Would the rarity of the true errors that<br>
DAGMAN_ABORT_ON_SCARY_SUBMIT is supposed to detect be sufficient that<br>
this would work in the immediate term?<br>
<br>
It looks to be a quick fix on our part. We'll try and get you a new<br>
dagman executable in a couple of days.<br>
<br>
Thank you.<br>
<br>
Condor Admin<br>
<br>
<br>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
MESSAGE INFORMATION<br>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
* From: Peter Keller <<a href=3D"mailto:psilord__AT__cs.wisc.edu" target=3D"_=
blank">psilord__AT__cs.wisc.edu</a>><br>
* Ticket Email List: <a href=3D"mailto:dabrown__AT__physics.syr.edu" target=3D"_=
blank">dabrown__AT__physics.syr.edu</a>, <a href=3D"mailto:anderson__AT__ligo.caltech=
.edu" target=3D"_blank">anderson__AT__ligo.caltech.edu</a>,<a href=3D"mailto:sko=
randa__AT__gravity.phys.uwm.edu" target=3D"_blank">skoranda__AT__gravity.phys.uwm.edu=
</a>,<a href=3D"mailto:patrick__AT__gravity.phys.uwm.edu" target=3D"_blank">patr=
ick__AT__gravity.phys.uwm.edu</a>,<a href=3D"mailto:jclayton__AT__gravity.phys.uwm.ed=
u" target=3D"_blank">jclayton__AT__gravity.phys.uwm.edu</a><br>

<br>
</blockquote>
<br></div></div><div><div></div><div class=3D"h5">
-- <br>
<br>
Duncan Brown =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Room 263-1,=
 Department of Physics,<br>
Assistant Professor of Physics =A0 =A0 =A0 =A0Syracuse University, NY 13244=
, USA<br>
Phone: (315) 443 5993 =A0 =A0 =A0 =A0 =A0 =A0 <a href=3D"http://www.gravity=
.phy.syr.edu/~duncan" target=3D"_blank">http://www.gravity.phy.syr.edu/~dun=
can</a><br>
<br>
<br>
</div></div></blockquote></div><br></div>

--0015174be628b38d3c0469b7cdd2--

===========================================================================
Date mail was appended: Tue May 12  9:27:11 2009 (1242138431)
Date: Tue, 12 May 2009 09:55:04 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19273] LIGO: bug in dagman recovery procedure
 for retried jobs

On Tue, May 12, 2009 at 09:27:11AM -0500, condor-admin response tracking system wrote:
> If we start any new dags, I would be happy to try this. I'll just
> communicate with you then how I should use this option. (Where do I specify
> it?) In the mean time, the large dags that the LV team has been running are
> still going. They are happily running, so I do not want to stop them unless
> there's another problem.

The Condor administrator must set

DAGMAN_ABORT_ON_SCARY_SUBMIT = False

in the configuration file for the submit machine. Then all dagman
instances started after that on the schedd will use the configuration
variable. If a current dagman instance running before the change dies,
then the new instance will use the feature in recovery/rescue mode.

Thank you.

Condor Admin


===========================================================================
Date mail was appended: Tue May 12  9:55:09 2009 (1242140110)
CC: Duncan Brown <dabrown__AT__physics.syr.edu>, condor-admin__AT__cs.wisc.edu,
 skoranda__AT__gravity.phys.uwm.edu, patrick__AT__gravity.phys.uwm.edu
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: Jessica Clayton <jclayton__AT__gravity.phys.uwm.edu>
Subject: Re: [condor-admin #19273] LIGO: bug in dagman recovery procedure
 for  retried jobs
Date: Tue, 12 May 2009 09:50:33 -0700
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Please also note that Condor version 7.2.3 is due for release this  
week which purports to fix the schedd file descriptor leak that has  
been exasperating this problem on Nemo. I also believe Pete Keller has  
reported he is only days away from having a set of pre-release DAGMan  
binaries that fix this actual problem.

Thanks.


On May 12, 2009, at 7:20 AM, Jessica Clayton wrote:

> Hi Duncan,
>
> If we start any new dags, I would be happy to try this. I'll just  
> communicate with you then how I should use this option. (Where do I  
> specify it?) In the mean time, the large dags that the LV team has  
> been running are still going. They are happily running, so I do not  
> want to stop them unless there's another problem.
>
> Jessica
>
> On Mon, May 11, 2009 at 5:41 PM, Duncan Brown  
> <dabrown__AT__physics.syr.edu> wrote:
> Hi Pete,
>
> Thanks, we should be able to do this as an interim solution.  
> Jessica, do you want to try it?
>
> Cheers,
> Duncan.
>
>
> On May 11, 2009, at 4:59 PM, condor-admin response tracking system  
> wrote:
>
> Hello,
>
> From: Duncan Brown <dabrown__AT__physics.syr.edu>
>
> I think Jessica has found a bug in the dagman recovery procedure for
> DAGs that have retried nodes. I started to write you a (long) email
> about this but then I figured I could just come up with a test case
> which demonstrates the problem:
>
> As a fast workaround, you could set:
>
> DAGMAN_ABORT_ON_SCARY_SUBMIT = False
>
> for now, which should stop the immediate problem, but hide the
> other problems which could happen that were denoted in the
> initial email you copied into this ticket (multiple dags using
> the same log file, etc). Would the rarity of the true errors that
> DAGMAN_ABORT_ON_SCARY_SUBMIT is supposed to detect be sufficient that
> this would work in the immediate term?
>
> It looks to be a quick fix on our part. We'll try and get you a new
> dagman executable in a couple of days.
>
> Thank you.
>
> Condor Admin
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Peter Keller <psilord__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu, anderson__AT__ligo.caltech.edu 
> ,skoranda__AT__gravity.phys.uwm.edu,patrick__AT__gravity.phys.uwm.edu,jclayton__AT__gravity.phys.uwm.edu
>
>
> -- 
>
> Duncan Brown                          Room 263-1, Department of  
> Physics,
> Assistant Professor of Physics        Syracuse University, NY 13244,  
> USA
> Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan
>
>
>

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Tue May 12 11:50:54 2009 (1242147054)
Date: Wed, 20 May 2009 10:45:36 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19273] LIGO: bug in dagman recovery procedure
 for retried jobs

On Tue, May 12, 2009 at 11:50:54AM -0500, condor-admin response tracking system wrote:
> Please also note that Condor version 7.2.3 is due for release this  
> week which purports to fix the schedd file descriptor leak that has  
> been exasperating this problem on Nemo. I also believe Pete Keller has  
> reported he is only days away from having a set of pre-release DAGMan  
> binaries that fix this actual problem.

Ok, what architectures do you guys need with the retry fix? It turns out the
retry fix missed 7.2.3, so I'll have to give you a 7.2.4 pre-release.

Thank you.

-pete

===========================================================================
Date mail was appended: Wed May 20 10:45:40 2009 (1242834341)
CC: Duncan Brown <dabrown__AT__physics.syr.edu>,        Scott Koranda
 <skoranda__AT__gravity.phys.uwm.edu>,        Patrick Brady
 <patrick__AT__gravity.phys.uwm.edu>,        Jessica Clayton
 <jclayton__AT__gravity.phys.uwm.edu>,        Gregory Skelton
 <gskelton__AT__gravity.phys.uwm.edu>,        Ross Oldenburg
 <rosso__AT__gravity.phys.uwm.edu>
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19273] LIGO: bug in dagman recovery procedure
 for retried jobs
Date: Wed, 20 May 2009 12:41:20 -0700
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu


On May 20, 2009, at 8:45 AM, condor-admin response tracking system  
wrote:

> On Tue, May 12, 2009 at 11:50:54AM -0500, condor-admin response  
> tracking system wrote:
>> Please also note that Condor version 7.2.3 is due for release this
>> week which purports to fix the schedd file descriptor leak that has
>> been exasperating this problem on Nemo. I also believe Pete Keller  
>> has
>> reported he is only days away from having a set of pre-release DAGMan
>> binaries that fix this actual problem.
>
> Ok, what architectures do you guys need with the retry fix? It turns  
> out the
> retry fix missed 7.2.3, so I'll have to give you a 7.2.4 pre-release.

This a question for testing on Nemo if you are willing to verify this  
fixes the problem seen there.

Thanks.

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Wed May 20 14:41:41 2009 (1242848502)
Date: Wed, 20 May 2009 14:57:35 -0500
From: Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>, condor-admin__AT__cs.wisc.edu,
 Duncan Brown <dabrown__AT__physics.syr.edu>,        Scott Koranda
 <skoranda__AT__gravity.phys.uwm.edu>,        Patrick Brady
 <patrick__AT__gravity.phys.uwm.edu>,        Jessica Clayton
 <jclayton__AT__gravity.phys.uwm.edu>,        Gregory Skelton
 <gskelton__AT__gravity.phys.uwm.edu>
Subject: Re: [condor-admin #19273] LIGO: bug in dagman recovery procedure
 for retried jobs
X-Enigmail-Version: 0.95.0
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

Hi Pete,

We use the rhel5 x86_64 binaries at UWM.

Thanks,
Ross

Stuart Anderson wrote:
> 
> On May 20, 2009, at 8:45 AM, condor-admin response tracking system wrote:
> 
>> On Tue, May 12, 2009 at 11:50:54AM -0500, condor-admin response
>> tracking system wrote:
>>> Please also note that Condor version 7.2.3 is due for release this
>>> week which purports to fix the schedd file descriptor leak that has
>>> been exasperating this problem on Nemo. I also believe Pete Keller has
>>> reported he is only days away from having a set of pre-release DAGMan
>>> binaries that fix this actual problem.
>>
>> Ok, what architectures do you guys need with the retry fix? It turns
>> out the
>> retry fix missed 7.2.3, so I'll have to give you a 7.2.4 pre-release.
> 
> This a question for testing on Nemo if you are willing to verify this
> fixes the problem seen there.
> 
> Thanks.
> 
> -- 
> Stuart Anderson  anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
> 
> 

===========================================================================
Date mail was appended: Wed May 20 14:57:15 2009 (1242849435)
Date: Thu, 21 May 2009 14:43:03 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19273] LIGO: bug in dagman recovery procedure
 for retried jobs

On Wed, May 20, 2009 at 02:57:15PM -0500, condor-admin response tracking system wrote:
> We use the rhel5 x86_64 binaries at UWM.

Ok, you can find a 7.2.4 dagman pre-release here:

ftp ftp.cs.wisc.edu
login: ftp
password: <email address>
ftp> bin
ftp> cd condor/temporary/forligo/dagman-7.2.4-prerelease-2009-05-21/x86_64_rhap_5
ftp> get condor_dagman
ftp> get condor_submit_dag
ftp> bye

This should hold the retry fixes Kent did so dagman doesn't get confused in
recovery mode when nodes had already retried in the dag.

Thank you.

-pete

===========================================================================
Date mail was appended: Thu May 21 14:43:08 2009 (1242934988)
Date: Wed, 3 Jun 2009 14:29:18 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19273] LIGO: bug in dagman recovery procedure
 for retried jobs

Hello,

I'd heard from Ross that the new binaries were having some trouble. Is that
still the case? What is the status ticket of this from your point of view?

Thank you.

-pete

===========================================================================
Date mail was appended: Wed Jun  3 14:29:25 2009 (1244057365)
Subject: Actions

Assigned to wenger by psilord
===========================================================================
Date of actions: Fri Jun  5 13:35:26 2009 (1244226926)
Subject: Comments added

Assigned to wnger to follow up with LIGO. Problem should be fixed.

Comments added by psilord

===========================================================================
Date comments were added: Fri Jun  5 13:35:34 2009 (1244226934)
Date: Wed, 10 Jun 2009 17:43:40 -0500
From: Ross Oldenburg <rosso__AT__gravity.phys.uwm.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>,        Scott Koranda
 <skoranda__AT__gravity.phys.uwm.edu>,        Patrick Brady
 <patrick__AT__gravity.phys.uwm.edu>,        Jessica Clayton
 <jclayton__AT__gravity.phys.uwm.edu>,        Duncan Brown
 <dabrown__AT__physics.syr.edu>
Subject: Re: [condor-admin #19273] LIGO: bug in dagman recovery procedure
X-Enigmail-Version: 0.95.0
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

The DAGMan recovery bug appears to be fixed, so I think you can probably
close this ticket.

I ran Duncan's test on Nemo (Condor 7.2.3, DAGMan
7.2.4-prerelease-2009-05-21) and it passed.  There were no DAGMan
failures during the recovery and the job completed.  A couple DAG's also
recovered cleanly after a cluster upgrade yesterday.

Thanks for all of your hard work, we really appreciate it!

--Ross Oldenburg

===========================================================================
Date mail was appended: Wed Jun 10 17:43:08 2009 (1244673788)
Date: Thu, 11 Jun 2009 09:23:06 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19273] LIGO: bug in dagman recovery procedure
 for retried jobs

Ross,

> The DAGMan recovery bug appears to be fixed, so I think you can probably
> close this ticket.
>
> I ran Duncan's test on Nemo (Condor 7.2.3, DAGMan
> 7.2.4-prerelease-2009-05-21) and it passed.  There were no DAGMan
> failures during the recovery and the job completed.  A couple DAG's also
> recovered cleanly after a cluster upgrade yesterday.
>
> Thanks for all of your hard work, we really appreciate it!

That's great!  Thanks for the update.  I was a bit worried when we heard 
that maybe the problem wasn't fixed, because that would have meant that 
we didn't really completely understand the original failure.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Thu Jun 11  9:23:11 2009 (1244730191)
Subject: Actions

Ticket resolved by wenger
===========================================================================
Date of actions: Thu Jun 11  9:23:15 2009 (1244730196)