LIGO Support Ticket 17748

Ticket Information
  Number:      admin 17748
  User:        nvf@gravity.phys.uwm.edu
  Email:       anderson__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu
  Status:      open
  Assigned To: pfc
From: Nickolas Fotopoulos <nvf__AT__gravity.phys.uwm.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: LIGO: noop assertion error
Date: Sat, 15 Mar 2008 16:50:00 -0700
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>,         Scott Koranda
 <skoranda__AT__gravity.phys.uwm.edu>
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

When running many noop jobs, I am seeing the assertion error quoted  
below.

$ condor_version
$CondorVersion: 6.9.4 Aug 30 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
$ condor_dagman -version
$CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

On Mar 15, 2008, at 3:36 PM, Nickolas Fotopoulos wrote:

> Scott,
>
> The noop route seems the most appealing.  I tried it and it appeared  
> to work for a while, but I think I ran into a Condor bug ~200 jobs  
> into the noop jobs:
>
> ...
> ...
> 3/15 17:20:03 submitting: condor_submit -a dag_node_name' '='  
> '97d736e41f382cd8477d1ff0e8ae484f -a +DAGManJobId' '=' '1653119 -a  
> DAGManJobId' '=' '1653119 -a submit_event_notes' '=' 'DAG' 'Node:'  
> '97d736e41f382cd8477d1ff0e8ae484f -a macrooutput' '=' 'L1- 
> SIRE_FIRST_GRB070714B_injections21-868429419-2048.xml -a  
> macrosummary' '=' 'L1- 
> SIRE_FIRST_GRB070714B_injections21-868429419-2048.txt -a  
> macrousertag' '=' 'GRB070714B_injections21 -a macroglob' '=' 'L1- 
> INSPIRAL_FIRST_GRB070714B_injections21_150-868429419-2048.xml.gz -a  
> macroifocut' '=' 'L1 -a +DAGParentNodeNames' '='  
> '"eb25ace1446075cf9f2d9f0eb93e0ae6"  
> injections21.sire.GRB070714B_injections21.sub
> 3/15 17:20:04 From submit: Submitting job(s).
> 3/15 17:20:04 From submit: Logging submit event(s).
> 3/15 17:20:04 From submit: 1 job(s) submitted to cluster 1653253.
> 3/15 17:20:04   assigned Condor ID (1653253.0)
> ...
> ...
> 3/15 17:20:04 Number of idle job procs: 0
> 3/15 17:20:04 Event: ULOG_JOB_TERMINATED for Condor Node  
> 97d736e41f382cd8477d1ff0e8ae484f (1653253.0)
> 3/15 17:20:04 BAD EVENT: job (1653253.0.0) ended, submit count < 1 (0)
> 3/15 17:20:04 BAD EVENT is warning only
> 3/15 17:20:04 ERROR "Assertion ERROR on (node->_queuedNodeJobProcs  
> >= 0)" at line 3024 in file dag.C
>
>
>
> On Mar 14, 2008, at 7:22 PM, Scott Koranda wrote:
>
>> Hi Nick,
>>
>> Rather than setting 'executable = /bin/true' you could add to
>> the submit file 'hold = True'. The child jobs will then be submitted
>> and held and will not run unless you explicitly call
>> condor_release on them.
>>
>> In a similar way you could set 'noop_job = True' for the child
>> jobs and the jobs will simply be marked as completed with a
>> return value of 0.
>>
>> Scott
>>
>>> Dear all,
>>>
>>> After a DAG has run partway through, I've decided that the bottom- 
>>> most
>>> post-processing job (several thousand of them) should/can not be  
>>> run.
>>> When my rescue DAG comes, as it inevitably does, I would like not to
>>> execute these.  So far, no problem; a one-line bash/sed invocation
>>> takes care of that:
>>>
>>> cat $f | sed 's/.*mysubfile.*/& DONE/' > ${f}.sires_done;
>>>
>>> The problem is that not all of the parents have completed
>>> successfully.  I'd like to resubmit the parents, but not these
>>> children.  When I naively mark them as DONE, as above, I get the
>>> following error while dagman parses the DAG.
>>>
>>> 3/13 20:25:13 ERROR: AddParent( ea0bca7d3503cccca43dff66a99c1516 )
>>> failed for no
>>> de a5bf08f49f3323fdd5f838f6d89918f7: STATUS_DONE      child may  
>>> not be
>>> given a n
>>> ew STATUS_READY     parent
>>>
>>> Removing the JOB lines produces an error that the parent-child
>>> relationships refer to a non-existent job.  (I don't have the exact
>>> message handy.)
>>>
>>> I see a few solutions, none of which I like:
>>> * resubmit without modification and let the children fail (wastes
>>> resources)
>>> * change the submit files to point to /bin/true and run in the local
>>> universe (a lot of scheduling overhead, I'd think, but maybe this is
>>> negligible)
>>> * identify all nodes of a class and remove all references to each of
>>> them (more code than I want to write at the moment)
>>>
>>> Can I get some gut reactions to these options or perhaps new,  
>>> cleverer
>>> options?
>>>
>>> Thanks,
>>> Nick

===================================
Nickolas Fotopoulos
nvf__AT__gravity.phys.uwm.edu

Office: (414) 229-6438
Fax: (414) 229-5589
University of Wisconsin - Milwaukee
Physics Bldg, Rm 471
===================================


===========================================================================
Date of creation: Sat Mar 15 18:50:20 2008 (1205625023)
Subject: Actions

Assigned to bt by bt
===========================================================================
Date of actions: Mon Mar 17  9:21:07 2008 (1205774361)
Date: Mon, 17 Mar 2008 13:03:02 -0500
From: Bill Taylor <bt__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17748] LIGO: noop assertion error

bt wrote:

Just a quick note to let you know I have moved this along to our
Dagman experts.

Bill
Condor Team

Note added from ticket
Re: [condor-admin #17756] LIGO Support Ticket 17748

Hi,

Please append this to LIGO support Ticket 17748.

In short, in a rescue DAG some of the children had noop =True
added, and when the rescue DAG was submitted the schedd died
after some point.

Note, however, that this is for

$ condor_version
$CondorVersion: 6.9.4 Aug 30 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
$ condor_dagman -version
$CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

If you think this will be solved by upgrading to 7.0.1 and the
latest DAGman binaries Kent has produced let us know and we
will simply do that before debugging this further.

Thanks,

Scott



> 
> When running many noop jobs, I am seeing the assertion error quoted  
> below.
> 
> $ condor_version
> $CondorVersion: 6.9.4 Aug 30 2007 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
> $ condor_dagman -version
> $CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
> 
> On Mar 15, 2008, at 3:36 PM, Nickolas Fotopoulos wrote:
> 
>> Scott,
>>
>> The noop route seems the most appealing.  I tried it and it appeared  
>> to work for a while, but I think I ran into a Condor bug ~200 jobs  
>> into the noop jobs:
>>
>> ...
>> ...
>> 3/15 17:20:03 submitting: condor_submit -a dag_node_name' '='  
>> '97d736e41f382cd8477d1ff0e8ae484f -a +DAGManJobId' '=' '1653119 -a  
>> DAGManJobId' '=' '1653119 -a submit_event_notes' '=' 'DAG' 'Node:'  
>> '97d736e41f382cd8477d1ff0e8ae484f -a macrooutput' '=' 'L1- 
>> SIRE_FIRST_GRB070714B_injections21-868429419-2048.xml -a  
>> macrosummary' '=' 'L1- 
>> SIRE_FIRST_GRB070714B_injections21-868429419-2048.txt -a  
>> macrousertag' '=' 'GRB070714B_injections21 -a macroglob' '=' 'L1- 
>> INSPIRAL_FIRST_GRB070714B_injections21_150-868429419-2048.xml.gz -a  
>> macroifocut' '=' 'L1 -a +DAGParentNodeNames' '='  
>> '"eb25ace1446075cf9f2d9f0eb93e0ae6"  
>> injections21.sire.GRB070714B_injections21.sub
>> 3/15 17:20:04 From submit: Submitting job(s).
>> 3/15 17:20:04 From submit: Logging submit event(s).
>> 3/15 17:20:04 From submit: 1 job(s) submitted to cluster 1653253.
>> 3/15 17:20:04   assigned Condor ID (1653253.0)
>> ...
>> ...
>> 3/15 17:20:04 Number of idle job procs: 0
>> 3/15 17:20:04 Event: ULOG_JOB_TERMINATED for Condor Node  
>> 97d736e41f382cd8477d1ff0e8ae484f (1653253.0)
>> 3/15 17:20:04 BAD EVENT: job (1653253.0.0) ended, submit count < 1 (0)
>> 3/15 17:20:04 BAD EVENT is warning only
>> 3/15 17:20:04 ERROR "Assertion ERROR on (node->_queuedNodeJobProcs  
>>> = 0)" at line 3024 in file dag.C
>>
>>
>> On Mar 14, 2008, at 7:22 PM, Scott Koranda wrote:
>>
>>> Hi Nick,
>>>
>>> Rather than setting 'executable = /bin/true' you could add to
>>> the submit file 'hold = True'. The child jobs will then be submitted
>>> and held and will not run unless you explicitly call
>>> condor_release on them.
>>>
>>> In a similar way you could set 'noop_job = True' for the child
>>> jobs and the jobs will simply be marked as completed with a
>>> return value of 0.
>>>
>>> Scott
>>>
>>>> Dear all,
>>>>
>>>> After a DAG has run partway through, I've decided that the bottom- 
>>>> most
>>>> post-processing job (several thousand of them) should/can not be  
>>>> run.
>>>> When my rescue DAG comes, as it inevitably does, I would like not to
>>>> execute these.  So far, no problem; a one-line bash/sed invocation
>>>> takes care of that:
>>>>
>>>> cat $f | sed 's/.*mysubfile.*/& DONE/' > ${f}.sires_done;
>>>>
>>>> The problem is that not all of the parents have completed
>>>> successfully.  I'd like to resubmit the parents, but not these
>>>> children.  When I naively mark them as DONE, as above, I get the
>>>> following error while dagman parses the DAG.
>>>>
>>>> 3/13 20:25:13 ERROR: AddParent( ea0bca7d3503cccca43dff66a99c1516 )
>>>> failed for no
>>>> de a5bf08f49f3323fdd5f838f6d89918f7: STATUS_DONE      child may  
>>>> not be
>>>> given a n
>>>> ew STATUS_READY     parent
>>>>
>>>> Removing the JOB lines produces an error that the parent-child
>>>> relationships refer to a non-existent job.  (I don't have the exact
>>>> message handy.)
>>>>
>>>> I see a few solutions, none of which I like:
>>>> * resubmit without modification and let the children fail (wastes
>>>> resources)
>>>> * change the submit files to point to /bin/true and run in the local
>>>> universe (a lot of scheduling overhead, I'd think, but maybe this is
>>>> negligible)
>>>> * identify all nodes of a class and remove all references to each of
>>>> them (more code than I want to write at the moment)
>>>>
>>>> Can I get some gut reactions to these options or perhaps new,  
>>>> cleverer
>>>> options?
>>>>
>>>> Thanks,
>>>> Nick
> 
> ===================================
> Nickolas Fotopoulos
> nvf__AT__gravity.phys.uwm.edu
> 
> Office: (414) 229-6438
> Fax: (414) 229-5589
> University of Wisconsin - Milwaukee
> Physics Bldg, Rm 471
> ===================================

===========================================================================
Date mail was appended: Mon Mar 17 13:03:16 2008 (1205776999)
Subject: Actions

Assigned to pfc by bt
===========================================================================
Date of actions: Mon Mar 17  9:21:07 2008 (1205777129)
Date: Wed, 26 Mar 2008 15:51:37 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17748] LIGO: noop assertion error

See PR 940 (Noop job shouldn't require an executable).

--
R. Kent Wenger (wenger__AT__cs.wisc.edu, 608-262-6627,
http://www.cs.wisc.edu/~wenger/)
Computer Sciences Department
University of Wisconsin-Madison


===========================================================================
Date mail was appended: Wed Mar 26 15:51:44 2008 (1206564705)
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>,         Scott Koranda
 <skoranda__AT__gravity.phys.uwm.edu>
From: Nickolas Fotopoulos <nvf__AT__gravity.phys.uwm.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17748] LIGO: noop assertion error
Date: Thu, 27 Mar 2008 13:28:19 -0500
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

Kent,

I'm unable to find a PR 940 through Google or anything else.  The only  
tracker I know is the LIGO-related bug tracker:

<http://www.lsc-group.phys.uwm.edu/lscdatagrid/condorligo/index.html>

Can you point me to the PR you reference?

Thanks,
Nick

On Mar 26, 2008, at 3:51 PM, condor-admin response tracking system  
wrote:
> See PR 940 (Noop job shouldn't require an executable).
>
> --
> R. Kent Wenger (wenger__AT__cs.wisc.edu, 608-262-6627,
> http://www.cs.wisc.edu/~wenger/)
> Computer Sciences Department
> University of Wisconsin-Madison
>
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: nvf__AT__gravity.phys.uwm.edu, anderson__AT__ligo.caltech.edu 
> ,skoranda__AT__gravity.phys.uwm.edu
>
> -- 
> ======================================================================
> This mail was sent from the RUST Mail System
> Please direct all replies to condor-admin__AT__cs.wisc.edu
> Please include the current subject line in your reply.
> ======================================================================

===================================
Nickolas Fotopoulos
nvf__AT__gravity.phys.uwm.edu

Office: (414) 229-6438
Fax: (414) 229-5589
University of Wisconsin - Milwaukee
Physics Bldg, Rm 471
===================================


===========================================================================
Date mail was appended: Thu Mar 27 13:28:44 2008 (1206642525)