LIGO Support Ticket 17748
Ticket Information
Number: admin 17748
User: nvf@gravity.phys.uwm.edu
Email: anderson__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu
Status: open
Assigned To: pfc
From: Nickolas Fotopoulos <nvf__AT__gravity.phys.uwm.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: LIGO: noop assertion error
Date: Sat, 15 Mar 2008 16:50:00 -0700
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
When running many noop jobs, I am seeing the assertion error quoted
below.
$ condor_version
$CondorVersion: 6.9.4 Aug 30 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
$ condor_dagman -version
$CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
On Mar 15, 2008, at 3:36 PM, Nickolas Fotopoulos wrote:
> Scott,
>
> The noop route seems the most appealing. I tried it and it appeared
> to work for a while, but I think I ran into a Condor bug ~200 jobs
> into the noop jobs:
>
> ...
> ...
> 3/15 17:20:03 submitting: condor_submit -a dag_node_name' '='
> '97d736e41f382cd8477d1ff0e8ae484f -a +DAGManJobId' '=' '1653119 -a
> DAGManJobId' '=' '1653119 -a submit_event_notes' '=' 'DAG' 'Node:'
> '97d736e41f382cd8477d1ff0e8ae484f -a macrooutput' '=' 'L1-
> SIRE_FIRST_GRB070714B_injections21-868429419-2048.xml -a
> macrosummary' '=' 'L1-
> SIRE_FIRST_GRB070714B_injections21-868429419-2048.txt -a
> macrousertag' '=' 'GRB070714B_injections21 -a macroglob' '=' 'L1-
> INSPIRAL_FIRST_GRB070714B_injections21_150-868429419-2048.xml.gz -a
> macroifocut' '=' 'L1 -a +DAGParentNodeNames' '='
> '"eb25ace1446075cf9f2d9f0eb93e0ae6"
> injections21.sire.GRB070714B_injections21.sub
> 3/15 17:20:04 From submit: Submitting job(s).
> 3/15 17:20:04 From submit: Logging submit event(s).
> 3/15 17:20:04 From submit: 1 job(s) submitted to cluster 1653253.
> 3/15 17:20:04 assigned Condor ID (1653253.0)
> ...
> ...
> 3/15 17:20:04 Number of idle job procs: 0
> 3/15 17:20:04 Event: ULOG_JOB_TERMINATED for Condor Node
> 97d736e41f382cd8477d1ff0e8ae484f (1653253.0)
> 3/15 17:20:04 BAD EVENT: job (1653253.0.0) ended, submit count < 1 (0)
> 3/15 17:20:04 BAD EVENT is warning only
> 3/15 17:20:04 ERROR "Assertion ERROR on (node->_queuedNodeJobProcs
> >= 0)" at line 3024 in file dag.C
>
>
>
> On Mar 14, 2008, at 7:22 PM, Scott Koranda wrote:
>
>> Hi Nick,
>>
>> Rather than setting 'executable = /bin/true' you could add to
>> the submit file 'hold = True'. The child jobs will then be submitted
>> and held and will not run unless you explicitly call
>> condor_release on them.
>>
>> In a similar way you could set 'noop_job = True' for the child
>> jobs and the jobs will simply be marked as completed with a
>> return value of 0.
>>
>> Scott
>>
>>> Dear all,
>>>
>>> After a DAG has run partway through, I've decided that the bottom-
>>> most
>>> post-processing job (several thousand of them) should/can not be
>>> run.
>>> When my rescue DAG comes, as it inevitably does, I would like not to
>>> execute these. So far, no problem; a one-line bash/sed invocation
>>> takes care of that:
>>>
>>> cat $f | sed 's/.*mysubfile.*/& DONE/' > ${f}.sires_done;
>>>
>>> The problem is that not all of the parents have completed
>>> successfully. I'd like to resubmit the parents, but not these
>>> children. When I naively mark them as DONE, as above, I get the
>>> following error while dagman parses the DAG.
>>>
>>> 3/13 20:25:13 ERROR: AddParent( ea0bca7d3503cccca43dff66a99c1516 )
>>> failed for no
>>> de a5bf08f49f3323fdd5f838f6d89918f7: STATUS_DONE child may
>>> not be
>>> given a n
>>> ew STATUS_READY parent
>>>
>>> Removing the JOB lines produces an error that the parent-child
>>> relationships refer to a non-existent job. (I don't have the exact
>>> message handy.)
>>>
>>> I see a few solutions, none of which I like:
>>> * resubmit without modification and let the children fail (wastes
>>> resources)
>>> * change the submit files to point to /bin/true and run in the local
>>> universe (a lot of scheduling overhead, I'd think, but maybe this is
>>> negligible)
>>> * identify all nodes of a class and remove all references to each of
>>> them (more code than I want to write at the moment)
>>>
>>> Can I get some gut reactions to these options or perhaps new,
>>> cleverer
>>> options?
>>>
>>> Thanks,
>>> Nick
===================================
Nickolas Fotopoulos
nvf__AT__gravity.phys.uwm.edu
Office: (414) 229-6438
Fax: (414) 229-5589
University of Wisconsin - Milwaukee
Physics Bldg, Rm 471
===================================
===========================================================================
Date of creation: Sat Mar 15 18:50:20 2008 (1205625023)
Subject: Actions
Assigned to bt by bt
===========================================================================
Date of actions: Mon Mar 17 9:21:07 2008 (1205774361)
Date: Mon, 17 Mar 2008 13:03:02 -0500
From: Bill Taylor <bt__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17748] LIGO: noop assertion error
bt wrote:
Just a quick note to let you know I have moved this along to our
Dagman experts.
Bill
Condor Team
Note added from ticket
Re: [condor-admin #17756] LIGO Support Ticket 17748
Hi,
Please append this to LIGO support Ticket 17748.
In short, in a rescue DAG some of the children had noop =True
added, and when the rescue DAG was submitted the schedd died
after some point.
Note, however, that this is for
$ condor_version
$CondorVersion: 6.9.4 Aug 30 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
$ condor_dagman -version
$CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
If you think this will be solved by upgrading to 7.0.1 and the
latest DAGman binaries Kent has produced let us know and we
will simply do that before debugging this further.
Thanks,
Scott
>
> When running many noop jobs, I am seeing the assertion error quoted
> below.
>
> $ condor_version
> $CondorVersion: 6.9.4 Aug 30 2007 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
> $ condor_dagman -version
> $CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
>
> On Mar 15, 2008, at 3:36 PM, Nickolas Fotopoulos wrote:
>
>> Scott,
>>
>> The noop route seems the most appealing. I tried it and it appeared
>> to work for a while, but I think I ran into a Condor bug ~200 jobs
>> into the noop jobs:
>>
>> ...
>> ...
>> 3/15 17:20:03 submitting: condor_submit -a dag_node_name' '='
>> '97d736e41f382cd8477d1ff0e8ae484f -a +DAGManJobId' '=' '1653119 -a
>> DAGManJobId' '=' '1653119 -a submit_event_notes' '=' 'DAG' 'Node:'
>> '97d736e41f382cd8477d1ff0e8ae484f -a macrooutput' '=' 'L1-
>> SIRE_FIRST_GRB070714B_injections21-868429419-2048.xml -a
>> macrosummary' '=' 'L1-
>> SIRE_FIRST_GRB070714B_injections21-868429419-2048.txt -a
>> macrousertag' '=' 'GRB070714B_injections21 -a macroglob' '=' 'L1-
>> INSPIRAL_FIRST_GRB070714B_injections21_150-868429419-2048.xml.gz -a
>> macroifocut' '=' 'L1 -a +DAGParentNodeNames' '='
>> '"eb25ace1446075cf9f2d9f0eb93e0ae6"
>> injections21.sire.GRB070714B_injections21.sub
>> 3/15 17:20:04 From submit: Submitting job(s).
>> 3/15 17:20:04 From submit: Logging submit event(s).
>> 3/15 17:20:04 From submit: 1 job(s) submitted to cluster 1653253.
>> 3/15 17:20:04 assigned Condor ID (1653253.0)
>> ...
>> ...
>> 3/15 17:20:04 Number of idle job procs: 0
>> 3/15 17:20:04 Event: ULOG_JOB_TERMINATED for Condor Node
>> 97d736e41f382cd8477d1ff0e8ae484f (1653253.0)
>> 3/15 17:20:04 BAD EVENT: job (1653253.0.0) ended, submit count < 1 (0)
>> 3/15 17:20:04 BAD EVENT is warning only
>> 3/15 17:20:04 ERROR "Assertion ERROR on (node->_queuedNodeJobProcs
>>> = 0)" at line 3024 in file dag.C
>>
>>
>> On Mar 14, 2008, at 7:22 PM, Scott Koranda wrote:
>>
>>> Hi Nick,
>>>
>>> Rather than setting 'executable = /bin/true' you could add to
>>> the submit file 'hold = True'. The child jobs will then be submitted
>>> and held and will not run unless you explicitly call
>>> condor_release on them.
>>>
>>> In a similar way you could set 'noop_job = True' for the child
>>> jobs and the jobs will simply be marked as completed with a
>>> return value of 0.
>>>
>>> Scott
>>>
>>>> Dear all,
>>>>
>>>> After a DAG has run partway through, I've decided that the bottom-
>>>> most
>>>> post-processing job (several thousand of them) should/can not be
>>>> run.
>>>> When my rescue DAG comes, as it inevitably does, I would like not to
>>>> execute these. So far, no problem; a one-line bash/sed invocation
>>>> takes care of that:
>>>>
>>>> cat $f | sed 's/.*mysubfile.*/& DONE/' > ${f}.sires_done;
>>>>
>>>> The problem is that not all of the parents have completed
>>>> successfully. I'd like to resubmit the parents, but not these
>>>> children. When I naively mark them as DONE, as above, I get the
>>>> following error while dagman parses the DAG.
>>>>
>>>> 3/13 20:25:13 ERROR: AddParent( ea0bca7d3503cccca43dff66a99c1516 )
>>>> failed for no
>>>> de a5bf08f49f3323fdd5f838f6d89918f7: STATUS_DONE child may
>>>> not be
>>>> given a n
>>>> ew STATUS_READY parent
>>>>
>>>> Removing the JOB lines produces an error that the parent-child
>>>> relationships refer to a non-existent job. (I don't have the exact
>>>> message handy.)
>>>>
>>>> I see a few solutions, none of which I like:
>>>> * resubmit without modification and let the children fail (wastes
>>>> resources)
>>>> * change the submit files to point to /bin/true and run in the local
>>>> universe (a lot of scheduling overhead, I'd think, but maybe this is
>>>> negligible)
>>>> * identify all nodes of a class and remove all references to each of
>>>> them (more code than I want to write at the moment)
>>>>
>>>> Can I get some gut reactions to these options or perhaps new,
>>>> cleverer
>>>> options?
>>>>
>>>> Thanks,
>>>> Nick
>
> ===================================
> Nickolas Fotopoulos
> nvf__AT__gravity.phys.uwm.edu
>
> Office: (414) 229-6438
> Fax: (414) 229-5589
> University of Wisconsin - Milwaukee
> Physics Bldg, Rm 471
> ===================================
===========================================================================
Date mail was appended: Mon Mar 17 13:03:16 2008 (1205776999)
Subject: Actions
Assigned to pfc by bt
===========================================================================
Date of actions: Mon Mar 17 9:21:07 2008 (1205777129)
Date: Wed, 26 Mar 2008 15:51:37 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17748] LIGO: noop assertion error
See PR 940 (Noop job shouldn't require an executable).
--
R. Kent Wenger (wenger__AT__cs.wisc.edu, 608-262-6627,
http://www.cs.wisc.edu/~wenger/)
Computer Sciences Department
University of Wisconsin-Madison
===========================================================================
Date mail was appended: Wed Mar 26 15:51:44 2008 (1206564705)
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>
From: Nickolas Fotopoulos <nvf__AT__gravity.phys.uwm.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17748] LIGO: noop assertion error
Date: Thu, 27 Mar 2008 13:28:19 -0500
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
Kent,
I'm unable to find a PR 940 through Google or anything else. The only
tracker I know is the LIGO-related bug tracker:
<http://www.lsc-group.phys.uwm.edu/lscdatagrid/condorligo/index.html>
Can you point me to the PR you reference?
Thanks,
Nick
On Mar 26, 2008, at 3:51 PM, condor-admin response tracking system
wrote:
> See PR 940 (Noop job shouldn't require an executable).
>
> --
> R. Kent Wenger (wenger__AT__cs.wisc.edu, 608-262-6627,
> http://www.cs.wisc.edu/~wenger/)
> Computer Sciences Department
> University of Wisconsin-Madison
>
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: nvf__AT__gravity.phys.uwm.edu, anderson__AT__ligo.caltech.edu
> ,skoranda__AT__gravity.phys.uwm.edu
>
> --
> ======================================================================
> This mail was sent from the RUST Mail System
> Please direct all replies to condor-admin__AT__cs.wisc.edu
> Please include the current subject line in your reply.
> ======================================================================
===================================
Nickolas Fotopoulos
nvf__AT__gravity.phys.uwm.edu
Office: (414) 229-6438
Fax: (414) 229-5589
University of Wisconsin - Milwaukee
Physics Bldg, Rm 471
===================================
===========================================================================
Date mail was appended: Thu Mar 27 13:28:44 2008 (1206642525)