LIGO Support Ticket 17748
Ticket Information
Number: admin 17748
User: nvf@gravity.phys.uwm.edu
Email: anderson__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu
Status: open
Assigned To: wenger
From: Nickolas Fotopoulos <nvf__AT__gravity.phys.uwm.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: LIGO: noop assertion error
Date: Sat, 15 Mar 2008 16:50:00 -0700
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
When running many noop jobs, I am seeing the assertion error quoted
below.
$ condor_version
$CondorVersion: 6.9.4 Aug 30 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
$ condor_dagman -version
$CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
On Mar 15, 2008, at 3:36 PM, Nickolas Fotopoulos wrote:
> Scott,
>
> The noop route seems the most appealing. I tried it and it appeared
> to work for a while, but I think I ran into a Condor bug ~200 jobs
> into the noop jobs:
>
> ...
> ...
> 3/15 17:20:03 submitting: condor_submit -a dag_node_name' '='
> '97d736e41f382cd8477d1ff0e8ae484f -a +DAGManJobId' '=' '1653119 -a
> DAGManJobId' '=' '1653119 -a submit_event_notes' '=' 'DAG' 'Node:'
> '97d736e41f382cd8477d1ff0e8ae484f -a macrooutput' '=' 'L1-
> SIRE_FIRST_GRB070714B_injections21-868429419-2048.xml -a
> macrosummary' '=' 'L1-
> SIRE_FIRST_GRB070714B_injections21-868429419-2048.txt -a
> macrousertag' '=' 'GRB070714B_injections21 -a macroglob' '=' 'L1-
> INSPIRAL_FIRST_GRB070714B_injections21_150-868429419-2048.xml.gz -a
> macroifocut' '=' 'L1 -a +DAGParentNodeNames' '='
> '"eb25ace1446075cf9f2d9f0eb93e0ae6"
> injections21.sire.GRB070714B_injections21.sub
> 3/15 17:20:04 From submit: Submitting job(s).
> 3/15 17:20:04 From submit: Logging submit event(s).
> 3/15 17:20:04 From submit: 1 job(s) submitted to cluster 1653253.
> 3/15 17:20:04 assigned Condor ID (1653253.0)
> ...
> ...
> 3/15 17:20:04 Number of idle job procs: 0
> 3/15 17:20:04 Event: ULOG_JOB_TERMINATED for Condor Node
> 97d736e41f382cd8477d1ff0e8ae484f (1653253.0)
> 3/15 17:20:04 BAD EVENT: job (1653253.0.0) ended, submit count < 1 (0)
> 3/15 17:20:04 BAD EVENT is warning only
> 3/15 17:20:04 ERROR "Assertion ERROR on (node->_queuedNodeJobProcs
> >= 0)" at line 3024 in file dag.C
>
>
>
> On Mar 14, 2008, at 7:22 PM, Scott Koranda wrote:
>
>> Hi Nick,
>>
>> Rather than setting 'executable = /bin/true' you could add to
>> the submit file 'hold = True'. The child jobs will then be submitted
>> and held and will not run unless you explicitly call
>> condor_release on them.
>>
>> In a similar way you could set 'noop_job = True' for the child
>> jobs and the jobs will simply be marked as completed with a
>> return value of 0.
>>
>> Scott
>>
>>> Dear all,
>>>
>>> After a DAG has run partway through, I've decided that the bottom-
>>> most
>>> post-processing job (several thousand of them) should/can not be
>>> run.
>>> When my rescue DAG comes, as it inevitably does, I would like not to
>>> execute these. So far, no problem; a one-line bash/sed invocation
>>> takes care of that:
>>>
>>> cat $f | sed 's/.*mysubfile.*/& DONE/' > ${f}.sires_done;
>>>
>>> The problem is that not all of the parents have completed
>>> successfully. I'd like to resubmit the parents, but not these
>>> children. When I naively mark them as DONE, as above, I get the
>>> following error while dagman parses the DAG.
>>>
>>> 3/13 20:25:13 ERROR: AddParent( ea0bca7d3503cccca43dff66a99c1516 )
>>> failed for no
>>> de a5bf08f49f3323fdd5f838f6d89918f7: STATUS_DONE child may
>>> not be
>>> given a n
>>> ew STATUS_READY parent
>>>
>>> Removing the JOB lines produces an error that the parent-child
>>> relationships refer to a non-existent job. (I don't have the exact
>>> message handy.)
>>>
>>> I see a few solutions, none of which I like:
>>> * resubmit without modification and let the children fail (wastes
>>> resources)
>>> * change the submit files to point to /bin/true and run in the local
>>> universe (a lot of scheduling overhead, I'd think, but maybe this is
>>> negligible)
>>> * identify all nodes of a class and remove all references to each of
>>> them (more code than I want to write at the moment)
>>>
>>> Can I get some gut reactions to these options or perhaps new,
>>> cleverer
>>> options?
>>>
>>> Thanks,
>>> Nick
===================================
Nickolas Fotopoulos
nvf__AT__gravity.phys.uwm.edu
Office: (414) 229-6438
Fax: (414) 229-5589
University of Wisconsin - Milwaukee
Physics Bldg, Rm 471
===================================
===========================================================================
Date of creation: Sat Mar 15 18:50:20 2008 (1205625023)
Subject: Actions
Assigned to bt by bt
===========================================================================
Date of actions: Mon Mar 17 9:21:07 2008 (1205774361)
Date: Mon, 17 Mar 2008 13:03:02 -0500
From: Bill Taylor <bt__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17748] LIGO: noop assertion error
bt wrote:
Just a quick note to let you know I have moved this along to our
Dagman experts.
Bill
Condor Team
Note added from ticket
Re: [condor-admin #17756] LIGO Support Ticket 17748
Hi,
Please append this to LIGO support Ticket 17748.
In short, in a rescue DAG some of the children had noop =True
added, and when the rescue DAG was submitted the schedd died
after some point.
Note, however, that this is for
$ condor_version
$CondorVersion: 6.9.4 Aug 30 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
$ condor_dagman -version
$CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
If you think this will be solved by upgrading to 7.0.1 and the
latest DAGman binaries Kent has produced let us know and we
will simply do that before debugging this further.
Thanks,
Scott
>
> When running many noop jobs, I am seeing the assertion error quoted
> below.
>
> $ condor_version
> $CondorVersion: 6.9.4 Aug 30 2007 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
> $ condor_dagman -version
> $CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
>
> On Mar 15, 2008, at 3:36 PM, Nickolas Fotopoulos wrote:
>
>> Scott,
>>
>> The noop route seems the most appealing. I tried it and it appeared
>> to work for a while, but I think I ran into a Condor bug ~200 jobs
>> into the noop jobs:
>>
>> ...
>> ...
>> 3/15 17:20:03 submitting: condor_submit -a dag_node_name' '='
>> '97d736e41f382cd8477d1ff0e8ae484f -a +DAGManJobId' '=' '1653119 -a
>> DAGManJobId' '=' '1653119 -a submit_event_notes' '=' 'DAG' 'Node:'
>> '97d736e41f382cd8477d1ff0e8ae484f -a macrooutput' '=' 'L1-
>> SIRE_FIRST_GRB070714B_injections21-868429419-2048.xml -a
>> macrosummary' '=' 'L1-
>> SIRE_FIRST_GRB070714B_injections21-868429419-2048.txt -a
>> macrousertag' '=' 'GRB070714B_injections21 -a macroglob' '=' 'L1-
>> INSPIRAL_FIRST_GRB070714B_injections21_150-868429419-2048.xml.gz -a
>> macroifocut' '=' 'L1 -a +DAGParentNodeNames' '='
>> '"eb25ace1446075cf9f2d9f0eb93e0ae6"
>> injections21.sire.GRB070714B_injections21.sub
>> 3/15 17:20:04 From submit: Submitting job(s).
>> 3/15 17:20:04 From submit: Logging submit event(s).
>> 3/15 17:20:04 From submit: 1 job(s) submitted to cluster 1653253.
>> 3/15 17:20:04 assigned Condor ID (1653253.0)
>> ...
>> ...
>> 3/15 17:20:04 Number of idle job procs: 0
>> 3/15 17:20:04 Event: ULOG_JOB_TERMINATED for Condor Node
>> 97d736e41f382cd8477d1ff0e8ae484f (1653253.0)
>> 3/15 17:20:04 BAD EVENT: job (1653253.0.0) ended, submit count < 1 (0)
>> 3/15 17:20:04 BAD EVENT is warning only
>> 3/15 17:20:04 ERROR "Assertion ERROR on (node->_queuedNodeJobProcs
>>> = 0)" at line 3024 in file dag.C
>>
>>
>> On Mar 14, 2008, at 7:22 PM, Scott Koranda wrote:
>>
>>> Hi Nick,
>>>
>>> Rather than setting 'executable = /bin/true' you could add to
>>> the submit file 'hold = True'. The child jobs will then be submitted
>>> and held and will not run unless you explicitly call
>>> condor_release on them.
>>>
>>> In a similar way you could set 'noop_job = True' for the child
>>> jobs and the jobs will simply be marked as completed with a
>>> return value of 0.
>>>
>>> Scott
>>>
>>>> Dear all,
>>>>
>>>> After a DAG has run partway through, I've decided that the bottom-
>>>> most
>>>> post-processing job (several thousand of them) should/can not be
>>>> run.
>>>> When my rescue DAG comes, as it inevitably does, I would like not to
>>>> execute these. So far, no problem; a one-line bash/sed invocation
>>>> takes care of that:
>>>>
>>>> cat $f | sed 's/.*mysubfile.*/& DONE/' > ${f}.sires_done;
>>>>
>>>> The problem is that not all of the parents have completed
>>>> successfully. I'd like to resubmit the parents, but not these
>>>> children. When I naively mark them as DONE, as above, I get the
>>>> following error while dagman parses the DAG.
>>>>
>>>> 3/13 20:25:13 ERROR: AddParent( ea0bca7d3503cccca43dff66a99c1516 )
>>>> failed for no
>>>> de a5bf08f49f3323fdd5f838f6d89918f7: STATUS_DONE child may
>>>> not be
>>>> given a n
>>>> ew STATUS_READY parent
>>>>
>>>> Removing the JOB lines produces an error that the parent-child
>>>> relationships refer to a non-existent job. (I don't have the exact
>>>> message handy.)
>>>>
>>>> I see a few solutions, none of which I like:
>>>> * resubmit without modification and let the children fail (wastes
>>>> resources)
>>>> * change the submit files to point to /bin/true and run in the local
>>>> universe (a lot of scheduling overhead, I'd think, but maybe this is
>>>> negligible)
>>>> * identify all nodes of a class and remove all references to each of
>>>> them (more code than I want to write at the moment)
>>>>
>>>> Can I get some gut reactions to these options or perhaps new,
>>>> cleverer
>>>> options?
>>>>
>>>> Thanks,
>>>> Nick
>
> ===================================
> Nickolas Fotopoulos
> nvf__AT__gravity.phys.uwm.edu
>
> Office: (414) 229-6438
> Fax: (414) 229-5589
> University of Wisconsin - Milwaukee
> Physics Bldg, Rm 471
> ===================================
===========================================================================
Date mail was appended: Mon Mar 17 13:03:16 2008 (1205776999)
Subject: Actions
Assigned to pfc by bt
===========================================================================
Date of actions: Mon Mar 17 9:21:07 2008 (1205777129)
Date: Wed, 26 Mar 2008 15:51:37 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17748] LIGO: noop assertion error
See PR 940 (Noop job shouldn't require an executable).
--
R. Kent Wenger (wenger__AT__cs.wisc.edu, 608-262-6627,
http://www.cs.wisc.edu/~wenger/)
Computer Sciences Department
University of Wisconsin-Madison
===========================================================================
Date mail was appended: Wed Mar 26 15:51:44 2008 (1206564705)
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>
From: Nickolas Fotopoulos <nvf__AT__gravity.phys.uwm.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17748] LIGO: noop assertion error
Date: Thu, 27 Mar 2008 13:28:19 -0500
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
Kent,
I'm unable to find a PR 940 through Google or anything else. The only
tracker I know is the LIGO-related bug tracker:
<http://www.lsc-group.phys.uwm.edu/lscdatagrid/condorligo/index.html>
Can you point me to the PR you reference?
Thanks,
Nick
On Mar 26, 2008, at 3:51 PM, condor-admin response tracking system
wrote:
> See PR 940 (Noop job shouldn't require an executable).
>
> --
> R. Kent Wenger (wenger__AT__cs.wisc.edu, 608-262-6627,
> http://www.cs.wisc.edu/~wenger/)
> Computer Sciences Department
> University of Wisconsin-Madison
>
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: nvf__AT__gravity.phys.uwm.edu, anderson__AT__ligo.caltech.edu
> ,skoranda__AT__gravity.phys.uwm.edu
>
> --
> ======================================================================
> This mail was sent from the RUST Mail System
> Please direct all replies to condor-admin__AT__cs.wisc.edu
> Please include the current subject line in your reply.
> ======================================================================
===================================
Nickolas Fotopoulos
nvf__AT__gravity.phys.uwm.edu
Office: (414) 229-6438
Fax: (414) 229-5589
University of Wisconsin - Milwaukee
Physics Bldg, Rm 471
===================================
===========================================================================
Date mail was appended: Thu Mar 27 13:28:44 2008 (1206642525)
Subject: Actions
Assigned to wenger by roy
===========================================================================
Date of actions: Mon Nov 3 12:24:47 2008 (1225736688)
Date: Mon, 22 Dec 2008 14:35:08 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: roy <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17748] LIGO: noop assertion error
Nick,
I was just going back through old emails, and realized that I never
replied to this...
Anyhow, if you still have the information around, I'd like to see the
complete dagman.out file for this, and the dag file and node job log files
if possible.
Also, there's another LIGO request to not actually submit noop jobs at all
in DAGMan. We haven't implemented it yet, but it's at least in the
pipeline.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Mon Dec 22 14:35:11 2008 (1229978112)
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com;
s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender
:to:subject:cc:in-reply-to:mime-version:content-type
:content-transfer-encoding:content-disposition:references
:x-google-sender-auth; bh=D3rjpZ3MenQmhMP0yoIEprInWfSLa6D8U9N6F+6Z8Fw=;
b=rqv4F/cuGLcqyjm3ArWF1P1XO7uteHX1EajBKmkjp9/mOJJ7XTPGyyfpH9ymXEuKfL
jzOgxWAfu80HE5SpktrRB2i7BVCtCZiV/UdU/6Ocfyr3Y3mR39EUf4n5YdG/uUx6HW2U
nnokLjvW/ZZ6Ggmrj8G8hbrr7ZH/SkMuWAuUQ=
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
h=message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version
:content-type:content-transfer-encoding:content-disposition
:references:x-google-sender-auth; b=oVWs8DspUuAseAS1XGuSMSIv0aHefxmPv54oLy11H6/IMN422baQRFKdXVdo4hSYJQ
IOtKcWiFX106HXCPWryeZQs//fKD2DdVEGE26bSM2GIa5TrJcYXrW+/YNAnabHH6NPw9
yYqFhfmMwo2FKc8bmpmWpTlKUEI/gmJpH4WoI=
Date: Mon, 22 Dec 2008 18:26:58 -0500
From: "Nickolas Fotopoulos" <nvf__AT__gravity.phys.uwm.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17748] LIGO: noop assertion error
CC: anderson__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu
X-Google-Sender-Auth: 0ef9b26624dc7d8f
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
Kent,
I apologize, but I no longer have the dag and dagman.out files
available. It must have fallen victim to the last call to free up
disk space. I will mention, though, that despite many noop jobs run
since then, I have seen no additional assertion errors.
As to not submitting noop jobs, that was probably my request also. I
would na=EFvely guess that whatever caused the assertion error is
probably shortcutted by not queueing the jobs at all. In light of all
this, I don't think that the assertion error PR (#17748) is very
important.
Thanks for checking up on this. Have a good holiday.
Take care,
Nick
On Mon, Dec 22, 2008 at 3:35 PM, condor-admin response tracking system
<condor-admin__AT__cs.wisc.edu> wrote:
> Nick,
>
> I was just going back through old emails, and realized that I never
> replied to this...
>
> Anyhow, if you still have the information around, I'd like to see the
> complete dagman.out file for this, and the dag file and node job log files
> if possible.
>
> Also, there's another LIGO request to not actually submit noop jobs at all
> in DAGMan. We haven't implemented it yet, but it's at least in the
> pipeline.
>
> Kent Wenger
> Condor Team
>
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> MESSAGE INFORMATION
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: nvf__AT__gravity.phys.uwm.edu, anderson__AT__ligo.caltech.edu,=
skoranda__AT__gravity.phys.uwm.edu
>
--=20
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Nickolas Fotopoulos
nvf__AT__gravity.phys.uwm.edu
Office: (414) 229-6438
Fax: (414) 229-5589
University of Wisconsin - Milwaukee
Physics Bldg, Rm 471
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
===========================================================================
Date mail was appended: Mon Dec 22 17:27:04 2008 (1229988424)