LIGO Support Ticket 1679

Ticket Information
  Number:      support 1679
  User:        anderson@ligo.caltech.edu
  Email:       espinoza_e__AT__ligo.caltech.edu
  Status:      resolved
  Assigned To: wenger
Date: Sat, 9 Sep 2006 15:29:59 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu
CC: wenger__AT__cs.wisc.edu, adesmet__AT__cs.wisc.eduerik
Subject: LIGO condor-6.8.1 duplicate rescue dags

A pre-release of condor-6.8.1 recently provided by Alan for running on the
LIGO CIT cluster just had the schedd segfault and then exit with status 4
on restart a few minutes later, see [condor-support #1677] and
[condor-support #1677]. In addition, after these two schedd restarts
we now have multiple instances of rescue dags running (in triplicate
as is the normal case). Therefore, it appears that there is still a
problem with DAG's handling these disruptive crashes.

The hard evidence is that following:
[root@ldas-grid log]# ps -ef | grep Lockfile | awk '{print $8}' | sort | uniq -c
      1 condor_scheduniv_exec.6716973.0
      1 condor_scheduniv_exec.6716981.0
      1 condor_scheduniv_exec.6717000.0
      3 condor_scheduniv_exec.6717008.0
      1 condor_scheduniv_exec.6746185.0
      1 condor_scheduniv_exec.6832866.0
      3 condor_scheduniv_exec.6922506.0
      3 condor_scheduniv_exec.6927226.0
      1 grep

Note the 3 job id's that the Linux kernel reports there are 3 processes
running for. Here is one of these in more detail:

[root@ldas-grid log]# ps -ef | grep 6927226.0
gstef    32481     1  0 Sep08 ?        00:05:43 condor_scheduniv_exec.6927226.0 -f -l . -Debug 3 -Lockfile monitor_dag/the.dag.lock -Condorlog monitor_dag/the.dag.log -Dag monitor_dag/the.dag -Rescue monitor_dag/the.dag.rescue
gstef    20534     1  3 13:26 ?        00:04:21 condor_scheduniv_exec.6927226.0 -f -l . -Debug 3 -Lockfile monitor_dag/the.dag.lock -Condorlog monitor_dag/the.dag.log -Dag monitor_dag/the.dag -Rescue monitor_dag/the.dag.rescue
gstef    20622 20593  4 13:27 ?        00:04:56 condor_scheduniv_exec.6927226.0 -f -l . -Debug 3 -Lockfile monitor_dag/the.dag.lock -Condorlog monitor_dag/the.dag.log -Dag monitor_dag/the.dag -Rescue monitor_dag/the.dag.rescue
root      9495 30346  0 15:24 pts/16   00:00:00 grep 6927226.0

Note the Unix process start times for these 3 processes. Sep08 was after
we installed the 6.8.1 pre-release and the last 2 where started ~1 minute
apart. The second schedd crash was at 13:26:33.

Until such time as we can prevent the schedd from segfaulting, or the
dedicated scheduler from crashing on restart, it is very important
that we can at least get dagman to be able to ride out the storm
without running duplicate jobs.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Sat Sep  9 17:30:46 2006 (1157841048)
Subject: Actions

Assigned to wenger by adesmet
===========================================================================
Date of actions: Mon Sep 11 14:34:01 2006 (1158003241)
Date: Mon, 11 Sep 2006 14:51:53 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: adesmet <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1679] LIGO condor-6.8.1 duplicate rescue dags

Stuart,

> A pre-release of condor-6.8.1 recently provided by Alan for running on the
> LIGO CIT cluster just had the schedd segfault and then exit with status 4
> on restart a few minutes later, see [condor-support #1677] and
> [condor-support #1677]. In addition, after these two schedd restarts
> we now have multiple instances of rescue dags running (in triplicate
> as is the normal case). Therefore, it appears that there is still a
> problem with DAG's handling these disruptive crashes.

As I said in the previous email, the dagman.out file is what I need to
see why the "avoiding duplicate DAGMan runs" feature didn't work right.

I'm not sure about the underlying issue of how you end up with multiple
DAGMan jobs in the first place...

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Mon Sep 11 14:54:55 2006 (1158004496)
Date: Mon, 11 Sep 2006 16:14:36 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: adesmet <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1679] LIGO condor-6.8.1 duplicate rescue dags

Stuart,

> A pre-release of condor-6.8.1 recently provided by Alan for running on the
> LIGO CIT cluster just had the schedd segfault and then exit with status 4
> on restart a few minutes later, see [condor-support #1677] and
> [condor-support #1677]. In addition, after these two schedd restarts
> we now have multiple instances of rescue dags running (in triplicate
> as is the normal case). Therefore, it appears that there is still a
> problem with DAG's handling these disruptive crashes.

I think the mystery is solved.  I sent you the wrong config macro before.

You need to have DAGMAN_ABORT_DUPLICATES set to true; I glanced too
quickly at the manual.

According to the dagman.out file, DAGMAN_ABORT_DUPLICATES is not set to
true:

    9/8 10:34:38 DAGMAN_ABORT_DUPLICATES setting: 0

It looks like I sent email about this a while ago, but that probably
wasn't in conjuction with you getting the 6.8.1 pre-release binaries.

Anyhow, here are the details:

There's a new config macro
(DAGMAN_ABORT_DUPLICATES) that controls this.  DAGMAN_ABORT_DUPLICATES
defaults to false, so the default behavior hasn't changed.

If DAGMAN_ABORT_DUPLICATES is set to true, and multiple DAGMans get
started on the same DAG, all but the first should abort with a message
like this:

8/9 21:02:59 Duplicate DAGMan PID 28534 is alive; this DAGMan should
abort.
8/9 21:02:59 Aborting because it looks like another instance of DAGMan is
currently running on this DAG; if that is not the case, delete the lock
file (dag_files/diamond39.dag.lock) and re-submit the DAG.

I've testing this out, and the second and subsequent DAGMans abort before
they actually submit any jobs, so they don't do anything that messes up
the first one.

Sorry for the confusion...

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Mon Sep 11 16:14:55 2006 (1158009296)
Date: Mon, 11 Sep 2006 14:22:32 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>,
 Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1679] LIGO condor-6.8.1 duplicate rescue dags

Kent,
	Thanks for tracking this down. You did indeed send us this message
a while back and I failed to make the connection between this new 6.8.1
feature and the 6.8.1 pre-release binaries. We will enable this option
today.

	How much of condor do we have to restart and/or reconfigure after
making this change to condor_config? Just the schedd, or do our users need
to restart their dagman jobs as well?

Thanks.

On Mon, Sep 11, 2006 at 04:14:55PM -0600, condor-support response tracking system wrote:
> Stuart,
> 
> > A pre-release of condor-6.8.1 recently provided by Alan for running on the
> > LIGO CIT cluster just had the schedd segfault and then exit with status 4
> > on restart a few minutes later, see [condor-support #1677] and
> > [condor-support #1677]. In addition, after these two schedd restarts
> > we now have multiple instances of rescue dags running (in triplicate
> > as is the normal case). Therefore, it appears that there is still a
> > problem with DAG's handling these disruptive crashes.
> 
> I think the mystery is solved.  I sent you the wrong config macro before.
> 
> You need to have DAGMAN_ABORT_DUPLICATES set to true; I glanced too
> quickly at the manual.
> 
> According to the dagman.out file, DAGMAN_ABORT_DUPLICATES is not set to
> true:
> 
>     9/8 10:34:38 DAGMAN_ABORT_DUPLICATES setting: 0
> 
> It looks like I sent email about this a while ago, but that probably
> wasn't in conjuction with you getting the 6.8.1 pre-release binaries.
> 
> Anyhow, here are the details:
> 
> There's a new config macro
> (DAGMAN_ABORT_DUPLICATES) that controls this.  DAGMAN_ABORT_DUPLICATES
> defaults to false, so the default behavior hasn't changed.
> 
> If DAGMAN_ABORT_DUPLICATES is set to true, and multiple DAGMans get
> started on the same DAG, all but the first should abort with a message
> like this:
> 
> 8/9 21:02:59 Duplicate DAGMan PID 28534 is alive; this DAGMan should
> abort.
> 8/9 21:02:59 Aborting because it looks like another instance of DAGMan is
> currently running on this DAG; if that is not the case, delete the lock
> file (dag_files/diamond39.dag.lock) and re-submit the DAG.
> 
> I've testing this out, and the second and subsequent DAGMans abort before
> they actually submit any jobs, so they don't do anything that messes up
> the first one.
> 
> Sorry for the confusion...
> 
> Kent Wenger
> Condor Team
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, 
> 
> -- 
> ======================================================================
> This mail was sent from the RUST Mail System
> Please direct all replies to condor-support__AT__cs.wisc.edu
> Please include the current subject line in your reply.
> ======================================================================
> 

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Sep 11 16:22:53 2006 (1158009773)
Date: Mon, 11 Sep 2006 16:49:11 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1679] LIGO condor-6.8.1 duplicate rescue dags

> 	Thanks for tracking this down. You did indeed send us this message
> a while back and I failed to make the connection between this new 6.8.1
> feature and the 6.8.1 pre-release binaries. We will enable this option
> today.
>
> 	How much of condor do we have to restart and/or reconfigure after
> making this change to condor_config? Just the schedd, or do our users need
> to restart their dagman jobs as well?

You'll have to restart the DAGMans.  The schedd doesn't know anything
about DAGMAN_ABORT_DUPLICATES.

You can condor_rm the condor_dagman jobs, and then do condor_submit_dag
on the resulting rescue DAG files.  If you want to avoid wasting cycles,
you can condor_hold the DAGMans, wait for the already-submitted node jobs
to finish, and then condor_rm the DAGMans and submit the rescue DAGs.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Mon Sep 11 16:53:06 2006 (1158011586)
Date: Sat, 28 Oct 2006 14:51:51 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1679] LIGO condor-6.8.1 duplicate rescue dags

We have not seen this problem since upgrading to condor 6.8.2. Please consider
closing this ticket.

Thanks.

On Mon, Sep 11, 2006 at 04:53:06PM -0600, condor-support response tracking system wrote:
> > 	Thanks for tracking this down. You did indeed send us this message
> > a while back and I failed to make the connection between this new 6.8.1
> > feature and the 6.8.1 pre-release binaries. We will enable this option
> > today.
> >
> > 	How much of condor do we have to restart and/or reconfigure after
> > making this change to condor_config? Just the schedd, or do our users need
> > to restart their dagman jobs as well?
> 
> You'll have to restart the DAGMans.  The schedd doesn't know anything
> about DAGMAN_ABORT_DUPLICATES.
> 
> You can condor_rm the condor_dagman jobs, and then do condor_submit_dag
> on the resulting rescue DAG files.  If you want to avoid wasting cycles,
> you can condor_hold the DAGMans, wait for the already-submitted node jobs
> to finish, and then condor_rm the DAGMans and submit the rescue DAGs.
> 
> Kent Wenger
> Condor Team

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Sat Oct 28 16:52:09 2006 (1162072329)
Date: Thu, 16 Nov 2006 15:17:58 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1679] LIGO condor-6.8.1 duplicate rescue dags

Stuart,

> We have not seen this problem since upgrading to condor 6.8.2. Please consider
> closing this ticket.

Okay, glad to hear you haven't seen this.

I think there were some fixes in 6.8.2 to the underlying problem in the
schedd, plus the workaround in DAGMan itself.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Thu Nov 16 15:21:45 2006 (1163712106)
Subject: Actions

Ticket resolved by wenger
===========================================================================
Date of actions: Wed Nov 22 11:49:06 2006 (1164217746)