LIGO Support Ticket 19455
Ticket Information
Number: admin 19455
User: anderson@ligo.caltech.edu
Email: dabrown__AT__physics.syr.edu,skoranda__AT__gravity.phys.uwm.edu,aplundgr__AT__syr.edu,kcannon__AT__ligo.caltech.edu,cdcapano__AT__physics.syr.edu,dmoraru__AT__ligo-wa.caltech.edu
Status: open
Assigned To: psilord
CC: Duncan Brown <dabrown__AT__physics.syr.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>, Andrew P Lundgren
<aplundgr__AT__syr.edu>, Kipp Cannon <kcannon__AT__ligo.caltech.edu>,
Collin Capano <cdcapano__AT__physics.syr.edu>
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: LIGO: problems with active DAGs during Condor upgrade
Date: Sat, 4 Jul 2009 19:31:32 -0700
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
We ran into a few problems on the LIGO condor pool at CIT with large
DAGs that where running when the pool was upgraded from 7.2.3 to
7.2.4. To get started, here is one example of a user log file that
shows a version mismatch error. The procedure I followed was to run
rpm -Uvh to install the 7.2.4 RPM and let the condor daemons auto-
detect the new version and do a clean restart. However, this does not
appear to be the right procedure when there are active DAGs in the pool.
What is the recommended way to upgrade a pool that is actively running
DAGs?
The second part this ticket is why the user observed that "most of the
dag has run twice"?
Note, there will be a second ticket opened where a much larger DAG was
left in a corrupt state after this upgrade and could not be manually
repaired but had to be completely re-run.
Thanks.
Begin forwarded message:
> From: Andrew P Lundgren <aplundgr__AT__syr.edu>
> Date: July 4, 2009 6:33:48 AM PDT
> To: Stuart Anderson <anderson__AT__ligo.caltech.edu>, Duncan Brown <dabrown__AT__physics.syr.edu
> >
> Cc: Kipp Cannon <kcannon__AT__ligo.caltech.edu>, Collin Capano <cdcapano__AT__physics.syr.edu
> >
> Subject: RE: dag
>
> Hi Stuart,
>
> I've been looking through Collin's log files, figuring out run times
> on the E14 data. After the SIGTERM, the dags restarted and then
> quit because of a version mismatch in condor. They came back again
> (I think a few hours later) and ran normally this time. However,
> before the SIGTERM, more than 2000 jobs had finished in the
> full_data part of the dag. After, only 380 jobs were marked as
> done. So most of the dag has run twice. Any idea what happened?
>
> The dag was running on ldas-grid in /archive/home/cdcapano/e14/
> ihoperun/928875615-929134815.
>
> Thanks,
> Andy
>
> Here are some lines from the dagman.out file.
>
> 6/28 04:35:37 Done Pre Queued Post Ready Un-Ready
> Failed
> 6/28 04:35:37 === === === === === ===
> ===
> 6/28 04:35:37 2298 0 7 0 0
> 95 1
> ...
> 6/28 17:05:42 Got SIGTERM. Performing graceful shutdown.
> 6/28 17:05:42 **** condor_scheduniv_exec.49881787.0 (condor_DAGMAN)
> pid 3313 EXITING WITH STATUS 3
> ...
> 6/28 17:08:04 Lock file inspiral_hipe_full_data.FULL_DATA.dag.lock
> detected,
> 6/28 17:08:04 Duplicate DAGMan PID 3313 is no longer alive; this
> DAGMan should continue.
> ...
> 6/28 17:08:16 **** condor_scheduniv_exec.49881787.0 (condor_DAGMAN)
> pid 3483 EXITING WITH STATUS 1
> 6/28 17:08:41 ******************************************************
> 6/28 17:08:41 ** condor_scheduniv_exec.50455804.0 (CONDOR_DAGMAN)
> STARTING UP
> ...
> 6/28 17:08:41 Version mismatch: condor_submit_dag ($CondorVersion:
> 7.2.3 May 11 2009 BuildID: 151729 $) vs. condor_dagman
> ($CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $)
> 6/28 17:08:41 **** condor_scheduniv_exec.50455804.0 (condor_DAGMAN)
> pid 3788 EXITING WITH STATUS 1
> 6/29 14:47:49 ******************************************************
> 6/29 14:47:49 ** condor_scheduniv_exec.50527579.0 (CONDOR_DAGMAN)
> STARTING UP
> ...
> 6/29 14:47:49 Found rescue DAG number 1; running
> inspiral_hipe_full_data.FULL_DATA.dag.rescue001 instead of normal
> DAG file
> 6/29 14:47:49 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 6/29 14:47:49 RUNNING RESCUE DAG
> inspiral_hipe_full_data.FULL_DATA.dag.rescue001
> ...
> 6/29 14:48:16 Done Pre Queued Post Ready Un-Ready
> Failed
> 6/29 14:48:16 === === === === === ===
> ===
> 6/29 14:48:16 384 0 5 0 160
> 1852 0
>
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Sat Jul 4 21:31:53 2009 (1246761116)
Subject: Actions
Assigned to psilord by jfrey
===========================================================================
Date of actions: Mon Jul 6 14:24:50 2009 (1246908290)
Date: Wed, 8 Jul 2009 12:04:47 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: jfrey <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19455] LIGO: problems with active DAGs during
Condor upgrade
Hello,
> From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
>
> We ran into a few problems on the LIGO condor pool at CIT with large
> DAGs that where running when the pool was upgraded from 7.2.3 to
> 7.2.4. To get started, here is one example of a user log file that
> shows a version mismatch error. The procedure I followed was to run
> rpm -Uvh to install the 7.2.4 RPM and let the condor daemons auto-
> detect the new version and do a clean restart. However, this does not
> appear to be the right procedure when there are active DAGs in the pool.
Is DAGMAN_COPY_TO_SPOOL set to TRUE for your pool? This is a config file
variable which showed up in 7.2.1 that copies the condor_dagman binary
to spool upon dag submission. This is so that over upgrades around long
running dags, you don't get certain kinds of version mismatch error.
The log message:
6/28 17:08:41 Version mismatch: condor_submit_dag ($CondorVersion:
7.2.3 May 11 2009 BuildID: 151729 $) vs. condor_dagman
($CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $)
tells me that the dag was submitted with a 7.2.3 condor_submit dag, but
was tried to be restarted with a 7.2.4 condor-dagman. This then tells
me that DAGMAN_COPY_TO_SPOOL is unset or false for your submit point. This
allowed a newer condor_dagman (resultant from the upgrade) to manage a
dagman submitted by an older condor_submit_dag.
I'd set DAGMAN_COPY_TO_SPOOL to TRUE for that submit point
since you have long running dags.
> What is the recommended way to upgrade a pool that is actively running
> DAGs?
We'd like it to be in the manner you just performed and will figure out
what happened.
> The second part this ticket is why the user observed that "most of the
> dag has run twice"?
It looks like a rescue dag had been made from near the start of the dag
run, and this was the one discovered by a later dagman and then executed.
It appears that in the situation of a condor-dagman being shut down by
the schedd because the schedd decided to restart itself, condor_dagman
doesn't make a rescue dag. I think I remember some other RUST ticket
about this exact thing where you are requesting dagman to make a rescue
dag in a similar situation, but I don't have it handy at the moment.
I've confirmed with Kent that this is exactly the case, when the schedd
restarts due to finding a newer version of itself, condor_dagman doesn't
make a rescue dag. If there is a previous rescue dag upon dstart up,
that gets run.
Thank you.
-pete
===========================================================================
Date mail was appended: Wed Jul 8 12:04:52 2009 (1247072693)
CC: Duncan Brown <dabrown__AT__physics.syr.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>, Andrew P Lundgren
<aplundgr__AT__syr.edu>, Kipp Cannon <kcannon__AT__ligo.caltech.edu>,
Collin Capano <cdcapano__AT__physics.syr.edu>, Todd Tannenbaum
<tannenba__AT__cs.wisc.edu>, Dan Moraru <dmoraru__AT__ligo-wa.caltech.edu>
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19455] LIGO: problems with active DAGs during
Condor upgrade
Date: Sat, 11 Jul 2009 11:42:39 -0700
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
On Jul 8, 2009, at 10:04 AM, condor-admin response tracking system
wrote:
> Hello,
>
>> From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
>>
>> We ran into a few problems on the LIGO condor pool at CIT with large
>> DAGs that where running when the pool was upgraded from 7.2.3 to
>> 7.2.4. To get started, here is one example of a user log file that
>> shows a version mismatch error. The procedure I followed was to run
>> rpm -Uvh to install the 7.2.4 RPM and let the condor daemons auto-
>> detect the new version and do a clean restart. However, this does not
>> appear to be the right procedure when there are active DAGs in the
>> pool.
>
> Is DAGMAN_COPY_TO_SPOOL set to TRUE for your pool?
Yes,
[root@ldas-pcdev1 ~]# condor_config_val DAGMAN_COPY_TO_SPOOL
True
> This is a config file
> variable which showed up in 7.2.1 that copies the condor_dagman binary
> to spool upon dag submission. This is so that over upgrades around
> long
> running dags, you don't get certain kinds of version mismatch error.
>
> The log message:
>
> 6/28 17:08:41 Version mismatch: condor_submit_dag ($CondorVersion:
> 7.2.3 May 11 2009 BuildID: 151729 $) vs. condor_dagman
> ($CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $)
>
> tells me that the dag was submitted with a 7.2.3 condor_submit dag,
> but
> was tried to be restarted with a 7.2.4 condor-dagman. This then tells
> me that DAGMAN_COPY_TO_SPOOL is unset or false for your submit
> point. This
> allowed a newer condor_dagman (resultant from the upgrade) to manage a
> dagman submitted by an older condor_submit_dag.
>
> I'd set DAGMAN_COPY_TO_SPOOL to TRUE for that submit point
> since you have long running dags.
Understood, but it looks like there might be a simple bug here since
we had this setting enabled. Here are all the default DAGMAN_ settings
on the submit machine in question:
[root@ldas-pcdev1 condor]# grep DAGMAN_ condor_config
condor_config.local
condor_config:DAGMAN_SUBMIT_DEPTH_FIRST = True
condor_config:DAGMAN_DEBUG_CACHE_ENABLE = True
condor_config:DAGMAN_COPY_TO_SPOOL = True
condor_config.local:DAGMAN_MAX_JOBS_SUBMITTED = 2500
condor_config.local:DAGMAN_MAX_JOBS_IDLE = 20
condor_config.local:DAGMAN_PROHIBIT_MULTI_JOBS = True
condor_config.local:# If DAGMAN_ABORT_DUPLICATES is set to true, and
multiple DAGMans get
condor_config.local:# DAGMAN_ABORT_DUPLICATES defaults to false, so
the default behavior
condor_config.local:DAGMAN_ABORT_DUPLICATES = True
>
>> What is the recommended way to upgrade a pool that is actively
>> running
>> DAGs?
>
> We'd like it to be in the manner you just performed and will figure
> out
> what happened.
>
>> The second part this ticket is why the user observed that "most of
>> the
>> dag has run twice"?
>
> It looks like a rescue dag had been made from near the start of the
> dag
> run, and this was the one discovered by a later dagman and then
> executed.
> It appears that in the situation of a condor-dagman being shut down by
> the schedd because the schedd decided to restart itself, condor_dagman
> doesn't make a rescue dag. I think I remember some other RUST ticket
> about this exact thing where you are requesting dagman to make a
> rescue
> dag in a similar situation, but I don't have it handy at the moment.
>
> I've confirmed with Kent that this is exactly the case, when the
> schedd
> restarts due to finding a newer version of itself, condor_dagman
> doesn't
> make a rescue dag. If there is a previous rescue dag upon dstart up,
> that gets run.
Given the lack of automatic rescue dag generation and the recent
DAGMan code to automatically pick up and start from the last rescue
dag does this leave us in a guaranteed problem situation of restarting
from the wrong state?
If I understand this correctly we should do one or both of the
following:
1) A clean schedd shutdown should force a rescue dag dump. The
restarting of a complicated several 100k node nested DAG workflow
appears to be inherently complicated so the more metadata dumped about
the state of a properly running job under controlled circumstances the
better--perhaps for debugging purposes for future corner cases if
nothing else.
2) The automatic restart algorithm for DAGs should be enhanced to not
simply look at the last rescue dag, but perform some reasonable
heuristic to decide whether to roll back the logfiles or the last
rescue dag.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Sat Jul 11 13:43:00 2009 (1247337780)
CC: Duncan Brown <dabrown__AT__physics.syr.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>, Andrew P Lundgren
<aplundgr__AT__syr.edu>, Kipp Cannon <kcannon__AT__ligo.caltech.edu>,
Collin Capano <cdcapano__AT__physics.syr.edu>, Todd Tannenbaum
<tannenba__AT__cs.wisc.edu>, Dan Moraru <dmoraru__AT__ligo-wa.caltech.edu>
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19455] LIGO: problems with active DAGs during
Condor upgrade
Date: Mon, 13 Jul 2009 20:23:28 -0700
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
On Jul 11, 2009, at 11:42 AM, Stuart Anderson wrote:
>
> On Jul 8, 2009, at 10:04 AM, condor-admin response tracking system
> wrote:
>
>> Hello,
>>
>>> From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
>>>
>>> We ran into a few problems on the LIGO condor pool at CIT with large
>>> DAGs that where running when the pool was upgraded from 7.2.3 to
>>> 7.2.4. To get started, here is one example of a user log file that
>>> shows a version mismatch error. The procedure I followed was to run
>>> rpm -Uvh to install the 7.2.4 RPM and let the condor daemons auto-
>>> detect the new version and do a clean restart. However, this does
>>> not
>>> appear to be the right procedure when there are active DAGs in the
>>> pool.
>>
>> Is DAGMAN_COPY_TO_SPOOL set to TRUE for your pool?
>
> Yes,
>
> [root@ldas-pcdev1 ~]# condor_config_val DAGMAN_COPY_TO_SPOOL
> True
>
>> This is a config file
>> variable which showed up in 7.2.1 that copies the condor_dagman
>> binary
>> to spool upon dag submission. This is so that over upgrades around
>> long
>> running dags, you don't get certain kinds of version mismatch error.
>>
>> The log message:
>>
>> 6/28 17:08:41 Version mismatch: condor_submit_dag ($CondorVersion:
>> 7.2.3 May 11 2009 BuildID: 151729 $) vs. condor_dagman
>> ($CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $)
>>
>> tells me that the dag was submitted with a 7.2.3 condor_submit dag,
>> but
>> was tried to be restarted with a 7.2.4 condor-dagman. This then tells
>> me that DAGMAN_COPY_TO_SPOOL is unset or false for your submit
>> point. This
>> allowed a newer condor_dagman (resultant from the upgrade) to
>> manage a
>> dagman submitted by an older condor_submit_dag.
>>
>> I'd set DAGMAN_COPY_TO_SPOOL to TRUE for that submit point
>> since you have long running dags.
>
> Understood, but it looks like there might be a simple bug here since
> we had this setting enabled. Here are all the default DAGMAN_
> settings on the submit machine in question:
>
> [root@ldas-pcdev1 condor]# grep DAGMAN_ condor_config
> condor_config.local
> condor_config:DAGMAN_SUBMIT_DEPTH_FIRST = True
> condor_config:DAGMAN_DEBUG_CACHE_ENABLE = True
> condor_config:DAGMAN_COPY_TO_SPOOL = True
> condor_config.local:DAGMAN_MAX_JOBS_SUBMITTED = 2500
> condor_config.local:DAGMAN_MAX_JOBS_IDLE = 20
> condor_config.local:DAGMAN_PROHIBIT_MULTI_JOBS = True
> condor_config.local:# If DAGMAN_ABORT_DUPLICATES is set to true, and
> multiple DAGMans get
> condor_config.local:# DAGMAN_ABORT_DUPLICATES defaults to false, so
> the default behavior
> condor_config.local:DAGMAN_ABORT_DUPLICATES = True
It doesn't look like there is anything simple. Here is a snapshot of
the same submit machine (now running 7.2.4) that proves that all
currently running dags have their dagman executables spooled by condor
and that the spooled bits are the currently installed version of
condor_dagman:
[root@ldas-pcdev1 log]# md5sum /usr/bin/condor_dagman
4cc487c4de8a999a23d658341682fc05 /usr/bin/condor_dagman
[root@ldas-pcdev1 log]# ps -ef | grep -v awk | awk '/-Dag/ {print $2}'
| while read pid; do echo -n "$pid : "; lsof -p $pid | awk '$4 ~ /txt/
{print $NF}' | xargs -n1 md5sum; done
1054 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17805811.ickpt.subproc0
2231 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17805812.ickpt.subproc0
2773 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17805813.ickpt.subproc0
2930 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17805814.ickpt.subproc0
3292 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17822694.ickpt.subproc0
3749 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17805810.ickpt.subproc0
3750 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17805807.ickpt.subproc0
3754 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17805808.ickpt.subproc0
3755 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17805809.ickpt.subproc0
4254 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster18703748.ickpt.subproc0
4911 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17822841.ickpt.subproc0
6652 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17822696.ickpt.subproc0
6839 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17822723.ickpt.subproc0
10446 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17822758.ickpt.subproc0
10468 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17805805.ickpt.subproc0
11389 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17805806.ickpt.subproc0
21376 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster18551911.ickpt.subproc0
23690 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17822771.ickpt.subproc0
23844 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17822801.ickpt.subproc0
24443 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17822708.ickpt.subproc0
25227 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster18687693.ickpt.subproc0
25547 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster18704048.ickpt.subproc0
26257 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster18696428.ickpt.subproc0
27806 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17822814.ickpt.subproc0
27938 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17822827.ickpt.subproc0
28819 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17822691.ickpt.subproc0
28995 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17821083.ickpt.subproc0
30794 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17805804.ickpt.subproc0
32077 : 4cc487c4de8a999a23d658341682fc05 /usr1/condor/spool/
cluster17805496.ickpt.subproc0
Kipp,
Does the version mismatch string perhaps come from your manually
running condor_submit_dag after the crash?
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Mon Jul 13 22:23:47 2009 (1247541828)