LIGO Support Ticket 19259
Ticket Information
Number: admin 19259
User: dabrown@physics.syr.edu
Email:
Status: resolved
Assigned To: wenger
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: LIGO: dagman upgrade problem
Date: Wed, 29 Apr 2009 21:10:09 -0400
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.7400:2.4.4,1.2.40,4.0.166
definitions=2009-04-29_09:2009-04-28,2009-04-29,2009-04-29 signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0
reason=mlx engine=5.0.0-0811170000 definitions=main-0904290222
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
Hi Kent,
I just put all my condor jobs on hold and upgraded my condor install
from 7.2.0 (with pre7.2.1 dagman binaries) to 7.2.2 and all my users'
dags failed with the error below. Any idea how I can prevent this
error in future?
Cheers,
Duncan.
4/29 18:20:40 ******************************************************
4/29 18:20:40 ** condor_scheduniv_exec.2828574.0 (CONDOR_DAGMAN)
STARTING UP
4/29 18:20:40 ** /usr/bin/condor_dagman
4/29 18:20:40 ** SubsystemInfo: name=DAGMAN type=DAEMON(10)
class=DAEMON(1)
4/29 18:20:40 ** Configuration: subsystem:DAGMAN local:<NONE>
class:DAEMON
4/29 18:20:40 ** $CondorVersion: 7.2.2 Apr 9 2009 BuildID: 145189 $
4/29 18:20:40 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
4/29 18:20:40 ** PID = 22705
4/29 18:20:40 ** Log last touched 4/29 16:56:40
4/29 18:20:40 ******************************************************
4/29 18:20:40 Using config source: /usr1/condor/condor_config
4/29 18:20:40 Using local config sources:
4/29 18:20:40 /usr1/condor/condor_config.local
4/29 18:20:40 DaemonCore: Command Socket at <10.20.1.23:44829>
4/29 18:20:40 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880
4/29 18:20:40 DAGMAN_DEBUG_CACHE_ENABLE setting: False
4/29 18:20:40 DAGMAN_SUBMIT_DELAY setting: 0
4/29 18:20:40 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
4/29 18:20:40 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
4/29 18:20:40 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
4/29 18:20:40 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION,
DAGMAN_ALLOW_EVENTS) setting: 114
4/29 18:20:41 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
4/29 18:20:41 DAGMAN_RETRY_NODE_FIRST setting: 0
4/29 18:20:41 DAGMAN_MAX_JOBS_IDLE setting: 50
4/29 18:20:41 DAGMAN_MAX_JOBS_SUBMITTED setting: 1400
4/29 18:20:41 DAGMAN_MUNGE_NODE_NAMES setting: 1
4/29 18:20:41 DAGMAN_DELETE_OLD_LOGS setting: 1
4/29 18:20:41 DAGMAN_PROHIBIT_MULTI_JOBS setting: 1
4/29 18:20:41 DAGMAN_SUBMIT_DEPTH_FIRST setting: 1
4/29 18:20:41 DAGMAN_ABORT_DUPLICATES setting: 1
4/29 18:20:41 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1
4/29 18:20:41 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
4/29 18:20:41 DAGMAN_AUTO_RESCUE setting: 1
4/29 18:20:41 DAGMAN_MAX_RESCUE_NUM setting: 100
4/29 18:20:41 argv[0] == "condor_scheduniv_exec.2828574.0"
4/29 18:20:41 argv[1] == "-Debug"
4/29 18:20:41 argv[2] == "3"
4/29 18:20:41 argv[3] == "-Lockfile"
4/29 18:20:41 argv[4] == "inspiral_hipe_full_data.FULL_DATA.dag.lock"
4/29 18:20:41 argv[5] == "-AutoRescue"
4/29 18:20:41 argv[6] == "1"
4/29 18:20:41 argv[7] == "-DoRescueFrom"
4/29 18:20:41 argv[8] == "0"
4/29 18:20:41 argv[9] == "-Dag"
4/29 18:20:41 argv[10] == "inspiral_hipe_full_data.FULL_DATA.dag"
4/29 18:20:41 argv[11] == "-CsdVersion"
4/29 18:20:41 argv[12] == "$CondorVersion: 7.2.1 Jan 8 2009 BuildID:
124562 PRE-RELEASE-UWCS $"
4/29 18:20:41 Version mismatch: condor_submit_dag ($CondorVersion:
7.2.1 Jan 8 2009 BuildID: 124562 PRE-RELEASE-UWCS $) vs.
condor_dagman ($CondorVersion: 7.2.2 Apr 9 2009 BuildID: 145189 $)
4/29 18:20:41 **** condor_scheduniv_exec.2828574.0 (condor_DAGMAN) pid
22705 EXITING WITH STATUS 1
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
===========================================================================
Date of creation: Wed Apr 29 20:10:21 2009 (1241053824)
Subject: Actions
Assigned to psilord by gthain
===========================================================================
Date of actions: Wed Apr 29 21:30:50 2009 (1241058650)
Date: Thu, 30 Apr 2009 01:41:34 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: gthain <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19259] LIGO: dagman upgrade problem
Hello,
> From: Duncan Brown <dabrown__AT__physics.syr.edu>
>
> Hi Kent,
>
> I just put all my condor jobs on hold and upgraded my condor install
> from 7.2.0 (with pre7.2.1 dagman binaries) to 7.2.2 and all my users'
> dags failed with the error below. Any idea how I can prevent this
> error in future?
You must always upgrade condor_submit_dag in addition to condor_dagman.
Those two programs should always match in version.
-pete
===========================================================================
Date mail was appended: Thu Apr 30 1:41:36 2009 (1241073697)
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19259] LIGO: dagman upgrade problem
Date: Thu, 30 Apr 2009 11:46:19 -0400
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.7400:2.4.4,1.2.40,4.0.166
definitions=2009-04-30_09:2009-04-28,2009-04-30,2009-04-30 signatures=0
X-Proofpoint-Spam-Reason: safe
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
Hi Pete,
On Apr 30, 2009, at 2:41 AM, condor-admin response tracking system
wrote:
>> I just put all my condor jobs on hold and upgraded my condor install
>> from 7.2.0 (with pre7.2.1 dagman binaries) to 7.2.2 and all my users'
>> dags failed with the error below. Any idea how I can prevent this
>> error in future?
>
> You must always upgrade condor_submit_dag in addition to
> condor_dagman.
> Those two programs should always match in version.
I did upgrade them together. Here's what I did:
condor_hold all running jobs
wait for everything to be evicted from the cluster and schedd
condor_off -all
turned off the condor master on all nodes and head nodes
installed the new 7.2.2 RPM on all nodes and head nodes
turned condor back on
checked that condor was back up cleanly
released all held jobs
The regular condor jobs resumed, but the running dagman processes
(including top-level and sub-dags) all exited with the error in my
original message.
I cd-ed into a users directory and ran condor_submit on their
mydag.dag.condor.sub, but this failed immediately with the same error
(as expected, since this is effectively what condor_release on a
dagman process does).
I then tried to condor_submit_dag the original dag, but
condor_submit_dag complained that the dag lock file existed and that
there may be already be another dagman process running this dag (there
wasn't, obviously). I was reluctant to use -f, since that would delete
all the work done so far.
I was able to recover the dags by hand-editing the user's top-level
mydag.dag.condor.sub and changing the version string to match the new
version of the dagman binaries. I could then condor_submit the
mydag.dag.condor.sub which to re-created all the subdag.dag.condor.sub
files with the right version string.
This OK for a couple of dags (my case) but it may be painful for Stuart.
I see a couple of solutions:
- insist that admins condor_rm dags before an upgrade and inform users
that they must re-submit any rescue dags afterwards by hand.
- make condor_dagman less picky so rather than demanding an exact
match between the version strings, the 7.2.2 condor_dagman can say:
ok, i can run a 7.2.1 generated sub file, but i'll balk at a eg, a
6.0.0 generated file.
Of course, my upgrade procedure may be screwy, so there could be other
ways to do things better.
Cheers,
Duncan.
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
===========================================================================
Date mail was appended: Thu Apr 30 10:46:35 2009 (1241106396)
Date: Thu, 30 Apr 2009 14:39:53 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19259] LIGO: dagman upgrade problem
Hello,
I showed Kent this ticket, and he expressed enough interest in it, since he
implemented it, that I'm reassigning it to him. After you get this message, the
next message you send will go to him.
Thank you.
-pete
===========================================================================
Date mail was appended: Thu Apr 30 14:39:59 2009 (1241120399)
Subject: Actions
Assigned to wenger by psilord
===========================================================================
Date of actions: Thu Apr 30 14:54:02 2009 (1241121243)
Date: Fri, 15 May 2009 10:51:15 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: psilord <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19259] LIGO: dagman upgrade problem
Duncan,
Sorry it took a long time for me to follow up on this -- the non-Condor
part of my job has had some fires recently...
> I just put all my condor jobs on hold and upgraded my condor install
> from 7.2.0 (with pre7.2.1 dagman binaries) to 7.2.2 and all my users'
> dags failed with the error below. Any idea how I can prevent this
> error in future?
Do you know if you had DAGMAN_COPY_TO_SPOOL set to true in your
configuration? Also, depending on exactly where in the 7.2.1 series your
binaries came from, that might not have taken effect. You can tell
if you have some of the .condor.sub files hanging around -- if
copy_to_spool is true in the .condor.sub file, then our fix didn't work.
If copy_to_spool is false in the .condor.sub file, either your
condor_submit_dag binaries (pre-7.2.1) were just to old, or else you
didn't have DAGMAN_COPY_TO_SPOOL set to true in your configuration.
If you have DAGMAN_COPY_TO_SPOOL set to true, things *should* work from
now on, because the .condor.sub files generated by the 7.2.2
condor_submit_dag should have copy_to_spool = true, so then when you
upgrade to 7.2.3 (or whatever) the DAGs should continue with the same
DAGMan binaries they started with, and the versions will match between the
.condor.sub files and the binaries.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Fri May 15 10:51:20 2009 (1242402681)
Subject: Actions
Ticket resolved by wenger
===========================================================================
Date of actions: Fri May 22 15:18:41 2009 (1243023521)