LIGO Support Ticket 17226
Ticket Information
Number: admin 17226
User: anderson@ligo.caltech.edu
Email: skoranda__AT__gravity.phys.uwm.edu
Status: resolved
Assigned To: gquinn
Date: Sat, 17 Nov 2007 11:21:11 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: shadow assertionn error in pseudo_ops.C
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
The LIGO Condor pool at Caltech running,
# condor_version
$CondorVersion: 6.9.4 Aug 30 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
plus a 6.9.5 shadow pre-release,
# ident /usr/sbin/condor_shadow
/usr/sbin/condor_shadow:
$CondorVersion: 6.9.5 Nov 12 2007 BuildID: 63496 PRE-RELEASE-UWCS $
$CondorPlatform: X86_64-LINUX_RHEL3 $
$Id: kdb5_err.et 13854 2001-10-25 20:20:57Z tlyu $
$Id: krb5_err.et 16816 2004-10-13 16:18:27Z lxs $
$Id: accept_sec_context.c,v 1.30.4.2 2007/05/17 15:38:32 bester Exp $
$Id: acquire_cred.c,v 1.12.4.1 2007/05/17 15:38:32 bester Exp $
$Id: compare_name.c,v 1.22.4.2 2005/07/13 20:17:52 mlink Exp $
$Id: delete_sec_context.c,v 1.13 2005/04/15 23:37:16 meder Exp $
$Id: display_name.c,v 1.10 2005/04/15 23:37:16 meder Exp $
$Id: display_status.c,v 1.19 2005/04/15 23:37:16 meder Exp $
$Id: import_name.c,v 1.15 2005/04/15 23:37:18 meder Exp $
$Id: init_sec_context.c,v 1.31.4.4 2007/05/17 15:38:32 bester Exp $
$Id: inquire_cred.c,v 1.10 2005/04/15 23:37:19 meder Exp $
$Id: inquire_context.c,v 1.11 2005/04/15 23:37:19 meder Exp $
$Id: oid_functions.c,v 1.13 2005/04/15 23:37:19 meder Exp $
$Id: release_cred.c,v 1.5 2005/04/15 23:37:20 meder Exp $
$Id: release_name.c,v 1.6 2005/04/15 23:37:20 meder Exp $
$Id: unwrap.c,v 1.17 2005/04/15 23:37:20 meder Exp $
$Id: verify_mic.c,v 1.12 2005/04/15 23:37:21 meder Exp $
$Id: wrap.c,v 1.12 2005/04/15 23:37:21 meder Exp $
$Id: release_buffer.c,v 1.3 2005/04/15 23:37:20 meder Exp $
$Id: globus_i_gsi_gss_utils.c,v 1.38.4.1 2005/05/04 00:19:37 meder Exp $
$Id: import_cred.c,v 1.19.4.1 2007/05/17 15:38:32 bester Exp $
$Id: get_mic.c,v 1.7 2005/04/15 23:37:17 meder Exp $
$GCBVersion: 1.5.0 $
$GCBBuildDate: Nov 6 2007 $
currently has condor_shadow generating a large number of assertion errors
in pseudo_ops.C. In the Month prior to upgrading the condor_shadow
binary there where only 8 such errors all grouped within a few seconds:
# grep pseudo_ops.C ShadowLog.old
11/10 02:00:40 (20585223.0) (24928): ERROR "Error from starter on slot4__AT__node275.ldas-cit.ligo.caltech.edu: Assertion ERROR on (result)" at line 644 in file pseudo_ops.C
11/10 02:00:40 (20585225.0) (24932): ERROR "Error from starter on slot1__AT__node281.ldas-cit.ligo.caltech.edu: Assertion ERROR on (result)" at line 644 in file pseudo_ops.C
11/10 02:00:40 (20585222.0) (24929): ERROR "Error from starter on slot4__AT__node120.ldas-cit.ligo.caltech.edu: Assertion ERROR on (result)" at line 644 in file pseudo_ops.C
11/10 02:00:41 (20585224.0) (24935): ERROR "Error from starter on slot3__AT__node238.ldas-cit.ligo.caltech.edu: Assertion ERROR on (result)" at line 644 in file pseudo_ops.C
11/10 02:00:42 (20585241.0) (25193): ERROR "Error from starter on slot3__AT__node170.ldas-cit.ligo.caltech.edu: Assertion ERROR on (result)" at line 644 in file pseudo_ops.C
11/10 02:00:43 (20585246.0) (25189): ERROR "Error from starter on slot2__AT__node101.ldas-cit.ligo.caltech.edu: Assertion ERROR on (result)" at line 644 in file pseudo_ops.C
11/10 02:00:43 (20585233.0) (25190): ERROR "Error from starter on slot4__AT__node38.ldas-cit.ligo.caltech.edu: Assertion ERROR on (result)" at line 644 in file pseudo_ops.C
11/10 02:00:43 (20585244.0) (25196): ERROR "Error from starter on slot1__AT__node204.ldas-cit.ligo.caltech.edu: Assertion ERROR on (result)" at line 644 in file pseudo_ops.C
However, in the last two days while running with the 6.9.5 prerelease shadow
there have have been 1627 such errors, at the following line number,
11/17 10:54:36 (20821660.0) (11142): ERROR "Error from starter on slot2__AT__node253.ldas-cit.ligo.caltech.edu: Assertion ERROR on (result)" at line 649 in file pseudo_ops.C
Note, the 6.9.5 pre-release was installed 4 days ago, so the increase
in these error was does not necessarily caused by the new shadow binary
but may be the result of a different mix of jobs running in the pool.
Here is the relevant section of the ShadowLog for the most recent occurance
(job 20821660.0):
11/17 10:42:28 ******************************************************
11/17 10:42:28 ** condor_shadow (CONDOR_SHADOW) STARTING UP
11/17 10:42:28 ** /usr/sbin/condor_shadow
11/17 10:42:28 ** $CondorVersion: 6.9.5 Nov 12 2007 BuildID: 63496 PRE-RELEASE-UWCS $
11/17 10:42:28 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/17 10:42:28 ** PID = 11142
11/17 10:42:28 ** Log last touched 11/17 10:42:28
11/17 10:42:28 ******************************************************
11/17 10:42:28 Using config source: /usr1/condor/condor_config
11/17 10:42:28 Using local config sources:
11/17 10:42:28 /usr1/condor/condor_config.local
11/17 10:42:28 DaemonCore: Command Socket at <10.14.0.12:53783>
11/17 10:42:28 Initializing a VANILLA shadow for job 20821660.0
11/17 10:42:28 (20821660.0) (11142): Request to run on <10.14.2.3:60557> was ACCEPTED
11/17 10:42:28 ******************************************************
11/17 10:42:28 ** condor_shadow (CONDOR_SHADOW) STARTING UP
11/17 10:42:28 ** /usr/sbin/condor_shadow
11/17 10:42:28 ** $CondorVersion: 6.9.5 Nov 12 2007 BuildID: 63496 PRE-RELEASE-UWCS $
11/17 10:42:28 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/17 10:42:28 ** PID = 11144
11/17 10:42:28 ** Log last touched 11/17 10:42:28
11/17 10:42:28 ******************************************************
11/17 10:42:28 Using config source: /usr1/condor/condor_config
11/17 10:42:28 Using local config sources:
11/17 10:42:28 /usr1/condor/condor_config.local
11/17 10:42:28 DaemonCore: Command Socket at <10.14.0.12:59321>
11/17 10:42:28 Initializing a VANILLA shadow for job 20821665.0
11/17 10:42:28 (20821665.0) (11144): Request to run on <10.14.1.54:47061> was ACCEPTED
11/17 10:42:28 ******************************************************
...
11/17 10:54:36 (20727589.0) (13090):My_UID_Domain = "ligo"
11/17 10:54:36 (20727589.0) (13090): Entering pseudo_get_file_stream
11/17 10:54:36 (20727589.0) (13090): file = "/usr1/condor/spool/cluster20727589.ickpt.subproc0"
11/17 10:54:36 (20718426.0) (13088):Reaped child status - pid 13089 exited with status 0
11/17 10:54:36 (20727589.0) (13090):Reaped child status - pid 13091 exited with status 0
11/17 10:54:36 (20821660.0) (11142): ERROR "Error from starter on slot2__AT__node253.ldas-cit.ligo.caltech.edu: Assertion ERROR on (
result)" at line 649 in file pseudo_ops.C
11/17 10:54:36 (20820673.0) (2262): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
11/17 10:54:36 (20820666.0) (2260): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
11/17 10:54:36 (20820688.0) (2258): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
11/17 10:54:36 (20820684.0) (2530): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
11/17 10:54:36 (?.?) (13092):******* Standard Shadow starting up *******
11/17 10:54:36 (?.?) (13092):** $CondorVersion: 6.9.4 Aug 30 2007 $
11/17 10:54:36 (?.?) (13092):** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/17 10:54:36 (?.?) (13092):*******************************************
11/17 10:54:36 (?.?) (13092):uid=0, euid=4135, gid=0, egid=4135
...
#
# This job was then restarted by the same slot on the same execute
# machine and is now actively running.
#
11/17 10:54:37 Using config source: /usr1/condor/condor_config
11/17 10:54:37 Using local config sources:
11/17 10:54:37 /usr1/condor/condor_config.local
11/17 10:54:37 DaemonCore: Command Socket at <10.14.0.12:53017>
11/17 10:54:37 Initializing a VANILLA shadow for job 20821660.0
11/17 10:54:37 (20821660.0) (13097): Request to run on <10.14.2.3:60557> was ACCEPTED
11/17 10:54:45 (20821751.0) (12641):Shadow: Job 20821751.0 exited, termsig = 0, coredump = 0, retcode = 0
11/17 10:54:45 (20821751.0) (12641):Shadow: Job exited normally with status 0
11/17 10:54:45 (20821751.0) (12641):user_time = 0 ticks
#
# Here is the Starter log from slot2@node253 that shows both attempts
# to start this job.
#
11/17 10:42:28 Communicating with shadow <10.14.0.12:53783>
11/17 10:42:28 Submitting machine is "ldas-grid.ligo.caltech.edu"
11/17 10:42:28 setting the orig job name in starter
11/17 10:42:28 setting the orig job iwd in starter
11/17 10:42:28 Job 20821660.0 set to execute immediately
11/17 10:42:28 Starting a VANILLA universe job with ID: 20821660.0
11/17 10:42:28 IWD: /mnt/qfs1/jclark/ExLodsPackage/condor
11/17 10:42:28 Output file: /archive/home/jclark/logs/mExLods-20821660-0.out
11/17 10:42:28 Error file: /archive/home/jclark/logs/mExLods-20821660-0.err
11/17 10:47:28 condor_read(): timeout reading 5 bytes from <10.14.0.12:54438>.
11/17 10:47:28 IO: Failed to read packet header
11/17 10:47:28 ERROR "Assertion ERROR on (result)" at line 322 in file NTsenders.C
11/17 10:47:28 ShutdownFast all jobs.
11/17 10:54:37 ******************************************************
11/17 10:54:37 ** condor_starter (CONDOR_STARTER) STARTING UP
11/17 10:54:37 ** /usr/sbin/condor_starter
11/17 10:54:37 ** $CondorVersion: 6.9.4 Aug 30 2007 $
11/17 10:54:37 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/17 10:54:37 ** PID = 13457
11/17 10:54:37 ** Log last touched 11/17 10:47:28
11/17 10:54:37 ******************************************************
11/17 10:54:37 Using config source: /usr1/condor/condor_config
11/17 10:54:37 Using local config sources:
11/17 10:54:37 /usr1/condor/condor_config.local
11/17 10:54:37 DaemonCore: Command Socket at <10.14.2.3:55562>
11/17 10:54:37 Done setting resource limits
11/17 10:54:37 Communicating with shadow <10.14.0.12:53017>
11/17 10:54:37 Submitting machine is "ldas-grid.ligo.caltech.edu"
11/17 10:54:37 setting the orig job name in starter
11/17 10:54:37 setting the orig job iwd in starter
11/17 10:54:37 Job 20821660.0 set to execute immediately
11/17 10:54:37 Starting a VANILLA universe job with ID: 20821660.0
11/17 10:54:37 IWD: /mnt/qfs1/jclark/ExLodsPackage/condor
11/17 10:54:37 Output file: /archive/home/jclark/logs/mExLods-20821660-0.out
11/17 10:54:37 Error file: /archive/home/jclark/logs/mExLods-20821660-0.err
11/17 10:55:32 Renice expr "0" evaluated to 0
11/17 10:55:32 About to exec /archive/home/jclark/ExLodsPackage/bin/runmExLods.sh /archive/home/jclark/VELA/data/onsource/H1_si
m_onsource.mat /archive/home/jclark/VELA/data/onsource/LODSout/H1_sim 2892 2897 1.0 1551 1600
11/17 10:55:32 Create_Process succeeded, pid=13500
What does this assertion error indciate and what can be done to avoid
the apparent need to attempt to start jobs multiple times?
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Sat Nov 17 13:21:30 2007 (1195327293)
Subject: Actions
Assigned to gquinn by roy
===========================================================================
Date of actions: Mon Nov 19 10:45:57 2007 (1195490758)
Subject: Re: [condor-admin #17226] LIGO: shadow assertionn error in
pseudo_ops.C
From: Greg Quinn <gquinn__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Date: Mon, 19 Nov 2007 15:13:56 -0600
> However, in the last two days while running with the 6.9.5 prerelease shadow
> there have have been 1627 such errors, at the following line number,
>
> 11/17 10:54:36 (20821660.0) (11142): ERROR "Error from starter on slot2__AT__node253.ldas-cit.ligo.caltech.edu: Assertion ERROR on (result)" at line 649 in file pseudo_ops.C
>
> Note, the 6.9.5 pre-release was installed 4 days ago, so the increase
> in these error was does not necessarily caused by the new shadow binary
> but may be the result of a different mix of jobs running in the pool.
My initial guess was that these errors are a result of the new Shadow
binary, and some incompatibility between it and the 6.9.4 Starter. But
looking through the changes that have been made to the code since 6.9.4,
it is hard to justify this explanation.
It is also mighty fishy that you sometimes get this error and sometimes
don't.
I am going to proceed by trying to setup a small 6.9.4 pool here and
seeing if using a 6.9.5 Shadow results in the same problem for me.
On your side, it would be useful if you could add D_SYSCALLS to your
SHADOW_DEBUG setting and show me the log from when this failure occurs.
I could also provide you with a 6.9.5 Starter binary to see if it really
is a version incompatibility, but upgrading all your execute machines
may be a chore. I'm also very interested in finding out what's really
going on here so I'd like to hold off on this option for now.
Greg Quinn
Condor Team
===========================================================================
Date mail was appended: Mon Nov 19 15:18:50 2007 (1195507130)
Date: Sat, 24 Nov 2007 17:07:12 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17226] LIGO: shadow assertionn error in
pseudo_ops.C
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
On Mon, Nov 19, 2007 at 03:18:50PM -0600, condor-admin response tracking system wrote:
>
> On your side, it would be useful if you could add D_SYSCALLS to your
> SHADOW_DEBUG setting and show me the log from when this failure occurs.
>
This has only happened once more since reporting the problem. Here is the
more detailed ShadowLog entries for job 21057423.0 with D_SYSCALLS enabled.
Note, this time corresponds to restarting the lock manager on the NFS server
to increase the number of conccurent locks it will handle.
11/24 11:21:51 ******************************************************
11/24 11:21:51 ** condor_shadow (CONDOR_SHADOW) STARTING UP
11/24 11:21:51 ** /usr/sbin/condor_shadow
11/24 11:21:51 ** $CondorVersion: 6.9.5 Nov 12 2007 BuildID: 63496 PRE-RELEASE-UWCS $
11/24 11:21:51 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/24 11:21:51 ** PID = 19534
11/24 11:21:51 ** Log last touched 11/24 11:21:51
11/24 11:21:51 ******************************************************
11/24 11:21:51 Using config source: /usr1/condor/condor_config
11/24 11:21:51 Using local config sources:
11/24 11:21:51 /usr1/condor/condor_config.local
11/24 11:21:51 DaemonCore: Command Socket at <10.14.0.12:57023>
11/24 11:21:51 (21057419.0) (19515): ZKM: setting default map to (null)
11/24 11:21:51 Initializing a VANILLA shadow for job 21057423.0
11/24 11:21:51 (21056845.0) (19533): About to decode condor_sysnum
11/24 11:21:51 (21056845.0) (19533): Got request for syscall get_job_info (-63)
11/24 11:21:51 (21056845.0) (19533): rval = 0, errno = 0
11/24 11:21:51 (21056845.0) (19533): About to decode condor_sysnum
11/24 11:21:51 (21056845.0) (19533): Got request for syscall register_starter_info (-77)
11/24 11:21:51 (21056845.0) (19533): StarterIpAddr = <10.14.1.182:42461>
11/24 11:21:51 (21056845.0) (19533): UidDomain = ligo
11/24 11:21:51 (21056845.0) (19533): FileSystemDomain = ligo
11/24 11:21:51 (21056845.0) (19533): Machine = slot2__AT__node182.ldas-cit.ligo.caltech.edu
11/24 11:21:51 (21056845.0) (19533): Arch = X86_64
11/24 11:21:51 (21056845.0) (19533): OpSys = LINUX
11/24 11:21:51 (21056845.0) (19533): CondorVersion = $CondorVersion: 6.9.4 Aug 30 2007 $
11/24 11:21:51 (21056845.0) (19533): HasReconnect = TRUE
11/24 11:21:51 (21056845.0) (19533): rval = 0, errno = 0
11/24 11:21:51 (21056845.0) (19533): About to decode condor_sysnum
11/24 11:21:51 (21056845.0) (19533): Got request for syscall begin_execution (-78)
11/24 11:21:51 (21057419.0) (19515): Job 21057419.0 terminated: exited with status 11
11/24 11:21:51 (21057423.0) (19534): Request to run on <10.14.1.13:48464> was ACCEPTED
11/24 11:21:51 (21057423.0) (19534): About to decode condor_sysnum
11/24 11:21:51 (21057423.0) (19534): Got request for syscall get_job_info (-63)
11/24 11:21:51 (21057423.0) (19534): rval = 0, errno = 0
11/24 11:21:51 (21057423.0) (19534): About to decode condor_sysnum
11/24 11:21:51 (21057423.0) (19534): Got request for syscall register_starter_info (-77)
11/24 11:21:51 (21057423.0) (19534): StarterIpAddr = <10.14.1.13:40351>
11/24 11:21:51 (21057423.0) (19534): UidDomain = ligo
11/24 11:21:51 (21057423.0) (19534): FileSystemDomain = ligo
11/24 11:21:51 (21057423.0) (19534): Machine = slot3__AT__node13.ldas-cit.ligo.caltech.edu
11/24 11:21:51 (21057423.0) (19534): Arch = X86_64
11/24 11:21:51 (21057423.0) (19534): OpSys = LINUX
11/24 11:21:51 (21057423.0) (19534): CondorVersion = $CondorVersion: 6.9.4 Aug 30 2007 $
11/24 11:21:51 (21057423.0) (19534): HasReconnect = TRUE
11/24 11:21:51 (21057423.0) (19534): rval = 0, errno = 0
11/24 11:21:51 (21056845.0) (19533): rval = 0, errno = 25
11/24 11:21:51 (21057423.0) (19534): About to decode condor_sysnum
11/24 11:21:51 (21057423.0) (19534): Got request for syscall begin_execution (-78)
11/24 11:21:51 (21057419.0) (19515): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
11/24 11:21:51 (21057420.0) (19516): About to decode condor_sysnum
11/24 11:21:51 (21057420.0) (19516): Got request for syscall job_exit (-65)
11/24 11:21:51 (21057420.0) (19516): in pseudo_job_exit: status=2816,reason=100
11/24 11:21:51 (21057420.0) (19516): rval = 0, errno = 25
11/24 11:21:51 (21057420.0) (19516): Shadow: do_REMOTE_syscall returned < 0
11/24 11:21:51 (21057421.0) (19518): About to decode condor_sysnum
11/24 11:21:51 (21057421.0) (19518): Got request for syscall job_exit (-65)
11/24 11:21:51 (21057421.0) (19518): in pseudo_job_exit: status=2816,reason=100
11/24 11:21:51 (21057421.0) (19518): rval = 0, errno = 25
11/24 11:21:51 (21057421.0) (19518): Shadow: do_REMOTE_syscall returned < 0
11/24 11:21:51 (21057420.0) (19516): ZKM: setting default map to (null)
11/24 11:21:52 (21057420.0) (19516): Job 21057420.0 terminated: exited with status 11
11/24 11:21:52 (21057421.0) (19518): ZKM: setting default map to (null)
11/24 11:21:52 (21057420.0) (19516): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
11/24 11:21:52 (21057422.0) (19520): About to decode condor_sysnum
11/24 11:21:52 (21057422.0) (19520): Got request for syscall job_exit (-65)
11/24 11:21:52 (21057422.0) (19520): in pseudo_job_exit: status=2816,reason=100
11/24 11:21:52 (21057422.0) (19520): rval = 0, errno = 25
11/24 11:21:52 (21057422.0) (19520): Shadow: do_REMOTE_syscall returned < 0
11/24 11:21:52 (21057421.0) (19518): Job 21057421.0 terminated: exited with status 11
11/24 11:21:52 (21057422.0) (19520): ZKM: setting default map to (null)
11/24 11:21:52 (21057421.0) (19518): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
11/24 11:21:52 (21057422.0) (19520): Job 21057422.0 terminated: exited with status 11
11/24 11:21:52 (21057422.0) (19520): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
11/24 11:21:53 ******************************************************
11/24 11:21:53 ** condor_shadow (CONDOR_SHADOW) STARTING UP
11/24 11:21:53 ** /usr/sbin/condor_shadow
11/24 11:21:53 ** $CondorVersion: 6.9.5 Nov 12 2007 BuildID: 63496 PRE-RELEASE-UWCS $
11/24 11:21:53 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/24 11:21:53 ** PID = 19536
11/24 11:21:53 ** Log last touched 11/24 11:21:52
11/24 11:21:53 ******************************************************
11/24 11:21:53 Using config source: /usr1/condor/condor_config
11/24 11:21:53 Using local config sources:
11/24 11:21:53 /usr1/condor/condor_config.local
11/24 11:21:53 DaemonCore: Command Socket at <10.14.0.12:38158>
11/24 11:21:53 Initializing a VANILLA shadow for job 21056842.0
...
1/24 11:27:25 (21002093.74) (2729): FileLock::obtain(1) failed - errno 37 (No locks available)
11/24 11:27:25 (21002093.82) (2728): FileLock::obtain(1) failed - errno 37 (No locks available)
11/24 11:27:25 (21002093.185) (2912): FileLock::obtain(1) failed - errno 37 (No locks available)
11/24 11:27:25 (21002093.21) (2725): FileLock::obtain(1) failed - errno 37 (No locks available)
11/24 11:27:25 (21002093.242) (2913): FileLock::obtain(1) failed - errno 37 (No locks available)
11/24 11:27:25 (21002093.111) (2724): FileLock::obtain(1) failed - errno 37 (No locks available)
11/24 11:27:25 (21002093.194) (2911): FileLock::obtain(1) failed - errno 37 (No locks available)
11/24 11:27:25 (21002093.50) (2727): condor_write(): send() 13 bytes to unknown source returned -1, timeout=300, errno=32 (Broken pipe). Assuming failure.
11/24 11:27:25 (21002093.50) (2727): Buf::write(): condor_write() failed
11/24 11:27:25 (21002093.50) (2727): ERROR "Assertion ERROR on (result)" at line 342 in file NTreceivers.C
11/24 11:27:26 (21002093.228) (2916): condor_write(): send() 13 bytes to unknown source returned -1, timeout=300, errno=32 (Broken pipe). Assuming failure.
11/24 11:27:26 (21002093.228) (2916): Buf::write(): condor_write() failed
11/24 11:27:26 (21002093.228) (2916): ERROR "Assertion ERROR on (result)" at line 342 in file NTreceivers.C
11/24 11:27:26 (21002093.194) (2911): condor_write(): send() 13 bytes to unknown source returned -1, timeout=300, errno=32 (Broken pipe). Assuming failure.
11/24 11:27:26 (21002093.194) (2911): Buf::write(): condor_write() failed
11/24 11:27:26 (21002093.194) (2911): ERROR "Assertion ERROR on (result)" at line 342 in file NTreceivers.C
11/24 11:27:26 (21002093.74) (2729): condor_write(): send() 13 bytes to unknown source returned -1, timeout=300, errno=32 (Broken pipe). Assuming failure.
11/24 11:27:26 (21002093.74) (2729): Buf::write(): condor_write() failed
11/24 11:27:26 (21002093.74) (2729): ERROR "Assertion ERROR on (result)" at line 342 in file NTreceivers.C
11/24 11:27:26 (21002093.111) (2724): condor_write(): send() 13 bytes to unknown source returned -1, timeout=300, errno=32 (Broken pipe). Assuming failure.
11/24 11:27:26 (21002093.111) (2724): Buf::write(): condor_write() failed
11/24 11:27:26 (21002093.111) (2724): ERROR "Assertion ERROR on (result)" at line 342 in file NTreceivers.C
11/24 11:27:26 (21057423.0) (19534): rval = 0, errno = 25
11/24 11:27:26 (21057423.0) (19534): About to decode condor_sysnum
11/24 11:27:26 (21057423.0) (19534): Got request for syscall ulog (284)
11/24 11:27:26 (21057669.0) (20584): rval = 0, errno = 25
11/24 11:27:26 (21002093.185) (2912): condor_write(): send() 13 bytes to unknown source returned -1, timeout=300, errno=32 (Broken pipe). Assuming failure.
11/24 11:27:26 (21002093.185) (2912): Buf::write(): condor_write() failed
11/24 11:27:26 (21002093.185) (2912): ERROR "Assertion ERROR on (result)" at line 342 in file NTreceivers.C
11/24 11:27:26 (21002093.82) (2728): condor_write(): send() 13 bytes to unknown source returned -1, timeout=300, errno=32 (Broken pipe). Assuming failure.
11/24 11:27:26 (21002093.220) (2914): condor_write(): send() 13 bytes to unknown source returned -1, timeout=300, errno=32 (Broken pipe). Assuming failure.
11/24 11:27:26 (21002093.220) (2914): Buf::write(): condor_write() failed
11/24 11:27:26 (21002093.220) (2914): ERROR "Assertion ERROR on (result)" at line 342 in file NTreceivers.C
11/24 11:27:26 (21002093.85) (2730): condor_write(): send() 13 bytes to unknown source returned -1, timeout=300, errno=32 (Broken pipe). Assuming failure.
11/24 11:27:26 (21002093.82) (2728): Buf::write(): condor_write() failed
11/24 11:27:26 (21002093.85) (2730): Buf::write(): condor_write() failed
11/24 11:27:26 (21002093.82) (2728): ERROR "Assertion ERROR on (result)" at line 342 in file NTreceivers.C
11/24 11:27:26 (21002093.85) (2730): ERROR "Assertion ERROR on (result)" at line 342 in file NTreceivers.C
11/24 11:27:26 (21002093.21) (2725): condor_write(): send() 13 bytes to unknown source returned -1, timeout=300, errno=32 (Broken pipe). Assuming failure.
11/24 11:27:26 (21002093.248) (2915): condor_write(): send() 13 bytes to unknown source returned -1, timeout=300, errno=32 (Broken pipe). Assuming failure.
11/24 11:27:26 (21002093.21) (2725): Buf::write(): condor_write() failed
11/24 11:27:26 (21002093.248) (2915): Buf::write(): condor_write() failed
11/24 11:27:26 (21002093.21) (2725): ERROR "Assertion ERROR on (result)" at line 342 in file NTreceivers.C
11/24 11:27:26 (21002093.248) (2915): ERROR "Assertion ERROR on (result)" at line 342 in file NTreceivers.C
11/24 11:27:26 (21002093.242) (2913): condor_write(): send() 13 bytes to unknown source returned -1, timeout=300, errno=32 (Broken pipe). Assuming failure.
11/24 11:27:26 (21002093.242) (2913): Buf::write(): condor_write() failed
11/24 11:27:26 (21002093.242) (2913): ERROR "Assertion ERROR on (result)" at line 342 in file NTreceivers.C
11/24 11:27:26 (21057423.0) (19534): ERROR "Error from starter on slot3__AT__node13.ldas-cit.ligo.caltech.edu: Assertion ERROR on (result)" at line 649 in file pseudo_ops.C
11/24 11:27:26 (21057694.0) (20666): About to decode condor_sysnum
11/24 11:27:26 (21057694.0) (20666): Got request for syscall job_exit (-65)
11/24 11:27:26 (21057694.0) (20666): in pseudo_job_exit: status=2816,reason=100
11/24 11:27:26 (21057694.0) (20666): rval = 0, errno = 25
11/24 11:27:26 (21057694.0) (20666): Shadow: do_REMOTE_syscall returned < 0
11/24 11:27:26 (21057694.0) (20666): ZKM: setting default map to (null)
11/24 11:27:26 (21057694.0) (20666): Job 21057694.0 terminated: exited with status 11
11/24 11:27:26 (21057694.0) (20666): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
11/24 11:27:26 (21057669.0) (20584): About to decode condor_sysnum
11/24 11:27:26 (21057669.0) (20584): Got request for syscall job_exit (-65)
11/24 11:27:26 (21057669.0) (20584): in pseudo_job_exit: status=0,reason=100
11/24 11:27:26 (21057669.0) (20584): rval = 0, errno = 25
11/24 11:27:26 (21057669.0) (20584): Shadow: do_REMOTE_syscall returned < 0
11/24 11:27:26 (21057669.0) (20584): ZKM: setting default map to (null)
11/24 11:27:26 (21057669.0) (20584): Job 21057669.0 terminated: exited with status 0
11/24 11:27:26 (21057669.0) (20584): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
11/24 11:27:26 (21057699.0) (20673): About to decode condor_sysnum
11/24 11:27:26 (21057699.0) (20673): Got request for syscall job_exit (-65)
11/24 11:27:26 (21057699.0) (20673): in pseudo_job_exit: status=2816,reason=100
11/24 11:27:26 (21057699.0) (20673): rval = 0, errno = 25
11/24 11:27:26 (21057699.0) (20673): Shadow: do_REMOTE_syscall returned < 0
...
11/24 11:27:28 Using config source: /usr1/condor/condor_config
11/24 11:27:28 Using local config sources:
11/24 11:27:28 /usr1/condor/condor_config.local
11/24 11:27:28 DaemonCore: Command Socket at <10.14.0.12:44785>
11/24 11:27:28 Initializing a VANILLA shadow for job 21057423.0
11/24 11:27:28 (21057423.0) (20719): Request to run on <10.14.1.13:48464> was ACCEPTED
11/24 11:27:28 ******************************************************
11/24 11:27:28 ** condor_shadow (CONDOR_SHADOW) STARTING UP
11/24 11:27:28 ** /usr/sbin/condor_shadow
11/24 11:27:28 ** $CondorVersion: 6.9.5 Nov 12 2007 BuildID: 63496 PRE-RELEASE-UWCS $
11/24 11:27:28 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/24 11:27:28 ** PID = 20720
11/24 11:27:28 ** Log last touched 11/24 11:27:28
11/24 11:27:28 ******************************************************
11/24 11:27:28 Using config source: /usr1/condor/condor_config
11/24 11:27:28 Using local config sources:
11/24 11:27:28 /usr1/condor/condor_config.local
11/24 11:27:28 DaemonCore: Command Socket at <10.14.0.12:60958>
11/24 11:27:28 Initializing a VANILLA shadow for job 21057701.0
11/24 11:27:28 (21057423.0) (20719): About to decode condor_sysnum
11/24 11:27:28 (21057423.0) (20719): Got request for syscall get_job_info (-63)
11/24 11:27:28 (21057423.0) (20719): rval = 0, errno = 0
11/24 11:27:28 (21057423.0) (20719): About to decode condor_sysnum
11/24 11:27:28 (21057423.0) (20719): Got request for syscall register_starter_info (-77)
11/24 11:27:28 (21057423.0) (20719): StarterIpAddr = <10.14.1.13:53058>
11/24 11:27:28 (21057423.0) (20719): UidDomain = ligo
11/24 11:27:28 (21057423.0) (20719): FileSystemDomain = ligo
11/24 11:27:28 (21057423.0) (20719): Machine = slot3__AT__node13.ldas-cit.ligo.caltech.edu
11/24 11:27:28 (21057423.0) (20719): Arch = X86_64
11/24 11:27:28 (21057423.0) (20719): OpSys = LINUX
11/24 11:27:28 (21057423.0) (20719): CondorVersion = $CondorVersion: 6.9.4 Aug 30 2007 $
11/24 11:27:28 (21057423.0) (20719): HasReconnect = TRUE
11/24 11:27:28 (21057423.0) (20719): rval = 0, errno = 0
11/24 11:27:28 (21057423.0) (20719): About to decode condor_sysnum
11/24 11:27:28 (21057423.0) (20719): Got request for syscall begin_execution (-78)
11/24 11:27:28 ******************************************************
11/24 11:27:28 ** condor_shadow (CONDOR_SHADOW) STARTING UP
11/24 11:27:28 ** /usr/sbin/condor_shadow
11/24 11:27:28 ** $CondorVersion: 6.9.5 Nov 12 2007 BuildID: 63496 PRE-RELEASE-UWCS $
11/24 11:27:28 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/24 11:27:28 ** PID = 20721
11/24 11:27:28 ** Log last touched 11/24 11:27:28
11/24 11:27:28 ******************************************************
11/24 11:27:28 Using config source: /usr1/condor/condor_config
11/24 11:27:28 Using local config sources:
11/24 11:27:28 /usr1/condor/condor_config.local
11/24 11:27:28 DaemonCore: Command Socket at <10.14.0.12:59108>
11/24 11:27:28 (21057423.0) (20719): rval = 0, errno = 25
11/24 11:27:28 (21057423.0) (20719): About to decode condor_sysnum
11/24 11:27:28 (21057423.0) (20719): Got request for syscall ulog (284)
11/24 11:27:28 Initializing a VANILLA shadow for job 21057702.0
11/24 11:27:28 (21057423.0) (20719): Job 21057423.0 going into Hold state (code 6,2): Error from starter on slot3__AT__node13.ldas-cit.ligo.caltech.edu: Failed to execute '/mnt/qfs1/anderson/.condor_run
.19517': No such file or directory
11/24 11:27:28 ******************************************************
11/24 11:27:28 ** condor_shadow (CONDOR_SHADOW) STARTING UP
11/24 11:27:28 ** /usr/sbin/condor_shadow
11/24 11:27:28 ** $CondorVersion: 6.9.5 Nov 12 2007 BuildID: 63496 PRE-RELEASE-UWCS $
11/24 11:27:28 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/24 11:27:28 ** PID = 20722
11/24 11:27:28 ** Log last touched 11/24 11:27:28
11/24 11:27:28 ******************************************************
11/24 11:27:28 Using config source: /usr1/condor/condor_config
11/24 11:27:28 Using local config sources:
11/24 11:27:28 /usr1/condor/condor_config.local
11/24 11:27:28 DaemonCore: Command Socket at <10.14.0.12:43485>
11/24 11:27:28 Initializing a VANILLA shadow for job 21057703.0
11/24 11:27:28 ******************************************************
11/24 11:27:28 ** condor_shadow (CONDOR_SHADOW) STARTING UP
11/24 11:27:28 ** /usr/sbin/condor_shadow
11/24 11:27:28 ** $CondorVersion: 6.9.5 Nov 12 2007 BuildID: 63496 PRE-RELEASE-UWCS $
11/24 11:27:28 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/24 11:27:28 ** PID = 20723
11/24 11:27:28 ** Log last touched 11/24 11:27:28
11/24 11:27:28 ******************************************************
11/24 11:27:28 Using config source: /usr1/condor/condor_config
11/24 11:27:28 Using local config sources:
11/24 11:27:28 /usr1/condor/condor_config.local
11/24 11:27:28 DaemonCore: Command Socket at <10.14.0.12:37880>
11/24 11:27:28 Initializing a VANILLA shadow for job 21057704.0
11/24 11:27:28 ******************************************************
11/24 11:27:28 ** condor_shadow (CONDOR_SHADOW) STARTING UP
11/24 11:27:28 ** /usr/sbin/condor_shadow
11/24 11:27:28 ** $CondorVersion: 6.9.5 Nov 12 2007 BuildID: 63496 PRE-RELEASE-UWCS $
11/24 11:27:28 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/24 11:27:28 ** PID = 20724
11/24 11:27:28 ** Log last touched 11/24 11:27:28
11/24 11:27:28 ******************************************************
11/24 11:27:28 Using config source: /usr1/condor/condor_config
11/24 11:27:28 Using local config sources:
11/24 11:27:28 /usr1/condor/condor_config.local
11/24 11:27:28 (21057423.0) (20719): ZKM: setting default map to (null)
11/24 11:27:28 DaemonCore: Command Socket at <10.14.0.12:59008>
11/24 11:27:28 Initializing a VANILLA shadow for job 21057705.0
11/24 11:27:28 (21057423.0) (20719): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 112
11/24 11:27:28 (21057678.0) (20572): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
11/24 11:27:28 (21057701.0) (20720): Request to run on <10.14.1.144:55028> was ACCEPTED
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Mon Nov 26 10:35:03 2007 (1196094904)
Subject: Re: [condor-admin #17226] LIGO: shadow assertionn error in
pseudo_ops.C
From: Greg Quinn <gquinn__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Date: Tue, 04 Dec 2007 15:01:09 -0600
On Mon, 2007-11-26 at 10:35 -0600, condor-admin response tracking system
wrote:
> On Mon, Nov 19, 2007 at 03:18:50PM -0600, condor-admin response tracking system wrote:
> >
> > On your side, it would be useful if you could add D_SYSCALLS to your
> > SHADOW_DEBUG setting and show me the log from when this failure occurs.
> >
>
> This has only happened once more since reporting the problem. Here is the
> more detailed ShadowLog entries for job 21057423.0 with D_SYSCALLS enabled.
> Note, this time corresponds to restarting the lock manager on the NFS server
> to increase the number of conccurent locks it will handle.
Hello,
Looking through the log you sent, it appears that there is a large (~6
minute) gap in the log between the Shadow getting the "begin_execution"
RPC from the Starter and its response. This explains why the Starter
times out, assumes something is wrong, and aborts the the job's
execution.
I'm guessing the delay in the log is from waiting for NFS. Is it
possible that NFS could have caused the earlier slew of these types of
error that you saw? Are you still seeing these errors, or have things
calmed down now?
Greg
===========================================================================
Date mail was appended: Tue Dec 4 15:01:19 2007 (1196802080)
Date: Tue, 4 Dec 2007 13:50:13 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17226] LIGO: shadow assertionn error in
pseudo_ops.C
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
On Tue, Dec 04, 2007 at 03:01:19PM -0600, condor-admin response tracking system wrote:
>
> On Mon, 2007-11-26 at 10:35 -0600, condor-admin response tracking system
> wrote:
> > On Mon, Nov 19, 2007 at 03:18:50PM -0600, condor-admin response tracking system wrote:
> > >
> > > On your side, it would be useful if you could add D_SYSCALLS to your
> > > SHADOW_DEBUG setting and show me the log from when this failure occurs.
> > >
> >
> > This has only happened once more since reporting the problem. Here is the
> > more detailed ShadowLog entries for job 21057423.0 with D_SYSCALLS enabled.
> > Note, this time corresponds to restarting the lock manager on the NFS server
> > to increase the number of conccurent locks it will handle.
>
> Hello,
>
> Looking through the log you sent, it appears that there is a large (~6
> minute) gap in the log between the Shadow getting the "begin_execution"
> RPC from the Starter and its response. This explains why the Starter
> times out, assumes something is wrong, and aborts the the job's
> execution.
>
> I'm guessing the delay in the log is from waiting for NFS. Is it
> possible that NFS could have caused the earlier slew of these types of
> error that you saw? Are you still seeing these errors, or have things
> calmed down now?
Yes, no, yes.
With NFS anything is possible, and we are currently not getting assertion
errors in the shadow logfile.
Are you saying that a shadow assertion error is the mechanism by which
it discovers that a starter has aborted a job?
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Tue Dec 4 15:50:29 2007 (1196805030)
Subject: Re: [condor-admin #17226] LIGO: shadow assertionn error in
pseudo_ops.C
From: Greg Quinn <gquinn__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Date: Tue, 04 Dec 2007 16:13:05 -0600
> > I'm guessing the delay in the log is from waiting for NFS. Is it
> > possible that NFS could have caused the earlier slew of these types of
> > error that you saw? Are you still seeing these errors, or have things
> > calmed down now?
>
> Yes, no, yes.
Ok, thanks.
> Are you saying that a shadow assertion error is the mechanism by which
> it discovers that a starter has aborted a job?
Well, more like a Shadow assertion is the mechanism the Shadow uses in
response to a Starter assertion.
And a Starter assertion is it's way of indicating that something went
wrong trying to run a job, but it wasn't necessarily the job's fault.
(Things that are the job's fault, like a missing executable, are handled
via a different mechanism that results in the job going on hold).
In any case, the SchedD will see the Shadow assertion and react
accordingly.
Greg
===========================================================================
Date mail was appended: Tue Dec 4 16:13:14 2007 (1196806395)
Subject: Actions
Ticket resolved by zmiller
===========================================================================
Date of actions: Fri Dec 14 13:26:19 2007 (1197660379)