LIGO Support Ticket 19282
Ticket Information
Number: admin 19282
User: anderson@ligo.caltech.edu
Email:
Status: pending
Assigned To: psilord
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: LIGO: condor_starter assertion failure
Date: Sun, 10 May 2009 18:07:46 -0700
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
The LIGO Caltech condor pool has had the following assertion failure
from condor_starter ~150 times in the last few months.
Currently the pool is running CentOS 5.3 with,
[root@node496 log]# condor_version
$CondorVersion: 7.2.2 Apr 9 2009 BuildID: 145189 $
$CondorPlatform: X86_64-LINUX_RHEL5 $
This appears to occur much more frequently on execute machines that
have 8 slots compared to those with 4. Note, in the following example,
the machine 10.14.0.12 is a submit machine running schedd and
associated shadow processes.
Thanks.
5/10 17:00:34 ********** STARTER starting up ***********
5/10 17:00:34 ** $CondorVersion: 7.2.2 Apr 9 2009 BuildID: 145189 $
5/10 17:00:34 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
5/10 17:00:34 ******************************************
5/10 17:00:34 Submitting machine is "ldas-grid.ligo.caltech.edu"
5/10 17:00:34 EventHandler {
5/10 17:00:34 func = 0x4df3e2
5/10 17:00:34 mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD
SIGTSTP
5/10 17:00:34 }
5/10 17:00:34 Done setting resource limits
5/10 17:00:34 *FSM* Transitioning to state "GET_PROC"
5/10 17:00:34 *FSM* Executing state func "get_proc()" [ ]
5/10 17:00:34 Entering get_proc()
5/10 17:00:34 Entering get_job_info()
5/10 17:05:34 condor_read(): timeout reading 5 bytes from
<10.14.0.12:42471>.
5/10 17:05:34 IO: Failed to read packet header
Stack dump for process 3714 at timestamp 1242000334 (13 frames)
condor_starter(dprintf_dump_stack+0xb7)[0x4da972]
condor_starter[0x4dabde]
/lib64/libc.so.6[0x355b830280]
/lib64/libc.so.6(gsignal+0x35)[0x355b830215]
/lib64/libc.so.6(abort+0x110)[0x355b831cc0]
/lib64/libc.so.6(__assert_fail+0xf6)[0x355b829696]
condor_starter(REMOTE_CONDOR_startup_info_request+0x1a1)[0x4d3acb]
condor_starter(_Z12get_job_infov+0x52)[0x4a51aa]
condor_starter(_Z8get_procv+0x21)[0x4a52c7]
condor_starter(_ZN12StateMachine7executeEv+0x20e)[0x4df6ee]
condor_starter(main+0x157)[0x4a615d]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x355b81d974]
condor_starter(__gxx_personality_v0+0x2c9)[0x4a49a9]
5/10 17:05:34 ERROR "Can't find transition out of state "GET_PROC" for
event "CHILD_EXIT"" at line 325 in file state_machine_driver.cpp
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Sun May 10 20:08:06 2009 (1242004089)
Subject: Actions
Assigned to psilord by zmiller
===========================================================================
Date of actions: Mon May 11 14:32:53 2009 (1242070373)
Date: Fri, 5 Jun 2009 10:56:52 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: zmiller <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19282] LIGO: condor_starter assertion failure
Hello,
> From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
>
> The LIGO Caltech condor pool has had the following assertion failure
> from condor_starter ~150 times in the last few months.
>
> Currently the pool is running CentOS 5.3 with,
>
> [root@node496 log]# condor_version
> $CondorVersion: 7.2.2 Apr 9 2009 BuildID: 145189 $
> $CondorPlatform: X86_64-LINUX_RHEL5 $
>
> This appears to occur much more frequently on execute machines that
> have 8 slots compared to those with 4. Note, in the following example,
> the machine 10.14.0.12 is a submit machine running schedd and
> associated shadow processes.
Ok, I looked at this and it seems that when the stduniv shadow
unexpectedly breaks the connection to the stduniv starter, this will
happen.
So, if you can find an example of this happening and then look at the
shadow log for the job in question, the true error should be written
there or at least it can be determined if the shadow segfaulted.
Thank you.
-pete
===========================================================================
Date mail was appended: Fri Jun 5 10:56:57 2009 (1244217418)
Subject: Actions
Status changed from open to pending by psilord
===========================================================================
Date of actions: Fri Jun 5 10:57:03 2009 (1244217423)
Date: Thu, 11 Jun 2009 14:09:31 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: zmiller <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19282] LIGO: condor_starter assertion failure
Hello,
> From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
>
> The LIGO Caltech condor pool has had the following assertion failure
> from condor_starter ~150 times in the last few months.
>
> Currently the pool is running CentOS 5.3 with,
>
> [root@node496 log]# condor_version
> $CondorVersion: 7.2.2 Apr 9 2009 BuildID: 145189 $
> $CondorPlatform: X86_64-LINUX_RHEL5 $
>
> This appears to occur much more frequently on execute machines that
> have 8 slots compared to those with 4. Note, in the following example,
> the machine 10.14.0.12 is a submit machine running schedd and
> associated shadow processes.
In looking at this ticket, I would say that the starter asserted
because the connection to the shadow went away during the RPC call
REMOTE_CONDOR_startup_info_request() from the starter.
When you find one of these again, can I have the appropriate section
(in time) from the shadow log as well?
Thank you.
-pete
===========================================================================
Date mail was appended: Thu Jun 11 14:09:35 2009 (1244747375)