LIGO Support Ticket 15441
Ticket Information
Number: admin 15441
User: anderson@ligo.caltech.edu
Email: espinoza_e__AT__ligo.caltech.edu
Status: new
Assigned To: wright
Date: Thu, 10 May 2007 17:05:28 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: LIGO: condor startd crash
The LIGO Caltech cluster running,
$CondorVersion: 6.8.4 Feb 1 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
on FC4 x86_64 has had a condor_startd processes crash on one of the execute
nodes, apparently while trying to spawn a BOINC backfill job.
Here is the stack trace,
(gdb) where
#0 0x0000003190d0c848 in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#1 0x0000003190d12112 in _dl_runtime_resolve () from /lib64/ld-linux-x86-64.so.2
#2 0x00000000004fc504 in DaemonCore::Create_Process ()
#3 0x00000000004b73e4 in Starter::execDCStarter ()
#4 0x00000000004b710f in Starter::execBOINCStarter ()
#5 0x00000000004b6bc3 in Starter::spawn ()
#6 0x00000000004c9025 in BOINC_BackfillMgr::spawnClient ()
#7 0x00000000004c8cb1 in BOINC_BackfillMgr::start ()
#8 0x00000000004b4193 in Resource::start_backfill ()
#9 0x00000000004b00b4 in ResState::eval ()
#10 0x00000000004af4bf in Resource::eval_state ()
#11 0x00000000004ace48 in ResMgr::walk ()
#12 0x00000000004ade6a in ResMgr::eval_all ()
#13 0x0000000000511136 in TimerManager::Timeout ()
#14 0x00000000004f6319 in DaemonCore::Driver ()
#15 0x0000000000503cc1 in main ()
and here is the Obituary email,
----- Forwarded message from condor__AT__node316.ldas-cit.ligo.caltech.edu -----
This is an automated email from the Condor system
on machine "node316.ldas-cit.ligo.caltech.edu". Do not reply.
"/ldcg/condor/sbin/condor_startd" on "node316.ldas-cit.ligo.caltech.edu" died due to signal 7.
Condor will automatically restart this process in 10 seconds.
*** Last 20 line(s) of file StartLog:
5/10 16:43:11 vm1: Got universe "STANDARD" (1) from request classad
5/10 16:43:11 vm1: State change: claim-activation protocol successful
5/10 16:43:11 vm1: Changing activity: Idle -> Busy
5/10 16:45:37 DaemonCore: Command received via TCP from host <10.14.0.12:33451>
5/10 16:45:37 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
5/10 16:45:37 vm2: Called deactivate_claim_forcibly()
5/10 16:45:37 Starter pid 670 exited with status 0
5/10 16:45:37 vm2: State change: starter exited
5/10 16:45:37 vm2: Changing activity: Busy -> Idle
5/10 16:55:23 vm2: State change: idle claim shutting down due to CLAIM_WORKLIFE
5/10 16:55:23 vm2: Changing state and activity: Claimed/Idle -> Preempting/Vacating
5/10 16:55:23 vm2: State change: No preempting claim, returning to owner
5/10 16:55:23 vm2: Changing state and activity: Preempting/Vacating -> Owner/Idle
5/10 16:55:23 vm2: State change: IS_OWNER is false
5/10 16:55:23 vm2: Changing state: Owner -> Unclaimed
5/10 16:55:23 vm2: State change: START_BACKFILL is TRUE
5/10 16:55:23 vm2: Changing state: Unclaimed -> Backfill
5/10 16:55:23 DaemonCore: Command received via UDP from host <10.14.0.12:57103>
5/10 16:55:23 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_release_claim)
5/10 16:55:23 Warning: can't find resource with ClaimId (<10.14.2.66:53455>#1177343082#2822)
*** End of file StartLog
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: ldas_admin_cit__AT__ligo.caltech.edu
The Official Condor Homepage is http://www.cs.wisc.edu/condor
----- End forwarded message -----
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Thu May 10 19:06:36 2007 (1178841999)
Date: Thu, 10 May 2007 17:12:13 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #15441] LIGO: condor startd crash
The full core image for this crash can be found at,
http://www.ligo.caltech.edu/~anderson/condor.15441
Thanks.
On Thu, May 10, 2007 at 07:06:36PM -0500, condor-admin__AT__cs.wisc.edu wrote:
> Greetings. (This is an automated response. There is no need to reply.)
>
> Your message regarding:
> "LIGO: condor startd crash"
> has been received by the condor-admin response tracking system.
>
> In order to help us track the progress of your request, we ask that you
> include the string:
> "[condor-admin #15441] LIGO: condor startd crash"
> in the subject line of any further mail about this particular request.
>
> You can do this by simply replying to this email.
>
> While you are waiting for a reply, please look at the Condor Manual:
> http://www.cs.wisc.edu/condor/manual/
> for full documentation of Condor. Your problem may have already
> been solved or explained.
>
> Support for Condor through the condor-admin list is free of charge.
> We will make a best effort to respond in a timely fashion, but please
> keep in mind that our resources are limited.
>
> We offer a higher level of support for a fee. If you are interested in
> this, please send a message to condor-support__AT__cs.wisc.edu.
>
> If possible, we encourage you to try to experiment a little to see if
> you can solve the problem yourself.
>
> Thank You,
> - condor-admin response tracking system
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Thu May 10 19:12:41 2007 (1178842361)
Subject: Actions
Assigned to wright by cat
===========================================================================
Date of actions: Fri May 11 11:09:42 2007 (1178899782)