LIGO Support Ticket 15441

Ticket Information
  Number:      admin 15441
  User:        anderson@ligo.caltech.edu
  Email:       espinoza_e__AT__ligo.caltech.edu
  Status:      new
  Assigned To: wright
Date: Thu, 10 May 2007 17:05:28 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: LIGO: condor startd crash

The LIGO Caltech cluster running,
$CondorVersion: 6.8.4 Feb  1 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
on FC4 x86_64 has had a condor_startd processes crash on one of the execute
nodes, apparently while trying to spawn a BOINC backfill job.

Here is the stack trace,

(gdb) where
#0  0x0000003190d0c848 in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#1  0x0000003190d12112 in _dl_runtime_resolve () from /lib64/ld-linux-x86-64.so.2
#2  0x00000000004fc504 in DaemonCore::Create_Process ()
#3  0x00000000004b73e4 in Starter::execDCStarter ()
#4  0x00000000004b710f in Starter::execBOINCStarter ()
#5  0x00000000004b6bc3 in Starter::spawn ()
#6  0x00000000004c9025 in BOINC_BackfillMgr::spawnClient ()
#7  0x00000000004c8cb1 in BOINC_BackfillMgr::start ()
#8  0x00000000004b4193 in Resource::start_backfill ()
#9  0x00000000004b00b4 in ResState::eval ()
#10 0x00000000004af4bf in Resource::eval_state ()
#11 0x00000000004ace48 in ResMgr::walk ()
#12 0x00000000004ade6a in ResMgr::eval_all ()
#13 0x0000000000511136 in TimerManager::Timeout ()
#14 0x00000000004f6319 in DaemonCore::Driver ()
#15 0x0000000000503cc1 in main ()

and here is the Obituary email,

----- Forwarded message from condor__AT__node316.ldas-cit.ligo.caltech.edu -----

This is an automated email from the Condor system
on machine "node316.ldas-cit.ligo.caltech.edu".  Do not reply.

"/ldcg/condor/sbin/condor_startd" on "node316.ldas-cit.ligo.caltech.edu" died due to signal 7.
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file StartLog:
5/10 16:43:11 vm1: Got universe "STANDARD" (1) from request classad
5/10 16:43:11 vm1: State change: claim-activation protocol successful
5/10 16:43:11 vm1: Changing activity: Idle -> Busy
5/10 16:45:37 DaemonCore: Command received via TCP from host <10.14.0.12:33451>
5/10 16:45:37 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
5/10 16:45:37 vm2: Called deactivate_claim_forcibly()
5/10 16:45:37 Starter pid 670 exited with status 0
5/10 16:45:37 vm2: State change: starter exited
5/10 16:45:37 vm2: Changing activity: Busy -> Idle
5/10 16:55:23 vm2: State change: idle claim shutting down due to CLAIM_WORKLIFE
5/10 16:55:23 vm2: Changing state and activity: Claimed/Idle -> Preempting/Vacating
5/10 16:55:23 vm2: State change: No preempting claim, returning to owner
5/10 16:55:23 vm2: Changing state and activity: Preempting/Vacating -> Owner/Idle
5/10 16:55:23 vm2: State change: IS_OWNER is false
5/10 16:55:23 vm2: Changing state: Owner -> Unclaimed
5/10 16:55:23 vm2: State change: START_BACKFILL is TRUE
5/10 16:55:23 vm2: Changing state: Unclaimed -> Backfill
5/10 16:55:23 DaemonCore: Command received via UDP from host <10.14.0.12:57103>
5/10 16:55:23 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_release_claim)
5/10 16:55:23 Warning: can't find resource with ClaimId (<10.14.2.66:53455>#1177343082#2822)
*** End of file StartLog



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: ldas_admin_cit__AT__ligo.caltech.edu
The Official Condor Homepage is http://www.cs.wisc.edu/condor

----- End forwarded message -----

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Thu May 10 19:06:36 2007 (1178841999)
Date: Thu, 10 May 2007 17:12:13 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #15441] LIGO: condor startd crash

The full core image for this crash can be found at,
http://www.ligo.caltech.edu/~anderson/condor.15441

Thanks.

On Thu, May 10, 2007 at 07:06:36PM -0500, condor-admin__AT__cs.wisc.edu wrote:
> Greetings.  (This is an automated response.  There is no need to reply.)
> 
> Your message regarding: 
>   "LIGO: condor startd crash"
> has been received by the condor-admin response tracking system.
> 
> In order to help us track the progress of your request, we ask that you
> include the string:
>   "[condor-admin #15441] LIGO: condor startd crash"
> in the subject line of any further mail about this particular request.
> 
> You can do this by simply replying to this email.
> 
> While you are waiting for a reply, please look at the Condor Manual:
>   http://www.cs.wisc.edu/condor/manual/
> for full documentation of Condor.  Your problem may have already
> been solved or explained.
> 
> Support for Condor through the condor-admin list is free of charge.
> We will make a best effort to respond in a timely fashion, but please
> keep in mind that our resources are limited.
> 
> We offer a higher level of support for a fee.  If you are interested in
> this, please send a message to condor-support__AT__cs.wisc.edu.
> 
> If possible, we encourage you to try to experiment a little to see if
> you can solve the problem yourself.
> 
>                         Thank You,
>                         - condor-admin response tracking system

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu May 10 19:12:41 2007 (1178842361)
Subject: Actions

Assigned to wright by cat
===========================================================================
Date of actions: Fri May 11 11:09:42 2007 (1178899782)