LIGO Support Ticket 17339
Ticket Information
Number: admin 17339
User: anderson@ligo.caltech.edu
Email: skoranda__AT__gravity.phys.uwm.edu
Status: resolved
Assigned To: nleroy
Date: Mon, 31 Dec 2007 16:06:04 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu, Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: schedd core dump in GridUniverseLogic::StartOrFindGManager
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
The LIGO Condor pool at Caltech running,
[root@ldas-grid ~]# condor_version
$CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
decided to end 2007 with a bang, or more precisely a schedd core dump.
Here is the stack trace:
(gdb) where
#0 0x000000000058b2e0 in WriteCoreDump ()
#1 0x00000000005787da in linux_sig_coredump ()
#2 <signal handler called>
#3 0x00002b6952cab9bb in _int_malloc () from /lib64/libc.so.6
#4 0x00002b6952cad58b in malloc () from /lib64/libc.so.6
#5 0x00002b695286e425 in operator new () from /usr/lib64/libstdc++.so.5
#6 0x0000000000538cd2 in GridUniverseLogic::StartOrFindGManager ()
#7 0x0000000000537d09 in GridUniverseLogic::JobAdded ()
#8 0x0000000000537c8a in GridUniverseLogic::JobCountUpdate ()
#9 0x00000000004e1fb7 in Scheduler::count_jobs ()
#10 0x00000000004e086b in Scheduler::timeout ()
#11 0x00000000004f8dc9 in Scheduler::reschedule_negotiator ()
#12 0x000000000056fa5d in DaemonCore::HandleReq ()
#13 0x000000000056c6dc in DaemonCore::HandleReq ()
#14 0x000000000056c175 in DaemonCore::Driver ()
#15 0x000000000057afa4 in main ()
Here is the obituary email from condor_master:
This is an automated email from the Condor system
on machine "ldas-grid.ligo.caltech.edu". Do not reply.
"/usr/sbin/condor_schedd" on "ldas-grid.ligo.caltech.edu" exited with status 1.
Condor will automatically restart this process in 10 seconds.
*** Last 20 line(s) of file /usr1/condor/log/SchedLog:
12/31 15:37:10 (pid:15872) Started shadow for job 23169831.0 on
"<10.14.1.231:52581>", (shadow pid = 24837)
12/31 15:37:10 (pid:15872) Calling HandleReq <handle_q> (0)
12/31 15:37:10 (pid:15872) ZKM: setting default map to dietz@ligo
12/31 15:37:10 (pid:15872) Return from HandleReq <handle_q>
12/31 15:37:10 (pid:15872) DaemonCore: Command received via UDP from host
<10.14.0.12:40900>, access level WRITE
12/31 15:37:10 (pid:15872) DaemonCore: received command 421 (RESCHEDULE),
calling handler (reschedule_negotiator)
12/31 15:37:10 (pid:15872) Calling HandleReq <reschedule_negotiator> (0)
12/31 15:37:10 (pid:15872) Sent ad to central manager for cokelaer@ligo
12/31 15:37:10 (pid:15872) Sent ad to 1 collectors for cokelaer@ligo
12/31 15:37:10 (pid:15872) Sent ad to central manager for dietz@ligo
12/31 15:37:10 (pid:15872) Sent ad to 1 collectors for dietz@ligo
12/31 15:37:10 (pid:15872) Sent ad to central manager for dfazi@ligo
12/31 15:37:10 (pid:15872) Sent ad to 1 collectors for dfazi@ligo
12/31 15:37:10 (pid:15872) Sent ad to central manager for nvf@ligo
12/31 15:37:10 (pid:15872) Sent ad to 1 collectors for nvf@ligo
12/31 15:37:10 (pid:15872) Sent ad to central manager for sung@ligo
12/31 15:37:10 (pid:15872) Sent ad to 1 collectors for sung@ligo
12/31 15:37:10 (pid:15872) Sent ad to central manager for bhughey@ligo
12/31 15:37:10 (pid:15872) Sent ad to 1 collectors for bhughey@ligo
12/31 15:37:10 (pid:15872) warning: setting UserUid to 5008, was 5004 previously
*** End of file SchedLog
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Mon Dec 31 18:06:25 2007 (1199145989)
Date: Tue, 1 Jan 2008 10:19:23 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: Re: [condor-admin #17339] LIGO: schedd core dump in
GridUniverseLogic::StartOrFindGManager
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
Unfortunately, condor_preen removed the core file from the initial crash
reported in this ticket before I made a copy (and modified the configuration
file to avoid that). However, 2008 is off to an auspicious start with
45 minutes of uptime in the new year before another schedd coredump.
Here is the stack track and obituary email. The full core file may be found
compressed at,
http://www.ligo.caltech.edu/~anderson/condor.17339
Thanks.
(gdb) where
#0 0x000000000058b2e0 in WriteCoreDump ()
#1 0x00000000005787da in linux_sig_coredump ()
#2 <signal handler called>
#3 0x00002b14c0bd6cd8 in _int_malloc () from /lib64/libc.so.6
#4 0x00002b14c0bd858b in malloc () from /lib64/libc.so.6
#5 0x00002b14c0799425 in operator new () from /usr/lib64/libstdc++.so.5
#6 0x00002b14c0799549 in operator new[] () from /usr/lib64/libstdc++.so.5
#7 0x000000000060d11b in MyString::reserve ()
#8 0x000000000060d1df in MyString::reserve_at_least ()
#9 0x000000000060d304 in MyString::operator+= ()
#10 0x000000000063fb7a in AttrList::sPrint ()
#11 0x00000000006447f9 in ClassAd::sPrint ()
#12 0x000000000050e071 in CloseJobHistoryFile ()
#13 0x00000000005085bf in DestroyProc ()
#14 0x00000000004e5532 in jobIsFinishedDone ()
#15 0x00000000004fbbbd in Scheduler::jobIsFinishedHandler ()
#16 0x0000000000587990 in SelfDrainingQueue::timerHandler ()
#17 0x000000000058ad31 in TimerManager::Timeout ()
#18 0x000000000056b0f7 in DaemonCore::Driver ()
#19 0x000000000057afa4 in main ()
This is an automated email from the Condor system
on machine "ldas-grid.ligo.caltech.edu". Do not reply.
"/usr/sbin/condor_schedd" on "ldas-grid.ligo.caltech.edu" exited with status 1.
Condor will automatically restart this process in 10 seconds.
*** Last 20 line(s) of file /usr1/condor/log/SchedLog:
1/1 00:45:17 (pid:24850) Calling Handler <<Negotiator Command>>
1/1 00:45:17 (pid:24850) Activity on stashed negotiator socket
1/1 00:45:17 (pid:24850) Negotiating for owner: cokelaer@ligo
1/1 00:45:17 (pid:24850) Out of servers - 8 jobs matched, 169 jobs idle, 1 jobs
rejected
1/1 00:45:17 (pid:24850) Return from Handler <<Negotiator Command>>
1/1 00:45:17 (pid:24850) DaemonCore: pid 20592 exited with status 0, invoking
reaper 3 <GManagerReaper>
1/1 00:45:17 (pid:24850) condor_gridmanager (PID 20592, owner nvf) exited with
return code 0.
1/1 00:45:17 (pid:24850) warning: setting UserUid to 5008, was 5136 previously
1/1 00:45:17 (pid:24850) DaemonCore: return from reaper for pid 20592
1/1 00:45:17 (pid:24850) Calling HandleReq <HandleChildAliveCommand> (0)
1/1 00:45:17 (pid:24850) Return from HandleReq <HandleChildAliveCommand>
1/1 00:45:17 (pid:24850) Calling HandleReq <handle_q> (0)
1/1 00:45:17 (pid:24850) ZKM: setting default map to isogait@ligo
1/1 00:45:17 (pid:24850) Return from HandleReq <handle_q>
1/1 00:45:17 (pid:24850) DaemonCore: pid 18059 exited with status 25600,
invoking reaper 2 <child_exit>
1/1 00:45:17 (pid:24850) Shadow pid 18059 for job 23182581.0 exited with status
100
1/1 00:45:17 (pid:24850) match (<10.14.1.127:51275>#1197678898#2565#...) out of
jobs (cluster id 23182581); relinquishing
1/1 00:45:17 (pid:24850) Sent RELEASE_CLAIM to startd at <10.14.1.127:51275>
1/1 00:45:17 (pid:24850) Match record (<10.14.1.127:51275>, 23182581, -1)
deleted
1/1 00:45:17 (pid:24850) DaemonCore: return from reaper for pid 18059
*** End of file SchedLog
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Wed Jan 2 9:24:19 2008 (1199287460)
Subject: Actions
Assigned to nleroy by nleroy
===========================================================================
Date of actions: Wed Jan 2 8:54:43 2008 (1199287946)
Subject: Actions
Ticket resolved by zmiller
===========================================================================
Date of actions: Thu Jan 10 16:25:12 2008 (1200003913)