LIGO Support Ticket 2158

Ticket Information
  Number:      support 2158
  User:        anderson@ligo.caltech.edu
  Email:       skoranda__AT__gravity.phys.uwm.edu,nvf__AT__gravity.phys.uwm.edu
  Status:      resolved
  Assigned To: nleroy
Date: Tue, 1 Jan 2008 17:51:27 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu, Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>,         Nickolas
 Fotopoulos <nvf__AT__gravity.phys.uwm.edu>
Subject: LIGO: multiple schedd core dumps possibly related to  
 condor_gridmanager
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

The LIGO Condor pool running,

# condor_version
$CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

has now had the schedd core dump 3 times in the last 26 hours. This is
an escalation of ticket,
[condor-admin #17339] LIGO: schedd core dump in GridUniverseLogic::StartOrFindGManager

The stack trace looks different in all 3 cases (see below for the latest),
however, the schedd log file indicates one common feature in all cases:

First crash:
12/31 15:37:02 (pid:15872) condor_gridmanager (PID 24686, owner nvf) exited with return code 0.

Second crash:
1/1 00:45:17 (pid:24850) condor_gridmanager (PID 20592, owner nvf) exited with
return code 0.

And this the third one:
1/1 16:52:14 (pid:21126) condor_gridmanager (PID 9281, owner nvf) exited with
return code 0.

Perhaps there is a memory corruption occuring during the cleanup of
gridmanager jobs?

The core file for this 3rd crash may be found at,
http://www.ligo.caltech.edu/~anderson/condor.17339

Here is the stack trace and Obituary from the 3rd crash:

(gdb) where
#0  0x000000000058b2e0 in WriteCoreDump ()
#1  0x00000000005787da in linux_sig_coredump ()
#2  <signal handler called>
#3  0x00002ab4361eca84 in _int_malloc () from /lib64/libc.so.6
#4  0x00002ab4361ee58b in malloc () from /lib64/libc.so.6
#5  0x00002ab435daf425 in operator new () from /usr/lib64/libstdc++.so.5
#6  0x00002ab435daf549 in operator new[] () from /usr/lib64/libstdc++.so.5
#7  0x0000000000659f7b in Buf::Buf ()
#8  0x000000000065672c in ReliSock::SndMsg::SndMsg ()
#9  0x0000000000654020 in ReliSock::ReliSock ()
#10 0x0000000000654aff in ReliSock::accept ()
#11 0x000000000056c83c in DaemonCore::HandleReq ()
#12 0x000000000056c6dc in DaemonCore::HandleReq ()
#13 0x000000000056c175 in DaemonCore::Driver ()
#14 0x000000000057afa4 in main ()


This is an automated email from the Condor system
on machine "ldas-grid.ligo.caltech.edu".  Do not reply.

"/usr/sbin/condor_schedd" on "ldas-grid.ligo.caltech.edu" exited with status 1.
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file /usr1/condor/log/SchedLog:
1/1 16:52:13 (pid:21126) Return from HandleReq <attempt_access_handler>
1/1 16:52:13 (pid:21126) Calling HandleReq <attempt_access_handler> (1)
1/1 16:52:13 (pid:21126) Return from HandleReq <attempt_access_handler>
1/1 16:52:13 (pid:21126) Calling HandleReq <attempt_access_handler> (1)
1/1 16:52:13 (pid:21126) Return from HandleReq <attempt_access_handler>
1/1 16:52:13 (pid:21126) Calling HandleReq <attempt_access_handler> (1)
1/1 16:52:13 (pid:21126) Return from HandleReq <attempt_access_handler>
1/1 16:52:13 (pid:21126) Calling HandleReq <attempt_access_handler> (1)
1/1 16:52:13 (pid:21126) Return from HandleReq <attempt_access_handler>
1/1 16:52:14 (pid:21126) Calling HandleReq <attempt_access_handler> (1)
1/1 16:52:14 (pid:21126) warning: setting UserUid to 5081, was 5011 previously
1/1 16:52:14 (pid:21126) Return from HandleReq <attempt_access_handler>
1/1 16:52:14 (pid:21126) Out of servers - 0 jobs matched, 210 jobs idle, 1 jobs
rejected
1/1 16:52:14 (pid:21126) Return from Handler <<Negotiator Command>>
1/1 16:52:14 (pid:21126) DaemonCore: pid 9281 exited with status 0, invoking
reaper 3 <GManagerReaper>
1/1 16:52:14 (pid:21126) condor_gridmanager (PID 9281, owner nvf) exited with
return code 0.
1/1 16:52:14 (pid:21126) warning: setting UserUid to 5008, was 5081 previously
1/1 16:52:14 (pid:21126) DaemonCore: return from reaper for pid 9281
1/1 16:52:15 (pid:21126) Starting add_shadow_birthdate(23202610.0)
1/1 16:52:16 (pid:21126) Started shadow for job 23202610.0 on
"<10.14.2.34:47276>", (shadow pid = 9485)
*** End of file SchedLog

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Wed Jan  2  8:38:59 2008 (1199284743)
Subject: Actions

Assigned to nleroy by nleroy
===========================================================================
Date of actions: Wed Jan  2  8:54:43 2008 (1199287952)
From: Nick LeRoy <nleroy__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #2158] LIGO: multiple schedd core dumps     
 possibly related to
Date: Wed, 2 Jan 2008 10:28:18 -0600

Stuart,

> The LIGO Condor pool running,

<snip>

> First crash:
> 12/31 15:37:02 (pid:15872) condor_gridmanager (PID 24686, owner nvf) exited
> with return code 0.
>
> Second crash:
> 1/1 00:45:17 (pid:24850) condor_gridmanager (PID 20592, owner nvf) exited
> with return code 0.
>
> And this the third one:
> 1/1 16:52:14 (pid:21126) condor_gridmanager (PID 9281, owner nvf) exited
> with return code 0.
>
> Perhaps there is a memory corruption occuring during the cleanup of
> gridmanager jobs?
>
> The core file for this 3rd crash may be found at,
> http://www.ligo.caltech.edu/~anderson/condor.17339
>
> Here is the stack trace and Obituary from the 3rd crash:

I don't have anything profound to tell you at this point, other than we're 
looking at this, and we'll get back to you as soon as possible.  Todd is on 
vacation today, but Jaime & I are on this.

In the mean time, would it be possible to run the schedd under valgrind?  I 
know that valgrind slows down the process


-- 
           <<< Why, oh, why, didn't I take the blue pill? >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy__AT__cs.wisc.edu              The University of Wisconsin
 |_*_|   608-265-5761                    Department of Computer Sciences

===========================================================================
Date mail was appended: Wed Jan  2 10:28:28 2008 (1199291308)
From: Nick LeRoy <nleroy__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #2158] LIGO: multiple schedd core dumps     
 possibly related to
Date: Wed, 2 Jan 2008 10:30:03 -0600

Stuart,

Oops, didn't mean to send that reply off yet.

> I don't have anything profound to tell you at this point, other than we're
> looking at this, and we'll get back to you as soon as possible.  Todd is on
> vacation today, but Jaime & I are on this.

In the mean time, would it be possible to run the schedd under valgrind?  I 
know that valgrind slows down the target process, and that your schedd may be 
too busy to be valgrind-ed, but if we can do it, it might help a lot.

Thanks

-Nick

-- 
           <<< Why, oh, why, didn't I take the blue pill? >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy__AT__cs.wisc.edu              The University of Wisconsin
 |_*_|   608-265-5761                    Department of Computer Sciences

===========================================================================
Date mail was appended: Wed Jan  2 10:30:20 2008 (1199291422)
Date: Wed, 2 Jan 2008 11:08:45 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu, nvf__AT__gravity.phys.uwm.edu
Subject: Re: [condor-support #2158] LIGO: multiple schedd core dumps  
 possibly related to
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

On Wed, Jan 02, 2008 at 10:28:28AM -0600, condor-support response tracking system wrote:
> 
> I don't have anything profound to tell you at this point, other than we're 
> looking at this, and we'll get back to you as soon as possible.  Todd is on 
> vacation today, but Jaime & I are on this.
> 
> In the mean time, would it be possible to run the schedd under valgrind?  I 
> know that valgrind slows down the process
> 

There was a 4th and 5th crash last night (with the core file from the 4th
being overwritten by the 5th since they are just named "core"). As an aside,
when using the Google core dump library how about appending the PID to the
core file name?


Both of these dumps also had a schedd log entry in the same second
as the crash indicating:

1/1 23:40:23 (pid:9497) condor_gridmanager (PID 11706, owner nvf) exited with   
return code 0.

1/2 02:54:30 (pid:11918) condor_gridmanager (PID 24545, owner nvf) exited with return code 0.

Here is the stack trace from the 5th crash and the core file (core.5.gz)
may be found at the same URL as the other ones,
http://www.ligo.caltech.edu/~anderson/condor.17339/

(gdb) where
#0  0x000000000058b2e0 in WriteCoreDump ()
#1  0x00000000005787da in linux_sig_coredump ()
#2  <signal handler called>
#3  0x00002b56eb19d9bb in _int_malloc () from /lib64/libc.so.6
#4  0x00002b56eb19f58b in malloc () from /lib64/libc.so.6
#5  0x00002b56ead60425 in operator new () from /usr/lib64/libstdc++.so.5
#6  0x0000000000654aee in ReliSock::accept ()
#7  0x000000000056c83c in DaemonCore::HandleReq ()
#8  0x000000000056c6dc in DaemonCore::HandleReq ()
#9  0x000000000056c175 in DaemonCore::Driver ()
#10 0x000000000057afa4 in main ()


I have now removed the 6 "nvf" jobs that where on hold in the queue
(pegasus kickstart jobs) to see if this stabilizes the schedd before
trying some thing more drastic like valgrind.

An example job classad for one of these nvf jobs is in ticket 17343.
Even if these jobs are a red-herring for the schedd stability problem,
is it a bug or a feature that there are condor_gridmanager pid's exiting
for jobs that are on hold in the queue?

Thanks.


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Wed Jan  2 13:09:03 2008 (1199300945)
Date: Thu, 3 Jan 2008 13:56:55 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu, nvf__AT__gravity.phys.uwm.edu
Subject: Re: [condor-support #2158] LIGO: multiple schedd core dumps  
 possibly related to
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

In summary there where 6 crashes in 29 hours but there have been none in the
last 36 hours. The last 28 hours where run without any condor_gridmanager
processes for owner nvf. Those jobs, which where on hold, where removed from
the queue with -forces.  Here are the timestamps of the crashes that show
the possibly related codor_gridmanager log entries regarding these suspect
jobs.

The apparent time coincidence between owner nvf condor_gridmanager jobs
exiting and the schedd crashes (0, 8, 0, 2, 2, 1 seconds) should be compared
to nominal interval of condor_gridmanager being run at once per minute by
the schedd for this user.  The fraction of schedd crashes per
condor_gridmangaer reaping was 1 per 750 occurrences.

At this point the best scenario is that there will be no more crashes
and a code inspection will show a memory leak associated with the schedd
reaping condor_gridmanager processes. However, if we continue to have
crashes we will try running schedd with valgrind, or switch to 7.0.0
if that is released first :)


Thanks.



12/31 09:53:16 (pid:24590) condor_gridmanager (PID 15779, owner nvf) exited with return code 0.
...
12/31 09:53:17 (pid:24590) Shadow pid 15065 for job 23166285.0 exited with status 100
12/31 09:53:42 (pid:15872) ******************************************************
12/31 09:53:42 (pid:15872) ** condor_schedd (CONDOR_SCHEDD) STARTING UP


12/31 15:37:02 (pid:15872) condor_gridmanager (PID 24686, owner nvf) exited with return code 0.
...
12/31 15:37:10 (pid:15872) warning: setting UserUid to 5008, was 5004 previously
12/31 15:37:32 (pid:24850) ******************************************************
12/31 15:37:32 (pid:24850) ** condor_schedd (CONDOR_SCHEDD) STARTING UP


1/1 00:45:17 (pid:24850) condor_gridmanager (PID 20592, owner nvf) exited with return code 0.
...
1/1 00:45:17 (pid:24850) DaemonCore: return from reaper for pid 18059
1/1 00:45:52 (pid:21126) ******************************************************
1/1 00:45:52 (pid:21126) ** condor_schedd (CONDOR_SCHEDD) STARTING UP


1/1 16:52:14 (pid:21126) condor_gridmanager (PID 9281, owner nvf) exited with return code 0.
...
1/1 16:52:16 (pid:21126) Started shadow for job 23202610.0 on "<10.14.2.34:47276>", (shadow pid = 9485)
1/1 16:52:43 (pid:9497) ******************************************************
1/1 16:52:43 (pid:9497) ** condor_schedd (CONDOR_SCHEDD) STARTING UP


1/1 23:40:23 (pid:9497) condor_gridmanager (PID 11706, owner nvf) exited with return code 0.
...
1/1 23:40:25 (pid:9497) Shadow pid 29742 for job 23212029.0 exited with status 100
1/1 23:40:46 (pid:11918) ******************************************************
1/1 23:40:46 (pid:11918) ** condor_schedd (CONDOR_SCHEDD) STARTING UP


1/2 02:54:30 (pid:11918) condor_gridmanager (PID 24545, owner nvf) exited with return code 0.
...
1/2 02:54:31 (pid:11918) Return from HandleReq <HandleChildAliveCommand>
1/2 02:54:51 (pid:24973) ******************************************************
1/2 02:54:51 (pid:24973) ** condor_schedd (CONDOR_SCHEDD) STARTING UP


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Jan  3 15:57:20 2008 (1199397442)
Subject: Comments added

admin ticket 17339 was a dupe of this ticket.  i resolved 17339.  -zach

Comments added by zmiller

===========================================================================
Date comments were added: Thu Jan 10 16:25:23 2008 (1200003923)
From: Nick LeRoy <nleroy__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #2158] LIGO: multiple schedd core dumps    
 possibly related to
Date: Fri, 18 Jan 2008 13:56:08 -0600
CC: Zach Miller <zmiller__AT__cs.wisc.edu>

On Thu January 3 2008, condor-support response tracking system wrote:
> In summary there where 6 crashes in 29 hours but there have been none in
> the last 36 hours. The last 28 hours where run without any
> condor_gridmanager processes for owner nvf. Those jobs, which where on
> hold, where removed from the queue with -forces.  Here are the timestamps
> of the crashes that show the possibly related codor_gridmanager log entries
> regarding these suspect jobs.
>
> The apparent time coincidence between owner nvf condor_gridmanager jobs
> exiting and the schedd crashes (0, 8, 0, 2, 2, 1 seconds) should be
> compared to nominal interval of condor_gridmanager being run at once per
> minute by the schedd for this user.  The fraction of schedd crashes per
> condor_gridmangaer reaping was 1 per 750 occurrences.
>
> At this point the best scenario is that there will be no more crashes
> and a code inspection will show a memory leak associated with the schedd
> reaping condor_gridmanager processes. However, if we continue to have
> crashes we will try running schedd with valgrind, or switch to 7.0.0
> if that is released first :)

So, what's the status of this?  How are the new binaries working?  Should I 
close this ticket?

Thanks

-Nick

-- 
           <<< Why, oh, why, didn't I take the blue pill? >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy__AT__cs.wisc.edu              The University of Wisconsin
 |_*_|   608-265-5761                    Department of Computer Sciences

===========================================================================
Date mail was appended: Fri Jan 18 14:00:49 2008 (1200686451)
Date: Fri, 18 Jan 2008 12:30:13 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>,
 Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu, nvf__AT__gravity.phys.uwm.edu
Subject: Re: [condor-support #2158] LIGO: multiple schedd core dumps  
 possibly related to
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

On Fri, Jan 18, 2008 at 02:00:49PM -0600, condor-support response tracking system wrote:
> 
> So, what's the status of this?  How are the new binaries working?  Should I 
> close this ticket?
> 

Nick,
	We have been unable to reproduce the circumstances that would have
indicated whether the 7.0 pre-release binaries fixed the problem or not.
We just discussed this in a conference call with Todd, so please check with
him on whether to close this and assume coverity has fixed this, or if you
should pursue this further with in-house testing of the schedd reaping
gridmanager jobs under valgrind. Please recall, the memory corruption (assumed)
only caused the 6.9.5 schedd to crash after running ~750 gridmanager
jobs to checkup on jobs on hold in the queue.

	I would be satisfied with either response given that 7.0 is being
released today, and if the problem re-occurs under 7.0 we will certainly
let you know.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Jan 18 14:30:46 2008 (1200688247)
Subject: Actions

Ticket resolved by tannenba
===========================================================================
Date of actions: Fri Feb 29 13:43:19 2008 (1204314199)