LIGO Support Ticket 1984

Ticket Information
  Number:      support 1984
  User:        espinoza@ligo.caltech.edu
  Email:       anderson__AT__ligo.caltech.edu
  Status:      resolved
  Assigned To: psilord
Date: Fri, 11 May 2007 10:43:52 -0700
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>
Subject: LIGO - condor_master/condor_startd crash
X-Enigmail-Version: 0.95.0
Openpgp: url=http://pgp.mit.edu/

Greetings,

We had a crash of condor_master. The checkpoint was left orphaned and
running.

Here is the gdb traces and core files:
http://www.ligo.caltech.edu/~eespinoz/debug/05-10-2007/
Pid 2137: condor_master
Pid 2162: condor_startd

Here is a little bit more information:
[root@node222 log]# /ldcg/condor/bin/condor_version
$CondorVersion: 6.8.4 Feb  1 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
[root@node222 log]# cat /etc/fedora-release
Fedora Core release 4 (Stentz)
[root@node222 log]# uname -i
x86_64
[root@node222 log]#

Below is the info from the the log:

4/23 08:44:15 ******************************************************
4/23 08:44:15 ** condor_master (CONDOR_MASTER) STARTING UP
4/23 08:44:15 ** /ldcg/stow_pkgs/condor-6.8.4/condor/sbin/condor_master
4/23 08:44:15 ** $CondorVersion: 6.8.4 Feb  1 2007 $
4/23 08:44:15 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
4/23 08:44:15 ** PID = 2137
4/23 08:44:15 ** Log last touched 4/23 08:05:34
4/23 08:44:15 ******************************************************
4/23 08:44:15 Using config source: /usr1/condor/condor_config
4/23 08:44:15 Using local config sources:
4/23 08:44:15    /usr1/condor/condor_config.local
4/23 08:44:15 DaemonCore: Command Socket at <10.14.1.222:40309>
4/23 08:44:17 Started DaemonCore process
"/ldcg/condor/sbin/condor_startd", pid and pgroup = 2162
4/23 08:44:35 Started process "/ldcg/condor/sbin/condor_ckpt_server",
pid and pgroup = 2481
4/23 09:44:38 Preen pid is 2914
4/23 09:44:38 Child 2914 died, but not a daemon -- Ignored
4/24 09:44:38 Preen pid is 16379
4/24 09:44:38 Child 16379 died, but not a daemon -- Ignored
4/25 09:44:38 Preen pid is 26269
4/25 09:44:38 Child 26269 died, but not a daemon -- Ignored
4/26 09:44:38 Preen pid is 3775
4/26 09:44:38 Child 3775 died, but not a daemon -- Ignored
4/27 09:44:38 Preen pid is 13989
4/27 09:44:38 Child 13989 died, but not a daemon -- Ignored
4/28 09:44:39 Preen pid is 25221
4/28 09:44:39 Child 25221 died, but not a daemon -- Ignored
4/29 09:44:39 Preen pid is 16574
4/29 09:44:39 Child 16574 died, but not a daemon -- Ignored
4/30 09:44:39 Preen pid is 4706
4/30 09:44:39 Child 4706 died, but not a daemon -- Ignored
5/1 09:44:39 Preen pid is 32157
5/1 09:44:39 Child 32157 died, but not a daemon -- Ignored
5/2 09:44:39 Preen pid is 17755
5/2 09:44:39 Child 17755 died, but not a daemon -- Ignored
5/3 09:44:39 Preen pid is 4452
5/3 09:44:39 Child 4452 died, but not a daemon -- Ignored
5/4 09:44:40 Preen pid is 20186
5/4 09:44:40 Child 20186 died, but not a daemon -- Ignored
5/5 09:44:41 Preen pid is 7262
5/5 09:44:41 Child 7262 died, but not a daemon -- Ignored
5/6 09:44:41 Preen pid is 24899
5/6 09:44:41 Child 24899 died, but not a daemon -- Ignored
5/7 09:44:41 Preen pid is 19645
5/7 09:44:41 Child 19645 died, but not a daemon -- Ignored
5/8 09:44:41 Preen pid is 32245
5/8 09:44:42 Child 32245 died, but not a daemon -- Ignored
5/9 09:44:41 Preen pid is 19060
5/9 09:44:41 Child 19060 died, but not a daemon -- Ignored
5/10 09:44:41 Preen pid is 5843
5/10 09:44:41 Child 5843 died, but not a daemon -- Ignored

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date of creation: Fri May 11 12:44:18 2007 (1178905461)
Subject: Actions

Assigned to pavlo by pavlo
===========================================================================
Date of actions: Mon May 14  9:59:43 2007 (1179156772)
From: Andy Pavlo <pavlo__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1984] LIGO - condor_master/condor_startd crash
Date: Mon, 14 May 2007 10:53:44 -0500

Erik,

Our checkpoint server expert (Pete Keller) is out sick today. I have forwarded 
your information to him and we hope to have a response to you by early 
tomorrow. Thanks.
-- 
Andy Pavlo
pavlo__AT__cs.wisc.edu

===========================================================================
Date mail was appended: Mon May 14 10:53:51 2007 (1179158032)
Subject: Actions

Assigned to psilord by pavlo
===========================================================================
Date of actions: Mon May 14 10:54:17 2007 (1179158078)
Date: Tue, 15 May 2007 13:40:39 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: pavlo <condor-support__AT__cs.wisc.edu>
Subject: Re: [condor-support #1984] LIGO - condor_master/condor_startd crash

Hello,

> From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
> 
> Greetings,
> 
> We had a crash of condor_master. The checkpoint was left orphaned and
> running.

Did you mean the checkpoint server was orphaned?

> Here is the gdb traces and core files:
> http://www.ligo.caltech.edu/~eespinoz/debug/05-10-2007/
> Pid 2137: condor_master
> Pid 2162: condor_startd

It looks like there are a few problems in this ticket, that of the master
and startd segfaulting, and of the checkpoint server being orphaned.

Are these problems happening all of the time? Rarely? 

I'll start by investigating the crashes, and then later the checkpoint server
unless the latter is causing you more problems.

Thank you.

Condor Admin

===========================================================================
Date mail was appended: Tue May 15 13:40:41 2007 (1179254442)
Date: Tue, 15 May 2007 11:51:34 -0700
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu
Subject: Re: [condor-support #1984] LIGO - condor_master/condor_startd crash
X-Enigmail-Version: 0.95.0
Openpgp: url=http://pgp.mit.edu/



condor-support response tracking system wrote:
> Hello,
> 
>> From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
>>
>> Greetings,
>>
>> We had a crash of condor_master. The checkpoint was left orphaned and
>> running.
> 
> Did you mean the checkpoint server was orphaned?

Correct!

>> Here is the gdb traces and core files:
>> http://www.ligo.caltech.edu/~eespinoz/debug/05-10-2007/
>> Pid 2137: condor_master
>> Pid 2162: condor_startd
> 
> It looks like there are a few problems in this ticket, that of the master
> and startd segfaulting, and of the checkpoint server being orphaned.
> 
> Are these problems happening all of the time? Rarely? 

We've had this happens about 10 times since ~April 23. We hadn't noticed
 since our monitoring system had a bug.

> I'll start by investigating the crashes, and then later the checkpoint server
> unless the latter is causing you more problems.

Nope. The segfaults are definitely the more important problems.

Thanks,
-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Tue May 15 13:51:59 2007 (1179255119)
Date: Tue, 15 May 2007 15:52:14 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
Subject: Re: [condor-support #1984] LIGO - condor_master/condor_startd crash

On Tue, May 15, 2007 at 01:51:59PM -0500, condor-support response tracking system wrote:
> We've had this happens about 10 times since ~April 23. We hadn't noticed
>  since our monitoring system had a bug.

I've gotten what I can out of the core files and must ask, would this path be
on an NFS or AFS server:

/ldcg/condor/sbin/condor_starter

Cause if so, then the server was probably down, causing the bus error.

My evidence is this error string I dug out of one of the corefiles:

stat(/ldcg/condor/sbin/condor_starter) failed with unexpected errno 5 (Input/out
put error)

Both of the processes died with bus errors and there are only two
reasons I've ever seen for when that happens A) on some architectures
a misliagned pointer dereference can cause it, on other architectures
(like the x86_64) it is merely slow to do the pointer derefence, B) the
kernel wanted some pages from the executable but couldn't get them due
to disk failure or failure of an NFS server to serve the requested pages.

So, at this point, I'd look to see if the drive that is serving these programs
are unreliable, or if NFS server outages (if applicable) coincide with these
failures.

Thank you.

Condor Admin

===========================================================================
Date mail was appended: Tue May 15 15:52:16 2007 (1179262336)
Subject: Actions

Status changed from open to pending by psilord
===========================================================================
Date of actions: Tue May 15 15:52:28 2007 (1179262348)
Date: Sat, 19 May 2007 09:57:36 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu
Subject: Re: [condor-support #1984] LIGO - condor_master/condor_startd crash

Further investigation shows that these problems happen on execute machines
shortly after a burst of NFS mount errors like:

automount[26375]: >> nfs bindresvport: Address already in use

which in turn have been tracked down to the automounter using up all the
reserved TCP ports under Linux (due to a poor Linux implementation) when
rapidly mounting several filesystems (via the automounter in our case).

The Condor daemon binaries are not automonted, but there may be a side
effect for currently mounted filesystems when new mount requests use up
all the reserved ports.  While it might be possible to track this down
further and make Condor more robust against this Linux bug, I think this is
case where we will adjust our usage model instead.  In particular, we are
going to move all our condor binaries to local disk on every machine rather
than NFS mounting them.

Please feel free to close this ticket unless you want to pursue this further.

Thanks.


On Tue, May 15, 2007 at 03:52:16PM -0500, condor-support response tracking system wrote:
> On Tue, May 15, 2007 at 01:51:59PM -0500, condor-support response tracking system wrote:
> > We've had this happens about 10 times since ~April 23. We hadn't noticed
> >  since our monitoring system had a bug.
> 
> I've gotten what I can out of the core files and must ask, would this path be
> on an NFS or AFS server:
> 
> /ldcg/condor/sbin/condor_starter
> 
> Cause if so, then the server was probably down, causing the bus error.
> 
> My evidence is this error string I dug out of one of the corefiles:
> 
> stat(/ldcg/condor/sbin/condor_starter) failed with unexpected errno 5 (Input/out
> put error)
> 
> Both of the processes died with bus errors and there are only two
> reasons I've ever seen for when that happens A) on some architectures
> a misliagned pointer dereference can cause it, on other architectures
> (like the x86_64) it is merely slow to do the pointer derefence, B) the
> kernel wanted some pages from the executable but couldn't get them due
> to disk failure or failure of an NFS server to serve the requested pages.
> 
> So, at this point, I'd look to see if the drive that is serving these programs
> are unreliable, or if NFS server outages (if applicable) coincide with these
> failures.
> 
> Thank you.
> 
> Condor Admin
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Peter Keller <psilord__AT__cs.wisc.edu>
> * Ticket Email List: espinoza__AT__ligo.caltech.edu, anderson__AT__ligo.caltech.edu
> 
> -- 
> ======================================================================
> This mail was sent from the RUST Mail System
> Please direct all replies to condor-support__AT__cs.wisc.edu
> Please include the current subject line in your reply.
> ======================================================================
> 

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Sat May 19 11:58:02 2007 (1179593884)
Date: Mon, 21 May 2007 20:58:48 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
Subject: Re: [condor-support #1984] LIGO - condor_master/condor_startd crash

Hello,

On Sat, May 19, 2007 at 11:58:02AM -0500, condor-support response tracking system wrote:
> Please feel free to close this ticket unless you want to pursue this further.

I think your solution is perfect and it is what I was going to recommend.

I'm closing this ticket.

Thank you.

Condor Admin

===========================================================================
Date mail was appended: Mon May 21 20:58:50 2007 (1179799131)
Subject: Actions

Ticket resolved by psilord
===========================================================================
Date of actions: Mon May 21 20:59:30 2007 (1179799170)