LIGO Support Ticket 17268

Ticket Information
  Number:      admin 17268
  User:        anderson@ligo.caltech.edu
  Email:       skoranda__AT__gravity.phys.uwm.edu
  Status:      resolved
  Assigned To: gquinn
Date: Sun, 2 Dec 2007 16:11:06 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: Condor 6.9.5 master coredump
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

The LIGO Condor pool at Caltech running,
# condor_version
$CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

had the condor_master core dump on multiple execute machines
with the following stack trace,

(gdb) where
#0  0x00000000004e01e0 in WriteCoreDump ()
#1  0x00000000004cd2ae in linux_sig_coredump ()
#2  <signal handler called>
#3  0x0000003ac1a30845 in abort () from /lib64/libc.so.6
#4  0x0000003ac1a6534f in __libc_message () from /lib64/libc.so.6
#5  0x0000003ac1a6b14f in _int_free () from /lib64/libc.so.6
#6  0x0000003ac1a6b68e in free () from /lib64/libc.so.6
#7  0x000000000070b173 in ProcFamilyClient::unregister_family ()
#8  0x000000000052ca11 in ProcFamilyProxy::unregister_family ()
#9  0x00000000004c9975 in DaemonCore::HandleProcessExit ()
#10 0x00000000004c9542 in DaemonCore::HandleDC_SERVICEWAITPIDS ()
#11 0x00000000004bf83b in DaemonCore::Driver ()
#12 0x00000000004cfa78 in main ()

This happened during repeated attempts to restart the startd daemon
as reported in,
[condor-support #2146] LIGO: 6.9.5 starter assertion error at line 575


Here is the MasterLog,

12/1 11:04:29 ******************************************************
12/1 11:04:29 ** condor_master (CONDOR_MASTER) STARTING UP
12/1 11:04:29 ** /usr/sbin/condor_master
12/1 11:04:29 ** $CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
12/1 11:04:29 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
12/1 11:04:29 ** PID = 18973
12/1 11:04:29 ** Log last touched 12/1 10:50:06
12/1 11:04:29 ******************************************************
12/1 11:04:29 Using config source: /usr1/condor/condor_config
12/1 11:04:29 Using local config sources: 
12/1 11:04:29    /usr1/condor/condor_config.local
12/1 11:04:29 DaemonCore: Command Socket at <10.14.2.70:59671>
12/1 11:04:29 GID-based process tracking requires use of ProcD; ignoring USE_PROCD setting
12/1 11:04:29 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 18975
12/1 11:04:29 Started process "/usr/sbin/condor_ckpt_server", pid and pgroup = 18976
12/1 11:04:38 Calling HandleReq <HandleChildAliveCommand> (0)
12/1 11:04:38 Return from HandleReq <HandleChildAliveCommand>
12/1 11:08:40 DaemonCore: pid 18975 exited with status 1024, invoking reaper 1 <Daemons::DefaultReaper()>
12/1 11:08:40 The STARTD (pid 18975) exited with status 4
12/1 11:08:40 Sending obituary for "/usr/sbin/condor_startd"
12/1 11:08:53 restarting /usr/sbin/condor_startd in 10 seconds
12/1 11:08:53 DaemonCore: return from reaper for pid 18975
12/1 11:09:03 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 19033
12/1 11:09:15 Calling HandleReq <HandleChildAliveCommand> (0)
12/1 11:09:15 Return from HandleReq <HandleChildAliveCommand>
12/1 11:11:17 DaemonCore: pid 19033 exited with status 1024, invoking reaper 1 <Daemons::DefaultReaper()>
12/1 11:11:17 The STARTD (pid 19033) exited with status 4
12/1 11:11:17 Sending obituary for "/usr/sbin/condor_startd"
12/1 11:11:25 restarting /usr/sbin/condor_startd in 11 seconds
12/1 11:11:25 DaemonCore: return from reaper for pid 19033
12/1 11:11:36 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 19081
12/1 11:11:44 Calling HandleReq <HandleChildAliveCommand> (0)
12/1 11:11:44 Return from HandleReq <HandleChildAliveCommand>
12/1 11:12:04 DaemonCore: pid 19081 exited with status 1024, invoking reaper 1 <Daemons::DefaultReaper()>
12/1 11:12:04 The STARTD (pid 19081) exited with status 4
12/1 11:12:04 Sending obituary for "/usr/sbin/condor_startd"
12/1 11:12:12 restarting /usr/sbin/condor_startd in 13 seconds
12/1 11:12:12 DaemonCore: return from reaper for pid 19081
12/1 11:12:25 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 19116
12/1 11:12:33 Calling HandleReq <HandleChildAliveCommand> (0)
12/1 11:12:33 Return from HandleReq <HandleChildAliveCommand>
12/1 11:14:13 DaemonCore: pid 19116 exited with status 1024, invoking reaper 1 <Daemons::DefaultReaper()>
12/1 11:14:13 The STARTD (pid 19116) exited with status 4
12/1 11:14:14 Sending obituary for "/usr/sbin/condor_startd"
12/1 11:14:21 restarting /usr/sbin/condor_startd in 17 seconds
12/1 11:14:21 DaemonCore: return from reaper for pid 19116
12/1 11:14:38 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 19157
12/1 11:14:51 Calling HandleReq <HandleChildAliveCommand> (0)
12/1 11:14:51 Return from HandleReq <HandleChildAliveCommand>
12/1 11:16:42 DaemonCore: pid 19157 exited with status 1024, invoking reaper 1 <Daemons::DefaultReaper()>
12/1 11:16:42 The STARTD (pid 19157) exited with status 4
12/1 11:16:42 restarting /usr/sbin/condor_startd in 25 seconds
12/1 11:16:42 DaemonCore: return from reaper for pid 19157
12/1 11:17:07 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 19198
12/1 11:17:15 Calling HandleReq <HandleChildAliveCommand> (0)
12/1 11:17:15 Return from HandleReq <HandleChildAliveCommand>
12/1 11:17:59 Calling Handler <DaemonCore::HandleReqSocketHandler>
12/1 11:17:59 DaemonCore: Command received via TCP from host <10.14.0.25:59980>, access level ADMINISTRATOR
12/1 11:17:59 DaemonCore: received command 467 (DAEMON_OFF), calling handler (admin_command_handler)
12/1 11:17:59 Calling HandleReq <admin_command_handler> (0)
12/1 11:17:59 Handling daemon-specific command for "STARTD"
12/1 11:17:59 Sent SIGTERM to STARTD (pid 19198)
12/1 11:17:59 Return from HandleReq <admin_command_handler>
12/1 11:17:59 Return from Handler <DaemonCore::HandleReqSocketHandler>
12/1 11:17:59 DaemonCore: pid 19198 exited with status 1024, invoking reaper 1 <Daemons::DefaultReaper()>
12/1 11:17:59 The STARTD (pid 19198) exited with status 4
12/1 11:17:59 DaemonCore: return from reaper for pid 19198
12/1 11:18:10 Calling Handler <DaemonCore::HandleReqSocketHandler>
12/1 11:18:10 DaemonCore: Command received via TCP from host <10.14.0.25:41515>, access level ADMINISTRATOR
12/1 11:18:10 DaemonCore: received command 467 (DAEMON_OFF), calling handler (admin_command_handler)
12/1 11:18:10 Calling HandleReq <admin_command_handler> (0)
12/1 11:18:10 Handling daemon-specific command for "STARTD"
12/1 11:18:10 Return from HandleReq <admin_command_handler>
12/1 11:18:10 Return from Handler <DaemonCore::HandleReqSocketHandler>
12/1 11:21:47 Calling Handler <DaemonCore::HandleReqSocketHandler>
12/1 11:21:47 DaemonCore: Command received via TCP from host <10.14.0.25:45618>, access level ADMINISTRATOR
12/1 11:21:47 DaemonCore: received command 469 (DAEMON_ON), calling handler (admin_command_handler)
12/1 11:21:47 Calling HandleReq <admin_command_handler> (0)
12/1 11:21:47 Handling daemon-specific command for "STARTD"
12/1 11:21:47 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 19255
12/1 11:21:47 Return from HandleReq <admin_command_handler>
12/1 11:21:47 Return from Handler <DaemonCore::HandleReqSocketHandler>
12/1 11:21:55 Calling HandleReq <HandleChildAliveCommand> (0)
12/1 11:21:55 Return from HandleReq <HandleChildAliveCommand>
12/1 11:24:13 DaemonCore: pid 19255 exited with status 1024, invoking reaper 1 <Daemons::DefaultReaper()>
12/1 11:24:13 The STARTD (pid 19255) exited with status 4
12/1 11:24:13 restarting /usr/sbin/condor_startd in 41 seconds
12/1 11:24:13 DaemonCore: return from reaper for pid 19255
12/1 11:24:54 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 19295
12/1 11:25:07 Calling HandleReq <HandleChildAliveCommand> (0)
12/1 11:25:07 Return from HandleReq <HandleChildAliveCommand>
12/1 11:25:29 DaemonCore: pid 19295 exited with status 1024, invoking reaper 1 <Daemons::DefaultReaper()>
12/1 11:25:29 The STARTD (pid 19295) exited with status 4
12/1 11:25:29 restarting /usr/sbin/condor_startd in 73 seconds
12/1 11:25:29 DaemonCore: return from reaper for pid 19295
12/1 11:26:42 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 19327
12/1 11:26:51 Calling HandleReq <HandleChildAliveCommand> (0)
12/1 11:26:51 Return from HandleReq <HandleChildAliveCommand>
12/1 11:28:28 DaemonCore: pid 19327 exited with status 1024, invoking reaper 1 <Daemons::DefaultReaper()>
12/1 11:28:28 The STARTD (pid 19327) exited with status 4
12/1 11:28:28 restarting /usr/sbin/condor_startd in 137 seconds
12/1 11:28:28 DaemonCore: return from reaper for pid 19327
12/1 11:30:45 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 19375
12/1 11:30:53 Calling HandleReq <HandleChildAliveCommand> (0)
12/1 11:30:53 Return from HandleReq <HandleChildAliveCommand>
12/1 11:50:23 Calling HandleReq <HandleChildAliveCommand> (0)
12/1 11:50:23 Return from HandleReq <HandleChildAliveCommand>
12/1 11:50:34 DaemonCore: pid 19375 exited with status 1024, invoking reaper 1 <Daemons::DefaultReaper()>
12/1 11:50:34 The STARTD (pid 19375) exited with status 4
12/1 11:50:35 Sending obituary for "/usr/sbin/condor_startd"
12/1 11:50:43 restarting /usr/sbin/condor_startd in 10 seconds
12/1 11:50:43 DaemonCore: return from reaper for pid 19375
12/1 11:50:43 error writing to named pipe: watchdog pipe has closed
12/1 11:50:43 LocalClient: error sending message to server
12/1 11:50:43 ProcFamilyClient: failed to start connection with ProcD

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Sun Dec  2 18:11:22 2007 (1196640685)
Date: Sun, 2 Dec 2007 16:16:25 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17268] LIGO: Condor 6.9.5 master coredump
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

An example condor_master core dump file may be found at,
http://www.ligo.caltech.edu/~anderson/condor.17268

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Sun Dec  2 18:16:40 2007 (1196641001)
Subject: Actions

Assigned to gquinn by gquinn
===========================================================================
Date of actions: Sun Dec  2 21:28:21 2007 (1196652502)
Subject: Re: [condor-admin #17268] LIGO: Condor 6.9.5 master coredump
From: Greg Quinn <gquinn__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Date: Mon, 03 Dec 2007 11:02:35 -0600

Hi Stuart,

This crash was caused by a silly programming mistake in the code that
the Master uses when interacting with the ProcD. The problem only occurs
when communication with the ProcD fails. I've fixed the problem for
Condor 7.0.0 and could provide you with a patched binary if you'd like.

Regarding these types of errors, we have very recently started using
Coverity's static C/C++ analysis tool on our source code, and are in the
process of cleaning up the problems it detects in preparation for Condor
7.0.0. It did, in fact, find the problem reported in this ticket.
Hopefully, these types of errors become extremely rare in the future!

The more concerning issue this ticket brings to light is that
communication with the ProcD failed to begin with. Is this something you
have seen repeatedly or was it an isolated incident?

Greg


===========================================================================
Date mail was appended: Mon Dec  3 11:02:45 2007 (1196701366)
Subject: Actions

Status changed from open to pending by gquinn
===========================================================================
Date of actions: Mon Dec  3 11:10:56 2007 (1196701856)
Date: Mon, 3 Dec 2007 10:57:14 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17268] LIGO: Condor 6.9.5 master coredump
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

On Mon, Dec 03, 2007 at 11:02:45AM -0600, condor-admin response tracking system wrote:
> Hi Stuart,
> 
> This crash was caused by a silly programming mistake in the code that
> the Master uses when interacting with the ProcD. The problem only occurs
> when communication with the ProcD fails. I've fixed the problem for
> Condor 7.0.0 and could provide you with a patched binary if you'd like.

Not at this point, but I will keep that in mind if we have additional
problems.

> 
> Regarding these types of errors, we have very recently started using
> Coverity's static C/C++ analysis tool on our source code, and are in the
> process of cleaning up the problems it detects in preparation for Condor
> 7.0.0. It did, in fact, find the problem reported in this ticket.
> Hopefully, these types of errors become extremely rare in the future!
> 
> The more concerning issue this ticket brings to light is that
> communication with the ProcD failed to begin with. Is this something you
> have seen repeatedly or was it an isolated incident?

It was happening repeatedly (in the sense of happening on multiple execute
macines) due to support ticket 2146 until we disabled backfill jobs.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Dec  3 12:57:40 2007 (1196708260)
Date: Mon, 14 Jan 2008 20:11:17 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17268] LIGO: Condor 6.9.5 master coredump
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

On Mon, Dec 03, 2007 at 11:02:45AM -0600, condor-admin response tracking system wrote:
> Hi Stuart,
> 
> This crash was caused by a silly programming mistake in the code that
> the Master uses when interacting with the ProcD. The problem only occurs
> when communication with the ProcD fails. I've fixed the problem for
> Condor 7.0.0 and could provide you with a patched binary if you'd like.

Greg,
	If you are convinced you have fixed this please go ahead and close
this ticket. We are no longer seeing this problem, even with 6.9.5, since we
have a patched startd and have disabled GID tracking use of procd.

Thanks.


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Jan 14 22:11:37 2008 (1200370298)
Subject: Re: [condor-admin #17268] LIGO: Condor 6.9.5 master coredump
From: Greg Quinn <gquinn__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Date: Tue, 15 Jan 2008 08:47:02 -0600

Thanks for the update, Stuart. I'll resolve the ticket.

Greg

On Mon, 2008-01-14 at 22:11 -0600, condor-admin response tracking system
wrote:
> On Mon, Dec 03, 2007 at 11:02:45AM -0600, condor-admin response tracking system wrote:
> > Hi Stuart,
> > 
> > This crash was caused by a silly programming mistake in the code that
> > the Master uses when interacting with the ProcD. The problem only occurs
> > when communication with the ProcD fails. I've fixed the problem for
> > Condor 7.0.0 and could provide you with a patched binary if you'd like.
> 
> Greg,
> 	If you are convinced you have fixed this please go ahead and close
> this ticket. We are no longer seeing this problem, even with 6.9.5, since we
> have a patched startd and have disabled GID tracking use of procd.
> 
> Thanks.
> 
> 
> -- 
> Stuart Anderson  anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu
> 


===========================================================================
Date mail was appended: Tue Jan 15  8:47:14 2008 (1200408435)
Subject: Actions

Ticket resolved by gquinn
===========================================================================
Date of actions: Tue Jan 15  8:47:26 2008 (1200408446)