LIGO Support Ticket 1769

Ticket Information
  Number:      support 1769
  User:        anderson@ligo.caltech.edu
  Email:       espinoza_e__AT__ligo.caltech.edu
  Status:      resolved
  Assigned To: psilord
Date: Fri, 1 Dec 2006 11:50:28 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
Subject: LIGO 6.8.2 ckpt_server core dump

On the LIGO Caltech Condor pool running,
# condor_version
$CondorVersion: 6.8.2 Oct 12 2006 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
# cat /etc/redhat-release 
Fedora Core release 4 (Stentz)
# uname -a
Linux node329 2.6.18.3-CIT #1 SMP Sun Nov 26 13:16:15 PST 2006 x86_64 x86_64 x86_64 GNU/Linux

the checkpoint server reproducibly core dumps on fast shutdown when
condor_master sends it a SIGQUIT. Here is the stack trace:

(gdb) where
#0  0x00000033fabc2633 in __select_nocancel () from /lib64/libc.so.6
#1  0x00000000006959a8 in __wrap_select ()
#2  0x000000000048a7be in Server::Execute ()
#3  0x000000000048f331 in main ()


Here is the interesting part of MasterLog:

12/1 11:36:39 ProcFamily: parent: 11464 family: 11464
12/1 11:36:39 ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 15508k
12/1 11:37:14 Getting monitoring info for pid 11462
12/1 11:37:39 ProcAPI::buildFamily() called w/ parent: 11464
12/1 11:37:39 ProcAPI::buildFamily() Found daddypid on the system: 11464
12/1 11:37:39 ProcFamily: parent: 11464 family: 11464
12/1 11:37:39 ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 15508k
12/1 11:38:23 Got SIGQUIT.  Performing fast shutdown.
12/1 11:38:23 NumberOfChildren() returning 1
12/1 11:38:23 Sent SIGQUIT to CKPT_SERVER (pid 11464)
12/1 11:38:23 DaemonCore: No more children processes to reap.
12/1 11:38:23 The CKPT_SERVER (pid 11464) died due to signal 3
12/1 11:38:23 Entering ProcFamily::hardkill
12/1 11:38:23 ProcAPI::buildFamily() called w/ parent: 11464
12/1 11:38:23 ProcAPI::buildFamily failed: parent 11464 not found on system.
12/1 11:38:23 ProcFamily::takesnapshot: getPidFamily(11464) failed. Could not find the pid or any family members.
12/1 11:38:23 ProcAPI::getProcInfo() pid 11464 does not exist.
12/1 11:38:23 ProcAPI::getProcInfo() pid 11464 does not exist.
12/1 11:38:23 ProcAPI::getProcInfo() pid 11464 does not exist.
12/1 11:38:23 ProcAPI::getProcInfo() pid 11464 does not exist.
12/1 11:38:23 ProcAPI::getProcInfo() pid 11464 does not exist.
12/1 11:38:23 ProcFamily: parent: 11464 family:
12/1 11:38:23 ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 15508k
12/1 11:38:23 Deleted ProcFamily w/ pid 11464 as parent
12/1 11:38:23 NumberOfChildren() returning 0
12/1 11:38:23 All daemons are gone.  Exiting.
12/1 11:38:23 **** condor_master (condor_MASTER) EXITING WITH STATUS 0


The last entry in the CkptServerLog is from 11:36 when I manually checkpointed
the previous jobs to prove this was a reproducible problem, i.e.,


12/1 11:35:54 [reqid: 81] File successfully received
12/1 11:35:54 RECV transferred 340161567 bytes in 25 seconds (13606462 bytes / sec)
12/1 11:35:54 [reqid: 82] File successfully received
12/1 11:35:54 RECV transferred 352281631 bytes in 25 seconds (14091265 bytes / sec)
12/1 11:35:54 ERROR from waitpid(): 10 (No child processes)
12/1 11:36:23 ----------------------------------------------------
12/1 11:36:23 [reqid: 88] Receiving SERVICE request from 10.14.0.12
12/1 11:36:23     Using descriptor 8 to handle request
12/1 11:36:23     Service: SERVICE_EXIST
12/1 11:36:23     Owner:   anandss__AT__ldas-grid.ligo.caltech.edu
12/1 11:36:23     File:    cluster9506534.proc0.subproc0
12/1 11:36:23     Checking existance of file: ./10.14.0.12/anandss__AT__ldas-grid.ligo.caltech.edu/cluster9506534.proc0.subproc0
12/1 11:36:23     Service request successfully completed
12/1 11:36:29 ----------------------------------------------------
12/1 11:36:29 [reqid: 89] Receiving RESTORE request from 10.14.0.12
12/1 11:36:29     Using descriptor 8 to handle request
12/1 11:36:29     Owner:     anandss__AT__ldas-grid.ligo.caltech.edu
12/1 11:36:29     File name: cluster9506534.proc0.subproc0
12/1 11:36:29     Request to restore checkpoint file GRANTED
12/1 11:36:29     Transmitting checkpoint file: ./10.14.0.12/anandss__AT__ldas-grid.ligo.caltech.edu/cluster9506534.proc0.subproc0
12/1 11:36:32 [reqid: 89] File successfully transmitted
12/1 11:36:32 SEND transferred 345060383 bytes in 3 seconds (115020127 bytes / sec)
12/1 11:36:32 ERROR from waitpid(): 10 (No child processes)


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Fri Dec  1 13:51:03 2006 (1165002668)
Date: Fri, 1 Dec 2006 12:10:02 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1769] LIGO 6.8.2 ckpt_server core dump

An example core file may be found at,

http://www.ligo.caltech.edu/~anderson/condor.1769/core.11464

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Dec  1 14:10:23 2006 (1165003823)
Subject: Actions

Assigned to psilord by jfrey
===========================================================================
Date of actions: Mon Dec  4 14:09:29 2006 (1165262969)
Date: Mon, 4 Dec 2006 15:51:12 -0600
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: jfrey <condor-support__AT__cs.wisc.edu>
Subject: Re: [condor-support #1769] LIGO 6.8.2 ckpt_server core dump

Hello,

> From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
> 
> On the LIGO Caltech Condor pool running,
> # condor_version
> $CondorVersion: 6.8.2 Oct 12 2006 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
> # cat /etc/redhat-release 
> Fedora Core release 4 (Stentz)
> # uname -a
> Linux node329 2.6.18.3-CIT #1 SMP Sun Nov 26 13:16:15 PST 2006 x86_64 x86_64 x86_64 GNU/Linux
> 
> the checkpoint server reproducibly core dumps on fast shutdown when
> condor_master sends it a SIGQUIT. Here is the stack trace:
> 
> (gdb) where
> #0  0x00000033fabc2633 in __select_nocancel () from /lib64/libc.so.6
> #1  0x00000000006959a8 in __wrap_select ()
> #2  0x000000000048a7be in Server::Execute ()
> #3  0x000000000048f331 in main ()

Ok, I'll take a look at it and see if I can reproduce it here.

Thank you.

-pete

===========================================================================
Date mail was appended: Mon Dec  4 15:51:14 2006 (1165269075)
Date: Tue, 5 Dec 2006 15:27:00 -0600
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: jfrey <condor-support__AT__cs.wisc.edu>
Subject: Re: [condor-support #1769] LIGO 6.8.2 ckpt_server core dump

Hello,

> From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
> 
> the checkpoint server reproducibly core dumps on fast shutdown when
> condor_master sends it a SIGQUIT. Here is the stack trace:

Ok, I've reproduced the behavior and fixed it. I'll let you know if I was
able to get it in for 6.8.3.

Thank you.

Condor Admin

===========================================================================
Date mail was appended: Tue Dec  5 15:27:05 2006 (1165354026)
Date: Tue, 5 Dec 2006 16:01:28 -0600
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: jfrey <condor-support__AT__cs.wisc.edu>
Subject: Re: [condor-support #1769] LIGO 6.8.2 ckpt_server core dump

Hello,

I was able to get this fix into 6.8.3. The checkpoint server should now 
shutdown properly when condor_off -master or condor_off -fast -master is
used.

Thank you.

Condor Admin

===========================================================================
Date mail was appended: Tue Dec  5 16:01:32 2006 (1165356093)
Subject: Actions

Ticket resolved by psilord
===========================================================================
Date of actions: Tue Dec  5 16:14:35 2006 (1165356876)
Subject: Actions

Ticket was reopened by mailnull
===========================================================================
Date of actions: Tue Dec  5 16:15:07 2006 (1165356908)
Date: Tue, 5 Dec 2006 14:14:44 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1769] LIGO 6.8.2 ckpt_server core dump

On Tue, Dec 05, 2006 at 04:01:32PM -0600, condor-support response tracking system wrote:
> Hello,
> 
> I was able to get this fix into 6.8.3. The checkpoint server should now 
> shutdown properly when condor_off -master or condor_off -fast -master is
> used.

Thanks Pete.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Tue Dec  5 16:15:07 2006 (1165356908)
Subject: Actions

Ticket resolved by psilord
===========================================================================
Date of actions: Tue Dec  5 16:34:09 2006 (1165358049)