LIGO Support Ticket 1769
Ticket Information
Number: support 1769
User: anderson@ligo.caltech.edu
Email: espinoza_e__AT__ligo.caltech.edu
Status: resolved
Assigned To: psilord
Date: Fri, 1 Dec 2006 11:50:28 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
Subject: LIGO 6.8.2 ckpt_server core dump
On the LIGO Caltech Condor pool running,
# condor_version
$CondorVersion: 6.8.2 Oct 12 2006 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
# cat /etc/redhat-release
Fedora Core release 4 (Stentz)
# uname -a
Linux node329 2.6.18.3-CIT #1 SMP Sun Nov 26 13:16:15 PST 2006 x86_64 x86_64 x86_64 GNU/Linux
the checkpoint server reproducibly core dumps on fast shutdown when
condor_master sends it a SIGQUIT. Here is the stack trace:
(gdb) where
#0 0x00000033fabc2633 in __select_nocancel () from /lib64/libc.so.6
#1 0x00000000006959a8 in __wrap_select ()
#2 0x000000000048a7be in Server::Execute ()
#3 0x000000000048f331 in main ()
Here is the interesting part of MasterLog:
12/1 11:36:39 ProcFamily: parent: 11464 family: 11464
12/1 11:36:39 ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 15508k
12/1 11:37:14 Getting monitoring info for pid 11462
12/1 11:37:39 ProcAPI::buildFamily() called w/ parent: 11464
12/1 11:37:39 ProcAPI::buildFamily() Found daddypid on the system: 11464
12/1 11:37:39 ProcFamily: parent: 11464 family: 11464
12/1 11:37:39 ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 15508k
12/1 11:38:23 Got SIGQUIT. Performing fast shutdown.
12/1 11:38:23 NumberOfChildren() returning 1
12/1 11:38:23 Sent SIGQUIT to CKPT_SERVER (pid 11464)
12/1 11:38:23 DaemonCore: No more children processes to reap.
12/1 11:38:23 The CKPT_SERVER (pid 11464) died due to signal 3
12/1 11:38:23 Entering ProcFamily::hardkill
12/1 11:38:23 ProcAPI::buildFamily() called w/ parent: 11464
12/1 11:38:23 ProcAPI::buildFamily failed: parent 11464 not found on system.
12/1 11:38:23 ProcFamily::takesnapshot: getPidFamily(11464) failed. Could not find the pid or any family members.
12/1 11:38:23 ProcAPI::getProcInfo() pid 11464 does not exist.
12/1 11:38:23 ProcAPI::getProcInfo() pid 11464 does not exist.
12/1 11:38:23 ProcAPI::getProcInfo() pid 11464 does not exist.
12/1 11:38:23 ProcAPI::getProcInfo() pid 11464 does not exist.
12/1 11:38:23 ProcAPI::getProcInfo() pid 11464 does not exist.
12/1 11:38:23 ProcFamily: parent: 11464 family:
12/1 11:38:23 ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 15508k
12/1 11:38:23 Deleted ProcFamily w/ pid 11464 as parent
12/1 11:38:23 NumberOfChildren() returning 0
12/1 11:38:23 All daemons are gone. Exiting.
12/1 11:38:23 **** condor_master (condor_MASTER) EXITING WITH STATUS 0
The last entry in the CkptServerLog is from 11:36 when I manually checkpointed
the previous jobs to prove this was a reproducible problem, i.e.,
12/1 11:35:54 [reqid: 81] File successfully received
12/1 11:35:54 RECV transferred 340161567 bytes in 25 seconds (13606462 bytes / sec)
12/1 11:35:54 [reqid: 82] File successfully received
12/1 11:35:54 RECV transferred 352281631 bytes in 25 seconds (14091265 bytes / sec)
12/1 11:35:54 ERROR from waitpid(): 10 (No child processes)
12/1 11:36:23 ----------------------------------------------------
12/1 11:36:23 [reqid: 88] Receiving SERVICE request from 10.14.0.12
12/1 11:36:23 Using descriptor 8 to handle request
12/1 11:36:23 Service: SERVICE_EXIST
12/1 11:36:23 Owner: anandss__AT__ldas-grid.ligo.caltech.edu
12/1 11:36:23 File: cluster9506534.proc0.subproc0
12/1 11:36:23 Checking existance of file: ./10.14.0.12/anandss__AT__ldas-grid.ligo.caltech.edu/cluster9506534.proc0.subproc0
12/1 11:36:23 Service request successfully completed
12/1 11:36:29 ----------------------------------------------------
12/1 11:36:29 [reqid: 89] Receiving RESTORE request from 10.14.0.12
12/1 11:36:29 Using descriptor 8 to handle request
12/1 11:36:29 Owner: anandss__AT__ldas-grid.ligo.caltech.edu
12/1 11:36:29 File name: cluster9506534.proc0.subproc0
12/1 11:36:29 Request to restore checkpoint file GRANTED
12/1 11:36:29 Transmitting checkpoint file: ./10.14.0.12/anandss__AT__ldas-grid.ligo.caltech.edu/cluster9506534.proc0.subproc0
12/1 11:36:32 [reqid: 89] File successfully transmitted
12/1 11:36:32 SEND transferred 345060383 bytes in 3 seconds (115020127 bytes / sec)
12/1 11:36:32 ERROR from waitpid(): 10 (No child processes)
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Fri Dec 1 13:51:03 2006 (1165002668)
Date: Fri, 1 Dec 2006 12:10:02 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1769] LIGO 6.8.2 ckpt_server core dump
An example core file may be found at,
http://www.ligo.caltech.edu/~anderson/condor.1769/core.11464
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Fri Dec 1 14:10:23 2006 (1165003823)
Subject: Actions
Assigned to psilord by jfrey
===========================================================================
Date of actions: Mon Dec 4 14:09:29 2006 (1165262969)
Date: Mon, 4 Dec 2006 15:51:12 -0600
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: jfrey <condor-support__AT__cs.wisc.edu>
Subject: Re: [condor-support #1769] LIGO 6.8.2 ckpt_server core dump
Hello,
> From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
>
> On the LIGO Caltech Condor pool running,
> # condor_version
> $CondorVersion: 6.8.2 Oct 12 2006 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
> # cat /etc/redhat-release
> Fedora Core release 4 (Stentz)
> # uname -a
> Linux node329 2.6.18.3-CIT #1 SMP Sun Nov 26 13:16:15 PST 2006 x86_64 x86_64 x86_64 GNU/Linux
>
> the checkpoint server reproducibly core dumps on fast shutdown when
> condor_master sends it a SIGQUIT. Here is the stack trace:
>
> (gdb) where
> #0 0x00000033fabc2633 in __select_nocancel () from /lib64/libc.so.6
> #1 0x00000000006959a8 in __wrap_select ()
> #2 0x000000000048a7be in Server::Execute ()
> #3 0x000000000048f331 in main ()
Ok, I'll take a look at it and see if I can reproduce it here.
Thank you.
-pete
===========================================================================
Date mail was appended: Mon Dec 4 15:51:14 2006 (1165269075)
Date: Tue, 5 Dec 2006 15:27:00 -0600
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: jfrey <condor-support__AT__cs.wisc.edu>
Subject: Re: [condor-support #1769] LIGO 6.8.2 ckpt_server core dump
Hello,
> From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
>
> the checkpoint server reproducibly core dumps on fast shutdown when
> condor_master sends it a SIGQUIT. Here is the stack trace:
Ok, I've reproduced the behavior and fixed it. I'll let you know if I was
able to get it in for 6.8.3.
Thank you.
Condor Admin
===========================================================================
Date mail was appended: Tue Dec 5 15:27:05 2006 (1165354026)
Date: Tue, 5 Dec 2006 16:01:28 -0600
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: jfrey <condor-support__AT__cs.wisc.edu>
Subject: Re: [condor-support #1769] LIGO 6.8.2 ckpt_server core dump
Hello,
I was able to get this fix into 6.8.3. The checkpoint server should now
shutdown properly when condor_off -master or condor_off -fast -master is
used.
Thank you.
Condor Admin
===========================================================================
Date mail was appended: Tue Dec 5 16:01:32 2006 (1165356093)
Subject: Actions
Ticket resolved by psilord
===========================================================================
Date of actions: Tue Dec 5 16:14:35 2006 (1165356876)
Subject: Actions
Ticket was reopened by mailnull
===========================================================================
Date of actions: Tue Dec 5 16:15:07 2006 (1165356908)
Date: Tue, 5 Dec 2006 14:14:44 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1769] LIGO 6.8.2 ckpt_server core dump
On Tue, Dec 05, 2006 at 04:01:32PM -0600, condor-support response tracking system wrote:
> Hello,
>
> I was able to get this fix into 6.8.3. The checkpoint server should now
> shutdown properly when condor_off -master or condor_off -fast -master is
> used.
Thanks Pete.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Tue Dec 5 16:15:07 2006 (1165356908)
Subject: Actions
Ticket resolved by psilord
===========================================================================
Date of actions: Tue Dec 5 16:34:09 2006 (1165358049)