LIGO Support Ticket 17266
Ticket Information
Number: admin 17266
User: anderson@ligo.caltech.edu
Email: skoranda__AT__gravity.phys.uwm.edu
Status: open
Assigned To: danb
Date: Sat, 1 Dec 2007 17:16:26 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: checkpoint server failure to start
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
The LIGO Caltech condor pool running,
# condor_version
$CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
is unable to reliably start the condor checkpoint server on execute
machines due to the previous problem reported in,
[condor-admin #14515] LIGO condor_ckpt_server continues to run after condor_master
This new ticket is to describe the compounded problem that if an old
orphaned checkpoint server is running then condor_master will fail
to start a new one or connect to the old one, and this will prevent
any other condor daemons from being started.
For example, from CkptServerLog,
12/1 11:25:37 ******************************************************
12/1 11:25:37 CKPT_SERVER running in directory /usr1/condor/checkpoint
12/1 11:25:37 ERROR: I_bind() returned an error (#28)
12/1 11:25:49 ******************************************************
12/1 11:25:49 ** condor_ckpt_server (CONDOR_CKPT_SERVER) STARTING UP
12/1 11:25:49 ** $CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
12/1 11:25:49 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
12/1 11:25:49 ** PID = 4218
12/1 11:25:49 ******************************************************
12/1 11:25:49 CKPT_SERVER running in directory /usr1/condor/checkpoint
12/1 11:25:49 ERROR: I_bind() returned an error (#28)
12/1 11:26:00 ******************************************************
12/1 11:26:00 ** condor_ckpt_server (CONDOR_CKPT_SERVER) STARTING UP
12/1 11:26:00 ** $CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
12/1 11:26:00 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
12/1 11:26:00 ** PID = 4228
12/1 11:26:00 ******************************************************
12/1 11:26:00 CKPT_SERVER running in directory /usr1/condor/checkpoint
12/1 11:26:00 ERROR: I_bind() returned an error (#28)
12/1 11:26:13 ******************************************************
and the correspond MasterLog,
12/1 11:25:37 ******************************************************
12/1 11:25:37 ** condor_master (CONDOR_MASTER) STARTING UP
12/1 11:25:37 ** /usr/sbin/condor_master
12/1 11:25:37 ** $CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
12/1 11:25:37 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
12/1 11:25:37 ** PID = 4196
12/1 11:25:37 ** Log last touched 12/1 11:09:32
12/1 11:25:37 ******************************************************
12/1 11:25:37 Using config source: /usr1/condor/condor_config
12/1 11:25:37 Using local config sources:
12/1 11:25:37 /usr1/condor/condor_config.local
12/1 11:25:37 DaemonCore: Command Socket at <10.14.1.248:47216>
12/1 11:25:37 GID-based process tracking requires use of ProcD; ignoring USE_PROCD setting
12/1 11:25:37 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 4198
12/1 11:25:37 Started process "/usr/sbin/condor_ckpt_server", pid and pgroup = 4199
12/1 11:25:37 DaemonCore: pid 4199 exited with status 7168, invoking reaper 1 <Daemons::DefaultReaper
()>
12/1 11:25:37 The CKPT_SERVER (pid 4199) exited with status 28
12/1 11:25:37 Sending obituary for "/usr/sbin/condor_ckpt_server"
12/1 11:25:39 restarting /usr/sbin/condor_ckpt_server in 10 seconds
12/1 11:25:39 DaemonCore: return from reaper for pid 4199
12/1 11:25:46 Calling HandleReq <HandleChildAliveCommand> (0)
12/1 11:25:46 Return from HandleReq <HandleChildAliveCommand>
12/1 11:25:49 Started process "/usr/sbin/condor_ckpt_server", pid and pgroup = 4218
12/1 11:25:49 DaemonCore: pid 4218 exited with status 7168, invoking reaper 1 <Daemons::DefaultReaper
()>
12/1 11:25:49 The CKPT_SERVER (pid 4218) exited with status 28
12/1 11:25:49 Sending obituary for "/usr/sbin/condor_ckpt_server"
12/1 11:25:49 restarting /usr/sbin/condor_ckpt_server in 11 seconds
12/1 11:25:49 DaemonCore: return from reaper for pid 4218
12/1 11:26:00 Started process "/usr/sbin/condor_ckpt_server", pid and pgroup = 4228
12/1 11:26:00 DaemonCore: pid 4228 exited with status 7168, invoking reaper 1 <Daemons::DefaultReaper
()>
12/1 11:26:00 The CKPT_SERVER (pid 4228) exited with status 28
12/1 11:26:00 Sending obituary for "/usr/sbin/condor_ckpt_server"
12/1 11:26:00 restarting /usr/sbin/condor_ckpt_server in 13 seconds
12/1 11:26:00 DaemonCore: return from reaper for pid 4228
12/1 11:26:13 Started process "/usr/sbin/condor_ckpt_server", pid and pgroup = 4237
12/1 11:26:13 DaemonCore: pid 4237 exited with status 7168, invoking reaper 1 <Daemons::DefaultReaper
()>
12/1 11:26:13 The CKPT_SERVER (pid 4237) exited with status 28
12/1 11:26:14 Sending obituary for "/usr/sbin/condor_ckpt_server"
12/1 11:26:14 restarting /usr/sbin/condor_ckpt_server in 17 seconds
12/1 11:26:14 DaemonCore: return from reaper for pid 4237
ad infinitium.
If the checkpoint server is not enhanced to use DaemonCore for reliable
shutdown (the old ticket) then please consider adding functionality for
the checkpoint server and/or condor_master to kill any old orphaned
servers for more reliable startup.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Sun Dec 2 0:16:34 2007 (1196576203)
Subject: Actions
Assigned to danb by danb
===========================================================================
Date of actions: Tue Dec 4 16:50:40 2007 (1196808641)
Date: Tue, 04 Dec 2007 16:54:09 -0600
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17266] LIGO: checkpoint server failure to start
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
Hi,
Can you please confirm that you really mean it when you say "and this
will prevent any other condor daemons from being started."
I can understand the other parts of your report, but I am surprised to
hear that failure to start up the checkpoint server would prevent all
other daemons from being started as well. If that is the case, then we
better understand why that is happening!
--Dan
> The LIGO Caltech condor pool running,
> # condor_version
> $CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
> is unable to reliably start the condor checkpoint server on execute
> machines due to the previous problem reported in,
>
> [condor-admin #14515] LIGO condor_ckpt_server continues to run after condor_master
>
> This new ticket is to describe the compounded problem that if an old
> orphaned checkpoint server is running then condor_master will fail
> to start a new one or connect to the old one, and this will prevent
> any other condor daemons from being started.
>
> For example, from CkptServerLog,
>
> 12/1 11:25:37 ******************************************************
> 12/1 11:25:37 CKPT_SERVER running in directory /usr1/condor/checkpoint
> 12/1 11:25:37 ERROR: I_bind() returned an error (#28)
> 12/1 11:25:49 ******************************************************
> 12/1 11:25:49 ** condor_ckpt_server (CONDOR_CKPT_SERVER) STARTING UP
> 12/1 11:25:49 ** $CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
> 12/1 11:25:49 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
> 12/1 11:25:49 ** PID = 4218
> 12/1 11:25:49 ******************************************************
> 12/1 11:25:49 CKPT_SERVER running in directory /usr1/condor/checkpoint
> 12/1 11:25:49 ERROR: I_bind() returned an error (#28)
> 12/1 11:26:00 ******************************************************
> 12/1 11:26:00 ** condor_ckpt_server (CONDOR_CKPT_SERVER) STARTING UP
> 12/1 11:26:00 ** $CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
> 12/1 11:26:00 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
> 12/1 11:26:00 ** PID = 4228
> 12/1 11:26:00 ******************************************************
> 12/1 11:26:00 CKPT_SERVER running in directory /usr1/condor/checkpoint
> 12/1 11:26:00 ERROR: I_bind() returned an error (#28)
> 12/1 11:26:13 ******************************************************
>
> and the correspond MasterLog,
>
> 12/1 11:25:37 ******************************************************
> 12/1 11:25:37 ** condor_master (CONDOR_MASTER) STARTING UP
> 12/1 11:25:37 ** /usr/sbin/condor_master
> 12/1 11:25:37 ** $CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
> 12/1 11:25:37 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
> 12/1 11:25:37 ** PID = 4196
> 12/1 11:25:37 ** Log last touched 12/1 11:09:32
> 12/1 11:25:37 ******************************************************
> 12/1 11:25:37 Using config source: /usr1/condor/condor_config
> 12/1 11:25:37 Using local config sources:
> 12/1 11:25:37 /usr1/condor/condor_config.local
> 12/1 11:25:37 DaemonCore: Command Socket at <10.14.1.248:47216>
> 12/1 11:25:37 GID-based process tracking requires use of ProcD; ignoring USE_PROCD setting
> 12/1 11:25:37 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 4198
> 12/1 11:25:37 Started process "/usr/sbin/condor_ckpt_server", pid and pgroup = 4199
> 12/1 11:25:37 DaemonCore: pid 4199 exited with status 7168, invoking reaper 1 <Daemons::DefaultReaper
> ()>
> 12/1 11:25:37 The CKPT_SERVER (pid 4199) exited with status 28
> 12/1 11:25:37 Sending obituary for "/usr/sbin/condor_ckpt_server"
> 12/1 11:25:39 restarting /usr/sbin/condor_ckpt_server in 10 seconds
> 12/1 11:25:39 DaemonCore: return from reaper for pid 4199
> 12/1 11:25:46 Calling HandleReq <HandleChildAliveCommand> (0)
> 12/1 11:25:46 Return from HandleReq <HandleChildAliveCommand>
> 12/1 11:25:49 Started process "/usr/sbin/condor_ckpt_server", pid and pgroup = 4218
> 12/1 11:25:49 DaemonCore: pid 4218 exited with status 7168, invoking reaper 1 <Daemons::DefaultReaper
> ()>
> 12/1 11:25:49 The CKPT_SERVER (pid 4218) exited with status 28
> 12/1 11:25:49 Sending obituary for "/usr/sbin/condor_ckpt_server"
> 12/1 11:25:49 restarting /usr/sbin/condor_ckpt_server in 11 seconds
> 12/1 11:25:49 DaemonCore: return from reaper for pid 4218
> 12/1 11:26:00 Started process "/usr/sbin/condor_ckpt_server", pid and pgroup = 4228
> 12/1 11:26:00 DaemonCore: pid 4228 exited with status 7168, invoking reaper 1 <Daemons::DefaultReaper
> ()>
> 12/1 11:26:00 The CKPT_SERVER (pid 4228) exited with status 28
> 12/1 11:26:00 Sending obituary for "/usr/sbin/condor_ckpt_server"
> 12/1 11:26:00 restarting /usr/sbin/condor_ckpt_server in 13 seconds
> 12/1 11:26:00 DaemonCore: return from reaper for pid 4228
> 12/1 11:26:13 Started process "/usr/sbin/condor_ckpt_server", pid and pgroup = 4237
> 12/1 11:26:13 DaemonCore: pid 4237 exited with status 7168, invoking reaper 1 <Daemons::DefaultReaper
> ()>
> 12/1 11:26:13 The CKPT_SERVER (pid 4237) exited with status 28
> 12/1 11:26:14 Sending obituary for "/usr/sbin/condor_ckpt_server"
> 12/1 11:26:14 restarting /usr/sbin/condor_ckpt_server in 17 seconds
> 12/1 11:26:14 DaemonCore: return from reaper for pid 4237
>
> ad infinitium.
>
> If the checkpoint server is not enhanced to use DaemonCore for reliable
> shutdown (the old ticket) then please consider adding functionality for
> the checkpoint server and/or condor_master to kill any old orphaned
> servers for more reliable startup.
>
> Thanks.
>
>
>
===========================================================================
Date mail was appended: Tue Dec 4 16:54:12 2007 (1196808853)
Date: Tue, 4 Dec 2007 15:44:40 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu, Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17266] LIGO: checkpoint server failure to start
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
On Tue, Dec 04, 2007 at 04:54:12PM -0600, condor-admin response tracking system wrote:
>
>
>
> Hi,
>
> Can you please confirm that you really mean it when you say "and this
> will prevent any other condor daemons from being started."
>
> I can understand the other parts of your report, but I am surprised to
> hear that failure to start up the checkpoint server would prevent all
> other daemons from being started as well. If that is the case, then we
> better understand why that is happening!
>
Dan,
I tried to split the several 6.9.5 problems into separate logical problem
reports, but that may have left this particular point convolved between
multiple tickets. Here is the global picture of what I remember from trying
to get 6.9.5 running:
* condor_startd crashes on startup (support 1750)
* condor_master core dumps after a few startd crashes (admin 17268)
* condor_ckpt_server is left orphaned (admin 14515)
* subsequent condor_maser restart gets stuck trying to restart ckpt server
(this ticket: admin 17266) after a work around for 1750 is in place.
Put another way, "preventing any other condor daemons from being started"
may have been mistakenly assigned to 17266 when it is just 1750. However,
how would condor_master behave if it successfully started the startd and
then got stuck in the infinite loop of trying to start the ckpt server, e.g.,
what happens if it then needed to stop or restart the startd at a later time,
would it processes that in parallel with the ckpt server start failures
or would such a request get blocked in a FIFO?
Thanks.
P.S. I CC'd Todd in case he wants to see a quick summary of how I think
these tickets relate to each other.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Tue Dec 4 17:44:57 2007 (1196811898)
Date: Tue, 04 Dec 2007 18:42:13 -0600
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17266] LIGO: checkpoint server failure to start
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
Thanks for the summary.
> how would condor_master behave if it successfully started the startd and
> then got stuck in the infinite loop of trying to start the ckpt server
The operations of starting up daemons, detection of daemon failures, and
restarting daemons are handled asynchronously (with incremental backoff
in the restart), so there should be no case for a tight loop focussed on
restarting one particular daemon. Indeed, I cannot reproduce any such
problem in simple tests.
Until anyone finds evidence to the contrary, I'll assume there is no
vicious side-effect of a failing checkpoint server causing other daemons
to fail to be started.
Thanks again for all the help in debugging!
--Dan
===========================================================================
Date mail was appended: Tue Dec 4 18:42:14 2007 (1196815334)