LIGO Support Ticket 14515
Ticket Information
Number: admin 14515
User: anderson@ligo.caltech.edu
Email: espinoza_e__AT__ligo.caltech.edu
Status: pending
Assigned To: nleroy
Date: Sun, 26 Nov 2006 17:32:39 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: LIGO condor_ckpt_server continues to run after condor_master
On one of the LIGO Caltech nodes (node209 FC4/x86_64) running,
# condor_version
$CondorVersion: 6.8.2 Oct 12 2006 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
had an unexpected shutdown of condor_master.
Upon closer inspection this raises 3 questions:
1) Is condor_ckpt_server continuing to run after condor_master exits
considered to be a bug or a feature?
2) Why does condor_startd report calling getProcInfo() for negative (and
otherwise illegal numerical values) during shutdown?
3) Why did condor_master exit without any log messages? If this is not
obvious from the logs below, I suggest we do not put any further
effort into pursuing this unless it happens again.
Additional details:
1) Unlike condor_startd, which recorded the following before cleanly shutting
itself down:
11/25 15:47:44 DaemonCore::IsPidAlive(): kill returned errno 3, assuming pid 10931 is dead.
11/25 15:47:44 Our parent process (pid 10931) went away; shutting down
11/25 15:47:44 Got SIGTERM. Performing graceful shutdown.
11/25 15:47:44 Started timer to call main_shutdown_fast in 1800 seconds
11/25 15:47:44 shutdown graceful
11/25 15:47:44 StartdCronMgr: Shutting down
11/25 15:47:44 CronMgr: Killing all jobs
11/25 15:47:44 CronMgr: 0 jobs alive
The checkpoint server continued to run. Is this failure of condor_ckpt_server
to use IsPidAlive() to check on condor_master a feature or a bug?
2) Continuing further on in the StartLog after this initiation of shutdown
shows some strange calls to getProcInfo with illegal looking PID numbers,
i.e., < 0 and > 32768.
11/25 15:47:44 ProcAPI::buildFamily failed: parent 7023 not found on system.
11/25 15:47:44 ProcAPI::getProcInfo() pid 7023 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid 7023 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid 7023 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid 7023 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid 7023 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid -839702264 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid -839702264 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid -839702264 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid -839702264 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid -839702264 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid 54 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid 54 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid 54 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid 54 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid 54 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid 11473568 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid 11473568 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid 11473568 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid 11473568 does not exist.
11/25 15:47:44 ProcAPI::getProcInfo() pid 11473568 does not exist.
11/25 15:47:44 vm3: State change: starter exited
11/25 15:47:44 Entered vacate_client <10.14.0.12:44107> ldas-grid.ligo.caltech.edu...
11/25 15:47:44 vm3: State change: No preempting claim, returning to owner
A quick check on a few other nodes in this Condor pool shows that about
half of them have instances of negative pids at one time or another, i.e.,
grep getProcInfo StartLog | grep -e -
3) It is not clear what happened to the condor_master (PID = 10931).
There where no system level errors reported at that time, and there is no
core dump image. Here is the end of this section of the MasterLog and
unless there is something obvious here to an expert I doubt we can figure
out if this was an OS failure (e.g., silent Linux OOM killer) or a Condor bug,
so I propose we do not persue this any further unless it happens again:
11/25 15:41:59 Pid 4360 is predicted to be in family of 10932
11/25 15:41:59 Pid 4361 is predicted to be in family of 10932
11/25 15:41:59 ProcFamily: parent: 10932 family: 10932 28631 28632 3823 3824 3825 3826 4289 4290 4291 4292 4357 4359
4360 4361
11/25 15:41:59 ProcFamily: alive_cpu_user = 39277, exited_cpu = 346051, max_image = 2345656k
11/25 15:41:59 ProcAPI::buildFamily() called w/ parent: 10933
11/25 15:41:59 ProcAPI::buildFamily() Found daddypid on the system: 10933
11/25 15:41:59 ProcFamily: parent: 10933 family: 10933
11/25 15:41:59 ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 15516k
11/25 15:42:59 ProcAPI::buildFamily() called w/ parent: 10932
11/25 15:42:59 ProcAPI::buildFamily() Found daddypid on the system: 10932
11/25 15:42:59 Pid 28631 is in family of 10932
11/25 15:42:59 Pid 28632 is in family of 28631
11/25 15:42:59 Pid 3823 is in family of 10932
11/25 15:42:59 Pid 3824 is predicted to be in family of 10932
11/25 15:42:59 Pid 3825 is predicted to be in family of 10932
11/25 15:42:59 Pid 3826 is predicted to be in family of 10932
11/25 15:42:59 Pid 4289 is in family of 10932
11/25 15:42:59 Pid 4290 is predicted to be in family of 10932
11/25 15:42:59 Pid 4291 is predicted to be in family of 10932
11/25 15:42:59 Pid 4292 is predicted to be in family of 10932
11/25 15:42:59 Pid 4357 is in family of 10932
11/25 15:42:59 Pid 4359 is predicted to be in family of 10932
11/25 15:42:59 Pid 4360 is predicted to be in family of 10932
11/25 15:42:59 Pid 4361 is predicted to be in family of 10932
11/25 15:42:59 ProcFamily: parent: 10932 family: 10932 28631 28632 3823 3824 3825 3826 4289 4290 4291 4292 4357 4359
4360 4361
11/25 15:42:59 ProcFamily: alive_cpu_user = 39513, exited_cpu = 346051, max_image = 2345656k
11/25 15:42:59 ProcAPI::buildFamily() called w/ parent: 10933
11/25 15:42:59 ProcAPI::buildFamily() Found daddypid on the system: 10933
11/25 15:42:59 ProcFamily: parent: 10933 family: 10933
11/25 15:42:59 ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 15516k
11/25 15:43:59 ProcAPI::buildFamily() called w/ parent: 10932
11/25 15:43:59 ProcAPI::buildFamily() Found daddypid on the system: 10932
11/25 15:43:59 Pid 28631 is in family of 10932
11/25 15:43:59 Pid 28632 is in family of 28631
11/25 15:43:59 Pid 3823 is in family of 10932
11/25 15:43:59 Pid 3824 is predicted to be in family of 10932
11/25 15:43:59 Pid 3825 is predicted to be in family of 10932
11/25 15:43:59 Pid 3826 is predicted to be in family of 10932
11/25 15:43:59 Pid 4289 is in family of 10932
11/25 15:43:59 Pid 4290 is predicted to be in family of 10932
11/25 15:43:59 Pid 4291 is predicted to be in family of 10932
11/25 15:43:59 Pid 4292 is predicted to be in family of 10932
11/25 15:43:59 Pid 4357 is in family of 10932
11/25 15:43:59 Pid 4359 is predicted to be in family of 10932
11/25 15:43:59 Pid 4360 is predicted to be in family of 10932
11/25 15:43:59 Pid 4361 is predicted to be in family of 10932
11/25 15:43:59 ProcFamily: parent: 10932 family: 10932 28631 28632 3823 3824 3825 3826 4289 4290 4291 4292 4357 4359
4360 4361
11/25 15:43:59 ProcFamily: alive_cpu_user = 39744, exited_cpu = 346051, max_image = 2345656k
11/25 15:43:59 ProcAPI::buildFamily() called w/ parent: 10933
11/25 15:43:59 ProcAPI::buildFamily() Found daddypid on the system: 10933
11/25 15:43:59 ProcFamily: parent: 10933 family: 10933
11/25 15:43:59 ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 15516k
11/25 15:44:59 ProcAPI::buildFamily() called w/ parent: 10932
11/25 15:44:59 ProcAPI::buildFamily() Found daddypid on the system: 10932
11/25 15:44:59 Pid 28631 is in family of 10932
11/25 15:44:59 Pid 28632 is in family of 28631
11/25 15:44:59 Pid 3823 is in family of 10932
11/25 15:44:59 Pid 3824 is predicted to be in family of 10932
11/25 15:44:59 Pid 3825 is predicted to be in family of 10932
11/25 15:44:59 Pid 3826 is predicted to be in family of 10932
11/25 15:44:59 Pid 4289 is in family of 10932
11/25 15:44:59 Pid 4290 is predicted to be in family of 10932
11/25 15:44:59 Pid 4291 is predicted to be in family of 10932
11/25 15:44:59 Pid 4292 is predicted to be in family of 10932
11/25 15:44:59 Pid 4357 is in family of 10932
11/25 15:44:59 Pid 4359 is predicted to be in family of 10932
11/25 15:44:59 Pid 4360 is predicted to be in family of 10932
11/25 15:44:59 Pid 4361 is predicted to be in family of 10932
11/25 15:44:59 ProcFamily: parent: 10932 family: 10932 28631 28632 3823 3824 3825 3826 4289 4290 4291 4292 4357 4359
4360 4361
11/25 15:44:59 ProcFamily: alive_cpu_user = 39977, exited_cpu = 346051, max_image = 2345656k
11/25 15:44:59 ProcAPI::buildFamily() called w/ parent: 10933
11/25 15:44:59 ProcAPI::buildFamily() Found daddypid on the system: 10933
11/25 15:44:59 ProcFamily: parent: 10933 family: 10933
11/25 15:44:59 ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 15516k
11/25 15:45:29 Getting monitoring info for pid 10931
11/25 15:45:33 enter Daemons::UpdateCollector
11/25 15:45:33 Trying to update collector <10.14.0.12:9618>
11/25 15:45:33 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/25 15:45:33 exit Daemons::UpdateCollector
11/25 15:45:37 enter Daemons::CheckForNewExecutable
11/25 15:45:37 Time stamp of running /ldcg/condor/sbin/condor_master: 1160676896
11/25 15:45:37 GetTimeStamp returned: 1160676896
11/25 15:45:37 Time stamp of running /ldcg/condor/sbin/condor_startd: 1161825357
11/25 15:45:37 GetTimeStamp returned: 1161825357
11/25 15:45:37 Time stamp of running /ldcg/condor/sbin/condor_ckpt_server: 1160676867
11/25 15:45:37 GetTimeStamp returned: 1160676867
11/25 15:45:37 exit Daemons::CheckForNewExecutable
11/25 15:45:59 ProcAPI::buildFamily() called w/ parent: 10932
11/25 15:45:59 ProcAPI::buildFamily() Found daddypid on the system: 10932
11/25 15:45:59 Pid 28631 is in family of 10932
11/25 15:45:59 Pid 28632 is in family of 28631
11/25 15:45:59 Pid 4289 is in family of 10932
11/25 15:45:59 Pid 4290 is predicted to be in family of 10932
11/25 15:45:59 Pid 4291 is predicted to be in family of 10932
11/25 15:45:59 Pid 4292 is predicted to be in family of 10932
11/25 15:45:59 Pid 4357 is in family of 10932
11/25 15:45:59 Pid 4359 is predicted to be in family of 10932
11/25 15:45:59 Pid 4360 is predicted to be in family of 10932
11/25 15:45:59 Pid 4361 is predicted to be in family of 10932
11/25 15:45:59 Pid 6997 is in family of 10932
11/25 15:45:59 Pid 6998 is predicted to be in family of 10932
11/25 15:45:59 Pid 6999 is predicted to be in family of 10932
11/25 15:45:59 Pid 7000 is predicted to be in family of 10932
11/25 15:45:59 Pid 7001 is predicted to be in family of 10932
11/25 15:45:59 Pid 7002 is predicted to be in family of 10932
11/25 15:45:59 ProcAPI::getProcInfo() pid 3823 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3823 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3823 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3823 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3823 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3824 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3824 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3824 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3824 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3824 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3825 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3825 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3825 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3825 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3825 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3826 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3826 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3826 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3826 does not exist.
11/25 15:45:59 ProcAPI::getProcInfo() pid 3826 does not exist.
11/25 15:45:59 ProcFamily: parent: 10932 family: 10932 28631 28632 4289 4290 4291 4292 4357 4359 4360 4361 6997 6998 699
9 7000 7001 7002
11/25 15:45:59 ProcFamily: alive_cpu_user = 33487, exited_cpu = 352730, max_image = 2345656k
11/25 15:45:59 ProcAPI::buildFamily() called w/ parent: 10933
11/25 15:45:59 ProcAPI::buildFamily() Found daddypid on the system: 10933
11/25 15:45:59 ProcFamily: parent: 10933 family: 10933
11/25 15:45:59 ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 15516k
11/25 21:48:44 ******************************************************
11/25 21:48:44 ** condor_master (CONDOR_MASTER) STARTING UP
11/25 21:48:44 ** /ldcg/stow_pkgs/condor-6.8.2/condor/sbin/condor_master
11/25 21:48:44 ** $CondorVersion: 6.8.2 Oct 12 2006 $
11/25 21:48:44 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/25 21:48:44 ** PID = 11059
11/25 21:48:44 ** Log last touched 11/25 15:45:59
11/25 21:48:44 ******************************************************
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Sun Nov 26 19:33:05 2006 (1164591187)
Subject: Actions
Assigned to nleroy by nleroy
===========================================================================
Date of actions: Mon Nov 27 9:54:31 2006 (1164659318)
From: Nick LeRoy <nleroy__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #14515] LIGO condor_ckpt_server continues to
run after condor_master
Date: Mon, 27 Nov 2006 14:44:02 -0600
> On one of the LIGO Caltech nodes (node209 FC4/x86_64) running,
> # condor_version
> $CondorVersion: 6.8.2 Oct 12 2006 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
> had an unexpected shutdown of condor_master.
>
> Upon closer inspection this raises 3 questions:
>
> 1) Is condor_ckpt_server continuing to run after condor_master exits
> considered to be a bug or a feature?
Ugh. The checkpoint server is based on really old code, code that predated
DaemonCore. Thus, it doesn't get all the goodness of DaemonCore, including
the "watch my parent and let me know if it goes away" logic.
> 2) Why does condor_startd report calling getProcInfo() for negative (and
> otherwise illegal numerical values) during shutdown?
This is most likely a race condition with the starter getting shutdown
(because the startd sent it a SIGTERM), but the startd doesn't know about it
because is hasn't its reapers haven't run yet. I don't think that it's
anything to worry about.
> 3) Why did condor_master exit without any log messages? If this is not
> obvious from the logs below, I suggest we do not put any further
> effort into pursuing this unless it happens again.
I honestly don't know... As you've observed, there's precious little
information to look at.
>
> Additional details:
>
> The checkpoint server continued to run. Is this failure of
> condor_ckpt_server to use IsPidAlive() to check on condor_master a feature
> or a bug?
This is a bug, but would require rewriting the checkpoint server to solve
correctly. Is it proving to be a big problem for you?
> 3) It is not clear what happened to the condor_master (PID = 10931).
> There where no system level errors reported at that time, and there is no
> core dump image. Here is the end of this section of the MasterLog and
> unless there is something obvious here to an expert I doubt we can figure
> out if this was an OS failure (e.g., silent Linux OOM killer) or a Condor
> bug, so I propose we do not persue this any further unless it happens
> again:
I'd agree. If you see it happen again, let us know. We could possibly have
you attach to it with gdb to see where it's going.
Thanks & good luck
-Nick
--
<<< The Matrix is everywhere. >>>
/`-_ Nicholas R. LeRoy The Condor Project
{ }/ http://www.cs.wisc.edu/~nleroy http://www.cs.wisc.edu/condor
\ / nleroy__AT__cs.wisc.edu The University of Wisconsin
|_*_| 608-265-5761 Department of Computer Sciences
===========================================================================
Date mail was appended: Mon Nov 27 14:44:09 2006 (1164660250)
Subject: Actions
Status changed from open to pending by nleroy
===========================================================================
Date of actions: Mon Nov 27 14:44:09 2006 (1164660251)
Date: Mon, 27 Nov 2006 13:53:43 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-admin #14515] LIGO condor_ckpt_server continues to
run after condor_master
On Mon, Nov 27, 2006 at 02:44:09PM -0600, condor-admin response tracking system wrote:
>
> > On one of the LIGO Caltech nodes (node209 FC4/x86_64) running,
> > # condor_version
> > $CondorVersion: 6.8.2 Oct 12 2006 $
> > $CondorPlatform: X86_64-LINUX_RHEL3 $
> > had an unexpected shutdown of condor_master.
> >
> > Upon closer inspection this raises 3 questions:
> >
> > 1) Is condor_ckpt_server continuing to run after condor_master exits
> > considered to be a bug or a feature?
>
> Ugh. The checkpoint server is based on really old code, code that predated
> DaemonCore. Thus, it doesn't get all the goodness of DaemonCore, including
> the "watch my parent and let me know if it goes away" logic.
OK, so this is a bug. However, fixing this is not a priority for LIGO since
we are currently running Condor on dedicated resources. I leave it to you
to decide what priority should be given to Condor properly cleaning up on
shared/leased resources.
>
> > 2) Why does condor_startd report calling getProcInfo() for negative (and
> > otherwise illegal numerical values) during shutdown?
>
> This is most likely a race condition with the starter getting shutdown
> (because the startd sent it a SIGTERM), but the startd doesn't know about it
> because is hasn't its reapers haven't run yet. I don't think that it's
> anything to worry about.
The idea that Condor is using invalid/random state information to make
decisions does not instill a lot of confidence. What happens if during
shutdown it picks a random non-condor PID that does exist, i.e., are
you sure the side effects of this are benign?
>
> > 3) Why did condor_master exit without any log messages? If this is not
> > obvious from the logs below, I suggest we do not put any further
> > effort into pursuing this unless it happens again.
>
> I honestly don't know... As you've observed, there's precious little
> information to look at.
OK, please consider this part closed as plausibly unrelated to Condor.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Mon Nov 27 15:54:18 2006 (1164664458)
From: Nick LeRoy <nleroy__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #14515] LIGO condor_ckpt_server continues to
run after condor_master
Date: Tue, 28 Nov 2006 10:55:38 -0600
> > > 1) Is condor_ckpt_server continuing to run after condor_master exits
> > > considered to be a bug or a feature?
> >
> > Ugh. The checkpoint server is based on really old code, code that
> > predated DaemonCore. Thus, it doesn't get all the goodness of
> > DaemonCore, including the "watch my parent and let me know if it goes
> > away" logic.
>
> OK, so this is a bug. However, fixing this is not a priority for LIGO since
> we are currently running Condor on dedicated resources. I leave it to you
> to decide what priority should be given to Condor properly cleaning up on
> shared/leased resources.
OK.
> > > 2) Why does condor_startd report calling getProcInfo() for negative
> > > (and otherwise illegal numerical values) during shutdown?
> >
> > This is most likely a race condition with the starter getting shutdown
> > (because the startd sent it a SIGTERM), but the startd doesn't know about
> > it because is hasn't its reapers haven't run yet. I don't think that
> > it's anything to worry about.
>
> The idea that Condor is using invalid/random state information to make
> decisions does not instill a lot of confidence. What happens if during
> shutdown it picks a random non-condor PID that does exist, i.e., are
> you sure the side effects of this are benign?
We'll look into this further... I'll let you know what I learn.
> > > 3) Why did condor_master exit without any log messages? If this is not
> > > obvious from the logs below, I suggest we do not put any further
> > > effort into pursuing this unless it happens again.
> >
> > I honestly don't know... As you've observed, there's precious little
> > information to look at.
>
> OK, please consider this part closed as plausibly unrelated to Condor.
OK, thanks.
-Nick
--
<<< The Matrix is everywhere. >>>
/`-_ Nicholas R. LeRoy The Condor Project
{ }/ http://www.cs.wisc.edu/~nleroy http://www.cs.wisc.edu/condor
\ / nleroy__AT__cs.wisc.edu The University of Wisconsin
|_*_| 608-265-5761 Department of Computer Sciences
===========================================================================
Date mail was appended: Tue Nov 28 10:55:46 2006 (1164732947)
Subject: Actions
Status changed from open to pending by nleroy
===========================================================================
Date of actions: Tue Nov 28 10:55:46 2006 (1164732948)