LIGO Support Ticket 17283
Ticket Information
Number: admin 17283
User: anderson@ligo.caltech.edu
Email: skoranda__AT__gravity.phys.uwm.edu
Status: open
Assigned To: gquinn
Date: Thu, 6 Dec 2007 16:31:42 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: condor_procd failure
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
The LIGO Condor pool at Caltech running,
# condor_version
$CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
somehow lost its condor_procd process on a dedicated execute machine (node174)
and this resulted in the condor_master, startd, and 2 out of 4 starters
terminating with errors like the following MasterLog entry:
12/6 14:17:18 Return from HandleReq <HandleChildAliveCommand>
12/6 14:36:48 Calling HandleReq <HandleChildAliveCommand> (0)
12/6 14:36:48 Return from HandleReq <HandleChildAliveCommand>
12/6 14:56:18 Calling HandleReq <HandleChildAliveCommand> (0)
12/6 14:56:18 Return from HandleReq <HandleChildAliveCommand>
12/6 14:59:06 DaemonCore: pid 9290 exited with status 8, invoking reaper 2 <condor_procd reaper>
12/6 14:59:06 procd (pid = 9290) exited unexpectedly with status 8
12/6 14:59:06 ERROR "ProcD has failed" at line 590 in file proc_family_proxy.C
12/6 14:59:06 error writing to named pipe: watchdog pipe has closed
12/6 14:59:06 LocalClient: error sending message to server
12/6 14:59:06 ProcFamilyClient: failed to start connection with ProcD
12/6 14:59:06 kill_family: ProcD communication error
12/6 14:59:06 ERROR "ProcD has failed" at line 590 in file proc_family_proxy.C
This also left one starter to terminate with,
12/6 14:59:06 ********* STARTER terminating normally **********
one Vanilla universe job running without an associated condor_starter
process, a stranded check point server, and after rebooting the machine for a
for a completely clean restart, several unused procd_pipe* files.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Thu Dec 6 18:32:03 2007 (1196987528)
Subject: Actions
Assigned to gquinn by gquinn
===========================================================================
Date of actions: Fri Dec 7 10:30:48 2007 (1197045049)
Subject: Re: [condor-admin #17283] LIGO: condor_procd failure
From: Greg Quinn <gquinn__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Date: Fri, 07 Dec 2007 11:49:23 -0600
> The LIGO Condor pool at Caltech running,
>
> # condor_version
> $CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
Hi Stuart,
If I am not mistaken, the exit status of 8 indicates a floating point
exception. I can think of two possibilities: divide by zero or some such
when the ProcD is aggregating process usage data or memory error that
causing the ProcD to execute random memory locations that look to the
CPU like FPU instructions.
> somehow lost its condor_procd process on a dedicated execute machine (node174)
> and this resulted in the condor_master, startd, and 2 out of 4 starters
> terminating with errors like the following MasterLog entry:
>
> 12/6 14:17:18 Return from HandleReq <HandleChildAliveCommand>
> 12/6 14:36:48 Calling HandleReq <HandleChildAliveCommand> (0)
> 12/6 14:36:48 Return from HandleReq <HandleChildAliveCommand>
> 12/6 14:56:18 Calling HandleReq <HandleChildAliveCommand> (0)
> 12/6 14:56:18 Return from HandleReq <HandleChildAliveCommand>
> 12/6 14:59:06 DaemonCore: pid 9290 exited with status 8, invoking reaper 2 <condor_procd reaper>
> 12/6 14:59:06 procd (pid = 9290) exited unexpectedly with status 8
> 12/6 14:59:06 ERROR "ProcD has failed" at line 590 in file proc_family_proxy.C
> 12/6 14:59:06 error writing to named pipe: watchdog pipe has closed
> 12/6 14:59:06 LocalClient: error sending message to server
> 12/6 14:59:06 ProcFamilyClient: failed to start connection with ProcD
> 12/6 14:59:06 kill_family: ProcD communication error
> 12/6 14:59:06 ERROR "ProcD has failed" at line 590 in file proc_family_proxy.C
Could you package up the logs from the other daemons as well? I'd like
to have as much info surrounding what the ProcD may have been doing as I
can get. Unfortunately, debugging problems with the ProcD is something
that we _need_ to make easier for ourselves. As you may or may not know,
the ProcD started as part of the privilege separation effort, and as
such was designed not to contain the loads and loads of Condor library
code that the rest of Condor links against. This is because the ProcD is
one of the two components that have root privilege in the PrivSep model,
and we wanted to keep it as small as possible. A side effect of this is
that it does not have the relatively mature logging system that the rest
of Condor has. You can enable logging for the ProcD with PROCD_LOG in
your condor_config, but the logs do not rotate and thus grow without
bound. More robust logging for the ProcD is definitely on the radar.
Another annoyance it that Linux seems to refuse to generate core files
for the ProcD. Previous attempts on our part to force Linux to generate
core files even for root-owned processes have not yielded any fruit.
This is another issue that we need to address.
> This also left one starter to terminate with,
> 12/6 14:59:06 ********* STARTER terminating normally **********
>
> one Vanilla universe job running without an associated condor_starter
> process, a stranded check point server, and after rebooting the machine for a
> for a completely clean restart, several unused procd_pipe* files.
Yes, things get unfortunately messy when the Master gets taken out. This
is why in our default configuration, the Master does not rely on the
ProcD for its own operation. It uses an older version of the process
tracking logic that is less scalable and does not work with PrivSep but
has been around forever and is thus more stable. We plan to continue to
ship with this default until we feel the ProcD has reached sufficient
stability.
The reason your Master is relying on the ProcD, I'm guessing, is because
of your use of GID-based process tracking, which requires the ProcD.
GID-based process tracking uses the ProcD to manage the pool of
available GIDs. For execute machines that do not run a SchedD (and
therefore cannot run local universe jobs), I believe you could employ a
minor configuration hack and still get away with the Master not relying
on the ProcD. Specifically, tell the Starter along that GID-based
tracking is to be used, and the Master will not require a ProcD:
STARTER.USE_GID_PROCESS_TRACKING = True
STARTER.MIN_TRACKING_GID = 901
STARTER.MAX_TRACKING_GID = 1000
Sorry for the inconvenience this has caused.
Greg Quinn
Condor Team
===========================================================================
Date mail was appended: Fri Dec 7 11:49:31 2007 (1197049772)
Date: Sat, 8 Dec 2007 11:49:48 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17283] LIGO: condor_procd failure
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
On Fri, Dec 07, 2007 at 11:49:31AM -0600, condor-admin response tracking system wrote:
>
> If I am not mistaken, the exit status of 8 indicates a floating point
> exception. I can think of two possibilities: divide by zero or some such
> when the ProcD is aggregating process usage data or memory error that
> causing the ProcD to execute random memory locations that look to the
> CPU like FPU instructions.
This makes sense, since I believe I saw /bin/top report NaN on this
machine when I was performing a postmortem before rebooting, so if
procd performs floating point operations on the values from /proc that
may have been the final step that caused procd to crash.
That obviously leaves open the question of what was to the root cause, e.g.,
Liniux kernel bug, Condor procd bug, cosmic ray, or failing hardware.
For what it is worth, another 330 identical HW and condor configured
execute machines have been running fine with procd. Furthermore,
this particular machine has no interesting syslog error messages and
has been running fine since rebooting.
>
> Could you package up the logs from the other daemons as well? I'd like
> to have as much info surrounding what the ProcD may have been doing as I
> can get. Unfortunately, debugging problems with the ProcD is something
Please consider adding a condor specific postmortem data collection tool
to make bug reports easier, e.g.,
# condor_crash_report node174 -event "12/6 14:59:06" -history 3600
Could aggregate the following and possibly phone home with the information:
* Configuration files of interest (submit machine, central pool manager,
and specified node174).
* All log files for 1 hour before the specified time from these machines.
* Run /usr/bin/ident on all Condor binaries.
* Search for and include any core files found, possibly running a few
standard gdb commands on the local machine first.
* Basic information on the host computer (OS, patch level, memory, CPU, ...)
...
Here are the last 2 minutes from the logs on the execute machine:
[root@node174 log]# grep "^12\/6 14:5[89]" MasterLog
12/6 14:58:42 ProcAPI::buildFamily() called w/ parent: 28604
12/6 14:58:42 ProcAPI::buildFamily() Found daddypid on the system: 28604
12/6 14:58:42 ProcFamily: parent: 28604 family: 28604
12/6 14:58:42 ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 15516k
12/6 14:58:42 ProcAPI::buildFamily() called w/ parent: 17515
12/6 14:58:42 ProcAPI::buildFamily() Found daddypid on the system: 17515
12/6 14:58:42 Pid 731 is in family of 17515
12/6 14:58:42 Pid 733 is predicted to be in family of 17515
12/6 14:58:42 Pid 734 is predicted to be in family of 17515
12/6 14:58:42 Pid 735 is predicted to be in family of 17515
12/6 14:58:42 Pid 746 is in family of 17515
12/6 14:58:42 Pid 747 is predicted to be in family of 17515
12/6 14:58:42 Pid 748 is predicted to be in family of 17515
12/6 14:58:42 Pid 749 is predicted to be in family of 17515
12/6 14:58:42 Pid 750 is in family of 17515
12/6 14:58:42 Pid 752 is predicted to be in family of 17515
12/6 14:58:42 Pid 753 is predicted to be in family of 17515
12/6 14:58:42 Pid 756 is predicted to be in family of 17515
12/6 14:58:42 Pid 770 is in family of 17515
12/6 14:58:42 Pid 771 is predicted to be in family of 17515
12/6 14:58:42 Pid 772 is predicted to be in family of 17515
12/6 14:58:42 Pid 773 is predicted to be in family of 17515
12/6 14:58:42 ProcFamily: parent: 17515 family: 17515 731 733 734 735 746 747 748 749 750 752 753 756 770 771 772 773
12/6 14:58:42 ProcFamily: alive_cpu_user = 25173, exited_cpu = 1947617, max_image = 2893588k
12/6 14:59:42 ProcAPI::buildFamily() called w/ parent: 28604
12/6 14:59:42 ProcAPI::buildFamily() Found daddypid on the system: 28604
12/6 14:59:42 ProcFamily: parent: 28604 family: 28604
12/6 14:59:42 ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 15516k
12/6 14:59:42 ProcAPI::buildFamily() called w/ parent: 17515
12/6 14:59:42 ProcAPI::buildFamily() Found daddypid on the system: 17515
12/6 14:59:42 Pid 731 is in family of 17515
12/6 14:59:42 Pid 733 is predicted to be in family of 17515
12/6 14:59:42 Pid 734 is predicted to be in family of 17515
12/6 14:59:42 Pid 735 is predicted to be in family of 17515
12/6 14:59:42 Pid 746 is in family of 17515
12/6 14:59:42 Pid 747 is predicted to be in family of 17515
12/6 14:59:42 Pid 748 is predicted to be in family of 17515
12/6 14:59:42 Pid 749 is predicted to be in family of 17515
12/6 14:59:42 Pid 750 is in family of 17515
12/6 14:59:42 Pid 752 is predicted to be in family of 17515
12/6 14:59:42 Pid 753 is predicted to be in family of 17515
12/6 14:59:42 Pid 756 is predicted to be in family of 17515
12/6 14:59:42 Pid 770 is in family of 17515
12/6 14:59:42 Pid 771 is predicted to be in family of 17515
12/6 14:59:42 Pid 772 is predicted to be in family of 17515
12/6 14:59:42 Pid 773 is predicted to be in family of 17515
12/6 14:59:42 ProcFamily: parent: 17515 family: 17515 731 733 734 735 746 747 748 749 750 752 753 756 770 771 772 773
12/6 14:59:42 ProcFamily: alive_cpu_user = 25413, exited_cpu = 1947617, max_image = 2893588k
12/6 14:59:06 DaemonCore: pid 9290 exited with status 8, invoking reaper 2 <condor_procd reaper>
12/6 14:59:06 procd (pid = 9290) exited unexpectedly with status 8
12/6 14:59:06 ERROR "ProcD has failed" at line 590 in file proc_family_proxy.C
12/6 14:59:06 error writing to named pipe: watchdog pipe has closed
12/6 14:59:06 LocalClient: error sending message to server
12/6 14:59:06 ProcFamilyClient: failed to start connection with ProcD
12/6 14:59:06 kill_family: ProcD communication error
12/6 14:59:06 ERROR "ProcD has failed" at line 590 in file proc_family_proxy.C
root@node174 log]# grep "^12\/6 14:5[89]" StartLog
12/6 14:58:04 DaemonCore: Command received via UDP from host <10.14.0.12:44475>
12/6 14:58:04 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
12/6 14:58:04 DaemonCore: Command received via UDP from host <10.14.0.12:44475>
12/6 14:58:04 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
12/6 14:58:04 DaemonCore: Command received via UDP from host <10.14.0.12:44475>
12/6 14:58:04 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
12/6 14:58:04 DaemonCore: Command received via UDP from host <10.14.0.12:44475>
12/6 14:58:04 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
12/6 14:58:12 Swap space: 10482324
12/6 14:58:12 120785628 kbytes available for "/usr1/condor/execute"
12/6 14:58:12 Looking up RESERVED_DISK parameter
12/6 14:58:12 Reserving 5120 kbytes for file system
12/6 14:58:12 Disk space: 120780508
12/6 14:58:16 Trying to update collector <10.14.0.12:9618>
12/6 14:58:16 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
12/6 14:58:16 vm1: Sent update to 1 collector(s)
12/6 14:58:17 Trying to update collector <10.14.0.12:9618>
12/6 14:58:17 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
12/6 14:58:17 vm2: Sent update to 1 collector(s)
12/6 14:58:18 Trying to update collector <10.14.0.12:9618>
12/6 14:58:18 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
12/6 14:58:18 vm3: Sent update to 1 collector(s)
12/6 14:58:19 Trying to update collector <10.14.0.12:9618>
12/6 14:58:19 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
12/6 14:58:19 vm4: Sent update to 1 collector(s)
12/6 14:58:28 ProcAPI::buildFamily() Found daddypid on the system: 731
12/6 14:58:33 ProcAPI::buildFamily() Found daddypid on the system: 746
12/6 14:58:33 ProcAPI::buildFamily() Found daddypid on the system: 750
12/6 14:58:38 ProcAPI::buildFamily() Found daddypid on the system: 770
12/6 14:59:18 Unable to calculate keyboard/mouse idle time due to them both being USB or not present, assuming infinite idle time for these devices.
12/6 14:59:18 ProcAPI::buildFamily() Found daddypid on the system: 731
12/6 14:59:23 ProcAPI::buildFamily() Found daddypid on the system: 746
12/6 14:59:23 ProcAPI::buildFamily() Found daddypid on the system: 750
12/6 14:59:29 ProcAPI::buildFamily() Found daddypid on the system: 770
12/6 14:58:09 DaemonCore: Command received via UDP from host <10.14.0.25:53068>, access level NEGOTIATOR
12/6 14:58:09 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
12/6 14:58:09 Calling HandleReq <command_match_info> (0)
12/6 14:58:09 slot1: match_info called
12/6 14:58:09 slot1: Received match <10.14.1.174:34160>#1196536907#881#...
12/6 14:58:09 slot1: State change: match notification protocol successful
12/6 14:58:09 slot1: Changing state: Unclaimed -> Matched
12/6 14:58:09 Return from HandleReq <command_match_info>
12/6 14:58:13 Calling Handler <DaemonCore::HandleReqSocketHandler>
12/6 14:58:13 DaemonCore: Command received via TCP from host <10.14.0.12:48113>, access level DAEMON
12/6 14:58:13 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
12/6 14:58:13 Calling HandleReq <command_request_claim> (0)
12/6 14:58:13 slot1: Request accepted.
12/6 14:58:13 slot1: Remote owner is psutton@ligo
12/6 14:58:13 slot1: State change: claiming protocol successful
12/6 14:58:13 slot1: Changing state: Matched -> Claimed
12/6 14:58:13 Return from HandleReq <command_request_claim>
12/6 14:58:13 Return from Handler <DaemonCore::HandleReqSocketHandler>
12/6 14:58:17 Calling Handler <DaemonCore::HandleReqSocketHandler>
12/6 14:58:17 DaemonCore: Command received via TCP from host <10.14.0.12:59415>, access level DAEMON
12/6 14:58:17 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
12/6 14:58:17 Calling HandleReq <command_activate_claim> (0)
12/6 14:58:17 slot1: Got activate_claim request from shadow (<10.14.0.12:59415>)
12/6 14:58:17 slot1: Remote job ID is 21873276.0
12/6 14:58:17 slot1: Got universe "VANILLA" (5) from request classad
12/6 14:58:17 slot1: State change: claim-activation protocol successful
12/6 14:58:17 slot1: Changing activity: Idle -> Busy
12/6 14:58:17 Return from HandleReq <command_activate_claim>
12/6 14:58:17 Return from Handler <DaemonCore::HandleReqSocketHandler>
12/6 14:58:18 Calling HandleReq <HandleChildAliveCommand> (0)
12/6 14:58:18 Return from HandleReq <HandleChildAliveCommand>
12/6 14:58:25 Calling Handler <receiveJobClassAdUpdate>
12/6 14:58:25 Return from Handler <receiveJobClassAdUpdate>
12/6 14:58:29 Calling HandleReq <command_handler> (0)
12/6 14:58:29 Return from HandleReq <command_handler>
12/6 14:58:29 Calling HandleReq <command_handler> (0)
12/6 14:58:29 Return from HandleReq <command_handler>
12/6 14:58:29 Calling HandleReq <command_handler> (0)
12/6 14:58:29 Return from HandleReq <command_handler>
12/6 14:58:29 Calling HandleReq <command_handler> (0)
12/6 14:58:29 Return from HandleReq <command_handler>
12/6 14:59:06 Calling Handler <receiveJobClassAdUpdate>
12/6 14:59:06 Return from Handler <receiveJobClassAdUpdate>
12/6 14:59:06 DaemonCore: pid 14015 exited with status 1024, invoking reaper 1 <reaper>
12/6 14:59:06 Starter pid 14015 exited with status 4
12/6 14:59:06 error writing to named pipe: watchdog pipe has closed
12/6 14:59:06 LocalClient: error sending message to server
12/6 14:59:06 ProcFamilyClient: failed to start connection with ProcD
12/6 14:59:06 kill_family: ProcD communication error
12/6 14:59:06 ERROR "ProcD has failed" at line 590 in file proc_family_proxy.C
12/6 14:59:06 slot1: Changing state and activity: Claimed/Busy -> Preempting/Killing
12/6 14:59:06 slot4: Changing state and activity: Claimed/Idle -> Preempting/Killing
12/6 14:59:06 slot4: State change: No preempting claim, returning to owner
12/6 14:59:06 slot4: Changing state and activity: Preempting/Killing -> Owner/Idle
12/6 14:59:06 slot4: State change: IS_OWNER is false
12/6 14:59:06 slot4: Changing state: Owner -> Unclaimed
12/6 14:59:06 slot3: Changing state and activity: Claimed/Busy -> Preempting/Killing
12/6 14:59:06 slot3: State change: No preempting claim, returning to owner
12/6 14:59:06 slot3: Changing state and activity: Preempting/Killing -> Owner/Idle
12/6 14:59:06 slot3: State change: IS_OWNER is false
12/6 14:59:06 slot3: Changing state: Owner -> Unclaimed
12/6 14:59:06 slot2: Changing state and activity: Claimed/Busy -> Preempting/Killing
12/6 14:59:06 startd exiting because of fatal exception.
[root@node174 log]# grep "^12\/6 14:5[89]" StarterLog.slot1
12/6 14:58:17 ******************************************************
12/6 14:58:17 ** condor_starter (CONDOR_STARTER) STARTING UP
12/6 14:58:17 ** /usr/sbin/condor_starter
12/6 14:58:17 ** $CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
12/6 14:58:17 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
12/6 14:58:17 ** PID = 14139
12/6 14:58:17 ** Log last touched 12/6 14:33:16
12/6 14:58:17 ******************************************************
12/6 14:58:17 Using config source: /usr1/condor/condor_config
12/6 14:58:17 Using local config sources:
12/6 14:58:17 /usr1/condor/condor_config.local
12/6 14:58:17 DaemonCore: Command Socket at <10.14.1.174:42231>
12/6 14:58:17 Done setting resource limits
12/6 14:58:17 Communicating with shadow <10.14.0.12:40165>
12/6 14:58:17 Submitting machine is "ldas-grid.ligo.caltech.edu"
12/6 14:58:17 setting the orig job name in starter
12/6 14:58:17 setting the orig job iwd in starter
12/6 14:58:17 Job 21873276.0 set to execute immediately
12/6 14:58:17 Starting a VANILLA universe job with ID: 21873276.0
12/6 14:58:17 IWD: /mnt/qfs1/psutton/coherent-network/grb/GRB070201_simData_MDC_999inj
12/6 14:58:17 Output file: /mnt/qfs1/psutton/coherent-network/grb/GRB070201_simData_MDC_999inj/logs/xsearch-21873276-0.out
12/6 14:58:17 Error file: /mnt/qfs1/psutton/coherent-network/grb/GRB070201_simData_MDC_999inj/logs/xsearch-21873276-0.err
12/6 14:58:17 Renice expr "0" evaluated to 0
12/6 14:58:17 About to exec /archive/home/psutton/coherent-network/xdetection.sh ./input/parameters_mdc_insp1414_2.txt 0 ./output/simulations_mdc_insp1414_2 501-550
12/6 14:58:17 Create_Process succeeded, pid=14140
12/6 14:59:06 Got SIGQUIT. Performing fast shutdown.
12/6 14:59:06 ShutdownFast all jobs.
12/6 14:59:06 error writing to named pipe: watchdog pipe has closed
12/6 14:59:06 LocalClient: error sending message to server
12/6 14:59:06 ProcFamilyClient: failed to start connection with ProcD
12/6 14:59:06 kill_family: ProcD communication error
12/6 14:59:06 ERROR "ProcD has failed" at line 590 in file proc_family_proxy.C
12/6 14:59:06 ShutdownFast all jobs.
12/6 14:59:06 error writing to named pipe: watchdog pipe has closed
12/6 14:59:06 LocalClient: error sending message to server
12/6 14:59:06 ProcFamilyClient: failed to start connection with ProcD
12/6 14:59:06 kill_family: ProcD communication error
12/6 14:59:06 ERROR "ProcD has failed" at line 590 in file proc_family_proxy.C
[root@node174 log]# grep "^12\/6 14:5[89]" StarterLog.slot2
12/6 14:59:06 *FSM* Got asynchronous event "DIE"
12/6 14:59:06 *FSM* Executing transition function "req_die"
12/6 14:59:06 req_exit_all: Proc -1 in state EXECUTING
12/6 14:59:06 Requesting Exit on proc #-1
12/6 14:59:06 UserProc::send_sig_no_privsep(): Sent signal SIGCONT to user job 14103
12/6 14:59:06 UserProc::send_sig(): Sent signal SIGKILL to user job 14103
12/6 14:59:06 *FSM* Transitioning to state "TERMINATE"
12/6 14:59:06 *FSM* Executing state func "terminate_all()" [ ]
12/6 14:59:06 *FSM* Transitioning to state "TERMINATE_WAIT"
12/6 14:59:06 *FSM* Executing state func "asynch_wait()" [ SUSPEND ALARM DIE CHILD_EXIT ]
12/6 14:59:06 *FSM* Got asynchronous event "CHILD_EXIT"
12/6 14:59:06 *FSM* Executing transition function "reaper"
12/6 14:59:06 Process 14103 killed by signal 9
12/6 14:59:06 Process exited by request
12/6 14:59:06 *FSM* Transitioning to state "TERMINATE"
12/6 14:59:06 *FSM* Executing state func "terminate_all()" [ ]
12/6 14:59:06 *FSM* Transitioning to state "SEND_STATUS_ALL"
12/6 14:59:06 *FSM* Executing state func "dispose_all()" [ ]
12/6 14:59:06 Sending final status for process 21873324.0
12/6 14:59:06 STATUS encoded as CKPT, *NOT* TRANSFERRED
12/6 14:59:06 User time = 0.000000 seconds
12/6 14:59:06 System time = 0.000000 seconds
12/6 14:59:06 Unlinked "dir_14102/condor_exec.21873324.0"
12/6 14:59:06 Removed directory "dir_14102"
12/6 14:59:06 *FSM* Reached state "END"
12/6 14:59:06 ********* STARTER terminating normally **********
[root@node174 log]# grep "^12\/6 14:5[89]" StarterLog.slot3
12/6 14:59:06 Process exited, pid=14017, status=0
12/6 14:59:06 error reading from named pipe: watchdog pipe has closed
12/6 14:59:06 ProcFamilyClient: failed to read response from ProcD
12/6 14:59:06 kill_family: ProcD communication error
12/6 14:59:06 ERROR "ProcD has failed" at line 590 in file proc_family_proxy.C
12/6 14:59:06 ShutdownFast all jobs.
12/6 14:59:06 error writing to named pipe: watchdog pipe has closed
12/6 14:59:06 LocalClient: error sending message to server
12/6 14:59:06 ProcFamilyClient: failed to start connection with ProcD
12/6 14:59:06 kill_family: ProcD communication error
12/6 14:59:06 ERROR "ProcD has failed" at line 590 in file proc_family_proxy.C
root@node174 log]# grep "^12\/6 14:5[89]" StarterLog.slot4
[root@node174 log]#
> Yes, things get unfortunately messy when the Master gets taken out. This
> is why in our default configuration, the Master does not rely on the
> ProcD for its own operation. It uses an older version of the process
> tracking logic that is less scalable and does not work with PrivSep but
> has been around forever and is thus more stable. We plan to continue to
> ship with this default until we feel the ProcD has reached sufficient
> stability.
>
> The reason your Master is relying on the ProcD, I'm guessing, is because
> of your use of GID-based process tracking, which requires the ProcD.
This does appear to be the case. From the MasterLog,
12/1 13:38:21 GID-based process tracking requires use of ProcD; ignoring USE_PROCD setting
So for now, we will simply disable GID tracking to avoid enabling ProcD
until it is more stable.
Thanks for the detailed explanations.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Sun Dec 9 15:20:33 2007 (1197235234)