LIGO Support Ticket 1707

Ticket Information
  Number:      support 1707
  User:        anderson@ligo.caltech.edu
  Email:       espinoza_e__AT__ligo.caltech.edu
  Status:      resolved
  Assigned To: wright
Date: Sun, 29 Oct 2006 10:25:51 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu, wright__AT__cs.wisc.edu, roy__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: LIGO startd exit with status 44

The LIGO Caltech condor pool running,
$CondorVersion: 6.8.2 Oct 12 2006 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
plus a patched version of startd with extra debugging information for
[condor-support #1694] (wherein startd is exiting with status 6)
has had condor_startd exit with status 44. This also took down the
condor_master processes and the node in question stopped running anything
having to do with condor.

Here is the email alert:

This is an automated email from the Condor system
on machine "node108.ldas-cit.ligo.caltech.edu".  Do not reply.
 
"/ldcg/condor/sbin/condor_startd" on "node108.ldas-cit.ligo.caltech.edu" exited 
with status 44.
Condor will automatically restart this process in 10 seconds.


Below are the last 200 lines from MasterLog and StartLog, it might also
be interesting to note the existence of the following 0 byte files in the
log directory. Any help in solving this problem would be appreciated.

Thanks.


[root@node108 log]# ls -lt
total 546848
-rw-r--r--  1 condor condor 217079808 Oct 29 10:17 StartLog
-rw-r--r--  1 condor condor         0 Oct 29 10:17 dprintf_failure.STARTD
-rw-r--r--  1 condor condor 149864448 Oct 29 01:29 MasterLog
-rw-r--r--  1 condor condor         0 Oct 29 01:29 dprintf_failure.MASTER
-rw-r--r--  1 condor condor  47151247 Oct 29 01:29 StarterLog.vm1
-rw-r--r--  1 condor condor  46794008 Oct 29 01:29 StarterLog.vm2
-rw-r--r--  1 condor condor  43422104 Oct 29 01:29 StarterLog.vm3
-rw-r--r--  1 condor condor  42905183 Oct 29 01:29 StarterLog.vm4
-rw-r--r--  1 condor condor   5371078 Oct 29 01:28 CkptServerLog
-rw-r--r--  1 condor condor   6771991 Oct 29 01:22 StarterLog.boinc
-rw-------  1 condor condor         0 Jan 20  2006 InstanceLock


MasterLog
---------
10/29 01:25:29 ProcAPI::buildFamily() called w/ parent: 3463
10/29 01:25:29 ProcAPI::buildFamily() Found daddypid on the system: 3463
10/29 01:25:29 ProcFamily: parent: 3463 family: 3463
10/29 01:25:29 ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 15512k
10/29 01:26:29 ProcAPI::buildFamily() called w/ parent: 3462
10/29 01:26:29 ProcAPI::buildFamily() Found daddypid on the system: 3462
10/29 01:26:29 Pid 6306 is in family of 3462
10/29 01:26:29 Pid 6309 is predicted to be in family of 3462
10/29 01:26:29 Pid 6345 is predicted to be in family of 3462
10/29 01:26:29 Pid 6428 is in family of 6345
10/29 01:26:29 Pid 6437 is in family of 6428
10/29 01:26:29 Pid 6438 is in family of 6437
10/29 01:26:29 Pid 6544 is in family of 6438
10/29 01:26:29 Pid 6859 is in family of 3462
10/29 01:26:29 Pid 6862 is in family of 6859
10/29 01:26:29 Pid 6969 is in family of 6862
10/29 01:26:29 Pid 6973 is in family of 6969
10/29 01:26:29 Pid 26837 is in family of 3462
10/29 01:26:29 Pid 26838 is predicted to be in family of 3462
10/29 01:26:29 Pid 26904 is predicted to be in family of 3462
10/29 01:26:29 Pid 29396 is in family of 3462
10/29 01:26:29 Pid 29397 is predicted to be in family of 3462
10/29 01:26:29 Pid 29462 is predicted to be in family of 3462
10/29 01:26:29 Pid 29777 is predicted to be in family of 3462
10/29 01:26:29 ProcAPI::getProcInfo() pid 29694 does not exist.
10/29 01:26:29 ProcAPI::getProcInfo() pid 29694 does not exist.
10/29 01:26:29 ProcAPI::getProcInfo() pid 29694 does not exist.
10/29 01:26:29 ProcAPI::getProcInfo() pid 29694 does not exist.
10/29 01:26:29 ProcAPI::getProcInfo() pid 29694 does not exist.
10/29 01:26:29 ProcFamily: parent: 3462 family: 3462 6306 6309 6345 6428 6437 6438 6544 6859 6862 6969 6973 26837 26838 26904 29396 29397 29462 29777
10/29 01:26:29 ProcFamily: alive_cpu_user = 91079, exited_cpu = 388959, max_image = 4108484k
10/29 01:26:29 ProcAPI::buildFamily() called w/ parent: 3463
10/29 01:26:29 ProcAPI::buildFamily() Found daddypid on the system: 3463
10/29 01:26:29 ProcFamily: parent: 3463 family: 3463
10/29 01:26:29 ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 15512k
10/29 01:27:29 ProcAPI::buildFamily() called w/ parent: 3462
10/29 01:27:29 ProcAPI::buildFamily() Found daddypid on the system: 3462
10/29 01:27:29 Pid 6306 is in family of 3462
10/29 01:27:29 Pid 6309 is predicted to be in family of 3462
10/29 01:27:29 Pid 6345 is predicted to be in family of 3462
10/29 01:27:29 Pid 6428 is in family of 6345
10/29 01:27:29 Pid 6437 is in family of 6428
10/29 01:27:29 Pid 6438 is in family of 6437
10/29 01:27:29 Pid 6544 is in family of 6438
10/29 01:27:29 Pid 6859 is in family of 3462
10/29 01:27:29 Pid 6862 is in family of 6859
10/29 01:27:29 Pid 6969 is in family of 6862
10/29 01:27:29 Pid 6973 is in family of 6969
10/29 01:27:29 Pid 26837 is in family of 3462
10/29 01:27:29 Pid 26838 is predicted to be in family of 3462
10/29 01:27:29 Pid 26904 is predicted to be in family of 3462
10/29 01:27:29 Pid 29396 is in family of 3462
10/29 01:27:29 Pid 29397 is predicted to be in family of 3462
10/29 01:27:29 Pid 29462 is predicted to be in family of 3462
10/29 01:27:29 Pid 29863 is predicted to be in family of 3462
10/29 01:27:29 ProcAPI::getProcInfo() pid 29777 does not exist.
10/29 01:27:29 ProcAPI::getProcInfo() pid 29777 does not exist.
10/29 01:27:29 ProcAPI::getProcInfo() pid 29777 does not exist.
10/29 01:27:29 ProcAPI::getProcInfo() pid 29777 does not exist.
10/29 01:27:29 ProcAPI::getProcInfo() pid 29777 does not exist.
10/29 01:27:29 ProcFamily: parent: 3462 family: 3462 6306 6309 6345 6428 6437 6438 6544 6859 6862 6969 6973 26837 26838 26904 29396 29397 29462 29863
10/29 01:27:29 ProcFamily: alive_cpu_user = 91266, exited_cpu = 388959, max_image = 4108484k
10/29 01:27:29 ProcAPI::buildFamily() called w/ parent: 3463
10/29 01:27:29 ProcAPI::buildFamily() Found daddypid on the system: 3463
10/29 01:27:29 ProcFamily: parent: 3463 family: 3463
10/29 01:27:29 ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 15512k
10/29 01:28:29 ProcAPI::buildFamily() called w/ parent: 3462
10/29 01:28:29 ProcAPI::buildFamily() Found daddypid on the system: 3462
10/29 01:28:29 Pid 6306 is in family of 3462
10/29 01:28:29 Pid 6309 is predicted to be in family of 3462
10/29 01:28:29 Pid 6345 is predicted to be in family of 3462
10/29 01:28:30 Pid 6428 is in family of 6345
10/29 01:28:30 Pid 6437 is in family of 6428
10/29 01:28:30 Pid 6438 is in family of 6437
10/29 01:28:30 Pid 6544 is in family of 6438
10/29 01:28:30 Pid 6859 is in family of 3462
10/29 01:28:30 Pid 6862 is in family of 6859
10/29 01:28:30 Pid 6969 is in family of 6862
10/29 01:28:30 Pid 6973 is in family of 6969
10/29 01:28:30 Pid 26837 is in family of 3462
10/29 01:28:30 Pid 26838 is predicted to be in family of 3462
10/29 01:28:30 Pid 26904 is predicted to be in family of 3462
10/29 01:28:30 Pid 29396 is in family of 3462
10/29 01:28:30 Pid 29397 is predicted to be in family of 3462
10/29 01:28:30 Pid 29462 is predicted to be in family of 3462
10/29 01:28:30 Pid 29940 is predicted to be in family of 3462
10/29 01:28:30 ProcAPI::getProcInfo() pid 29863 does not exist.
10/29 01:28:30 ProcAPI::getProcInfo() pid 29863 does not exist.
10/29 01:28:30 ProcAPI::getProcInfo() pid 29863 does not exist.
10/29 01:28:30 ProcAPI::getProcInfo() pid 29863 does not exist.
10/29 01:28:30 ProcAPI::getProcInfo() pid 29863 does not exist.
10/29 01:28:30 ProcFamily: parent: 3462 family: 3462 6306 6309 6345 6428 6437 6438 6544 6859 6862 6969 6973 26837 26838 26904 29396 29397 29462 29940
10/29 01:28:30 ProcFamily: alive_cpu_user = 91459, exited_cpu = 388959, max_image = 4108484k
10/29 01:28:30 ProcAPI::buildFamily() called w/ parent: 3463
10/29 01:28:30 ProcAPI::buildFamily() Found daddypid on the system: 3463
10/29 01:28:30 ProcFamily: parent: 3463 family: 3463
10/29 01:28:30 ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 15512k
10/29 01:28:34 enter Daemons::UpdateCollector
10/29 01:28:34 Trying to update collector <10.14.0.12:9618>
10/29 01:28:34 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/29 01:28:34 exit Daemons::UpdateCollector
10/29 01:28:39 enter Daemons::CheckForNewExecutable
10/29 01:28:39 Time stamp of running /ldcg/condor/sbin/condor_master: 1160676896
10/29 01:28:39 GetTimeStamp returned: 1160676896
10/29 01:28:39 Time stamp of running /ldcg/condor/sbin/condor_startd: 1161825357
10/29 01:28:39 GetTimeStamp returned: 1161825357
10/29 01:28:39 Time stamp of running /ldcg/condor/sbin/condor_ckpt_server: 1160676867
10/29 01:28:39 GetTimeStamp returned: 1160676867
10/29 01:28:39 exit Daemons::CheckForNewExecutable
10/29 01:29:29 Getting monitoring info for pid 3461
10/29 01:29:30 ProcAPI::buildFamily() called w/ parent: 3462
10/29 01:29:30 ProcAPI::buildFamily() Found daddypid on the system: 3462
10/29 01:29:30 Pid 6306 is in family of 3462
10/29 01:29:30 Pid 6309 is predicted to be in family of 3462
10/29 01:29:30 Pid 6345 is predicted to be in family of 3462
10/29 01:29:30 Pid 6428 is in family of 6345
10/29 01:29:30 Pid 6437 is in family of 6428
10/29 01:29:30 Pid 6438 is in family of 6437
10/29 01:29:30 Pid 6544 is in family of 6438
10/29 01:29:30 Pid 6859 is in family of 3462
10/29 01:29:30 Pid 6862 is in family of 6859
10/29 01:29:30 Pid 6969 is in family of 6862
10/29 01:29:30 Pid 6973 is in family of 6969
10/29 01:29:30 Pid 26837 is in family of 3462
10/29 01:29:30 Pid 26838 is predicted to be in family of 3462
10/29 01:29:30 Pid 26904 is predicted to be in family of 3462
10/29 01:29:30 Pid 29396 is in family of 3462
10/29 01:29:30 Pid 29397 is predicted to be in family of 3462
10/29 01:29:30 Pid 29462 is predicted to be in family of 3462
10/29 01:29:30 Pid 30012 is predicted to be in family of 3462
10/29 01:29:30 ProcAPI::getProcInfo() pid 29940 does not exist.
10/29 01:29:30 ProcAPI::getProcInfo() pid 29940 does not exist.
10/29 01:29:30 ProcAPI::getProcInfo() pid 29940 does not exist.
10/29 01:29:30 ProcAPI::getProcInfo() pid 29940 does not exist.
10/29 01:29:30 ProcAPI::getProcInfo() pid 29940 does not exist.
10/29 01:29:30 ProcFamily: parent: 3462 family: 3462 6306 6309 6345 6428 6437 6438 6544 6859 6862 6969 6973 26837 26838 26904 29396 29397 29462 30012
10/29 01:29:30 ProcFamily: alive_cpu_user = 91643, exited_cpu = 388959, max_image = 4108484k
10/29 01:29:30 ProcAPI::buildFamily() called w/ parent: 3463
10/29 01:29:30 ProcAPI::buildFamily() Found daddypid on the system: 3463
10/29 01:29:30 ProcFamily: parent: 3463 family: 3463
10/29 01:29:30 ProcFamily: alive_cpu_user = 0, exited_cpu = 0, max_image = 15512k
10/29 01:29:47 DaemonCore: No more children processes to reap.
10/29 01:29:47 The STARTD (pid 3462) exited with status 44
10/29 01:29:47 Entering ProcFamily::hardkill
10/29 01:29:47 ProcAPI::buildFamily() called w/ parent: 3462
10/29 01:29:47 ProcAPI::buildFamily() Parent pid 3462 is gone. Found descendant 6306 via ancestor environment tracking and assigning as new "parent".
10/29 01:29:47 Pid 6309 is in family of 6306
10/29 01:29:47 Pid 6345 is predicted to be in family of 6306
10/29 01:29:47 Pid 6428 is in family of 6345
10/29 01:29:47 Pid 6437 is in family of 6428
10/29 01:29:47 Pid 6438 is in family of 6437
10/29 01:29:47 Pid 6544 is in family of 6438
10/29 01:29:47 Pid 6859 is predicted to be in family of 6306
10/29 01:29:47 Pid 6862 is in family of 6859
10/29 01:29:47 Pid 6969 is in family of 6862
10/29 01:29:47 Pid 6973 is in family of 6969
10/29 01:29:47 Pid 26837 is predicted to be in family of 6306
10/29 01:29:47 Pid 26838 is predicted to be in family of 6306
10/29 01:29:47 Pid 26904 is predicted to be in family of 6306
10/29 01:29:47 Pid 29396 is predicted to be in family of 6306
10/29 01:29:47 Pid 29397 is predicted to be in family of 6306
10/29 01:29:47 Pid 29462 is predicted to be in family of 6306
10/29 01:29:47 Pid 30036 is predicted to be in family of 6306
10/29 01:29:47 ProcAPI::getProcInfo() pid 3462 does not exist.
10/29 01:29:47 ProcAPI::getProcInfo() pid 3462 does not exist.
10/29 01:29:47 ProcAPI::getProcInfo() pid 3462 does not exist.
10/29 01:29:47 ProcAPI::getProcInfo() pid 3462 does not exist.
10/29 01:29:47 ProcAPI::getProcInfo() pid 3462 does not exist.
10/29 01:29:47 ProcAPI::getProcInfo() pid 30012 does not exist.
10/29 01:29:47 ProcAPI::getProcInfo() pid 30012 does not exist.
10/29 01:29:47 ProcAPI::getProcInfo() pid 30012 does not exist.
10/29 01:29:47 ProcAPI::getProcInfo() pid 30012 does not exist.
10/29 01:29:47 ProcAPI::getProcInfo() pid 30012 does not exist.
10/29 01:29:47 ProcFamily: parent: 3462 family: 6306 6309 6345 6428 6437 6438 6544 6859 6862 6969 6973 26837 26838 26904 29396 29397 29462 30036
10/29 01:29:47 ProcFamily: alive_cpu_user = 91490, exited_cpu = 389164, max_image = 4108484k
10/29 01:29:47 ProcFamily::safe_kill: about to kill pid 6544 with sig 9
10/29 01:29:47 ProcFamily::safe_kill: about to kill pid 6438 with sig 9
10/29 01:29:47 ProcFamily::safe_kill: about to kill pid 6437 with sig 9
10/29 01:29:47 ProcFamily::safe_kill: about to kill pid 6428 with sig 9
10/29 01:29:47 ProcFamily::safe_kill: about to kill pid 6345 with sig 9
10/29 01:29:47 ProcFamily::safe_kill: about to kill pid 6309 with sig 9
10/29 01:29:47 ProcFamily::safe_kill: about to kill pid 6306 with sig 9
10/29 01:29:47 ProcFamily::safe_kill: about to kill pid 6973 with sig 9
10/29 01:29:47 ProcFamily::safe_kill: about to kill pid 6969 with sig 9
10/29 01:29:47 ProcFamily::safe_kill: about to kill pid 6862 with sig 9
10/29 01:29:47 ProcFamily::safe_kill: about to kill pid 6859 with sig 9
10/29 01:29:47 ProcFamily::safe_kill: about to kill pid 26904 with sig 9
10/29 01:29:47 ProcFamily::safe_kill: about to kill pid 26838 with sig 9
10/29 01:29:47 ProcFamily::safe_kill: about to kill pid 26837 with sig 9
10/29 01:29:47 ProcFamily::safe_kill: about to kill pid 30036 with sig 9
10/29 01:29:47 ProcFamily::safe_kill: about to kill pid 29462 with sig 9
10/29 01:29:47 ProcFamily::safe_kill: about to kill pid 29397 with sig 9
10/29 01:29:47 ProcFamily::safe_kill: about to kill pid 29396 with sig 9
10/29 01:29:47 Deleted ProcFamily w/ pid 3462 as parent
10/29 01:29:47 Sending obituary for "/ldcg/condor/sbin/condor_startd"
10/29 01:29:47 Forking Mailer process...
10/29 01:29:54 restarting /ldcg/condor/sbin/condor_startd in 10 seconds
10/29 01:29:54 enter Daemons::UpdateCollector
10/29 01:29:54 Trying to update collector <10.14.0.12:9618>
10/29 01:29:54 Attempting to send update via UDP to


StartLog
--------
10/29 01:27:31 ProcAPI::buildFamily() Found daddypid on the system: 6306
10/29 01:27:31 ProcAPI::getProcInfo() pid 29792 does not exist.
10/29 01:27:31 ProcAPI::getProcInfo() pid 29792 does not exist.
10/29 01:27:31 ProcAPI::getProcInfo() pid 29792 does not exist.
10/29 01:27:31 ProcAPI::getProcInfo() pid 29792 does not exist.
10/29 01:27:31 ProcAPI::getProcInfo() pid 29792 does not exist.
10/29 01:27:36 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:36 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:36 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:36 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:36 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:36 ProcAPI::getProcSetInfo(): Pid 29864 does not exist, ignoring.
10/29 01:27:41 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:41 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:41 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:41 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:41 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:41 ProcAPI::getProcSetInfo(): Pid 29864 does not exist, ignoring.
10/29 01:27:46 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:46 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:46 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:46 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:46 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:46 ProcAPI::getProcSetInfo(): Pid 29864 does not exist, ignoring.
10/29 01:27:51 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:51 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:51 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:51 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:51 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:51 ProcAPI::getProcSetInfo(): Pid 29864 does not exist, ignoring.
10/29 01:27:51 ProcAPI::buildFamily() Found daddypid on the system: 29396
10/29 01:27:56 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:56 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:56 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:56 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:56 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:27:56 ProcAPI::getProcSetInfo(): Pid 29864 does not exist, ignoring.
10/29 01:28:01 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:01 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:01 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:01 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:01 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:01 ProcAPI::getProcSetInfo(): Pid 29864 does not exist, ignoring.
10/29 01:28:06 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:06 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:06 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:06 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:06 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:06 ProcAPI::getProcSetInfo(): Pid 29864 does not exist, ignoring.
10/29 01:28:06 ProcAPI::buildFamily() Found daddypid on the system: 6859
10/29 01:28:11 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:11 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:11 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:11 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:11 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:11 ProcAPI::getProcSetInfo(): Pid 29864 does not exist, ignoring.
10/29 01:28:17 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:17 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:17 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:17 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:17 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:17 ProcAPI::getProcSetInfo(): Pid 29864 does not exist, ignoring.
10/29 01:28:17 ProcAPI::buildFamily() Found daddypid on the system: 26837
10/29 01:28:22 ProcAPI::buildFamily() Found daddypid on the system: 6306
10/29 01:28:22 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:22 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:22 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:22 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:22 ProcAPI::getProcInfo() pid 29864 does not exist.
10/29 01:28:27 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:27 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:27 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:27 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:27 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:27 ProcAPI::getProcSetInfo(): Pid 29930 does not exist, ignoring.
10/29 01:28:32 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:32 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:32 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:32 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:32 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:32 ProcAPI::getProcSetInfo(): Pid 29930 does not exist, ignoring.
10/29 01:28:37 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:37 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:37 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:37 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:37 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:37 ProcAPI::getProcSetInfo(): Pid 29930 does not exist, ignoring.
10/29 01:28:42 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:42 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:42 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:42 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:42 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:42 ProcAPI::getProcSetInfo(): Pid 29930 does not exist, ignoring.
10/29 01:28:42 ProcAPI::buildFamily() Found daddypid on the system: 29396
10/29 01:28:43 Swap space: 10210556
10/29 01:28:43 1306040 kbytes available for "/usr1/condor/execute"
10/29 01:28:43 Looking up RESERVED_DISK parameter
10/29 01:28:43 Reserving 5120 kbytes for file system
10/29 01:28:43 Disk space: 1300920
10/29 01:28:43 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:43 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:43 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:43 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:43 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:43 ProcAPI::getProcSetInfo(): Pid 29930 does not exist, ignoring.
10/29 01:28:47 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:47 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:47 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:47 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:47 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:47 ProcAPI::getProcSetInfo(): Pid 29930 does not exist, ignoring.
10/29 01:28:47 Trying to update collector <10.14.0.12:9618>
10/29 01:28:47 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/29 01:28:47 vm1: Sent update to 1 collector(s)
10/29 01:28:48 Trying to update collector <10.14.0.12:9618>
10/29 01:28:48 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/29 01:28:48 vm2: Sent update to 1 collector(s)
10/29 01:28:49 Trying to update collector <10.14.0.12:9618>
10/29 01:28:49 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/29 01:28:49 vm3: Sent update to 1 collector(s)
10/29 01:28:50 Trying to update collector <10.14.0.12:9618>
10/29 01:28:50 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/29 01:28:50 vm4: Sent update to 1 collector(s)
10/29 01:28:52 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:52 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:52 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:52 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:52 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:52 ProcAPI::getProcSetInfo(): Pid 29930 does not exist, ignoring.
10/29 01:28:57 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:57 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:57 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:57 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:57 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:28:57 ProcAPI::getProcSetInfo(): Pid 29930 does not exist, ignoring.
10/29 01:28:57 ProcAPI::buildFamily() Found daddypid on the system: 6859
10/29 01:29:02 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:29:02 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:29:02 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:29:02 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:29:02 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:29:02 ProcAPI::getProcSetInfo(): Pid 29930 does not exist, ignoring.
10/29 01:29:07 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:29:07 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:29:07 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:29:07 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:29:07 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:29:07 ProcAPI::getProcSetInfo(): Pid 29930 does not exist, ignoring.
10/29 01:29:07 ProcAPI::buildFamily() Found daddypid on the system: 26837
10/29 01:29:12 ProcAPI::buildFamily() Found daddypid on the system: 6306
10/29 01:29:12 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:29:12 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:29:12 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:29:12 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:29:12 ProcAPI::getProcInfo() pid 29930 does not exist.
10/29 01:29:17 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:17 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:17 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:17 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:17 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:17 ProcAPI::getProcSetInfo(): Pid 29990 does not exist, ignoring.
10/29 01:29:22 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:22 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:22 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:22 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:22 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:22 ProcAPI::getProcSetInfo(): Pid 29990 does not exist, ignoring.
10/29 01:29:27 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:27 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:27 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:27 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:27 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:27 ProcAPI::getProcSetInfo(): Pid 29990 does not exist, ignoring.
10/29 01:29:32 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:32 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:32 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:32 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:32 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:32 ProcAPI::getProcSetInfo(): Pid 29990 does not exist, ignoring.
10/29 01:29:32 ProcAPI::buildFamily() Found daddypid on the system: 29396
10/29 01:29:37 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:37 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:37 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:37 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:37 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:37 ProcAPI::getProcSetInfo(): Pid 29990 does not exist, ignoring.
10/29 01:29:41 Getting monitoring info for pid 3462
10/29 01:29:42 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:42 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:42 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:42 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:42 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:42 ProcAPI::getProcSetInfo(): Pid 29990 does not exist, ignoring.
10/29 01:29:47 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:47 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:47 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:47 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:47 ProcAPI::getProcInfo() pid 29990 does not exist.
10/29 01:29:47 ProcAPI::getProcSetInfo(): Pid 29990 does not exist, ignoring.
10/29 01:29:47 ProcAPI::buildFamily() Found daddypid on the system: 6859



-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Sun Oct 29 12:29:26 2006 (1162146569)
Date: Sun, 29 Oct 2006 10:37:10 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1707] LIGO startd exit with status 44

I found the problem. A rogue condor job created a 213GByte _condor_stdout
file in its execute/dir_pid directory.

This ticket can be closed.

A possible enhancement request is to consider if there is an easy way
to pass back the ENOSPC error messages for easier debugging when this
happens again, e.g., avoid having to create a temporary file with the
email message at all (or use a different filesystem such as /tmp)?

Thanks.


On Sun, Oct 29, 2006 at 12:29:26PM -0600, condor-support__AT__cs.wisc.edu wrote:
> Greetings.  (This is an automated response.  There is no need to reply.)
> 
> Your message regarding: 
>   "LIGO startd exit with status 44"
> has been received by the condor-support response tracking system.
> 
> In order to help us track the progress of your request, we ask that you
> include the string:
>   "[condor-support #1707] LIGO startd exit with status 44"
> in the subject line of any further mail about this particular request.
> 
> You can do this by simply replying to this email.
> 
>                         Thank You,
>                         - condor-support response tracking system

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Sun Oct 29 12:37:32 2006 (1162147052)
Subject: Actions

Assigned to tannenba by pavlo
===========================================================================
Date of actions: Mon Oct 30  9:33:36 2006 (1162230761)
Subject: Actions

Assigned to wright by wright
===========================================================================
Date of actions: Mon Oct 30 13:08:37 2006 (1162235317)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1707] LIGO startd exit with status 44
Date: Mon, 30 Oct 2006 12:26:39 -0800
To: condor-support__AT__cs.wisc.edu

>
> I found the problem. A rogue condor job created a 213GByte  
> _condor_stdout
> file in its execute/dir_pid directory.

right.  status 44 from a condor daemon always means it failed to  
write to its log file, which usually means the disk/partition filled  
up.  i tried to find the part of the manual where we talk about this,  
but apparently we never documented this(!).  ugh.

> This ticket can be closed.

as it will be shortly. ;)

> A possible enhancement request is to consider if there is an easy way
> to pass back the ENOSPC error messages for easier debugging when this
> happens again, e.g., avoid having to create a temporary file with the
> email message at all (or use a different filesystem such as /tmp)?

pass back ENOSPC where?  the problem is the log file can't be written  
to.  where are we supposed to write out that we failed to write b/c  
of ENOSPC?  we could potentially try to generate a custom, special  
email message (e.g. directly from the condor_startd if that was what  
failed to write to its log file) in this case, with "i can't write to  
my log file and am about to die" and as much info as we can include.   
is that what you mean?  however, the email you got was just the  
default email from the master whenever one of its children die  
unexpectedly.  at that point, it's way too late to include anything  
more meaningful than it already does...

just trying to understand the nature of your suggested feature request.

thanks,
-derek



===========================================================================
Date mail was appended: Mon Oct 30 14:26:47 2006 (1162240008)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Mon Oct 30 14:32:21 2006 (1162240341)
Date: Mon, 30 Oct 2006 14:41:15 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1707] LIGO startd exit with status 44

On Mon, Oct 30, 2006 at 02:26:47PM -0600, condor-support response tracking system wrote:
> >
> > A possible enhancement request is to consider if there is an easy way
> > to pass back the ENOSPC error messages for easier debugging when this
> > happens again, e.g., avoid having to create a temporary file with the
> > email message at all (or use a different filesystem such as /tmp)?
> 
> pass back ENOSPC where?  the problem is the log file can't be written  

In the email alert message.

> to.  where are we supposed to write out that we failed to write b/c  
> of ENOSPC?  we could potentially try to generate a custom, special  
> email message (e.g. directly from the condor_startd if that was what  
> failed to write to its log file) in this case, with "i can't write to  
> my log file and am about to die" and as much info as we can include.   
> is that what you mean?  however, the email you got was just the  

Yes.

> default email from the master whenever one of its children die  
> unexpectedly.  at that point, it's way too late to include anything  
> more meaningful than it already does...
> 
> just trying to understand the nature of your suggested feature request.

Given the architecture you described of all log message having to go
through the filesystem, I think that rather than implementing another
complex backup scheme (shared memory, /tmp, ...) it would be sufficient
to just just document all the exit status codes.

However, please consider adding directly in the email alert either a one
sentence description or a URL pointer to the on-lilne documentation that
explains what the various exit status codes mean.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Oct 30 16:41:42 2006 (1162248102)
Date: Fri, 6 Apr 2007 13:13:06 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1707] LIGO startd exit with status 44

What was the decision on returning more information when this happens,
e.g., adding a URL pointer in the obit email message that directs admins
to the documentation on what various exit code values mean?

Thanks.

On Mon, Oct 30, 2006 at 02:41:15PM -0800, Stuart Anderson wrote:
> On Mon, Oct 30, 2006 at 02:26:47PM -0600, condor-support response tracking system wrote:
> > >
> > > A possible enhancement request is to consider if there is an easy way
> > > to pass back the ENOSPC error messages for easier debugging when this
> > > happens again, e.g., avoid having to create a temporary file with the
> > > email message at all (or use a different filesystem such as /tmp)?
> > 
> > pass back ENOSPC where?  the problem is the log file can't be written  
> 
> In the email alert message.
> 
> > to.  where are we supposed to write out that we failed to write b/c  
> > of ENOSPC?  we could potentially try to generate a custom, special  
> > email message (e.g. directly from the condor_startd if that was what  
> > failed to write to its log file) in this case, with "i can't write to  
> > my log file and am about to die" and as much info as we can include.   
> > is that what you mean?  however, the email you got was just the  
> 
> Yes.
> 
> > default email from the master whenever one of its children die  
> > unexpectedly.  at that point, it's way too late to include anything  
> > more meaningful than it already does...
> > 
> > just trying to understand the nature of your suggested feature request.
> 
> Given the architecture you described of all log message having to go
> through the filesystem, I think that rather than implementing another
> complex backup scheme (shared memory, /tmp, ...) it would be sufficient
> to just just document all the exit status codes.
> 
> However, please consider adding directly in the email alert either a one
> sentence description or a URL pointer to the on-lilne documentation that
> explains what the various exit status codes mean.
> 
> Thanks.
> 
> -- 
> Stuart Anderson  anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Apr  6 15:13:27 2007 (1175890407)
Subject: Comments added

user said we can close this ticket if we document exit status meanings
in the manual.  i sent the information to Karen and asked her to add this
info.

Comments added by tannenba

===========================================================================
Date comments were added: Tue Jul  3 17:26:29 2007 (1183501589)
Subject: Actions

Ticket resolved by tannenba
===========================================================================
Date of actions: Tue Jul  3 17:27:18 2007 (1183501638)