LIGO Support Ticket 1681

Ticket Information
  Number:      support 1681
  User:        anderson@ligo.caltech.edu
  Email:       espinoza_e__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu,patrick__AT__gravity.phys.uwm.edu,leonor__AT__oersted.uoregon.edu
  Status:      resolved
  Assigned To: gquinn
Date: Fri, 15 Sep 2006 21:14:44 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,         Brown Duncan
 <duncan__AT__gravity.phys.uwm.edu>,         Patrick Brady
 <patrick__AT__gravity.phys.uwm.edu>
Subject: LIGO orphaned Condor jobs still running

On the LIGO CIT condor pool running a pre-release of 6.8.1 one of our
users just noticed that she still had jobs running, since she observered
continuing output data being generated on disk, even though condor_q
indicated that she did not have any jobs in the queue at all, i.e.,
"condor_q username" returned an empty list. The implications of this
for the proper running of a full data analysis pipeline with or without
DAGMan support could be rather severe.

I verified by scanning the cluster that she did indeed have 3 processes
running that where started by Condor but whose Unix parent proccess id
was 1, i.e., they where no longer connected to a condor_starter but had
been inherited by init. For example,

[root@node202 ~]# ps -ef | grep 23627
ileonor  23627     1 81 07:32 ?        10:31:29 /archive/home/ileonor/Grb/S5/ScriptsGeo/mainsearch_geo /archive/home/ileonor/Grb/S5/ScriptsGeo/Inputs/searchparameters_S5_h2g1.12.txt

By comparison the condor_master and condor_startd unix pid start times where
back on Sep 5 (today is Sep 15), i.e., this was not triggered by a restart
of condor or the master/startd crashing:

[root@node202 ~]# ps -ef | grep condor_     
condor   32435     1  0 Sep05 ?        00:04:27 /ldcg/condor/sbin/condor_master
condor   32436 32435  0 Sep05 ?        00:32:07 condor_startd -f
condor   32437 32435  0 Sep05 ?        00:00:00 condor_ckpt_server
condor    5933 32436  0 19:40 ?        00:00:02 condor_starter -f -append boinc -job-keyword boinc
nobody    5934  5933  0 19:40 ?        00:00:00 condor_exec.exe -update_prefs http://einstein.phys.uwm.edu/ -return_results_immediately


registered within condor at the same time as the kernel reports the processes
was created, i.e., 7:32AM,

[root@node202 log]# grep ^9/15.*23627 MasterLog | head
9/15 07:32:55 Pid 23627 is in family of 23626
9/15 07:32:55 ProcFamily: parent: 32436 family: 32436 21622 21623 21624 23415 23416 23605 23606 23607 23608 23609 23610 23624 23626 23627
9/15 07:33:55 Pid 23627 is in family of 23626
9/15 07:33:55 ProcFamily: parent: 32436 family: 32436 21622 21623 21624 23415 23416 23624 23626 23627 23646 23648
9/15 07:34:55 Pid 23627 is in family of 23626
9/15 07:34:55 ProcFamily: parent: 32436 family: 32436 21622 21623 21624 23415 23416 23624 23626 23627 23646 23648
9/15 07:35:55 Pid 23627 is in family of 23626
9/15 07:35:55 ProcFamily: parent: 32436 family: 32436 21622 21623 21624 23415 23416 23624 23626 23627 23646 23648
9/15 07:36:55 Pid 23627 is in family of 23626
9/15 07:36:55 ProcFamily: parent: 32436 family: 32436 21622 21623 21624 23415 23416 23624 23626 23627 23646 23648

I will post the full set of condor log files from this machine (node202) for
all of today (9/15) at http://www.ligo.caltech.edu/~anderson/condor.XYZ
where XYZ is the support ticket number once it has been assigned to me.

P.S. Perhaps this is related to to [condor-support #1679] where we have
seen duplicating rescue dag processes running on the head node,
only this time it is happening on the worker nodes, and not associated
with a condor daemon crashing.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Fri Sep 15 23:15:04 2006 (1158380107)
Subject: Actions

Assigned to gquinn by gquinn
===========================================================================
Date of actions: Mon Sep 18 10:58:50 2006 (1158595130)
Date: Mon, 18 Sep 2006 11:00:54 -0500
From: Greg Quinn <gquinn__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1681] LIGO orphaned Condor jobs still running

Stuart,

I've begun looking into this issue. I'll let you know how my 
investigation goes as I learn more.

Greg Quinn
Condor Team

===========================================================================
Date mail was appended: Mon Sep 18 11:00:44 2006 (1158595245)
Date: Wed, 20 Sep 2006 12:45:23 -0500
From: Greg Quinn <gquinn__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1681] LIGO orphaned Condor jobs still running

> I will post the full set of condor log files from this machine (node202) for
> all of today (9/15) at http://www.ligo.caltech.edu/~anderson/condor.XYZ
> where XYZ is the support ticket number once it has been assigned to me.
> 
> P.S. Perhaps this is related to to [condor-support #1679] where we have
> seen duplicating rescue dag processes running on the head node,
> only this time it is happening on the worker nodes, and not associated
> with a condor daemon crashing.

Hello,

It appears that the process that is being left behind was spawned by a 
shell script in the user job. One thing that will help us debug this is 
knowing if that shell script sets up a custom environment for the child. 
In particular, Condor uses a set of environment variables (prefixed 
_CONDOR_ANCESTOR) to track process families. Is it possible that the 
child of the shell script is not getting these variables in its 
environment? In this case, Condor should still be able to track the 
process and kill it when the job is evicted, but knowing the answer will 
help us in debugging.

It does appear as though this issue and the one from [condor-support 
#1679] could be related.

Greg

> 
> Thanks.
> 


===========================================================================
Date mail was appended: Wed Sep 20 12:45:29 2006 (1158774329)
Date: Wed, 20 Sep 2006 21:49:13 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>,
 leonor__AT__oersted.uoregon.edu
CC: espinoza_e__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu,
 patrick__AT__gravity.phys.uwm.edu
Subject: Re: [condor-support #1681] LIGO orphaned Condor jobs still running

Isabel,
	It would helpful for the Condor development team in tracking
down the bug you found with detached/phantom jobs running in the LDAS-CIT
cluster if you could provide a copy of the shell script you submitted to
condor that started the program found left over running, e.g.,

[root@node202 ~]# ps -ef | grep 23627
ileonor  23627     1 81 07:32 ?        10:31:29
/archive/home/ileonor/Grb/S5/ScriptsGeo/mainsearch_geo
/archive/home/ileonor/Grb/S5/ScriptsGeo/Inputs/searchparameters_S5_h2g1.12.txt

Please also see the questions below from Greg.

Thanks.



Greg,
	I just took a look on one of our cluster nodes which is
currently running one of Isabel's jobs (user ileonor) and it appears
that the condor daemons have CONDOR_ANCESTOR set but only 1 out of
the 4 user jobs does so. In case this is the same job Isabel was
running when she first reported the problem I have included her
trivial script below. However, why would pid 20980 and 25166 which are
standard universe jobs not have this CONDOR_ANCESTOR environment variable set?

[root@node5 ~]# top | head -15
top - 21:34:06 up 19 days,  1:07,  1 user,  load average: 4.06, 3.90, 3.92
Tasks: 131 total,   5 running, 126 sleeping,   0 stopped,   0 zombie
Cpu(s): 45.5% us,  7.6% sy, 32.5% ni,  9.4% id,  5.0% wa,  0.0% hi,  0.1% si
Mem:   4100748k total,  3494836k used,   605912k free,    64404k buffers
Swap: 10482404k total,       92k used, 10482312k free,  2547828k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                         
20980 channa    25   0  333m 329m  688 R 99.9  8.2 243:20.25 condor_exec.746                                                                 
24076 spoprock  25   0 92804  83m 5516 R 99.9  2.1  62:59.44 xdetection                                                                      
17853 ileonor   25   0  118m  95m 4852 R 98.4  2.4 438:19.32 mainsearch_geo                                                                  
25166 cokelaer  25   0  236m 232m  952 R 98.4  5.8   5:19.76 condor_exec.746                                                                 
25323 root      15   0  7424 1004  728 R  2.0  0.0   0:00.01 top                                                                             
    1 root      16   0  4876  592  492 S  0.0  0.0   0:02.70 init                                                                            
    2 root      RT   0     0    0    0 S  0.0  0.0   0:00.07 migration/0                                                                     
    3 root      34  19     0    0    0 S  0.0  0.0   0:01.20 ksoftirqd/0                                                                     
[root@node5 ~]# ps -ef | grep condor_
condor   17344     1  0 Sep05 ?        00:06:18 /ldcg/condor/sbin/condor_master
condor   17345 17344  0 Sep05 ?        00:47:12 condor_startd -f
condor   17346 17344  0 Sep05 ?        00:00:03 condor_ckpt_server
condor   17851 17345  0 14:03 ?        00:00:14 condor_starter -f -a vm4 ldas-grid.ligo.caltech.edu
condor   20979 17345  0 17:30 ?        00:00:00 condor_starter ldas-grid.ligo.caltech.edu <10.14.1.5:40401> -a vm3
channa   20980 20979 99 17:30 ?        04:03:28 [condor_exec.746]
condor   24075 17345  0 20:30 ?        00:00:02 condor_starter -f -a vm1 ldas-grid.ligo.caltech.edu
condor   25163 17345  0 21:28 ?        00:00:00 condor_starter ldas-grid.ligo.caltech.edu <10.14.1.5:40401> -a vm2
cokelaer 25166 25163 99 21:28 ?        00:05:27 [condor_exec.746]
root     25332 25179  0 21:34 pts/0    00:00:00 grep condor_

[root@node5 ~]# grep CONDOR_ANCESTOR /proc/*/environ 
Binary file /proc/17345/environ matches
Binary file /proc/17346/environ matches
Binary file /proc/17851/environ matches
Binary file /proc/17852/environ matches
Binary file /proc/20979/environ matches
Binary file /proc/24075/environ matches
Binary file /proc/24076/environ matches
Binary file /proc/25163/environ matches

#
# Focusing on Isabels jobs (pid 17583):
# 

[root@node5 ~]# ps -ef | grep 17852
ileonor  17852 17851  0 14:03 ?        00:00:00 /bin/sh /archive/home/ileonor/Grb/S5/ScriptsGeo/mainsearch_S5.sh /archive/home/ileonor/Grb/S5/ScriptsGeo/Inputs/searchparameters_S5_l1g1.26.txt
ileonor  17853 17852 97 14:03 ?        07:21:48 /archive/home/ileonor/Grb/S5/ScriptsGeo/mainsearch_geo /archive/home/ileonor/Grb/S5/ScriptsGeo/Inputs/searchparameters_S5_l1g1.26.txt

root@node5 ~]# pstree 17851
condor_starter---mainsearch_S5.s---mainsearch_geo

#
# This appears to be the script in question
#

[root@node5 ~]# cat /archive/home/ileonor/Grb/S5/ScriptsGeo/mainsearch_S5.sh
#!/bin/sh

export MAINSEARCH=/archive/home/ileonor/Grb/S5/ScriptsGeo/mainsearch_geo
$MAINSEARCH $1


Thanks.


On Wed, Sep 20, 2006 at 12:45:29PM -0600, condor-support response tracking system wrote:
> > I will post the full set of condor log files from this machine (node202) for
> > all of today (9/15) at http://www.ligo.caltech.edu/~anderson/condor.XYZ
> > where XYZ is the support ticket number once it has been assigned to me.
> > 
> > P.S. Perhaps this is related to to [condor-support #1679] where we have
> > seen duplicating rescue dag processes running on the head node,
> > only this time it is happening on the worker nodes, and not associated
> > with a condor daemon crashing.
> 
> Hello,
> 
> It appears that the process that is being left behind was spawned by a 
> shell script in the user job. One thing that will help us debug this is 
> knowing if that shell script sets up a custom environment for the child. 
> In particular, Condor uses a set of environment variables (prefixed 
> _CONDOR_ANCESTOR) to track process families. Is it possible that the 
> child of the shell script is not getting these variables in its 
> environment? In this case, Condor should still be able to track the 
> process and kill it when the job is evicted, but knowing the answer will 
> help us in debugging.
> 
> It does appear as though this issue and the one from [condor-support 
> #1679] could be related.
> 
> Greg
> 
> > 
> > Thanks.
> > 
> 
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Greg Quinn <gquinn__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu,patrick__AT__gravity.phys.uwm.edu
> 
> -- 
> ======================================================================
> This mail was sent from the RUST Mail System
> Please direct all replies to condor-support__AT__cs.wisc.edu
> Please include the current subject line in your reply.
> ======================================================================
> 

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Wed Sep 20 23:49:30 2006 (1158814170)
Date: Thu, 21 Sep 2006 14:12:19 -0700 (PDT)
From: Isabel Leonor <leonor__AT__oersted.uoregon.edu>
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
CC: condor-support response tracking system <condor-support__AT__cs.wisc.edu>,
 espinoza_e__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu,
 patrick__AT__gravity.phys.uwm.edu
Subject: Re: [condor-support #1681] LIGO orphaned Condor jobs still running


Hi, 

	The shell script I submitted to condor is indeed a trivial one.
Here is a copy (where mainsearch_geo is a compiled matlab function):

[ileonor@ldas-grid ScriptsGeo]$ more mainsearch_S5.sh
#!/bin/sh

export MAINSEARCH=/archive/home/ileonor/Grb/S5/ScriptsGeo/mainsearch_geo
$MAINSEARCH $1


Thanks,
Isabel


On Wed, 20 Sep 2006, Stuart Anderson wrote:

> Isabel,
> 	It would helpful for the Condor development team in tracking
> down the bug you found with detached/phantom jobs running in the LDAS-CIT
> cluster if you could provide a copy of the shell script you submitted to
> condor that started the program found left over running, e.g.,
> 
> [root@node202 ~]# ps -ef | grep 23627
> ileonor  23627     1 81 07:32 ?        10:31:29
> /archive/home/ileonor/Grb/S5/ScriptsGeo/mainsearch_geo
> /archive/home/ileonor/Grb/S5/ScriptsGeo/Inputs/searchparameters_S5_h2g1.12.txt
> 
> Please also see the questions below from Greg.
> 
> Thanks.
> 
> 
> 
> Greg,
> 	I just took a look on one of our cluster nodes which is
> currently running one of Isabel's jobs (user ileonor) and it appears
> that the condor daemons have CONDOR_ANCESTOR set but only 1 out of
> the 4 user jobs does so. In case this is the same job Isabel was
> running when she first reported the problem I have included her
> trivial script below. However, why would pid 20980 and 25166 which are
> standard universe jobs not have this CONDOR_ANCESTOR environment variable set?
> 
> [root@node5 ~]# top | head -15
> top - 21:34:06 up 19 days,  1:07,  1 user,  load average: 4.06, 3.90, 3.92
> Tasks: 131 total,   5 running, 126 sleeping,   0 stopped,   0 zombie
> Cpu(s): 45.5% us,  7.6% sy, 32.5% ni,  9.4% id,  5.0% wa,  0.0% hi,  0.1% si
> Mem:   4100748k total,  3494836k used,   605912k free,    64404k buffers
> Swap: 10482404k total,       92k used, 10482312k free,  2547828k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                         
> 20980 channa    25   0  333m 329m  688 R 99.9  8.2 243:20.25 condor_exec.746                                                                 
> 24076 spoprock  25   0 92804  83m 5516 R 99.9  2.1  62:59.44 xdetection                                                                      
> 17853 ileonor   25   0  118m  95m 4852 R 98.4  2.4 438:19.32 mainsearch_geo                                                                  
> 25166 cokelaer  25   0  236m 232m  952 R 98.4  5.8   5:19.76 condor_exec.746                                                                 
> 25323 root      15   0  7424 1004  728 R  2.0  0.0   0:00.01 top                                                                             
>     1 root      16   0  4876  592  492 S  0.0  0.0   0:02.70 init                                                                            
>     2 root      RT   0     0    0    0 S  0.0  0.0   0:00.07 migration/0                                                                     
>     3 root      34  19     0    0    0 S  0.0  0.0   0:01.20 ksoftirqd/0                                                                     
> [root@node5 ~]# ps -ef | grep condor_
> condor   17344     1  0 Sep05 ?        00:06:18 /ldcg/condor/sbin/condor_master
> condor   17345 17344  0 Sep05 ?        00:47:12 condor_startd -f
> condor   17346 17344  0 Sep05 ?        00:00:03 condor_ckpt_server
> condor   17851 17345  0 14:03 ?        00:00:14 condor_starter -f -a vm4 ldas-grid.ligo.caltech.edu
> condor   20979 17345  0 17:30 ?        00:00:00 condor_starter ldas-grid.ligo.caltech.edu <10.14.1.5:40401> -a vm3
> channa   20980 20979 99 17:30 ?        04:03:28 [condor_exec.746]
> condor   24075 17345  0 20:30 ?        00:00:02 condor_starter -f -a vm1 ldas-grid.ligo.caltech.edu
> condor   25163 17345  0 21:28 ?        00:00:00 condor_starter ldas-grid.ligo.caltech.edu <10.14.1.5:40401> -a vm2
> cokelaer 25166 25163 99 21:28 ?        00:05:27 [condor_exec.746]
> root     25332 25179  0 21:34 pts/0    00:00:00 grep condor_
> 
> [root@node5 ~]# grep CONDOR_ANCESTOR /proc/*/environ 
> Binary file /proc/17345/environ matches
> Binary file /proc/17346/environ matches
> Binary file /proc/17851/environ matches
> Binary file /proc/17852/environ matches
> Binary file /proc/20979/environ matches
> Binary file /proc/24075/environ matches
> Binary file /proc/24076/environ matches
> Binary file /proc/25163/environ matches
> 
> #
> # Focusing on Isabels jobs (pid 17583):
> # 
> 
> [root@node5 ~]# ps -ef | grep 17852
> ileonor  17852 17851  0 14:03 ?        00:00:00 /bin/sh /archive/home/ileonor/Grb/S5/ScriptsGeo/mainsearch_S5.sh /archive/home/ileonor/Grb/S5/ScriptsGeo/Inputs/searchparameters_S5_l1g1.26.txt
> ileonor  17853 17852 97 14:03 ?        07:21:48 /archive/home/ileonor/Grb/S5/ScriptsGeo/mainsearch_geo /archive/home/ileonor/Grb/S5/ScriptsGeo/Inputs/searchparameters_S5_l1g1.26.txt
> 
> root@node5 ~]# pstree 17851
> condor_starter---mainsearch_S5.s---mainsearch_geo
> 
> #
> # This appears to be the script in question
> #
> 
> [root@node5 ~]# cat /archive/home/ileonor/Grb/S5/ScriptsGeo/mainsearch_S5.sh
> #!/bin/sh
> 
> export MAINSEARCH=/archive/home/ileonor/Grb/S5/ScriptsGeo/mainsearch_geo
> $MAINSEARCH $1
> 
> 
> Thanks.
> 
> 
> On Wed, Sep 20, 2006 at 12:45:29PM -0600, condor-support response tracking system wrote:
> > > I will post the full set of condor log files from this machine (node202) for
> > > all of today (9/15) at http://www.ligo.caltech.edu/~anderson/condor.XYZ
> > > where XYZ is the support ticket number once it has been assigned to me.
> > > 
> > > P.S. Perhaps this is related to to [condor-support #1679] where we have
> > > seen duplicating rescue dag processes running on the head node,
> > > only this time it is happening on the worker nodes, and not associated
> > > with a condor daemon crashing.
> > 
> > Hello,
> > 
> > It appears that the process that is being left behind was spawned by a 
> > shell script in the user job. One thing that will help us debug this is 
> > knowing if that shell script sets up a custom environment for the child. 
> > In particular, Condor uses a set of environment variables (prefixed 
> > _CONDOR_ANCESTOR) to track process families. Is it possible that the 
> > child of the shell script is not getting these variables in its 
> > environment? In this case, Condor should still be able to track the 
> > process and kill it when the job is evicted, but knowing the answer will 
> > help us in debugging.
> > 
> > It does appear as though this issue and the one from [condor-support 
> > #1679] could be related.
> > 
> > Greg
> > 
> > > 
> > > Thanks.
> > > 
> > 
> > 
> > 
> > ========================================
> > MESSAGE INFORMATION
> > ========================================
> > * From: Greg Quinn <gquinn__AT__cs.wisc.edu>
> > * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu,patrick__AT__gravity.phys.uwm.edu
> > 
> > -- 
> > ======================================================================
> > This mail was sent from the RUST Mail System
> > Please direct all replies to condor-support__AT__cs.wisc.edu
> > Please include the current subject line in your reply.
> > ======================================================================
> > 
> 
> -- 
> Stuart Anderson  anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
> 


===========================================================================
Date mail was appended: Thu Sep 21 16:49:36 2006 (1158875376)
Date: Sat, 30 Sep 2006 16:07:19 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>,
 tannenba__AT__cs.wisc.edu
CC: espinoza_e__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu
Subject: Re: [condor-support #1681] LIGO orphaned Condor jobs still running

Greg, Todd,

A full scan of our 1300 machine condor pool shows there we had 314 orphaned
processes left. This is a serious problem, and I believe Todd mentioned
a week ago that this is now understood in terms of a drifting Condor
measurement of the Unix processes lifetime. Is that theory still holding,
and if so, what is the time estimate for a fix?

One of the orphans was un-killable with "kill -9" as root as the job
was stuck in the Linux kernel D state, so that is not a Condor problem.
One of the others was a vanilla universe job that was killable
(the original job reported in this problem ticket), and the remaining
312 processes are all related to parallel universe jobs, where it
is presumably harder for Condor to keep track of all the ssh connections
man for these MPI jobs, or perhaps because these are longer running jobs.

Duncan,
	Is there any more useful information you want to pass on regarding
your parallel universe jobs that might help with fixing this?

Thanks.


On Wed, Sep 20, 2006 at 12:45:29PM -0600, condor-support response tracking system wrote:
> > I will post the full set of condor log files from this machine (node202) for
> > all of today (9/15) at http://www.ligo.caltech.edu/~anderson/condor.XYZ
> > where XYZ is the support ticket number once it has been assigned to me.
> > 
> > P.S. Perhaps this is related to to [condor-support #1679] where we have
> > seen duplicating rescue dag processes running on the head node,
> > only this time it is happening on the worker nodes, and not associated
> > with a condor daemon crashing.
> 
> Hello,
> 
> It appears that the process that is being left behind was spawned by a 
> shell script in the user job. One thing that will help us debug this is 
> knowing if that shell script sets up a custom environment for the child. 
> In particular, Condor uses a set of environment variables (prefixed 
> _CONDOR_ANCESTOR) to track process families. Is it possible that the 
> child of the shell script is not getting these variables in its 
> environment? In this case, Condor should still be able to track the 
> process and kill it when the job is evicted, but knowing the answer will 
> help us in debugging.
> 
> It does appear as though this issue and the one from [condor-support 
> #1679] could be related.
> 
> Greg
> 
> > 
> > Thanks.
> > 
> 
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Greg Quinn <gquinn__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu,patrick__AT__gravity.phys.uwm.edu
> 
> -- 
> ======================================================================
> This mail was sent from the RUST Mail System
> Please direct all replies to condor-support__AT__cs.wisc.edu
> Please include the current subject line in your reply.
> ======================================================================
> 

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Sat Sep 30 18:09:25 2006 (1159657766)
Date: Thu, 05 Oct 2006 13:53:29 -0500
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>,         condor-support
 response tracking system <condor-support__AT__cs.wisc.edu>
From: Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
Subject: Re: [condor-support #1681] LIGO orphaned Condor jobs still   running
CC: espinoza_e__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu,
 gthain__AT__cs.wisc.edu

At 06:07 PM 9/30/2006, Stuart Anderson wrote:
>Greg, Todd,
>
>A full scan of our 1300 machine condor pool shows there we had 314 orphaned
>processes left. This is a serious problem, and I believe Todd mentioned
>a week ago that this is now understood in terms of a drifting Condor
>measurement of the Unix processes lifetime. Is that theory still holding,
>and if so, what is the time estimate for a fix?


The theory is still holding, and the code to fix it has been written 
and committed into the v6.8 branch in CVS.  Thus it will appear in 
v6.8.2.  We can send you a "preview" of v6.8.2 for testing this fix 
anytime you wish --- it is supposed to go onto our UW machines this 
week, and onto the web next week.



>One of the orphans was un-killable with "kill -9" as root as the job
>was stuck in the Linux kernel D state, so that is not a Condor problem.
>One of the others was a vanilla universe job that was killable
>(the original job reported in this problem ticket), and the remaining
>312 processes are all related to parallel universe jobs, where it
>is presumably harder for Condor to keep track of all the ssh connections
>man for these MPI jobs, or perhaps because these are longer running jobs.

The parallel universe jobs would be susceptible to the same bugs 
fixed above.  Yes, since they run longer they are more susceptible.

These are parallel universe jobs using, right?  Not mpi 
universe?  Assuming the former, are you using the default "mpich" 
launch scripts that Condor shipped with, or did you roll your 
own?  Also, Condor is started as user root on your execute machines, 
correct?  (since sshd is setuid to root)

Meanwhile, Greg is checking the code path for parallel universe jobs, 
making certain they kill descendants the same way as vanilla does (we 
think that is the case, but Greg is making certain).

best regards,
Todd

p.s. I am now (finally) back in the office after being out sick.

>Duncan,
>         Is there any more useful information you want to pass on regarding
>your parallel universe jobs that might help with fixing this?
>
>Thanks.
>
>
>On Wed, Sep 20, 2006 at 12:45:29PM -0600, condor-support response 
>tracking system wrote:
> > > I will post the full set of condor log files from this machine 
> (node202) for
> > > all of today (9/15) at http://www.ligo.caltech.edu/~anderson/condor.XYZ
> > > where XYZ is the support ticket number once it has been assigned to me.
> > >
> > > P.S. Perhaps this is related to to [condor-support #1679] where we have
> > > seen duplicating rescue dag processes running on the head node,
> > > only this time it is happening on the worker nodes, and not associated
> > > with a condor daemon crashing.
> >
> > Hello,
> >
> > It appears that the process that is being left behind was spawned by a
> > shell script in the user job. One thing that will help us debug this is
> > knowing if that shell script sets up a custom environment for the child.
> > In particular, Condor uses a set of environment variables (prefixed
> > _CONDOR_ANCESTOR) to track process families. Is it possible that the
> > child of the shell script is not getting these variables in its
> > environment? In this case, Condor should still be able to track the
> > process and kill it when the job is evicted, but knowing the answer will
> > help us in debugging.
> >
> > It does appear as though this issue and the one from [condor-support
> > #1679] could be related.
> >
> > Greg
> >
> > >
> > > Thanks.
> > >
> >
> >
> >
> > ========================================
> > MESSAGE INFORMATION
> > ========================================
> > * From: Greg Quinn <gquinn__AT__cs.wisc.edu>
> > * Ticket Email List: anderson__AT__ligo.caltech.edu, 
> espinoza_e__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu,patrick__AT__gravity.phys.uwm.edu
> >
> > --
> > ======================================================================
> > This mail was sent from the RUST Mail System
> > Please direct all replies to condor-support__AT__cs.wisc.edu
> > Please include the current subject line in your reply.
> > ======================================================================
> >
>
>--
>Stuart Anderson  anderson__AT__ligo.caltech.edu
>http://www.ligo.caltech.edu/~anderson


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Todd Tannenbaum                       University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
tannenba__AT__cs.wisc.edu                  1210 W. Dayton St. Rm #4257
http://www.cs.wisc.edu/~tannenba      Madison, WI 53706-1685
Phone: (608) 263-7132  FAX: (608) 262-9777


===========================================================================
Date mail was appended: Thu Oct  5 14:04:42 2006 (1160075082)
Date: Sun, 8 Oct 2006 20:06:34 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu,
 patrick__AT__gravity.phys.uwm.edu, leonor__AT__oersted.uoregon.edu
Subject: Re: [condor-support #1681] LIGO orphaned Condor jobs still running

On Thu, Oct 05, 2006 at 02:04:42PM -0600, condor-support response tracking system wrote:
> At 06:07 PM 9/30/2006, Stuart Anderson wrote:
> >Greg, Todd,
> >
> >A full scan of our 1300 machine condor pool shows there we had 314 orphaned
> >processes left. This is a serious problem, and I believe Todd mentioned
> >a week ago that this is now understood in terms of a drifting Condor
> >measurement of the Unix processes lifetime. Is that theory still holding,
> >and if so, what is the time estimate for a fix?
> 
> 
> The theory is still holding, and the code to fix it has been written 
> and committed into the v6.8 branch in CVS.  Thus it will appear in 
> v6.8.2.  We can send you a "preview" of v6.8.2 for testing this fix 
> anytime you wish --- it is supposed to go onto our UW machines this 
> week, and onto the web next week.

We will wait for 6.8.2.

> 
> 
> 
> >One of the orphans was un-killable with "kill -9" as root as the job
> >was stuck in the Linux kernel D state, so that is not a Condor problem.
> >One of the others was a vanilla universe job that was killable
> >(the original job reported in this problem ticket), and the remaining
> >312 processes are all related to parallel universe jobs, where it
> >is presumably harder for Condor to keep track of all the ssh connections
> >man for these MPI jobs, or perhaps because these are longer running jobs.
> 
> The parallel universe jobs would be susceptible to the same bugs 
> fixed above.  Yes, since they run longer they are more susceptible.
> 
> These are parallel universe jobs using, right?  Not mpi 
> universe?  Assuming the former, are you using the default "mpich" 
> launch scripts that Condor shipped with, or did you roll your 
> own?  Also, Condor is started as user root on your execute machines, 
> correct?  (since sshd is setuid to root)

They are MPI jobs run in the Parallel Universe using a duncan-roll-his-own
script and Condor is started as root on all our machines. Duncan
had problems with the condor shipped MPI example script that we should
follow up sometime when there is less important bugs to fix.

However, I am confused by your comment about sshd being setuid root.
On our execute machines, condor_master is started from an /etc/init.d/condor
startup script that is run as root, i.e., ssh is not involved in starting
condor.

> 
> Meanwhile, Greg is checking the code path for parallel universe jobs, 
> making certain they kill descendants the same way as vanilla does (we 
> think that is the case, but Greg is making certain).

Thanks.

> 
> best regards,
> Todd
> 
> p.s. I am now (finally) back in the office after being out sick.
> 
> >Duncan,
> >         Is there any more useful information you want to pass on regarding
> >your parallel universe jobs that might help with fixing this?
> >
> >Thanks.
> >
> >
> >On Wed, Sep 20, 2006 at 12:45:29PM -0600, condor-support response 
> >tracking system wrote:
> > > > I will post the full set of condor log files from this machine 
> > (node202) for
> > > > all of today (9/15) at http://www.ligo.caltech.edu/~anderson/condor.XYZ
> > > > where XYZ is the support ticket number once it has been assigned to me.
> > > >
> > > > P.S. Perhaps this is related to to [condor-support #1679] where we have
> > > > seen duplicating rescue dag processes running on the head node,
> > > > only this time it is happening on the worker nodes, and not associated
> > > > with a condor daemon crashing.
> > >
> > > Hello,
> > >
> > > It appears that the process that is being left behind was spawned by a
> > > shell script in the user job. One thing that will help us debug this is
> > > knowing if that shell script sets up a custom environment for the child.
> > > In particular, Condor uses a set of environment variables (prefixed
> > > _CONDOR_ANCESTOR) to track process families. Is it possible that the
> > > child of the shell script is not getting these variables in its
> > > environment? In this case, Condor should still be able to track the
> > > process and kill it when the job is evicted, but knowing the answer will
> > > help us in debugging.
> > >
> > > It does appear as though this issue and the one from [condor-support
> > > #1679] could be related.
> > >
> > > Greg
> > >
> > > >
> > > > Thanks.
> > > >
> > >
> > >
> > >
> > > ========================================
> > > MESSAGE INFORMATION
> > > ========================================
> > > * From: Greg Quinn <gquinn__AT__cs.wisc.edu>
> > > * Ticket Email List: anderson__AT__ligo.caltech.edu, 
> > espinoza_e__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu,patrick__AT__gravity.phys.uwm.edu
> > >
> > > --
> > > ======================================================================
> > > This mail was sent from the RUST Mail System
> > > Please direct all replies to condor-support__AT__cs.wisc.edu
> > > Please include the current subject line in your reply.
> > > ======================================================================
> > >
> >
> >--
> >Stuart Anderson  anderson__AT__ligo.caltech.edu
> >http://www.ligo.caltech.edu/~anderson
> 
> 
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Todd Tannenbaum                       University of Wisconsin-Madison
> Condor Project Research               Department of Computer Sciences
> tannenba__AT__cs.wisc.edu                  1210 W. Dayton St. Rm #4257
> http://www.cs.wisc.edu/~tannenba      Madison, WI 53706-1685
> Phone: (608) 263-7132  FAX: (608) 262-9777
> 
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu,patrick__AT__gravity.phys.uwm.edu,leonor__AT__oersted.uoregon.edu
> 
> -- 
> ======================================================================
> This mail was sent from the RUST Mail System
> Please direct all replies to condor-support__AT__cs.wisc.edu
> Please include the current subject line in your reply.
> ======================================================================
> 

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Sun Oct  8 22:07:04 2006 (1160363225)
Date: Sat, 28 Oct 2006 14:45:00 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu,
 patrick__AT__gravity.phys.uwm.edu
Subject: Re: [condor-support #1681] LIGO orphaned Condor jobs still running

Todd,

	This fix appears to be working as we have not had any orphaned processes since
upgrading to condor 6.8.2.  Please consider closing this ticket.

Thanks.


On Thu, Oct 05, 2006 at 02:04:42PM -0600, condor-support response tracking system wrote:
> At 06:07 PM 9/30/2006, Stuart Anderson wrote:
> >Greg, Todd,
> >
> >A full scan of our 1300 machine condor pool shows there we had 314 orphaned
> >processes left. This is a serious problem, and I believe Todd mentioned
> >a week ago that this is now understood in terms of a drifting Condor
> >measurement of the Unix processes lifetime. Is that theory still holding,
> >and if so, what is the time estimate for a fix?
> 
> 
> The theory is still holding, and the code to fix it has been written 
> and committed into the v6.8 branch in CVS.  Thus it will appear in 
> v6.8.2.  We can send you a "preview" of v6.8.2 for testing this fix 
> anytime you wish --- it is supposed to go onto our UW machines this 
> week, and onto the web next week.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Sat Oct 28 16:45:50 2006 (1162071951)
Subject: Actions

Ticket resolved by tannenba
===========================================================================
Date of actions: Tue Jul  3 16:59:00 2007 (1183499941)