LIGO Support Ticket 17239

Ticket Information
  Number:      admin 17239
  User:        anderson@ligo.caltech.edu
  Email:       skoranda__AT__gravity.phys.uwm.edu,fairhurst_s__AT__ligo.caltech.edu
  Status:      open
  Assigned To: tannenba
Date: Wed, 21 Nov 2007 11:48:44 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: condor_submit stuck in CPU spin-loop
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

Condor version 6.9.4 running on the LIGO Caltech pool currently has
4 condor_submit processes stuck in CPU spin-loops on one of the
submit machines, i.e., from top,

top - 11:29:56 up 4 days, 22:00, 28 users,  load average: 5.70, 6.04, 5.74
Tasks: 1607 total,   6 running, 1601 sleeping,   0 stopped,   0 zombie
Cpu(s): 50.8% us,  0.6% sy,  0.0% ni, 46.2% id,  2.3% wa,  0.0% hi,  0.1% si
Mem:  15903476k total, 15050304k used,   853172k free,   886864k buffers
Swap: 16779884k total,      240k used, 16779644k free, 10338392k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5989 marion    25   0 17160 3200 2420 R 99.9  0.0   1020:31 condor_submit
12055 marion    25   0 17160 3200 2420 R 99.9  0.0   1573:16 condor_submit
14955 marion    25   0 17160 3196 2420 R 99.9  0.0 478:55.26 condor_submit
24884 marion    25   0 17160 3200 2420 R 99.9  0.0   1415:15 condor_submit
13302 condor    17   0  125m 111m 3596 R  2.6  0.7 253:39.09 condor_schedd


There are no system calls being made by these processes as reported
by strace and here is an example output form lsof on one of these
processes:

[root@ldas-grid ~]# lsof -p 24884
COMMAND     PID   USER   FD   TYPE   DEVICE    SIZE     NODE NAME
condor_su 24884 marion  cwd    DIR     0,20   24576 48012867 /mnt/qfs2/marion/analysis/866188015-866288015/playground (opterondata-cit:/home2)
condor_su 24884 marion  rtd    DIR     8,19    4096        2 /
condor_su 24884 marion  txt    REG     8,19 4209688   368189 /usr/bin/condor_submit
condor_su 24884 marion  mem    REG      0,0                0 [heap] (stat: No such file or directory)
condor_su 24884 marion  mem    REG     8,19  128760  1048602 /lib64/ld-2.3.6.so
condor_su 24884 marion  mem    REG     8,19   27904  1048683 /lib64/libcrypt-2.3.6.so
condor_su 24884 marion  mem    REG     8,19   20312  1048597 /lib64/libdl-2.3.6.so
condor_su 24884 marion  mem    REG     8,19   89800  1048613 /lib64/libresolv-2.3.6.so
condor_su 24884 marion  mem    REG     8,19  825496  2031624 /usr/lib64/libstdc++.so.5.0.7
condor_su 24884 marion  mem    REG     8,19  626832  1048638 /lib64/libm-2.3.6.so
condor_su 24884 marion  mem    REG     8,19   54024  1048801 /lib64/libgcc_s-4.0.2-20051126.so.1
condor_su 24884 marion  mem    REG     8,19 1548560  1048670 /lib64/libc-2.3.6.so
condor_su 24884 marion  mem    REG     8,19   57888  1048603 /lib64/libnss_files-2.3.6.so
condor_su 24884 marion    0r   CHR      1,3             1878 /dev/null
condor_su 24884 marion    1w  FIFO      0,5         14913087 pipe
condor_su 24884 marion    2w  FIFO      0,5         14913087 pipe
condor_su 24884 marion    3r  FIFO      0,5         14911254 pipe
condor_su 24884 marion    4w  FIFO      0,5         14911254 pipe
condor_su 24884 marion    5u  IPv4 14911255              TCP ldas-grid.ligo.caltech.edu:45985 (LISTEN)
condor_su 24884 marion    6u  IPv4 14911256              UDP ldas-grid.ligo.caltech.edu:45985 
condor_su 24884 marion    7r   REG     0,20    1164 48012039 /mnt/qfs2/marion/analysis/866188015-866288015/playground/inspiral_hipe_cat2_veto_playground.thinca2_slides_H1H2L1.sub (opterondata-cit:/home2)
condor_su 24884 marion    8u  sock      0,4         14913114 can't identify protocol


[root@ldas-grid ~]# ident /usr/bin/condor_submit
/usr/bin/condor_submit:
     $CondorVersion: 6.9.4 Aug 30 2007 $
     $CondorPlatform: X86_64-LINUX_RHEL3 $
     $Id: kdb5_err.et 13854 2001-10-25 20:20:57Z tlyu $
     $Id: krb5_err.et 16816 2004-10-13 16:18:27Z lxs $
     $Id: accept_sec_context.c,v 1.30.4.1 2005/10/27 00:53:26 kettimut Exp $
     $Id: acquire_cred.c,v 1.12 2005/04/15 23:37:15 meder Exp $
     $Id: compare_name.c,v 1.22.4.2 2005/07/13 20:17:52 mlink Exp $
     $Id: delete_sec_context.c,v 1.13 2005/04/15 23:37:16 meder Exp $
     $Id: display_name.c,v 1.10 2005/04/15 23:37:16 meder Exp $
     $Id: display_status.c,v 1.19 2005/04/15 23:37:16 meder Exp $
     $Id: import_name.c,v 1.15 2005/04/15 23:37:18 meder Exp $
     $Id: init_sec_context.c,v 1.31.4.3 2005/05/04 16:00:23 meder Exp $
     $Id: inquire_cred.c,v 1.10 2005/04/15 23:37:19 meder Exp $
     $Id: inquire_context.c,v 1.11 2005/04/15 23:37:19 meder Exp $
     $Id: oid_functions.c,v 1.13 2005/04/15 23:37:19 meder Exp $
     $Id: release_cred.c,v 1.5 2005/04/15 23:37:20 meder Exp $
     $Id: release_name.c,v 1.6 2005/04/15 23:37:20 meder Exp $
     $Id: unwrap.c,v 1.17 2005/04/15 23:37:20 meder Exp $
     $Id: verify_mic.c,v 1.12 2005/04/15 23:37:21 meder Exp $
     $Id: wrap.c,v 1.12 2005/04/15 23:37:21 meder Exp $
     $Id: release_buffer.c,v 1.3 2005/04/15 23:37:20 meder Exp $
     $Id: globus_i_gsi_gss_utils.c,v 1.38.4.1 2005/05/04 00:19:37 meder Exp $
     $Id: import_cred.c,v 1.19 2005/04/15 23:37:18 meder Exp $
     $Id: get_mic.c,v 1.7 2005/04/15 23:37:17 meder Exp $
     $GCBVersion: 1.5.0 $
     $GCBBuildDate: Aug 30 2007 $


Here is the full list of process arguments for one of these condor_submit jobs,

marion   24884 23912 99 Nov20 ?        23:43:03 condor_submit -a dag_node_name = 07864dbd2f17da5744666951fc3c10f7 -a +DAGManJobId = 20888877 -a DAGManJobId = 20888877 -a submit_event_notes = DAG Node: 07864dbd2f17da5744666951fc3c10f7 -a macronumslides = 29 -a macrol1triggers =  -a macroh1triggers =  -a macrousertag = $(macrousertag) -a macrogpsstarttime = 866285943 -a macrogpsendtime = 866286543 -a macroh2triggers =  -a macroifotag = SECOND_H1H2L1 -a macroarguments = H1-INSPIRAL_SECOND_H1H2L1-866284318-2048.xml.gz H1-INSPIRAL_SECOND_H1H2L1-866285959-2048.xml.gz H2-INSPIRAL_SECOND_H1H2L1-866284006-2048.xml.gz H2-INSPIRAL_SECOND_H1H2L1-866285926-2048.xml.gz L1-INSPIRAL_SECOND_H1H2L1-866284358-2048.xml.gz L1-INSPIRAL_SECOND_H1H2L1-866285959-2048.xml.gz -a +DAGParentNodeNames = "" inspiral_hipe_cat2_veto_playground.thinca2_slides_H1H2L1.sub


proverabial needle in a haystack.

[root@ldas-grid ~]# gdb -p 24884
GNU gdb Red Hat Linux (6.3.0.0-1.84rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Attaching to process 24884
Reading symbols from /usr/bin/condor_submit...(no debugging symbols found)...done.
Using host libthread_db library "/lib64/libthread_db.so.1".
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /usr/lib64/libstdc++.so.5...
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libstdc++.so.5
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...
(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
0x00002b66637ee2b7 in strstr () from /lib64/libc.so.6
(gdb) where
#0  0x00002b66637ee2b7 in strstr () from /lib64/libc.so.6
#1  0x00000000004e56e7 in get_special_var ()
#2  0x00000000004e4cd9 in expand_macro ()
#3  0x00000000004a26f7 in condor_param ()
#4  0x000000000049df7a in SetArguments ()
#5  0x00000000004a2c05 in queue ()
#6  0x00000000004a2295 in read_condor_file ()
#7  0x0000000000497f79 in main ()


Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Wed Nov 21 13:49:03 2007 (1195674549)
Subject: Actions

Assigned to tannenba by tannenba
===========================================================================
Date of actions: Wed Nov 21 14:42:33 2007 (1195677753)
Date: Wed, 21 Nov 2007 14:50:17 -0600
From: Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17239] LIGO: condor_submit stuck in CPU spin-loop


> 
> Condor version 6.9.4 running on the LIGO Caltech pool currently has
> 4 condor_submit processes stuck in CPU spin-loops on one of the
> submit machines, i.e., from top,

Yikes.

Please pick one of the spinning condor_submits (one with a smaller image 
size), and kill it in order to produce a core file (kill -ABRT <pid> 
should do the trick).  Then send into this ticket the submit file given 
to condor_submit (if that is easy to do) and either the core file, or 
even better, a pointer to a URL where we can grab the core file.

Thanks,
UW Condor Team (this time you got Todd)


-- 
Todd Tannenbaum                       University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
tannenba__AT__cs.wisc.edu                  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                 Madison, WI 53706-1685

===========================================================================
Date mail was appended: Wed Nov 21 15:24:09 2007 (1195680250)
Subject: Actions

Status changed from open to pending by tannenba
===========================================================================
Date of actions: Wed Nov 21 15:56:47 2007 (1195682207)
Date: Wed, 21 Nov 2007 14:00:46 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17239] LIGO: condor_submit stuck in CPU spin-loop
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

On Wed, Nov 21, 2007 at 03:24:09PM -0600, condor-admin response tracking system wrote:
> 
> > 
> > Condor version 6.9.4 running on the LIGO Caltech pool currently has
> > 4 condor_submit processes stuck in CPU spin-loops on one of the
> > submit machines, i.e., from top,
> 
> Yikes.
> 
> Please pick one of the spinning condor_submits (one with a smaller image 
> size), and kill it in order to produce a core file (kill -ABRT <pid> 
> should do the trick).  Then send into this ticket the submit file given 
> to condor_submit (if that is easy to do) and either the core file, or 
> even better, a pointer to a URL where we can grab the core file.
> 


Here is the submit file from pid 12055,

[root@ldas-grid marion]# cat /mnt/qfs2/marion/analysis/866088014-866188014/playground/inspiral_hipe_cat2_veto_playground.thinca2_slides_H1H2L1.sub
universe = standard
executable = ../executables/lalapps_thinca
arguments = --h1-h2-distance-cut --h2-slide 10 --h2-triggers $(macroh2triggers) --ifo-tag $(macroifotag) --gps-end-time $(macrogpsendtime) --h1-triggers $(macroh1triggers) --snr-cut 5.5 --debug-level 33 --gps-start-time $(macrogpsstarttime) --iota-cut-h1h2 0.6 --h1-h2-consistency --do-veto --data-type playground_only --h2-kappa 0.6 --h1-slide 0 --h2-veto-file ../H2-CATEGORY_2_VETO_SEGS-866088014-100000.txt --l1-veto-file ../L1-CATEGORY_2_VETO_SEGS-866088014-100000.txt --num-slides $(macronumslides) --v1-slide 15 --parameter-test ellipsoid --h1-kappa 0.6 --h1-epsilon 0.0 --l1-slide 5 --e-thinca-parameter 0.5 --l1-triggers $(macrol1triggers) --multi-ifo-coinc --user-tag $(macrousertag) --h1-veto-file ../H1-CATEGORY_2_VETO_SEGS-866088014-100000.txt --write-compress --h2-epsilon 0.0 $(macroarguments)
environment = KMP_LIBRARY=serial;MKL_SERIAL=yes
priority = 10
log = /usr1/marion/tmpG_BEtp
error = logs/thinca-$(macrogpsstarttime)-$(macrogpsendtime)-$(cluster)-$(process).err
output = logs/thinca-$(macrogpsstarttime)-$(macrogpsendtime)-$(cluster)-$(process).out
notification = never
queue 1


Here is the list of open files:

root@ldas-grid marion]# lsof -p 12055
COMMAND     PID   USER   FD   TYPE   DEVICE    SIZE     NODE NAME
condor_su 12055 marion  cwd    DIR     0,20   28672 48014725 /mnt/qfs2/marion/analysis/866088014-866188014/playground (opterondata-cit:/home2)
condor_su 12055 marion  rtd    DIR     8,19    4096        2 /
condor_su 12055 marion  txt    REG     8,19 4209688   368189 /usr/bin/condor_submit
condor_su 12055 marion  mem    REG      0,0                0 [heap] (stat: No such file or directory)
condor_su 12055 marion  mem    REG     8,19  128760  1048602 /lib64/ld-2.3.6.so
condor_su 12055 marion  mem    REG     8,19   27904  1048683 /lib64/libcrypt-2.3.6.so
condor_su 12055 marion  mem    REG     8,19   20312  1048597 /lib64/libdl-2.3.6.so
condor_su 12055 marion  mem    REG     8,19   89800  1048613 /lib64/libresolv-2.3.6.so
condor_su 12055 marion  mem    REG     8,19  825496  2031624 /usr/lib64/libstdc++.so.5.0.7
condor_su 12055 marion  mem    REG     8,19  626832  1048638 /lib64/libm-2.3.6.so
condor_su 12055 marion  mem    REG     8,19   54024  1048801 /lib64/libgcc_s-4.0.2-20051126.so.1
condor_su 12055 marion  mem    REG     8,19 1548560  1048670 /lib64/libc-2.3.6.so
condor_su 12055 marion  mem    REG     8,19   57888  1048603 /lib64/libnss_files-2.3.6.so
condor_su 12055 marion    0r   CHR      1,3             1878 /dev/null
condor_su 12055 marion    1w  FIFO      0,5         14214536 pipe
condor_su 12055 marion    2w  FIFO      0,5         14214536 pipe
condor_su 12055 marion    3r  FIFO      0,5         14213904 pipe
condor_su 12055 marion    4w  FIFO      0,5         14213904 pipe
condor_su 12055 marion    5u  IPv4 14213905              TCP ldas-grid.ligo.caltech.edu:40318 (LISTEN)
condor_su 12055 marion    6u  IPv4 14213906              UDP ldas-grid.ligo.caltech.edu:40318 
condor_su 12055 marion    7r   REG     0,20    1164 48014547 /mnt/qfs2/marion/analysis/866088014-866188014/playground/inspiral_hipe_cat2_veto_playground.thinca2_slides_H1H2L1.sub (opterondata-cit:/home2)
condor_su 12055 marion    8u  sock      0,4         14214541 can't identify protocol


The log file is empty,

root@ldas-grid marion]# ls -l /usr1/marion/tmpG_BEtp
-rw-r--r--  1 marion marion 0 Nov 20 09:15 /usr1/marion/tmpG_BEtp


I have put a gzip core image obtained with gcore in,
http://www.ligo.caltech.edu/~anderson/condor.17239/core.12055.gz
however when I load this into gdb I do not see the full stack trace
as I do when connecting gdb to the running processes:

(gdb) where
#0  0x00002b8261418304 in strstr () from /lib64/libc.so.6
#1  0x00000000004e56e7 in get_special_var ()
#2  0x00000000004e4a40 in expand_macro ()
#3  0x00000000004a26f7 in condor_param ()
#4  0x000000000049df7a in SetArguments ()
#5  0x00000000004a2c05 in queue ()
#6  0x00000000004a2295 in read_condor_file ()
#7  0x0000000000497f79 in main ()


I ran "kill -ABRT" on a different process and ther wass no core file
left behind, but that job appeared to be restarted and a new spinning
condor_submit process was created.

If there are any gdb commands you want me to run on one of the running
processes let me know. We can also give you login access to poke
at one of them if that makes it easier to debug.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Wed Nov 21 16:01:04 2007 (1195682465)
Date: Thu, 22 Nov 2007 22:39:33 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17239] LIGO: condor_submit stuck in CPU spin-loop
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

I took a few more random snapshots a few seconds apart on the same spinning
condor_submit process and found that it is looping through Condor code and
not just stuck in one glibc call to strstr() never-never-land. However,
it does appear to always be in condor_param().


For example, here it is in strcmp() from lookup_macro()

(gdb) where
#0  0x00002b026e60f4b0 in strcmp () from /lib64/libc.so.6
#1  0x00000000004e5af6 in lookup_macro (name=0x94a1f4 "macrousertag", table=0x905200, table_size=32)
    at config.C:989
#2  0x00000000004e5014 in expand_macro (
    value=0x94bb00 "--h1-h2-distance-cut --h2-triggers $(macroh2triggers) --ifo-tag $(macroifotag) --gps-end-time $(macrogpsendtime) --h1-triggers $(macroh1triggers) --snr-cut 5.5 --debug-level 33 --gps-start-time $(macr"..., table=0x905200, table_size=32, self=0x0) at config.C:658
#3  0x00000000004a26f7 in condor_param (name=0x6f2423 "arguments", alt_name=0x703bbd "Args")
    at submit.C:5201
#4  0x000000000049df7a in SetArguments () at submit.C:3417
#5  0x00000000004a2c05 in queue (num=1) at submit.C:5409
#6  0x00000000004a2295 in read_condor_file (fp=0x9318d0) at submit.C:5066
#7  0x0000000000497f79 in main (argc=0, argv=0x7fff3cddff68) at submit.C:877


and then in strstr() from get_special_var(),

(gdb) where
#0  0x00002b026e6102b7 in strstr () from /lib64/libc.so.6
#1  0x00000000004e56e7 in get_special_var (prefix=0x6fd6a2 "$ENV", only_id_chars=true, 
    value=0x94a280 "--h1-h2-distance-cut --h2-triggers  --ifo-tag SECOND_H1H2L1 --gps-end-time 866287943 --h1-triggers  --snr-cut 5.5 --debug-level 33 --gps-start-time 866284862 --iota-cut-h1h2 0.6 --h1-h2-consistency --"..., leftp=0x7fff3cddd850, namep=0x7fff3cddd848, rightp=0x7fff3cddd838) at config.C:789
#2  0x00000000004e4a40 in expand_macro (
    value=0x94bb00 "--h1-h2-distance-cut --h2-triggers $(macroh2triggers) --ifo-tag $(macroifotag) --gps-end-time $(macrogpsendtime) --h1-triggers $(macroh1triggers) --snr-cut 5.5 --debug-level 33 --gps-start-time $(macr"..., table=0x905200, table_size=32, self=0x0) at config.C:564
#3  0x00000000004a26f7 in condor_param (name=0x6f2423 "arguments", alt_name=0x703bbd "Args")
    at submit.C:5201
#4  0x000000000049df7a in SetArguments () at submit.C:3417
#5  0x00000000004a2c05 in queue (num=1) at submit.C:5409
#6  0x00000000004a2295 in read_condor_file (fp=0x9318d0) at submit.C:5066
#7  0x0000000000497f79 in main (argc=0, argv=0x7fff3cddff68) at submit.C:877

and here is an instance in sprintf() from expand_macro()

(gdb) where
#0  0x00002b026e5de046 in vfprintf () from /lib64/libc.so.6
#1  0x00002b026e5fbe79 in vsprintf () from /lib64/libc.so.6
#2  0x00002b026e5e6d18 in sprintf () from /lib64/libc.so.6
#3  0x00000000004e5079 in expand_macro (
    value=0x94bb00 "--h1-h2-distance-cut --h2-triggers $(macroh2triggers) --ifo-tag $(macroifotag) --gps-end-time $(macrogpsendtime) --h1-triggers $(macroh1triggers) --snr-cut 5.5 --debug-level 33 --gps-start-time $(macr"..., table=0x905200, table_size=32, self=0x0) at config.C:665
#4  0x00000000004a26f7 in condor_param (name=0x6f2423 "arguments", alt_name=0x703bbd "Args") at submit.C:5201
#5  0x000000000049df7a in SetArguments () at submit.C:3417
#6  0x00000000004a2c05 in queue (num=1) at submit.C:5409
#7  0x00000000004a2295 in read_condor_file (fp=0x9318d0) at submit.C:5066
#8  0x0000000000497f79 in main (argc=0, argv=0x7fff3cddff68) at submit.C:877


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Nov 23  1:36:45 2007 (1195803406)
Date: Mon, 26 Nov 2007 15:46:14 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu,         Steve Fairhurst
 <fairhurst_s__AT__ligo.caltech.edu>
Subject: Re: [condor-admin #17239] LIGO: condor_submit stuck in CPU spin-loop
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

Here is a simple reproducible test case:

> cat bug.dag
JOB job1 bug.sub
VARS job1 macrooutput="$(macrooutput)"

> cat bug.sub
universe = vanilla
executable = /bin/echo
arguments = $(macrooutput)
getenv = False
log = /usr1/anderson/bug.log
error = bug.err
output = bug.out
notification = never
queue 1

> condor_submit_dag bug.dag


A secondary issue is that condor_rm does not successfully kill the
resulting condor_submit process, but leaves an entry in the Condor
job queue in the "X" state until the process is killed manually.
Furthermore, it requires a SIGKILL signal to kill the processes
as it appears that SIGTERM, SIGINT, and SIGQUIT are blocked.
Are these two additional issues bugs or features?

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Nov 26 17:46:44 2007 (1196120805)