LIGO Support Ticket 17239
Ticket Information
Number: admin 17239
User: anderson@ligo.caltech.edu
Email: skoranda__AT__gravity.phys.uwm.edu,fairhurst_s__AT__ligo.caltech.edu
Status: open
Assigned To: tannenba
Date: Wed, 21 Nov 2007 11:48:44 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: condor_submit stuck in CPU spin-loop
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
Condor version 6.9.4 running on the LIGO Caltech pool currently has
4 condor_submit processes stuck in CPU spin-loops on one of the
submit machines, i.e., from top,
top - 11:29:56 up 4 days, 22:00, 28 users, load average: 5.70, 6.04, 5.74
Tasks: 1607 total, 6 running, 1601 sleeping, 0 stopped, 0 zombie
Cpu(s): 50.8% us, 0.6% sy, 0.0% ni, 46.2% id, 2.3% wa, 0.0% hi, 0.1% si
Mem: 15903476k total, 15050304k used, 853172k free, 886864k buffers
Swap: 16779884k total, 240k used, 16779644k free, 10338392k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5989 marion 25 0 17160 3200 2420 R 99.9 0.0 1020:31 condor_submit
12055 marion 25 0 17160 3200 2420 R 99.9 0.0 1573:16 condor_submit
14955 marion 25 0 17160 3196 2420 R 99.9 0.0 478:55.26 condor_submit
24884 marion 25 0 17160 3200 2420 R 99.9 0.0 1415:15 condor_submit
13302 condor 17 0 125m 111m 3596 R 2.6 0.7 253:39.09 condor_schedd
There are no system calls being made by these processes as reported
by strace and here is an example output form lsof on one of these
processes:
[root@ldas-grid ~]# lsof -p 24884
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
condor_su 24884 marion cwd DIR 0,20 24576 48012867 /mnt/qfs2/marion/analysis/866188015-866288015/playground (opterondata-cit:/home2)
condor_su 24884 marion rtd DIR 8,19 4096 2 /
condor_su 24884 marion txt REG 8,19 4209688 368189 /usr/bin/condor_submit
condor_su 24884 marion mem REG 0,0 0 [heap] (stat: No such file or directory)
condor_su 24884 marion mem REG 8,19 128760 1048602 /lib64/ld-2.3.6.so
condor_su 24884 marion mem REG 8,19 27904 1048683 /lib64/libcrypt-2.3.6.so
condor_su 24884 marion mem REG 8,19 20312 1048597 /lib64/libdl-2.3.6.so
condor_su 24884 marion mem REG 8,19 89800 1048613 /lib64/libresolv-2.3.6.so
condor_su 24884 marion mem REG 8,19 825496 2031624 /usr/lib64/libstdc++.so.5.0.7
condor_su 24884 marion mem REG 8,19 626832 1048638 /lib64/libm-2.3.6.so
condor_su 24884 marion mem REG 8,19 54024 1048801 /lib64/libgcc_s-4.0.2-20051126.so.1
condor_su 24884 marion mem REG 8,19 1548560 1048670 /lib64/libc-2.3.6.so
condor_su 24884 marion mem REG 8,19 57888 1048603 /lib64/libnss_files-2.3.6.so
condor_su 24884 marion 0r CHR 1,3 1878 /dev/null
condor_su 24884 marion 1w FIFO 0,5 14913087 pipe
condor_su 24884 marion 2w FIFO 0,5 14913087 pipe
condor_su 24884 marion 3r FIFO 0,5 14911254 pipe
condor_su 24884 marion 4w FIFO 0,5 14911254 pipe
condor_su 24884 marion 5u IPv4 14911255 TCP ldas-grid.ligo.caltech.edu:45985 (LISTEN)
condor_su 24884 marion 6u IPv4 14911256 UDP ldas-grid.ligo.caltech.edu:45985
condor_su 24884 marion 7r REG 0,20 1164 48012039 /mnt/qfs2/marion/analysis/866188015-866288015/playground/inspiral_hipe_cat2_veto_playground.thinca2_slides_H1H2L1.sub (opterondata-cit:/home2)
condor_su 24884 marion 8u sock 0,4 14913114 can't identify protocol
[root@ldas-grid ~]# ident /usr/bin/condor_submit
/usr/bin/condor_submit:
$CondorVersion: 6.9.4 Aug 30 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
$Id: kdb5_err.et 13854 2001-10-25 20:20:57Z tlyu $
$Id: krb5_err.et 16816 2004-10-13 16:18:27Z lxs $
$Id: accept_sec_context.c,v 1.30.4.1 2005/10/27 00:53:26 kettimut Exp $
$Id: acquire_cred.c,v 1.12 2005/04/15 23:37:15 meder Exp $
$Id: compare_name.c,v 1.22.4.2 2005/07/13 20:17:52 mlink Exp $
$Id: delete_sec_context.c,v 1.13 2005/04/15 23:37:16 meder Exp $
$Id: display_name.c,v 1.10 2005/04/15 23:37:16 meder Exp $
$Id: display_status.c,v 1.19 2005/04/15 23:37:16 meder Exp $
$Id: import_name.c,v 1.15 2005/04/15 23:37:18 meder Exp $
$Id: init_sec_context.c,v 1.31.4.3 2005/05/04 16:00:23 meder Exp $
$Id: inquire_cred.c,v 1.10 2005/04/15 23:37:19 meder Exp $
$Id: inquire_context.c,v 1.11 2005/04/15 23:37:19 meder Exp $
$Id: oid_functions.c,v 1.13 2005/04/15 23:37:19 meder Exp $
$Id: release_cred.c,v 1.5 2005/04/15 23:37:20 meder Exp $
$Id: release_name.c,v 1.6 2005/04/15 23:37:20 meder Exp $
$Id: unwrap.c,v 1.17 2005/04/15 23:37:20 meder Exp $
$Id: verify_mic.c,v 1.12 2005/04/15 23:37:21 meder Exp $
$Id: wrap.c,v 1.12 2005/04/15 23:37:21 meder Exp $
$Id: release_buffer.c,v 1.3 2005/04/15 23:37:20 meder Exp $
$Id: globus_i_gsi_gss_utils.c,v 1.38.4.1 2005/05/04 00:19:37 meder Exp $
$Id: import_cred.c,v 1.19 2005/04/15 23:37:18 meder Exp $
$Id: get_mic.c,v 1.7 2005/04/15 23:37:17 meder Exp $
$GCBVersion: 1.5.0 $
$GCBBuildDate: Aug 30 2007 $
Here is the full list of process arguments for one of these condor_submit jobs,
marion 24884 23912 99 Nov20 ? 23:43:03 condor_submit -a dag_node_name = 07864dbd2f17da5744666951fc3c10f7 -a +DAGManJobId = 20888877 -a DAGManJobId = 20888877 -a submit_event_notes = DAG Node: 07864dbd2f17da5744666951fc3c10f7 -a macronumslides = 29 -a macrol1triggers = -a macroh1triggers = -a macrousertag = $(macrousertag) -a macrogpsstarttime = 866285943 -a macrogpsendtime = 866286543 -a macroh2triggers = -a macroifotag = SECOND_H1H2L1 -a macroarguments = H1-INSPIRAL_SECOND_H1H2L1-866284318-2048.xml.gz H1-INSPIRAL_SECOND_H1H2L1-866285959-2048.xml.gz H2-INSPIRAL_SECOND_H1H2L1-866284006-2048.xml.gz H2-INSPIRAL_SECOND_H1H2L1-866285926-2048.xml.gz L1-INSPIRAL_SECOND_H1H2L1-866284358-2048.xml.gz L1-INSPIRAL_SECOND_H1H2L1-866285959-2048.xml.gz -a +DAGParentNodeNames = "" inspiral_hipe_cat2_veto_playground.thinca2_slides_H1H2L1.sub
proverabial needle in a haystack.
[root@ldas-grid ~]# gdb -p 24884
GNU gdb Red Hat Linux (6.3.0.0-1.84rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Attaching to process 24884
Reading symbols from /usr/bin/condor_submit...(no debugging symbols found)...done.
Using host libthread_db library "/lib64/libthread_db.so.1".
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /usr/lib64/libstdc++.so.5...
(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libstdc++.so.5
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...
(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
0x00002b66637ee2b7 in strstr () from /lib64/libc.so.6
(gdb) where
#0 0x00002b66637ee2b7 in strstr () from /lib64/libc.so.6
#1 0x00000000004e56e7 in get_special_var ()
#2 0x00000000004e4cd9 in expand_macro ()
#3 0x00000000004a26f7 in condor_param ()
#4 0x000000000049df7a in SetArguments ()
#5 0x00000000004a2c05 in queue ()
#6 0x00000000004a2295 in read_condor_file ()
#7 0x0000000000497f79 in main ()
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Wed Nov 21 13:49:03 2007 (1195674549)
Subject: Actions
Assigned to tannenba by tannenba
===========================================================================
Date of actions: Wed Nov 21 14:42:33 2007 (1195677753)
Date: Wed, 21 Nov 2007 14:50:17 -0600
From: Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17239] LIGO: condor_submit stuck in CPU spin-loop
>
> Condor version 6.9.4 running on the LIGO Caltech pool currently has
> 4 condor_submit processes stuck in CPU spin-loops on one of the
> submit machines, i.e., from top,
Yikes.
Please pick one of the spinning condor_submits (one with a smaller image
size), and kill it in order to produce a core file (kill -ABRT <pid>
should do the trick). Then send into this ticket the submit file given
to condor_submit (if that is easy to do) and either the core file, or
even better, a pointer to a URL where we can grab the core file.
Thanks,
UW Condor Team (this time you got Todd)
--
Todd Tannenbaum University of Wisconsin-Madison
Condor Project Research Department of Computer Sciences
tannenba__AT__cs.wisc.edu 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685
===========================================================================
Date mail was appended: Wed Nov 21 15:24:09 2007 (1195680250)
Subject: Actions
Status changed from open to pending by tannenba
===========================================================================
Date of actions: Wed Nov 21 15:56:47 2007 (1195682207)
Date: Wed, 21 Nov 2007 14:00:46 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17239] LIGO: condor_submit stuck in CPU spin-loop
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
On Wed, Nov 21, 2007 at 03:24:09PM -0600, condor-admin response tracking system wrote:
>
> >
> > Condor version 6.9.4 running on the LIGO Caltech pool currently has
> > 4 condor_submit processes stuck in CPU spin-loops on one of the
> > submit machines, i.e., from top,
>
> Yikes.
>
> Please pick one of the spinning condor_submits (one with a smaller image
> size), and kill it in order to produce a core file (kill -ABRT <pid>
> should do the trick). Then send into this ticket the submit file given
> to condor_submit (if that is easy to do) and either the core file, or
> even better, a pointer to a URL where we can grab the core file.
>
Here is the submit file from pid 12055,
[root@ldas-grid marion]# cat /mnt/qfs2/marion/analysis/866088014-866188014/playground/inspiral_hipe_cat2_veto_playground.thinca2_slides_H1H2L1.sub
universe = standard
executable = ../executables/lalapps_thinca
arguments = --h1-h2-distance-cut --h2-slide 10 --h2-triggers $(macroh2triggers) --ifo-tag $(macroifotag) --gps-end-time $(macrogpsendtime) --h1-triggers $(macroh1triggers) --snr-cut 5.5 --debug-level 33 --gps-start-time $(macrogpsstarttime) --iota-cut-h1h2 0.6 --h1-h2-consistency --do-veto --data-type playground_only --h2-kappa 0.6 --h1-slide 0 --h2-veto-file ../H2-CATEGORY_2_VETO_SEGS-866088014-100000.txt --l1-veto-file ../L1-CATEGORY_2_VETO_SEGS-866088014-100000.txt --num-slides $(macronumslides) --v1-slide 15 --parameter-test ellipsoid --h1-kappa 0.6 --h1-epsilon 0.0 --l1-slide 5 --e-thinca-parameter 0.5 --l1-triggers $(macrol1triggers) --multi-ifo-coinc --user-tag $(macrousertag) --h1-veto-file ../H1-CATEGORY_2_VETO_SEGS-866088014-100000.txt --write-compress --h2-epsilon 0.0 $(macroarguments)
environment = KMP_LIBRARY=serial;MKL_SERIAL=yes
priority = 10
log = /usr1/marion/tmpG_BEtp
error = logs/thinca-$(macrogpsstarttime)-$(macrogpsendtime)-$(cluster)-$(process).err
output = logs/thinca-$(macrogpsstarttime)-$(macrogpsendtime)-$(cluster)-$(process).out
notification = never
queue 1
Here is the list of open files:
root@ldas-grid marion]# lsof -p 12055
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
condor_su 12055 marion cwd DIR 0,20 28672 48014725 /mnt/qfs2/marion/analysis/866088014-866188014/playground (opterondata-cit:/home2)
condor_su 12055 marion rtd DIR 8,19 4096 2 /
condor_su 12055 marion txt REG 8,19 4209688 368189 /usr/bin/condor_submit
condor_su 12055 marion mem REG 0,0 0 [heap] (stat: No such file or directory)
condor_su 12055 marion mem REG 8,19 128760 1048602 /lib64/ld-2.3.6.so
condor_su 12055 marion mem REG 8,19 27904 1048683 /lib64/libcrypt-2.3.6.so
condor_su 12055 marion mem REG 8,19 20312 1048597 /lib64/libdl-2.3.6.so
condor_su 12055 marion mem REG 8,19 89800 1048613 /lib64/libresolv-2.3.6.so
condor_su 12055 marion mem REG 8,19 825496 2031624 /usr/lib64/libstdc++.so.5.0.7
condor_su 12055 marion mem REG 8,19 626832 1048638 /lib64/libm-2.3.6.so
condor_su 12055 marion mem REG 8,19 54024 1048801 /lib64/libgcc_s-4.0.2-20051126.so.1
condor_su 12055 marion mem REG 8,19 1548560 1048670 /lib64/libc-2.3.6.so
condor_su 12055 marion mem REG 8,19 57888 1048603 /lib64/libnss_files-2.3.6.so
condor_su 12055 marion 0r CHR 1,3 1878 /dev/null
condor_su 12055 marion 1w FIFO 0,5 14214536 pipe
condor_su 12055 marion 2w FIFO 0,5 14214536 pipe
condor_su 12055 marion 3r FIFO 0,5 14213904 pipe
condor_su 12055 marion 4w FIFO 0,5 14213904 pipe
condor_su 12055 marion 5u IPv4 14213905 TCP ldas-grid.ligo.caltech.edu:40318 (LISTEN)
condor_su 12055 marion 6u IPv4 14213906 UDP ldas-grid.ligo.caltech.edu:40318
condor_su 12055 marion 7r REG 0,20 1164 48014547 /mnt/qfs2/marion/analysis/866088014-866188014/playground/inspiral_hipe_cat2_veto_playground.thinca2_slides_H1H2L1.sub (opterondata-cit:/home2)
condor_su 12055 marion 8u sock 0,4 14214541 can't identify protocol
The log file is empty,
root@ldas-grid marion]# ls -l /usr1/marion/tmpG_BEtp
-rw-r--r-- 1 marion marion 0 Nov 20 09:15 /usr1/marion/tmpG_BEtp
I have put a gzip core image obtained with gcore in,
http://www.ligo.caltech.edu/~anderson/condor.17239/core.12055.gz
however when I load this into gdb I do not see the full stack trace
as I do when connecting gdb to the running processes:
(gdb) where
#0 0x00002b8261418304 in strstr () from /lib64/libc.so.6
#1 0x00000000004e56e7 in get_special_var ()
#2 0x00000000004e4a40 in expand_macro ()
#3 0x00000000004a26f7 in condor_param ()
#4 0x000000000049df7a in SetArguments ()
#5 0x00000000004a2c05 in queue ()
#6 0x00000000004a2295 in read_condor_file ()
#7 0x0000000000497f79 in main ()
I ran "kill -ABRT" on a different process and ther wass no core file
left behind, but that job appeared to be restarted and a new spinning
condor_submit process was created.
If there are any gdb commands you want me to run on one of the running
processes let me know. We can also give you login access to poke
at one of them if that makes it easier to debug.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Wed Nov 21 16:01:04 2007 (1195682465)
Date: Thu, 22 Nov 2007 22:39:33 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17239] LIGO: condor_submit stuck in CPU spin-loop
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
I took a few more random snapshots a few seconds apart on the same spinning
condor_submit process and found that it is looping through Condor code and
not just stuck in one glibc call to strstr() never-never-land. However,
it does appear to always be in condor_param().
For example, here it is in strcmp() from lookup_macro()
(gdb) where
#0 0x00002b026e60f4b0 in strcmp () from /lib64/libc.so.6
#1 0x00000000004e5af6 in lookup_macro (name=0x94a1f4 "macrousertag", table=0x905200, table_size=32)
at config.C:989
#2 0x00000000004e5014 in expand_macro (
value=0x94bb00 "--h1-h2-distance-cut --h2-triggers $(macroh2triggers) --ifo-tag $(macroifotag) --gps-end-time $(macrogpsendtime) --h1-triggers $(macroh1triggers) --snr-cut 5.5 --debug-level 33 --gps-start-time $(macr"..., table=0x905200, table_size=32, self=0x0) at config.C:658
#3 0x00000000004a26f7 in condor_param (name=0x6f2423 "arguments", alt_name=0x703bbd "Args")
at submit.C:5201
#4 0x000000000049df7a in SetArguments () at submit.C:3417
#5 0x00000000004a2c05 in queue (num=1) at submit.C:5409
#6 0x00000000004a2295 in read_condor_file (fp=0x9318d0) at submit.C:5066
#7 0x0000000000497f79 in main (argc=0, argv=0x7fff3cddff68) at submit.C:877
and then in strstr() from get_special_var(),
(gdb) where
#0 0x00002b026e6102b7 in strstr () from /lib64/libc.so.6
#1 0x00000000004e56e7 in get_special_var (prefix=0x6fd6a2 "$ENV", only_id_chars=true,
value=0x94a280 "--h1-h2-distance-cut --h2-triggers --ifo-tag SECOND_H1H2L1 --gps-end-time 866287943 --h1-triggers --snr-cut 5.5 --debug-level 33 --gps-start-time 866284862 --iota-cut-h1h2 0.6 --h1-h2-consistency --"..., leftp=0x7fff3cddd850, namep=0x7fff3cddd848, rightp=0x7fff3cddd838) at config.C:789
#2 0x00000000004e4a40 in expand_macro (
value=0x94bb00 "--h1-h2-distance-cut --h2-triggers $(macroh2triggers) --ifo-tag $(macroifotag) --gps-end-time $(macrogpsendtime) --h1-triggers $(macroh1triggers) --snr-cut 5.5 --debug-level 33 --gps-start-time $(macr"..., table=0x905200, table_size=32, self=0x0) at config.C:564
#3 0x00000000004a26f7 in condor_param (name=0x6f2423 "arguments", alt_name=0x703bbd "Args")
at submit.C:5201
#4 0x000000000049df7a in SetArguments () at submit.C:3417
#5 0x00000000004a2c05 in queue (num=1) at submit.C:5409
#6 0x00000000004a2295 in read_condor_file (fp=0x9318d0) at submit.C:5066
#7 0x0000000000497f79 in main (argc=0, argv=0x7fff3cddff68) at submit.C:877
and here is an instance in sprintf() from expand_macro()
(gdb) where
#0 0x00002b026e5de046 in vfprintf () from /lib64/libc.so.6
#1 0x00002b026e5fbe79 in vsprintf () from /lib64/libc.so.6
#2 0x00002b026e5e6d18 in sprintf () from /lib64/libc.so.6
#3 0x00000000004e5079 in expand_macro (
value=0x94bb00 "--h1-h2-distance-cut --h2-triggers $(macroh2triggers) --ifo-tag $(macroifotag) --gps-end-time $(macrogpsendtime) --h1-triggers $(macroh1triggers) --snr-cut 5.5 --debug-level 33 --gps-start-time $(macr"..., table=0x905200, table_size=32, self=0x0) at config.C:665
#4 0x00000000004a26f7 in condor_param (name=0x6f2423 "arguments", alt_name=0x703bbd "Args") at submit.C:5201
#5 0x000000000049df7a in SetArguments () at submit.C:3417
#6 0x00000000004a2c05 in queue (num=1) at submit.C:5409
#7 0x00000000004a2295 in read_condor_file (fp=0x9318d0) at submit.C:5066
#8 0x0000000000497f79 in main (argc=0, argv=0x7fff3cddff68) at submit.C:877
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Fri Nov 23 1:36:45 2007 (1195803406)
Date: Mon, 26 Nov 2007 15:46:14 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu, Steve Fairhurst
<fairhurst_s__AT__ligo.caltech.edu>
Subject: Re: [condor-admin #17239] LIGO: condor_submit stuck in CPU spin-loop
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
Here is a simple reproducible test case:
> cat bug.dag
JOB job1 bug.sub
VARS job1 macrooutput="$(macrooutput)"
> cat bug.sub
universe = vanilla
executable = /bin/echo
arguments = $(macrooutput)
getenv = False
log = /usr1/anderson/bug.log
error = bug.err
output = bug.out
notification = never
queue 1
> condor_submit_dag bug.dag
A secondary issue is that condor_rm does not successfully kill the
resulting condor_submit process, but leaves an entry in the Condor
job queue in the "X" state until the process is killed manually.
Furthermore, it requires a SIGKILL signal to kill the processes
as it appears that SIGTERM, SIGINT, and SIGQUIT are blocked.
Are these two additional issues bugs or features?
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Mon Nov 26 17:46:44 2007 (1196120805)