LIGO Support Ticket 1789

Ticket Information
  Number:      support 1789
  User:        anderson@ligo.caltech.edu
  Email:       espinoza_e__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu,bjohnson__AT__ligo-wa.caltech.edu,johnson_b__AT__ligo-wa.caltech.edu
  Status:      resolved
  Assigned To: danb
Date: Tue, 12 Dec 2006 21:16:45 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu, wenger__AT__cs.wisc.edu,         Alain Roy
 <roy__AT__cs.wisc.edu>, Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,         Brown Duncan
 <duncan__AT__gravity.phys.uwm.edu>,         Ben Johnson
 <bjohnson__AT__ligo-wa.caltech.edu>
Subject: LIGO: Condor re-running old jobs

The LIGO LHO Condor pool running,
# uname -a
Linux ldas-grid 2.6.17-1.2142_FC4smp #1 SMP Tue Jul 11 22:57:02 EDT 2006 i686 i686 i386 GNU/Linux
# condor_version
$CondorVersion: 6.8.2 Oct 12 2006 $
$CondorPlatform: I386-LINUX_RHEL3 $
is occassionally trying to re-run Local Universe nodes in a Dag several
days after they have succesfully completed. Since these jobs are no
longer able to run several days later (by design) they are then spinning
in the scheduler and using up local universe slots which prevents other
legitimate jobs from running.

Here is an example log file that shows job "8329674.0" trying to
run again 3 days after it successfuly ran.  Note, this log file is on
a local ext3 filesystem so hopefully it is not a file locking problem.
It is also does not appear to happen on a very nearly identical cluser
running the same hardware, OS, number of nodes, condor version, which
we call LLO.  This has happened more than once and so is not just a
cosmic ray.  Any help in tracking this down further would be greatly
appreciated.



Ben, Duncan,
	Did we have this problem while running Condor 6.7.20 or did it only
start after upgrading LHO to 6.8.2?

Duncan,
	Can you tell if it is just inidividual nodes in the Dag that are
being re-run after 3 days or if the entire Dag job itself is trying to re-run?

Ben,
	Please respond to this ticket with any further requested log files
from the LHO cluster to help track this down. The most effecient way to do
that is to post large files on a publically accessible web site and just
put the URL into the email to this bug tracking system.


Thanks.



000 (8329674.000.000) 11/26 04:27:23 Job submitted from host: <10.12.0.12:54510>
    DAG Node: 0154becefe51a3a7de469c3e0109ac09
...
001 (8329674.000.000) 11/26 04:30:06 Job executing on host: <10.12.0.12:42039>
...
005 (8329674.000.000) 11/26 04:30:10 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...
000 (8329708.000.000) 11/26 04:31:30 Job submitted from host: <10.12.0.12:54510>
    DAG Node: 0b8eaf4e1d28e3d652c9cc2c83863a64

(snip)

021 (8329674.000.000) 11/29 02:07:46 Error from starter on <10.12.0.12:46496>:
        Failed to open '/archive/home/inspiralbns/detchar/onasysd.inspiralbns_S5_H1.jobs/005/82/70
/cache/H-R-848576912-848578976.cache' as standard output: No such file or directory (errno 2)
...
021 (8329674.000.000) 11/29 02:07:46 Error from starter on <10.12.0.12:46496>:
        Failed to open '/archive/home/inspiralbns/detchar/onasysd.inspiralbns_S5_H1.jobs/005/82/70
/logs/datafind-H-R-848576912-848578976-8329674-0.err' as standard error: No such file or directory
 (errno 2)
...

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Tue Dec 12 23:19:33 2006 (1165987176)
Date: Tue, 12 Dec 2006 21:25:08 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,         Ben Johnson
 <bjohnson__AT__ligo-wa.caltech.edu>,         Brown Duncan
 <duncan__AT__gravity.phys.uwm.edu>
Subject: Re: [condor-support #1789] LIGO: Condor re-running old jobs

A larger section of the log file that includes the example job may be found at:
http://www.ligo.caltech.edu/~anderson/condor.1789/inspiralbns_S5_H1-848576912-2064.log.gz

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Tue Dec 12 23:25:30 2006 (1165987531)
Subject: Actions

Assigned to danb by danb
===========================================================================
Date of actions: Wed Dec 13 14:46:32 2006 (1166042793)
Date: Wed, 13 Dec 2006 15:13:09 -0600
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1789] LIGO: Condor re-running old jobs
X-PMX-Version: 5.2.0.264296, Antispam-Engine: 2.4.0.264935, Antispam-Data:
 2006.12.13.125433
X-PMX-Spam-Score: Probability=7%, Report='__CT 0, __CTE 0, __CT_TEXT_PLAIN
 0, __HAS_MSGID 0, __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0,  
 __USER_AGENT 0'


Can we please see the StartLog and the SchedLog for the 3 days in which 
this job completed and then eventually ran again?

Also, what is recorded in the job queue?  For example, what is the 
output of either of the following:

condor_q -l 8329674

condor_history -l 8329674

Thanks,
--Dan


===========================================================================
Date mail was appended: Wed Dec 13 15:13:15 2006 (1166044395)
Date: Wed, 13 Dec 2006 14:02:18 -0800
From: Ben Johnson <johnson_b__AT__ligo-wa.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,
 duncan__AT__gravity.phys.uwm.edu
Subject: Re: [condor-support #1789] LIGO: Condor re-running old jobs
X-Enigmail-Version: 0.94.1.0

Dan,

This is Ben, the admin at LHO. I have put the log files under

http://www.ligo-wa.caltech.edu/~bjohnson/LDAS/Condor/

Note: some files decompress to > 1GB. Also, there is no StartLog file at
all, but there are StarterLog files which seem relevant.


File contents:
SchedLog.old._2006_12_12.bz2 - SchedLog on and prior to 2006/12/12
SchedLog.old._2006_12_13.bz2 - SchedLog on and after 2006/12/12
StarterLog.old._2006_12_10.bz2 - StarterLog on and prior to 2006/12/10
StarterLog.old._2006_12_13.bz2 - StarterLog on and after 2006/12/10
8329674.condor_history.bz2 - output of condor_history -l 8329674

Let us know if you need any more information,
Ben

condor-support response tracking system wrote:
> Can we please see the StartLog and the SchedLog for the 3 days in which 
> this job completed and then eventually ran again?
> 
> Also, what is recorded in the job queue?  For example, what is the 
> output of either of the following:
> 
> condor_q -l 8329674
> 
> condor_history -l 8329674
> 
> Thanks,
> --Dan
> 
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Dan Bradley <danb__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu,bjohnson__AT__ligo-wa.caltech.edu
> 


===========================================================================
Date mail was appended: Wed Dec 13 16:02:34 2006 (1166047355)
Date: Wed, 13 Dec 2006 17:10:03 -0600
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1789] LIGO: Condor re-running old jobs
X-PMX-Version: 5.2.0.264296, Antispam-Engine: 2.4.0.264935, Antispam-Data:
 2006.12.13.145433
X-PMX-Spam-Score: Probability=7%, Report='__CP_URI_IN_BODY 0, __CT 0,  
 __CTE 0, __CT_TEXT_PLAIN 0, __HAS_MSGID 0, __MIME_TEXT_ONLY 0,  
 __MIME_VERSION 0, __SANE_MSGID 0, __USER_AGENT 0'


The StarterLogs that you uploaded go back to 11/29 22:10:35, which is 
after the event on 11/26 04:27:23 when the job first ran.  Do you still 
have logs
from this previous time?

And yes, I did mean StarterLog instead of StartLog.

Thanks,
--Dan

condor-support response tracking system wrote:
> Dan,
>
> This is Ben, the admin at LHO. I have put the log files under
>
> http://www.ligo-wa.caltech.edu/~bjohnson/LDAS/Condor/
>
> Note: some files decompress to > 1GB. Also, there is no StartLog file at
> all, but there are StarterLog files which seem relevant.
>
>
> File contents:
> SchedLog.old._2006_12_12.bz2 - SchedLog on and prior to 2006/12/12
> SchedLog.old._2006_12_13.bz2 - SchedLog on and after 2006/12/12
> StarterLog.old._2006_12_10.bz2 - StarterLog on and prior to 2006/12/10
> StarterLog.old._2006_12_13.bz2 - StarterLog on and after 2006/12/10
> 8329674.condor_history.bz2 - output of condor_history -l 8329674
>
> Let us know if you need any more information,
> Ben
>
> condor-support response tracking system wrote:
>   
>> Can we please see the StartLog and the SchedLog for the 3 days in which 
>> this job completed and then eventually ran again?
>>
>> Also, what is recorded in the job queue?  For example, what is the 
>> output of either of the following:
>>
>> condor_q -l 8329674
>>
>> condor_history -l 8329674
>>
>> Thanks,
>> --Dan
>>
>>
>>
>> ========================================
>> MESSAGE INFORMATION
>> ========================================
>> * From: Dan Bradley <danb__AT__cs.wisc.edu>
>> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu,bjohnson__AT__ligo-wa.caltech.edu
>>
>>     
>
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Ben Johnson <johnson_b__AT__ligo-wa.caltech.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu,bjohnson__AT__ligo-wa.caltech.edu,johnson_b__AT__ligo-wa.caltech.edu
>
>   

===========================================================================
Date mail was appended: Wed Dec 13 17:10:13 2006 (1166051413)
Date: Wed, 13 Dec 2006 15:12:03 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Ben Johnson <bjohnson__AT__ligo-wa.caltech.edu>,
 espinoza_e__AT__ligo.caltech.edu,         duncan__AT__gravity.phys.uwm.edu
Subject: Re: [condor-support #1789] LIGO: Condor re-running old jobs
X-MIME-Autoconverted: from quoted-printable to 8bit by chopin.cs.wisc.edu 
 id kBDNCeEu000690

Dan,
	Here is another problematic job on the same condor pool that
I suspect is suffering from the same problem. Job 8724331.0 is a Local Universe
job that ran at ~1AM this morning and to all appearances completed succesfully,
however, it is still in the schedd queue and listed in the "R" state. This job
has not been run multiple times yet, but I suspect since it is still
in the queue it will run again in a few days. Please also see the end of
this email for another slightly different manifistation of this problem.

$ condor_q -direct schedd 8724331.0


-- Submitter: ldas-grid.ligo-wa.caltech.edu : <10.12.0.12:54510> : ldas-grid.ligo-wa.caltech.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
8724331.0   hoft           12/13 01:06   0+13:44:10 R  0   9.8  LSCdataFind --obse

If you ask the schedd what machine it is running on it appears confused:

$ condor_q -direct schedd -run 8724331.0


-- Submitter: ldas-grid.ligo-wa.caltech.edu : <10.12.0.12:54510> : ldas-grid.ligo-wa.caltech.edu
 ID      OWNER            SUBMITTED     RUN_TIME HOST(S)         
8724331.0   hoft           12/13 01:06   0+13:46:05 [????????????????]


Here is the full job classadd,


schedd 8724331.0


-- Submitter: ldas-grid.ligo-wa.caltech.edu : <10.12.0.12:54510> : ldas-grid.ligo-wa.caltech.edu
MyType = "Job"
TargetType = "Machine"
ClusterId = 8724331
QDate = 1166000762
CompletionDate = 0
Owner = "hoft"
RemoteWallClockTime = 0.000000
LocalUserCpu = 0.000000
LocalSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteSysCpu = 0.000000
ExitStatus = 0
NumCkpts_RAW = 0
NumCkpts = 0
NumRestarts = 0
NumSystemHolds = 0
CommittedTime = 0
TotalSuspensions = 0
CumulativeSuspensionTime = 0
ExitBySignal = FALSE
Notification = ERROR
WantBadgers = TRUE
CondorVersion = "$CondorVersion: 6.8.2 Oct 12 2006 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RHEL3 $"
RootDir = "/"
Iwd = "/archive/home/hoft/S5-online/H2hoft/onasysd.strain_S5H2.jobs/001/73/48"
JobUniverse = 12
Cmd = "/opt/lscsoft/glue/bin/LSCdataFind"
MinHosts = 1
MaxHosts = 1
WantRemoteSyscalls = FALSE
WantCheckpoint = FALSE
JobPrio = 0
User = "hoft@ligo"
NiceUser = FALSE
Environment = "_CONDOR_DAGMAN_LOG=/archive/home/hoft/S5-online/H2hoft/onasysd.strain_S5H2.jobs/001/73/48/strain_S5H2-850033639-1616.dag.dagman.out _CONDOR_MAX_DAGMAN_LOG=0 GRID_SECURITY_DIR=/etc/grid-security MANPATH=/ldcg/condor/man:/opt/lscsoft/lalapps/share/man:/opt/lscsoft/lal/share/man:/opt/lscsoft/glue/man:/opt/lscsoft/libframe/man:/opt/lscsoft/libmetaio/man:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/man:/ldcg/condor//man:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/globus/man:::/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/perl/man:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/expat/man:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/logrotate/man:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/jdk1.4/man:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/edg/share/man:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/mysql/man:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/contrib/gstar/man:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/glite/share/man TERM=xterm LAL_PREFIX=/opt/lscsoft/lal LSCSOFT_PREFIX=/opt/lscsoft HOSTNAME=ldas-grid MYSQL_UNIX_PORT=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vdt-app-data/mysql/var/mysql.sock SASL_PATH=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/globus/lib/sasl GSTAR_LOCATION=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/contrib/gstar SHELL=/bin/bash MATLABPATH=/ligotools/matlab HISTSIZE=1000 EDG_LOCATION=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/edg LDG_LOCATION=/ldcg/stow_pkgs/ldg-4.3/ldg VOMS_USERCONF=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/glite/etc GLOBUS_LOCATION=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/globus SSH_CLIENT=131.215.114.74' '53708' '22 GLOBUS_PATH=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/globus X509_CADIR=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/globus/TRUSTED_CA X509_CERT_DIR=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/globus/TRUSTED_CA PERL5LIB=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/perl:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vdt/lib:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/perl/lib/5.8.0:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/perl/lib/5.8.0/i686-linux-thread-multi:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/perl/lib/site_perl/5.8.0:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/perl/lib/site_perl/5.8.0/i686-linux-thread-multi::/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/contrib/gstar/lib SHLVL=4 SSH_TTY=/dev/pts/7 GLITE_LOCATION_LOG=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/glite/log GLOBUS_TCP_PORT_RANGE= USER=hoft VDS_HOME=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds CONDOR_PARENT_ID=ldas-grid:18187:1166000760 LD_LIBRARY_PATH=/opt/lscsoft/lal/lib:/opt/lscsoft/glue/lib:/opt/lscsoft/libframe/lib:/opt/lscsoft/libmetaio/lib:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/glite/lib:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/netlogger/lib:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/myodbc/lib:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/unixodbc/lib:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/mysql/lib/mysql:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/jdk1.4/jre/lib/i386:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/jdk1.4/jre/lib/i386/server:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/jdk1.4/jre/lib/i386/client:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/berkeley-db/lib:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/expat/lib:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/globus/lib:/ldcg/ldg/vdt/globus/lib:/ldcg/ldg/vdt/globus/lib:/ldcg/ldg/vdt/globus/lib:/ligotools/lib GLOBUS_ERROR_VERBOSE=true GPT_LOCATION=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/gpt LDG_INSTALL_LOG=/ldcg/stow_pkgs/ldg-4.3/ldg/ldg-server/etc/ldg-install.log GLITE_LOCATION_TMP=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/glite/tmp LDG_SOFTWARE_LOCATION=http://www.ligo.mit.edu/ldg4.3/software SSH_AUTH_SOCK=/tmp/ssh-yZubL12223/agent.12223 LIBPATH=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/globus/lib:/ldcg/ldg/vdt/globus/lib:/ldcg/ldg/vdt/globus/lib:/ldcg/ldg/vdt/globus/lib:/usr/lib:/lib CONDOR_ID=8724328.0 LAL_LOCATION=/opt/lscsoft/lalapps _CONDOR_ANCESTOR_2508=2527:1163537401:1225610882 PATH=/ldcg/condor/bin:/ldcg/condor/sbin:/opt/lscsoft/lalapps/bin:/opt/lscsoft/lal/bin:/opt/lscsoft/glue/bin:/opt/lscsoft/libframe/bin:/opt/lscsoft/libmetaio/bin:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/glite/sbin:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/glite/bin:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/bin:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/pyglobus-url-copy/bin:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/netlogger/bin:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/unixodbc/bin:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/mysql/bin:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/edg/sbin:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/ftsh/bin:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/jdk1.4/bin:/ldcg/condor//sbin:/ldcg/condor//bin:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/logrotate/sbin:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/gpt/sbin:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/globus/bin:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/globus/sbin:/ldcg/pacman/src:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vdt/sbin:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vdt/bin:/ldcg/stow_pkgs/ldg-4.3/ldg/ldg-server/bin:/usr/kerberos/bin:/usr/bin:/bin:/usr/sbin:/sbin:/ldcg/ldg/vdt/globus/bin:/usr/X11R6/bin:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/contrib/gstar/bin:/ligotools/bin:/ldcg/matlab_r2006a/bin:/archive/home/hoft/bin MAIL=/var/spool/mail/hoft JAVA_HOME=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/jdk1.4 CONDOR_LOCATION=/ldcg/condor VDT_LOCATION=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt CONDOR_CONFIG=/ldcg/condor/etc/condor_config DAGDBUPDATORLOCKFILE=/etc/onasys-dblockfile PWD=/archive/home/hoft/S5-online/H2hoft/onasysd.strain_S5H2.jobs/001/73/48 LOGNAME=hoft INPUTRC=/etc/inputrc OLDPWD=/archive/home/hoft/S5-online/H2hoft/onasysd.strain_S5H2.jobs/001/73/48 ODBCINI=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/unixodbc/etc/odbc.ini LSC_SEGFIND_SERVER=ldas.ligo-wa.caltech.edu X509_USER_CERT=/archive/home/hoft/hoftcert.pem LANG=C HOME=/archive/home/hoft LIGOTOOLS=/ligotools GLITE_LOCATION_VAR=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/glite/var X509_USER_PROXY=/tmp/x509up_p12223.fileqdnaGp.1 BOSSDIR=/etc VDT_INSTALL_LOG=vdt-install.log DYLD_LIBRARY_PATH=/opt/lscsoft/lal/lib:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/globus/lib X509_USER_KEY=/archive/home/hoft/hoftkey.pem VDS_JAVA_HEAPMAX=1024 LSC_DATAFIND_SERVER=ldas.ligo-wa.caltech.edu PYTHONPATH=/opt/lscsoft/lalapps/lib/python2.4/site-packages:/opt/lscsoft/glue/lib/python:/opt/lscsoft/libframe/lib/python:/opt/lscsoft/libmetaio/lib/python:/ldcg/stow_pkgs/ldg-4.3/ldg/ldg-server/lib/python:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/netlogger/lib/python2.2/site-packages:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/globus/lib64/python:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/globus/lib/python:/ldcg/pacman/src::/archive/home/hoft/lib/:/archive/home/hoft/lib/python2.4/:/archive/home/hoft/lib/python2.4/site-packages/:/archive/home/hoft/lib/python2.4/site-packages/onasys/:/archive/home/hoft/lib/python2.4/site-packages/dagdbUpdator/ SSH_CONNECTION=131.215.114.74' '53708' '198.129.208.250' '22 LSC_DATAGRID_SERVER_LOCATION=/ldcg/ldg GLOBUS_MYSQL_PATH=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/mysql CLASSPATH=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/commons-pool.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/cog-jglobus.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/java-getopt-1.0.9.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/cryptix-asn1.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/cryptix.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/cryptix32.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/exist-optional.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/exist.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/gvds.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/jakarta-oro.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/loggerservice-stub.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/jce-jdk13-117.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/jlinker.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/junit.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/log4j-1.2.8.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/resolver.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/puretls.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/mysql-connector-java-3.0.11-stable-bin.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/postgresql-8.1dev-400.jdbc3.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/xercesImpl.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/rls.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/xmlParserAPIs.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/xmldb.jar:/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/vds/lib/xmlrpc.jar LESSOPEN=|/usr/bin/lesspipe.sh' '%s LDG_DIRECTORY=/ldcg/stow_pkgs/ldg-4.3/ldg/ldg-server PKG_CONFIG_PATH=/opt/lscsoft/lal/lib/pkgconfig:/opt/lscsoft/glue/lib/pkgconfig:/opt/lscsoft/libframe/lib/pkgconfig:/opt/lscsoft/libmetaio/lib/pkgconfig: SHLIB_PATH=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/globus/lib:/ldcg/ldg/vdt/globus/lib:/ldcg/ldg/vdt/globus/lib:/ldcg/ldg/vdt/globus/lib VDT_POSTINSTALL_README=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/post-install/README DISPLAY=localhost:17.0 GLITE_LOCATION=/ldcg/stow_pkgs/ldg-4.3/ldg/vdt/glite _=/ldcg/condor/bin/condor_submit _CONDOR_ANCESTOR_2527=18187:1166000760:4254020225 G_BROKEN_FILENAMES=1 PACMAN_LOCATION=/ldcg/pacman"
JobNotification = 0
WantRemoteIO = FALSE
UserLog = "/usr1/hoft/tmpcnyMF1"
CoreSize = -1
KillSig = "SIGTERM"
Rank = 0.000000
In = "/dev/null"
TransferIn = FALSE
Out = "cache/H-RDS_R_L1-850033511-850035383.cache"
StreamOut = FALSE
Err = "logs/datafind-H-RDS_R_L1-850033511-850035383-8724331-0.err"
StreamErr = FALSE
BufferSize = 524288
BufferBlockSize = 32768
ShouldTransferFiles = "NO"
TransferFiles = "NEVER"
ImageSize_RAW = 13
ImageSize = 10000
ExecutableSize_RAW = 13
ExecutableSize = 10000
DiskUsage_RAW = 13
DiskUsage = 10000
Requirements = (Arch == "INTEL") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize)
FileSystemDomain = "ligo"
PeriodicHold = FALSE
PeriodicRelease = FALSE
PeriodicRemove = FALSE
OnExitHold = FALSE
OnExitRemove = TRUE
LeaveJobInQueue = FALSE
Args = "--observatory H --url-type file --server ldas.ligo-wa.caltech.edu --gps-start-time 850033511 --gps-end-time 850035383 --lal-cache --type RDS_R_L1"
DAGNodeName = "df_H2_0"
DAGManJobId = 8724328
DAGParentNodeNames = ""
GlobalJobId = "ldas-grid.ligo-wa.caltech.edu#1166000780#8724331.0"
ProcId = 0
CurrentHosts = 1
AutoClusterId = 1
AutoClusterAttrs = "JobUniverse,LastCheckpointPlatform,NumCkpts,JobStart,DiskUsage,ImageSize,Requirements,NiceUser"
OrigMaxHosts = 1
JobStatus = 2
EnteredCurrentStatus = 1166000801
LastSuspensionTime = 0
ShadowBday = 1166000801
JobStartDate = 1166000801
JobCurrentStartDate = 1166000801
JobRunCount = 1
WallClockCheckpoint = 49521
ServerTime = 1166050687
 


	Another problem with these jobs is that if you condor_rm them
they stay in the X state for weeks, e.g.,
$ condor_q 8196327.0  


-- Quill: lhoquill@ligo : <10.12.0.12:5432> : lhoquill_db
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
8196327.0   inspiralbns    11/17 01:17   0+00:00:00 X  0   9.8  LSCdataFind --obse


	It appears that this problem with jobs not leaving the queue/running
again in the Local Universe either started happening when either we
upgraded Condor from 6.7.20 to 6.8.2 or when we switched from Scheduler
to Local Universe or when we made the jobs themselves run much more
quickly. Otherwise these jobs have run for several Months with 6.7.20
in the Scheduler Universe without problems.

	Please let us know if you direct login access to the schedd machine
to take a closer look or if you need access to the user log files for the
first job id above.

Thanks.

---
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson


===========================================================================
Date mail was appended: Wed Dec 13 17:12:47 2006 (1166051567)
Date: Wed, 13 Dec 2006 16:13:57 -0800
From: Ben Johnson <johnson_b__AT__ligo-wa.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,
 duncan__AT__gravity.phys.uwm.edu
Subject: Re: [condor-support #1789] LIGO: Condor re-running old jobs
X-Enigmail-Version: 0.94.1.0

Dan,

You should find an earlier log file, StarterLog.old._2006_11_29.bz2, on
the web directory.


Ben

condor-support response tracking system wrote:
> The StarterLogs that you uploaded go back to 11/29 22:10:35, which is 
> after the event on 11/26 04:27:23 when the job first ran.  Do you still 
> have logs
> from this previous time?
> 
> And yes, I did mean StarterLog instead of StartLog.
> 
> Thanks,
> --Dan
> 
> condor-support response tracking system wrote:
>> Dan,
>>
>> This is Ben, the admin at LHO. I have put the log files under
>>
>> http://www.ligo-wa.caltech.edu/~bjohnson/LDAS/Condor/
>>
>> Note: some files decompress to > 1GB. Also, there is no StartLog file at
>> all, but there are StarterLog files which seem relevant.
>>
>>
>> File contents:
>> SchedLog.old._2006_12_12.bz2 - SchedLog on and prior to 2006/12/12
>> SchedLog.old._2006_12_13.bz2 - SchedLog on and after 2006/12/12
>> StarterLog.old._2006_12_10.bz2 - StarterLog on and prior to 2006/12/10
>> StarterLog.old._2006_12_13.bz2 - StarterLog on and after 2006/12/10
>> 8329674.condor_history.bz2 - output of condor_history -l 8329674
>>
>> Let us know if you need any more information,
>> Ben
>>
>> condor-support response tracking system wrote:
>>   
>>> Can we please see the StartLog and the SchedLog for the 3 days in which 
>>> this job completed and then eventually ran again?
>>>
>>> Also, what is recorded in the job queue?  For example, what is the 
>>> output of either of the following:
>>>
>>> condor_q -l 8329674
>>>
>>> condor_history -l 8329674
>>>
>>> Thanks,
>>> --Dan
>>>
>>>
>>>
>>> ========================================
>>> MESSAGE INFORMATION
>>> ========================================
>>> * From: Dan Bradley <danb__AT__cs.wisc.edu>
>>> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu,bjohnson__AT__ligo-wa.caltech.edu
>>>
>>>     
>>
>>
>> ========================================
>> MESSAGE INFORMATION
>> ========================================
>> * From: Ben Johnson <johnson_b__AT__ligo-wa.caltech.edu>
>> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu,bjohnson__AT__ligo-wa.caltech.edu,johnson_b__AT__ligo-wa.caltech.edu
>>
>>   
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Dan Bradley <danb__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu,bjohnson__AT__ligo-wa.caltech.edu,johnson_b__AT__ligo-wa.caltech.edu
> 

===========================================================================
Date mail was appended: Wed Dec 13 18:14:14 2006 (1166055255)
Date: Mon, 18 Dec 2006 13:49:40 -0600
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1789] LIGO: Condor re-running old jobs
X-PMX-Version: 5.2.0.264296, Antispam-Engine: 2.4.0.264935, Antispam-Data:
 2006.12.18.112932
X-PMX-Spam-Score: Probability=7%, Report='__CP_NAME_BODY 0, __CT 0, __CTE 
 0, __CT_TEXT_PLAIN 0, __FRAUD_419_BADTHINGS 0, __HAS_MSGID 0,  
 __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0, __STOCK_PHRASE_24 0,
 __USER_AGENT 0'


Here's a summary of what I see happening in the logs:

SchedLog:

11/26 04:30:06 (pid:2527) Spawned local starter (pid 10172) for job 
8329674.0
11/29 02:03:08 (pid:2527) ERROR: Child pid 10172 appears hung! Killing 
it hard.
11/29 02:04:23 (pid:2527) Local starter pid 10172 successfully killed 
because it was hung.
11/29 02:04:23 (pid:2527) Local starter pid 10172 died with signal 4
11/29 02:04:43 (pid:2527) Received child alive command from unknown pid 
10172

StarterLog:

11/26 04:30:06 Starting a LOCAL universe job with ID: 8329674.0
11/26 04:30:08 Create_Process succeeded, pid=10177
11/26 04:30:10 Process exited, pid=10177, status=0


Either the starter is hanging, or the schedd is taking a _very_ long 
time to observe the exit of the starter.

You don't happen to have logged 'ps' output from this machine, do you?  
If not, then would it be possible for someone to generate a process list 
next time there is a local universe job stuck in the 'R' or 'X' state?  
If there really is a starter process running with the pid of the starter 
associated with the stuck job, it would be useful to attach to it with 
'gdb' to see what the call stack looks like.  Example:

gdb -p <pid>
(gdb) where


In the mean time, to avoid local universe jobs repeatedly failing after 
they have already run once, you can insert the following into the submit 
script:

#put the job on hold if more than one run attempt is made
periodic_hold = JobRunCount > 1

or

#remove the job if more than one run attempt is made
periodic_remove = JobRunCount > 1

--Dan


===========================================================================
Date mail was appended: Mon Dec 18 13:49:48 2006 (1166471389)
Date: Mon, 18 Dec 2006 14:13:52 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu,
 bjohnson__AT__ligo-wa.caltech.edu, johnson_b__AT__ligo-wa.caltech.edu
Subject: Re: [condor-support #1789] LIGO: Condor re-running old jobs

Here is a gdb stack trace from a very old condor_starter process:

gdb) where
#0  0x009c3402 in ?? ()
#1  0x00aa91ed in ___newselect_nocancel () from /lib/libc.so.6
#2  0x08328c8f in __wrap_select ()
#3  0x0811863e in DaemonCore::Driver ()
#4  0x0812555d in main ()

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Dec 18 16:14:14 2006 (1166480055)
Subject: Actions

Status changed from open to bug by tannenba
===========================================================================
Date of actions: Mon Dec 18 16:35:31 2006 (1166481331)
Date: Mon, 18 Dec 2006 17:43:32 -0600
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1789] LIGO: Condor re-running old jobs
X-PMX-Version: 5.2.0.264296, Antispam-Engine: 2.4.0.264935, Antispam-Data:
 2006.12.18.135933
X-PMX-Spam-Score: Probability=7%, Report='__CT 0, __CTE 0, __CT_TEXT_PLAIN
 0, __HAS_MSGID 0, __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0,  
 __USER_AGENT 0'


Interesting.

If you strace the process for a minute, are any system calls reported?  
Is the process accumulating any significant amount of cpu time, or is it 
mostly just sitting idle?

Thanks,
--Dan

condor-support response tracking system wrote:
> Here is a gdb stack trace from a very old condor_starter process:
>
> gdb) where
> #0  0x009c3402 in ?? ()
> #1  0x00aa91ed in ___newselect_nocancel () from /lib/libc.so.6
> #2  0x08328c8f in __wrap_select ()
> #3  0x0811863e in DaemonCore::Driver ()
> #4  0x0812555d in main ()
>
>   

===========================================================================
Date mail was appended: Mon Dec 18 17:44:03 2006 (1166485444)
Date: Mon, 18 Dec 2006 15:50:13 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu,
 bjohnson__AT__ligo-wa.caltech.edu, johnson_b__AT__ligo-wa.caltech.edu
Subject: Re: [condor-support #1789] LIGO: Condor re-running old jobs

On Mon, Dec 18, 2006 at 05:44:03PM -0600, condor-support response tracking system wrote:
> 
> Interesting.
> 
> If you strace the process for a minute, are any system calls reported?  
> Is the process accumulating any significant amount of cpu time, or is it 
> mostly just sitting idle?
> 

I believe Derek has explained this as a bug in the Starter. When its one
and only attempt to contact the Schedd on job completion fails, presumably
due to a busy Schedd, then the Starter just sits around waiting for the
end of the Universe. Evidently a Shadow helps clean this up in other
Universes but there is no Shadow for Local Universe jobs.

We are going to switch back to using the Scheduler Universe for most
of our local machine jobs until this is fixed, however, please let
us know if you need any more information on this problem.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Dec 18 19:53:42 2006 (1166493223)
Date: Thu, 21 Dec 2006 11:39:30 -0600
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1789] LIGO: Condor re-running old jobs


condor-support response tracking system wrote:

>On Mon, Dec 18, 2006 at 05:44:03PM -0600, condor-support response tracking system wrote:
>  
>
>>Interesting.
>>
>>If you strace the process for a minute, are any system calls reported?  
>>Is the process accumulating any significant amount of cpu time, or is it 
>>mostly just sitting idle?
>>
>>    
>>
>
>I believe Derek has explained this as a bug in the Starter.
>

Indeed.  Derek will work on a fix for this problem.  The current plan is 
that this would go into 6.8.4, assuming it does not require deep 
changes.  It is too late in the release process to get it into 6.8.3.

--Dan


===========================================================================
Date mail was appended: Thu Dec 21 11:39:41 2006 (1166722781)
Date: Fri, 6 Apr 2007 12:51:15 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu,
 bjohnson__AT__ligo-wa.caltech.edu, johnson_b__AT__ligo-wa.caltech.edu
Subject: Re: [condor-support #1789] LIGO: Condor re-running old jobs

I believe the full fix for this problem in the Local Universe is going into
6.9, but that the most aggregious case has already been solved in 6.8.4 so
please go ahead and close this ticket.

Thanks.

On Thu, Dec 21, 2006 at 11:39:41AM -0600, condor-support response tracking system wrote:
> 
> 
> condor-support response tracking system wrote:
> 
> >On Mon, Dec 18, 2006 at 05:44:03PM -0600, condor-support response tracking system wrote:
> >  
> >
> >>Interesting.
> >>
> >>If you strace the process for a minute, are any system calls reported?  
> >>Is the process accumulating any significant amount of cpu time, or is it 
> >>mostly just sitting idle?
> >>
> >>    
> >>
> >
> >I believe Derek has explained this as a bug in the Starter.
> >
> 
> Indeed.  Derek will work on a fix for this problem.  The current plan is 
> that this would go into 6.8.4, assuming it does not require deep 
> changes.  It is too late in the release process to get it into 6.8.3.
> 
> --Dan
> 
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Dan Bradley <danb__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu,bjohnson__AT__ligo-wa.caltech.edu,johnson_b__AT__ligo-wa.caltech.edu
> 
> -- 
> ======================================================================
> This mail was sent from the RUST Mail System
> Please direct all replies to condor-support__AT__cs.wisc.edu
> Please include the current subject line in your reply.
> ======================================================================
> 

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Apr  6 14:51:31 2007 (1175889091)
Subject: Actions

Ticket resolved by tannenba
===========================================================================
Date of actions: Tue Jul  3 16:58:52 2007 (1183499932)