LIGO Support Ticket 17143

Ticket Information
  Number:      admin 17143
  User:        anderson@ligo.caltech.edu
  Email:       skoranda__AT__gravity.phys.uwm.edu
  Status:      open
  Assigned To: psilord
Date: Thu, 25 Oct 2007 09:44:37 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: ImageSize update problem
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

The LIGO Condor pool at Caltech running,

# condor_version
$CondorVersion: 6.9.4 Aug 30 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

does not appear to be reliably updating the ImageSize job ClassAdd.
For example, here are 3 jobs that where submitted from the same submit
machine by the same user and have been running on the same execute
machine for an extended period of time. However, only one of them is
showing the right ImageSize.


# condor_q 19975928.0 19991712.0 19991783.0


-- Submitter: ldas-grid.ligo.caltech.edu : <10.14.0.12:42671> : ldas-grid.ligo.caltech.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
19975928.0   dkeppel        10/24 08:31   0+22:39:23 R  0   332.0 lalapps_inspiral -
19991712.0   dkeppel        10/24 21:42   0+11:45:16 R  0   9.8  lalapps_inspiral -
19991783.0   dkeppel        10/24 21:47   0+11:33:31 R  0   9.8  lalapps_inspiral -



# condor_q -run 19975928.0 19991712.0 19991783.0


-- Submitter: ldas-grid.ligo.caltech.edu : <10.14.0.12:42671> : ldas-grid.ligo.caltech.edu
 ID      OWNER            SUBMITTED     RUN_TIME HOST(S)         
19975928.0   dkeppel        10/24 08:31   0+22:40:10 slot2__AT__node264.ldas-cit.ligo.caltech.edu
19991712.0   dkeppel        10/24 21:42   0+11:46:03 slot3__AT__node264.ldas-cit.ligo.caltech.edu
19991783.0   dkeppel        10/24 21:47   0+11:34:18 slot4__AT__node264.ldas-cit.ligo.caltech.edu


# condor_q -long 19975928.0 19991712.0 19991783.0 | grep "^ImageSize ="
ImageSize = 340000
ImageSize = 10000
ImageSize = 10000


However, here is the output of top from node264for user dkeppel that
shows the actual memory usage:


top - 09:36:53 up 22 days, 17:20,  1 user,  load average: 4.08, 4.10, 4.09
_asks: 139 total,   5 running, 134 sleeping,   0 stopped,   0 zombie
Cpu(s): 100.0% us,  0.0% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:   4101496k total,  3861084k used,   240412k free,    13820k buffers
Swap: 10482404k total,    10940k used, 10471464k free,  2660116k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 6430 dkeppel   25   0  322m 318m  700 R 99.9  7.9  57:16.88 condor_exec.199
32618 dkeppel   25   0  315m 310m 1976 R 99.9  7.8 704:40.19 condor_exec.199
32706 dkeppel   25   0  314m 309m 1976 R 99.9  7.7 692:18.97 condor_exec.199


Thanks.


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Thu Oct 25 11:44:58 2007 (1193330700)
Subject: Actions

Assigned to zmiller by zmiller
===========================================================================
Date of actions: Fri Oct 26 10:39:34 2007 (1193413174)
Date: Fri, 26 Oct 2007 10:42:37 -0500
From: Zachary Miller <zmiller__AT__cs.wisc.edu>
To: zmiller <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17143] LIGO: ImageSize update problem

> # condor_version
> $CondorVersion: 6.9.4 Aug 30 2007 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
> 
> does not appear to be reliably updating the ImageSize job ClassAdd.

okay.

> # condor_q -long 19975928.0 19991712.0 19991783.0 | grep "^ImageSize ="
> ImageSize = 340000
> ImageSize = 10000
> ImageSize = 10000

do you happen to have the starterlog (for the slot that ran one of the
jobs with the incorrect image size) and/or the shadow log?  i'm curious
if the size was detected incorrectly or if it got updated on the submit
side due to a communications problem.

if possible, can you increase the log verbosity to D_ALL and replicate
it again before sending the logs?


cheers,
-zach


===========================================================================
Date mail was appended: Fri Oct 26 10:42:46 2007 (1193413366)
Date: Sat, 27 Oct 2007 18:38:24 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu, zmiller__AT__cs.wisc.edu
Subject: Re: [condor-admin #17143] LIGO: ImageSize update problem
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

--4bRzO86E/ozDv8r1

On Fri, Oct 26, 2007 at 10:42:46AM -0500, condor-admin response tracking system wrote:
> > # condor_version
> > $CondorVersion: 6.9.4 Aug 30 2007 $
> > $CondorPlatform: X86_64-LINUX_RHEL3 $
> > 
> > does not appear to be reliably updating the ImageSize job ClassAdd.
> 
> okay.
> 
> > # condor_q -long 19975928.0 19991712.0 19991783.0 | grep "^ImageSize ="
> > ImageSize = 340000
> > ImageSize = 10000
> > ImageSize = 10000
> 
> do you happen to have the starterlog (for the slot that ran one of the
> jobs with the incorrect image size) and/or the shadow log?  i'm curious
> if the size was detected incorrectly or if it got updated on the submit
> side due to a communications problem.

Here is another set of jobs being run by the same user on the same execute
machine and for which one is reporting the wrong ImageSize. Attached
are the requested log files. In particular,


[root@ldas-grid ~]# condor_q 20024617.0 20029039.0 20030610.0


-- Submitter: ldas-grid.ligo.caltech.edu : <10.14.0.12:42671> : ldas-grid.ligo.caltech.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
20024617.0   dkeppel        10/27 06:07   0+11:27:47 R  0   322.3 lalapps_inspiral -
20029039.0   dkeppel        10/27 09:31   0+08:05:11 R  0   322.3 lalapps_inspiral -
20030610.0   dkeppel        10/27 10:46   0+07:20:17 R  0   9.8  lalapps_inspiral -


All 3 of these have accumulated several hours of run time on the same
execute machine and are using approximately the same amount of memory
as can seen in the attached output of top, however, one of them is
clearly reporting the wrong ImageSize.

The Starter log for slot3 it is also attached and shows that
job id 20030610.0 corresponds to pid=20219 on the execute machine.

Also attached is the section of the ShadowLog that corresponds to
this job being started.

> 
> if possible, can you increase the log verbosity to D_ALL and replicate
> it again before sending the logs?
> 

I would rather you take a look at these logs first before I change
the cluster configuration.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

--4bRzO86E/ozDv8r1

top - 18:25:07 up 25 days,  2:09,  2 users,  load average: 4.01, 4.05, 4.02
Tasks: 142 total,   5 running, 137 sleeping,   0 stopped,   0 zombie
Cpu(s): 100.0% us,  0.0% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:   4101496k total,  2357848k used,  1743648k free,    80212k buffers
Swap: 10482404k total,    10496k used, 10471908k free,   940064k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20085 dkeppel   25   0  314m 309m  700 R 99.9  7.7 462:16.39 condor_exec.200
20081 dkeppel   25   0  313m 308m  712 R 99.7  7.7 464:24.21 condor_exec.200
20219 dkeppel   25   0  314m 310m 1976 R 99.7  7.7 442:16.26 condor_exec.200
23304 igor      25   0  349m 256m  13m R 99.7  6.4  75:06.69 root.exe
    1 root      18   0  4880  500  460 S  0.0  0.0   0:03.02 init

--4bRzO86E/ozDv8r1

10/27 11:00:26 Sending final status for process 20030583.0
10/27 11:00:26 STATUS encoded as NORMAL
10/27 11:00:26 User time =3D 43.110000 seconds
10/27 11:00:26 System time =3D 1.290000 seconds
10/27 11:00:26 Unlinked "dir_20199/condor_exec.20030583.0"
10/27 11:00:26 Removed directory "dir_20199"
10/27 11:00:26 	*FSM* Transitioning to state "SUPERVISE"
10/27 11:00:26 	*FSM* Got asynchronous event "DIE"
10/27 11:00:26 	*FSM* Executing transition function "req_die"
10/27 11:00:26 	*FSM* Transitioning to state "TERMINATE"
10/27 11:00:26 	*FSM* Executing state func "terminate_all()" [  ]
10/27 11:00:26 	*FSM* Transitioning to state "SEND_STATUS_ALL"
10/27 11:00:26 	*FSM* Executing state func "dispose_all()" [  ]
10/27 11:00:26 	*FSM* Reached state "END"
10/27 11:00:26 ********* STARTER terminating normally **********
10/27 11:00:28 ********** STARTER starting up ***********
10/27 11:00:28 ** $CondorVersion: 6.9.4 Aug 30 2007 $
10/27 11:00:28 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/27 11:00:28 ******************************************
10/27 11:00:28 Submitting machine is "ldas-grid.ligo.caltech.edu"
10/27 11:00:28 EventHandler {
10/27 11:00:28 	func =3D 0x4c64e4
10/27 11:00:28 	mask =3D SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD SIGT=
STP=20
10/27 11:00:28 }
10/27 11:00:28 Done setting resource limits
10/27 11:00:28 	*FSM* Transitioning to state "GET_PROC"
10/27 11:00:28 	*FSM* Executing state func "get_proc()" [  ]
10/27 11:00:28 Entering get_proc()
10/27 11:00:28 Entering get_job_info()
10/27 11:00:28 Startup Info:
10/27 11:00:28 	Version Number: 1
10/27 11:00:28 	Id: 20030587.0
10/27 11:00:28 	JobClass: STANDARD
10/27 11:00:28 	Uid: 4093
10/27 11:00:28 	Gid: 4093
10/27 11:00:28 	VirtPid: -1
10/27 11:00:28 	SoftKillSignal: 20
10/27 11:00:28 	Cmd: "/archive/home/dkeppel/codes/s5_1yr/lalapps/bin/lalapp=
s_inspiral"
10/27 11:00:28 	Args: "--do-rsq-veto --trig-end-time 0 --cluster-method tem=
plate --dynamic-range-exponent 69.0 --bank-file L1-TRIGBANK_H1H2L1-84230967=
9-2048.xml.gz --high-pass-order 8 --strain-high-pass-order 8 --ifo-tag H1H2=
L1 --gps-end-time 842311727 --calibrated-data real_8 --channel-name L1:LSC-=
STRAIN --rsq-veto-coeff 0.0075 --snr-threshold 5.5 --enable-rsq-veto --numb=
er-of-segments 15 --trig-start-time 0 --enable-high-pass 30.0 --debug-level=
 33 --gps-start-time 842309679 --enable-filter-inj-only --high-pass-attenua=
tion 0.1 --chisq-bins 16 --inverse-spec-length 16 --rsq-veto-threshold 15.0=
 --segment-length 1048576 --low-frequency-cutoff 40.0 --pad-data 8 --sample=
-rate 4096 --chisq-threshold 10.0 --rsq-veto-max-snr 12.0 --resample-filter=
 ldas --strain-high-pass-atten 0.1 --strain-high-pass-freq 30.0 --injection=
-file /archive/home/dkeppel/analysis/lowcbc/20051104-20061114/full_data/sw_=
injections/PPN_run1/HL-INJECTIONS_1-815160323-32395247.xml --segment-overla=
p 524288 --frame-cache cache/L-L1_RDS_C03_L2-842309671-842316427.cache --ch=
isq-delta 0.2 --approximant FindChirpSP --rsq-veto-pow 1.124764 --rsq-veto-=
time-thresh 0.0002 --write-compress --enable-output --rsq-veto-window 6.0 -=
-ts-cluster T0T3TcAS --ts-metric-scaling 0.06 --spectrum-type median"
10/27 11:00:28 	Env: "KMP_LIBRARY=3Dserial;MKL_SERIAL=3Dyes"
10/27 11:00:28 	Iwd: "/mnt/qfs2/dkeppel/analysis/lowcbc/20051104-20061114/f=
ull_data/sw_injections/PPN_run1_cat123_rsq"
10/27 11:00:28 	Ckpt Wanted: TRUE
10/27 11:00:28 	Is Restart: FALSE
10/27 11:00:28 	Core Limit Valid: TRUE
10/27 11:00:28 	Coredump Limit 0
10/27 11:00:28 User uid set to 4093
10/27 11:00:28 User uid set to 4093
10/27 11:00:28 User Process 20030587.0 {
10/27 11:00:28   cmd =3D /archive/home/dkeppel/codes/s5_1yr/lalapps/bin/lal=
apps_inspiral
10/27 11:00:28   args =3D --do-rsq-veto --trig-end-time 0 --cluster-method =
template --dynamic-range-exponent 69.0 --bank-file L1-TRIGBANK_H1H2L1-84230=
9679-2048.xml.gz --high-pass-order 8 --strain-high-pass-order 8 --ifo-tag H=
1H2L1 --gps-end-time 842311727 --calibrated-data real_8 --channel-name L1:L=
SC-STRAIN --rsq-veto-coeff 0.0075 --snr-threshold 5.5 --enable-rsq-veto --n=
umber-of-segments 15 --trig-start-time 0 --enable-high-pass 30.0 --debug-le=
vel 33 --gps-start-time 842309679 --enable-filter-inj-only --high-pass-atte=
nuation 0.1 --chisq-bins 16 --inverse-spec-length 16 --rsq-veto-threshold 1=
5.0 --segment-length 1048576 --low-frequency-cutoff 40.0 --pad-data 8 --sam=
ple-rate 4096 --chisq-threshold 10.0 --rsq-veto-max-snr 12.0 --resample-fil=
ter ldas --strain-high-pass-atten 0.1 --strain-high-pass-freq 30.0 --inject=
ion-file /archive/home/dkeppel/analysis/lowcbc/20051104-20061114/full_data/=
sw_injections/PPN_run1/HL-INJECTIONS_1-815160323-32395247.xml --segment-ove=
rlap 524288 --frame-cache cache/L-L1_RDS_C03_L2-842309671-842316427.cache -=
-chisq-delta 0.2 --approximant FindChirpSP --rsq-veto-pow 1.124764 --rsq-ve=
to-time-thresh 0.0002 --write-compress --enable-output --rsq-veto-window 6.=
0 --ts-cluster T0T3TcAS --ts-metric-scaling 0.06 --spectrum-type median
10/27 11:00:28   env =3D KMP_LIBRARY=3Dserial MKL_SERIAL=3Dyes _CONDOR_SLOT=
=3Dslot2 CONDOR_VM=3Dslot2 _condor_BIND_ALL_INTERFACES=3DFALSE CONDOR_SCRAT=
CH_DIR=3D/usr1/condor/execute/dir_20208
10/27 11:00:28   local_dir =3D dir_20208
10/27 11:00:28   cur_ckpt =3D dir_20208/condor_exec.20030587.0
10/27 11:00:28   core_name =3D (either 'core' or 'core.<pid>')
10/27 11:00:28   uid =3D 4093, gid =3D 4093
10/27 11:00:28   v_pid =3D -1
10/27 11:00:28   pid =3D (NOT CURRENTLY EXECUTING)
10/27 11:00:28   exit_status_valid =3D FALSE
10/27 11:00:28   exit_status =3D (NEVER BEEN EXECUTED)
10/27 11:00:28   ckpt_wanted =3D TRUE
10/27 11:00:28   coredump_limit_exists =3D TRUE
10/27 11:00:28   coredump_limit =3D 0
10/27 11:00:28   soft_kill_sig =3D 20
10/27 11:00:28   job_class =3D STANDARD
10/27 11:00:28   state =3D NEW
10/27 11:00:28   new_ckpt_created =3D FALSE
10/27 11:00:28   ckpt_transferred =3D FALSE
10/27 11:00:28   core_created =3D FALSE
10/27 11:00:28   core_transferred =3D FALSE
10/27 11:00:28   exit_requested =3D FALSE
10/27 11:00:28   image_size =3D -1 blocks
10/27 11:00:28   user_time =3D 0
10/27 11:00:28   sys_time =3D 0
10/27 11:00:28   guaranteed_user_time =3D 0
10/27 11:00:28   guaranteed_sys_time =3D 0
10/27 11:00:28 }
10/27 11:00:28 	*FSM* Transitioning to state "GET_EXEC"
10/27 11:00:28 	*FSM* Executing state func "get_exec()" [ SUSPEND VACATE DI=
E  ]
10/27 11:00:28 Entering get_exec()
10/27 11:00:28 Executable is located on submitting host
10/27 11:00:28 Expanded executable name is "/usr1/condor/spool/cluster20030=
587.ickpt.subproc0"
10/27 11:00:28 Going to try 3 attempts at getting the initial executable
10/27 11:00:28 Entering get_file( /usr1/condor/spool/cluster20030587.ickpt.=
subproc0, dir_20208/condor_exec.20030587.0, 0755 )
10/27 11:00:28 Opened "/usr1/condor/spool/cluster20030587.ickpt.subproc0" v=
ia file stream
10/27 11:00:28 Get_file() transferred 9345252 bytes, 113213971 bytes/second
10/27 11:00:28 Fetched orig ckpt file "/usr1/condor/spool/cluster20030587.i=
ckpt.subproc0" into "dir_20208/condor_exec.20030587.0" with 1 attempt
10/27 11:00:28 Executable 'dir_20208/condor_exec.20030587.0' is linked with=
 "$CondorVersion: 6.8.5 May 17 2007 $" on a "$CondorPlatform: X86_64-LINUX_=
RHEL3 $"
10/27 11:00:28 	*FSM* Executing transition function "spawn_all"
10/27 11:00:28 Pipe built
10/27 11:00:28 New pipe_fds[14,1]
10/27 11:00:28 cmd_fd =3D 14
10/27 11:00:28 Calling execve( "/usr1/condor/execute/dir_20208/condor_exec.=
20030587.0", "condor_exec.20030587.0", "-_condor_cmd_fd", "14", "--do-rsq-v=
eto", "--trig-end-time", "0", "--cluster-method", "template", "--dynamic-ra=
nge-exponent", "69.0", "--bank-file", "L1-TRIGBANK_H1H2L1-842309679-2048.xm=
l.gz", "--high-pass-order", "8", "--strain-high-pass-order", "8", "--ifo-ta=
g", "H1H2L1", "--gps-end-time", "842311727", "--calibrated-data", "real_8",=
 "--channel-name", "L1:LSC-STRAIN", "--rsq-veto-coeff", "0.0075", "--snr-th=
reshold", "5.5", "--enable-rsq-veto", "--number-of-segments", "15", "--trig=
-start-time", "0", "--enable-high-pass", "30.0", "--debug-level", "33", "--=
gps-start-time", "842309679", "--enable-filter-inj-only", "--high-pass-atte=
nuation", "0.1", "--chisq-bins", "16", "--inverse-spec-length", "16", "--rs=
q-veto-threshold", "15.0", "--segment-length", "1048576", "--low-frequency-=
cutoff", "40.0", "--pad-data", "8", "--sample-rate", "4096", "--chisq-thres=
hold", "10.0", "--rsq-veto-max-snr", "12.0", "--resample-filter", "ldas", "=
--strain-high-pass-atten", "0.1", "--strain-high-pass-freq", "30.0", "--inj=
ection-file", "/archive/home/dkeppel/analysis/lowcbc/20051104-20061114/full=
_data/sw_injections/PPN_run1/HL-INJECTIONS_1-815160323-32395247.xml", "--se=
gment-overlap", "524288", "--frame-cache", "cache/L-L1_RDS_C03_L2-842309671=
-842316427.cache", "--chisq-delta", "0.2", "--approximant", "FindChirpSP", =
"--rsq-veto-pow", "1.124764", "--rsq-veto-time-thresh", "0.0002", "--write-=
compress", "--enable-output", "--rsq-veto-window", "6.0", "--ts-cluster", "=
T0T3TcAS", "--ts-metric-scaling", "0.06", "--spectrum-type", "median", 0, "=
KMP_LIBRARY=3Dserial", "MKL_SERIAL=3Dyes", "_CONDOR_SLOT=3Dslot2", "CONDOR_=
VM=3Dslot2", "_condor_BIND_ALL_INTERFACES=3DFALSE", "CONDOR_SCRATCH_DIR=3D/=
usr1/condor/execute/dir_20208", 0 )
10/27 11:00:28 Started user job - PID =3D 20209
10/27 11:00:28 cmd_fp =3D 0xa46c90
10/27 11:00:28 end
10/27 11:00:28 	*FSM* Transitioning to state "SUPERVISE"
10/27 11:00:28 	*FSM* Executing state func "supervise_all()" [ GET_NEW_PROC=
 SUSPEND VACATE ALARM DIE CHILD_EXIT PERIODIC_CKPT  ]
10/27 11:01:21 	*FSM* Got asynchronous event "CHILD_EXIT"
10/27 11:01:21 	*FSM* Executing transition function "reaper"
10/27 11:01:21 Process 20209 exited with status 0
10/27 11:01:21 	*FSM* Transitioning to state "PROC_EXIT"
10/27 11:01:21 	*FSM* Executing state func "proc_exit()" [ DIE  ]
10/27 11:01:21 	*FSM* Executing transition function "dispose_one"
10/27 11:01:21 Sending final status for process 20030587.0
10/27 11:01:21 STATUS encoded as NORMAL
10/27 11:01:21 User time =3D 47.240000 seconds
10/27 11:01:21 System time =3D 1.910000 seconds
10/27 11:01:21 Unlinked "dir_20208/condor_exec.20030587.0"
10/27 11:01:21 Removed directory "dir_20208"
10/27 11:01:21 	*FSM* Transitioning to state "SUPERVISE"
10/27 11:01:21 	*FSM* Got asynchronous event "DIE"
10/27 11:01:21 	*FSM* Executing transition function "req_die"
10/27 11:01:21 	*FSM* Transitioning to state "TERMINATE"
10/27 11:01:21 	*FSM* Executing state func "terminate_all()" [  ]
10/27 11:01:21 	*FSM* Transitioning to state "SEND_STATUS_ALL"
10/27 11:01:21 	*FSM* Executing state func "dispose_all()" [  ]
10/27 11:01:21 	*FSM* Reached state "END"
10/27 11:01:21 ********* STARTER terminating normally **********
10/27 11:01:22 ********** STARTER starting up ***********
10/27 11:01:22 ** $CondorVersion: 6.9.4 Aug 30 2007 $
10/27 11:01:22 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/27 11:01:22 ******************************************
10/27 11:01:22 Submitting machine is "ldas-grid.ligo.caltech.edu"
10/27 11:01:22 EventHandler {
10/27 11:01:22 	func =3D 0x4c64e4
10/27 11:01:22 	mask =3D SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD SIGT=
STP=20
10/27 11:01:22 }
10/27 11:01:22 Done setting resource limits
10/27 11:01:22 	*FSM* Transitioning to state "GET_PROC"
10/27 11:01:22 	*FSM* Executing state func "get_proc()" [  ]
10/27 11:01:22 Entering get_proc()
10/27 11:01:22 Entering get_job_info()
10/27 11:01:22 Startup Info:
10/27 11:01:22 	Version Number: 1
10/27 11:01:22 	Id: 20030610.0
10/27 11:01:22 	JobClass: STANDARD
10/27 11:01:22 	Uid: 4093
10/27 11:01:22 	Gid: 4093
10/27 11:01:22 	VirtPid: -1
10/27 11:01:22 	SoftKillSignal: 20
10/27 11:01:22 	Cmd: "/archive/home/dkeppel/codes/s5_1yr/lalapps/bin/lalapp=
s_inspiral"
10/27 11:01:22 	Args: "--do-rsq-veto --trig-end-time 0 --cluster-method tem=
plate --disable-filter-inj-only --bank-file L1-TRIGBANK_H1H2L1-840934559-20=
48.xml.gz --high-pass-order 8 --strain-high-pass-order 8 --ifo-tag H1H2L1 -=
-gps-end-time 840936607 --calibrated-data real_8 --channel-name L1:LSC-STRA=
IN --rsq-veto-coeff 0.0075 --snr-threshold 5.5 --enable-rsq-veto --number-o=
f-segments 15 --trig-start-time 0 --dynamic-range-exponent 69.0 --enable-hi=
gh-pass 30.0 --debug-level 33 --gps-start-time 840934559 --high-pass-attenu=
ation 0.1 --chisq-bins 16 --inverse-spec-length 16 --rsq-veto-threshold 15.=
0 --segment-length 1048576 --low-frequency-cutoff 40.0 --pad-data 8 --sampl=
e-rate 4096 --chisq-threshold 10.0 --rsq-veto-max-snr 12.0 --resample-filte=
r ldas --strain-high-pass-atten 0.1 --strain-high-pass-freq 30.0 --segment-=
overlap 524288 --frame-cache cache/L-L1_RDS_C03_L2-840928791-840945622.cach=
e --chisq-delta 0.2 --approximant FindChirpSP --rsq-veto-pow 1.124764 --rsq=
-veto-time-thresh 0.0002 --write-compress --enable-output --rsq-veto-window=
 6.0 --ts-cluster T0T3TcAS --ts-metric-scaling 0.06 --spectrum-type median"
10/27 11:01:22 	Env: "KMP_LIBRARY=3Dserial;MKL_SERIAL=3Dyes"
10/27 11:01:22 	Iwd: "/mnt/qfs2/dkeppel/analysis/lowcbc/20051104-20061114/f=
ull_data/slides_cat123_rsq"
10/27 11:01:22 	Ckpt Wanted: TRUE
10/27 11:01:22 	Is Restart: FALSE
10/27 11:01:22 	Core Limit Valid: TRUE
10/27 11:01:22 	Coredump Limit 0
10/27 11:01:22 User uid set to 4093
10/27 11:01:22 User uid set to 4093
10/27 11:01:22 User Process 20030610.0 {
10/27 11:01:22   cmd =3D /archive/home/dkeppel/codes/s5_1yr/lalapps/bin/lal=
apps_inspiral
10/27 11:01:22   args =3D --do-rsq-veto --trig-end-time 0 --cluster-method =
template --disable-filter-inj-only --bank-file L1-TRIGBANK_H1H2L1-840934559=
-2048.xml.gz --high-pass-order 8 --strain-high-pass-order 8 --ifo-tag H1H2L=
1 --gps-end-time 840936607 --calibrated-data real_8 --channel-name L1:LSC-S=
TRAIN --rsq-veto-coeff 0.0075 --snr-threshold 5.5 --enable-rsq-veto --numbe=
r-of-segments 15 --trig-start-time 0 --dynamic-range-exponent 69.0 --enable=
-high-pass 30.0 --debug-level 33 --gps-start-time 840934559 --high-pass-att=
enuation 0.1 --chisq-bins 16 --inverse-spec-length 16 --rsq-veto-threshold =
15.0 --segment-length 1048576 --low-frequency-cutoff 40.0 --pad-data 8 --sa=
mple-rate 4096 --chisq-threshold 10.0 --rsq-veto-max-snr 12.0 --resample-fi=
lter ldas --strain-high-pass-atten 0.1 --strain-high-pass-freq 30.0 --segme=
nt-overlap 524288 --frame-cache cache/L-L1_RDS_C03_L2-840928791-840945622.c=
ache --chisq-delta 0.2 --approximant FindChirpSP --rsq-veto-pow 1.124764 --=
rsq-veto-time-thresh 0.0002 --write-compress --enable-output --rsq-veto-win=
dow 6.0 --ts-cluster T0T3TcAS --ts-metric-scaling 0.06 --spectrum-type medi=
an
10/27 11:01:22   env =3D KMP_LIBRARY=3Dserial MKL_SERIAL=3Dyes _CONDOR_SLOT=
=3Dslot2 CONDOR_VM=3Dslot2 _condor_BIND_ALL_INTERFACES=3DFALSE CONDOR_SCRAT=
CH_DIR=3D/usr1/condor/execute/dir_20218
10/27 11:01:22   local_dir =3D dir_20218
10/27 11:01:22   cur_ckpt =3D dir_20218/condor_exec.20030610.0
10/27 11:01:22   core_name =3D (either 'core' or 'core.<pid>')
10/27 11:01:22   uid =3D 4093, gid =3D 4093
10/27 11:01:22   v_pid =3D -1
10/27 11:01:22   pid =3D (NOT CURRENTLY EXECUTING)
10/27 11:01:22   exit_status_valid =3D FALSE
10/27 11:01:22   exit_status =3D (NEVER BEEN EXECUTED)
10/27 11:01:22   ckpt_wanted =3D TRUE
10/27 11:01:22   coredump_limit_exists =3D TRUE
10/27 11:01:22   coredump_limit =3D 0
10/27 11:01:22   soft_kill_sig =3D 20
10/27 11:01:22   job_class =3D STANDARD
10/27 11:01:22   state =3D NEW
10/27 11:01:22   new_ckpt_created =3D FALSE
10/27 11:01:22   ckpt_transferred =3D FALSE
10/27 11:01:22   core_created =3D FALSE
10/27 11:01:22   core_transferred =3D FALSE
10/27 11:01:22   exit_requested =3D FALSE
10/27 11:01:22   image_size =3D -1 blocks
10/27 11:01:22   user_time =3D 0
10/27 11:01:22   sys_time =3D 0
10/27 11:01:22   guaranteed_user_time =3D 0
10/27 11:01:22   guaranteed_sys_time =3D 0
10/27 11:01:22 }
10/27 11:01:22 	*FSM* Transitioning to state "GET_EXEC"
10/27 11:01:22 	*FSM* Executing state func "get_exec()" [ SUSPEND VACATE DI=
E  ]
10/27 11:01:22 Entering get_exec()
10/27 11:01:22 Executable is located on submitting host
10/27 11:01:22 Expanded executable name is "/usr1/condor/spool/cluster20030=
610.ickpt.subproc0"
10/27 11:01:22 Going to try 3 attempts at getting the initial executable
10/27 11:01:22 Entering get_file( /usr1/condor/spool/cluster20030610.ickpt.=
subproc0, dir_20218/condor_exec.20030610.0, 0755 )
10/27 11:01:22 Opened "/usr1/condor/spool/cluster20030610.ickpt.subproc0" v=
ia file stream
10/27 11:01:22 Get_file() transferred 9345252 bytes, 113671323 bytes/second
10/27 11:01:22 Fetched orig ckpt file "/usr1/condor/spool/cluster20030610.i=
ckpt.subproc0" into "dir_20218/condor_exec.20030610.0" with 1 attempt
10/27 11:01:22 Executable 'dir_20218/condor_exec.20030610.0' is linked with=
 "$CondorVersion: 6.8.5 May 17 2007 $" on a "$CondorPlatform: X86_64-LINUX_=
RHEL3 $"
10/27 11:01:22 	*FSM* Executing transition function "spawn_all"
10/27 11:01:22 Pipe built
10/27 11:01:22 New pipe_fds[14,1]
10/27 11:01:22 cmd_fd =3D 14
10/27 11:01:22 Calling execve( "/usr1/condor/execute/dir_20218/condor_exec.=
20030610.0", "condor_exec.20030610.0", "-_condor_cmd_fd", "14", "--do-rsq-v=
eto", "--trig-end-time", "0", "--cluster-method", "template", "--disable-fi=
lter-inj-only", "--bank-file", "L1-TRIGBANK_H1H2L1-840934559-2048.xml.gz", =
"--high-pass-order", "8", "--strain-high-pass-order", "8", "--ifo-tag", "H1=
H2L1", "--gps-end-time", "840936607", "--calibrated-data", "real_8", "--cha=
nnel-name", "L1:LSC-STRAIN", "--rsq-veto-coeff", "0.0075", "--snr-threshold=
", "5.5", "--enable-rsq-veto", "--number-of-segments", "15", "--trig-start-=
time", "0", "--dynamic-range-exponent", "69.0", "--enable-high-pass", "30.0=
", "--debug-level", "33", "--gps-start-time", "840934559", "--high-pass-att=
enuation", "0.1", "--chisq-bins", "16", "--inverse-spec-length", "16", "--r=
sq-veto-threshold", "15.0", "--segment-length", "1048576", "--low-frequency=
-cutoff", "40.0", "--pad-data", "8", "--sample-rate", "4096", "--chisq-thre=
shold", "10.0", "--rsq-veto-max-snr", "12.0", "--resample-filter", "ldas", =
"--strain-high-pass-atten", "0.1", "--strain-high-pass-freq", "30.0", "--se=
gment-overlap", "524288", "--frame-cache", "cache/L-L1_RDS_C03_L2-840928791=
-840945622.cache", "--chisq-delta", "0.2", "--approximant", "FindChirpSP", =
"--rsq-veto-pow", "1.124764", "--rsq-veto-time-thresh", "0.0002", "--write-=
compress", "--enable-output", "--rsq-veto-window", "6.0", "--ts-cluster", "=
T0T3TcAS", "--ts-metric-scaling", "0.06", "--spectrum-type", "median", 0, "=
KMP_LIBRARY=3Dserial", "MKL_SERIAL=3Dyes", "_CONDOR_SLOT=3Dslot2", "CONDOR_=
VM=3Dslot2", "_condor_BIND_ALL_INTERFACES=3DFALSE", "CONDOR_SCRATCH_DIR=3D/=
usr1/condor/execute/dir_20218", 0 )
10/27 11:01:22 Started user job - PID =3D 20219
10/27 11:01:22 cmd_fp =3D 0xa46a40
10/27 11:01:22 end
10/27 11:01:22 	*FSM* Transitioning to state "SUPERVISE"
10/27 11:01:22 	*FSM* Executing state func "supervise_all()" [ GET_NEW_PROC=
 SUSPEND VACATE ALARM DIE CHILD_EXIT PERIODIC_CKPT  ]

--4bRzO86E/ozDv8r1

10/27 11:01:14 (?.?) (14167):******* Standard Shadow starting up *******
10/27 11:01:14 (?.?) (14167):** $CondorVersion: 6.9.4 Aug 30 2007 $
10/27 11:01:14 (?.?) (14167):** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/27 11:01:14 (?.?) (14167):*******************************************
10/27 11:01:14 (?.?) (14167):uid=0, euid=4135, gid=0, egid=4135
10/27 11:01:14 (?.?) (14167):Hostname = "<10.14.1.238:56934>", Job = 20030603.0
10/27 11:01:14 (20030603.0) (14167):Requesting Primary Starter
10/27 11:01:14 (20030603.0) (14167):Shadow: Request to run a job was ACCEPTED
10/27 11:01:14 (20030603.0) (14167):Shadow: RSC_SOCK connected, fd = 17
10/27 11:01:14 (20030603.0) (14167):Shadow: CLIENT_LOG connected, fd = 18
10/27 11:01:14 (20030603.0) (14167):My_Filesystem_Domain = "ligo"
10/27 11:01:14 (20030603.0) (14167):My_UID_Domain = "ligo"
10/27 11:01:14 (20030603.0) (14167):	Entering pseudo_get_file_stream
10/27 11:01:14 (20030603.0) (14167):	file = "/usr1/condor/spool/cluster20030603.ickpt.subproc0"
10/27 11:01:14 (20030603.0) (14167):Reaped child status - pid 14168 exited with status 0
10/27 11:01:14 (20012127.0) (14155): DoDownload: SHADOW failed to receive file(s) from <10.14.1.208:36391>; STARTER at 10.14.1.208 failed to send file(s) to <10.14.0.12:54401>: error reading from /usr1/condor/execute/dir_13246/Coinc_HSG6_S5_R2_3.35: (errno 2) No such file or directory
10/27 11:01:14 (20012127.0) (14155): Job 20012127.0 going into Hold state (code 13,2): Error from starter on slot2__AT__node208.ldas-cit.ligo.caltech.edu: STARTER at 10.14.1.208 failed to send file(s) to <10.14.0.12:54401>: error reading from /usr1/condor/execute/dir_13246/Coinc_HSG6_S5_R2_3.35: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <10.14.1.208:36391>
10/27 11:01:14 (20012127.0) (14155): ZKM: setting default map to (null)
10/27 11:01:14 (20030603.0) (14167):Read: User Job - $CondorPlatform: X86_64-LINUX_RHEL3 $
10/27 11:01:14 (20030603.0) (14167):Read: User Job - $CondorVersion: 6.8.5 May 17 2007 $
10/27 11:01:14 (20030603.0) (14167):Read: Checkpoint file name is "/usr1/condor/spool/cluster20030603.proc0.subproc0"
10/27 11:01:14 (20012128.0) (14157): DoDownload: SHADOW failed to receive file(s) from <10.14.1.249:40884>; STARTER at 10.14.1.249 failed to send file(s) to <10.14.0.12:52465>: error reading from /usr1/condor/execute/dir_25546/Coinc_HSG6_S5_R2_6.68: (errno 2) No such file or directory
10/27 11:01:14 (20012127.0) (14155): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 112
10/27 11:01:14 (20012128.0) (14157): Job 20012128.0 going into Hold state (code 13,2): Error from starter on slot2__AT__node249.ldas-cit.ligo.caltech.edu: STARTER at 10.14.1.249 failed to send file(s) to <10.14.0.12:52465>: error reading from /usr1/condor/execute/dir_25546/Coinc_HSG6_S5_R2_6.68: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <10.14.1.249:40884>
10/27 11:01:15 (20012128.0) (14157): ZKM: setting default map to (null)
10/27 11:01:15 (20012128.0) (14157): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 112
10/27 11:01:18 ******************************************************
10/27 11:01:18 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/27 11:01:18 ** /usr/sbin/condor_shadow
10/27 11:01:18 ** $CondorVersion: 6.9.4 Aug 30 2007 $
10/27 11:01:18 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/27 11:01:18 ** PID = 14169
10/27 11:01:18 ** Log last touched 10/27 11:01:15
10/27 11:01:18 ******************************************************
10/27 11:01:18 Using config source: /usr1/condor/condor_config
10/27 11:01:18 Using local config sources: 
10/27 11:01:18    /usr1/condor/condor_config.local
10/27 11:01:18 DaemonCore: Command Socket at <10.14.0.12:50157>
10/27 11:01:18 Initializing a VANILLA shadow for job 20012136.0
10/27 11:01:18 (20012136.0) (14169): Request to run on <10.14.1.208:38412> was ACCEPTED
10/27 11:01:18 ******************************************************
10/27 11:01:18 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/27 11:01:18 ** /usr/sbin/condor_shadow
10/27 11:01:18 ** $CondorVersion: 6.9.4 Aug 30 2007 $
10/27 11:01:18 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/27 11:01:18 ** PID = 14170
10/27 11:01:18 ** Log last touched 10/27 11:01:18
10/27 11:01:18 ******************************************************
10/27 11:01:18 Using config source: /usr1/condor/condor_config
10/27 11:01:18 Using local config sources: 
10/27 11:01:18    /usr1/condor/condor_config.local
10/27 11:01:18 DaemonCore: Command Socket at <10.14.0.12:44167>
10/27 11:01:18 Initializing a VANILLA shadow for job 20012137.0
10/27 11:01:18 (20012137.0) (14170): Request to run on <10.14.1.249:38133> was ACCEPTED
10/27 11:01:21 (20030586.0) (14123):Shadow: Job 20030586.0 exited, termsig = 0, coredump = 0, retcode = 0
10/27 11:01:21 (20030586.0) (14123):Shadow: Job exited normally with status 0
10/27 11:01:21 (20030586.0) (14123):user_time = 0 ticks
10/27 11:01:21 (20030586.0) (14123):sys_time = 2 ticks
10/27 11:01:21 (20030586.0) (14123):ZKM: setting default map to (null)
10/27 11:01:21 (20030587.0) (14124):Shadow: Job 20030587.0 exited, termsig = 0, coredump = 0, retcode = 0
10/27 11:01:21 (20030587.0) (14124):Shadow: Job exited normally with status 0
10/27 11:01:21 (20030587.0) (14124):user_time = 1 ticks
10/27 11:01:21 (20030587.0) (14124):sys_time = 1 ticks
10/27 11:01:21 (20030587.0) (14124):ZKM: setting default map to (null)
10/27 11:01:21 (20030586.0) (14123):Static Policy: removing job because OnExitRemove has become true
10/27 11:01:21 (20030586.0) (14123):********** Shadow Exiting(100) **********
10/27 11:01:21 (20028238.0) (8455):Shadow: Job 20028238.0 exited, termsig = 0, coredump = 0, retcode = 0
10/27 11:01:21 (20028238.0) (8455):Shadow: Job exited normally with status 0
10/27 11:01:21 (20028238.0) (8455):user_time = 0 ticks
10/27 11:01:21 (20028238.0) (8455):sys_time = 5 ticks
10/27 11:01:21 (20030587.0) (14124):Static Policy: removing job because OnExitRemove has become true
10/27 11:01:21 (20030587.0) (14124):********** Shadow Exiting(100) **********
10/27 11:01:21 (?.?) (14172):******* Standard Shadow starting up *******
10/27 11:01:21 (?.?) (14172):** $CondorVersion: 6.9.4 Aug 30 2007 $
10/27 11:01:21 (?.?) (14172):** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/27 11:01:21 (?.?) (14172):*******************************************
10/27 11:01:21 (?.?) (14172):uid=0, euid=4135, gid=0, egid=4135
10/27 11:01:21 (?.?) (14172):Hostname = "<10.14.1.241:48655>", Job = 20030604.0
10/27 11:01:21 (20028238.0) (8455):ZKM: setting default map to (null)
10/27 11:01:21 (20030604.0) (14172):Requesting Primary Starter
10/27 11:01:22 (20030604.0) (14172):Shadow: Request to run a job was ACCEPTED
10/27 11:01:22 (20030604.0) (14172):Shadow: RSC_SOCK connected, fd = 17
10/27 11:01:22 (20030604.0) (14172):Shadow: CLIENT_LOG connected, fd = 18
10/27 11:01:22 (20030604.0) (14172):My_Filesystem_Domain = "ligo"
10/27 11:01:22 (20030604.0) (14172):My_UID_Domain = "ligo"
10/27 11:01:22 (20030604.0) (14172):	Entering pseudo_get_file_stream
10/27 11:01:22 (20030604.0) (14172):	file = "/usr1/condor/spool/cluster20030604.ickpt.subproc0"
10/27 11:01:22 (20028238.0) (8455):Static Policy: removing job because OnExitRemove has become true
10/27 11:01:22 (20028238.0) (8455):********** Shadow Exiting(100) **********
10/27 11:01:22 (20030604.0) (14172):Reaped child status - pid 14173 exited with status 0
10/27 11:01:22 (?.?) (14174):******* Standard Shadow starting up *******
10/27 11:01:22 (?.?) (14174):** $CondorVersion: 6.9.4 Aug 30 2007 $
10/27 11:01:22 (?.?) (14174):** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/27 11:01:22 (?.?) (14174):*******************************************
10/27 11:01:22 (?.?) (14174):uid=0, euid=4135, gid=0, egid=4135
10/27 11:01:22 (?.?) (14174):Hostname = "<10.14.1.123:43563>", Job = 20030610.0
10/27 11:01:22 (20030610.0) (14174):Requesting Primary Starter
10/27 11:01:22 (20030610.0) (14174):Shadow: Request to run a job was ACCEPTED
10/27 11:01:22 (20030610.0) (14174):Shadow: RSC_SOCK connected, fd = 17
10/27 11:01:22 (20030610.0) (14174):Shadow: CLIENT_LOG connected, fd = 18
10/27 11:01:22 (20030610.0) (14174):My_Filesystem_Domain = "ligo"
10/27 11:01:22 (20030610.0) (14174):My_UID_Domain = "ligo"
10/27 11:01:22 (20030610.0) (14174):	Entering pseudo_get_file_stream
10/27 11:01:22 (20030610.0) (14174):	file = "/usr1/condor/spool/cluster20030610.ickpt.subproc0"
10/27 11:01:22 (20030610.0) (14174):Reaped child status - pid 14175 exited with status 0
10/27 11:01:22 (20030604.0) (14172):Read: User Job - $CondorPlatform: X86_64-LINUX_RHEL3 $
10/27 11:01:22 (20030604.0) (14172):Read: User Job - $CondorVersion: 6.8.5 May 17 2007 $
10/27 11:01:22 (20030604.0) (14172):Read: Checkpoint file name is "/usr1/condor/spool/cluster20030604.proc0.subproc0"
10/27 11:01:22 (?.?) (14176):******* Standard Shadow starting up *******
10/27 11:01:22 (?.?) (14176):** $CondorVersion: 6.9.4 Aug 30 2007 $
10/27 11:01:22 (?.?) (14176):** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/27 11:01:22 (?.?) (14176):*******************************************
10/27 11:01:22 (?.?) (14176):uid=0, euid=4135, gid=0, egid=4135
10/27 11:01:22 (?.?) (14176):Hostname = "<10.14.1.105:42925>", Job = 20030611.0
10/27 11:01:22 (20030611.0) (14176):Requesting Primary Starter
10/27 11:01:22 (20030611.0) (14176):Shadow: Request to run a job was ACCEPTED
10/27 11:01:22 (20030611.0) (14176):Shadow: RSC_SOCK connected, fd = 17
10/27 11:01:22 (20030611.0) (14176):Shadow: CLIENT_LOG connected, fd = 18
10/27 11:01:22 (20030611.0) (14176):My_Filesystem_Domain = "ligo"
10/27 11:01:22 (20030611.0) (14176):My_UID_Domain = "ligo"
10/27 11:01:22 (20030611.0) (14176):	Entering pseudo_get_file_stream
10/27 11:01:22 (20030611.0) (14176):	file = "/usr1/condor/spool/cluster20030611.ickpt.subproc0"
10/27 11:01:22 (20030611.0) (14176):Reaped child status - pid 14177 exited with status 0
10/27 11:01:22 (20030610.0) (14174):Read: User Job - $CondorPlatform: X86_64-LINUX_RHEL3 $
10/27 11:01:22 (20030610.0) (14174):Read: User Job - $CondorVersion: 6.8.5 May 17 2007 $
10/27 11:01:22 (20030610.0) (14174):Read: Checkpoint file name is "/usr1/condor/spool/cluster20030610.proc0.subproc0"
10/27 11:01:22 (20030611.0) (14176):Read: User Job - $CondorPlatform: X86_64-LINUX_RHEL3 $
10/27 11:01:22 (20030611.0) (14176):Read: User Job - $CondorVersion: 6.8.5 May 17 2007 $
10/27 11:01:22 (20030611.0) (14176):Read: Checkpoint file name is "/usr1/condor/spool/cluster20030611.proc0.subproc0"
10/27 11:01:26 (20012137.0) (14170): DoDownload: SHADOW failed to receive file(s) from <10.14.1.249:37895>; STARTER at 10.14.1.249 failed to send file(s) to <10.14.0.12:44167>: error reading from /usr1/condor/execute/dir_25649/Coinc_HSG6_S5_R2_6.68: (errno 2) No such file or directory
10/27 11:01:26 (20012137.0) (14170): Job 20012137.0 going into Hold state (code 13,2): Error from starter on slot2__AT__node249.ldas-cit.ligo.caltech.edu: STARTER at 10.14.1.249 failed to send file(s) to <10.14.0.12:44167>: error reading from /usr1/condor/execute/dir_25649/Coinc_HSG6_S5_R2_6.68: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <10.14.1.249:37895>
10/27 11:01:26 (20012137.0) (14170): ZKM: setting default map to (null)
10/27 11:01:26 (20012137.0) (14170): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 112
10/27 11:01:26 (20012136.0) (14169): DoDownload: SHADOW failed to receive file(s) from <10.14.1.208:37203>; STARTER at 10.14.1.208 failed to send file(s) to <10.14.0.12:50157>: error reading from /usr1/condor/execute/dir_13349/Coinc_HSG6_S5_R2_3.35: (errno 2) No such file or directory
10/27 11:01:26 ******************************************************
10/27 11:01:26 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/27 11:01:26 ** /usr/sbin/condor_shadow
10/27 11:01:26 ** $CondorVersion: 6.9.4 Aug 30 2007 $
10/27 11:01:26 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/27 11:01:26 ** PID = 14179
10/27 11:01:26 ** Log last touched 10/27 11:01:26
10/27 11:01:26 ******************************************************
10/27 11:01:26 Using config source: /usr1/condor/condor_config
10/27 11:01:26 Using local config sources: 
10/27 11:01:26    /usr1/condor/condor_config.local
10/27 11:01:26 DaemonCore: Command Socket at <10.14.0.12:37058>
10/27 11:01:26 Initializing a VANILLA shadow for job 20012138.0
10/27 11:01:26 (20012136.0) (14169): Job 20012136.0 going into Hold state (code 13,2): Error from starter on slot2__AT__node208.ldas-cit.ligo.caltech.edu: STARTER at 10.14.1.208 failed to send file(s) to <10.14.0.12:50157>: error reading from /usr1/condor/execute/dir_13349/Coinc_HSG6_S5_R2_3.35: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <10.14.1.208:37203>
10/27 11:01:26 (20012136.0) (14169): ZKM: setting default map to (null)
10/27 11:01:26 (20012138.0) (14179): Request to run on <10.14.1.249:38133> was ACCEPTED
10/27 11:01:26 (20012136.0) (14169): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 112
10/27 11:01:26 ******************************************************
10/27 11:01:26 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/27 11:01:26 ** /usr/sbin/condor_shadow
10/27 11:01:26 ** $CondorVersion: 6.9.4 Aug 30 2007 $
10/27 11:01:26 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/27 11:01:26 ** PID = 14180
10/27 11:01:26 ** Log last touched 10/27 11:01:26
10/27 11:01:26 ******************************************************
10/27 11:01:26 Using config source: /usr1/condor/condor_config
10/27 11:01:26 Using local config sources: 
10/27 11:01:26    /usr1/condor/condor_config.local
10/27 11:01:26 DaemonCore: Command Socket at <10.14.0.12:44275>
10/27 11:01:26 Initializing a VANILLA shadow for job 20012139.0
10/27 11:01:26 (20012139.0) (14180): Request to run on <10.14.1.208:38412> was ACCEPTED
10/27 11:01:30 (20030824.0) (13394): ZKM: setting default map to (null)
10/27 11:01:30 (20030824.0) (13394): Job 20030824.0 terminated: exited with status 0
10/27 11:01:30 (20030824.0) (13394): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
10/27 11:01:34 (20012138.0) (14179): DoDownload: SHADOW failed to receive file(s) from <10.14.1.249:58830>; STARTER at 10.14.1.249 failed to send file(s) to <10.14.0.12:37058>: error reading from /usr1/condor/execute/dir_25752/Coinc_HSG6_S5_R2_3.35: (errno 2) No such file or directory
10/27 11:01:34 (20012139.0) (14180): DoDownload: SHADOW failed to receive file(s) from <10.14.1.208:55713>; STARTER at 10.14.1.208 failed to send file(s) to <10.14.0.12:44275>: error reading from /usr1/condor/execute/dir_13451/Coinc_HSG6_S5_R2_6.68: (errno 2) No such file or directory
10/27 11:01:34 (20012138.0) (14179): Job 20012138.0 going into Hold state (code 13,2): Error from starter on slot2__AT__node249.ldas-cit.ligo.caltech.edu: STARTER at 10.14.1.249 failed to send file(s) to <10.14.0.12:37058>: error reading from /usr1/condor/execute/dir_25752/Coinc_HSG6_S5_R2_3.35: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <10.14.1.249:58830>
10/27 11:01:34 (20012139.0) (14180): Job 20012139.0 going into Hold state (code 13,2): Error from starter on slot2__AT__node208.ldas-cit.ligo.caltech.edu: STARTER at 10.14.1.208 failed to send file(s) to <10.14.0.12:44275>: error reading from /usr1/condor/execute/dir_13451/Coinc_HSG6_S5_R2_6.68: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <10.14.1.208:55713>
10/27 11:01:34 (20012138.0) (14179): ZKM: setting default map to (null)
10/27 11:01:35 (20012138.0) (14179): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 112
10/27 11:01:35 (20012139.0) (14180): ZKM: setting default map to (null)
10/27 11:01:35 (20012139.0) (14180): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 112
10/27 11:01:37 (20024603.0) (9182): ZKM: setting default map to (null)
10/27 11:01:37 (20024603.0) (9182): Job 20024603.0 terminated: exited with status 0
10/27 11:01:37 (20024603.0) (9182): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
10/27 11:01:39 ******************************************************
10/27 11:01:39 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/27 11:01:39 ** /usr/sbin/condor_shadow
10/27 11:01:39 ** $CondorVersion: 6.9.4 Aug 30 2007 $
10/27 11:01:39 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/27 11:01:39 ** PID = 14325
10/27 11:01:39 ** Log last touched 10/27 11:01:37
10/27 11:01:39 ******************************************************
10/27 11:01:39 Using config source: /usr1/condor/condor_config
10/27 11:01:39 Using local config sources: 
10/27 11:01:39    /usr1/condor/condor_config.local
10/27 11:01:39 DaemonCore: Command Socket at <10.14.0.12:42514>
10/27 11:01:39 Initializing a VANILLA shadow for job 20012140.0
10/27 11:01:39 (20012140.0) (14325): Request to run on <10.14.1.249:38133> was ACCEPTED
10/27 11:01:39 ******************************************************
10/27 11:01:39 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/27 11:01:39 ** /usr/sbin/condor_shadow
10/27 11:01:39 ** $CondorVersion: 6.9.4 Aug 30 2007 $
10/27 11:01:39 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/27 11:01:39 ** PID = 14326
10/27 18:32:04 (20032162.0) (16347): ZKM: setting default map to (null)

--4bRzO86E/ozDv8r1--

===========================================================================
Date mail was appended: Sat Oct 27 20:38:44 2007 (1193535524)
Subject: Actions

Assigned to psilord by zmiller
===========================================================================
Date of actions: Fri Nov 16 16:25:26 2007 (1195251926)
Date: Fri, 16 Nov 2007 16:38:47 -0600
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: zmiller <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17143] LIGO: ImageSize update problem

Hello,

> From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
> 
> The LIGO Condor pool at Caltech running,
> 
> # condor_version
> $CondorVersion: 6.9.4 Aug 30 2007 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
> 
> does not appear to be reliably updating the ImageSize job ClassAdd.
> For example, here are 3 jobs that where submitted from the same submit
> machine by the same user and have been running on the same execute
> machine for an extended period of time. However, only one of them is
> showing the right ImageSize.

Ok, after looking to the codebase, I found this behavior:

The ImageSize for a standard universe job is updated (these are the
imporant ones) at four conceptual locations in the code: A) when the
job is submitted--the size of the executable is used, B) when the job
starts executing on the remote node, C) when the job finishes execution,
and D) when the job checkpoints for any reason, even if that checkpoint
is aborted because WantCheckpoint is false. So, if you set your periodic
checkpoint value to something like 24 hours, the starter will only inform
the shadow every 24 hours of the change of image size.

However, the update only gets written to the shadow's in-memory classad and
is only updated to the schedd upon job termination or eviction.

So, this tells me that you'll only see the updated ImageSize attribute
from the perspective of condor_q on job termination/eviction boundaries.

What behavior would you like instead? Once I know, then I can consider the
ramfifications of it. Be aware that it is expensive for a shadow to call
back to a schedd via a socket, since the schedd is usually heavily loaded.

Thank you.

-pete

===========================================================================
Date mail was appended: Fri Nov 16 16:38:56 2007 (1195252737)
Subject: Actions

Status changed from open to pending by psilord
===========================================================================
Date of actions: Fri Nov 16 16:38:57 2007 (1195252737)
Date: Fri, 16 Nov 2007 22:22:07 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17143] LIGO: ImageSize update problem
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Pete,
	Thanks for the explanation. I think a starting position would
be to understand the performance impact of having the standard universe
report back the same information on the same timescale as the vanilla
universe, perhaps even using the same configuration knobs. If that
level of reporting adversely affects pool performance then I may flip
this around and ask to be able to disable this for the vanilla universe.

Thanks.

On Fri, Nov 16, 2007 at 04:38:56PM -0600, condor-admin response tracking system wrote:
> Hello,
> 
> > From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
> > 
> > The LIGO Condor pool at Caltech running,
> > 
> > # condor_version
> > $CondorVersion: 6.9.4 Aug 30 2007 $
> > $CondorPlatform: X86_64-LINUX_RHEL3 $
> > 
> > does not appear to be reliably updating the ImageSize job ClassAdd.
> > For example, here are 3 jobs that where submitted from the same submit
> > machine by the same user and have been running on the same execute
> > machine for an extended period of time. However, only one of them is
> > showing the right ImageSize.
> 
> Ok, after looking to the codebase, I found this behavior:
> 
> The ImageSize for a standard universe job is updated (these are the
> imporant ones) at four conceptual locations in the code: A) when the
> job is submitted--the size of the executable is used, B) when the job
> starts executing on the remote node, C) when the job finishes execution,
> and D) when the job checkpoints for any reason, even if that checkpoint
> is aborted because WantCheckpoint is false. So, if you set your periodic
> checkpoint value to something like 24 hours, the starter will only inform
> the shadow every 24 hours of the change of image size.
> 
> However, the update only gets written to the shadow's in-memory classad and
> is only updated to the schedd upon job termination or eviction.
> 
> So, this tells me that you'll only see the updated ImageSize attribute
> from the perspective of condor_q on job termination/eviction boundaries.
> 
> What behavior would you like instead? Once I know, then I can consider the
> ramfifications of it. Be aware that it is expensive for a shadow to call
> back to a schedd via a socket, since the schedd is usually heavily loaded.
> 
> Thank you.
> 
> -pete
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Peter Keller <psilord__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu
> 
> -- 
> ======================================================================
> This mail was sent from the RUST Mail System
> Please direct all replies to condor-admin__AT__cs.wisc.edu
> Please include the current subject line in your reply.
> ======================================================================
> 

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Sat Nov 17  1:16:55 2007 (1195283815)
Date: Wed, 28 Nov 2007 13:26:29 -0600
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17143] LIGO: ImageSize update problem

Hello,

> 	Thanks for the explanation. I think a starting position would
> be to understand the performance impact of having the standard universe
> report back the same information on the same timescale as the vanilla
> universe, perhaps even using the same configuration knobs. If that
> level of reporting adversely affects pool performance then I may flip
> this around and ask to be able to disable this for the vanilla universe.

Ok, I'll check out how difficult it would be to do this and what impact
it might have.

Thank you.

-pete

===========================================================================
Date mail was appended: Wed Nov 28 13:26:38 2007 (1196277999)