LIGO Support Ticket 15795

Ticket Information
  Number:      admin 15795
  User:        anderson@ligo.caltech.edu
  Email:       espinoza_e__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu,dan__AT__hep.wisc.edu
  Status:      resolved
  Assigned To: zmiller
Date: Tue, 3 Jul 2007 09:26:43 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,         Scott Koranda
 <skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: Excessive Shadow stat() calls
X-MIME-Autoconverted: from quoted-printable to 8bit by chopin.cs.wisc.edu 
 id l63GRYoe006324

On the LIGO CIT Condor pool running,

# condor_version
$CondorVersion: 6.8.5 May 17 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

it has been observed that some Vanilla universe jobs have their Shadow
processes performing an excessive amount of stat() system calls which is
overloading the submit machine.

This is occuring even though,

WantRemoteSyscalls = FALSE
WantRemoteIO = FALSE
ShouldTransferFiles = "NO"
TransferFiles = "NEVER"

in the job classadd. The files being stat()'ed are all user owned/created
files. With the above job classadd settings, under what circumstances
should the Shaodw processes stat() user files in the Vanilla universe?

It appears that most (perhaps all?) of such jobs where started with
condor_run if that makes any difference.  These are also jobs that are
eventually failing with "Shadow exception!" and then being re-run. So
here is another vote for keeping the shadow-exception count in the
job classadd so we can add a PeriodicHold expression to catch
failing jobs with a more precise criteria than JunRunCount.



Here is an example job classadd:
# condor_q -long 15119687.0


-- Quill: citquill@ligo : <10.14.0.25:5432> : citquill_db
MyType = "Job"
TargetType = "Machine"
RemoteWallClockTime = 12871
MaxHosts = 1
OrigMaxHosts = 1
JobStatus = 2
EnteredCurrentStatus = 1183477895
LastSuspensionTime = 0
CurrentHosts = 1
RemoteVirtualMachineID = 2
ShadowBday = 1183477904
JobLastStartDate = 1183475834
JobCurrentStartDate = 1183477904
JobRunCount = 9
ProcId = 0
JobStartDate = 1183462546
LastVacateTime = 1183464967
BytesSent = 0
BytesRecvd = 0
LastJobLeaseRenewal = 1183479181
LastRemoteHost = "vm2__AT__node92.ldas-cit.ligo.caltech.edu"
LastPublicClaimId = "<10.14.1.92:42375>#1182879166#979#..."
LastPublicClaimIds = ""
PublicClaimId = "<10.14.1.28:42106>#1182879134#1025#..."
RemoteHost = "vm2__AT__node28.ldas-cit.ligo.caltech.edu"
GlobalJobId = "ldas-grid.ligo.caltech.edu#1183461572#15119687.0"
ServerTime = 1183479184
ClusterId = 15119687
QDate = 1183461461
CompletionDate = 0
LocalUserCpu = 0
LocalSysCpu = 0
RemoteUserCpu = 0
RemoteSysCpu = 0
ExitStatus = 0
NumCkpts_RAW = 0
NumCkpts = 0
NumRestarts = 0
NumSystemHolds = 0
CommittedTime = 0
TotalSuspensions = 0
CumulativeSuspensionTime = 0
JobUniverse = 5
MinHosts = 1
JobPrio = 0
JobNotification = 0
CoreSize = 0
Rank = 0
BufferSize = 524288
BufferBlockSize = 32768
ImageSize_RAW = 1
ImageSize = 10000
ExecutableSize_RAW = 1
ExecutableSize = 10000
DiskUsage_RAW = 1
DiskUsage = 10000
JobLeaseDuration = 1200
Owner = "dietz"
ExitBySignal = FALSE
Notification = ERROR
WantBadgers = TRUE
periodic_hold = (JobRunCount > 1000)
CondorVersion = "$CondorVersion: 6.8.5 May 17 2007 $"
CondorPlatform = "$CondorPlatform: X86_64-LINUX_RHEL3 $"
RootDir = "/"
Iwd = "/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21"
Cmd = "/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/.condor_run.4632"
WantRemoteSyscalls = FALSE
WantCheckpoint = FALSE
User = "dietz@ligo"
NiceUser = FALSE
Environment = "O1=alex__AT__adietz.phys.lsu.edu GRID_SECURITY_DIR=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/etc MANPATH=/archive/home/dietz/Install/vds/man:/archive/home/dietz/Install/LAL//man:/archive/home/dietz/Install/LAL//man:/opt/lscsoft/lalapps/share/man:/opt/lscsoft/libframe/man:/opt/lscsoft/libmetaio/man:/opt/lscsoft/framecpp/share/man:/opt/lscsoft/root/man:/opt/lscsoft/lalapps/share/man:/opt/lscsoft/libframe/man:/opt/lscsoft/libmetaio/man:/opt/lscsoft/framecpp/share/man:/opt/lscsoft/root/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/man::/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vdt/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/perl/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/expat/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/logrotate/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/jdk1.4/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/edg/share/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/mysql/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/contrib/gstar/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/glite/share/man:/archive/home/dietz/Install/vds/contrib/gstar/man TERM=xterm LAL_PREFIX=/archive/home/dietz/Install/LAL/ LSCSOFT_PREFIX=/opt/lscsoft HOSTNAME=ldas-grid MYSQL_UNIX_PORT=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vdt-app-data/mysql/var/mysql.sock SASL_PATH=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/lib/sasl GSTAR_LOCATION=/archive/home/dietz/Install/vds/contrib/gstar LD_LIBRARY_PATH=/archive/home/dietz/Install/pylal/lib64/python2.4/site-packages:/archive/home/dietz/Install/glue/lib64/python2.4/site-packages:/archive/home/dietz/Install/LAL//lib:/opt/lscsoft/glue/lib64/python2.4/site-packages:/opt/lscsoft/libframe/lib64:/opt/lscsoft/libmetaio/lib64:/opt/lscsoft/framecpp/lib64:/opt/lscsoft/dol/lib64:/opt/lscsoft/root/lib64:/opt/lscsoft/glue/lib64/python2.4/site-packages:/opt/lscsoft/libframe/lib64:/opt/lscsoft/libmetaio/lib64:/opt/lscsoft/framecpp/lib64:/opt/lscsoft/dol/lib64:/opt/lscsoft/root/lib64:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/glite/lib:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/myodbc/lib:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/unixodbc/lib:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/mysql/lib/mysql:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/jdk1.4/jre/lib/i386:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/jdk1.4/jre/lib/i386/server:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/jdk1.4/jre/lib/i386/client:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/berkeley-db/lib:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/expat/lib:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/lib:/ldcg/ldg/vdt/globus/lib:/ligotools/lib LSCSOFT_LOCATION=/opt/lscsoft GLITE_LOCATION_VAR=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/glite/var MLHO=dietz__AT__ldas-grid.ligo-wa.caltech.edu SHELL=/bin/bash MATLABPATH=/ligotools/matlab HISTSIZE=1000 EDG_LOCATION=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/edg LDG_LOCATION=/ldcg/stow_pkgs/ldg-4.4/ldg FRAMECPP_PREFIX=/opt/lscsoft/framecpp VOMS_USERCONF=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/glite/etc/vomses PYLAL_LOCATION=/archive/home/dietz/Install/pylal GLOBUS_LOCATION=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus SSH_CLIENT=131.251.40.230' '22760' '22 GLOBUS_PATH=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus FRAMECPP_LOCATION=/opt/lscsoft/framecpp GLITE_LOCATION_LOG=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/glite/log _condor_ALL_DEBUG= X509_CADIR=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/TRUSTED_CA X509_CERT_DIR=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/TRUSTED_CA PERL5LIB=/archive/home/dietz/Install/vds/lib/perl:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/perl:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vdt/lib:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/perl/lib/5.8.0:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/perl/lib/5.8.0/x86_64-linux-thread-multi:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/perl/lib/site_perl/5.8.0:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/perl/lib/site_perl/5.8.0/x86_64-linux-thread-multi::/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/contrib/gstar/lib:/archive/home/dietz/Install/vds/contrib/gstar/lib CVSROOT=:pserver:dietz__AT__gravity.phys.uwm.edu:2402/usr/local/cvs/lscsoft PYLAL_PREFIX=/archive/home/dietz/Install/pylal MT=alex__AT__dietztower.astro.cf.ac.uk MLLO=dietz__AT__ldas-grid.ligo-la.caltech.edu HT=alex__AT__dietztower.astro.cf.ac.uk:/home/alex DAGDBUPDATORLOCKFILE=/etc/onasys-dblockfile PWD=/archive/home/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21 KDEDIR=/usr DOL_LOCATION=/opt/lscsoft/dol SHLVL=2 SSH_TTY=/dev/pts/0 MUWM=dietz__AT__hydra.phys.uwm.edu GLOBUS_TCP_PORT_RANGE=40000,45000 USER=dietz VDS_HOME=/archive/home/dietz/Install/vds MSOL=adietz__AT__sol.phys.lsu.edu LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35: GLUE_LOCATION=/archive/home/dietz/Install/glue GLOBUS_ERROR_VERBOSE=true MPUB=dietz__AT__phys.lsu.edu:/home3/dietz/public_html/ ROOTSYS=/opt/lscsoft/root GPT_LOCATION=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/gpt LDG_INSTALL_LOG=/ldcg/stow_pkgs/ldg-4.4/ldg/ldg-server/etc/ldg-install.log GLITE_LOCATION_TMP=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/glite/tmp MDECATUR=dietz__AT__decatur.ligo-la.caltech.edu LDG_SOFTWARE_LOCATION=http://www.ldas-sw.ligo.caltech.edu/ldg_dist/ldg4.4/software SSH_AUTH_SOCK=/tmp/ssh-eytFv26346/agent.26346 MHELIX=adietz__AT__helix.bcvc.lsu.edu LIBPATH=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/lib:/ldcg/ldg/vdt/globus/lib:/usr/lib:/lib MOFFICE=alex__AT__adietz.phys.lsu.edu MLAGUACHA=dietz__AT__laguacha.phys.lsu.edu MSOLMAT=matlab__AT__sol.phys.lsu.edu LAL_LOCATION=/archive/home/dietz/Install/LAL PATH=/archive/home/dietz/Install/pylal/bin:/archive/home/dietz/Install/vds/bin:/archive/home/dietz/Install/glue/bin:/archive/home/dietz/Install/LAL//bin:/archive/home/dietz/Install/LAL//bin:/opt/lscsoft/lalapps/bin:/opt/lscsoft/glue/bin:/opt/lscsoft/libframe/bin:/opt/lscsoft/libmetaio/bin:/opt/lscsoft/framecpp/bin:/opt/lscsoft/dol/bin:/opt/lscsoft/root/bin:/opt/lscsoft/lalapps/bin:/opt/lscsoft/glue/bin:/opt/lscsoft/libframe/bin:/opt/lscsoft/libmetaio/bin:/opt/lscsoft/framecpp/bin:/opt/lscsoft/dol/bin:/opt/lscsoft/root/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/glite/sbin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/glite/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/pyglobus-url-copy/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/unixodbc/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/mysql/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/edg/sbin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/jdk1.4/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/logrotate/sbin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/gpt/sbin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/sbin:/ldcg/pacman/stow_pkgs/pacman-3.19/src:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vdt/sbin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vdt/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/ldg-server/bin:/usr/kerberos/bin:/usr/bin:/bin:/usr/sbin:/sbin:/ldcg/ldg/vdt/globus/bin:/usr/X11R6/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/contrib/gstar/bin:/ligotools/bin:/ldcg/matlab_r2006a/bin:/archive/home/dietz/Install/vds/contrib/gstar/bin:/archive/home/dietz/bin:. MAIL=/var/spool/mail/dietz _=/usr/bin/condor_run JAVA_HOME=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/jdk1.4 CONDOR_LOCATION=/usr VDT_LOCATION=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt CONDOR_CONFIG=/usr1/condor/condor_config LOGNAME=dietz INPUTRC=/etc/inputrc ODBCINI=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/unixodbc/etc/odbc.ini LSC_SEGFIND_SERVER=ldas-cit.ligo.caltech.edu ROOT_LOCATION=/opt/lscsoft/root LANG=C LSCSOFTCVS=:pserver:dietz__AT__gravity.phys.uwm.edu:2402/usr/local/cvs/lscsoft MOFFICE2=alex__AT__dietz.phys.lsu.edu MLAPTOP=alex__AT__hvproj9.phys.lsu.edu:/home/alex HOME=/archive/home/dietz LIGOTOOLS=/ligotools MCOMA=spxad1__AT__coma.astro.cf.ac.uk GLUE_PREFIX=/archive/home/dietz/Install/glue X509_USER_PROXY=/tmp/x509up_p26346.fileUjuJXS.1 O2=alex__AT__dietz.phys.lsu.edu BOSSDIR=/etc MMAIL=dietz__AT__phys.lsu.edu VDT_INSTALL_LOG=vdt-install.log DYLD_LIBRARY_PATH=/archive/home/dietz/Install/pylal/lib64/python2.4/site-packages:/archive/home/dietz/Install/glue/lib64/python2.4/site-packages:/archive/home/dietz/Install/LAL//lib:/opt/lscsoft/glue/lib64/python2.4/site-packages:/opt/lscsoft/framecpp/lib64:/opt/lscsoft/glue/lib64/python2.4/site-packages:/opt/lscsoft/framecpp/lib64:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/lib VDS_JAVA_HEAPMAX=1024 LSC_DATAFIND_SERVER=ldas-cit.ligo.caltech.edu PYTHONPATH=/archive/home/dietz/Install/pylal/lib/python:/archive/home/dietz/Install/glue/lib/python:/archive/home/dietz/Install/pylal/lib64/python2.4/site-packages:/archive/home/dietz/Install/pylal/lib/python2.4/site-packages:/archive/home/dietz/Install/glue/lib64/python2.4/site-packages:/archive/home/dietz/Install/glue/lib/python2.4/site-packages:/archive/home/dietz/Install/LAL//lib64/python2.4/site-packages:/archive/home/dietz/Install/LAL//lib/python2.4/site-packages:/opt/lscsoft/lalapps/lib64/python2.4/site-packages:/opt/lscsoft/lalapps/lib/python2.4/site-packages:/opt/lscsoft/glue/lib64/python2.4/site-packages:/var/tmp/glue-1.13-root/opt/lscsoft/glue/lib/python2.4/site-packages:/opt/lscsoft/libframe/lib64/python:/opt/lscsoft/libmetaio/lib64/python:/opt/lscsoft/lalapps/lib/python:/opt/lscsoft/lalapps/lib64/python2.4/site-packages:/opt/lscsoft/lalapps/lib/python2.4/site-packages:/opt/lscsoft/glue/lib64/python2.4/site-packages:/var/tmp/glue-1.13-root/opt/lscsoft/glue/lib/python2.4/site-packages:/opt/lscsoft/libframe/lib64/python:/opt/lscsoft/libmetaio/lib64/python:/ldcg/stow_pkgs/ldg-4.4/ldg/ldg-server/lib64/python:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/lib64/python:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/lib/python:/ldcg/pacman/stow_pkgs/pacman-3.19/src: SSH_CONNECTION=131.251.40.230' '22760' '131.215.114.6' '22 LSC_DATAGRID_SERVER_LOCATION=/ldcg/ldg GLOBUS_MYSQL_PATH=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/mysql CLASSPATH=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/gvds.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/xmldb.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/xmlParserAPIs.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/cryptix-asn1.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/commons-pool.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/postgresql-8.1dev-400.jdbc3.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/mysql-connector-java-3.0.11-stable-bin.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/jlinker.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/jce-jdk13-117.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/resolver.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/puretls.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/junit.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/xmlrpc.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/cog-jglobus.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/java-getopt-1.0.9.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/exist.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/xercesImpl.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/loggerservice-stub.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/jakarta-oro.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/cryptix32.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/rls.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/log4j-1.2.8.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/exist-optional.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/cryptix.jar:/archive/home/dietz/Install/vds/lib/gvds.jar:/archive/home/dietz/Install/vds/lib/rls.jar:/archive/home/dietz/Install/vds/lib/exist.jar:/archive/home/dietz/Install/vds/lib/commons-pool.jar:/archive/home/dietz/Install/vds/lib/jlinker.jar:/archive/home/dietz/Install/vds/lib/puretls.jar:/archive/home/dietz/Install/vds/lib/cog-jglobus.jar:/archive/home/dietz/Install/vds/lib/xmlrpc.jar:/archive/home/dietz/Install/vds/lib/xmlParserAPIs.jar:/archive/home/dietz/Install/vds/lib/cryptix-asn1.jar:/archive/home/dietz/Install/vds/lib/xmldb.jar:/archive/home/dietz/Install/vds/lib/resolver.jar:/archive/home/dietz/Install/vds/lib/cryptix32.jar:/archive/home/dietz/Install/vds/lib/postgresql-8.1dev-400.jdbc3.jar:/archive/home/dietz/Install/vds/lib/mysql-connector-java-3.0.11-stable-bin.jar:/archive/home/dietz/Install/vds/lib/java-getopt-1.0.9.jar:/archive/home/dietz/Install/vds/lib/log4j-1.2.8.jar:/archive/home/dietz/Install/vds/lib/cryptix.jar:/archive/home/dietz/Install/vds/lib/exist-optional.jar:/archive/home/dietz/Install/vds/lib/xercesImpl.jar:/archive/home/dietz/Install/vds/lib/jce-jdk13-117.jar:/archive/home/dietz/Install/vds/lib/jakarta-oro.jar:/archive/home/dietz/Install/vds/lib/loggerservice-stub.jar:/archive/home/dietz/Install/vds/lib/junit.jar LESSOPEN=|/usr/bin/lesspipe.sh' '%s LDG_DIRECTORY=/ldcg/stow_pkgs/ldg-4.4/ldg/ldg-server PKG_CONFIG_PATH=/archive/home/dietz/Install/LAL//lib/pkgconfig:/opt/lscsoft/libframe/lib64/pkgconfig:/opt/lscsoft/libmetaio/lib64/pkgconfig:/opt/lscsoft/framecpp/lib64/pkgconfig:/opt/lscsoft/dol/lib64/pkgconfig:/opt/lscsoft/root/lib64/pkgconfig:/opt/lscsoft/libframe/lib64/pkgconfig:/opt/lscsoft/libmetaio/lib64/pkgconfig:/opt/lscsoft/framecpp/lib64/pkgconfig:/opt/lscsoft/dol/lib64/pkgconfig:/opt/lscsoft/root/lib64/pkgconfig: SHLIB_PATH=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/lib:/ldcg/ldg/vdt/globus/lib VDT_POSTINSTALL_README=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/post-install/README DISPLAY=localhost:10.0 GLITE_LOCATION=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/glite G_BROKEN_FILENAMES=1 PACMAN_LOCATION=/ldcg/pacman/stow_pkgs/pacman-3.19"
WantRemoteIO = FALSE
UserLog = "/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/.condor_log.4632"
KillSig = "SIGTERM"
In = "/dev/null"
TransferIn = FALSE
Out = ".condor_out.4632"
StreamOut = FALSE
Err = ".condor_error.4632"
StreamErr = FALSE
ShouldTransferFiles = "NO"
TransferFiles = "NEVER"
Requirements = (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (TARGET.FileSystemDomain == MY.FileSystemDomain)
FileSystemDomain = "ligo"
PeriodicHold = (JobRunCount > 1000)
PeriodicRelease = FALSE
PeriodicRemove = FALSE
OnExitHold = FALSE
OnExitRemove = TRUE
LeaveJobInQueue = FALSE
Arguments = ""


Here is the UserLog file,
cat /home//dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/.condor_log.4632
000 (15119687.000.000) 07/03 04:19:32 Job submitted from host: <10.14.0.12:44955>
...
007 (15119687.000.000) 07/03 04:52:12 Shadow exception!
        Assertion ERROR on (result)
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
007 (15119687.000.000) 07/03 06:00:59 Shadow exception!
        Assertion ERROR on (result)
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
007 (15119687.000.000) 07/03 06:34:48 Shadow exception!
        Assertion ERROR on (result)
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
007 (15119687.000.000) 07/03 07:08:09 Shadow exception!
        Assertion ERROR on (result)
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
007 (15119687.000.000) 07/03 07:31:59 Shadow exception!
        Assertion ERROR on (result)
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
007 (15119687.000.000) 07/03 08:02:11 Shadow exception!
        Assertion ERROR on (result)
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
007 (15119687.000.000) 07/03 08:32:51 Shadow exception!
        Assertion ERROR on (result)
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
007 (15119687.000.000) 07/03 09:18:12 Shadow exception!
        Assertion ERROR on (result)
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...


Here is an example of strace outoupt on this jobs shadow process:

# strace -p 7438
Process 7438 attached - interrupt to quit
lstat("/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/H1-TRIGBANK_H1H2_2189-854377580-2048.xml", {st_mode=S_IFREG|0644, st_size=13760, ...}) = 0
stat("/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/H1-TRIGBANK_H1H2_2172-854377580-2048.xml", {st_mode=S_IFREG|0644, st_size=14500, ...}) = 0
lstat("/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/H1-TRIGBANK_H1H2_2172-854377580-2048.xml", {st_mode=S_IFREG|0644, st_size=14500, ...}) = 0
stat("/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/H2-TRIGBANK_H1H2_2189-854377580-2048.xml", {st_mode=S_IFREG|0644, st_size=14504, ...}) = 0
lstat("/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/H2-TRIGBANK_H1H2_2189-854377580-2048.xml", {st_mode=S_IFREG|0644, st_size=14504, ...}) = 0
stat("/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/H1-TRIGBANK_H1H2_2165-854377580-2048.xml", {st_mode=S_IFREG|0644, st_size=13022, ...}) = 0
lstat("/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/H1-TRIGBANK_H1H2_2165-854377580-2048.xml", {st_mode=S_IFREG|0644, st_size=13022, ...}) = 0
stat("/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/H2-TRIGBANK_H1H2_2165-854377580-2048.xml", {st_mode=S_IFREG|0644, st_size=13765, ...}) = 0
lstat("/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/H2-TRIGBANK_H1H2_2165-854377580-2048.xml", {st_mode=S_IFREG|0644, st_size=13765, ...}) = 0
...


Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson


===========================================================================
Date of creation: Tue Jul  3 11:27:37 2007 (1183480061)
Subject: Actions

Assigned to danb by danb
===========================================================================
Date of actions: Tue Jul  3 13:25:04 2007 (1183487105)
Date: Tue, 03 Jul 2007 17:45:24 -0500
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #15795] LIGO: Excessive Shadow stat() calls
X-MIME-Autoconverted: from quoted-printable to 8bit by chopin.cs.wisc.edu 
 id l63MjjPc027808


The only statting I have found so far while looking through the vanilla 
shadow code happens on shadow startup and when trying to reconnect to 
the starter (and this statting happens in the job's initial working 
directory, which is consistent with the strace that you have sent).  Are 
these shadows continually failing and reconnecting to the starter by any 
chance?  What does the shadow log look like for one of these shadows?

--Dan

>On the LIGO CIT Condor pool running,
>
># condor_version
>$CondorVersion: 6.8.5 May 17 2007 $
>$CondorPlatform: X86_64-LINUX_RHEL3 $
>
>it has been observed that some Vanilla universe jobs have their Shadow
>processes performing an excessive amount of stat() system calls which is
>overloading the submit machine.
>
>This is occuring even though,
>
>WantRemoteSyscalls = FALSE
>WantRemoteIO = FALSE
>ShouldTransferFiles = "NO"
>TransferFiles = "NEVER"
>
>in the job classadd. The files being stat()'ed are all user owned/created
>files. With the above job classadd settings, under what circumstances
>should the Shaodw processes stat() user files in the Vanilla universe?
>
>It appears that most (perhaps all?) of such jobs where started with
>condor_run if that makes any difference.  These are also jobs that are
>eventually failing with "Shadow exception!" and then being re-run. So
>here is another vote for keeping the shadow-exception count in the
>job classadd so we can add a PeriodicHold expression to catch
>failing jobs with a more precise criteria than JunRunCount.
>
>
>
>Here is an example job classadd:
># condor_q -long 15119687.0
>
>
>-- Quill: citquill@ligo : <10.14.0.25:5432> : citquill_db
>MyType = "Job"
>TargetType = "Machine"
>RemoteWallClockTime = 12871
>MaxHosts = 1
>OrigMaxHosts = 1
>JobStatus = 2
>EnteredCurrentStatus = 1183477895
>LastSuspensionTime = 0
>CurrentHosts = 1
>RemoteVirtualMachineID = 2
>ShadowBday = 1183477904
>JobLastStartDate = 1183475834
>JobCurrentStartDate = 1183477904
>JobRunCount = 9
>ProcId = 0
>JobStartDate = 1183462546
>LastVacateTime = 1183464967
>BytesSent = 0
>BytesRecvd = 0
>LastJobLeaseRenewal = 1183479181
>LastRemoteHost = "vm2__AT__node92.ldas-cit.ligo.caltech.edu"
>LastPublicClaimId = "<10.14.1.92:42375>#1182879166#979#..."
>LastPublicClaimIds = ""
>PublicClaimId = "<10.14.1.28:42106>#1182879134#1025#..."
>RemoteHost = "vm2__AT__node28.ldas-cit.ligo.caltech.edu"
>GlobalJobId = "ldas-grid.ligo.caltech.edu#1183461572#15119687.0"
>ServerTime = 1183479184
>ClusterId = 15119687
>QDate = 1183461461
>CompletionDate = 0
>LocalUserCpu = 0
>LocalSysCpu = 0
>RemoteUserCpu = 0
>RemoteSysCpu = 0
>ExitStatus = 0
>NumCkpts_RAW = 0
>NumCkpts = 0
>NumRestarts = 0
>NumSystemHolds = 0
>CommittedTime = 0
>TotalSuspensions = 0
>CumulativeSuspensionTime = 0
>JobUniverse = 5
>MinHosts = 1
>JobPrio = 0
>JobNotification = 0
>CoreSize = 0
>Rank = 0
>BufferSize = 524288
>BufferBlockSize = 32768
>ImageSize_RAW = 1
>ImageSize = 10000
>ExecutableSize_RAW = 1
>ExecutableSize = 10000
>DiskUsage_RAW = 1
>DiskUsage = 10000
>JobLeaseDuration = 1200
>Owner = "dietz"
>ExitBySignal = FALSE
>Notification = ERROR
>WantBadgers = TRUE
>periodic_hold = (JobRunCount > 1000)
>CondorVersion = "$CondorVersion: 6.8.5 May 17 2007 $"
>CondorPlatform = "$CondorPlatform: X86_64-LINUX_RHEL3 $"
>RootDir = "/"
>Iwd = "/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21"
>Cmd = "/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/.condor_run.4632"
>WantRemoteSyscalls = FALSE
>WantCheckpoint = FALSE
>User = "dietz@ligo"
>NiceUser = FALSE
>Environment = "O1=alex__AT__adietz.phys.lsu.edu GRID_SECURITY_DIR=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/etc MANPATH=/archive/home/dietz/Install/vds/man:/archive/home/dietz/Install/LAL//man:/archive/home/dietz/Install/LAL//man:/opt/lscsoft/lalapps/share/man:/opt/lscsoft/libframe/man:/opt/lscsoft/libmetaio/man:/opt/lscsoft/framecpp/share/man:/opt/lscsoft/root/man:/opt/lscsoft/lalapps/share/man:/opt/lscsoft/libframe/man:/opt/lscsoft/libmetaio/man:/opt/lscsoft/framecpp/share/man:/opt/lscsoft/root/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/man::/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vdt/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/perl/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/expat/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/logrotate/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/jdk1.4/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/edg/share/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/mysql/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/contrib/gstar/man:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/glite/share/man:/arch!
> ive/home/dietz/Install/vds/contrib/gstar/man TERM=xterm LAL_PREFIX=/archive/home/dietz/Install/LAL/ LSCSOFT_PREFIX=/opt/lscsoft HOSTNAME=ldas-grid MYSQL_UNIX_PORT=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vdt-app-data/mysql/var/mysql.sock SASL_PATH=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/lib/sasl GSTAR_LOCATION=/archive/home/dietz/Install/vds/contrib/gstar LD_LIBRARY_PATH=/archive/home/dietz/Install/pylal/lib64/python2.4/site-packages:/archive/home/dietz/Install/glue/lib64/python2.4/site-packages:/archive/home/dietz/Install/LAL//lib:/opt/lscsoft/glue/lib64/python2.4/site-packages:/opt/lscsoft/libframe/lib64:/opt/lscsoft/libmetaio/lib64:/opt/lscsoft/framecpp/lib64:/opt/lscsoft/dol/lib64:/opt/lscsoft/root/lib64:/opt/lscsoft/glue/lib64/python2.4/site-packages:/opt/lscsoft/libframe/lib64:/opt/lscsoft/libmetaio/lib64:/opt/lscsoft/framecpp/lib64:/opt/lscsoft/dol/lib64:/opt/lscsoft/root/lib64:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/glite/lib:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/myodbc/lib:/ldcg/stow_pkgs!
> /ldg-4.4/ldg/vdt/unixodbc/lib:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/mysql/li
>b/mysql:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/jdk1.4/jre/lib/i386:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/jdk1.4/jre/lib/i386/server:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/jdk1.4/jre/lib/i386/client:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/berkeley-db/lib:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/expat/lib:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/lib:/ldcg/ldg/vdt/globus/lib:/ligotools/lib LSCSOFT_LOCATION=/opt/lscsoft GLITE_LOCATION_VAR=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/glite/var MLHO=dietz__AT__ldas-grid.ligo-wa.caltech.edu SHELL=/bin/bash MATLABPATH=/ligotools/matlab HISTSIZE=1000 EDG_LOCATION=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/edg LDG_LOCATION=/ldcg/stow_pkgs/ldg-4.4/ldg FRAMECPP_PREFIX=/opt/lscsoft/framecpp VOMS_USERCONF=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/glite/etc/vomses PYLAL_LOCATION=/archive/home/dietz/Install/pylal GLOBUS_LOCATION=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus SSH_CLIENT=131.251.40.230' '22760' '22 GLOBUS_PATH=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus FRAMECPP_LOCATION=/opt/lscsoft/framecpp GLITE_LOCATION_LOG=/ld!
> cg/stow_pkgs/ldg-4.4/ldg/vdt/glite/log _condor_ALL_DEBUG= X509_CADIR=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/TRUSTED_CA X509_CERT_DIR=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/TRUSTED_CA PERL5LIB=/archive/home/dietz/Install/vds/lib/perl:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/perl:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vdt/lib:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/perl/lib/5.8.0:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/perl/lib/5.8.0/x86_64-linux-thread-multi:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/perl/lib/site_perl/5.8.0:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/perl/lib/site_perl/5.8.0/x86_64-linux-thread-multi::/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/contrib/gstar/lib:/archive/home/dietz/Install/vds/contrib/gstar/lib CVSROOT=:pserver:dietz__AT__gravity.phys.uwm.edu:2402/usr/local/cvs/lscsoft PYLAL_PREFIX=/archive/home/dietz/Install/pylal MT=alex__AT__dietztower.astro.cf.ac.uk MLLO=dietz__AT__ldas-grid.ligo-la.caltech.edu HT=alex__AT__dietztower.astro.cf.ac.uk:/home/alex DAGDBUPDATORLOCKFILE=/etc/onasys-dblockfile PWD=/archive/home/dietz!
> /Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854
>378604/injections21 KDEDIR=/usr DOL_LOCATION=/opt/lscsoft/dol SHLVL=2 SSH_TTY=/dev/pts/0 MUWM=dietz__AT__hydra.phys.uwm.edu GLOBUS_TCP_PORT_RANGE=40000,45000 USER=dietz VDS_HOME=/archive/home/dietz/Install/vds MSOL=adietz__AT__sol.phys.lsu.edu LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35: GLUE_LOCATION=/archive/home/dietz/Install/glue GLOBUS_ERROR_VERBOSE=true MPUB=dietz__AT__phys.lsu.edu:/home3/dietz/public_html/ ROOTSYS=/opt/lscsoft/root GPT_LOCATION=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/gpt LDG_INSTALL_LOG=/ldcg/stow_pkgs/ldg-4.4/ldg/ldg-server/etc/ldg-install.log GLITE_LOCATION_TMP=/ldcg/stow_p!
> kgs/ldg-4.4/ldg/vdt/glite/tmp MDECATUR=dietz__AT__decatur.ligo-la.caltech.edu LDG_SOFTWARE_LOCATION=http://www.ldas-sw.ligo.caltech.edu/ldg_dist/ldg4.4/software SSH_AUTH_SOCK=/tmp/ssh-eytFv26346/agent.26346 MHELIX=adietz__AT__helix.bcvc.lsu.edu LIBPATH=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/lib:/ldcg/ldg/vdt/globus/lib:/usr/lib:/lib MOFFICE=alex__AT__adietz.phys.lsu.edu MLAGUACHA=dietz__AT__laguacha.phys.lsu.edu MSOLMAT=matlab__AT__sol.phys.lsu.edu LAL_LOCATION=/archive/home/dietz/Install/LAL PATH=/archive/home/dietz/Install/pylal/bin:/archive/home/dietz/Install/vds/bin:/archive/home/dietz/Install/glue/bin:/archive/home/dietz/Install/LAL//bin:/archive/home/dietz/Install/LAL//bin:/opt/lscsoft/lalapps/bin:/opt/lscsoft/glue/bin:/opt/lscsoft/libframe/bin:/opt/lscsoft/libmetaio/bin:/opt/lscsoft/framecpp/bin:/opt/lscsoft/dol/bin:/opt/lscsoft/root/bin:/opt/lscsoft/lalapps/bin:/opt/lscsoft/glue/bin:/opt/lscsoft/libframe/bin:/opt/lscsoft/libmetaio/bin:/opt/lscsoft/framecpp/bin:/opt/lscsoft/dol/bin:/opt/lscs!
> oft/root/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/glite/sbin:/ldcg/stow_pkg
>s/ldg-4.4/ldg/vdt/glite/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/pyglobus-url-copy/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/unixodbc/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/mysql/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/edg/sbin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/jdk1.4/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/logrotate/sbin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/gpt/sbin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/sbin:/ldcg/pacman/stow_pkgs/pacman-3.19/src:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vdt/sbin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vdt/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/ldg-server/bin:/usr/kerberos/bin:/usr/bin:/bin:/usr/sbin:/sbin:/ldcg/ldg/vdt/globus/bin:/usr/X11R6/bin:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/contrib/gstar/bin:/ligotools/bin:/ldcg/matlab_r2006a/bin:/archive/home/dietz/Install/vds/contrib/gstar/bin:/archive/home/dietz/bin:. MAIL=/var/spool/mail/dietz _=/usr/bin/condor_run JAVA_HOME=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/jdk1.4 CON!
> DOR_LOCATION=/usr VDT_LOCATION=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt CONDOR_CONFIG=/usr1/condor/condor_config LOGNAME=dietz INPUTRC=/etc/inputrc ODBCINI=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/unixodbc/etc/odbc.ini LSC_SEGFIND_SERVER=ldas-cit.ligo.caltech.edu ROOT_LOCATION=/opt/lscsoft/root LANG=C LSCSOFTCVS=:pserver:dietz__AT__gravity.phys.uwm.edu:2402/usr/local/cvs/lscsoft MOFFICE2=alex__AT__dietz.phys.lsu.edu MLAPTOP=alex__AT__hvproj9.phys.lsu.edu:/home/alex HOME=/archive/home/dietz LIGOTOOLS=/ligotools MCOMA=spxad1__AT__coma.astro.cf.ac.uk GLUE_PREFIX=/archive/home/dietz/Install/glue X509_USER_PROXY=/tmp/x509up_p26346.fileUjuJXS.1 O2=alex__AT__dietz.phys.lsu.edu BOSSDIR=/etc MMAIL=dietz__AT__phys.lsu.edu VDT_INSTALL_LOG=vdt-install.log DYLD_LIBRARY_PATH=/archive/home/dietz/Install/pylal/lib64/python2.4/site-packages:/archive/home/dietz/Install/glue/lib64/python2.4/site-packages:/archive/home/dietz/Install/LAL//lib:/opt/lscsoft/glue/lib64/python2.4/site-packages:/opt/lscsoft/framecpp/lib64:/opt/lscsoft/glue/lib64/!
> python2.4/site-packages:/opt/lscsoft/framecpp/lib64:/ldcg/stow_pkgs/ld
>g-4.4/ldg/vdt/globus/lib VDS_JAVA_HEAPMAX=1024 LSC_DATAFIND_SERVER=ldas-cit.ligo.caltech.edu PYTHONPATH=/archive/home/dietz/Install/pylal/lib/python:/archive/home/dietz/Install/glue/lib/python:/archive/home/dietz/Install/pylal/lib64/python2.4/site-packages:/archive/home/dietz/Install/pylal/lib/python2.4/site-packages:/archive/home/dietz/Install/glue/lib64/python2.4/site-packages:/archive/home/dietz/Install/glue/lib/python2.4/site-packages:/archive/home/dietz/Install/LAL//lib64/python2.4/site-packages:/archive/home/dietz/Install/LAL//lib/python2.4/site-packages:/opt/lscsoft/lalapps/lib64/python2.4/site-packages:/opt/lscsoft/lalapps/lib/python2.4/site-packages:/opt/lscsoft/glue/lib64/python2.4/site-packages:/var/tmp/glue-1.13-root/opt/lscsoft/glue/lib/python2.4/site-packages:/opt/lscsoft/libframe/lib64/python:/opt/lscsoft/libmetaio/lib64/python:/opt/lscsoft/lalapps/lib/python:/opt/lscsoft/lalapps/lib64/python2.4/site-packages:/opt/lscsoft/lalapps/lib/python2.4/site-packages:/o!
> pt/lscsoft/glue/lib64/python2.4/site-packages:/var/tmp/glue-1.13-root/opt/lscsoft/glue/lib/python2.4/site-packages:/opt/lscsoft/libframe/lib64/python:/opt/lscsoft/libmetaio/lib64/python:/ldcg/stow_pkgs/ldg-4.4/ldg/ldg-server/lib64/python:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/lib64/python:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/lib/python:/ldcg/pacman/stow_pkgs/pacman-3.19/src: SSH_CONNECTION=131.251.40.230' '22760' '131.215.114.6' '22 LSC_DATAGRID_SERVER_LOCATION=/ldcg/ldg GLOBUS_MYSQL_PATH=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/mysql CLASSPATH=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/gvds.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/xmldb.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/xmlParserAPIs.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/cryptix-asn1.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/commons-pool.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/postgresql-8.1dev-400.jdbc3.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/mysql-connector-java-3.0.11-stable-bin.jar:/ldcg/stow_pkgs/ld!
> g-4.4/ldg/vdt/vds/lib/jlinker.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/
>lib/jce-jdk13-117.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/resolver.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/puretls.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/junit.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/xmlrpc.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/cog-jglobus.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/java-getopt-1.0.9.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/exist.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/xercesImpl.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/loggerservice-stub.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/jakarta-oro.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/cryptix32.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/rls.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/log4j-1.2.8.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/exist-optional.jar:/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/vds/lib/cryptix.jar:/archive/home/dietz/Install/vds/lib/gvds.jar:/archive/home/dietz/Install/vds/lib/rls.jar:/archive/home/dietz/Install/vds/lib/exist.jar:/archive/home/d!
> ietz/Install/vds/lib/commons-pool.jar:/archive/home/dietz/Install/vds/lib/jlinker.jar:/archive/home/dietz/Install/vds/lib/puretls.jar:/archive/home/dietz/Install/vds/lib/cog-jglobus.jar:/archive/home/dietz/Install/vds/lib/xmlrpc.jar:/archive/home/dietz/Install/vds/lib/xmlParserAPIs.jar:/archive/home/dietz/Install/vds/lib/cryptix-asn1.jar:/archive/home/dietz/Install/vds/lib/xmldb.jar:/archive/home/dietz/Install/vds/lib/resolver.jar:/archive/home/dietz/Install/vds/lib/cryptix32.jar:/archive/home/dietz/Install/vds/lib/postgresql-8.1dev-400.jdbc3.jar:/archive/home/dietz/Install/vds/lib/mysql-connector-java-3.0.11-stable-bin.jar:/archive/home/dietz/Install/vds/lib/java-getopt-1.0.9.jar:/archive/home/dietz/Install/vds/lib/log4j-1.2.8.jar:/archive/home/dietz/Install/vds/lib/cryptix.jar:/archive/home/dietz/Install/vds/lib/exist-optional.jar:/archive/home/dietz/Install/vds/lib/xercesImpl.jar:/archive/home/dietz/Install/vds/lib/jce-jdk13-117.jar:/archive/home/dietz/Install/vds/lib/ja!
> karta-oro.jar:/archive/home/dietz/Install/vds/lib/loggerservice-stub.j
>ar:/archive/home/dietz/Install/vds/lib/junit.jar LESSOPEN=|/usr/bin/lesspipe.sh' '%s LDG_DIRECTORY=/ldcg/stow_pkgs/ldg-4.4/ldg/ldg-server PKG_CONFIG_PATH=/archive/home/dietz/Install/LAL//lib/pkgconfig:/opt/lscsoft/libframe/lib64/pkgconfig:/opt/lscsoft/libmetaio/lib64/pkgconfig:/opt/lscsoft/framecpp/lib64/pkgconfig:/opt/lscsoft/dol/lib64/pkgconfig:/opt/lscsoft/root/lib64/pkgconfig:/opt/lscsoft/libframe/lib64/pkgconfig:/opt/lscsoft/libmetaio/lib64/pkgconfig:/opt/lscsoft/framecpp/lib64/pkgconfig:/opt/lscsoft/dol/lib64/pkgconfig:/opt/lscsoft/root/lib64/pkgconfig: SHLIB_PATH=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/globus/lib:/ldcg/ldg/vdt/globus/lib VDT_POSTINSTALL_README=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/post-install/README DISPLAY=localhost:10.0 GLITE_LOCATION=/ldcg/stow_pkgs/ldg-4.4/ldg/vdt/glite G_BROKEN_FILENAMES=1 PACMAN_LOCATION=/ldcg/pacman/stow_pkgs/pacman-3.19"
>WantRemoteIO = FALSE
>UserLog = "/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/.condor_log.4632"
>KillSig = "SIGTERM"
>In = "/dev/null"
>TransferIn = FALSE
>Out = ".condor_out.4632"
>StreamOut = FALSE
>Err = ".condor_error.4632"
>StreamErr = FALSE
>ShouldTransferFiles = "NO"
>TransferFiles = "NEVER"
>Requirements = (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (TARGET.FileSystemDomain == MY.FileSystemDomain)
>FileSystemDomain = "ligo"
>PeriodicHold = (JobRunCount > 1000)
>PeriodicRelease = FALSE
>PeriodicRemove = FALSE
>OnExitHold = FALSE
>OnExitRemove = TRUE
>LeaveJobInQueue = FALSE
>Arguments = ""
>
>
>Here is the UserLog file,
>cat /home//dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/.condor_log.4632
>000 (15119687.000.000) 07/03 04:19:32 Job submitted from host: <10.14.0.12:44955>
>...
>007 (15119687.000.000) 07/03 04:52:12 Shadow exception!
>        Assertion ERROR on (result)
>        0  -  Run Bytes Sent By Job
>        0  -  Run Bytes Received By Job
>...
>007 (15119687.000.000) 07/03 06:00:59 Shadow exception!
>        Assertion ERROR on (result)
>        0  -  Run Bytes Sent By Job
>        0  -  Run Bytes Received By Job
>...
>007 (15119687.000.000) 07/03 06:34:48 Shadow exception!
>        Assertion ERROR on (result)
>        0  -  Run Bytes Sent By Job
>        0  -  Run Bytes Received By Job
>...
>007 (15119687.000.000) 07/03 07:08:09 Shadow exception!
>        Assertion ERROR on (result)
>        0  -  Run Bytes Sent By Job
>        0  -  Run Bytes Received By Job
>...
>007 (15119687.000.000) 07/03 07:31:59 Shadow exception!
>        Assertion ERROR on (result)
>        0  -  Run Bytes Sent By Job
>        0  -  Run Bytes Received By Job
>...
>007 (15119687.000.000) 07/03 08:02:11 Shadow exception!
>        Assertion ERROR on (result)
>        0  -  Run Bytes Sent By Job
>        0  -  Run Bytes Received By Job
>...
>007 (15119687.000.000) 07/03 08:32:51 Shadow exception!
>        Assertion ERROR on (result)
>        0  -  Run Bytes Sent By Job
>        0  -  Run Bytes Received By Job
>...
>007 (15119687.000.000) 07/03 09:18:12 Shadow exception!
>        Assertion ERROR on (result)
>        0  -  Run Bytes Sent By Job
>        0  -  Run Bytes Received By Job
>...
>
>
>Here is an example of strace outoupt on this jobs shadow process:
>
># strace -p 7438
>Process 7438 attached - interrupt to quit
>lstat("/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/H1-TRIGBANK_H1H2_2189-854377580-2048.xml", {st_mode=S_IFREG|0644, st_size=13760, ...}) = 0
>stat("/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/H1-TRIGBANK_H1H2_2172-854377580-2048.xml", {st_mode=S_IFREG|0644, st_size=14500, ...}) = 0
>lstat("/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/H1-TRIGBANK_H1H2_2172-854377580-2048.xml", {st_mode=S_IFREG|0644, st_size=14500, ...}) = 0
>stat("/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/H2-TRIGBANK_H1H2_2189-854377580-2048.xml", {st_mode=S_IFREG|0644, st_size=14504, ...}) = 0
>lstat("/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/H2-TRIGBANK_H1H2_2189-854377580-2048.xml", {st_mode=S_IFREG|0644, st_size=14504, ...}) = 0
>stat("/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/H1-TRIGBANK_H1H2_2165-854377580-2048.xml", {st_mode=S_IFREG|0644, st_size=13022, ...}) = 0
>lstat("/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/H1-TRIGBANK_H1H2_2165-854377580-2048.xml", {st_mode=S_IFREG|0644, st_size=13022, ...}) = 0
>stat("/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/H2-TRIGBANK_H1H2_2165-854377580-2048.xml", {st_mode=S_IFREG|0644, st_size=13765, ...}) = 0
>lstat("/mnt/qfs/dietz/Work/E_ExtTrig/E200_Runs/E231_070201/InjectionV3/InjectionSet3/GRB854378604/injections21/H2-TRIGBANK_H1H2_2165-854377580-2048.xml", {st_mode=S_IFREG|0644, st_size=13765, ...}) = 0
>...
>
>
>Thanks.
>
>  
>



===========================================================================
Date mail was appended: Tue Jul  3 17:46:30 2007 (1183502805)
Date: Tue, 3 Jul 2007 16:46:54 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #15795] LIGO: Excessive Shadow stat() calls

On Tue, Jul 03, 2007 at 05:46:30PM -0500, condor-admin response tracking system wrote:
> 
> The only statting I have found so far while looking through the vanilla 
> shadow code happens on shadow startup and when trying to reconnect to 
> the starter (and this statting happens in the job's initial working 
> directory, which is consistent with the strace that you have sent).  Are 
> these shadows continually failing and reconnecting to the starter by any 
> chance?  What does the shadow log look like for one of these shadows?
> 


The Shadow Exceptions appear to be correlated with file locking problems,
e.g., from the ShadowLog:

7/3 07:47:44 Initializing a VANILLA shadow for job 15119687.0
7/3 07:47:44 (15119687.0) (8498): Request to run on <10.14.2.31:40863> was ACCEPTED
7/3 08:02:11 (15119687.0) (8498): condor_write(): Socket closed when trying to write 4096 bytes to <10.14.2.31:40863>, fd is 7
7/3 08:02:11 (15119687.0) (8498): Buf::write(): condor_write() failed
7/3 08:02:11 (15119687.0) (8498): ERROR "Assertion ERROR on (result)" at line 233 in file NTreceivers.C
7/3 08:02:11 (15119687.0) (8498): FileLock::obtain(1) failed - errno 37 (No locks available)
7/3 08:17:15 Initializing a VANILLA shadow for job 15119687.0
7/3 08:17:15 (15119687.0) (22473): Request to run on <10.14.1.92:42375> was ACCEPTED
7/3 08:32:51 (15119687.0) (22473): condor_write(): Socket closed when trying to write 4096 bytes to <10.14.1.92:42375>, fd is 7
7/3 08:32:51 (15119687.0) (22473): Buf::write(): condor_write() failed
7/3 08:32:51 (15119687.0) (22473): ERROR "Assertion ERROR on (result)" at line 233 in file NTreceivers.C
7/3 08:32:51 (15119687.0) (22473): FileLock::obtain(1) failed - errno 37 (No locks available)


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Tue Jul  3 18:47:25 2007 (1183506447)
Date: Thu, 05 Jul 2007 16:06:40 -0500
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #15795] LIGO: Excessive Shadow stat() calls



condor-admin response tracking system wrote:

>On Tue, Jul 03, 2007 at 05:46:30PM -0500, condor-admin response tracking system wrote:
>  
>
>>The only statting I have found so far while looking through the vanilla 
>>shadow code happens on shadow startup and when trying to reconnect to 
>>the starter (and this statting happens in the job's initial working 
>>directory, which is consistent with the strace that you have sent).  Are 
>>these shadows continually failing and reconnecting to the starter by any 
>>chance?  What does the shadow log look like for one of these shadows?
>>
>>    
>>
>
>
>The Shadow Exceptions appear to be correlated with file locking problems,
>e.g., from the ShadowLog:
>
>7/3 07:47:44 Initializing a VANILLA shadow for job 15119687.0
>7/3 07:47:44 (15119687.0) (8498): Request to run on <10.14.2.31:40863> was ACCEPTED
>7/3 08:02:11 (15119687.0) (8498): condor_write(): Socket closed when trying to write 4096 bytes to <10.14.2.31:40863>, fd is 7
>7/3 08:02:11 (15119687.0) (8498): Buf::write(): condor_write() failed
>7/3 08:02:11 (15119687.0) (8498): ERROR "Assertion ERROR on (result)" at line 233 in file NTreceivers.C
>7/3 08:02:11 (15119687.0) (8498): FileLock::obtain(1) failed - errno 37 (No locks available)
>7/3 08:17:15 Initializing a VANILLA shadow for job 15119687.0
>7/3 08:17:15 (15119687.0) (22473): Request to run on <10.14.1.92:42375> was ACCEPTED
>7/3 08:32:51 (15119687.0) (22473): condor_write(): Socket closed when trying to write 4096 bytes to <10.14.1.92:42375>, fd is 7
>7/3 08:32:51 (15119687.0) (22473): Buf::write(): condor_write() failed
>7/3 08:32:51 (15119687.0) (22473): ERROR "Assertion ERROR on (result)" at line 233 in file NTreceivers.C
>7/3 08:32:51 (15119687.0) (22473): FileLock::obtain(1) failed - errno 37 (No locks available)
>  
>

These errors look like what I would expect to see if the job "user log" 
is in NFS and you have not configured IGNORE_NFS_LOCK_ERRORS=True.  
Since the user is using condor_run, the user log will be in the initial 
working directory.

As for the problem with the shadow doing lots of stat opperations, from 
the shadow log you sent, I would expect each shadow to only do one such 
sweep through the job's initial working directory, statting all of the 
files in it.  Do you think it is likely that each shadow is doing only 
one such operation, or is it your impression that each shadow is doing 
this more frequently?

--Dan


===========================================================================
Date mail was appended: Thu Jul  5 16:06:53 2007 (1183669614)
Date: Thu, 5 Jul 2007 14:30:39 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #15795] LIGO: Excessive Shadow stat() calls

On Thu, Jul 05, 2007 at 04:06:53PM -0500, condor-admin response tracking system wrote:
> 
> These errors look like what I would expect to see if the job "user log" 
> is in NFS and you have not configured IGNORE_NFS_LOCK_ERRORS=True.  

Indeed, we are running with the default setting of False for this,
in fact I did not know that this knob existed. However, I was able to
stop the Shadow FileLock::obtain(1) errors (at least for now), by
increasing the NFS server value of LOCKD_LISTEN_BACKLOG from the default
of 32 to 128. Note, we already had LOCKD_SERVERS increased from the default
of 20 to 200.

> Since the user is using condor_run, the user log will be in the initial 
> working directory.
> 
> As for the problem with the shadow doing lots of stat opperations, from 
> the shadow log you sent, I would expect each shadow to only do one such 
> sweep through the job's initial working directory, statting all of the 
> files in it.  Do you think it is likely that each shadow is doing only 
> one such operation, or is it your impression that each shadow is doing 
> this more frequently?

It looks like just two calls per file: lstat() followed by stat(). However,
I did not realize it does this, and this explains other problems we have had.
In particular, we have some users with a very large number of files in one
directory (100k+).  While we have discouraged this for other performance
reasons, e.g., /bin/ls, I did not realize that the Condor Shadow process
runs stat() on every file in the CWD. We may to institute a stricter
policy on the users, however, before we do that, why does the shadow need
to stat() every file in the Vanilla universe?  Does it also do this in the
other universes, at least Standard, Local, Scheduler, and Parallel?

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Jul  5 16:31:00 2007 (1183671061)
Date: Thu, 05 Jul 2007 16:45:27 -0500
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #15795] LIGO: Excessive Shadow stat() calls


>I did not realize that the Condor Shadow process
>runs stat() on every file in the CWD. We may to institute a stricter
>policy on the users, however, before we do that, why does the shadow need
>to stat() every file in the Vanilla universe?  Does it also do this in the
>other universes, at least Standard, Local, Scheduler, and Parallel?
>  
>

Starting in Condor 6.8.3, Condor's file transfer mechanism builds a 
catalog of all the files in the working directory.  I see no reason why 
this needs to be happening in the shadow at all, (and in your specific 
case, it especially doesn't need to be happening, because these jobs are 
not even using file transfers).  I will bring this issue up and try to 
get it resolved.

 From my reading of the code, standard, local, and scheduler universe 
should not be affected.  Vanilla and parallel universe shadows are affected.

--Dan


===========================================================================
Date mail was appended: Thu Jul  5 16:45:38 2007 (1183671939)
Date: Thu, 5 Jul 2007 15:14:51 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #15795] LIGO: Excessive Shadow stat() calls

On Thu, Jul 05, 2007 at 04:45:38PM -0500, condor-admin response tracking system wrote:
> 
> Starting in Condor 6.8.3, Condor's file transfer mechanism builds a 
> catalog of all the files in the working directory.  I see no reason why 
> this needs to be happening in the shadow at all, (and in your specific 
> case, it especially doesn't need to be happening, because these jobs are 
> not even using file transfers).  I will bring this issue up and try to 
> get it resolved.
> 
>  From my reading of the code, standard, local, and scheduler universe 
> should not be affected.  Vanilla and parallel universe shadows are affected.
> 

Thanks for the update. Hopefully there is a simple way to have Condor not
build the CWD file catalog except when needed.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Jul  5 17:15:15 2007 (1183673716)
Date: Mon, 09 Jul 2007 16:33:59 -0500
From: Dan Bradley <dan__AT__hep.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #15795] LIGO: Excessive Shadow stat() calls
X-PMX-Version: 5.2.0.264296, Antispam-Engine: 2.4.0.264935, Antispam-Data:
 2007.7.9.141134
X-PMX-Spam-Score: Probability=7%, Report='__CT 0, __CTE 0, __CT_TEXT_PLAIN
 0, __HAS_MSGID 0, __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0,  
 __USER_AGENT 0'


> Thanks for the update. Hopefully there is a simple way to have Condor not
> build the CWD file catalog except when needed.
>   

A patch for this problem is under review now.  We hope to release it as 
part of 6.8.6 and 6.9.4.

--Dan


===========================================================================
Date mail was appended: Mon Jul  9 16:34:08 2007 (1184016848)
Date: Mon, 09 Jul 2007 16:45:52 -0500
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #15795] LIGO: Excessive Shadow stat() calls
X-PMX-Version: 5.2.0.264296, Antispam-Engine: 2.4.0.264935, Antispam-Data:
 2007.7.9.142833
X-PMX-Spam-Score: Probability=7%, Report='__CT 0, __CTE 0, __CT_TEXT_PLAIN
 0, __HAS_MSGID 0, __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0,  
 __USER_AGENT 0'


> Thanks for the update. Hopefully there is a simple way to have Condor not
> build the CWD file catalog except when needed.
>   

A patch for this problem is under review now.  We hope to release it as
part of 6.8.6 and 6.9.4.

--Dan



===========================================================================
Date mail was appended: Mon Jul  9 16:46:07 2007 (1184017568)
Subject: Actions

Ticket resolved by danb
===========================================================================
Date of actions: Mon Jul  9 16:46:07 2007 (1184017569)
Subject: Actions

Ticket was reopened by mailnull
===========================================================================
Date of actions: Mon Jul  9 17:20:41 2007 (1184019643)
Date: Mon, 9 Jul 2007 15:20:00 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #15795] LIGO: Excessive Shadow stat() calls

On Mon, Jul 09, 2007 at 04:34:08PM -0500, condor-admin response tracking system wrote:
> 
> 
> > Thanks for the update. Hopefully there is a simple way to have Condor not
> > build the CWD file catalog except when needed.
> >   
> 
> A patch for this problem is under review now.  We hope to release it as 
> part of 6.8.6 and 6.9.4.
> 

Fantastic. With this patch, under what circumstances will a file catalog
still be built, by which daemons, and under the controll of which Condor
configuration settings?

P.S. In the instances when it is still desirable to build a file catalog
is it really necessary to call lstat() before stat() for every file?
If condor's file transfer transfer mechanism really needs to know the stat
structure for symbolic link files themselves perhaps it makes sense to
optimize the case where there are also some non-symbolic link files, i.e.,
call stat() and only follow up with lstat() if st_mode matches S_IFLNK.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Jul  9 17:20:41 2007 (1184019643)
Date: Mon, 09 Jul 2007 17:59:01 -0500
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #15795] LIGO: Excessive Shadow stat() calls
X-PMX-Version: 5.2.0.264296, Antispam-Engine: 2.4.0.264935, Antispam-Data:
 2007.7.9.154432
X-PMX-Spam-Score: Probability=7%, Report='__CT 0, __CTE 0, __CT_TEXT_PLAIN
 0, __HAS_MSGID 0, __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0,  
 __USER_AGENT 0'



> Fantastic. With this patch, under what circumstances will a file catalog
> still be built, by which daemons, and under the controll of which Condor
> configuration settings?
>   

I believe there will be no configuration knobs and that the file catalog 
will only be instantiated by the condor starter.  I will confirm with 
the author of the patch.

> P.S. In the instances when it is still desirable to build a file catalog
> is it really necessary to call lstat() before stat() for every file?
> If condor's file transfer transfer mechanism really needs to know the stat
> structure for symbolic link files themselves perhaps it makes sense to
> optimize the case where there are also some non-symbolic link files, i.e.,
> call stat() and only follow up with lstat() if st_mode matches S_IFLNK.
>   

For a symbolic link, the st_mode returned by stat() doesn't tell you 
whether the file is a symbolic link or not, so I don't see any way 
around making both calls for each file.

--Dan


===========================================================================
Date mail was appended: Mon Jul  9 17:59:07 2007 (1184021948)
Date: Mon, 9 Jul 2007 19:50:10 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu,
 dan__AT__hep.wisc.edu
Subject: Re: [condor-admin #15795] LIGO: Excessive Shadow stat() calls

On Mon, Jul 09, 2007 at 05:59:07PM -0500, condor-admin response tracking system wrote:
> 
> 
> > Fantastic. With this patch, under what circumstances will a file catalog
> > still be built, by which daemons, and under the controll of which Condor
> > configuration settings?
> >   
> 
> I believe there will be no configuration knobs and that the file catalog 
> will only be instantiated by the condor starter.  I will confirm with 
> the author of the patch.

Under what circumstances will the starter do this? I am still concerened
about the performance impact for users running in large directories.

> 
> > P.S. In the instances when it is still desirable to build a file catalog
> > is it really necessary to call lstat() before stat() for every file?
> > If condor's file transfer transfer mechanism really needs to know the stat
> > structure for symbolic link files themselves perhaps it makes sense to
> > optimize the case where there are also some non-symbolic link files, i.e.,
> > call stat() and only follow up with lstat() if st_mode matches S_IFLNK.
> >   
> 
> For a symbolic link, the st_mode returned by stat() doesn't tell you 
> whether the file is a symbolic link or not, so I don't see any way 
> around making both calls for each file.
> 

My mistake, but if you will allow me one more question, I am still confused
why a follow up call to stat() is needed after lstat() on a regular file.
I would have thought that in this case the initial lstat() call has
already returned a stat structure with all the necessary information.
What extra information is returned by the second stat() call in this case?

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Jul  9 21:50:29 2007 (1184035829)
Subject: Actions

Assigned to zmiller by danb
===========================================================================
Date of actions: Tue Jul 10 15:57:16 2007 (1184101037)
Subject: Actions

Ticket resolved by tannenba
===========================================================================
Date of actions: Mon Sep 10 16:21:09 2007 (1189459270)