LIGO Support Ticket 18335

Ticket Information
  Number:      admin 18335
  User:        anderson@ligo.caltech.edu
  Email:       
  Status:      resolved
  Assigned To: wenger
Date: Wed, 13 Aug 2008 11:13:48 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu, Kent Wenger <wenger__AT__cs.wisc.edu>
Subject: LIGO: dagman memory usage
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

I just noticed that the LIGO CIT condor pool has a condor_dagman process
running on one of the head nodes uses 6GByte of memory, i.e.,

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5287 spxiwh    15   0 6105m 6.0g 1724 S  0.0 39.2   8:57.25 condor_dagman

This seems rather excessive given that the next largest dagman process
is using just 121MByte, i.e.,

 1438 jaredm    15   0 12760 2240 1748 S  0.0  0.0   0:00.26 condor_dagman
 5252 dietz     15   0 12648 2240 1780 S  0.0  0.0   0:00.20 condor_dagman
 5257 dietz     15   0 42160  23m 1764 S  0.0  0.2   0:00.70 condor_dagman
 5261 dietz     15   0 39960  15m 1704 S  0.0  0.1   0:02.55 condor_dagman
 5273 dietz     15   0  131m 121m 1788 S  0.0  0.8   0:47.79 condor_dagman
 5279 gjones    15   0 14872 4428 1784 S  0.0  0.0   0:03.76 condor_dagman
 5283 dietz     15   0  131m 121m 1788 S  0.0  0.8   0:49.88 condor_dagman
 5287 spxiwh    15   0 6106m 6.0g 1724 S  0.0 39.3   8:57.44 condor_dagman
 5289 dietz     15   0  131m 121m 1788 S  0.0  0.8   0:47.99 condor_dagman
 5292 dietz     15   0 12644 2236 1780 S  0.0  0.0   0:00.26 condor_dagman
 5293 dietz     15   0  131m 121m 1788 S  0.0  0.8   0:49.96 condor_dagman
 5296 volodya   15   0 20540 9.8m 1740 S  0.0  0.1   0:08.11 condor_dagman
 5303 dietz     15   0 42160  24m 1764 S  0.0  0.2   0:00.65 condor_dagman
 9388 thorne    18   0 15688 6292 1788 S  0.0  0.0   0:03.91 condor_dagman
 9922 jaredm    15   0 12760 2216 1732 S  0.0  0.0   0:00.14 condor_dagman
15384 jaredm    15   0 12760 2264 1756 S  0.0  0.0   0:00.26 condor_dagman
16019 jaredm    15   0 12756 2264 1756 S  0.0  0.0   0:00.27 condor_dagman
18331 elr       15   0 13832 3372 1760 S  0.0  0.0   0:00.78 condor_dagman
20127 jaredm    15   0 12756 2252 1748 S  0.0  0.0   0:00.16 condor_dagman
32502 jaredm    15   0 12760 2264 1768 S  0.0  0.0   0:00.26 condor_dagman

This is on a 64-bit FC4 machine, running condor 7.0.3 with the 7.1.1 dag
binaries, i.e.,

[root@ldas-grid ~]# condor_version
$CondorVersion: 7.0.3 Jun 20 2008 BuildID: 91405 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

[root@ldas-grid ~]# ident `which condor_dagman`
/usr/bin/condor_dagman:
     $CondorVersion: 7.1.1 Jul 14 2008 BuildID: 94553 $
     $CondorPlatform: X86_64-LINUX_RHEL3 $
     $Id: kdb5_err.et 13854 2001-10-25 20:20:57Z tlyu $
     $Id: krb5_err.et 16816 2004-10-13 16:18:27Z lxs $
     $Id: accept_sec_context.c,v 1.30.4.2 2007/05/17 15:38:32 bester Exp $
     $Id: acquire_cred.c,v 1.12.4.1 2007/05/17 15:38:32 bester Exp $
     $Id: compare_name.c,v 1.22.4.2 2005/07/13 20:17:52 mlink Exp $
     $Id: delete_sec_context.c,v 1.13 2005/04/15 23:37:16 meder Exp $
     $Id: display_name.c,v 1.10 2005/04/15 23:37:16 meder Exp $
     $Id: display_status.c,v 1.19 2005/04/15 23:37:16 meder Exp $
     $Id: import_name.c,v 1.15 2005/04/15 23:37:18 meder Exp $
     $Id: init_sec_context.c,v 1.31.4.4 2007/05/17 15:38:32 bester Exp $
     $Id: inquire_cred.c,v 1.10 2005/04/15 23:37:19 meder Exp $
     $Id: inquire_context.c,v 1.11 2005/04/15 23:37:19 meder Exp $
     $Id: oid_functions.c,v 1.13 2005/04/15 23:37:19 meder Exp $
     $Id: release_cred.c,v 1.5 2005/04/15 23:37:20 meder Exp $
     $Id: release_name.c,v 1.6 2005/04/15 23:37:20 meder Exp $
     $Id: unwrap.c,v 1.17 2005/04/15 23:37:20 meder Exp $
     $Id: verify_mic.c,v 1.12 2005/04/15 23:37:21 meder Exp $
     $Id: wrap.c,v 1.12 2005/04/15 23:37:21 meder Exp $
     $Id: release_buffer.c,v 1.3 2005/04/15 23:37:20 meder Exp $
     $Id: globus_i_gsi_gss_utils.c,v 1.38.4.1 2005/05/04 00:19:37 meder Exp=
 $
     $Id: import_cred.c,v 1.19.4.1 2007/05/17 15:38:32 bester Exp $
     $Id: get_mic.c,v 1.7 2005/04/15 23:37:17 meder Exp $
     $GCBVersion: 1.5.0 $
     $GCBBuildDate: Jul  3 2008 $
     $BINDId: base64.c,v 8.7 1999/10/13 16:39:33 vixie Exp $


Here is the memory map of the dagman process:

[root@ldas-grid ~]# pmap 5287
5287:   condor_scheduniv_exec.33283134.0 -f -l . -Debug 3 -Lockfile ihope.d=
ag.lock -AutoRescue 1 -DoRescueFrom 0 -Condorlog /mnt/qfs1/spxiwh/ihope/853=
000000-853010000/datafind/inspiral_hipe_datafind.DATAFIND.dag.dagman.log -D=
ag ihope.dag -CsdVersion $CondorVersion: 7.1.1 Jul 14 2008 BuildID: 94553 $
0000000000400000   5208K r-x--  /usr/bin/condor_dagman
0000000000a16000    184K rw---  /usr/bin/condor_dagman
0000000000a44000 6240720K rw---    [ anon ]
00002b9790557000     44K r-x--  /lib64/libnss_files-2.3.6.so
00002b9790562000   1020K -----  /lib64/libnss_files-2.3.6.so
00002b9790661000      8K rw---  /lib64/libnss_files-2.3.6.so
00002b9790663000   1208K r-x--  /lib64/libc-2.3.6.so
00002b9790791000   1020K -----  /lib64/libc-2.3.6.so
00002b9790890000     24K rw---  /lib64/libc-2.3.6.so
00002b9790896000     16K rw---    [ anon ]
00002b979089a000    104K r-x--  /lib64/ld-2.3.6.so
00002b97908b4000   1020K -----  /lib64/ld-2.3.6.so
00002b97909b3000      8K rw---  /lib64/ld-2.3.6.so
00002b97909b5000   3072K rw---    [ anon ]
00007fff1a55c000    100K rw---    [ stack ]
ffffffffff600000      4K r-x--    [ anon ]
 total          6253760K




Here is the dagman classadd:


-- Submitter: ldas-grid.ligo.caltech.edu : <10.14.0.12:46271> : ldas-grid.l=
igo.caltech.edu
MyType =3D "Job"
TargetType =3D "Machine"
ClusterId =3D 33283134
QDate =3D 1216900728
CompletionDate =3D 0
Owner =3D "spxiwh"
LocalUserCpu =3D 0.000000
LocalSysCpu =3D 0.000000
RemoteUserCpu =3D 0.000000
RemoteSysCpu =3D 0.000000
NumCkpts_RAW =3D 0
NumCkpts =3D 0
NumRestarts =3D 0
NumSystemHolds =3D 0
CommittedTime =3D 0
TotalSuspensions =3D 0
CumulativeSuspensionTime =3D 0
Notification =3D ERROR
WantBadgers =3D TRUE
JOB_LEASE_DURATION =3D 3600
copy_to_spool =3D TRUE
CondorVersion =3D "$CondorVersion: 7.0.3 Jun 20 2008 BuildID: 91405 $"
CondorPlatform =3D "$CondorPlatform: X86_64-LINUX_RHEL3 $"
RootDir =3D "/"
Iwd =3D "/mnt/qfs1/spxiwh/ihope/853000000-853010000"
JobUniverse =3D 7
Cmd =3D "/usr/bin/condor_dagman"
MinHosts =3D 1
MaxHosts =3D 1
WantRemoteSyscalls =3D FALSE
WantCheckpoint =3D FALSE
JobPrio =3D 0
User =3D "spxiwh@ligo"
NiceUser =3D FALSE
Env =3D "CLASSPATH=3D/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/pegasus/lib/exist.jar:=
/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/pegasus/lib/cog-jglobus.jar:/ldcg/stow_pkgs=
/ldg-4.5/ldg/vdt/pegasus/lib/xmlParserAPIs.jar:/ldcg/stow_pkgs/ldg-4.5/ldg/=
vdt/pegasus/lib/puretls.jar:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/pegasus/lib/glo=
bus_rls_client.jar:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/pegasus/lib/jakarta-oro.=
jar:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/pegasus/lib/xmlrpc.jar:/ldcg/stow_pkgs/=
ldg-4.5/ldg/vdt/pegasus/lib/jce-jdk13-125.jar:/ldcg/stow_pkgs/ldg-4.5/ldg/v=
dt/pegasus/lib/postgresql-8.1dev-400.jdbc3.jar:/ldcg/stow_pkgs/ldg-4.5/ldg/=
vdt/pegasus/lib/java-getopt-1.0.9.jar:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/pegas=
us/lib/exist-optional.jar:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/pegasus/lib/xerce=
sImpl.jar:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/pegasus/lib/mysql-connector-java-=
5.0.5-bin.jar:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/pegasus/lib/accessors.jar:/ld=
cg/stow_pkgs/ldg-4.5/ldg/vdt/pegasus/lib/resolver.jar:/ldcg/stow_pkgs/ldg-4=
.5/ldg/vdt/pegasus/lib/pegasus.jar:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/pegasus/=
lib/log4j-1.2.8.jar:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/pegasus/lib/cryptix32.j=
ar:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/pegasus/lib/commons-pool.jar:/ldcg/stow_=
pkgs/ldg-4.5/ldg/vdt/pegasus/lib/cryptix.jar:/ldcg/stow_pkgs/ldg-4.5/ldg/vd=
t/pegasus/lib/commons-logging.jar:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/pegasus/l=
ib/cryptix-asn1.jar:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/pegasus/lib/xmldb.jar:/=
ldcg/stow_pkgs/ldg-4.5/ldg/vdt/pegasus/lib/junit.jar:/ldcg/stow_pkgs/ldg-4.=
5/ldg/vdt/pegasus/lib/preservcsl.jar;SHLIB_PATH=3D/ldcg/stow_pkgs/ldg-4.5/l=
dg/vdt/globus/lib:/ldcg/ldg/vdt/globus/lib;LSCSOFT_LOCATION=3D/archive/home=
/spxiwh/opt/lscsoft;GLOBUS_OPTIONS=3D-Xmx512M;PAC_ANCHOR=3D/ldcg/stow_pkgs/=
ldg-4.5/ldg;LIGOVIRGOCVS=3D:pserver:spxiwh__AT__gravity.phys.uwm.edu:/usr/local/=
cvs/ligovirgo;SHLVL=3D2;PWD=3D/archive/home/spxiwh/ihope/853000000-85301000=
0;LSC_DATAGRID_SERVER_LOCATION=3D/ldcg/ldg;GRID_SECURITY_DIR=3D/ldcg/stow_p=
kgs/ldg-4.5/ldg/vdt/globus/etc;SSH_AUTH_SOCK=3D/tmp/ssh-oLnGx10135/agent.10=
135;VDT_LOCATION=3D/ldcg/stow_pkgs/ldg-4.5/ldg/vdt;SSH_CLIENT=3D131.251.46.=
194 43915 22;CVS_RSH=3Dssh;VDT_POSTINSTALL_README=3D/ldcg/stow_pkgs/ldg-4.5=
/ldg/vdt/post-install/README;PATH=3D/archive/home/spxiwh/lscsoft/executable=
s/cbc_s5_1yr_20070129/glue/bin:/archive/home/spxiwh/lscsoft/executables/cbc=
_s5_1yr_20070129/pylal/bin:/archive/home/spxiwh/lscsoft/executables/cbc_s5_=
1yr_20070129/lalapps//bin:/archive/home/spxiwh/lscsoft/executables/cbc_s5_1=
yr_20070129/lal//bin:/archive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_20=
070129/glue/bin:/archive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_2007012=
9/pylal/bin:/archive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_20070129/la=
lapps//bin:/archive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_20070129/lal=
//bin:/opt/lscsoft/lalapps/bin:/opt/lscsoft/lal/bin:/opt/lscsoft/glue/bin:/=
opt/lscsoft/libframe/bin:/opt/lscsoft/libmetaio/bin:/opt/lscsoft/framecpp/b=
in:/opt/lscsoft/dol/bin:/opt/lscsoft/root/bin:/ldcg/stow_pkgs/ldg-4.5/ldg/v=
dt/apache/bin:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/ant/bin:/ldcg/stow_pkgs/ldg-4=
.5/ldg/vdt/glite/sbin:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/glite/bin:/ldcg/stow_=
pkgs/ldg-4.5/ldg/vdt/pegasus/bin:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/pyglobus-u=
rl-copy/bin:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/unixodbc/bin:/ldcg/stow_pkgs/ld=
g-4.5/ldg/vdt/mysql/bin:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/edg/sbin:/ldcg/stow=
_pkgs/ldg-4.5/ldg/vdt/jdk1.5/bin:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/logrotate/=
sbin:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/gpt/sbin:/ldcg/stow_pkgs/ldg-4.5/ldg/v=
dt/globus/bin:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/globus/sbin:/ldcg/pacman/stow=
_pkgs/pacman-3.21/bin:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/vdt/sbin:/ldcg/stow_p=
kgs/ldg-4.5/ldg/vdt/vdt/bin:/ldcg/stow_pkgs/ldg-4.5/ldg/ldg-server/bin:/usr=
/kerberos/bin:/usr/bin:/bin:/usr/sbin:/sbin:/ldcg/ldg/vdt/globus/bin:/usr/X=
11R6/bin:/ligotools/bin:/ldcg/matlab_r2007a/bin:/archive/home/spxiwh/bin;SA=
SL_PATH=3D/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/globus/lib/sasl;VDT_INSTALL_LOG=
=3Dvdt-install.log;CFLAGS=3D-g -O2;GLITE_LOCATION_LOG=3D/ldcg/stow_pkgs/ldg=
-4.5/ldg/vdt/glite/log;ROOTSYS=3D/opt/lscsoft/root;DYLD_LIBRARY_PATH=3D/arc=
hive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_20070129/glue/lib64/python2=
.4/site-packages:/archive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_200701=
29/pylal/lib64/python2.4/site-packages:/archive/home/spxiwh/lscsoft/executa=
bles/cbc_s5_1yr_20070129/lal//lib:/archive/home/spxiwh/lscsoft/executables/=
cbc_s5_1yr_20070129/glue/lib64/python2.4/site-packages:/archive/home/spxiwh=
/lscsoft/executables/cbc_s5_1yr_20070129/pylal/lib64/python2.4/site-package=
s:/archive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_20070129/lal//lib:/op=
t/lscsoft/lal/lib64:/opt/lscsoft/glue/lib64/python2.4/site-packages:/opt/ls=
csoft/framecpp/lib64:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/globus/lib;_CONDOR_DAG=
MAN_LOG_ON_NFS_IS_ERROR=3DFALSE;GLOBUS_TCP_PORT_RANGE=3D40000,45000;GLOBUS_=
PATH=3D/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/globus;X509_CERT_DIR=3D/ldcg/stow_pk=
gs/ldg-4.5/ldg/vdt/globus/TRUSTED_CA;LAL_PREFIX=3D/archive/home/spxiwh/lscs=
oft/executables/cbc_s5_1yr_20070129/lal/;VOMS_USERCONF=3D/ldcg/stow_pkgs/ld=
g-4.5/ldg/vdt/glite/etc/vomses;LDG_SOFTWARE_LOCATION=3Dhttp://www.ldas-sw.l=
igo.caltech.edu/ldg_dist/ldg4.5/software;INPUTRC=3D/etc/inputrc;LALCVSROOT=
=3D:pserver:spxiwh__AT__gravity.phys.uwm.edu:2402/usr/local/cvs/lscsoft;LSCSOFT_=
PREFIX=3D/archive/home/spxiwh/opt/lscsoft;ROOT_LOCATION=3D/opt/lscsoft/root=
;PKG_CONFIG_PATH=3D/archive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_2007=
0129/lal//lib/pkgconfig:/archive/home/spxiwh/lscsoft/executables/cbc_s5_1yr=
_20070129/lal//lib/pkgconfig:/opt/lscsoft/lal/lib64/pkgconfig:/opt/lscsoft/=
libframe/lib64/pkgconfig:/opt/lscsoft/libmetaio/lib64/pkgconfig:/opt/lscsof=
t/framecpp/lib64/pkgconfig:/opt/lscsoft/dol/lib64/pkgconfig:/opt/lscsoft/ro=
ot/lib64/pkgconfig:;KDEDIR=3D/usr;GLITE_LOCATION_TMP=3D/ldcg/stow_pkgs/ldg-=
4.5/ldg/vdt/glite/tmp;LIGOTOOLS=3D/ligotools;LSCSOFT_SRCURL=3Dhttp://www.ls=
c-group.phys.uwm.edu/daswg/download/software/source;GLITE_LOCATION=3D/ldcg/=
stow_pkgs/ldg-4.5/ldg/vdt/glite;SSH_TTY=3D/dev/pts/7;GLITE_LOCATION_VAR=3D/=
ldcg/stow_pkgs/ldg-4.5/ldg/vdt/glite/var;LIBPATH=3D/ldcg/stow_pkgs/ldg-4.5/=
ldg/vdt/globus/lib:/ldcg/ldg/vdt/globus/lib:/usr/lib:/lib;SHELL=3D/bin/bash=
;LDG_INSTALL_LOG=3D/ldcg/stow_pkgs/ldg-4.5/ldg/ldg-server/etc/ldg-install.l=
og;FRAMECPP_PREFIX=3D/opt/lscsoft/framecpp;LSCSOFT_USER=3Dspxiwh;LDG_DIRECT=
ORY=3D/ldcg/stow_pkgs/ldg-4.5/ldg/ldg-server;MAIL=3D/var/spool/mail/spxiwh;=
_CONDOR_MAX_DAGMAN_LOG=3D0;MANPATH=3D/archive/home/spxiwh/lscsoft/executabl=
es/cbc_s5_1yr_20070129/lalapps//man:/archive/home/spxiwh/lscsoft/executable=
s/cbc_s5_1yr_20070129/lal//man:/archive/home/spxiwh/lscsoft/executables/cbc=
_s5_1yr_20070129/lalapps//man:/archive/home/spxiwh/lscsoft/executables/cbc_=
s5_1yr_20070129/lal//man:/opt/lscsoft/lalapps/share/man:/opt/lscsoft/lal/sh=
are/man:/opt/lscsoft/libframe/man:/opt/lscsoft/libmetaio/man:/opt/lscsoft/f=
ramecpp/share/man:/opt/lscsoft/root/man:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/peg=
asus/man:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/globus/man::/ldcg/stow_pkgs/ldg-4.=
5/ldg/vdt/vdt/man:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/perl/man:/ldcg/stow_pkgs/=
ldg-4.5/ldg/vdt/expat/man:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/logrotate/man:/ld=
cg/stow_pkgs/ldg-4.5/ldg/vdt/jdk1.5/man:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/edg=
/share/man:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/mysql/man:/ldcg/stow_pkgs/ldg-4.=
5/ldg/vdt/glite/share/man:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/apache/man;GLUE_P=
REFIX=3D/archive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_20070129/glue;M=
YSQL_UNIX_PORT=3D/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/vdt-app-data/mysql/var/mys=
ql.sock;DISPLAY=3Dlocalhost:45.0;GLUE_LOCATION=3D/archive/home/spxiwh/lscso=
ft/executables/cbc_s5_1yr_20070129/glue/;PERL5LIB=3D/ldcg/stow_pkgs/ldg-4.5=
/ldg/ldg-server/lib:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/pegasus/lib/perl:/ldcg/=
stow_pkgs/ldg-4.5/ldg/vdt/vdt/lib:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/perl/lib/=
5.8.0:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/perl/lib/5.8.0/x86_64-linux-thread-mu=
lti:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/perl/lib/site_perl/5.8.0:/ldcg/stow_pkg=
s/ldg-4.5/ldg/vdt/perl/lib/site_perl/5.8.0/x86_64-linux-thread-multi:;ANT_H=
OME=3D/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/ant;USER=3Dspxiwh;SSH_CONNECTION=3D13=
1.251.46.194 43915 131.215.114.6 22;DOL_LOCATION=3D/opt/lscsoft/dol;HOSTNAM=
E=3Dldas-grid;LD_LIBRARY_PATH=3D/archive/home/spxiwh/lscsoft/executables/cb=
c_s5_1yr_20070129/glue/lib64/python2.4/site-packages:/archive/home/spxiwh/l=
scsoft/executables/cbc_s5_1yr_20070129/pylal/lib64/python2.4/site-packages:=
/archive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_20070129/lal//lib:/arch=
ive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_20070129/glue/lib64/python2.=
4/site-packages:/archive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_2007012=
9/pylal/lib64/python2.4/site-packages:/archive/home/spxiwh/lscsoft/executab=
les/cbc_s5_1yr_20070129/lal//lib:/opt/lscsoft/lal/lib64:/opt/lscsoft/glue/l=
ib64/python2.4/site-packages:/opt/lscsoft/libframe/lib64:/opt/lscsoft/libme=
taio/lib64:/opt/lscsoft/framecpp/lib64:/opt/lscsoft/dol/lib64:/opt/lscsoft/=
root/lib64:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/tclglobus/lib:/ldcg/stow_pkgs/ld=
g-4.5/ldg/vdt/apache/lib:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/glite/lib:/ldcg/st=
ow_pkgs/ldg-4.5/ldg/vdt/myodbc/lib:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/unixodbc=
/lib:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/mysql/lib/mysql:/ldcg/stow_pkgs/ldg-4.=
5/ldg/vdt/jdk1.5/jre/lib/i386:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/jdk1.5/jre/li=
b/i386/server:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/jdk1.5/jre/lib/i386/client:/l=
dcg/stow_pkgs/ldg-4.5/ldg/vdt/berkeley-db/lib:/ldcg/stow_pkgs/ldg-4.5/ldg/v=
dt/expat/lib:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/globus/lib:/ldcg/ldg/vdt/globu=
s/lib:/ligotools/lib;PYTHONPATH=3D/archive/home/spxiwh/lscsoft/executables/=
cbc_s5_1yr_20070129/glue/lib64/python2.4/site-packages:/archive/home/spxiwh=
/lscsoft/executables/cbc_s5_1yr_20070129/glue/lib/python2.4/site-packages:/=
archive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_20070129/pylal/lib64/pyt=
hon2.4/site-packages:/archive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_20=
070129/pylal/lib/python2.4/site-packages:/archive/home/spxiwh/lscsoft/execu=
tables/cbc_s5_1yr_20070129/lalapps//lib64/python2.4/site-packages:/archive/=
home/spxiwh/lscsoft/executables/cbc_s5_1yr_20070129/lalapps//lib/python2.4/=
site-packages:/archive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_20070129/=
glue/lib64/python2.4/site-packages:/archive/home/spxiwh/lscsoft/executables=
/cbc_s5_1yr_20070129/glue/lib/python2.4/site-packages:/archive/home/spxiwh/=
lscsoft/executables/cbc_s5_1yr_20070129/pylal/lib64/python2.4/site-packages=
:/archive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_20070129/pylal/lib/pyt=
hon2.4/site-packages:/archive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_20=
070129/lalapps//lib64/python2.4/site-packages:/archive/home/spxiwh/lscsoft/=
executables/cbc_s5_1yr_20070129/lalapps//lib/python2.4/site-packages:/opt/l=
scsoft/lalapps/lib64/python2.4/site-packages:/opt/lscsoft/lalapps/lib/pytho=
n2.4/site-packages:/opt/lscsoft/glue/lib64/python2.4/site-packages:/opt/lsc=
soft/glue/lib/python2.4/site-packages:/opt/lscsoft/libframe/lib64/python:/o=
pt/lscsoft/libmetaio/lib64/python:/ldcg/stow_pkgs/ldg-4.5/ldg/ldg-server/li=
b64/python:/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/globus/lib64/python:/ldcg/stow_p=
kgs/ldg-4.5/ldg/vdt/globus/lib/python:;X509_CADIR=3D/ldcg/stow_pkgs/ldg-4.5=
/ldg/vdt/globus/TRUSTED_CA;ODBCINI=3D/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/unixod=
bc/etc/odbc.ini;CATALINA_OPTS=3D-Dorg.globus.wsrf.container.persistence.dir=
=3D/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/vdt-app-data/globus/persisted;HOME=3D/ar=
chive/home/spxiwh;LAL_LOCATION=3D/archive/home/spxiwh/lscsoft/executables/c=
bc_s5_1yr_20070129/lal/;LOGNAME=3Dspxiwh;LIGOVIRGOPREFIX=3D/archive/home/sp=
xiwh/ligovirgocvs;EDG_LOCATION=3D/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/edg;MATLAB=
PATH=3D/ligotools/matlab;GPT_LOCATION=3D/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/gpt=
;TAG=3Dcbc_s5_1yr_20070129;GLOBUS_ERROR_VERBOSE=3Dtrue;_=3D/usr/bin/condor_=
submit;JAVA_HOME=3D/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/jdk1.5;G_BROKEN_FILENAME=
S=3D1;FRAMECPP_LOCATION=3D/opt/lscsoft/framecpp;LANG=3DC;GLOBUS_LOCATION=3D=
/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/globus;CONDOR_CONFIG=3D/usr1/condor/condor_=
config;HISTSIZE=3D1000;LSC_SEGFIND_SERVER=3Dldas-cit.ligo.caltech.edu;PYLAL=
_PREFIX=3D/archive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_20070129/pyla=
l;GLOBUS_MYSQL_PATH=3D/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/mysql;LSCSOFT_TMPDIR=
=3D/archive/home/spxiwh/tmp/lscsoft-install;PACMAN_LOCATION=3D/ldcg/pacman/=
stow_pkgs/pacman-3.21;PEGASUS_HOME=3D/ldcg/stow_pkgs/ldg-4.5/ldg/vdt/pegasu=
s;DAGDBUPDATORLOCKFILE=3D/etc/onasys-dblockfile;X509_VOMS_DIR=3D/ldcg/stow_=
pkgs/ldg-4.5/ldg/vdt/glite/vomsdir;CONDOR_LOCATION=3D/usr;PYLAL_LOCATION=3D=
/archive/home/spxiwh/lscsoft/executables/cbc_s5_1yr_20070129/pylal/;LDG_LOC=
ATION=3D/ldcg/stow_pkgs/ldg-4.5/ldg;TERM=3Dxterm;CC=3Dgcc;CVSROOT=3D:ext:sp=
xiwh@alexandria:/usr/local/cvs/IANCVS/;LALAPPS_LOCATION=3D/archive/home/spx=
iwh/lscsoft/executables/cbc_s5_1yr_20070129/lalapps/;LESSOPEN=3D|/usr/bin/l=
esspipe.sh %s;BOSSDIR=3D/etc;LSC_DATAFIND_SERVER=3Dldas-cit.ligo.caltech.ed=
u;_CONDOR_DAGMAN_LOG=3Dihope.dag.dagman.out"
EnvDelim =3D ";"
JobNotification =3D 3
WantRemoteIO =3D FALSE
UserLog =3D "/mnt/qfs1/spxiwh/ihope/853000000-853010000/ihope.dag.dagman.lo=
g"
CoreSize =3D 0
KillSig =3D "SIGTERM"
RemoveKillSig =3D "SIGUSR1"
Rank =3D 0.000000
In =3D "/dev/null"
TransferIn =3D FALSE
Out =3D "ihope.dag.lib.out"
StreamOut =3D FALSE
Err =3D "ihope.dag.lib.err"
StreamErr =3D FALSE
BufferSize =3D 524288
BufferBlockSize =3D 32768
ShouldTransferFiles =3D "NO"
TransferFiles =3D "NEVER"
ImageSize_RAW =3D 5439
ImageSize =3D 7500
ExecutableSize_RAW =3D 5439
ExecutableSize =3D 7500
DiskUsage_RAW =3D 5439
DiskUsage =3D 7500
Requirements =3D (Arch =3D=3D "X86_64") && (OpSys =3D=3D "LINUX") && (Disk =
>=3D DiskUsage) && ((Memory * 1024) >=3D ImageSize)
FileSystemDomain =3D "ligo"
JobLeaseDuration =3D 3600
PeriodicHold =3D FALSE
PeriodicRelease =3D FALSE
PeriodicRemove =3D FALSE
OnExitHold =3D FALSE
OnExitRemove =3D (ExitSignal =3D?=3D 11 || (ExitCode =3D!=3D UNDEFINED && E=
xitCode >=3D 0 && ExitCode <=3D 2))
LeaveJobInQueue =3D FALSE
Arguments =3D "-f -l . -Debug 3 -Lockfile ihope.dag.lock -AutoRescue 1 -DoR=
escueFrom 0 -Condorlog /mnt/qfs1/spxiwh/ihope/853000000-853010000/datafind/=
inspiral_hipe_datafind.DATAFIND.dag.dagman.log -Dag ihope.dag -CsdVersion $=
CondorVersion:' '7.1.1' 'Jul' '14' '2008' 'BuildID:' '94553' '$"
GlobalJobId =3D "ldas-grid.ligo.caltech.edu#1216900728#33283134.0"
ProcId =3D 0
AutoClusterId =3D 7
AutoClusterAttrs =3D "JobUniverse,LastCheckpointPlatform,NumCkpts,JobStart,=
DiskUsage,ImageSize,Requirements,NiceUser"
JobStartDate =3D 1216906019
ExitStatus =3D 44
ExitBySignal =3D FALSE
ExitCode =3D 44
RemoteWallClockTime =3D 1653687.000000
LastRemoteHost =3D ""
LastPublicClaimId =3D ""
LastPublicClaimIds =3D ""
OrigMaxHosts =3D 1
JobStatus =3D 2
EnteredCurrentStatus =3D 1218564259
LastSuspensionTime =3D 0
NumJobStarts =3D 4
CurrentHosts =3D 1
ShadowBday =3D 1218564259
JobLastStartDate =3D 1218413482
JobCurrentStartDate =3D 1218564259
JobRunCount =3D 0
WallClockCheckpoint =3D 83438
ServerTime =3D 1218650829



Here is the UserLog:
root@ldas-cit ~:>>cat /home1/spxiwh/ihope/853000000-853010000/ihope.dag.dag=
man.log
000 (33283134.000.000) 07/24 04:58:48 Job submitted from host: <10.14.0.12:=
46271>
...
001 (33283134.000.000) 07/24 06:26:59 Job executing on host: <10.14.0.12:46=
271>
...
001 (33283134.000.000) 07/28 15:35:42 Job executing on host: <10.14.0.12:46=
271>
...
001 (33283134.000.000) 08/10 17:11:22 Job executing on host: <10.14.0.12:46=
271>
...
001 (33283134.000.000) 08/12 11:04:19 Job executing on host: <10.14.0.12:46=
271>
...


Here is the beginning and end of the large -Condorlog:

root@ldas-cit ~:>>ls -lh /home1/spxiwh/ihope/853000000-853010000/datafind/i=
nspiral_hipe_datafind.DATAFIND.dag.dagman.log
104M -rw-------  1 5134 5134 104M Aug 13 11:09 /home1/spxiwh/ihope/85300000=
0-853010000/datafind/inspiral_hipe_datafind.DATAFIND.dag.dagman.log
root@ldas-cit ~:>>head /home1/spxiwh/ihope/853000000-853010000/datafind/ins=
piral_hipe_datafind.DATAFIND.dag.dagman.log
000 (33286900.000.000) 07/24 06:27:14 Job submitted from host: <10.14.0.12:=
46271>
    DAG Node: 5d39d831be7ca646520b8e0b8776e92c
...
001 (33286900.000.000) 07/24 06:27:52 Job executing on host: <10.14.0.12:46=
271>
...
004 (33286900.000.000) 07/24 06:28:07 Job was evicted.
        (0) Job was not checkpointed.
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        0  -  Run Bytes Sent By Job
root@ldas-cit ~:>>tail /home1/spxiwh/ihope/853000000-853010000/datafind/ins=
piral_hipe_datafind.DATAFIND.dag.dagman.log
...
001 (33286900.000.000) 08/13 11:09:34 Job executing on host: <10.14.0.12:46=
271>
...
004 (33286900.000.000) 08/13 11:09:37 Job was evicted.
        (0) Job was not checkpointed.
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...


Note, there are ~300k jobs reported as having been executed in the Condorlo=
g,
so perhaps there is a memory leak somewhere?

root@ldas-cit ~:>>grep executing /home1/spxiwh/ihope/853000000-853010000/da=
tafind/inspiral_hipe_datafind.DATAFIND.dag.dagman.log | wc -l
317363


Any ideas on why this particular dagman process is using soo much memory
would be appreciated.


Thanks.


--=20
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Wed Aug 13 13:14:00 2008 (1218651242)
Subject: Actions

Assigned to wenger by bt
===========================================================================
Date of actions: Tue Aug 12 10:05:40 2008 (1218657203)
Date: Wed, 13 Aug 2008 15:11:37 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: bt <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18335] LIGO: dagman memory usage

Stuart,

> I just noticed that the LIGO CIT condor pool has a condor_dagman process
> running on one of the head nodes uses 6GByte of memory, i.e.,

> ...

> Note, there are ~300k jobs reported as having been executed in the Condorlo=
> g,
> so perhaps there is a memory leak somewhere?

> ...

> Any ideas on why this particular dagman process is using soo much memory
> would be appreciated.

Okay, I guess the first question I have is how many nodes are in the DAG
that it's running.  That's the easiest way I know of to make DAGMan
really use a lot of memory.

I also notice that the DAGMan has been running for a pretty long time,
so maybe there is some kind of memory leak.

Could you put the DAG file and the dagman.out file somewhere where I can
grab them?

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Wed Aug 13 15:11:43 2008 (1218658304)
Date: Wed, 13 Aug 2008 15:02:04 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18335] LIGO: dagman memory usage
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

On Wed, Aug 13, 2008 at 03:11:43PM -0500, condor-admin response tracking system wrote:
> Stuart,
> 
> > I just noticed that the LIGO CIT condor pool has a condor_dagman process
> > running on one of the head nodes uses 6GByte of memory, i.e.,
> 
> > ...
> 
> > Note, there are ~300k jobs reported as having been executed in the Condorlo=
> > g,
> > so perhaps there is a memory leak somewhere?
> 
> > ...
> 
> > Any ideas on why this particular dagman process is using soo much memory
> > would be appreciated.
> 
> Okay, I guess the first question I have is how many nodes are in the DAG
> that it's running.  That's the easiest way I know of to make DAGMan
> really use a lot of memory.

Currently, it is only running one node:

[root@ldas-grid ~]# condor_q -dag spxiwh


-- Submitter: ldas-grid.ligo.caltech.edu : <10.14.0.12:46271> : ldas-grid.ligo.caltech.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
33283134.0   spxiwh          7/24 04:58  20+07:08:29 R  0   7.3  condor_dagman -f -
33286900.0    |-5d39d831be7  7/24 06:27   1+13:08:53 R  0   7.3  condor_dagman -f -


> 
> I also notice that the DAGMan has been running for a pretty long time,
> so maybe there is some kind of memory leak.
> 
> Could you put the DAG file and the dagman.out file somewhere where I can
> grab them?


The ihope.dag and ihope.dag.dagman.out files may be found at,
http://www.ligo.caltech.edu/~anderson/condor.18335

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Wed Aug 13 17:02:15 2008 (1218664935)
Date: Wed, 13 Aug 2008 17:10:42 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18335] LIGO: dagman memory usage

Stuart,

> Currently, it is only running one node:

FYI: the memory footprint of DAGMan mostly depends on the total number
of nodes in the DAG, not the number currently running.  As it turns out,
though, there are only 28 nodes in that DAG, so that's not the problem.
Looks like you probably have found a case that triggered a memory leak.

I'll see what I can figure out.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Wed Aug 13 17:10:44 2008 (1218665444)
Date: Wed, 13 Aug 2008 15:29:17 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18335] LIGO: dagman memory usage
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

On Wed, Aug 13, 2008 at 05:10:44PM -0500, condor-admin response tracking system wrote:
> Stuart,
> 
> > Currently, it is only running one node:
> 
> FYI: the memory footprint of DAGMan mostly depends on the total number
> of nodes in the DAG, not the number currently running.  As it turns out,
> though, there are only 28 nodes in that DAG, so that's not the problem.
> Looks like you probably have found a case that triggered a memory leak.
> 

It looks like it is still growing:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5287 spxiwh    15   0 6215m 6.1g 1724 S  0.0 40.0   9:13.87 condor_dagman


In case it helps, here is a snippet of strace output on the dagman process
showing lots of stat() calls failing:

[root@ldas-grid ~]# strace -p 5287
Process 5287 attached - interrupt to quit
select(7, [3 5 6], [], [], {2, 944000}) = 0 (Timeout)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV PROF], NULL, 8) = 0
read(3, 0x7fff1a56e6e0, 8)              = -1 EAGAIN (Resource temporarily unavailable)
time([1218666447])                      = 1218666447
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/playground/inspiral_hipe_playground.PLAYGROUND.dag.dagman.log", 0x1844e8fb8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/full_data/inspiral_hipe_full_data.FULL_DATA.dag.dagman.log", 0x1844e8fb8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bnsinj/inspiral_hipe_bnsinj.BNSINJ.dag.dagman.log", 0x1844e8fb8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bbhinj/inspiral_hipe_bbhinj.BBHINJ.dag.dagman.log", 0x1844e8fb8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/playground/inspiral_hipe_playground_cat2_veto.PLAYGROUND_CAT_2_VETO.dag.dagman.log", 0x184471f18) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/full_data/inspiral_hipe_full_data_cat2_veto.FULL_DATA_CAT_2_VETO.dag.dagman.log", 0x184471f18) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bnsinj/inspiral_hipe_bnsinj_cat2_veto.BNSINJ_CAT_2_VETO.dag.dagman.log", 0x184471f18) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bbhinj/inspiral_hipe_bbhinj_cat2_veto.BBHINJ_CAT_2_VETO.dag.dagman.log", 0x184471f18) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/playground/inspiral_hipe_playground_cat3_veto.PLAYGROUND_CAT_3_VETO.dag.dagman.log", 0x184471f18) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/full_data/inspiral_hipe_full_data_cat3_veto.FULL_DATA_CAT_3_VETO.dag.dagman.log", 0x184471f18) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bnsinj/inspiral_hipe_bnsinj_cat3_veto.BNSINJ_CAT_3_VETO.dag.dagman.log", 0x184471f18) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bbhinj/inspiral_hipe_bbhinj_cat3_veto.BBHINJ_CAT_3_VETO.dag.dagman.log", 0x184471f18) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/playground/inspiral_hipe_playground_cat4_veto.PLAYGROUND_CAT_4_VETO.dag.dagman.log", 0x184471f18) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/full_data/inspiral_hipe_full_data_cat4_veto.FULL_DATA_CAT_4_VETO.dag.dagman.log", 0x184471f18) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bnsinj/inspiral_hipe_bnsinj_cat4_veto.BNSINJ_CAT_4_VETO.dag.dagman.log", 0x184471f18) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bbhinj/inspiral_hipe_bbhinj_cat4_veto.BBHINJ_CAT_4_VETO.dag.dagman.log", 0x184471f18) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/playground_summary_plots/plot_hipe_playground_summary_plots.PLAYGROUND_SUMMARY_PLOTS.dag.dagman.log", 0x1844e9998) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/playground_summary_plots/plot_hipe_playground_summary_plots_cat_2_veto.PLAYGROUND_SUMMARY_PLOTS_CAT_2_VETO.dag.dagman.log", 0x184471f18) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/full_data_slide_summary_plots/plot_hipe_full_data_slide_summary_plots.FULL_DATA_SLIDE_SUMMARY_PLOTS.dag.dagman.log", 0x1844e9998) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/full_data_slide_summary_plots/plot_hipe_full_data_slide_summary_plots_cat_2_veto.FULL_DATA_SLIDE_SUMMARY_PLOTS_CAT_2_VETO.dag.dagman.log", 0x184471f18) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bnsinj_summary_plots/plot_hipe_bnsinj_summary_plots.BNSINJ_SUMMARY_PLOTS.dag.dagman.log", 0x1844e9998) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bnsinj_summary_plots/plot_hipe_bnsinj_summary_plots_cat_2_veto.BNSINJ_SUMMARY_PLOTS_CAT_2_VETO.dag.dagman.log", 0x1844ea968) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bbhinj_summary_plots/plot_hipe_bbhinj_summary_plots.BBHINJ_SUMMARY_PLOTS.dag.dagman.log", 0x1844e9998) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bbhinj_summary_plots/plot_hipe_bbhinj_summary_plots_cat_2_veto.BBHINJ_SUMMARY_PLOTS_CAT_2_VETO.dag.dagman.log", 0x1844ea968) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/allinj_summary_plots/plot_hipe_allinj_summary_plots.ALLINJ_SUMMARY_PLOTS.dag.dagman.log", 0x1844e1f98) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/allinj_summary_plots/plot_hipe_allinj_summary_plots_cat_2_veto.ALLINJ_SUMMARY_PLOTS_CAT_2_VETO.dag.dagman.log", 0x1844ea968) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/followup/followup_pipe_followup.dag.dagman.log", 0x1844ea968) = -1 ENOENT (No such file or directory)
fstat(7, {st_mode=S_IFREG|0600, st_size=109836279, ...}) = 0
time([1218666447])                      = 1218666447
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/playground/inspiral_hipe_playground.PLAYGROUND.dag.dagman.log", 0x1844ea968) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/full_data/inspiral_hipe_full_data.FULL_DATA.dag.dagman.log", 0x1844ea968) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bnsinj/inspiral_hipe_bnsinj.BNSINJ.dag.dagman.log", 0x1844ea968) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bbhinj/inspiral_hipe_bbhinj.BBHINJ.dag.dagman.log", 0x1844ea968) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/playground/inspiral_hipe_playground_cat2_veto.PLAYGROUND_CAT_2_VETO.dag.dagman.log", 0x1844eb4f8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/full_data/inspiral_hipe_full_data_cat2_veto.FULL_DATA_CAT_2_VETO.dag.dagman.log", 0x1844eb4f8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bnsinj/inspiral_hipe_bnsinj_cat2_veto.BNSINJ_CAT_2_VETO.dag.dagman.log", 0x1844eb4f8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bbhinj/inspiral_hipe_bbhinj_cat2_veto.BBHINJ_CAT_2_VETO.dag.dagman.log", 0x1844eb4f8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/playground/inspiral_hipe_playground_cat3_veto.PLAYGROUND_CAT_3_VETO.dag.dagman.log", 0x1844eb4f8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/full_data/inspiral_hipe_full_data_cat3_veto.FULL_DATA_CAT_3_VETO.dag.dagman.log", 0x1844eb4f8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bnsinj/inspiral_hipe_bnsinj_cat3_veto.BNSINJ_CAT_3_VETO.dag.dagman.log", 0x1844eb4f8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bbhinj/inspiral_hipe_bbhinj_cat3_veto.BBHINJ_CAT_3_VETO.dag.dagman.log", 0x1844eb4f8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/playground/inspiral_hipe_playground_cat4_veto.PLAYGROUND_CAT_4_VETO.dag.dagman.log", 0x1844eb4f8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/full_data/inspiral_hipe_full_data_cat4_veto.FULL_DATA_CAT_4_VETO.dag.dagman.log", 0x1844eb4f8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bnsinj/inspiral_hipe_bnsinj_cat4_veto.BNSINJ_CAT_4_VETO.dag.dagman.log", 0x1844eb4f8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bbhinj/inspiral_hipe_bbhinj_cat4_veto.BBHINJ_CAT_4_VETO.dag.dagman.log", 0x1844eb4f8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/playground_summary_plots/plot_hipe_playground_summary_plots.PLAYGROUND_SUMMARY_PLOTS.dag.dagman.log", 0x1844ebed8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/playground_summary_plots/plot_hipe_playground_summary_plots_cat_2_veto.PLAYGROUND_SUMMARY_PLOTS_CAT_2_VETO.dag.dagman.log", 0x1844eb4f8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/full_data_slide_summary_plots/plot_hipe_full_data_slide_summary_plots.FULL_DATA_SLIDE_SUMMARY_PLOTS.dag.dagman.log", 0x1844ebed8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/full_data_slide_summary_plots/plot_hipe_full_data_slide_summary_plots_cat_2_veto.FULL_DATA_SLIDE_SUMMARY_PLOTS_CAT_2_VETO.dag.dagman.log", 0x1844eb4f8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bnsinj_summary_plots/plot_hipe_bnsinj_summary_plots.BNSINJ_SUMMARY_PLOTS.dag.dagman.log", 0x1844ebed8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bnsinj_summary_plots/plot_hipe_bnsinj_summary_plots_cat_2_veto.BNSINJ_SUMMARY_PLOTS_CAT_2_VETO.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bbhinj_summary_plots/plot_hipe_bbhinj_summary_plots.BBHINJ_SUMMARY_PLOTS.dag.dagman.log", 0x1844ebed8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bbhinj_summary_plots/plot_hipe_bbhinj_summary_plots_cat_2_veto.BBHINJ_SUMMARY_PLOTS_CAT_2_VETO.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/allinj_summary_plots/plot_hipe_allinj_summary_plots.ALLINJ_SUMMARY_PLOTS.dag.dagman.log", 0x1844e1f98) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/allinj_summary_plots/plot_hipe_allinj_summary_plots_cat_2_veto.ALLINJ_SUMMARY_PLOTS_CAT_2_VETO.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/followup/followup_pipe_followup.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
fcntl(7, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0
lseek(7, 109051904, SEEK_SET)           = 109051904
read(7, "4.0.12:46271>\n...\n004 (33286900."..., 1048576) = 784375
time([1218666447])                      = 1218666447
fcntl(7, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0
lseek(7, 109836279, SEEK_SET)           = 109836279
fstat(7, {st_mode=S_IFREG|0600, st_size=109836279, ...}) = 0
time(NULL)                              = 1218666447
rt_sigprocmask(SIG_BLOCK, ~[ILL TRAP ABRT BUS FPE SEGV], ~[ILL TRAP ABRT BUS FPE KILL SEGV STOP PROF], 8) = 0
umask(022)                              = 022
time([1218666447])                      = 1218666447
open("/mnt/qfs1/spxiwh/ihope/853000000-853010000/ihope.dag.dagman.out", O_WRONLY|O_APPEND|O_CREAT|O_EXCL, 0644) = -1 EEXIST (File exists)
open("/mnt/qfs1/spxiwh/ihope/853000000-853010000/ihope.dag.dagman.out", O_WRONLY|O_APPEND) = 8
fcntl(8, F_GETFL)                       = 0x8401 (flags O_WRONLY|O_APPEND|O_LARGEFILE)
fstat(8, {st_mode=S_IFREG|0644, st_size=269730479, ...}) = 0
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b9790cb5000
lseek(8, 0, SEEK_CUR)                   = 0
lseek(8, 0, SEEK_END)                   = 269730479
write(8, "8/13 15:27:27 Event: ULOG_EXECUT"..., 96) = 96
close(8)                                = 0
munmap(0x2b9790cb5000, 1048576)         = 0
umask(022)                              = 022
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE KILL SEGV STOP PROF], NULL, 8) = 0
open("/dev/null", O_WRONLY|O_CREAT|O_EXCL, 0644) = -1 EEXIST (File exists)
open("/dev/null", O_WRONLY)             = 8
fcntl(8, F_GETFL)                       = 0x8001 (flags O_WRONLY|O_LARGEFILE)
fstat(8, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
ioctl(8, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff1a56da80) = -1 ENOTTY (Inappropriate ioctl for device)
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b9790cb5000
lseek(8, 0, SEEK_CUR)                   = 0
write(8, "(33286900.0.0)", 14)          = 14
close(8)                                = 0
munmap(0x2b9790cb5000, 4096)            = 0
rt_sigprocmask(SIG_BLOCK, ~[ILL TRAP ABRT BUS FPE SEGV], ~[ILL TRAP ABRT BUS FPE KILL SEGV STOP PROF], 8) = 0
umask(022)                              = 022
time([1218666447])                      = 1218666447
open("/mnt/qfs1/spxiwh/ihope/853000000-853010000/ihope.dag.dagman.out", O_WRONLY|O_APPEND|O_CREAT|O_EXCL, 0644) = -1 EEXIST (File exists)
open("/mnt/qfs1/spxiwh/ihope/853000000-853010000/ihope.dag.dagman.out", O_WRONLY|O_APPEND) = 8
fcntl(8, F_GETFL)                       = 0x8401 (flags O_WRONLY|O_APPEND|O_LARGEFILE)
fstat(8, {st_mode=S_IFREG|0644, st_size=269730575, ...}) = 0
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b9790cb5000
lseek(8, 0, SEEK_CUR)                   = 0
lseek(8, 0, SEEK_END)                   = 269730575
write(8, "8/13 15:27:27 Number of idle job"..., 42) = 42
close(8)                                = 0
munmap(0x2b9790cb5000, 1048576)         = 0
umask(022)                              = 022
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE KILL SEGV STOP PROF], NULL, 8) = 0
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/playground/inspiral_hipe_playground.PLAYGROUND.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/full_data/inspiral_hipe_full_data.FULL_DATA.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bnsinj/inspiral_hipe_bnsinj.BNSINJ.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bbhinj/inspiral_hipe_bbhinj.BBHINJ.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/playground/inspiral_hipe_playground_cat2_veto.PLAYGROUND_CAT_2_VETO.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/full_data/inspiral_hipe_full_data_cat2_veto.FULL_DATA_CAT_2_VETO.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bnsinj/inspiral_hipe_bnsinj_cat2_veto.BNSINJ_CAT_2_VETO.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bbhinj/inspiral_hipe_bbhinj_cat2_veto.BBHINJ_CAT_2_VETO.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/playground/inspiral_hipe_playground_cat3_veto.PLAYGROUND_CAT_3_VETO.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/full_data/inspiral_hipe_full_data_cat3_veto.FULL_DATA_CAT_3_VETO.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bnsinj/inspiral_hipe_bnsinj_cat3_veto.BNSINJ_CAT_3_VETO.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bbhinj/inspiral_hipe_bbhinj_cat3_veto.BBHINJ_CAT_3_VETO.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/playground/inspiral_hipe_playground_cat4_veto.PLAYGROUND_CAT_4_VETO.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/full_data/inspiral_hipe_full_data_cat4_veto.FULL_DATA_CAT_4_VETO.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bnsinj/inspiral_hipe_bnsinj_cat4_veto.BNSINJ_CAT_4_VETO.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bbhinj/inspiral_hipe_bbhinj_cat4_veto.BBHINJ_CAT_4_VETO.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/playground_summary_plots/plot_hipe_playground_summary_plots.PLAYGROUND_SUMMARY_PLOTS.dag.dagman.log", 0x1844e1f98) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/playground_summary_plots/plot_hipe_playground_summary_plots_cat_2_veto.PLAYGROUND_SUMMARY_PLOTS_CAT_2_VETO.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/full_data_slide_summary_plots/plot_hipe_full_data_slide_summary_plots.FULL_DATA_SLIDE_SUMMARY_PLOTS.dag.dagman.log", 0x1844e1f98) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/full_data_slide_summary_plots/plot_hipe_full_data_slide_summary_plots_cat_2_veto.FULL_DATA_SLIDE_SUMMARY_PLOTS_CAT_2_VETO.dag.dagman.log", 0x1844ecea8) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bnsinj_summary_plots/plot_hipe_bnsinj_summary_plots.BNSINJ_SUMMARY_PLOTS.dag.dagman.log", 0x1844e1f98) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bnsinj_summary_plots/plot_hipe_bnsinj_summary_plots_cat_2_veto.BNSINJ_SUMMARY_PLOTS_CAT_2_VETO.dag.dagman.log", 0x1844ef418) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bbhinj_summary_plots/plot_hipe_bbhinj_summary_plots.BBHINJ_SUMMARY_PLOTS.dag.dagman.log", 0x1844e1f98) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/bbhinj_summary_plots/plot_hipe_bbhinj_summary_plots_cat_2_veto.BBHINJ_SUMMARY_PLOTS_CAT_2_VETO.dag.dagman.log", 0x1844ef418) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/allinj_summary_plots/plot_hipe_allinj_summary_plots.ALLINJ_SUMMARY_PLOTS.dag.dagman.log", 0x1844eda38) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/allinj_summary_plots/plot_hipe_allinj_summary_plots_cat_2_veto.ALLINJ_SUMMARY_PLOTS_CAT_2_VETO.dag.dagman.log", 0x1844ef418) = -1 ENOENT (No such file or directory)
stat("/mnt/qfs1/spxiwh/ihope/853000000-853010000/followup/followup_pipe_followup.dag.dagman.log", 0x1844ef418) = -1 ENOENT (No such file or directory)
fcntl(7, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0
lseek(7, 109836279, SEEK_SET)           = 109836279
time([1218666447])                      = 1218666447
fcntl(7, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0
lseek(7, 109051904, SEEK_SET)           = 109051904

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Wed Aug 13 17:29:27 2008 (1218666567)
Date: Thu, 14 Aug 2008 10:16:27 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18335] LIGO: dagman memory usage
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

FWIW, this dag is still running just the same single job which is another
dagman process, and the memory is still growing, i.e.,

[root@ldas-grid ~]# condor_q -dag spxiwh


-- Submitter: ldas-grid.ligo.caltech.edu : <10.14.0.12:46271> : ldas-grid.ligo.caltech.edu
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
33283134.0   spxiwh          7/24 04:58  21+02:31:51 R  0   7.3  condor_dagman -f -
33286900.0    |-5d39d831be7  7/24 06:27   1+13:43:01 I  0   7.3  condor_dagman -f -

2 jobs; 1 idle, 1 running, 0 held



  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5287 spxiwh    15   0 6697m 6.5g 1724 S  0.0 43.1  10:20.19 condor_dagman



On Wed, Aug 13, 2008 at 03:29:16PM -0700, Stuart Anderson wrote:
> On Wed, Aug 13, 2008 at 05:10:44PM -0500, condor-admin response tracking system wrote:
> > Stuart,
> > 
> > > Currently, it is only running one node:
> > 
> > FYI: the memory footprint of DAGMan mostly depends on the total number
> > of nodes in the DAG, not the number currently running.  As it turns out,
> > though, there are only 28 nodes in that DAG, so that's not the problem.
> > Looks like you probably have found a case that triggered a memory leak.
> > 
> 
> It looks like it is still growing:
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5287 spxiwh    15   0 6215m 6.1g 1724 S  0.0 40.0   9:13.87 condor_dagman
> 
> 


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Aug 14 12:16:46 2008 (1218734207)
Date: Fri, 15 Aug 2008 12:44:57 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18335] LIGO: dagman memory usage

Stuart,

> FWIW, this dag is still running just the same single job which is another
> dagman process, and the memory is still growing, i.e.,

I haven't had a chance to check in detail yet, but I suspect that this is 
related to the very large number of evicted events.  It's such a small DAG 
it can't be related to the size of the DAG.

It's been a while since I've run valgrind on DAGMan; I guess it's time to 
do it again.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Fri Aug 15 12:45:01 2008 (1218822302)
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18335] LIGO: dagman memory usage
Date: Sat, 13 Sep 2008 14:02:32 -0700
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

Kent,
	It looks like this corner case has been triggered by another user, so  
it is not
just a one off occurrence. However, that being said, LIGO would still  
like DAG
tickets 18355 and 18376 worked on before this.

Thanks.


On Aug 15, 2008, at 10:45 AM, condor-admin response tracking system  
wrote:

> Stuart,
>
>> FWIW, this dag is still running just the same single job which is  
>> another
>> dagman process, and the memory is still growing, i.e.,
>
> I haven't had a chance to check in detail yet, but I suspect that  
> this is
> related to the very large number of evicted events.  It's such a  
> small DAG
> it can't be related to the size of the DAG.
>
> It's been a while since I've run valgrind on DAGMan; I guess it's  
> time to
> do it again.
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu,
>

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Sat Sep 13 16:02:39 2008 (1221339760)
Date: Wed, 17 Sep 2008 15:37:41 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18335] LIGO: dagman memory usage

Stuart,

> 	It looks like this corner case has been triggered by another user, so
> it is not
> just a one off occurrence. However, that being said, LIGO would still
> like DAG
> tickets 18355 and 18376 worked on before this.

Hmm, is condor-admin #18355 really what you meant here?  That shows up for 
me with a subject unrelated to DAGMan.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Wed Sep 17 15:37:45 2008 (1221683867)
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18335] LIGO: dagman memory usage
Date: Wed, 17 Sep 2008 19:30:58 -0700
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

--Apple-Mail-19-732206086


On Sep 17, 2008, at 1:37 PM, condor-admin response tracking system  
wrote:

> Stuart,
>
>> 	It looks like this corner case has been triggered by another user,  
>> so
>> it is not
>> just a one off occurrence. However, that being said, LIGO would still
>> like DAG
>> tickets 18355 and 18376 worked on before this.
>
> Hmm, is condor-admin #18355 really what you meant here?  That shows  
> up for
> me with a subject unrelated to DAGMan.

Kent,
	My mistake, I meant to refer to automatically and recursively  
regenerating sub-dag submit files and the use of DIR in dag splicing.

When you and Pete have had a chance to discuss those two issues, I  
would appreciate an update on when you think a pre-release set of dag  
binaries might be available.

Thanks.

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




--Apple-Mail-19-732206086

MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEH
AQAAoIIEDjCCBAowggLyoAMCAQICAmFKMA0GCSqGSIb3DQEBBQUAMGkxEzAR
BgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/IsZAEZFghET0VHcmlkczEg
MB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxFjAUBgNVBAMTDURP
RUdyaWRzIENBIDEwHhcNMDgwNjMwMjAzNjI5WhcNMDkwNjMwMjAzNjI5WjBh
MRMwEQYKCZImiZPyLGQBGRYDb3JnMRgwFgYKCZImiZPyLGQBGRYIZG9lZ3Jp
ZHMxDzANBgNVBAsTBlBlb3BsZTEfMB0GA1UEAxMWU3R1YXJ0IEFuZGVyc29u
IDk1NzY5MTCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBALRgOk2J
VmkOKmIIpzG81nXbgTxvx4cT8BKe8iK7dobu61uLUlam3OQjsdDu6oKWCP5k
JMosF1xBpQZb5597NRkO8SFViY3ZIPcP+NcmW4Actas3c5PzeVliSIHKaPOb
U8i18M/c+gXY+4iB3hvBKtHZKauxqb4Q04SDy8e1nQ0lL+peHYromti8Tn8I
Si6/9bDgp+lr5f5pxJpfTyPCOXhtqvr9cJhuUCKIO2iQYkJlm9q2JxKAgPE4
5+cZ+dGSwlzl6Zsx6LgruVYZ+R9rr4x3q0FRzAiGhW2kMt1DP/rfev+DI1lp
xV93ttnwNdAB5doYhp8OTOvHWYebi/T6+LkCAwEAAaOBwzCBwDARBglghkgB
hvhCAQEEBAMCBeAwDgYDVR0PAQH/BAQDAgTwMBgGA1UdIAQRMA8wDQYLKoZI
hvdMAwcBAgowOgYDVR0fBDMwMTAvoC2gK4YpaHR0cDovL3BraTEuZG9lZ3Jp
ZHMub3JnL0NSTC8xYzNmMmNhOC5jcmwwJAYDVR0RBB0wG4EZYW5kZXJzb25A
bGlnby5jYWx0ZWNoLmVkdTAfBgNVHSMEGDAWgBTKGR0Sjm6kOF1C1DEOCNvZ
jRcNXTANBgkqhkiG9w0BAQUFAAOCAQEAdPXT2pZ7XfdGrhzZXrm9m+czZMNj
SHN3rtNjgf1TgjG/wU8OITQo+QkbliUXTgZohFpTtpplFJwxAD3pxuGIXYTM
orJURCgvRUX6h9WJWFt33ugC9Ihpr9VB+2a63etQpX8GNRMaNmVHr/S4RkHa
fpYEMHRBnVv+DJ7uD5zN3ygtSRxlHBLgvOHXQc35v4/1rgPvCEvvSQfHPobb
3iN/qdyf5/zajv8avzZueoqAb2S9ya/7nmVuhREtR5mDg+IQcMSVPBHZF1+w
4n2NvfZC4k/FNOSjD69GBGXAZCVAfiIrbvqxnlzV6qoLbkvxSJcnZ48hqg18
wvQgvSg7nAZqmjGCAvowggL2AgEBMG8waTETMBEGCgmSJomT8ixkARkWA29y
ZzEYMBYGCgmSJomT8ixkARkWCERPRUdyaWRzMSAwHgYDVQQLExdDZXJ0aWZp
Y2F0ZSBBdXRob3JpdGllczEWMBQGA1UEAxMNRE9FR3JpZHMgQ0EgMQICYUow
CQYFKw4DAhoFAKCCAWAwGAYJKoZIhvcNAQkDMQsGCSqGSIb3DQEHATAcBgkq
hkiG9w0BCQUxDxcNMDgwOTE4MDIzMDU4WjAjBgkqhkiG9w0BCQQxFgQUtn0t
Fm9knunmMg4mfPtN7vPATPcwfgYJKwYBBAGCNxAEMXEwbzBpMRMwEQYKCZIm
iZPyLGQBGRYDb3JnMRgwFgYKCZImiZPyLGQBGRYIRE9FR3JpZHMxIDAeBgNV
BAsTF0NlcnRpZmljYXRlIEF1dGhvcml0aWVzMRYwFAYDVQQDEw1ET0VHcmlk
cyBDQSAxAgJhSjCBgAYLKoZIhvcNAQkQAgsxcaBvMGkxEzARBgoJkiaJk/Is
ZAEZFgNvcmcxGDAWBgoJkiaJk/IsZAEZFghET0VHcmlkczEgMB4GA1UECxMX
Q2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxFjAUBgNVBAMTDURPRUdyaWRzIENB
IDECAmFKMA0GCSqGSIb3DQEBAQUABIIBAKe2ZBZelpjxHbplh7iaNH8aNRxI
8u7PHvOFItEojKjoJwKXwtpcXjrTsuiYvhF8FXtSIeDyT4zeD6BhCx18jGTX
Pn9t6+69aok7AawFwPIyHSWeQ8Vvbh6+fERO5sKy9enEJpOiTBTPtlh+nMk+
8wJ/orUehxkqj4kL/26uc6dfbdTLMDAouVvCxzimuB5oadwyDGbEEQRpfqdz
lebXeptmcVZph1t+dmHSKcW8FbmJoUoDG/Tk3471wV9fL1MG6rNw2k28l8EM
UCvef1Et8YBWa9N4aOV+OquNG+rmiN2Rg60a3PpENyKOUUCUVr2HRdIIVBs9
tp+PJIgtt4J37voAAAAAAAA=

--Apple-Mail-19-732206086--

===========================================================================
Date mail was appended: Wed Sep 17 21:31:10 2008 (1221705071)
Date: Fri, 19 Sep 2008 13:43:11 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18335] LIGO: dagman memory usage

Stuart,

> When you and Pete have had a chance to discuss those two issues, I
> would appreciate an update on when you think a pre-release set of dag
> binaries might be available.

I think I should have condor_submit_dag recursion ready to go some time
next week.  The basic framework is in place now; I have to add the
"preserve maxjobs, etc." feature that Duncan wants and do some more 
testing to just make sure I have all of the different cases covered.

Pete's not sure about the DIR/splicing fix -- he's wrangling the 7.0.5
release, working on some other stuff, and dealing with a family emergency 
right now...

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Fri Sep 19 13:43:14 2008 (1221849795)
Date: Thu, 18 Dec 2008 16:11:42 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: bt <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18335] LIGO: dagman memory usage

Stuart,

I've finally looked into this in detail, and unfortunately I haven't been 
able to figure out much.  I took a look at the dagman.out file, and the
only thing that's unusual about it is that there are a ton of 
evicted/execute event pairs for the DAG job.  So I created a job that
artificially introduced evicted and execute events into the log.
Unfortunately, that hasn't shown up any related memory leaks.
(In the process of working on this, I did find one small leak, but it's
leaked when a job is submitted, and in the example you sent, there's only 
one submission, so that obviously isn't the problem.)

The main thing I can think of at this point is as follows:  if you see 
this happen again, if you can save all of the node job log files, in 
addition to the dag file, dagman.out file, etc., I can use those log files
to "play back" the events and see if I can reproduce the problem.

Sorry I don't have anything more to offer at this point.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Thu Dec 18 16:11:44 2008 (1229638304)
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18335] LIGO: dagman memory usage
Date: Thu, 18 Dec 2008 20:17:27 -0800
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

OK, we will keep an eye out for this happening again. Is there a  
script that I can
point at an dagman job in the queue that will pull all the necessary  
information
together?

Thanks.

On Dec 18, 2008, at 2:11 PM, condor-admin response tracking system  
wrote:

> Stuart,
>
> I've finally looked into this in detail, and unfortunately I haven't  
> been
> able to figure out much.  I took a look at the dagman.out file, and  
> the
> only thing that's unusual about it is that there are a ton of
> evicted/execute event pairs for the DAG job.  So I created a job that
> artificially introduced evicted and execute events into the log.
> Unfortunately, that hasn't shown up any related memory leaks.
> (In the process of working on this, I did find one small leak, but  
> it's
> leaked when a job is submitted, and in the example you sent, there's  
> only
> one submission, so that obviously isn't the problem.)
>
> The main thing I can think of at this point is as follows:  if you see
> this happen again, if you can save all of the node job log files, in
> addition to the dag file, dagman.out file, etc., I can use those log  
> files
> to "play back" the events and see if I can reproduce the problem.
>
> Sorry I don't have anything more to offer at this point.
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu,
>

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Thu Dec 18 22:17:38 2008 (1229660258)
Date: Fri, 19 Dec 2008 10:47:34 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18335] LIGO: dagman memory usage

Stuart,

> OK, we will keep an eye out for this happening again. Is there a
> script that I can
> point at an dagman job in the queue that will pull all the necessary
> information
> together?

Unfortunately we don't have such a script -- maybe we should write one!
(I know that the Condor team has talked a bit about making some kind of 
tool that would make it easier for users to gather information we need
to debug things.)

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Fri Dec 19 10:47:36 2008 (1229705256)
Subject: Actions

Status changed from open to pending by wenger
===========================================================================
Date of actions: Fri Dec 19 13:43:11 2008 (1229715791)
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18335] LIGO: dagman memory usage
Date: Thu, 22 Oct 2009 12:34:19 -0700
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

Please close this ticket as un-reproducible.

Thanks.

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Thu Oct 22 14:34:30 2009 (1256240070)
Date: Thu, 22 Oct 2009 15:19:47 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #18335] LIGO: dagman memory usage

Stuart,

> Please close this ticket as un-reproducible.

Okay, thanks for the update.  I'll close the ticket.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Thu Oct 22 15:19:55 2009 (1256242797)
Subject: Actions

Ticket resolved by wenger
===========================================================================
Date of actions: Thu Oct 22 15:22:17 2009 (1256242937)