LIGO Support Ticket 7741
Ticket Information
Number: support 7741
User: dabrown@physics.syr.edu
Email: anderson__AT__ligo.caltech.edu,carsten.aulbert__AT__aei.mpg.de,henning.fehrmann__AT__aei.mpg.de
Status: resolved
Assigned To: wenger
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
Subject: LIGO: problem with large dagman.out files on atlas
Date: Tue, 28 Oct 2008 18:02:10 -0400
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>, Carsten Aulbert
<carsten.aulbert__AT__aei.mpg.de>, Henning Fehrmann
<henning.fehrmann__AT__aei.mpg.de>
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.7400:2.4.4,1.2.40,4.0.166
definitions=2008-10-28_08:2008-10-10,2008-10-28,2008-10-28 signatures=0
X-Proofpoint-Spam-Reason: safe
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
--Apple-Mail-28--36488345
Hi Kent,
Is condor_dagman currently limited to 32 byte file support? We have=20=20
some inspiral users who's dags are constantly starting and then=20=20
exiting repeatedly. The messages in the dagman.log file are
001 (5476704.000.000) 10/28 22:45:40 Job executing on host:=20=20
<10.20.30.2:56331>
...
004 (5476704.000.000) 10/28 22:45:40 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
001 (5476704.000.000) 10/28 22:45:50 Job executing on host:=20=20
<10.20.30.2:56331>
...
004 (5476704.000.000) 10/28 22:45:50 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
over and over again. In the ScheddLog, I see the messages:
10/28 20:08:28 (pid:2297) Starting add_shadow_birthdate(5476704.0)
10/28 20:08:28 (pid:2297) Called reschedule_negotiator()
10/28 20:08:46 (pid:2297) scheduler universe job (5476704.0) pid 6741=20=20
exited with status 44
The only thing I can see that may be causing a problem is that the=20=20
dagman.out file is over 2Gb:
[dbrown@h2 injections32]$ ls -l *dag.dagman*
-rw-r--r-- 1 marion atlas 2909132 Oct 28 22:38=20=20
injections32.GRB070714_injections32.dag.dagman.log
-rw-r--r-- 1 marion atlas 2147483700 Oct 28 05:52=20=20
injections32.GRB070714_injections32.dag.dagman.out
Is this a large file support issue? This is condor_dagman 7.1.1 on the=20=
=20
Atlas cluster. Below is the full job class ad for the dagman process.
Cheers,
Duncan.
[dbrown@h2 condor]$ condor_q -long 5476704.0
-- Submitter: h2.atlas.local : <10.20.30.2:56331> : h2.atlas.local
MyType =3D "Job"
TargetType =3D "Machine"
ClusterId =3D 5476704
QDate =3D 1225220908
CompletionDate =3D 0
Owner =3D "marion"
LocalUserCpu =3D 0.000000
LocalSysCpu =3D 0.000000
RemoteUserCpu =3D 0.000000
RemoteSysCpu =3D 0.000000
NumCkpts_RAW =3D 0
NumCkpts =3D 0
NumRestarts =3D 0
NumSystemHolds =3D 0
CommittedTime =3D 0
TotalSuspensions =3D 0
CumulativeSuspensionTime =3D 0
CondorVersion =3D "$CondorVersion: 7.0.5 Oct 22 2008 $"
CondorPlatform =3D "$CondorPlatform: X86_64-LINUX_DEBIAN40 $"
RootDir =3D "/"
Iwd =3D "/home/marion/analysis/GRB070714/injections32"
JobUniverse =3D 7
Cmd =3D "/opt/condor/bin/condor_dagman"
MinHosts =3D 1
MaxHosts =3D 1
WantRemoteSyscalls =3D FALSE
WantCheckpoint =3D FALSE
JobPrio =3D 0
User =3D "marion__AT__atlas.local"
NiceUser =3D FALSE
Env =3D "CLASSPATH=3D/usr/local/ldg-4.5-server/vdt/pegasus/lib/exist.jar:/=
=20
usr/local/ldg-4.5-server/vdt/pegasus/lib/preservcsl.jar:/usr/local/=20
ldg-4.5-server/vdt/pegasus/lib/log4j-1.2.8.jar:/usr/local/ldg-4.5-=20
server/vdt/pegasus/lib/cog-jglobus.jar:/usr/local/ldg-4.5-server/vdt/=20
pegasus/lib/pegasus.jar:/usr/local/ldg-4.5-server/vdt/pegasus/lib/=20
accessors.jar:/usr/local/ldg-4.5-server/vdt/pegasus/lib/jakarta-=20
oro.jar:/usr/local/ldg-4.5-server/vdt/pegasus/lib/xmlrpc.jar:/usr/=20
local/ldg-4.5-server/vdt/pegasus/lib/puretls.jar:/usr/local/ldg-4.5-=20
server/vdt/pegasus/lib/xercesImpl.jar:/usr/local/ldg-4.5-server/vdt/=20
pegasus/lib/globus_rls_client.jar:/usr/local/ldg-4.5-server/vdt/=20
pegasus/lib/mysql-connector-java-5.0.5-bin.jar:/usr/local/ldg-4.5-=20
server/vdt/pegasus/lib/cryptix.jar:/usr/local/ldg-4.5-server/vdt/=20
pegasus/lib/cryptix32.jar:/usr/local/ldg-4.5-server/vdt/pegasus/lib/=20
jce-jdk13-125.jar:/usr/local/ldg-4.5-server/vdt/pegasus/lib/=20
postgresql-8.1dev-400.jdbc3.jar:/usr/local/ldg-4.5-server/vdt/pegasus/=20
lib/cryptix-asn1.jar:/usr/local/ldg-4.5-server/vdt/pegasus/lib/commons-=20
pool.jar:/usr/local/ldg-4.5-server/vdt/pegasus/lib/xmldb.jar:/usr/=20
local/ldg-4.5-server/vdt/pegasus/lib/commons-logging.jar:/usr/local/=20
ldg-4.5-server/vdt/pegasus/lib/java-getopt-1.0.9.jar:/usr/local/=20
ldg-4.5-server/vdt/pegasus/lib/resolver.jar:/usr/local/ldg-4.5-server/=20
vdt/pegasus/lib/xmlParserAPIs.jar:/usr/local/ldg-4.5-server/vdt/=20
pegasus/lib/exist-optional.jar:/usr/local/ldg-4.5-server/vdt/pegasus/=20
lib/junit.jar;HOSTTYPE=3Dx86_64-linux;SHLIB_PATH=3D/usr/local/ldg-4.5-=20
server/vdt/globus/lib;LSCSOFT_LOCATION=3D/home/nvf/s5grb/=20
releases;GLOBUS_OPTIONS=3D-Xmx512M;PAC_ANCHOR=3D/usr/local/ldg-4.5-=20
server;SHLVL=3D1;PWD=3D/home/marion/analysis/GRB070714/=20
injections32;GRID_SECURITY_DIR=3D/usr/local/ldg-4.5-server/vdt/globus/=20
etc;VDT_LOCATION=3D/usr/local/ldg-4.5-server/=20
vdt;SSH_CLIENT=3D193.205.72.95 39455=20=20
22;REMOTEHOST=3Dmbta1.virgo.infn.it;VDT_POSTINSTALL_README=3D/usr/local/=20
ldg-4.5-server/vdt/post-install/README;PATH=3D/home/nvf/s5grb/releases/=20
s5_exttrig_20081003/glue/bin:/home/nvf/s5grb/releases/=20
s5_exttrig_20081011/pylal/bin:/home/nvf/s5grb/releases/=20
s5_exttrig_20081003/lalapps/bin:/home/nvf/s5grb/releases/=20
s5_exttrig_20081011/lal/bin:/opt/condor/bin:/opt/condor/sbin:/usr/=20
local/ldg-4.5-server/vdt/tcl/bin:/usr/local/ldg-4.5-server/vdt/apache/=20
bin:/usr/local/ldg-4.5-server/vdt/ant/bin:/usr/local/ldg-4.5-server/=20
vdt/glite/sbin:/usr/local/ldg-4.5-server/vdt/glite/bin:/usr/local/=20
ldg-4.5-server/vdt/pegasus/bin:/usr/local/ldg-4.5-server/vdt/pyglobus-=20
url-copy/bin:/usr/local/ldg-4.5-server/vdt/unixodbc/bin:/usr/local/=20
ldg-4.5-server/vdt/mysql/bin:/usr/local/ldg-4.5-server/vdt/edg/sbin:/=20
usr/local/ldg-4.5-server/vdt/jdk1.5/bin:/usr/local/ldg-4.5-server/vdt/=20
condor/sbin:/usr/local/ldg-4.5-server/vdt/condor/bin:/usr/local/=20
ldg-4.5-server/vdt/logrotate/sbin:/usr/local/ldg-4.5-server/vdt/gpt/=20
sbin:/usr/local/ldg-4.5-server/vdt/globus/bin:/usr/local/ldg-4.5-=20
server/vdt/globus/sbin:/usr/local/ldg-4.5-server/pacman-3.21/bin:/usr/=20
local/ldg-4.5-server/vdt/vdt/sbin:/usr/local/ldg-4.5-server/vdt/vdt/=20
bin:/usr/local/ldg-4.5-server/ldg-server/bin:/usr/bin:/bin:/usr/sbin:/=20
sbin;SASL_PATH=3D/usr/local/ldg-4.5-server/vdt/globus/lib/=20
sasl;GLITE_LOCATION_LOG=3D/usr/local/ldg-4.5-server/vdt/glite/=20
log;VDT_INSTALL_LOG=3Dvdt-install.log;DYLD_LIBRARY_PATH=3D/home/nvf/s5grb/=
=20
releases/s5_exttrig_20081003/glue/lib/python2.4/site-packages:/home/=20
nvf/s5grb/releases/s5_exttrig_20081011/pylal/lib/python2.4/site-=20
packages:/home/nvf/s5grb/releases/s5_exttrig_20081011/lal/lib:/usr/=20
local/ldg-4.5-server/vdt/globus/lib;GLOBUS_PATH=3D/usr/local/ldg-4.5-=20
server/vdt/globus;X509_CERT_DIR=3D/usr/local/ldg-4.5-server/vdt/globus/=20
TRUSTED_CA;GLOBUS_TCP_PORT_RANGE=3D40000,45000;LAL_PREFIX=3D/home/nvf/=20
s5grb/releases/s5_exttrig_20081011/lal;VENDOR=3Dunknown;VOMS_USERCONF=3D/=
=20
usr/local/ldg-4.5-server/vdt/glite/etc/=20
vomses;HOST=3Dh2;LDG_SOFTWARE_LOCATION=3Dhttp://www.ldas-sw.ligo.caltech.ed=
u/ldg_dist/ldg4.5/software;PKG_CONFIG_PATH=3D/home/nvf/s5grb/releases/s5_ex=
ttrig_20081011/lal/lib/pkgconfig;GLITE_LOCATION_TMP=3D/usr/local/ldg-4.5-se=
rver/vdt/glite/tmp;GLITE_LOCATION=3D/usr/local/ldg-4.5-server/vdt/glite;SSH=
_TTY=3D/dev/pts/4;GLITE_LOCATION_VAR=3D/usr/local/ldg-4.5-server/vdt/glite/=
var;LIBPATH=3D/usr/local/ldg-4.5-server/vdt/globus/lib:/usr/lib:/lib;SHELL=
=3D/usr/bin/tcsh;LDG_INSTALL_LOG=3D/usr/local/ldg-4.5-server/ldg-server/etc=
/ldg-install.log;MACHTYPE=3Dx86_64;LDG_DIRECTORY=3D/usr/local/ldg-4.5-serve=
r/ldg-server;MAIL=3D/var/mail/marion;_CONDOR_MAX_DAGMAN_LOG=3D0;MANPATH=3D/=
home/nvf/s5grb/releases/s5_exttrig_20081003/lalapps/share/man:/home/nvf/s5g=
rb/releases/s5_exttrig_20081011/lal/share/man:/usr/local/ldg-4.5-server/vdt=
/pegasus/man:/usr/local/ldg-4.5-server/vdt/condor/man:/usr/local/ldg-4.5-se=
rver/vdt/globus/man:/usr/local/ldg-4.5-server/vdt/vdt/man:/usr/local/ldg-4.=
5-server/vdt/perl/man:/usr/local/ldg-4.5-ser=20
ver/vdt/expat/man:/usr/local/ldg-4.5-server/vdt/logrotate/man:/usr/local/ld=
g-4.5-server/vdt/jdk1.5/man:/usr/local/ldg-4.5-server/vdt/edg/share/man:/us=
r/local/ldg-4.5-server/vdt/mysql/man:/usr/local/ldg-4.5-server/vdt/glite/sh=
are/man:/usr/local/ldg-4.5-server/vdt/apache/man:/usr/local/ldg-4.5-server/=
vdt/tcl/man;GLUE_PREFIX=3D/home/nvf/s5grb/releases/s5_exttrig_20081003/glue=
;MYSQL_UNIX_PORT=3D/usr/local/ldg-4.5-server/vdt/vdt-app-data/mysql/var/mys=
ql.sock;DISPLAY=3Dlocalhost:16.0;GLUE_LOCATION=3D/home/nvf/s5grb/releases/s=
5_exttrig_20081003/glue;PERL5LIB=3D/usr/local/ldg-4.5-server/ldg-server/lib=
:/usr/local/ldg-4.5-server/vdt/pegasus/lib/perl:/usr/local/ldg-4.5-server/v=
dt/vdt/lib:/usr/local/ldg-4.5-server/vdt/perl/lib/5.8.0:/usr/local/ldg-4.5-=
server/vdt/perl/lib/5.8.0/x86_64-linux-thread-multi:/usr/local/ldg-4.5-serv=
er/vdt/perl/lib/site_perl/5.8.0:/usr/local/ldg-4.5-server/vdt/perl/lib/site=
_perl/5.8.0/x86_64-linux-thread-multi:;ANT_HOME=3D/usr/local/ldg-4.5-server=
/vdt/ant;USER=3Dmarion;SSH_=20
CONNECTION=3D193.205.72.95 39455 130.75.116.11 22;LD_LIBRARY_PATH=3D/home/=
=20
nvf/s5grb/releases/s5_exttrig_20081003/glue/lib/python2.4/site-=20
packages:/home/nvf/s5grb/releases/s5_exttrig_20081011/pylal/lib/=20
python2.4/site-packages:/home/nvf/s5grb/releases/s5_exttrig_20081011/=20
lal/lib:/usr/local/ldg-4.5-server/vdt/tclglobus/lib:/usr/local/ldg-4.5-=20
server/vdt/tcl/lib:/usr/local/ldg-4.5-server/vdt/apache/lib:/usr/local/=20
ldg-4.5-server/vdt/glite/lib:/usr/local/ldg-4.5-server/vdt/myodbc/lib:/=20
usr/local/ldg-4.5-server/vdt/unixodbc/lib:/usr/local/ldg-4.5-server/=20
vdt/mysql/lib/mysql:/usr/local/ldg-4.5-server/vdt/jdk1.5/jre/lib/i386:/=20
usr/local/ldg-4.5-server/vdt/jdk1.5/jre/lib/i386/server:/usr/local/=20
ldg-4.5-server/vdt/jdk1.5/jre/lib/i386/client:/usr/local/ldg-4.5-=20
server/vdt/berkeley-db/lib:/usr/local/ldg-4.5-server/vdt/expat/lib:/=20
usr/local/ldg-4.5-server/vdt/globus/lib;PYTHONPATH=3D/home/nvf/s5grb/=20
releases/s5_exttrig_20081003/glue/lib/python2.4/site-packages:/home/=20
nvf/s5grb/releases/s5_exttrig_20081011/pylal/lib/python2.4/site-=20
packages:/home/nvf/s5grb/releases/s5_exttrig_20081003/lalapps/lib/=20
python2.4/site-packages:/usr/local/ldg-4.5-server/ldg-server/lib/=20
python:/usr/local/ldg-4.5-server/vdt/globus/lib64/python:/usr/local/=20
ldg-4.5-server/vdt/globus/lib/python;X509_CADIR=3D/usr/local/ldg-4.5-=20
server/vdt/globus/TRUSTED_CA;CATALINA_OPTS=3D-=20
Dorg.globus.wsrf.container.persistence.dir=3D/usr/local/ldg-4.5-server/=20
vdt/vdt-app-data/globus/persisted;ODBCINI=3D/usr/local/ldg-4.5-server/=20
vdt/unixodbc/etc/odbc.ini;LAL_LOCATION=3D/home/nvf/s5grb/releases/=20
s5_exttrig_20081011/lal;HOME=3D/home/marion;LOGNAME=3Dmarion;EDG_LOCATION=
=3D/=20
usr/local/ldg-4.5-server/vdt/edg;GPT_LOCATION=3D/usr/local/ldg-4.5-=20
server/vdt/gpt;GLOBUS_ERROR_VERBOSE=3Dtrue;GROUP=3Datlas;JAVA_HOME=3D/usr/=
=20
local/ldg-4.5-server/vdt/jdk1.5;GLOBUS_LOCATION=3D/usr/local/ldg-4.5-=20
server/vdt/globus;CONDOR_CONFIG=3D/opt/condor/etc/=20
condor_config;PYLAL_PREFIX=3D/home/marion/s5grb/releases/=20
s5_exttrig_20081011/pylal;GLOBUS_MYSQL_PATH=3D/usr/local/ldg-4.5-server/=20
vdt/mysql;PACMAN_LOCATION=3D/usr/local/ldg-4.5-server/=20
pacman-3.21;PEGASUS_HOME=3D/usr/local/ldg-4.5-server/vdt/=20
pegasus;X509_VOMS_DIR=3D/usr/local/ldg-4.5-server/vdt/glite/=20
vomsdir;CONDOR_LOCATION=3D/usr/local/ldg-4.5-server/vdt/=20
condor;PYLAL_LOCATION=3D/home/nvf/s5grb/releases/s5_exttrig_20081011/=20
pylal;LDG_LOCATION=3D/usr/local/ldg-4.5-=20
server;TERM=3Dvt100;OSTYPE=3Dlinux;LALAPPS_LOCATION=3D/home/nvf/s5grb/=20
releases/s5_exttrig_20081003/=20
lalapps=20
;_CONDOR_DAGMAN_LOG=3Dinjections32.GRB070714_injections32.dag.dagman.out"
EnvDelim =3D ";"
JobNotification =3D 2
WantRemoteIO =3D FALSE
UserLog =3D "/home/marion/analysis/GRB070714/injections32/=20
injections32.GRB070714_injections32.dag.dagman.log"
CoreSize =3D 0
KillSig =3D "SIGTERM"
RemoveKillSig =3D "SIGUSR1"
Rank =3D 0.000000
In =3D "/dev/null"
TransferIn =3D FALSE
Out =3D "injections32.GRB070714_injections32.dag.lib.out"
StreamOut =3D FALSE
Err =3D "injections32.GRB070714_injections32.dag.lib.err"
StreamErr =3D FALSE
BufferSize =3D 524288
BufferBlockSize =3D 32768
ShouldTransferFiles =3D "NO"
TransferFiles =3D "NEVER"
ImageSize_RAW =3D 5564
ImageSize =3D 7500
ExecutableSize_RAW =3D 5564
ExecutableSize =3D 7500
DiskUsage_RAW =3D 5564
DiskUsage =3D 7500
Requirements =3D (Arch =3D=3D "X86_64") && (OpSys =3D=3D "LINUX") && (Disk =
>=3D=20=20
DiskUsage) && ((Memory * 1024) >=3D ImageSize)
FileSystemDomain =3D "atlas.local"
PeriodicHold =3D FALSE
PeriodicRelease =3D FALSE
PeriodicRemove =3D FALSE
OnExitHold =3D FALSE
OnExitRemove =3D (ExitSignal =3D?=3D 11 || (ExitCode =3D!=3D UNDEFINED &&=
=20=20
ExitCode >=3D 0 && ExitCode <=3D 2))
LeaveJobInQueue =3D FALSE
Arguments =3D "-f -l . -Debug 3 -Lockfile=20=20
injections32.GRB070714_injections32.dag.lock -AutoRescue 1 -=20
DoRescueFrom 0 -Condorlog /local/marion/logs/tmp9UFyEE -Dag=20=20
injections32.GRB070714_injections32.dag -CsdVersion $CondorVersion:'=20=20
'7.1.1' 'Oct' '22' '2008' '$"
GlobalJobId =3D "h2.atlas.local#1225220908#5476704.0"
ProcId =3D 0
JobStartDate =3D 1225220908
AutoClusterId =3D 3
AutoClusterAttrs =3D=20=20
"JobUniverse=20
,LastCheckpointPlatform=20
,NumCkpts,DiskUsage,ImageSize,Requirements,NiceUser"
OrigMaxHosts =3D 1
NumJobStarts =3D 1680
JobLastStartDate =3D 1225229826
JobCurrentStartDate =3D 1225229864
JobRunCount =3D 1680
ExitStatus =3D 44
ExitBySignal =3D FALSE
ExitCode =3D 44
JobStatus =3D 1
EnteredCurrentStatus =3D 1225229864
LastSuspensionTime =3D 0
RemoteWallClockTime =3D 99.000000
LastRemoteHost =3D ""
LastPublicClaimId =3D ""
LastPublicClaimIds =3D ""
CurrentHosts =3D 0
ServerTime =3D 1225229890
--=20
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
--Apple-Mail-28--36488345
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEH
AQAAoIIGLzCCAugwggJRoAMCAQICEEWUZWnn/J137pWWI1HnbkgwDQYJKoZI
hvcNAQEFBQAwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25z
dWx0aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBG
cmVlbWFpbCBJc3N1aW5nIENBMB4XDTA4MDQyODE5NDYzOFoXDTA5MDQyODE5
NDYzOFowSTEfMB0GA1UEAxMWVGhhd3RlIEZyZWVtYWlsIE1lbWJlcjEmMCQG
CSqGSIb3DQEJARYXZGFicm93bkBwaHlzaWNzLnN5ci5lZHUwggEiMA0GCSqG
SIb3DQEBAQUAA4IBDwAwggEKAoIBAQDSc5Va6SrDifChYfKyCdYqLdVhfgif
ARvRe1ehenZr5tNhq5ZdP3Ib4viCIl2obKK4QVmbpSd84Eg/M5QMNe04zMAi
o598P2RaDdb8XO3XcdUs3OOoaW/HXBhfWsyn3pVJxOOUQdr9Hn38qFX7LPqa
0I31jDriyzOJ4dOCqbi7h+Vd7qaydrH20XQu5/cdx4pmxHIjFTrljTsRnLI9
lB+geKnvsKczlH6/o9IZbvs0YHnJAq9HSTX0JZQelERIxRgaGh1CQ4qHS6hr
orFKyEO2ha9+zLmDR8wxNIU3plqpnfBQyAnQkamehhIV7YobpeE0BdpMEXxF
cbhF5QsckHpJAgMBAAGjNDAyMCIGA1UdEQQbMBmBF2RhYnJvd25AcGh5c2lj
cy5zeXIuZWR1MAwGA1UdEwEB/wQCMAAwDQYJKoZIhvcNAQEFBQADgYEAgJs2
9rODjbmsNbsWhuZXZjtqU/CwKkWSvwRzJwOi7BKvFBk/JnBUxu/rUcPFIqmC
D7qFa5ataU2eMplctJTEaBtL8vA4EqG1ungA/8cB1n0jlJSULb2amL49s6YX
I3JamJ0y5F2FUBoZ/bAt+RItmSJsEI215QyeyvjDeu4mlbQwggM/MIICqKAD
AgECAgENMA0GCSqGSIb3DQEBBQUAMIHRMQswCQYDVQQGEwJaQTEVMBMGA1UE
CBMMV2VzdGVybiBDYXBlMRIwEAYDVQQHEwlDYXBlIFRvd24xGjAYBgNVBAoT
EVRoYXd0ZSBDb25zdWx0aW5nMSgwJgYDVQQLEx9DZXJ0aWZpY2F0aW9uIFNl
cnZpY2VzIERpdmlzaW9uMSQwIgYDVQQDExtUaGF3dGUgUGVyc29uYWwgRnJl
ZW1haWwgQ0ExKzApBgkqhkiG9w0BCQEWHHBlcnNvbmFsLWZyZWVtYWlsQHRo
YXd0ZS5jb20wHhcNMDMwNzE3MDAwMDAwWhcNMTMwNzE2MjM1OTU5WjBiMQsw
CQYDVQQGEwJaQTElMCMGA1UEChMcVGhhd3RlIENvbnN1bHRpbmcgKFB0eSkg
THRkLjEsMCoGA1UEAxMjVGhhd3RlIFBlcnNvbmFsIEZyZWVtYWlsIElzc3Vp
bmcgQ0EwgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBAMSmPFVzVftOucqZ
Wh5owHUEcJ3f6f+jHuy9zfVb8hp2vX8MOmHyv1HOAdTlUAow1wJjWiyJFXCO
3cnwK4Vaqj9xVsuvPAsH5/EfkTYkKhPPK9Xzgnc9A74r/rsYPge/QIACZNen
prufZdHFKlSFD0gEf6e20TxhBEAeZBlyYLf7AgMBAAGjgZQwgZEwEgYDVR0T
AQH/BAgwBgEB/wIBADBDBgNVHR8EPDA6MDigNqA0hjJodHRwOi8vY3JsLnRo
YXd0ZS5jb20vVGhhd3RlUGVyc29uYWxGcmVlbWFpbENBLmNybDALBgNVHQ8E
BAMCAQYwKQYDVR0RBCIwIKQeMBwxGjAYBgNVBAMTEVByaXZhdGVMYWJlbDIt
MTM4MA0GCSqGSIb3DQEBBQUAA4GBAEiM0VCD6gsuzA2jZqxnD3+vrL7CF6FD
lpSdf0whuPg2H6otnzYvwPQcUCCTcDz9reFhYsPZOhl+hLGZGwDFGguCdJ4l
UJRix9sncVcljd2pnDmOjCBPZV+V2vf3h9bGCE6u9uo05RAaWzVNd+NWIXiC
3CEZNd4ksdMdRv9dX2VPMYIDEDCCAwwCAQEwdjBiMQswCQYDVQQGEwJaQTEl
MCMGA1UEChMcVGhhd3RlIENvbnN1bHRpbmcgKFB0eSkgTHRkLjEsMCoGA1UE
AxMjVGhhd3RlIFBlcnNvbmFsIEZyZWVtYWlsIElzc3VpbmcgQ0ECEEWUZWnn
/J137pWWI1HnbkgwCQYFKw4DAhoFAKCCAW8wGAYJKoZIhvcNAQkDMQsGCSqG
SIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMDgxMDI4MjIwMjExWjAjBgkqhkiG
9w0BCQQxFgQUhSQRFnUJfdIRKeHsu0/idTPE8/IwgYUGCSsGAQQBgjcQBDF4
MHYwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0aW5n
IChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBGcmVlbWFp
bCBJc3N1aW5nIENBAhBFlGVp5/ydd+6VliNR525IMIGHBgsqhkiG9w0BCRAC
CzF4oHYwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0
aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBGcmVl
bWFpbCBJc3N1aW5nIENBAhBFlGVp5/ydd+6VliNR525IMA0GCSqGSIb3DQEB
AQUABIIBACK7uDx+B63k/NbCd4qQqqnO7e58iiYrk38iFFdlDvZnSqWVuhw7
DJtNTZSYMRUKq1lD6wfly+/NwOFRUethfgL0fyJEHo1gCHIpdXOZtGXbIOdO
oHd5mxniAMMdd+3tWEMLqZabOSWLn2yCmf2p1ls4TgbyLetIAfZkkH2BK7Cr
6cgAjFJhhugqmygBfwIWDvdziQhSyC6UrIOwcaBA6Zqij/o4lddhbMU5wymZ
ZUwsbqZenigPrciENvAgQAmLWWUoCqeBaTBKUONqiSknayOfYNIvT+vjOyUp
vjDmxfn4AejL6qLMsMD5vPIfpyqG1jYhqoJart1Mf8fsky819XYAAAAAAAA=
--Apple-Mail-28--36488345--
===========================================================================
Date of creation: Tue Oct 28 17:02:53 2008 (1225231375)
Subject: Actions
Assigned to wenger by gthain
===========================================================================
Date of actions: Wed Oct 29 8:08:00 2008 (1225285680)
Date: Wed, 29 Oct 2008 09:03:00 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: gthain <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7741] LIGO: problem with large dagman.out
files on atlas
Duncan,
> Is condor_dagman currently limited to 32 byte file support? We have=20=20
> some inspiral users who's dags are constantly starting and then=20=20
> exiting repeatedly. The messages in the dagman.log file are
>
> 001 (5476704.000.000) 10/28 22:45:40 Job executing on host:=20=20
> <10.20.30.2:56331>
> ...
> 004 (5476704.000.000) 10/28 22:45:40 Job was evicted.
> (0) Job was not checkpointed.
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
> 0 - Run Bytes Sent By Job
> 0 - Run Bytes Received By Job
> ...
> 001 (5476704.000.000) 10/28 22:45:50 Job executing on host:=20=20
> <10.20.30.2:56331>
> ...
> 004 (5476704.000.000) 10/28 22:45:50 Job was evicted.
> (0) Job was not checkpointed.
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
> 0 - Run Bytes Sent By Job
> 0 - Run Bytes Received By Job
> ...
>
> over and over again. In the ScheddLog, I see the messages:
>
> 10/28 20:08:28 (pid:2297) Starting add_shadow_birthdate(5476704.0)
> 10/28 20:08:28 (pid:2297) Called reschedule_negotiator()
> 10/28 20:08:46 (pid:2297) scheduler universe job (5476704.0) pid 6741=20=20
> exited with status 44
>
> The only thing I can see that may be causing a problem is that the=20=20
> dagman.out file is over 2Gb:
>
> [dbrown@h2 injections32]$ ls -l *dag.dagman*
> -rw-r--r-- 1 marion atlas 2909132 Oct 28 22:38=20=20
> injections32.GRB070714_injections32.dag.dagman.log
> -rw-r--r-- 1 marion atlas 2147483700 Oct 28 05:52=20=20
> injections32.GRB070714_injections32.dag.dagman.out
>
> Is this a large file support issue? This is condor_dagman 7.1.1 on the=20=
> =20
> Atlas cluster. Below is the full job class ad for the dagman process.
I'll bet you're right about it being a large file problem. Exit code 44
means a problem from dprintf, which is what prints to the dagman.out file.
And it looks like the dprintf code does not support large files.
There are two things to do:
1. I'm going to put in a request for the dprintf code to support
large files.
2. As an immediate workaround, you can reduce the amount of output
going to the dagman.out file by giving a lower debug level to DAGMan.
(This is done with the -debug flag in condor_submit_dag. The default
level is 3 -- you might want to check what you're at now, and go down
at least one level from that. DAGMan echos all of the command-line
arguments at startup, so it should be pretty easy to see what you've
been running with.)
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Wed Oct 29 9:03:51 2008 (1225289032)
CC: anderson__AT__ligo.caltech.edu, carsten.aulbert__AT__aei.mpg.de,
henning.fehrmann__AT__aei.mpg.de
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #7741] LIGO: problem with large dagman.out
files on atlas
Date: Wed, 29 Oct 2008 11:30:12 -0400
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.7400:2.4.4,1.2.40,4.0.166
definitions=2008-10-29_03:2008-10-10,2008-10-29,2008-10-29 signatures=0
X-Proofpoint-Spam-Reason: safe
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
--Apple-Mail-21-26393345
Hi Kent,
On Oct 29, 2008, at 10:03 AM, condor-support response tracking system
wrote:
> 2. As an immediate workaround, you can reduce the amount of output
> going to the dagman.out file by giving a lower debug level to DAGMan.
> (This is done with the -debug flag in condor_submit_dag. The default
> level is 3 -- you might want to check what you're at now, and go down
> at least one level from that. DAGMan echos all of the command-line
> arguments at startup, so it should be pretty easy to see what you've
> been running with.)
Thanks. I did a little digging to try and figure out why the log file
grew to over 2Gb (I've never seen this before). It turns out that the
condor job log file was corrupted (I've attached it so you can take a
look at it). This was causing the dagman process to keep looping and
looping until the dagman.out file hit 2Gb.
The interesting parts of dagman.log are below. The dagman process
would start up, parse the log file, fail with the error
10/29 14:37:11 ERROR "Assertion ERROR on (node->GetThrottleInfo()-
>_currentJobs >= 0)" at line 3086 in file dag.C
and then restart from scratch. Any idea why this error caused a hard
abort without writing a rescue dag?
Cheers,
Duncan.
10/29 14:32:57 ** condor_scheduniv_exec.5476704.0 (CONDOR_DAGMAN)
STARTING UP
10/29 14:32:57 ** /scr/opt/condor-7.0.5/bin/condor_dagman
10/29 14:32:57 ** $CondorVersion: 7.1.1 Oct 22 2008 $
10/29 14:32:57 ** $CondorPlatform: X86_64-LINUX_DEBIAN40 $
10/29 14:32:57 ** PID = 15960
10/29 14:32:57 ** Log last touched 10/29 14:32:48
10/29 14:32:57 ******************************************************
10/29 14:32:57 Using config source: /opt/condor/etc/condor_config
10/29 14:32:57 Using local config sources:
10/29 14:32:57 /etc/default/condor|
10/29 14:32:57 DaemonCore: Command Socket at <10.20.30.2:47770>
10/29 14:32:57 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880
10/29 14:32:57 DAGMAN_DEBUG_CACHE_ENABLE setting: False
10/29 14:32:57 DAGMAN_SUBMIT_DELAY setting: 0
10/29 14:32:57 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6
10/29 14:32:57 DAGMAN_STARTUP_CYCLE_DETECT setting: 0
10/29 14:32:57 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5
10/29 14:32:57 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION,
DAGMAN_ALLOW_EVENTS) setting: 114
10/29 14:32:57 DAGMAN_RETRY_SUBMIT_FIRST setting: 1
10/29 14:32:57 DAGMAN_RETRY_NODE_FIRST setting: 0
10/29 14:32:57 DAGMAN_MAX_JOBS_IDLE setting: 0
10/29 14:32:57 DAGMAN_MAX_JOBS_SUBMITTED setting: 0
10/29 14:32:57 DAGMAN_MUNGE_NODE_NAMES setting: 1
10/29 14:32:57 DAGMAN_DELETE_OLD_LOGS setting: 1
10/29 14:32:57 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0
10/29 14:32:57 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0
10/29 14:32:57 DAGMAN_ABORT_DUPLICATES setting: 1
10/29 14:32:57 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1
10/29 14:32:57 DAGMAN_PENDING_REPORT_INTERVAL setting: 600
10/29 14:32:58 DAGMAN_AUTO_RESCUE setting: 1
10/29 14:32:58 DAGMAN_MAX_RESCUE_NUM setting: 100
10/29 14:32:58 argv[0] == "condor_scheduniv_exec.5476704.0"
10/29 14:32:58 argv[1] == "-Debug"
10/29 14:32:58 argv[2] == "3"
10/29 14:32:58 argv[3] == "-Lockfile"
10/29 14:32:58 argv[4] == "injections32.GRB070714_injections32.dag.lock"
10/29 14:32:58 argv[5] == "-AutoRescue"
10/29 14:32:58 argv[6] == "1"
10/29 14:32:58 argv[7] == "-DoRescueFrom"
10/29 14:32:58 argv[8] == "0"
10/29 14:32:58 argv[9] == "-Condorlog"
10/29 14:32:58 argv[10] == "/local/marion/logs/tmp9UFyEE"
10/29 14:32:58 argv[11] == "-Dag"
10/29 14:32:58 argv[12] == "injections32.GRB070714_injections32.dag"
10/29 14:32:58 argv[13] == "-CsdVersion"
10/29 14:32:58 argv[14] == "$CondorVersion: 7.1.1 Oct 22 2008 $"
10/29 14:32:58 DAG Lockfile will be written to
injections32.GRB070714_injections32.dag.lock
10/29 14:32:58 DAG Input file is injections32.GRB070714_injections32.dag
10/29 14:32:58 Found rescue DAG number 1; running
injections32.GRB070714_injections32.dag.rescue001 instead of normal
DAG file
10/29 14:32:58 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10/29 14:32:58 RUNNING RESCUE DAG
injections32.GRB070714_injections32.dag.rescue001
10/29 14:33:08 All DAG node user log files:
10/29 14:33:09 /local/marion/logs/tmp9UFyEE (Condor)
10/29 14:33:09 Parsing 1 dagfiles
10/29 14:33:09 Parsing
injections32.GRB070714_injections32.dag.rescue001 ...
10/29 14:33:19 Dag contains 41080 total jobs
10/29 14:33:19 Lock file injections32.GRB070714_injections32.dag.lock
detected,
10/29 14:33:19 Duplicate DAGMan PID 26629 is no longer alive; this
DAGMan should continue.
10/29 14:33:19 Sleeping for 12 seconds to ensure ProcessId uniqueness
10/29 14:33:31 Bootstrapping...
10/29 14:33:31 Number of pre-completed nodes: 0
10/29 14:33:31 Running in RECOVERY mode...
10/29 14:33:31 Event: ULOG_SUBMIT for Condor Node
44cb65cac5d75cc7e78ea1700c48364d (4930833.0)
10/29 14:33:31 Number of idle job procs: 1
10/29 14:37:09 ReadMultipleUserLogs: read error on log /local/marion/
logs/tmp9UFyEE
10/29 14:37:09 ERROR: failure to read job log
A log event may be corrupt. DAGMan will skip the event and
try to
continue, but information may have been lost. If DAGMan exits
unfinished, but reports no failed jobs, re-submit the rescue
file
to complete the DAG.
10/29 14:37:09 ------------------------------
10/29 14:37:09 Condor Recovery Complete
10/29 14:37:09 ------------------------------
10/29 14:37:11 Number of idle job procs: 7
10/29 14:37:11 Event: ULOG_JOB_TERMINATED for Condor Node
eba560d794eaa321e8cad38f03789131 (5054131.0)
10/29 14:37:11 Node eba560d794eaa321e8cad38f03789131 job proc
(5054131.0) completed successfully.
10/29 14:37:11 Node eba560d794eaa321e8cad38f03789131 job completed
10/29 14:37:11 Number of idle job procs: 7
10/29 14:37:11 Event: ULOG_EXECUTE for Condor Node
1547594aceefe6970715450e8542d8d9 (5054137.0)
10/29 14:37:11 Number of idle job procs: 6
10/29 14:37:11 Event: ULOG_SUBMIT for Condor Node
a1ed5e58b8b3243ae08923e89df12355 (5054196.0)
10/29 14:37:11 Unexpected submit event (for job
"a1ed5e58b8b3243ae08923e89df12355") found in log; job
"ec8f413180c5f0246e07365d3884b469" was expected
.
10/29 14:37:11 Number of idle job procs: 7
10/29 14:37:11 Event: ULOG_IMAGE_SIZE for Condor Node
53040d865e5fc64d2c694d032c95a822 (5039116.0)
10/29 14:37:11 Event: ULOG_IMAGE_SIZE for Condor Node
0d23fa62038dcfd5007e550d797e2a9f (5035633.0)
10/29 14:37:11 Event: ULOG_JOB_TERMINATED for Condor Node
1547594aceefe6970715450e8542d8d9 (5054137.0)
10/29 14:37:11 ERROR "Assertion ERROR on (node->GetThrottleInfo()-
>_currentJobs >= 0)" at line 3086 in file dag.C
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
--Apple-Mail-21-26393345
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEH
AQAAoIIGLzCCAugwggJRoAMCAQICEEWUZWnn/J137pWWI1HnbkgwDQYJKoZI
hvcNAQEFBQAwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25z
dWx0aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBG
cmVlbWFpbCBJc3N1aW5nIENBMB4XDTA4MDQyODE5NDYzOFoXDTA5MDQyODE5
NDYzOFowSTEfMB0GA1UEAxMWVGhhd3RlIEZyZWVtYWlsIE1lbWJlcjEmMCQG
CSqGSIb3DQEJARYXZGFicm93bkBwaHlzaWNzLnN5ci5lZHUwggEiMA0GCSqG
SIb3DQEBAQUAA4IBDwAwggEKAoIBAQDSc5Va6SrDifChYfKyCdYqLdVhfgif
ARvRe1ehenZr5tNhq5ZdP3Ib4viCIl2obKK4QVmbpSd84Eg/M5QMNe04zMAi
o598P2RaDdb8XO3XcdUs3OOoaW/HXBhfWsyn3pVJxOOUQdr9Hn38qFX7LPqa
0I31jDriyzOJ4dOCqbi7h+Vd7qaydrH20XQu5/cdx4pmxHIjFTrljTsRnLI9
lB+geKnvsKczlH6/o9IZbvs0YHnJAq9HSTX0JZQelERIxRgaGh1CQ4qHS6hr
orFKyEO2ha9+zLmDR8wxNIU3plqpnfBQyAnQkamehhIV7YobpeE0BdpMEXxF
cbhF5QsckHpJAgMBAAGjNDAyMCIGA1UdEQQbMBmBF2RhYnJvd25AcGh5c2lj
cy5zeXIuZWR1MAwGA1UdEwEB/wQCMAAwDQYJKoZIhvcNAQEFBQADgYEAgJs2
9rODjbmsNbsWhuZXZjtqU/CwKkWSvwRzJwOi7BKvFBk/JnBUxu/rUcPFIqmC
D7qFa5ataU2eMplctJTEaBtL8vA4EqG1ungA/8cB1n0jlJSULb2amL49s6YX
I3JamJ0y5F2FUBoZ/bAt+RItmSJsEI215QyeyvjDeu4mlbQwggM/MIICqKAD
AgECAgENMA0GCSqGSIb3DQEBBQUAMIHRMQswCQYDVQQGEwJaQTEVMBMGA1UE
CBMMV2VzdGVybiBDYXBlMRIwEAYDVQQHEwlDYXBlIFRvd24xGjAYBgNVBAoT
EVRoYXd0ZSBDb25zdWx0aW5nMSgwJgYDVQQLEx9DZXJ0aWZpY2F0aW9uIFNl
cnZpY2VzIERpdmlzaW9uMSQwIgYDVQQDExtUaGF3dGUgUGVyc29uYWwgRnJl
ZW1haWwgQ0ExKzApBgkqhkiG9w0BCQEWHHBlcnNvbmFsLWZyZWVtYWlsQHRo
YXd0ZS5jb20wHhcNMDMwNzE3MDAwMDAwWhcNMTMwNzE2MjM1OTU5WjBiMQsw
CQYDVQQGEwJaQTElMCMGA1UEChMcVGhhd3RlIENvbnN1bHRpbmcgKFB0eSkg
THRkLjEsMCoGA1UEAxMjVGhhd3RlIFBlcnNvbmFsIEZyZWVtYWlsIElzc3Vp
bmcgQ0EwgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBAMSmPFVzVftOucqZ
Wh5owHUEcJ3f6f+jHuy9zfVb8hp2vX8MOmHyv1HOAdTlUAow1wJjWiyJFXCO
3cnwK4Vaqj9xVsuvPAsH5/EfkTYkKhPPK9Xzgnc9A74r/rsYPge/QIACZNen
prufZdHFKlSFD0gEf6e20TxhBEAeZBlyYLf7AgMBAAGjgZQwgZEwEgYDVR0T
AQH/BAgwBgEB/wIBADBDBgNVHR8EPDA6MDigNqA0hjJodHRwOi8vY3JsLnRo
YXd0ZS5jb20vVGhhd3RlUGVyc29uYWxGcmVlbWFpbENBLmNybDALBgNVHQ8E
BAMCAQYwKQYDVR0RBCIwIKQeMBwxGjAYBgNVBAMTEVByaXZhdGVMYWJlbDIt
MTM4MA0GCSqGSIb3DQEBBQUAA4GBAEiM0VCD6gsuzA2jZqxnD3+vrL7CF6FD
lpSdf0whuPg2H6otnzYvwPQcUCCTcDz9reFhYsPZOhl+hLGZGwDFGguCdJ4l
UJRix9sncVcljd2pnDmOjCBPZV+V2vf3h9bGCE6u9uo05RAaWzVNd+NWIXiC
3CEZNd4ksdMdRv9dX2VPMYIDEDCCAwwCAQEwdjBiMQswCQYDVQQGEwJaQTEl
MCMGA1UEChMcVGhhd3RlIENvbnN1bHRpbmcgKFB0eSkgTHRkLjEsMCoGA1UE
AxMjVGhhd3RlIFBlcnNvbmFsIEZyZWVtYWlsIElzc3VpbmcgQ0ECEEWUZWnn
/J137pWWI1HnbkgwCQYFKw4DAhoFAKCCAW8wGAYJKoZIhvcNAQkDMQsGCSqG
SIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMDgxMDI5MTUzMDEzWjAjBgkqhkiG
9w0BCQQxFgQUzxNrpZ9IRVYRU4OE/BoCGD1+xUIwgYUGCSsGAQQBgjcQBDF4
MHYwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0aW5n
IChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBGcmVlbWFp
bCBJc3N1aW5nIENBAhBFlGVp5/ydd+6VliNR525IMIGHBgsqhkiG9w0BCRAC
CzF4oHYwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0
aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBGcmVl
bWFpbCBJc3N1aW5nIENBAhBFlGVp5/ydd+6VliNR525IMA0GCSqGSIb3DQEB
AQUABIIBAGIqBOV2jlNtSyCytH6TE4Ie2hq44gCdASCTnrKJ4pbAiH0TYabd
QDW4yaCGOS5XNUKOBUBZiFE1TH9y4h6wGGBV7FaTtuydpzd1ebZnS+TZdhHm
pXJL3oO7dCjhmvPQzL26/TN45HiM1rAsGmH9hPHRTZzfGeLRRFvrqywzE7Zs
mGOZC2+5iKwjuakuLQft8BzYG5mIX9RtGFZ0glRDUBef7rGlWtxlcucALOmU
hXPrYZiP2WcpcVDIjlS/QT6/cLP3PrtNIicWU2NjVaktHzJxJXLIuIm1phyT
tVW8lBOhm0WCBiEfCTGksSxyHx2hxlG7EFnDj/DrjZnB1NcZMC0AAAAAAAA=
--Apple-Mail-21-26393345--
===========================================================================
Date mail was appended: Wed Oct 29 10:30:21 2008 (1225294221)
Date: Wed, 29 Oct 2008 15:24:58 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7741] LIGO: problem with large dagman.out
files on atlas
Duncan,
> Thanks. I did a little digging to try and figure out why the log file
> grew to over 2Gb (I've never seen this before). It turns out that the
> condor job log file was corrupted (I've attached it so you can take a
> look at it). This was causing the dagman process to keep looping and
> looping until the dagman.out file hit 2Gb.
Hmm, I didn't get an attachment on this email, but it might be a problem
with feeding it through RUST. Maybe you can stick the log somewhere
I can grab it via ftp or http?
> The interesting parts of dagman.log are below. The dagman process
> would start up, parse the log file, fail with the error
>
> 10/29 14:37:11 ERROR "Assertion ERROR on (node->GetThrottleInfo()-
> >_currentJobs >= 0)" at line 3086 in file dag.C
>
> and then restart from scratch. Any idea why this error caused a hard
> abort without writing a rescue dag?
Yeah, an assert() bypasses the rescue DAG writing. That may not be
a good idea, but you're only supposed to assert when something really
drastic has gone wrong.
So I guess there are several questions:
1. What happened to goof up the job log?
2. Is there any way DAGMan can deal with this better?
3. Should an assert bypass creating a resuce DAG?
Anyhow, if I can get a look at the log file I can at least think about
#1 and #2.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Wed Oct 29 15:25:58 2008 (1225311959)
CC: anderson__AT__ligo.caltech.edu, carsten.aulbert__AT__aei.mpg.de,
henning.fehrmann__AT__aei.mpg.de
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #7741] LIGO: problem with large dagman.out
files on atlas
Date: Fri, 31 Oct 2008 09:25:16 -0400
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.7400:2.4.4,1.2.40,4.0.166
definitions=2008-10-31_05:2008-10-10,2008-10-31,2008-10-31 signatures=0
X-Proofpoint-Spam-Reason: safe
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
--Apple-Mail-13-191697354
Hi Kent,
On Oct 29, 2008, at 4:25 PM, condor-support response tracking system
wrote:
>> Thanks. I did a little digging to try and figure out why the log file
>> grew to over 2Gb (I've never seen this before). It turns out that the
>> condor job log file was corrupted (I've attached it so you can take a
>> look at it). This was causing the dagman process to keep looping and
>> looping until the dagman.out file hit 2Gb.
>
> Hmm, I didn't get an attachment on this email, but it might be a
> problem
> with feeding it through RUST. Maybe you can stick the log somewhere
> I can grab it via ftp or http?
It's posted at
http://www.gravity.phy.syr.edu/~duncan/computing/tmp9UFyEE.backup.gz
>> The interesting parts of dagman.log are below. The dagman process
>> would start up, parse the log file, fail with the error
>>
>> 10/29 14:37:11 ERROR "Assertion ERROR on (node->GetThrottleInfo()-
>>> _currentJobs >= 0)" at line 3086 in file dag.C
>>
>> and then restart from scratch. Any idea why this error caused a hard
>> abort without writing a rescue dag?
>
> Yeah, an assert() bypasses the rescue DAG writing. That may not be
> a good idea, but you're only supposed to assert when something really
> drastic has gone wrong.
Another user has seen a similar issue, but with a different assert
statement ending the dag. His assert is
==> injections32.GRB070810B_injections32.dag.dagman.out <==
10/31 11:17:19 Event: ULOG_JOB_EVICTED for Condor Node
355746515a5e8cd8f1b888b24ab466df (5552354.0)
10/31 11:17:19 Number of idle job procs: 61
10/31 11:17:19 Event: ULOG_CHECKPOINTED for Condor Node
17848e49d66ef7faac54fcea6f345bf4 (5558822.0)
10/31 11:17:19 Event: ULOG_JOB_TERMINATED for Condor Node
4b71760ebc01b8af61a53324cf6f285b (5541931.0)
10/31 11:17:19 Node 4b71760ebc01b8af61a53324cf6f285b job proc
(5541931.0) completed successfully.
10/31 11:17:19 Node 4b71760ebc01b8af61a53324cf6f285b job completed
10/31 11:17:19 Number of idle job procs: 61
10/31 11:17:19 Event: ULOG_SUBMIT for Condor Node
64528b8cf6a494795e1657df4e2b80f1 (5568858.0)
10/31 11:17:19 Number of idle job procs: 62
10/31 11:17:19 ERROR "Assertion ERROR on (tmpNode == node)" at line
2699 in file dag.C
I've asked him for his log.
> So I guess there are several questions:
>
> 1. What happened to goof up the job log?
Good question. I'm not sure. We have multiple jobs writing to the same
log file, but it's on a non-NFS filesystem.
> 2. Is there any way DAGMan can deal with this better?
>
> 3. Should an assert bypass creating a resuce DAG?
Cheers,
Duncan.
> Anyhow, if I can get a look at the log file I can at least think about
> #1 and #2.
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu, anderson__AT__ligo.caltech.edu
> ,carsten.aulbert__AT__aei.mpg.de,henning.fehrmann__AT__aei.mpg.de
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
--Apple-Mail-13-191697354
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEH
AQAAoIIGLzCCAugwggJRoAMCAQICEEWUZWnn/J137pWWI1HnbkgwDQYJKoZI
hvcNAQEFBQAwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25z
dWx0aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBG
cmVlbWFpbCBJc3N1aW5nIENBMB4XDTA4MDQyODE5NDYzOFoXDTA5MDQyODE5
NDYzOFowSTEfMB0GA1UEAxMWVGhhd3RlIEZyZWVtYWlsIE1lbWJlcjEmMCQG
CSqGSIb3DQEJARYXZGFicm93bkBwaHlzaWNzLnN5ci5lZHUwggEiMA0GCSqG
SIb3DQEBAQUAA4IBDwAwggEKAoIBAQDSc5Va6SrDifChYfKyCdYqLdVhfgif
ARvRe1ehenZr5tNhq5ZdP3Ib4viCIl2obKK4QVmbpSd84Eg/M5QMNe04zMAi
o598P2RaDdb8XO3XcdUs3OOoaW/HXBhfWsyn3pVJxOOUQdr9Hn38qFX7LPqa
0I31jDriyzOJ4dOCqbi7h+Vd7qaydrH20XQu5/cdx4pmxHIjFTrljTsRnLI9
lB+geKnvsKczlH6/o9IZbvs0YHnJAq9HSTX0JZQelERIxRgaGh1CQ4qHS6hr
orFKyEO2ha9+zLmDR8wxNIU3plqpnfBQyAnQkamehhIV7YobpeE0BdpMEXxF
cbhF5QsckHpJAgMBAAGjNDAyMCIGA1UdEQQbMBmBF2RhYnJvd25AcGh5c2lj
cy5zeXIuZWR1MAwGA1UdEwEB/wQCMAAwDQYJKoZIhvcNAQEFBQADgYEAgJs2
9rODjbmsNbsWhuZXZjtqU/CwKkWSvwRzJwOi7BKvFBk/JnBUxu/rUcPFIqmC
D7qFa5ataU2eMplctJTEaBtL8vA4EqG1ungA/8cB1n0jlJSULb2amL49s6YX
I3JamJ0y5F2FUBoZ/bAt+RItmSJsEI215QyeyvjDeu4mlbQwggM/MIICqKAD
AgECAgENMA0GCSqGSIb3DQEBBQUAMIHRMQswCQYDVQQGEwJaQTEVMBMGA1UE
CBMMV2VzdGVybiBDYXBlMRIwEAYDVQQHEwlDYXBlIFRvd24xGjAYBgNVBAoT
EVRoYXd0ZSBDb25zdWx0aW5nMSgwJgYDVQQLEx9DZXJ0aWZpY2F0aW9uIFNl
cnZpY2VzIERpdmlzaW9uMSQwIgYDVQQDExtUaGF3dGUgUGVyc29uYWwgRnJl
ZW1haWwgQ0ExKzApBgkqhkiG9w0BCQEWHHBlcnNvbmFsLWZyZWVtYWlsQHRo
YXd0ZS5jb20wHhcNMDMwNzE3MDAwMDAwWhcNMTMwNzE2MjM1OTU5WjBiMQsw
CQYDVQQGEwJaQTElMCMGA1UEChMcVGhhd3RlIENvbnN1bHRpbmcgKFB0eSkg
THRkLjEsMCoGA1UEAxMjVGhhd3RlIFBlcnNvbmFsIEZyZWVtYWlsIElzc3Vp
bmcgQ0EwgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBAMSmPFVzVftOucqZ
Wh5owHUEcJ3f6f+jHuy9zfVb8hp2vX8MOmHyv1HOAdTlUAow1wJjWiyJFXCO
3cnwK4Vaqj9xVsuvPAsH5/EfkTYkKhPPK9Xzgnc9A74r/rsYPge/QIACZNen
prufZdHFKlSFD0gEf6e20TxhBEAeZBlyYLf7AgMBAAGjgZQwgZEwEgYDVR0T
AQH/BAgwBgEB/wIBADBDBgNVHR8EPDA6MDigNqA0hjJodHRwOi8vY3JsLnRo
YXd0ZS5jb20vVGhhd3RlUGVyc29uYWxGcmVlbWFpbENBLmNybDALBgNVHQ8E
BAMCAQYwKQYDVR0RBCIwIKQeMBwxGjAYBgNVBAMTEVByaXZhdGVMYWJlbDIt
MTM4MA0GCSqGSIb3DQEBBQUAA4GBAEiM0VCD6gsuzA2jZqxnD3+vrL7CF6FD
lpSdf0whuPg2H6otnzYvwPQcUCCTcDz9reFhYsPZOhl+hLGZGwDFGguCdJ4l
UJRix9sncVcljd2pnDmOjCBPZV+V2vf3h9bGCE6u9uo05RAaWzVNd+NWIXiC
3CEZNd4ksdMdRv9dX2VPMYIDEDCCAwwCAQEwdjBiMQswCQYDVQQGEwJaQTEl
MCMGA1UEChMcVGhhd3RlIENvbnN1bHRpbmcgKFB0eSkgTHRkLjEsMCoGA1UE
AxMjVGhhd3RlIFBlcnNvbmFsIEZyZWVtYWlsIElzc3VpbmcgQ0ECEEWUZWnn
/J137pWWI1HnbkgwCQYFKw4DAhoFAKCCAW8wGAYJKoZIhvcNAQkDMQsGCSqG
SIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMDgxMDMxMTMyNTE3WjAjBgkqhkiG
9w0BCQQxFgQUSrZ8FPo1idrwvyyJ9hxjfzOoni0wgYUGCSsGAQQBgjcQBDF4
MHYwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0aW5n
IChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBGcmVlbWFp
bCBJc3N1aW5nIENBAhBFlGVp5/ydd+6VliNR525IMIGHBgsqhkiG9w0BCRAC
CzF4oHYwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0
aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBGcmVl
bWFpbCBJc3N1aW5nIENBAhBFlGVp5/ydd+6VliNR525IMA0GCSqGSIb3DQEB
AQUABIIBAGN0s5RgH54V+Ktt95ayFcxagc3WXFF7ZhDCDRFByytuBGM0rNQs
ldj6ibpZX0oVUG71htnQneLE3o/7k5qt780C1+U5PaYc4XI+ugu1X1HHUQv2
d8AgC/Vz3Yuj2V/IJNXPbXGyRsVHPUxyluQYJlSoZKMYwy2/54TPBER/jUbm
nhVbvArvMZf7dFc8DP7Ymjm+QoLtkL9VVjLP2vo1uPrfhzxgfE+0RvPRGnzd
94NiV4W2BgP4Fz3cDUKyyn7Gi/JpnBo9UbiBz2fam2h28NmwUchczolBUrRs
WTKfm00wz7FyAfi++hMwJWbXJo5LeSliM9qAs1MIDU2hz+nxyasAAAAAAAA=
--Apple-Mail-13-191697354--
===========================================================================
Date mail was appended: Fri Oct 31 8:25:24 2008 (1225459525)
Date: Fri, 7 Nov 2008 14:59:36 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: gthain <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7741] LIGO: problem with large dagman.out
files on atlas
Duncan,
> Is condor_dagman currently limited to 32 byte file support? We have=20=20
> some inspiral users who's dags are constantly starting and then=20=20
> exiting repeatedly. The messages in the dagman.log file are
As per today's discussion, I've put this higher on my priority list,
and created a corresponding PR (953).
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Fri Nov 7 15:00:15 2008 (1226091616)
CC: condor-support__AT__cs.wisc.edu, anderson__AT__ligo.caltech.edu,
carsten.aulbert__AT__aei.mpg.de, henning.fehrmann__AT__aei.mpg.de
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: Re: [condor-support #7741] LIGO: problem with large dagman.out
files on atlas
Date: Sun, 9 Nov 2008 18:05:34 -0500
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.7400:2.4.4,1.2.40,4.0.166
definitions=2008-11-09_03:2008-11-06,2008-11-09,2008-11-09 signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0
reason=mlx engine=5.0.0-0810130000 definitions=main-0811090133
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
--Apple-Mail-9-1004115191
Hi Kent,
On Oct 31, 2008, at 9:25 AM, Duncan Brown wrote:
> Another user has seen a similar issue, but with a different assert
> statement ending the dag. His assert is
>
> ==> injections32.GRB070810B_injections32.dag.dagman.out <==
> 10/31 11:17:19 Event: ULOG_JOB_EVICTED for Condor Node
> 355746515a5e8cd8f1b888b24ab466df (5552354.0)
> 10/31 11:17:19 Number of idle job procs: 61
> 10/31 11:17:19 Event: ULOG_CHECKPOINTED for Condor Node
> 17848e49d66ef7faac54fcea6f345bf4 (5558822.0)
> 10/31 11:17:19 Event: ULOG_JOB_TERMINATED for Condor Node
> 4b71760ebc01b8af61a53324cf6f285b (5541931.0)
> 10/31 11:17:19 Node 4b71760ebc01b8af61a53324cf6f285b job proc
> (5541931.0) completed successfully.
> 10/31 11:17:19 Node 4b71760ebc01b8af61a53324cf6f285b job completed
> 10/31 11:17:19 Number of idle job procs: 61
> 10/31 11:17:19 Event: ULOG_SUBMIT for Condor Node
> 64528b8cf6a494795e1657df4e2b80f1 (5568858.0)
> 10/31 11:17:19 Number of idle job procs: 62
> 10/31 11:17:19 ERROR "Assertion ERROR on (tmpNode == node)" at line
> 2699 in file dag.C
>
> I've asked him for his log.
The second corrupted log file is posted at:
<http://www.gravity.phy.syr.edu/~duncan/computing/tmpQLmReP.gz>
If you can figure out what's causing the underlying problem here, that
would be great.
Cheers,
Duncan.
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
--Apple-Mail-9-1004115191
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEH
AQAAoIIGLzCCAugwggJRoAMCAQICEEWUZWnn/J137pWWI1HnbkgwDQYJKoZI
hvcNAQEFBQAwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25z
dWx0aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBG
cmVlbWFpbCBJc3N1aW5nIENBMB4XDTA4MDQyODE5NDYzOFoXDTA5MDQyODE5
NDYzOFowSTEfMB0GA1UEAxMWVGhhd3RlIEZyZWVtYWlsIE1lbWJlcjEmMCQG
CSqGSIb3DQEJARYXZGFicm93bkBwaHlzaWNzLnN5ci5lZHUwggEiMA0GCSqG
SIb3DQEBAQUAA4IBDwAwggEKAoIBAQDSc5Va6SrDifChYfKyCdYqLdVhfgif
ARvRe1ehenZr5tNhq5ZdP3Ib4viCIl2obKK4QVmbpSd84Eg/M5QMNe04zMAi
o598P2RaDdb8XO3XcdUs3OOoaW/HXBhfWsyn3pVJxOOUQdr9Hn38qFX7LPqa
0I31jDriyzOJ4dOCqbi7h+Vd7qaydrH20XQu5/cdx4pmxHIjFTrljTsRnLI9
lB+geKnvsKczlH6/o9IZbvs0YHnJAq9HSTX0JZQelERIxRgaGh1CQ4qHS6hr
orFKyEO2ha9+zLmDR8wxNIU3plqpnfBQyAnQkamehhIV7YobpeE0BdpMEXxF
cbhF5QsckHpJAgMBAAGjNDAyMCIGA1UdEQQbMBmBF2RhYnJvd25AcGh5c2lj
cy5zeXIuZWR1MAwGA1UdEwEB/wQCMAAwDQYJKoZIhvcNAQEFBQADgYEAgJs2
9rODjbmsNbsWhuZXZjtqU/CwKkWSvwRzJwOi7BKvFBk/JnBUxu/rUcPFIqmC
D7qFa5ataU2eMplctJTEaBtL8vA4EqG1ungA/8cB1n0jlJSULb2amL49s6YX
I3JamJ0y5F2FUBoZ/bAt+RItmSJsEI215QyeyvjDeu4mlbQwggM/MIICqKAD
AgECAgENMA0GCSqGSIb3DQEBBQUAMIHRMQswCQYDVQQGEwJaQTEVMBMGA1UE
CBMMV2VzdGVybiBDYXBlMRIwEAYDVQQHEwlDYXBlIFRvd24xGjAYBgNVBAoT
EVRoYXd0ZSBDb25zdWx0aW5nMSgwJgYDVQQLEx9DZXJ0aWZpY2F0aW9uIFNl
cnZpY2VzIERpdmlzaW9uMSQwIgYDVQQDExtUaGF3dGUgUGVyc29uYWwgRnJl
ZW1haWwgQ0ExKzApBgkqhkiG9w0BCQEWHHBlcnNvbmFsLWZyZWVtYWlsQHRo
YXd0ZS5jb20wHhcNMDMwNzE3MDAwMDAwWhcNMTMwNzE2MjM1OTU5WjBiMQsw
CQYDVQQGEwJaQTElMCMGA1UEChMcVGhhd3RlIENvbnN1bHRpbmcgKFB0eSkg
THRkLjEsMCoGA1UEAxMjVGhhd3RlIFBlcnNvbmFsIEZyZWVtYWlsIElzc3Vp
bmcgQ0EwgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBAMSmPFVzVftOucqZ
Wh5owHUEcJ3f6f+jHuy9zfVb8hp2vX8MOmHyv1HOAdTlUAow1wJjWiyJFXCO
3cnwK4Vaqj9xVsuvPAsH5/EfkTYkKhPPK9Xzgnc9A74r/rsYPge/QIACZNen
prufZdHFKlSFD0gEf6e20TxhBEAeZBlyYLf7AgMBAAGjgZQwgZEwEgYDVR0T
AQH/BAgwBgEB/wIBADBDBgNVHR8EPDA6MDigNqA0hjJodHRwOi8vY3JsLnRo
YXd0ZS5jb20vVGhhd3RlUGVyc29uYWxGcmVlbWFpbENBLmNybDALBgNVHQ8E
BAMCAQYwKQYDVR0RBCIwIKQeMBwxGjAYBgNVBAMTEVByaXZhdGVMYWJlbDIt
MTM4MA0GCSqGSIb3DQEBBQUAA4GBAEiM0VCD6gsuzA2jZqxnD3+vrL7CF6FD
lpSdf0whuPg2H6otnzYvwPQcUCCTcDz9reFhYsPZOhl+hLGZGwDFGguCdJ4l
UJRix9sncVcljd2pnDmOjCBPZV+V2vf3h9bGCE6u9uo05RAaWzVNd+NWIXiC
3CEZNd4ksdMdRv9dX2VPMYIDEDCCAwwCAQEwdjBiMQswCQYDVQQGEwJaQTEl
MCMGA1UEChMcVGhhd3RlIENvbnN1bHRpbmcgKFB0eSkgTHRkLjEsMCoGA1UE
AxMjVGhhd3RlIFBlcnNvbmFsIEZyZWVtYWlsIElzc3VpbmcgQ0ECEEWUZWnn
/J137pWWI1HnbkgwCQYFKw4DAhoFAKCCAW8wGAYJKoZIhvcNAQkDMQsGCSqG
SIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMDgxMTA5MjMwNTM1WjAjBgkqhkiG
9w0BCQQxFgQU+0stMGwdX0FMbyEa3j57C9Q9dO0wgYUGCSsGAQQBgjcQBDF4
MHYwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0aW5n
IChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBGcmVlbWFp
bCBJc3N1aW5nIENBAhBFlGVp5/ydd+6VliNR525IMIGHBgsqhkiG9w0BCRAC
CzF4oHYwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0
aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBGcmVl
bWFpbCBJc3N1aW5nIENBAhBFlGVp5/ydd+6VliNR525IMA0GCSqGSIb3DQEB
AQUABIIBAJRn8pYrCcoafHJpz//kjK4z/tSsoekSFRs6BolEXKgzD6Yh+zMH
RY+bAQkcQ16N6SlnZ+ujyhSehN5f+n6CHU9a+RhCNh0KtdwX9U2+RkjnNazN
X5ESyjuAYVr5WoMc2VdVlMeumsQHGsH+3Vi2DFYySGfeXUEH33363WEqc1P0
pcgl4K5dp6Hc+VgpdF2/budCVNHINp5y6+ye5DEratKIBnp4Am66xNV985/n
pLZSlLwYqDPsIG7pT+pH+DeEnO+jVvyAixvZjMHSVV2kVU3cKMxxoT/rM8T3
AyJjRPCSXqzDCzm7F0dy9USKw5G4dKHGixleWayv3GJhMVGNiUgAAAAAAAA=
--Apple-Mail-9-1004115191--
===========================================================================
Date mail was appended: Sun Nov 9 17:05:44 2008 (1226271945)
CC: anderson__AT__ligo.caltech.edu, carsten.aulbert__AT__aei.mpg.de,
henning.fehrmann__AT__aei.mpg.de
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #7741] LIGO: problem with large dagman.out
files on atlas
Date: Sun, 9 Nov 2008 18:22:48 -0500
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.7400:2.4.4,1.2.40,4.0.166
definitions=2008-11-09_03:2008-11-06,2008-11-09,2008-11-09 signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0
reason=mlx engine=5.0.0-0810130000 definitions=main-0811090135
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
--Apple-Mail-14-1005149405
Great, thanks.
Cheers,
Duncan.
On Nov 7, 2008, at 4:00 PM, condor-support response tracking system
wrote:
> Duncan,
>
>> Is condor_dagman currently limited to 32 byte file support? We
>> have=20=20
>> some inspiral users who's dags are constantly starting and then=20=20
>> exiting repeatedly. The messages in the dagman.log file are
>
> As per today's discussion, I've put this higher on my priority list,
> and created a corresponding PR (953).
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu, anderson__AT__ligo.caltech.edu
> ,carsten.aulbert__AT__aei.mpg.de,henning.fehrmann__AT__aei.mpg.de
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
--Apple-Mail-14-1005149405
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEH
AQAAoIIGLzCCAugwggJRoAMCAQICEEWUZWnn/J137pWWI1HnbkgwDQYJKoZI
hvcNAQEFBQAwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25z
dWx0aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBG
cmVlbWFpbCBJc3N1aW5nIENBMB4XDTA4MDQyODE5NDYzOFoXDTA5MDQyODE5
NDYzOFowSTEfMB0GA1UEAxMWVGhhd3RlIEZyZWVtYWlsIE1lbWJlcjEmMCQG
CSqGSIb3DQEJARYXZGFicm93bkBwaHlzaWNzLnN5ci5lZHUwggEiMA0GCSqG
SIb3DQEBAQUAA4IBDwAwggEKAoIBAQDSc5Va6SrDifChYfKyCdYqLdVhfgif
ARvRe1ehenZr5tNhq5ZdP3Ib4viCIl2obKK4QVmbpSd84Eg/M5QMNe04zMAi
o598P2RaDdb8XO3XcdUs3OOoaW/HXBhfWsyn3pVJxOOUQdr9Hn38qFX7LPqa
0I31jDriyzOJ4dOCqbi7h+Vd7qaydrH20XQu5/cdx4pmxHIjFTrljTsRnLI9
lB+geKnvsKczlH6/o9IZbvs0YHnJAq9HSTX0JZQelERIxRgaGh1CQ4qHS6hr
orFKyEO2ha9+zLmDR8wxNIU3plqpnfBQyAnQkamehhIV7YobpeE0BdpMEXxF
cbhF5QsckHpJAgMBAAGjNDAyMCIGA1UdEQQbMBmBF2RhYnJvd25AcGh5c2lj
cy5zeXIuZWR1MAwGA1UdEwEB/wQCMAAwDQYJKoZIhvcNAQEFBQADgYEAgJs2
9rODjbmsNbsWhuZXZjtqU/CwKkWSvwRzJwOi7BKvFBk/JnBUxu/rUcPFIqmC
D7qFa5ataU2eMplctJTEaBtL8vA4EqG1ungA/8cB1n0jlJSULb2amL49s6YX
I3JamJ0y5F2FUBoZ/bAt+RItmSJsEI215QyeyvjDeu4mlbQwggM/MIICqKAD
AgECAgENMA0GCSqGSIb3DQEBBQUAMIHRMQswCQYDVQQGEwJaQTEVMBMGA1UE
CBMMV2VzdGVybiBDYXBlMRIwEAYDVQQHEwlDYXBlIFRvd24xGjAYBgNVBAoT
EVRoYXd0ZSBDb25zdWx0aW5nMSgwJgYDVQQLEx9DZXJ0aWZpY2F0aW9uIFNl
cnZpY2VzIERpdmlzaW9uMSQwIgYDVQQDExtUaGF3dGUgUGVyc29uYWwgRnJl
ZW1haWwgQ0ExKzApBgkqhkiG9w0BCQEWHHBlcnNvbmFsLWZyZWVtYWlsQHRo
YXd0ZS5jb20wHhcNMDMwNzE3MDAwMDAwWhcNMTMwNzE2MjM1OTU5WjBiMQsw
CQYDVQQGEwJaQTElMCMGA1UEChMcVGhhd3RlIENvbnN1bHRpbmcgKFB0eSkg
THRkLjEsMCoGA1UEAxMjVGhhd3RlIFBlcnNvbmFsIEZyZWVtYWlsIElzc3Vp
bmcgQ0EwgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBAMSmPFVzVftOucqZ
Wh5owHUEcJ3f6f+jHuy9zfVb8hp2vX8MOmHyv1HOAdTlUAow1wJjWiyJFXCO
3cnwK4Vaqj9xVsuvPAsH5/EfkTYkKhPPK9Xzgnc9A74r/rsYPge/QIACZNen
prufZdHFKlSFD0gEf6e20TxhBEAeZBlyYLf7AgMBAAGjgZQwgZEwEgYDVR0T
AQH/BAgwBgEB/wIBADBDBgNVHR8EPDA6MDigNqA0hjJodHRwOi8vY3JsLnRo
YXd0ZS5jb20vVGhhd3RlUGVyc29uYWxGcmVlbWFpbENBLmNybDALBgNVHQ8E
BAMCAQYwKQYDVR0RBCIwIKQeMBwxGjAYBgNVBAMTEVByaXZhdGVMYWJlbDIt
MTM4MA0GCSqGSIb3DQEBBQUAA4GBAEiM0VCD6gsuzA2jZqxnD3+vrL7CF6FD
lpSdf0whuPg2H6otnzYvwPQcUCCTcDz9reFhYsPZOhl+hLGZGwDFGguCdJ4l
UJRix9sncVcljd2pnDmOjCBPZV+V2vf3h9bGCE6u9uo05RAaWzVNd+NWIXiC
3CEZNd4ksdMdRv9dX2VPMYIDEDCCAwwCAQEwdjBiMQswCQYDVQQGEwJaQTEl
MCMGA1UEChMcVGhhd3RlIENvbnN1bHRpbmcgKFB0eSkgTHRkLjEsMCoGA1UE
AxMjVGhhd3RlIFBlcnNvbmFsIEZyZWVtYWlsIElzc3VpbmcgQ0ECEEWUZWnn
/J137pWWI1HnbkgwCQYFKw4DAhoFAKCCAW8wGAYJKoZIhvcNAQkDMQsGCSqG
SIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMDgxMTA5MjMyMjQ5WjAjBgkqhkiG
9w0BCQQxFgQUyy+MWfyEWSb4mAq8CSR+4m2FNdowgYUGCSsGAQQBgjcQBDF4
MHYwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0aW5n
IChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBGcmVlbWFp
bCBJc3N1aW5nIENBAhBFlGVp5/ydd+6VliNR525IMIGHBgsqhkiG9w0BCRAC
CzF4oHYwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0
aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBGcmVl
bWFpbCBJc3N1aW5nIENBAhBFlGVp5/ydd+6VliNR525IMA0GCSqGSIb3DQEB
AQUABIIBALQfJa3uaTEhtS6od9eoBZFnV6+iPRTr5UVzjLQ8GtDdo3jup0SA
A8ZAN4D0Z0W4xhNRGXTKKLwkM7bw5PKjUmRWYDAtPK1OSv9H9NmSqQbR1jhV
SSqdl22s3EX5zaXodIQoEhJ3c9SL1Bei2iVQibTsVtDR9xqAWigoVG11gpi3
/GcfHdpmtGPA+COXy/C5iiaWaZlU5l9ZWJS61w5ow8HAnKRMw94lede0DYcj
JVDRiMDeIAnVyUPcVeqr20HNA6le1h8AYfMpC2WczsVZ7j46HdsS9hOTK8n0
WG8EUA8JTJClvXHw+kZcsKTB3cEFQTrOBDOx8UiWbAIR0tsKnqsAAAAAAAA=
--Apple-Mail-14-1005149405--
===========================================================================
Date mail was appended: Sun Nov 9 17:22:55 2008 (1226272976)
Date: Fri, 30 Jan 2009 10:55:35 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: gthain <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7741] LIGO: problem with large dagman.out
files on atlas
Duncan,
> Is condor_dagman currently limited to 32 byte file support?
In the most recent LIGO conference, Todd said he thought this was fixed.
But I just tried it out with 7.2.1 pre-release DAGMan, and it's not. So
I'll have to look into this some more.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Fri Jan 30 10:55:48 2009 (1233334548)
CC: anderson__AT__ligo.caltech.edu, carsten.aulbert__AT__aei.mpg.de,
henning.fehrmann__AT__aei.mpg.de
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #7741] LIGO: problem with large dagman.out
files on atlas
Date: Fri, 30 Jan 2009 12:05:11 -0500
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.7400:2.4.4,1.2.40,4.0.166
definitions=2009-01-30_12:2009-01-29,2009-01-30,2009-01-30 signatures=0
X-Proofpoint-Spam-Reason: safe
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
--Apple-Mail-14--522642898
Hi Kent,
Thanks for looking into this. It's not the most urgent thing, but it
would be useful.
Cheers,
Duncan.
On Jan 30, 2009, at 11:55 AM, condor-support response tracking system
wrote:
> Duncan,
>
>> Is condor_dagman currently limited to 32 byte file support?
>
> In the most recent LIGO conference, Todd said he thought this was
> fixed.
> But I just tried it out with 7.2.1 pre-release DAGMan, and it's
> not. So
> I'll have to look into this some more.
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu, anderson__AT__ligo.caltech.edu
> ,carsten.aulbert__AT__aei.mpg.de,henning.fehrmann__AT__aei.mpg.de
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
--Apple-Mail-14--522642898
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEH
AQAAoIIGLzCCAugwggJRoAMCAQICEEWUZWnn/J137pWWI1HnbkgwDQYJKoZI
hvcNAQEFBQAwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25z
dWx0aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBG
cmVlbWFpbCBJc3N1aW5nIENBMB4XDTA4MDQyODE5NDYzOFoXDTA5MDQyODE5
NDYzOFowSTEfMB0GA1UEAxMWVGhhd3RlIEZyZWVtYWlsIE1lbWJlcjEmMCQG
CSqGSIb3DQEJARYXZGFicm93bkBwaHlzaWNzLnN5ci5lZHUwggEiMA0GCSqG
SIb3DQEBAQUAA4IBDwAwggEKAoIBAQDSc5Va6SrDifChYfKyCdYqLdVhfgif
ARvRe1ehenZr5tNhq5ZdP3Ib4viCIl2obKK4QVmbpSd84Eg/M5QMNe04zMAi
o598P2RaDdb8XO3XcdUs3OOoaW/HXBhfWsyn3pVJxOOUQdr9Hn38qFX7LPqa
0I31jDriyzOJ4dOCqbi7h+Vd7qaydrH20XQu5/cdx4pmxHIjFTrljTsRnLI9
lB+geKnvsKczlH6/o9IZbvs0YHnJAq9HSTX0JZQelERIxRgaGh1CQ4qHS6hr
orFKyEO2ha9+zLmDR8wxNIU3plqpnfBQyAnQkamehhIV7YobpeE0BdpMEXxF
cbhF5QsckHpJAgMBAAGjNDAyMCIGA1UdEQQbMBmBF2RhYnJvd25AcGh5c2lj
cy5zeXIuZWR1MAwGA1UdEwEB/wQCMAAwDQYJKoZIhvcNAQEFBQADgYEAgJs2
9rODjbmsNbsWhuZXZjtqU/CwKkWSvwRzJwOi7BKvFBk/JnBUxu/rUcPFIqmC
D7qFa5ataU2eMplctJTEaBtL8vA4EqG1ungA/8cB1n0jlJSULb2amL49s6YX
I3JamJ0y5F2FUBoZ/bAt+RItmSJsEI215QyeyvjDeu4mlbQwggM/MIICqKAD
AgECAgENMA0GCSqGSIb3DQEBBQUAMIHRMQswCQYDVQQGEwJaQTEVMBMGA1UE
CBMMV2VzdGVybiBDYXBlMRIwEAYDVQQHEwlDYXBlIFRvd24xGjAYBgNVBAoT
EVRoYXd0ZSBDb25zdWx0aW5nMSgwJgYDVQQLEx9DZXJ0aWZpY2F0aW9uIFNl
cnZpY2VzIERpdmlzaW9uMSQwIgYDVQQDExtUaGF3dGUgUGVyc29uYWwgRnJl
ZW1haWwgQ0ExKzApBgkqhkiG9w0BCQEWHHBlcnNvbmFsLWZyZWVtYWlsQHRo
YXd0ZS5jb20wHhcNMDMwNzE3MDAwMDAwWhcNMTMwNzE2MjM1OTU5WjBiMQsw
CQYDVQQGEwJaQTElMCMGA1UEChMcVGhhd3RlIENvbnN1bHRpbmcgKFB0eSkg
THRkLjEsMCoGA1UEAxMjVGhhd3RlIFBlcnNvbmFsIEZyZWVtYWlsIElzc3Vp
bmcgQ0EwgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBAMSmPFVzVftOucqZ
Wh5owHUEcJ3f6f+jHuy9zfVb8hp2vX8MOmHyv1HOAdTlUAow1wJjWiyJFXCO
3cnwK4Vaqj9xVsuvPAsH5/EfkTYkKhPPK9Xzgnc9A74r/rsYPge/QIACZNen
prufZdHFKlSFD0gEf6e20TxhBEAeZBlyYLf7AgMBAAGjgZQwgZEwEgYDVR0T
AQH/BAgwBgEB/wIBADBDBgNVHR8EPDA6MDigNqA0hjJodHRwOi8vY3JsLnRo
YXd0ZS5jb20vVGhhd3RlUGVyc29uYWxGcmVlbWFpbENBLmNybDALBgNVHQ8E
BAMCAQYwKQYDVR0RBCIwIKQeMBwxGjAYBgNVBAMTEVByaXZhdGVMYWJlbDIt
MTM4MA0GCSqGSIb3DQEBBQUAA4GBAEiM0VCD6gsuzA2jZqxnD3+vrL7CF6FD
lpSdf0whuPg2H6otnzYvwPQcUCCTcDz9reFhYsPZOhl+hLGZGwDFGguCdJ4l
UJRix9sncVcljd2pnDmOjCBPZV+V2vf3h9bGCE6u9uo05RAaWzVNd+NWIXiC
3CEZNd4ksdMdRv9dX2VPMYIDEDCCAwwCAQEwdjBiMQswCQYDVQQGEwJaQTEl
MCMGA1UEChMcVGhhd3RlIENvbnN1bHRpbmcgKFB0eSkgTHRkLjEsMCoGA1UE
AxMjVGhhd3RlIFBlcnNvbmFsIEZyZWVtYWlsIElzc3VpbmcgQ0ECEEWUZWnn
/J137pWWI1HnbkgwCQYFKw4DAhoFAKCCAW8wGAYJKoZIhvcNAQkDMQsGCSqG
SIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMDkwMTMwMTcwNTExWjAjBgkqhkiG
9w0BCQQxFgQU3ewv4dvvJ9RB1cpBZ7mHHdkb80AwgYUGCSsGAQQBgjcQBDF4
MHYwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0aW5n
IChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBGcmVlbWFp
bCBJc3N1aW5nIENBAhBFlGVp5/ydd+6VliNR525IMIGHBgsqhkiG9w0BCRAC
CzF4oHYwYjELMAkGA1UEBhMCWkExJTAjBgNVBAoTHFRoYXd0ZSBDb25zdWx0
aW5nIChQdHkpIEx0ZC4xLDAqBgNVBAMTI1RoYXd0ZSBQZXJzb25hbCBGcmVl
bWFpbCBJc3N1aW5nIENBAhBFlGVp5/ydd+6VliNR525IMA0GCSqGSIb3DQEB
AQUABIIBAKPFjF8ZD+d6/EJOE+wPm9y2g0xQTjhSW3/62nEpFpe2Tv6YA1ot
bM1CYR9USbjpvOBt59MOL/Mw/ed2+oC2gLTNdAZZzPpZwhJgLmmnCb2WPNJ8
Je5mTNhX9+1TkHQqmct3Llqve8UQ49VtZbcycmxuhQWEvrcVPEEuwheVqkVl
mVgsQFghXcUnMzGiG3gS325PFuAFZaq5UFDMyz5pTmDJujD1yJouUT+IqSC2
CySg7nPWHDyN7nzGI//mtf6mB2soF6S7z9WVaDGKAD+bz708UovBkWweKBw6
kwCl5PYT7B2rxinLWKofK37HA8Y5xEo9roGk8U467FN11FzazWwAAAAAAAA=
--Apple-Mail-14--522642898--
===========================================================================
Date mail was appended: Fri Jan 30 11:05:21 2009 (1233335122)
Date: Tue, 24 Feb 2009 17:26:45 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: gthain <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7741] LIGO: problem with large dagman.out
files on atlas
Duncan,
> Is condor_dagman currently limited to 32 byte file support? We have
> some inspiral users who's dags are constantly starting and then
> exiting repeatedly. The messages in the dagman.log file are
I finally got this fixed! The fix will be in 7.2.2 and 7.3.1 (I think it
just missed 7.3.0). (The fix wasn't that hard, it was just a matter of
this getting to the top of my priorities.)
If you want, I can give you pre-release binaries. I'd like to hold off at
least a couple of days or so on that, though, because I'm hoping to also
deal with condor-admin #18817 (DAGMan core dump possibly related to
splicing) this week. So, if possible, I'd like to give you both fixes in
one shot.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Tue Feb 24 17:27:03 2009 (1235518024)
CC: dabrown__AT__physics.syr.edu, carsten.aulbert__AT__aei.mpg.de,
henning.fehrmann__AT__aei.mpg.de
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #7741] LIGO: problem with large dagman.out
files on atlas
Date: Tue, 24 Feb 2009 15:54:52 -0800
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
On Feb 24, 2009, at 3:27 PM, condor-support response tracking system
wrote:
> Duncan,
>
>> Is condor_dagman currently limited to 32 byte file support? We have
>> some inspiral users who's dags are constantly starting and then
>> exiting repeatedly. The messages in the dagman.log file are
>
> I finally got this fixed! The fix will be in 7.2.2 and 7.3.1 (I
> think it
> just missed 7.3.0). (The fix wasn't that hard, it was just a matter
> of
> this getting to the top of my priorities.)
>
> If you want, I can give you pre-release binaries. I'd like to hold
> off at
> least a couple of days or so on that, though, because I'm hoping to
> also
> deal with condor-admin #18817 (DAGMan core dump possibly related to
> splicing) this week. So, if possible, I'd like to give you both
> fixes in
> one shot.
Fantastic. We can wait for a pre-release until you have a chance to
look into the splicing issue as well.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Tue Feb 24 17:55:07 2009 (1235519707)
CC: condor-support__AT__cs.wisc.edu, carsten.aulbert__AT__aei.mpg.de,
henning.fehrmann__AT__aei.mpg.de
From: Duncan Brown <dabrown__AT__physics.syr.edu>
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
Subject: Re: [condor-support #7741] LIGO: problem with large dagman.out
files on atlas
Date: Wed, 25 Feb 2009 10:10:15 -0500
X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.7400:2.4.4,1.2.40,4.0.166
definitions=2009-02-25_07:2009-02-10,2009-02-25,2009-02-25 signatures=0
X-Proofpoint-Spam-Reason: safe
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
Hi Kent,
That's great. No rush for the binaries, we'll wait for both fixes.
Cheers,
Duncan.
On Feb 24, 2009, at 6:54 PM, Stuart Anderson wrote:
>
> On Feb 24, 2009, at 3:27 PM, condor-support response tracking system
> wrote:
>
>> Duncan,
>>
>>> Is condor_dagman currently limited to 32 byte file support? We have
>>> some inspiral users who's dags are constantly starting and then
>>> exiting repeatedly. The messages in the dagman.log file are
>>
>> I finally got this fixed! The fix will be in 7.2.2 and 7.3.1 (I
>> think it
>> just missed 7.3.0). (The fix wasn't that hard, it was just a
>> matter of
>> this getting to the top of my priorities.)
>>
>> If you want, I can give you pre-release binaries. I'd like to hold
>> off at
>> least a couple of days or so on that, though, because I'm hoping to
>> also
>> deal with condor-admin #18817 (DAGMan core dump possibly related to
>> splicing) this week. So, if possible, I'd like to give you both
>> fixes in
>> one shot.
>
> Fantastic. We can wait for a pre-release until you have a chance to
> look into the splicing issue as well.
>
> Thanks.
>
>
> --
> Stuart Anderson anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
>
>
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
===========================================================================
Date mail was appended: Wed Feb 25 9:10:35 2009 (1235574638)
Date: Wed, 11 Mar 2009 14:38:44 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7741] LIGO: problem with large dagman.out
files on atlas
Duncan,
> That's great. No rush for the binaries, we'll wait for both fixes.
Now that I've given you guys binaries with the fix for this problem, I'm
going to resolve the ticket. You can always re-open it if somehow the
problem is not really fixed.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Wed Mar 11 14:39:16 2009 (1236800356)
Subject: Actions
Ticket resolved by wenger
===========================================================================
Date of actions: Wed Mar 11 14:39:33 2009 (1236800373)