LIGO Support Ticket 17975
Ticket Information
Number: admin 17975
User: dabrown@physics.syr.edu
Email: anderson__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu,'__AT__cs.wisc.edu
Status: resolved
Assigned To: psilord
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: LIGO: CentOS 5 condor jobs are not checkpointing
Date: Wed, 7 May 2008 20:24:43 -0400
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
X-Scanner: InterScan AntiVirus for Sendmail
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
Hi Stuart,
Standard universe jobs on CentOS 5 are definitely not checkpointing.
I just tried a
condor_vacate -graceful slot3__AT__node046.sugar.phy.syr.edu
to checkpoint a condor compiled job, the job log says
004 (29259.000.000) 05/07 20:12:34 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
445 - Run Bytes Sent By Job
14792791 - Run Bytes Received By Job
...
Tailing the StarterLog gives
==> /usr1/condor/log/StarterLog.slot3 <==
5/7 20:07:29 *FSM* Got asynchronous event "VACATE"
5/7 20:07:29 *FSM* Executing transition function "req_vacate"
5/7 20:07:29 req_ckpt_exit_all: Proc -1 in state EXECUTING
5/7 20:07:29 Requesting Exit on proc #-1
5/7 20:07:29 UserProc::send_sig_no_privsep(): Sent signal SIGCONT to
user job 21890
5/7 20:07:29 UserProc::send_sig(): Sent signal SIGTSTP to user job 21890
5/7 20:07:29 *FSM* Transitioning to state "TERMINATE"
5/7 20:07:29 *FSM* Executing state func "terminate_all()" [ ]
5/7 20:07:29 *FSM* Transitioning to state "TERMINATE_WAIT"
5/7 20:07:29 *FSM* Executing state func "asynch_wait()" [ SUSPEND
ALARM DIE CHILD_EXIT ]
5/7 20:12:32 *FSM* Got asynchronous event "DIE"
5/7 20:12:32 *FSM* Executing transition function "req_die"
5/7 20:12:32 req_exit_all: Proc -1 in state EXECUTING
5/7 20:12:32 Requesting Exit on proc #-1
5/7 20:12:32 UserProc::send_sig_no_privsep(): Sent signal SIGCONT to
user job 21890
5/7 20:12:32 UserProc::send_sig(): Sent signal SIGKILL to user job 21890
5/7 20:12:33 *FSM* Got asynchronous event "CHILD_EXIT"
5/7 20:12:33 *FSM* Executing transition function "reaper"
5/7 20:12:33 Process 21890 killed by signal 9
5/7 20:12:33 Process exited by request
5/7 20:12:33 *FSM* Transitioning to state "TERMINATE"
5/7 20:12:33 *FSM* Executing state func "terminate_all()" [ ]
5/7 20:12:33 *FSM* Transitioning to state "SEND_STATUS_ALL"
5/7 20:12:33 *FSM* Executing state func "dispose_all()" [ ]
5/7 20:12:33 Sending final status for process 29259.0
5/7 20:12:33 STATUS encoded as CKPT, *NOT* TRANSFERRED
5/7 20:12:33 User time = 0.000000 seconds
5/7 20:12:33 System time = 0.000000 seconds
5/7 20:12:33 Unlinked "dir_21888/condor_exec.29259.0"
5/7 20:12:33 Removing directory "dir_21888"
5/7 20:12:33 *FSM* Reached state "END"
5/7 20:12:33 ********* STARTER terminating normally **********
There is nothing in the checkpoint server log other than
==> /usr1/condor/log/CkptServerLog <==
5/7 20:10:50 Sending ckpt server ad to collector...
This is a freshly compiled piece of code running
[root@node046 ~]# rpm -qa | grep condor
condor-7.0.1-1
on the execute node and the same version on the submit node used to
compile the code:
[dbrown@sugar-dev1 ~]$ rpm -qa | grep condor
condor-7.0.1-1
The code is condor compiled:
[dbrown@sugar-dev1 ~]$ /home/dbrown/projects/cbc/s5/ligovirgo/
head_bug/2pn_spa/lalapps_inspiral --version
Condor: Notice: Will checkpoint to /home/dbrown/projects/cbc/s5/
ligovirgo/head_bug/2pn_spa/lalapps_inspiral.ckpt
Condor: Notice: Remote system calls disabled.
LIGO/LSC Standalone Inspiral Search Engine
Duncan Brown <duncan__AT__gravity.phys.uwm.edu>
CVS Version: $Id: inspiral.c,v 1.270 2008/04/03 16:51:30 spxcar Exp $
CVS Tag: $Name: $
and the full classads are below. I think this is a show stopper for
moving to CentOS 5, unless I've done something stupid.
Cheers,
Duncan.
-- Schedd: sugar.phy.syr.edu : <10.20.1.23:58095>
MyType = "Job"
TargetType = "Machine"
ClusterId = 29259
QDate = 1210193021
CompletionDate = 0
Owner = "dbrown"
NumCkpts_RAW = 0
NumCkpts = 0
NumJobStarts = 0
NumRestarts = 0
NumSystemHolds = 0
CommittedTime = 0
ExitBySignal = FALSE
Notification = ERROR
WantBadgers = TRUE
JOB_LEASE_DURATION = 3600
copy_to_spool = TRUE
CondorVersion = "$CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $"
CondorPlatform = "$CondorPlatform: X86_64-LINUX_RHEL5 $"
RootDir = "/"
Iwd = "/home/dbrown/projects/cbc/s5/ligovirgo/head_bug/2pn_spa"
JobUniverse = 1
Cmd = "/home/dbrown/projects/cbc/s5/ligovirgo/head_bug/2pn_spa/
lalapps_inspiral"
MinHosts = 1
WantRemoteSyscalls = TRUE
WantCheckpoint = TRUE
JobPrio = 0
User = "dbrown@sugar"
NiceUser = FALSE
MaxJobRetirementTime = 0
Env = "KMP_LIBRARY=serial;MKL_SERIAL=yes"
EnvDelim = ";"
JobNotification = 0
WantRemoteIO = FALSE
UserLog = "/usr1/dbrown/log/tmp770Clx"
CoreSize = 0
KillSig = "SIGTSTP"
Rank = 0.000000
In = "/dev/null"
TransferIn = FALSE
Out = "logs/inspiral-817736603-817738651-29259-0.out"
StreamOut = FALSE
Err = "logs/inspiral-817736603-817738651-29259-0.err"
StreamErr = FALSE
BufferSize = 524288
BufferBlockSize = 32768
ShouldTransferFiles = "NO"
TransferFiles = "NEVER"
ExecutableSize_RAW = 14445
ExecutableSize = 15000
DiskUsage_RAW = 14445
DiskUsage = 15000
Requirements = (Arch == "X86_64") && (OpSys == "LINUX") && ((CkptArch
== Arch) || (CkptArch =?= UNDEFINED)) && ((CkptOpSys == OpSys) ||
(CkptOpSys =?= UNDEFINED)) && (Disk >= DiskUsage) && ((Memory * 1024)
>= ImageSize)
FileSystemDomain = "sugar"
JobLeaseDuration = 3600
PeriodicHold = FALSE
PeriodicRelease = FALSE
PeriodicRemove = FALSE
OnExitHold = FALSE
OnExitRemove = TRUE
LeaveJobInQueue = FALSE
Args = "--trig-end-time 0 --cluster-method window --dynamic-range-
exponent 69.0 --disable-rsq-veto --bank-file H1-
TMPLTBANK-817736603-2048.xml.gz --high-pass-order 8 --strain-high-
pass-order 8 --ifo-tag FIRST --gps-end-time 817738651 --calibrated-
data real_8 --channel-name H1:LSC-STRAIN --snr-threshold 5.5 --number-
of-segments 15 --trig-start-time 0 --enable-high-pass 30.0 --debug-
level 33 --gps-start-time 817736603 --enable-filter-inj-only --high-
pass-attenuation 0.1 --chisq-bins 0 --inverse-spec-length 16 --
segment-length 1048576 --low-frequency-cutoff 40.0 --pad-data 8 --
cluster-window 16 --sample-rate 4096 --chisq-threshold 10.0 --
resample-filter ldas --strain-high-pass-atten 0.1 --strain-high-pass-
freq 30.0 --segment-overlap 524288 --frame-cache cache/H-
H1_RDS_C03_L2-817730835-817747213.cache --chisq-delta 0.2 --bank-veto-
subbank-size 1 --approximant FindChirpSP --write-compress --enable-
output --order twoPN --spectrum-type median"
DAGNodeName = "1812a31af3998b75e881e767e4062960"
DAGParentNodeNames =
"57c6c3d3a94afe7909838ec90b1de878,c2422f87b1967077e766b6c6fb1eb714"
DAGManJobId = 28662
GlobalJobId = "sugar.phy.syr.edu#1210193021#29259.0"
ProcId = 0
JobStartDate = 1210201165
FileReadCount = 0.000000
FileReadBytes = 0.000000
FileWriteCount = 0.000000
FileWriteBytes = 0.000000
FileSeekCount = 0.000000
TotalSuspensions = 0
CumulativeSuspensionTime = 0
ImageSize_RAW = 15000
ImageSize = 15000
ExitStatus = 0
LocalUserCpu = 0.000000
LocalSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteSysCpu = 0.000000
BytesSent = 14792791.000000
BytesRecvd = 445.000000
RSCBytesSent = 1796.000000
RSCBytesRecvd = 445.000000
LastVacateTime = 1210205554
RemoteWallClockTime = 4389.000000
LastRemoteHost = "slot3__AT__node046.sugar.phy.syr.edu"
LastPublicClaimId = "<10.20.2.46:45686>#1209066322#791#..."
LastPublicClaimIds = ""
MaxHosts = 1
AutoClusterId = 0
AutoClusterAttrs =
"JobUniverse,LastCheckpointPlatform,NumCkpts,DiskUsage,ImageSize,Require
ments,NiceUser"
LastRejMatchReason = "no match found"
LastRejMatchTime = 1210205560
WantMatchDiagnostics = TRUE
LastMatchTime = 1210205560
NumJobMatches = 2
OrigMaxHosts = 1
JobStatus = 2
EnteredCurrentStatus = 1210205562
LastSuspensionTime = 0
CurrentHosts = 1
PublicClaimId = "<10.20.2.38:59180>#1209066321#751#..."
RemoteHost = "slot2__AT__node038.sugar.phy.syr.edu"
RemoteSlotID = 2
ShadowBday = 1210205562
JobLastStartDate = 1210201165
JobCurrentStartDate = 1210205562
NumShadowStarts = 2
JobRunCount = 2
LastJobLeaseRenewal = 1210206164
ServerTime = 1210206165
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
===========================================================================
Date of creation: Wed May 7 19:24:42 2008 (1210206284)
Subject: Actions
Assigned to psilord by jfrey
===========================================================================
Date of actions: Fri May 9 8:42:29 2008 (1210340550)
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>, Stuart Anderson
<anderson__AT__ligo.caltech.edu>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
Date: Thu, 15 May 2008 16:46:05 -0400
To: condor-admin__AT__cs.wisc.edu
X-Scanner: InterScan AntiVirus for Sendmail
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
Hi Todd and Pete,
I tried running the condor compiled job interactively and sending it
the checkpoint signal, as Scott suggested. This also fail to produce
a checkpoint with either SIGTSTP or SIGUSR2. SIGUSR1 kills the job:
Here's the output of the shell with the job:
[dbrown@sugar-dev1 ckpt]$ lalapps_inspiral --enable-output --trig-
end-time 0 --cluster-method window --dynamic-range-exponent 6.900000e
+01 --bank-file H1-TMPLTBANK-817701712-2048.xml.gz --high-pass-order
8 --strain-high-pass-order 8 --ifo-tag FIRST --gps-end-time 817703760
--calibrated-data real_8 --channel-name H1:LSC-STRAIN --snr-threshold
5.5 --number-of-segments 15 --trig-start-time 0 --enable-high-pass
3.000000e+01 --debug-level 33 --gps-start-time 817701712 --high-pass-
attenuation 1.000000e-01 --chisq-bins 0 --inverse-spec-length 16 --
segment-length 1048576 --low-frequency-cutoff 4.000000e+01 --pad-data
8 --cluster-window 1.600000e+01 --sample-rate 4096 --chisq-threshold
10.0 --resample-filter ldas --strain-high-pass-atten 1.000000e-01 --
strain-high-pass-freq 3.000000e+01 --segment-overlap 524288 --frame-
cache cache/H-H1_RDS_C03_L2-817701704-817707343.cache --chisq-delta
2.000000e-01 --bank-veto-subbank-size 1 --approximant FindChirpSP --
order twoPN --spectrum-type median --write-compress --disable-rsq-
veto --enable-filter-inj-only
Condor: Notice: Will checkpoint to lalapps_inspiral.ckpt
Condor: Notice: Remote system calls disabled.
User defined signal 1
Here's the output of the shell I used to send the signals:
[root@sugar-dev1 ~]# ps -u dbrown
PID TTY TIME CMD
24863 ? 00:00:00 sshd
24864 pts/8 00:00:00 bash
25230 ? 00:00:00 sshd
25231 pts/9 00:00:00 bash
25354 pts/9 00:00:00 man
25357 pts/9 00:00:00 sh
25358 pts/9 00:00:00 sh
25363 pts/9 00:00:00 less
25383 pts/8 00:00:06 lalapps_inspira
[root@sugar-dev1 ~]# kill -SIGTSTP 25383
[root@sugar-dev1 ~]# kill -SIGUSR2 25383
[root@sugar-dev1 ~]# kill -SIGUSR1 25383
Nothing happened until I sent it the USR1.
I also tried running the inspiral code with the arguments --data-
checkpoint and --checkpoint-path. This causes it to call the condor
function ckpt_and_exit() to trigger it to checkpoint after it has
read in the data. This also fails to checkpoint, and just carries on
executing:
[dbrown@sugar-dev1 ckpt]$ lalapps_inspiral --enable-output --trig-
end-time 0 --cluster-method window --dynamic-range-exponent 6.900000e
+01 --bank-file H1-TMPLTBANK-817701712-2048.xml.gz --high-pass-order
8 --strain-high-pass-order 8 --ifo-tag FIRST --gps-end-time 817703760
--calibrated-data real_8 --channel-name H1:LSC-STRAIN --snr-threshold
5.5 --number-of-segments 15 --trig-start-time 0 --enable-high-pass
3.000000e+01 --debug-level 33 --gps-start-time 817701712 --high-pass-
attenuation 1.000000e-01 --chisq-bins 0 --inverse-spec-length 16 --
segment-length 1048576 --low-frequency-cutoff 4.000000e+01 --pad-data
8 --cluster-window 1.600000e+01 --sample-rate 4096 --chisq-threshold
10.0 --resample-filter ldas --strain-high-pass-atten 1.000000e-01 --
strain-high-pass-freq 3.000000e+01 --segment-overlap 524288 --frame-
cache cache/H-H1_RDS_C03_L2-817701704-817707343.cache --chisq-delta
2.000000e-01 --bank-veto-subbank-size 1 --approximant FindChirpSP --
order twoPN --spectrum-type median --write-compress --disable-rsq-
veto --enable-filter-inj-only --data-checkpoint --checkpoint-path
`pwd` --verbose
Condor: Notice: Will checkpoint to lalapps_inspiral.ckpt
Condor: Notice: Remote system calls disabled.
parsed 5333 templates from H1-TMPLTBANK-817701712-2048.xml.gz
reading frame file locations from cache file: cache/H-
H1_RDS_C03_L2-817701704-817707343.cache
resampleParams.deltaT = 2.441406e-04
chan.deltaT = 6.103516e-05
input channel will be resampled
input channel H1:LSC-STRAIN has sample interval (deltaT) = 6.103516e-05
reading 33816576 points from frame stream
reading REAL8 h(t) data from frames... done
applying 8 order high pass to REAL8 h(t) data: 0.90 of signal passes
at 30.00 Hz
read channel H1:LSC-STRAIN from frame stream
got 33816576 points with deltaT 6.103516e-05
starting at GPS time 817701704 sec 0 ns
generating response at time 817701712 sec 0 ns
resampling input data from 6.103516e-05 to 2.441406e-04
channel H1:LSC-STRAIN resampled:
8454144 points with deltaT 2.441406e-04
starting at GPS time 817701704 sec 0 ns
checkpointing to file /home/dbrown/projects/daswg/condor/ckpt/H1-
INSPIRAL_FIRST-817701712-2048.ckpt
applying 8 order high pass: 0.90 of signal passes at 30.00 Hz
after removal of 8 second padding at start and end:
data channel sample interval (deltaT) = 2.441406e-04
data channel length = 8388608
starting at 817701712 sec 0 ns
computing median psd with overlap 524288
initializing findchirp... done
findchirp conditioning data for SP
and so on... Notice the line "checkpointing to file" above. Here's
the code around that printf:
if ( dataCheckpoint )
{
#ifdef LALAPPS_CONDOR
condor_compress_ckpt = 1;
if ( ckptPath[0] )
{
LALSnprintf( fname, FILENAME_MAX * sizeof(CHAR), "%s/%s.ckpt",
ckptPath, fileName );
}
else
{
LALSnprintf( fname, FILENAME_MAX * sizeof(CHAR), "%s.ckpt",
fileName );
}
if ( vrbflg ) fprintf( stdout, "checkpointing to file %s\n",
fname );
init_image_with_file_name( fname );
ckpt_and_exit();
#else
fprintf( stderr, "--data-checkpoint cannot be used unless "
"lalapps is condor compiled\n" );
exit( 1 );
#endif
}
I think we have a broken checkpoint library on CentOS 5.
Here's the shadow log for the job I mentioned in the original email.
5/7 18:59:25 (29259.0) (7680):Requesting Primary Starter
5/7 18:59:25 (29259.0) (7680):Shadow: Request to run a job was ACCEPTED
5/7 18:59:25 (29259.0) (7680):Shadow: RSC_SOCK connected, fd = 17
5/7 18:59:25 (29259.0) (7680):Shadow: CLIENT_LOG connected, fd = 18
5/7 18:59:25 (29259.0) (7680):My_Filesystem_Domain = "sugar"
5/7 18:59:25 (29259.0) (7680):My_UID_Domain = "sugar"
5/7 18:59:25 (29259.0) (7680): Entering pseudo_get_file_stream
5/7 18:59:25 (29259.0) (7680): file = "/usr1/condor/spool/
cluster29259.ickpt.subproc0"
5/7 18:59:25 (29259.0) (7680):Reaped child status - pid 7681 exited
with status 0
5/7 18:59:26 (29259.0) (7680):Read: User Job - $CondorPlatform:
X86_64-LINUX_RHEL5 $
5/7 18:59:26 (29259.0) (7680):Read: User Job - $CondorVersion: 7.0.1
Feb 26 2008 BuildID: 76180 $
5/7 18:59:26 (29259.0) (7680):Read: Checkpoint file name is "/usr1/
condor/spool/cluster29259.proc0.subproc0"
5/7 20:07:31 (29259.0) (7680):Read: received ckpt signal 20, but
deferred it for later
5/7 20:12:34 (29259.0) (7680):Shadow: Job 29259.0 exited, termsig =
9, coredump = 0, retcode = 0
5/7 20:12:34 (29259.0) (7680):Shadow: Job was kicked off without a
checkpoint
5/7 20:12:34 (29259.0) (7680):Shadow: DoCleanup: unlinking TmpCkpt '/
usr1/condor/spool/cluster29259.proc0.subproc0.tmp'
5/7 20:12:34 (29259.0) (7680):Trying to unlink /usr1/condor/spool/
cluster29259.proc0.subproc0.tmp
5/7 20:12:34 (29259.0) (7680):user_time = 0 ticks
5/7 20:12:34 (29259.0) (7680):sys_time = 1 ticks
5/7 20:12:34 (29259.0) (7680):ZKM: setting default map to (null)
5/7 20:12:34 (29259.0) (7680):********** Shadow Exiting(107) **********
5/7 20:12:42 (29259.0) (8689):Requesting Primary Starter
5/7 20:12:42 (29259.0) (8689):Shadow: Request to run a job was ACCEPTED
5/7 20:12:42 (29259.0) (8689):Shadow: RSC_SOCK connected, fd = 17
5/7 20:12:42 (29259.0) (8689):Shadow: CLIENT_LOG connected, fd = 18
5/7 20:12:42 (29259.0) (8689):My_Filesystem_Domain = "sugar"
5/7 20:12:42 (29259.0) (8689):My_UID_Domain = "sugar"
5/7 20:12:42 (29259.0) (8689):Shadow: Request to run a job was ACCEPTED
5/7 20:12:42 (29259.0) (8689):Shadow: RSC_SOCK connected, fd = 17
5/7 20:12:42 (29259.0) (8689):Shadow: CLIENT_LOG connected, fd = 18
5/7 20:12:42 (29259.0) (8689):My_Filesystem_Domain = "sugar"
5/7 20:12:42 (29259.0) (8689):My_UID_Domain = "sugar"
5/7 20:12:42 (29259.0) (8689): Entering pseudo_get_file_stream
5/7 20:12:42 (29259.0) (8689): file = "/usr1/condor/spool/
cluster29259.ickpt.subproc0"
5/7 20:12:42 (29259.0) (8689):Reaped child status - pid 8691 exited
with status 0
5/7 20:12:42 (29259.0) (8689):Read: User Job - $CondorPlatform:
X86_64-LINUX_RHEL5 $
5/7 20:12:42 (29259.0) (8689):Read: User Job - $CondorVersion: 7.0.1
Feb 26 2008 BuildID: 76180 $
5/7 20:12:42 (29259.0) (8689):Read: Checkpoint file name is "/usr1/
condor/spool/cluster29259.proc0.subproc0"
5/7 22:48:06 (29259.0) (8689):Shadow: Job 29259.0 exited, termsig =
0, coredump = 0, retcode = 0
5/7 22:48:06 (29259.0) (8689):Shadow: Job exited normally with status 0
5/7 22:48:06 (29259.0) (8689):user_time = 1 ticks
5/7 22:48:06 (29259.0) (8689):sys_time = 2 ticks
5/7 22:48:06 (29259.0) (8689):ZKM: setting default map to (null)
5/7 22:48:06 (29259.0) (8689):Static Policy: removing job because
OnExitRemove has become true
5/7 22:48:06 (29259.0) (8689):********** Shadow Exiting(100) **********
Cheers,
Duncan.
On May 7, 2008, at 8:24 PM, condor-admin__AT__cs.wisc.edu wrote:
> Greetings. (This is an automated response. There is no need to
> reply.)
>
> Your message regarding:
> "LIGO: CentOS 5 condor jobs are not checkpointing"
> has been received by the condor-admin response tracking system.
>
> In order to help us track the progress of your request, we ask that
> you
> include the string:
> "[condor-admin #17975] LIGO: CentOS 5 condor jobs are not
> checkpointing"
> in the subject line of any further mail about this particular request.
>
> You can do this by simply replying to this email.
>
> While you are waiting for a reply, please look at the Condor Manual:
> http://www.cs.wisc.edu/condor/manual/
> for full documentation of Condor. Your problem may have already
> been solved or explained.
>
> Support for Condor through the condor-admin list is free of charge.
> We will make a best effort to respond in a timely fashion, but please
> keep in mind that our resources are limited.
>
> We offer a higher level of support for a fee. If you are
> interested in
> this, please send a message to condor-support__AT__cs.wisc.edu.
>
> If possible, we encourage you to try to experiment a little to see if
> you can solve the problem yourself.
>
> Thank You,
> - condor-admin response tracking system
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
===========================================================================
Date mail was appended: Thu May 15 15:46:12 2008 (1210884372)
Subject: Comments added
Seems to be w/ RHEL 5, gethostbyname() opens a socket to a DNS cache serivce
if nsswitch says to use DNS resolving. This socket is kept open, thus
checkpoints are suspended.
Comments added by tannenba
===========================================================================
Date comments were added: Fri May 23 13:34:32 2008 (1211567672)
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>, Stuart Anderson
<anderson__AT__ligo.caltech.edu>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
Date: Fri, 23 May 2008 18:23:52 -0400
To: condor-admin__AT__cs.wisc.edu
X-Scanner: InterScan AntiVirus for Sendmail
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
--Apple-Mail-8--801485004
Hi Pete,
The offending function preventing checkpointing is getpwuid(). If I
comment out the following lines:
if(!(pw = getpwuid(uid)))
snprintf(ptable->username, LIGOMETA_USERNAME_MAX, "%d", uid);
else
snprintf(ptable->username, LIGOMETA_USERNAME_MAX, "%s", pw->pw_name);
then the inspiral code successfully checkpoints when told to. The
presence of the gethostname() call makes no difference.
I'll check the stderr stuff tomorrow.
Cheers,
Duncan.
On May 7, 2008, at 8:24 PM, condor-admin__AT__cs.wisc.edu wrote:
> Greetings. (This is an automated response. There is no need to
> reply.)
>
> Your message regarding:
> "LIGO: CentOS 5 condor jobs are not checkpointing"
> has been received by the condor-admin response tracking system.
>
> In order to help us track the progress of your request, we ask that
> you
> include the string:
> "[condor-admin #17975] LIGO: CentOS 5 condor jobs are not
> checkpointing"
> in the subject line of any further mail about this particular request.
>
> You can do this by simply replying to this email.
>
> While you are waiting for a reply, please look at the Condor Manual:
> http://www.cs.wisc.edu/condor/manual/
> for full documentation of Condor. Your problem may have already
> been solved or explained.
>
> Support for Condor through the condor-admin list is free of charge.
> We will make a best effort to respond in a timely fashion, but please
> keep in mind that our resources are limited.
>
> We offer a higher level of support for a fee. If you are
> interested in
> this, please send a message to condor-support__AT__cs.wisc.edu.
>
> If possible, we encourage you to try to experiment a little to see if
> you can solve the problem yourself.
>
> Thank You,
> - condor-admin response tracking system
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
--Apple-Mail-8--801485004
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEH
AQAAoIIIBTCCA/gwggLgoAMCAQICASkwDQYJKoZIhvcNAQEFBQAwdTETMBEG
CgmSJomT8ixkARkWA25ldDESMBAGCgmSJomT8ixkARkWAkVTMQ4wDAYDVQQK
EwVFU25ldDEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxGDAW
BgNVBAMTD0VTbmV0IFJvb3QgQ0EgMTAeFw0wMjEyMDUwODAwMDBaFw0xMzAx
MjUwODAwMDBaMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/Is
ZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRp
ZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwggEiMA0GCSqGSIb3DQEBAQUA
A4IBDwAwggEKAoIBAQC09dYjYaPbCD5mtbiQb7Ka3y1qAm0ZcqKCFciWcfe8
Kwcuy9tjHuIsLf9ZItdkDW4xy8sua9nJlx3KlwjtumTMtOtg35KZCknUd8KM
4VGTSFdLVG9AbNayef76caVCGM1+jyF0Lq03kauGOPTcNfZe1TZa3e1c9rc8
ljV5OSWa/mfsCACyS5zFIWu0yIDNyJdf+n0hwaPN53wllpJ30taD+JBjQ7h2
k4xRWzeaznLOb9OztZVRA/1sVze+iczFh2xwa4VdGy0eIIPw1pfvYwxO36rm
0S109qvbsNlaroPRbxerPKakQLpKe034Xcx7gBPqUk/FxoRRWin5EWN3rz9L
AgMBAAGjgZ4wgZswDgYDVR0PAQH/BAQDAgGGMBEGCWCGSAGG+EIBAQQEAwIA
hzAdBgNVHQ4EFgQUyhkdEo5upDhdQtQxDgjb2Y0XDV0wHwYDVR0jBBgwFoAU
vF1NSC/4NZRZq1yJSz7RsjoUAeowDwYDVR0TAQH/BAUwAwEB/zAlBgNVHREE
HjAcgRpET0VHcmlkcy1DQS0xQGRvZWdyaWRzLm9yZzANBgkqhkiG9w0BAQUF
AAOCAQEAZNVrIDLqe39CEOiJt7Q7EpBPhAihMvDTSf/42u0SMbUmChww4mLm
ph5DBghZUVF8Yn59kRZMn1QLOtO1HzLqvAvPITacZVPlJgG2IXzlR636YghZ
FAycbIUEOJDBHR4vtQO1KDxgZwvAbtmKIoxvhUCq2xsfFt9kCBBn+JYtQ6O5
LsBJq3PmuubeMcc7mbQAfJZ7h/3QghgkFIhmE1+LBXPJbkuP8vgfg6h2BKoA
f5TFfZECgGZKimfN110tBvfedGZwYYd3/GsJc83B0JN1gny0gqNVPm392Uch
XGeBRrHnm2gkhIkr48Oq6EmNGV9/a6XfbplQW/JWbtPVPWkaizCCBAUwggLt
oAMCAQICAkeUMA0GCSqGSIb3DQEBBQUAMGkxEzARBgoJkiaJk/IsZAEZFgNv
cmcxGDAWBgoJkiaJk/IsZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlm
aWNhdGUgQXV0aG9yaXRpZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwHhcN
MDcwOTEyMTkwMjA3WhcNMDgwOTExMTkwMjA3WjBeMRMwEQYKCZImiZPyLGQB
GRYDb3JnMRgwFgYKCZImiZPyLGQBGRYIZG9lZ3JpZHMxDzANBgNVBAsTBlBl
b3BsZTEcMBoGA1UEAxMTRHVuY2FuIEJyb3duIDUwNjU5MjCCASIwDQYJKoZI
hvcNAQEBBQADggEPADCCAQoCggEBAM+cSlKBhUWHQNmcRf3bsHC3ngHwQC+E
/RJNoYdRC0vQPVQ6kt12TLn6IGX+Eq0p5cHxd+rmOAms7zMzdhw9CYypN4KE
rgQOubIV7zTzajSYsNBKYkX6TA/jyt1223fmruD5hEnnOCWevsruA1iLIUFS
NjZ0WqYA+2KbHvjrObPkcoB7zZ2x4CV1ayWlYRZwLUrSdzSXQKYpbmTJili/
c5GzXpd22oQCpWX8+472pM0zM9l4A3B7uTFpmunsvKox741+CeSbgJzaHkIZ
V/9TjsEO6zrg05+JEGOcXzII2mlgEWtJRmOTOco2QkZ5h0aU8XTsxVAbbtCU
Y7JH4Z91wCMCAwEAAaOBwTCBvjARBglghkgBhvhCAQEEBAMCBeAwDgYDVR0P
AQH/BAQDAgXgMB8GA1UdIwQYMBaAFMoZHRKObqQ4XULUMQ4I29mNFw1dMCIG
A1UdEQQbMBmBF2RhYnJvd25AcGh5c2ljcy5zeXIuZWR1MDoGA1UdHwQzMDEw
L6AtoCuGKWh0dHA6Ly9wa2kxLmRvZWdyaWRzLm9yZy9DUkwvMWMzZjJjYTgu
Y3JsMBgGA1UdIAQRMA8wDQYLKoZIhvdMAwcBAgkwDQYJKoZIhvcNAQEFBQAD
ggEBAA3JV1r6C2MEwcNGarW8KBr3phLOLXoF2656DUFIy8sqler1t38f7ucX
hRSQLu26eLyGgUzrsPuiEAPqFYYNZa71DuQCcYbBs6wW7QFQrXMq7trHkXVG
qRhiHgT+tTVqxPkZgMKDcj853N9MiZod5QgYQCfEy+4A17WZ31W/2NzPgSYn
2beOsHTnMbkciPIi7Jq7E8IV0wvfPuv+ypRRhymG3VthKrRQCMKu0I4QaUfL
iX4BrlB07QesDw7X4kwR+o5flOjkjliQdBWZDcl+hyLNzbi20niOuLW1eoto
Gn9dnelZa9h2jQRqhyfvXDUpOt9jStxsSZjgkDK6L4BfmT4xggL6MIIC9gIB
ATBvMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/IsZAEZFghE
T0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxFjAU
BgNVBAMTDURPRUdyaWRzIENBIDECAkeUMAkGBSsOAwIaBQCgggFgMBgGCSqG
SIb3DQEJAzELBgkqhkiG9w0BBwEwHAYJKoZIhvcNAQkFMQ8XDTA4MDUyMzIy
MjM1MlowIwYJKoZIhvcNAQkEMRYEFPISM3ML8F5I4OIPSReF1qdX7hSHMH4G
CSsGAQQBgjcQBDFxMG8waTETMBEGCgmSJomT8ixkARkWA29yZzEYMBYGCgmS
JomT8ixkARkWCERPRUdyaWRzMSAwHgYDVQQLExdDZXJ0aWZpY2F0ZSBBdXRo
b3JpdGllczEWMBQGA1UEAxMNRE9FR3JpZHMgQ0EgMQICR5QwgYAGCyqGSIb3
DQEJEAILMXGgbzBpMRMwEQYKCZImiZPyLGQBGRYDb3JnMRgwFgYKCZImiZPy
LGQBGRYIRE9FR3JpZHMxIDAeBgNVBAsTF0NlcnRpZmljYXRlIEF1dGhvcml0
aWVzMRYwFAYDVQQDEw1ET0VHcmlkcyBDQSAxAgJHlDANBgkqhkiG9w0BAQEF
AASCAQCayDCGavACZf3QQrfGfkeeUtB8qzrDkFcPCL+aQBvZhCeTS7zW5zw3
umeKnvtS/vSENI8kywrlh712CJ0NS0sKcy2TPmrrJzOqRK92k73w2xl34uai
0o5AVQqLyfSrFaj8QnDn8QuS7iyhmmDPIyR+F2DKMtIfqcmiIzfvaIQWo9Cz
f4BnqaFffg1kIfJIUz8Eb63kqGtAjlJzDOw3dTafN5yzWdd4yf8eXcN7HvGQ
uga09xWilUqa83sYHktHV2F8kbkn/uW/1pp4fq54c1rrsEHqp6RJ0AijAduq
owX1QnLkB6NEFWXgkrtfoTo3PD+oU41bhe8uWTbGHfIKF6ujAAAAAAAA
--Apple-Mail-8--801485004--
===========================================================================
Date mail was appended: Fri May 23 17:24:09 2008 (1211581449)
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>, Stuart Anderson
<anderson__AT__ligo.caltech.edu>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
Date: Fri, 23 May 2008 18:24:42 -0400
To: condor-admin__AT__cs.wisc.edu
X-Scanner: InterScan AntiVirus for Sendmail
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
--Apple-Mail-9--801434654
Hi Pete,
The offending function preventing checkpointing is getpwuid(). If I
comment out the following lines:
if(!(pw = getpwuid(uid)))
snprintf(ptable->username, LIGOMETA_USERNAME_MAX, "%d", uid);
else
snprintf(ptable->username, LIGOMETA_USERNAME_MAX, "%s", pw->pw_name);
then the inspiral code successfully checkpoints when told to. The
presence of the gethostname() call makes no difference.
I'll check the stderr stuff tomorrow.
Cheers,
Duncan.
On May 7, 2008, at 8:24 PM, condor-admin__AT__cs.wisc.edu wrote:
> Greetings. (This is an automated response. There is no need to
> reply.)
>
> Your message regarding:
> "LIGO: CentOS 5 condor jobs are not checkpointing"
> has been received by the condor-admin response tracking system.
>
> In order to help us track the progress of your request, we ask that
> you
> include the string:
> "[condor-admin #17975] LIGO: CentOS 5 condor jobs are not
> checkpointing"
> in the subject line of any further mail about this particular request.
>
> You can do this by simply replying to this email.
>
> While you are waiting for a reply, please look at the Condor Manual:
> http://www.cs.wisc.edu/condor/manual/
> for full documentation of Condor. Your problem may have already
> been solved or explained.
>
> Support for Condor through the condor-admin list is free of charge.
> We will make a best effort to respond in a timely fashion, but please
> keep in mind that our resources are limited.
>
> We offer a higher level of support for a fee. If you are
> interested in
> this, please send a message to condor-support__AT__cs.wisc.edu.
>
> If possible, we encourage you to try to experiment a little to see if
> you can solve the problem yourself.
>
> Thank You,
> - condor-admin response tracking system
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
--Apple-Mail-9--801434654
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEH
AQAAoIIIBTCCA/gwggLgoAMCAQICASkwDQYJKoZIhvcNAQEFBQAwdTETMBEG
CgmSJomT8ixkARkWA25ldDESMBAGCgmSJomT8ixkARkWAkVTMQ4wDAYDVQQK
EwVFU25ldDEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxGDAW
BgNVBAMTD0VTbmV0IFJvb3QgQ0EgMTAeFw0wMjEyMDUwODAwMDBaFw0xMzAx
MjUwODAwMDBaMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/Is
ZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRp
ZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwggEiMA0GCSqGSIb3DQEBAQUA
A4IBDwAwggEKAoIBAQC09dYjYaPbCD5mtbiQb7Ka3y1qAm0ZcqKCFciWcfe8
Kwcuy9tjHuIsLf9ZItdkDW4xy8sua9nJlx3KlwjtumTMtOtg35KZCknUd8KM
4VGTSFdLVG9AbNayef76caVCGM1+jyF0Lq03kauGOPTcNfZe1TZa3e1c9rc8
ljV5OSWa/mfsCACyS5zFIWu0yIDNyJdf+n0hwaPN53wllpJ30taD+JBjQ7h2
k4xRWzeaznLOb9OztZVRA/1sVze+iczFh2xwa4VdGy0eIIPw1pfvYwxO36rm
0S109qvbsNlaroPRbxerPKakQLpKe034Xcx7gBPqUk/FxoRRWin5EWN3rz9L
AgMBAAGjgZ4wgZswDgYDVR0PAQH/BAQDAgGGMBEGCWCGSAGG+EIBAQQEAwIA
hzAdBgNVHQ4EFgQUyhkdEo5upDhdQtQxDgjb2Y0XDV0wHwYDVR0jBBgwFoAU
vF1NSC/4NZRZq1yJSz7RsjoUAeowDwYDVR0TAQH/BAUwAwEB/zAlBgNVHREE
HjAcgRpET0VHcmlkcy1DQS0xQGRvZWdyaWRzLm9yZzANBgkqhkiG9w0BAQUF
AAOCAQEAZNVrIDLqe39CEOiJt7Q7EpBPhAihMvDTSf/42u0SMbUmChww4mLm
ph5DBghZUVF8Yn59kRZMn1QLOtO1HzLqvAvPITacZVPlJgG2IXzlR636YghZ
FAycbIUEOJDBHR4vtQO1KDxgZwvAbtmKIoxvhUCq2xsfFt9kCBBn+JYtQ6O5
LsBJq3PmuubeMcc7mbQAfJZ7h/3QghgkFIhmE1+LBXPJbkuP8vgfg6h2BKoA
f5TFfZECgGZKimfN110tBvfedGZwYYd3/GsJc83B0JN1gny0gqNVPm392Uch
XGeBRrHnm2gkhIkr48Oq6EmNGV9/a6XfbplQW/JWbtPVPWkaizCCBAUwggLt
oAMCAQICAkeUMA0GCSqGSIb3DQEBBQUAMGkxEzARBgoJkiaJk/IsZAEZFgNv
cmcxGDAWBgoJkiaJk/IsZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlm
aWNhdGUgQXV0aG9yaXRpZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwHhcN
MDcwOTEyMTkwMjA3WhcNMDgwOTExMTkwMjA3WjBeMRMwEQYKCZImiZPyLGQB
GRYDb3JnMRgwFgYKCZImiZPyLGQBGRYIZG9lZ3JpZHMxDzANBgNVBAsTBlBl
b3BsZTEcMBoGA1UEAxMTRHVuY2FuIEJyb3duIDUwNjU5MjCCASIwDQYJKoZI
hvcNAQEBBQADggEPADCCAQoCggEBAM+cSlKBhUWHQNmcRf3bsHC3ngHwQC+E
/RJNoYdRC0vQPVQ6kt12TLn6IGX+Eq0p5cHxd+rmOAms7zMzdhw9CYypN4KE
rgQOubIV7zTzajSYsNBKYkX6TA/jyt1223fmruD5hEnnOCWevsruA1iLIUFS
NjZ0WqYA+2KbHvjrObPkcoB7zZ2x4CV1ayWlYRZwLUrSdzSXQKYpbmTJili/
c5GzXpd22oQCpWX8+472pM0zM9l4A3B7uTFpmunsvKox741+CeSbgJzaHkIZ
V/9TjsEO6zrg05+JEGOcXzII2mlgEWtJRmOTOco2QkZ5h0aU8XTsxVAbbtCU
Y7JH4Z91wCMCAwEAAaOBwTCBvjARBglghkgBhvhCAQEEBAMCBeAwDgYDVR0P
AQH/BAQDAgXgMB8GA1UdIwQYMBaAFMoZHRKObqQ4XULUMQ4I29mNFw1dMCIG
A1UdEQQbMBmBF2RhYnJvd25AcGh5c2ljcy5zeXIuZWR1MDoGA1UdHwQzMDEw
L6AtoCuGKWh0dHA6Ly9wa2kxLmRvZWdyaWRzLm9yZy9DUkwvMWMzZjJjYTgu
Y3JsMBgGA1UdIAQRMA8wDQYLKoZIhvdMAwcBAgkwDQYJKoZIhvcNAQEFBQAD
ggEBAA3JV1r6C2MEwcNGarW8KBr3phLOLXoF2656DUFIy8sqler1t38f7ucX
hRSQLu26eLyGgUzrsPuiEAPqFYYNZa71DuQCcYbBs6wW7QFQrXMq7trHkXVG
qRhiHgT+tTVqxPkZgMKDcj853N9MiZod5QgYQCfEy+4A17WZ31W/2NzPgSYn
2beOsHTnMbkciPIi7Jq7E8IV0wvfPuv+ypRRhymG3VthKrRQCMKu0I4QaUfL
iX4BrlB07QesDw7X4kwR+o5flOjkjliQdBWZDcl+hyLNzbi20niOuLW1eoto
Gn9dnelZa9h2jQRqhyfvXDUpOt9jStxsSZjgkDK6L4BfmT4xggL6MIIC9gIB
ATBvMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/IsZAEZFghE
T0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxFjAU
BgNVBAMTDURPRUdyaWRzIENBIDECAkeUMAkGBSsOAwIaBQCgggFgMBgGCSqG
SIb3DQEJAzELBgkqhkiG9w0BBwEwHAYJKoZIhvcNAQkFMQ8XDTA4MDUyMzIy
MjQ0M1owIwYJKoZIhvcNAQkEMRYEFPISM3ML8F5I4OIPSReF1qdX7hSHMH4G
CSsGAQQBgjcQBDFxMG8waTETMBEGCgmSJomT8ixkARkWA29yZzEYMBYGCgmS
JomT8ixkARkWCERPRUdyaWRzMSAwHgYDVQQLExdDZXJ0aWZpY2F0ZSBBdXRo
b3JpdGllczEWMBQGA1UEAxMNRE9FR3JpZHMgQ0EgMQICR5QwgYAGCyqGSIb3
DQEJEAILMXGgbzBpMRMwEQYKCZImiZPyLGQBGRYDb3JnMRgwFgYKCZImiZPy
LGQBGRYIRE9FR3JpZHMxIDAeBgNVBAsTF0NlcnRpZmljYXRlIEF1dGhvcml0
aWVzMRYwFAYDVQQDEw1ET0VHcmlkcyBDQSAxAgJHlDANBgkqhkiG9w0BAQEF
AASCAQCH2iucrEZCIcdRscYVdvvH0HhMQ6BXfMUv3IqVM3hXUaQFfbEXrzIv
JTXiyD160PEBEdbBT+Kg4FpiuOqnQwUlWnJlO8wK9Zj0mhsJjpVhG1AZRAb7
8kc1QofUYMlydnkXA9u+7K+B8FMGLEtpJDKV0A/IZgSfH8EDe2C43dD4dSPY
Y8CX5xfvp5fOXMpRgy1RNHsNFeVXLoSVFzYoAgBCxDwea2/8481QE4IRgmJ/
Dyp9zNwz8gkOvMM+sDsrGtU2lV+w8uUHVKzqUOrsZsI8od0rAJe37hSL6i4N
/YCgwZ7pvjLBWmYwH9joFkLIbZPFt/RH3kB4qysF9DbOp0TGAAAAAAAA
--Apple-Mail-9--801434654--
===========================================================================
Date mail was appended: Fri May 23 17:24:45 2008 (1211581485)
Date: Tue, 27 May 2008 10:36:12 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>,
'__AT__cs.wisc.edu
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
On Fri, May 23, 2008 at 05:24:45PM -0500, condor-admin response tracking system wrote:
> The offending function preventing checkpointing is getpwuid(). If I
> comment out the following lines:
>
> if(!(pw = getpwuid(uid)))
> snprintf(ptable->username, LIGOMETA_USERNAME_MAX, "%d", uid);
> else
> snprintf(ptable->username, LIGOMETA_USERNAME_MAX, "%s", pw->pw_name);
>
> then the inspiral code successfully checkpoints when told to. The
> presence of the gethostname() call makes no difference.
>
> I'll check the stderr stuff tomorrow.
It looks like the nscd.conf file has a method by which one may turn off
using the /var/run/nscd/socket communication method. Check out setting
'persistant passwd no' and 'shared passwd yes' in the config file and
see if that fixes the default behavior.
Even if it does, though, I still should figure out a method in stduniv
by which no configuration changes on your part are needed.
Thank you.
-pete
===========================================================================
Date mail was appended: Tue May 27 10:36:13 2008 (1211902574)
Date: Tue, 27 May 2008 10:47:37 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
On Fri, May 23, 2008 at 05:24:45PM -0500, condor-admin response tracking system
wrote:
> The offending function preventing checkpointing is getpwuid(). If I
> comment out the following lines:
>
> if(!(pw = getpwuid(uid)))
> snprintf(ptable->username, LIGOMETA_USERNAME_MAX, "%d", uid);
> else
> snprintf(ptable->username, LIGOMETA_USERNAME_MAX, "%s", pw->pw_name);
>
> then the inspiral code successfully checkpoints when told to. The
> presence of the gethostname() call makes no difference.
>
> I'll check the stderr stuff tomorrow.
It looks like the nscd.conf file has a method by which one may turn off
using the /var/run/nscd/socket communication method. Check out setting
'persistant passwd no' and 'shared passwd yes' in the config file and
see if that fixes the default behavior.
Even if it does, though, I still should figure out a method in stduniv
by which no configuration changes on your part are needed.
Thank you.
-pete
===========================================================================
Date mail was appended: Tue May 27 10:47:39 2008 (1211903259)
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
Date: Tue, 27 May 2008 12:04:35 -0400
To: condor-admin__AT__cs.wisc.edu
X-Scanner: InterScan AntiVirus for Sendmail
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
--Apple-Mail-45--478641795
Hi Pete,
On May 27, 2008, at 11:47 AM, condor-admin response tracking system
wrote:
> On Fri, May 23, 2008 at 05:24:45PM -0500, condor-admin response
> tracking system
> wrote:
>> The offending function preventing checkpointing is getpwuid(). If I
>> comment out the following lines:
>>
>> if(!(pw = getpwuid(uid)))
>> snprintf(ptable->username, LIGOMETA_USERNAME_MAX, "%d", uid);
>> else
>> snprintf(ptable->username, LIGOMETA_USERNAME_MAX, "%s", pw-
>> >pw_name);
>>
>> then the inspiral code successfully checkpoints when told to. The
>> presence of the gethostname() call makes no difference.
>>
>> I'll check the stderr stuff tomorrow.
>
> It looks like the nscd.conf file has a method by which one may turn
> off
> using the /var/run/nscd/socket communication method. Check out setting
> 'persistant passwd no' and 'shared passwd yes' in the config file and
> see if that fixes the default behavior.
No, it doesn't fix the checkpointing. I set
enable-cache passwd yes
positive-time-to-live passwd 600
negative-time-to-live passwd 20
suggested-size passwd 211
check-files passwd yes
persistent passwd no
shared passwd yes
max-db-size passwd 33554432
auto-propagate passwd yes
and the code still refuses to checkpoint, with or without the nscd
daemon running.
> Even if it does, though, I still should figure out a method in stduniv
> by which no configuration changes on your part are needed.
Great, thanks.
Cheers,
Duncan.
> Thank you.
>
> -pete
>
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Peter Keller <psilord__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu,
> anderson__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu,'__AT__cs.wisc.edu
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
--Apple-Mail-45--478641795
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEH
AQAAoIIIBTCCA/gwggLgoAMCAQICASkwDQYJKoZIhvcNAQEFBQAwdTETMBEG
CgmSJomT8ixkARkWA25ldDESMBAGCgmSJomT8ixkARkWAkVTMQ4wDAYDVQQK
EwVFU25ldDEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxGDAW
BgNVBAMTD0VTbmV0IFJvb3QgQ0EgMTAeFw0wMjEyMDUwODAwMDBaFw0xMzAx
MjUwODAwMDBaMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/Is
ZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRp
ZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwggEiMA0GCSqGSIb3DQEBAQUA
A4IBDwAwggEKAoIBAQC09dYjYaPbCD5mtbiQb7Ka3y1qAm0ZcqKCFciWcfe8
Kwcuy9tjHuIsLf9ZItdkDW4xy8sua9nJlx3KlwjtumTMtOtg35KZCknUd8KM
4VGTSFdLVG9AbNayef76caVCGM1+jyF0Lq03kauGOPTcNfZe1TZa3e1c9rc8
ljV5OSWa/mfsCACyS5zFIWu0yIDNyJdf+n0hwaPN53wllpJ30taD+JBjQ7h2
k4xRWzeaznLOb9OztZVRA/1sVze+iczFh2xwa4VdGy0eIIPw1pfvYwxO36rm
0S109qvbsNlaroPRbxerPKakQLpKe034Xcx7gBPqUk/FxoRRWin5EWN3rz9L
AgMBAAGjgZ4wgZswDgYDVR0PAQH/BAQDAgGGMBEGCWCGSAGG+EIBAQQEAwIA
hzAdBgNVHQ4EFgQUyhkdEo5upDhdQtQxDgjb2Y0XDV0wHwYDVR0jBBgwFoAU
vF1NSC/4NZRZq1yJSz7RsjoUAeowDwYDVR0TAQH/BAUwAwEB/zAlBgNVHREE
HjAcgRpET0VHcmlkcy1DQS0xQGRvZWdyaWRzLm9yZzANBgkqhkiG9w0BAQUF
AAOCAQEAZNVrIDLqe39CEOiJt7Q7EpBPhAihMvDTSf/42u0SMbUmChww4mLm
ph5DBghZUVF8Yn59kRZMn1QLOtO1HzLqvAvPITacZVPlJgG2IXzlR636YghZ
FAycbIUEOJDBHR4vtQO1KDxgZwvAbtmKIoxvhUCq2xsfFt9kCBBn+JYtQ6O5
LsBJq3PmuubeMcc7mbQAfJZ7h/3QghgkFIhmE1+LBXPJbkuP8vgfg6h2BKoA
f5TFfZECgGZKimfN110tBvfedGZwYYd3/GsJc83B0JN1gny0gqNVPm392Uch
XGeBRrHnm2gkhIkr48Oq6EmNGV9/a6XfbplQW/JWbtPVPWkaizCCBAUwggLt
oAMCAQICAkeUMA0GCSqGSIb3DQEBBQUAMGkxEzARBgoJkiaJk/IsZAEZFgNv
cmcxGDAWBgoJkiaJk/IsZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlm
aWNhdGUgQXV0aG9yaXRpZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwHhcN
MDcwOTEyMTkwMjA3WhcNMDgwOTExMTkwMjA3WjBeMRMwEQYKCZImiZPyLGQB
GRYDb3JnMRgwFgYKCZImiZPyLGQBGRYIZG9lZ3JpZHMxDzANBgNVBAsTBlBl
b3BsZTEcMBoGA1UEAxMTRHVuY2FuIEJyb3duIDUwNjU5MjCCASIwDQYJKoZI
hvcNAQEBBQADggEPADCCAQoCggEBAM+cSlKBhUWHQNmcRf3bsHC3ngHwQC+E
/RJNoYdRC0vQPVQ6kt12TLn6IGX+Eq0p5cHxd+rmOAms7zMzdhw9CYypN4KE
rgQOubIV7zTzajSYsNBKYkX6TA/jyt1223fmruD5hEnnOCWevsruA1iLIUFS
NjZ0WqYA+2KbHvjrObPkcoB7zZ2x4CV1ayWlYRZwLUrSdzSXQKYpbmTJili/
c5GzXpd22oQCpWX8+472pM0zM9l4A3B7uTFpmunsvKox741+CeSbgJzaHkIZ
V/9TjsEO6zrg05+JEGOcXzII2mlgEWtJRmOTOco2QkZ5h0aU8XTsxVAbbtCU
Y7JH4Z91wCMCAwEAAaOBwTCBvjARBglghkgBhvhCAQEEBAMCBeAwDgYDVR0P
AQH/BAQDAgXgMB8GA1UdIwQYMBaAFMoZHRKObqQ4XULUMQ4I29mNFw1dMCIG
A1UdEQQbMBmBF2RhYnJvd25AcGh5c2ljcy5zeXIuZWR1MDoGA1UdHwQzMDEw
L6AtoCuGKWh0dHA6Ly9wa2kxLmRvZWdyaWRzLm9yZy9DUkwvMWMzZjJjYTgu
Y3JsMBgGA1UdIAQRMA8wDQYLKoZIhvdMAwcBAgkwDQYJKoZIhvcNAQEFBQAD
ggEBAA3JV1r6C2MEwcNGarW8KBr3phLOLXoF2656DUFIy8sqler1t38f7ucX
hRSQLu26eLyGgUzrsPuiEAPqFYYNZa71DuQCcYbBs6wW7QFQrXMq7trHkXVG
qRhiHgT+tTVqxPkZgMKDcj853N9MiZod5QgYQCfEy+4A17WZ31W/2NzPgSYn
2beOsHTnMbkciPIi7Jq7E8IV0wvfPuv+ypRRhymG3VthKrRQCMKu0I4QaUfL
iX4BrlB07QesDw7X4kwR+o5flOjkjliQdBWZDcl+hyLNzbi20niOuLW1eoto
Gn9dnelZa9h2jQRqhyfvXDUpOt9jStxsSZjgkDK6L4BfmT4xggL6MIIC9gIB
ATBvMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/IsZAEZFghE
T0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxFjAU
BgNVBAMTDURPRUdyaWRzIENBIDECAkeUMAkGBSsOAwIaBQCgggFgMBgGCSqG
SIb3DQEJAzELBgkqhkiG9w0BBwEwHAYJKoZIhvcNAQkFMQ8XDTA4MDUyNzE2
MDQzNlowIwYJKoZIhvcNAQkEMRYEFJxmmFhzgpCrKHV4UDUhcEtRbcYsMH4G
CSsGAQQBgjcQBDFxMG8waTETMBEGCgmSJomT8ixkARkWA29yZzEYMBYGCgmS
JomT8ixkARkWCERPRUdyaWRzMSAwHgYDVQQLExdDZXJ0aWZpY2F0ZSBBdXRo
b3JpdGllczEWMBQGA1UEAxMNRE9FR3JpZHMgQ0EgMQICR5QwgYAGCyqGSIb3
DQEJEAILMXGgbzBpMRMwEQYKCZImiZPyLGQBGRYDb3JnMRgwFgYKCZImiZPy
LGQBGRYIRE9FR3JpZHMxIDAeBgNVBAsTF0NlcnRpZmljYXRlIEF1dGhvcml0
aWVzMRYwFAYDVQQDEw1ET0VHcmlkcyBDQSAxAgJHlDANBgkqhkiG9w0BAQEF
AASCAQCeoGh2NDxZIaQt1AVFXw0ll5gM3REjqM7Q4pkZhMBzvnVrsVJkoo6I
0AUkB7e8Q5QTqyDROrt6oyt56clDGppGM+Nxv07AhmsI7pBa+L0lLhZFDGuV
tFwvLMvxo5hdE/mqIUn+0H1qoxjPfbA2oXlGtsVgzSbq6okLqWzuTNQT1dAs
I6EbCV2bDxqgazhzYx6uLsQg/MKCPIWC1j8VT9vhCNrXyJy5zLqqv1+AjZ/6
lCDN0aaQ2PyxjlcnbJhyZyYGAu+eEG/oADXQweWG9TyZ3ec+O4Hy4JTLRzfm
HFFedLfHJyb2xKxW27HALZH2e3HVOYih+iuoV716VDImR3nNAAAAAAAA
--Apple-Mail-45--478641795--
===========================================================================
Date mail was appended: Tue May 27 11:04:28 2008 (1211904269)
Date: Tue, 27 May 2008 09:56:40 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: Duncan Brown <dabrown__AT__physics.syr.edu>
CC: condor-admin__AT__cs.wisc.edu, Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
Duncan,
Is this issue CentOS 5 specific? i.e., do you have this problem
on any of the other LIGO FC4 sites running the same 7.0.1 version of Condor
with the same application code base?
Is the call to getpwuid() new to the CBC code base since we
first passed the CentOS 5 acceptance testing with an earlier version
of Condor 7.0.x at Syracuse?
Is there an acceptable work around for the CBC code to avoid
this call for the short term or is this still a blocking bug for
upgrading the rest of the LDG?
Does this have any causal relationship to the stderr issue?
Thanks.
On Tue, May 27, 2008 at 12:04:35PM -0400, Duncan Brown wrote:
> Hi Pete,
>
> On May 27, 2008, at 11:47 AM, condor-admin response tracking system
> wrote:
>
> >On Fri, May 23, 2008 at 05:24:45PM -0500, condor-admin response
> >tracking system
> >wrote:
> >>The offending function preventing checkpointing is getpwuid(). If I
> >>comment out the following lines:
> >>
> >>if(!(pw = getpwuid(uid)))
> >> snprintf(ptable->username, LIGOMETA_USERNAME_MAX, "%d", uid);
> >>else
> >> snprintf(ptable->username, LIGOMETA_USERNAME_MAX, "%s", pw-
> >>>pw_name);
> >>
> >>then the inspiral code successfully checkpoints when told to. The
> >>presence of the gethostname() call makes no difference.
> >>
> >>I'll check the stderr stuff tomorrow.
> >
> >It looks like the nscd.conf file has a method by which one may turn
> >off
> >using the /var/run/nscd/socket communication method. Check out setting
> >'persistant passwd no' and 'shared passwd yes' in the config file and
> >see if that fixes the default behavior.
>
> No, it doesn't fix the checkpointing. I set
>
> enable-cache passwd yes
> positive-time-to-live passwd 600
> negative-time-to-live passwd 20
> suggested-size passwd 211
> check-files passwd yes
> persistent passwd no
> shared passwd yes
> max-db-size passwd 33554432
> auto-propagate passwd yes
>
> and the code still refuses to checkpoint, with or without the nscd
> daemon running.
>
> >Even if it does, though, I still should figure out a method in stduniv
> >by which no configuration changes on your part are needed.
>
> Great, thanks.
>
> Cheers,
> Duncan.
>
>
> >Thank you.
> >
> >-pete
> >
> >
> >
> >========================================
> >MESSAGE INFORMATION
> >========================================
> >* From: Peter Keller <psilord__AT__cs.wisc.edu>
> >* Ticket Email List: dabrown__AT__physics.syr.edu,
> >anderson__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu,'__AT__cs.wisc.edu
>
> --
>
> Duncan Brown Room 263-1, Department of Physics,
> Assistant Professor of Physics Syracuse University, NY 13244, USA
> Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
>
>
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Tue May 27 11:56:57 2008 (1211907417)
Date: Tue, 27 May 2008 15:55:19 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
On Tue, May 27, 2008 at 11:56:57AM -0500, condor-admin response tracking system wrote:
> Is this issue CentOS 5 specific?
Which Centos 5 are you using? 5.0, 5.1? The latest whenever it comes out?
And on what architectures?
Thank you.
-pete
===========================================================================
Date mail was appended: Tue May 27 15:55:22 2008 (1211921723)
Date: Tue, 27 May 2008 16:50:39 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
Hello,
As a potential workaround when running this job under Condor, what if
you specified this in the submit description file:
file_remaps = "/var/run/nscd/socket=local:/this-file-should-not-exist"
Then, when the stduniv libraries try to open the former, it gets redirected to
the latter, and you get ENOENT. If any fallback mechanism exists in the
glibc, this should show it.
Thank you.
-pete
===========================================================================
Date mail was appended: Tue May 27 16:50:41 2008 (1211925042)
Date: Wed, 28 May 2008 13:52:28 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: dabrown__AT__physics.syr.edu, skoranda__AT__gravity.phys.uwm.edu, '__AT__cs.wisc.edu
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
Duncan,
It might also be worth checking whether the getuid() or possibly
geteuid() provide all the information this code needs without opening
a socket.
Thanks.
On Tue, May 27, 2008 at 04:50:41PM -0500, condor-admin response tracking system wrote:
> Hello,
>
> As a potential workaround when running this job under Condor, what if
> you specified this in the submit description file:
>
> file_remaps = "/var/run/nscd/socket=local:/this-file-should-not-exist"
>
> Then, when the stduniv libraries try to open the former, it gets redirected to
> the latter, and you get ENOENT. If any fallback mechanism exists in the
> glibc, this should show it.
>
> Thank you.
>
> -pete
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Peter Keller <psilord__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu, anderson__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu,'__AT__cs.wisc.edu
>
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Wed May 28 15:52:44 2008 (1212007965)
CC: anderson__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu, '__AT__cs.wisc.edu
From: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
Date: Thu, 29 May 2008 10:57:30 -0400
To: condor-admin__AT__cs.wisc.edu
X-Scanner: InterScan AntiVirus for Sendmail
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
--Apple-Mail-53--309866861
Hi Pete,
[dbrown@sugar-dev1 ~]$ cat /etc/redhat-release
CentOS release 5 (Final)
[dbrown@sugar-dev1 ~]$ uname -a
Linux sugar-dev1.phy.syr.edu 2.6.18-8.el5xen #1 SMP Thu Mar 15
19:56:43 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
[dbrown@sugar-dev1 ~]$ rpm -qa | grep gcc
gcc-4.1.1-52.el5
compat-gcc-34-g77-3.4.6-4
compat-gcc-34-3.4.6-4
libgcc-4.1.1-52.el5
libgcc-4.1.1-52.el5
gcc-gfortran-4.1.1-52.el5
compat-gcc-34-c++-3.4.6-4
compat-libgcc-296-2.96-138
gcc-c++-4.1.1-52.el5
gcc-java-4.1.1-52.el5
[dbrown@sugar-dev1 ~]$ rpm -qa | grep glibc
glibc-common-2.5-12
glibc-2.5-12
compat-glibc-headers-2.3.4-2.26
glibc-2.5-12
glibc-headers-2.5-12
compat-glibc-2.3.4-2.26
glibc-devel-2.5-12
glibc-devel-2.5-12
compat-glibc-2.3.4-2.26
Cheers,
Duncan.
On May 27, 2008, at 4:55 PM, condor-admin response tracking system
wrote:
> On Tue, May 27, 2008 at 11:56:57AM -0500, condor-admin response
> tracking system wrote:
>> Is this issue CentOS 5 specific?
>
> Which Centos 5 are you using? 5.0, 5.1? The latest whenever it
> comes out?
> And on what architectures?
>
> Thank you.
>
> -pete
>
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Peter Keller <psilord__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu,
> anderson__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu,'__AT__cs.wisc.edu
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
--Apple-Mail-53--309866861
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEH
AQAAoIIIBTCCA/gwggLgoAMCAQICASkwDQYJKoZIhvcNAQEFBQAwdTETMBEG
CgmSJomT8ixkARkWA25ldDESMBAGCgmSJomT8ixkARkWAkVTMQ4wDAYDVQQK
EwVFU25ldDEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxGDAW
BgNVBAMTD0VTbmV0IFJvb3QgQ0EgMTAeFw0wMjEyMDUwODAwMDBaFw0xMzAx
MjUwODAwMDBaMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/Is
ZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRp
ZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwggEiMA0GCSqGSIb3DQEBAQUA
A4IBDwAwggEKAoIBAQC09dYjYaPbCD5mtbiQb7Ka3y1qAm0ZcqKCFciWcfe8
Kwcuy9tjHuIsLf9ZItdkDW4xy8sua9nJlx3KlwjtumTMtOtg35KZCknUd8KM
4VGTSFdLVG9AbNayef76caVCGM1+jyF0Lq03kauGOPTcNfZe1TZa3e1c9rc8
ljV5OSWa/mfsCACyS5zFIWu0yIDNyJdf+n0hwaPN53wllpJ30taD+JBjQ7h2
k4xRWzeaznLOb9OztZVRA/1sVze+iczFh2xwa4VdGy0eIIPw1pfvYwxO36rm
0S109qvbsNlaroPRbxerPKakQLpKe034Xcx7gBPqUk/FxoRRWin5EWN3rz9L
AgMBAAGjgZ4wgZswDgYDVR0PAQH/BAQDAgGGMBEGCWCGSAGG+EIBAQQEAwIA
hzAdBgNVHQ4EFgQUyhkdEo5upDhdQtQxDgjb2Y0XDV0wHwYDVR0jBBgwFoAU
vF1NSC/4NZRZq1yJSz7RsjoUAeowDwYDVR0TAQH/BAUwAwEB/zAlBgNVHREE
HjAcgRpET0VHcmlkcy1DQS0xQGRvZWdyaWRzLm9yZzANBgkqhkiG9w0BAQUF
AAOCAQEAZNVrIDLqe39CEOiJt7Q7EpBPhAihMvDTSf/42u0SMbUmChww4mLm
ph5DBghZUVF8Yn59kRZMn1QLOtO1HzLqvAvPITacZVPlJgG2IXzlR636YghZ
FAycbIUEOJDBHR4vtQO1KDxgZwvAbtmKIoxvhUCq2xsfFt9kCBBn+JYtQ6O5
LsBJq3PmuubeMcc7mbQAfJZ7h/3QghgkFIhmE1+LBXPJbkuP8vgfg6h2BKoA
f5TFfZECgGZKimfN110tBvfedGZwYYd3/GsJc83B0JN1gny0gqNVPm392Uch
XGeBRrHnm2gkhIkr48Oq6EmNGV9/a6XfbplQW/JWbtPVPWkaizCCBAUwggLt
oAMCAQICAkeUMA0GCSqGSIb3DQEBBQUAMGkxEzARBgoJkiaJk/IsZAEZFgNv
cmcxGDAWBgoJkiaJk/IsZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlm
aWNhdGUgQXV0aG9yaXRpZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwHhcN
MDcwOTEyMTkwMjA3WhcNMDgwOTExMTkwMjA3WjBeMRMwEQYKCZImiZPyLGQB
GRYDb3JnMRgwFgYKCZImiZPyLGQBGRYIZG9lZ3JpZHMxDzANBgNVBAsTBlBl
b3BsZTEcMBoGA1UEAxMTRHVuY2FuIEJyb3duIDUwNjU5MjCCASIwDQYJKoZI
hvcNAQEBBQADggEPADCCAQoCggEBAM+cSlKBhUWHQNmcRf3bsHC3ngHwQC+E
/RJNoYdRC0vQPVQ6kt12TLn6IGX+Eq0p5cHxd+rmOAms7zMzdhw9CYypN4KE
rgQOubIV7zTzajSYsNBKYkX6TA/jyt1223fmruD5hEnnOCWevsruA1iLIUFS
NjZ0WqYA+2KbHvjrObPkcoB7zZ2x4CV1ayWlYRZwLUrSdzSXQKYpbmTJili/
c5GzXpd22oQCpWX8+472pM0zM9l4A3B7uTFpmunsvKox741+CeSbgJzaHkIZ
V/9TjsEO6zrg05+JEGOcXzII2mlgEWtJRmOTOco2QkZ5h0aU8XTsxVAbbtCU
Y7JH4Z91wCMCAwEAAaOBwTCBvjARBglghkgBhvhCAQEEBAMCBeAwDgYDVR0P
AQH/BAQDAgXgMB8GA1UdIwQYMBaAFMoZHRKObqQ4XULUMQ4I29mNFw1dMCIG
A1UdEQQbMBmBF2RhYnJvd25AcGh5c2ljcy5zeXIuZWR1MDoGA1UdHwQzMDEw
L6AtoCuGKWh0dHA6Ly9wa2kxLmRvZWdyaWRzLm9yZy9DUkwvMWMzZjJjYTgu
Y3JsMBgGA1UdIAQRMA8wDQYLKoZIhvdMAwcBAgkwDQYJKoZIhvcNAQEFBQAD
ggEBAA3JV1r6C2MEwcNGarW8KBr3phLOLXoF2656DUFIy8sqler1t38f7ucX
hRSQLu26eLyGgUzrsPuiEAPqFYYNZa71DuQCcYbBs6wW7QFQrXMq7trHkXVG
qRhiHgT+tTVqxPkZgMKDcj853N9MiZod5QgYQCfEy+4A17WZ31W/2NzPgSYn
2beOsHTnMbkciPIi7Jq7E8IV0wvfPuv+ypRRhymG3VthKrRQCMKu0I4QaUfL
iX4BrlB07QesDw7X4kwR+o5flOjkjliQdBWZDcl+hyLNzbi20niOuLW1eoto
Gn9dnelZa9h2jQRqhyfvXDUpOt9jStxsSZjgkDK6L4BfmT4xggL6MIIC9gIB
ATBvMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/IsZAEZFghE
T0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxFjAU
BgNVBAMTDURPRUdyaWRzIENBIDECAkeUMAkGBSsOAwIaBQCgggFgMBgGCSqG
SIb3DQEJAzELBgkqhkiG9w0BBwEwHAYJKoZIhvcNAQkFMQ8XDTA4MDUyOTE0
NTczMVowIwYJKoZIhvcNAQkEMRYEFLErp740FnrParaGh9aKSEKmnqTQMH4G
CSsGAQQBgjcQBDFxMG8waTETMBEGCgmSJomT8ixkARkWA29yZzEYMBYGCgmS
JomT8ixkARkWCERPRUdyaWRzMSAwHgYDVQQLExdDZXJ0aWZpY2F0ZSBBdXRo
b3JpdGllczEWMBQGA1UEAxMNRE9FR3JpZHMgQ0EgMQICR5QwgYAGCyqGSIb3
DQEJEAILMXGgbzBpMRMwEQYKCZImiZPyLGQBGRYDb3JnMRgwFgYKCZImiZPy
LGQBGRYIRE9FR3JpZHMxIDAeBgNVBAsTF0NlcnRpZmljYXRlIEF1dGhvcml0
aWVzMRYwFAYDVQQDEw1ET0VHcmlkcyBDQSAxAgJHlDANBgkqhkiG9w0BAQEF
AASCAQBv6QYjwlgIe7fkvYP9sogk6/8Y5RwEjRTO/vyA50AHupi7qXW40DiS
++aRrrwtt35EFAdQYLgMpZL9u+tKfKucdeN/GIv7wFtuwXndAotQ8tF/DWvm
Fz0NemYky7MxJWFyLza6oQM9R3UHRdJlx2jT4czI1DT6Lq9alZW4Sm1gle5o
JJOq5bL3spg8wfXOojeA36+rjcwyuhrRJy7rI8gj9apXRfKO1UopASQzvvQj
KkBcKFfVvYbGajkwWavCrkVhHoINYfUqacvobfkp24/h1nSgncjN4Qvizb2k
pBI1jRU5DHTVKI7GlgbqunZ/JHbegCfG4pb2fHrI2Cct1qN9AAAAAAAA
--Apple-Mail-53--309866861--
===========================================================================
Date mail was appended: Thu May 29 9:57:19 2008 (1212073040)
CC: anderson__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu, '__AT__cs.wisc.edu
From: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
Date: Thu, 29 May 2008 11:26:07 -0400
To: condor-admin__AT__cs.wisc.edu
X-Scanner: InterScan AntiVirus for Sendmail
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
--Apple-Mail-55--308150283
Hi Pete,
That doesn't seem to work either, but I don't see any mention of the
file_remap in the Shadow or the Starter log.
[dbrown@sugar ckpt]$ cat inspiral.remap.sub
universe = standard
executable = /home/dbrown/projects/daswg/condor/ckpt/lalapps_inspiral
file_remaps = "/var/run/nscd/socket=local:/this-file-should-not-exist"
arguments = --enable-output --trig-end-time 0 --cluster-method
window --dynamic-range-exponent 6.900000e+01 --bank-file H1-
TMPLTBANK-817701712-2048.xml.gz --high-pass-order 8 --strain-high-
pass-order 8 --ifo-tag FIRST --gps-end-time 817703760 --calibrated-
data real_8 --channel-name H1:LSC-STRAIN --snr-threshold 5.5 --number-
of-segments 15 --trig-start-time 0 --enable-high-pass 3.000000e+01 --
debug-level 33 --gps-start-time 817701712 --high-pass-attenuation
1.000000e-01 --chisq-bins 0 --inverse-spec-length 16 --segment-length
1048576 --low-frequency-cutoff 4.000000e+01 --pad-data 8 --cluster-
window 1.600000e+01 --sample-rate 4096 --chisq-threshold 10.0 --
resample-filter ldas --strain-high-pass-atten 1.000000e-01 --strain-
high-pass-freq 3.000000e+01 --segment-overlap 524288 --frame-cache
cache/H-H1_RDS_C03_L2-817701704-817707343.cache --chisq-delta
2.000000e-01 --bank-veto-subbank-size 1 --approximant FindChirpSP --
order twoPN --spectrum-type median --write-compress --disable-rsq-
veto --enable-filter-inj-only --data-checkpoint --checkpoint-path /
home/dbrown/projects/daswg/condor/ckpt --verbose
output = inspiral.remap.$(cluster).$(process).out
error = inspiral.remap.$(cluster).$(process).err
log = inspiral.remap.$(cluster).log
queue
[dbrown@sugar ckpt]$ condor_submit inspiral.remap.sub
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 47899.
000 (47899.000.000) 05/29 10:59:27 Job submitted from host:
<10.20.1.23:33101>
...
001 (47899.000.000) 05/29 10:59:30 Job executing on host:
<10.20.2.30:41414>
...
004 (47899.000.000) 05/29 11:06:30 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
445 - Run Bytes Sent By Job
14792827 - Run Bytes Received By Job
...
001 (47899.000.000) 05/29 11:07:20 Job executing on host:
<10.20.2.30:41414>
...
5/29 10:59:30 (?.?) (9084):******* Standard Shadow starting up *******
5/29 10:59:30 (?.?) (9084):** $CondorVersion: 7.0.1 Feb 26 2008
BuildID: 76180 $
5/29 10:59:30 (?.?) (9084):** $CondorPlatform: X86_64-LINUX_RHEL5 $
5/29 10:59:30 (?.?) (9084):*******************************************
5/29 10:59:30 (?.?) (9084):uid=0, euid=103, gid=0, egid=103
5/29 10:59:30 (?.?) (9084):Hostname = "<10.20.2.30:41414>", Job =
47899.0
5/29 10:59:30 (47899.0) (9084):Requesting Primary Starter
5/29 10:59:30 (47899.0) (9084):Shadow: Request to run a job was ACCEPTED
5/29 10:59:30 (47899.0) (9084):Shadow: RSC_SOCK connected, fd = 17
5/29 10:59:30 (47899.0) (9084):Shadow: CLIENT_LOG connected, fd = 18
5/29 10:59:30 (47899.0) (9084):My_Filesystem_Domain = "sugar"
5/29 10:59:30 (47899.0) (9084):My_UID_Domain = "sugar"
5/29 10:59:30 (47899.0) (9084): Entering pseudo_get_file_stream
5/29 10:59:30 (47899.0) (9084): file = "/usr1/condor/spool/
cluster47899.ickpt.subproc0"
5/29 10:59:30 (47899.0) (9084):Reaped child status - pid 9085 exited
with status 0
5/29 10:59:30 (47899.0) (9084):Read: User Job - $CondorPlatform:
X86_64-LINUX_RHEL5 $
5/29 10:59:30 (47899.0) (9084):Read: User Job - $CondorVersion: 7.0.1
Feb 26 2008 BuildID: 76180 $
5/29 10:59:30 (47899.0) (9084):Read: Checkpoint file name is "/usr1/
condor/spool/cluster47899.proc0.subproc0"
5/29 10:59:57 (47899.0) (9084):Read: About to send CHECKPOINT and
EXIT signal to SELF
5/29 10:59:57 (47899.0) (9084):Read: received ckpt signal 20, but
deferred it for later
5/29 11:01:28 (47899.0) (9084):Read: received ckpt signal 20, but
deferred it for later
5/29 11:06:30 (47899.0) (9084):Shadow: Job 47899.0 exited, termsig =
9, coredump = 0, retcode = 0
5/29 11:06:30 (47899.0) (9084):Shadow: Job was kicked off without a
checkpoint
5/29 11:06:30 (47899.0) (9084):Shadow: DoCleanup: unlinking TmpCkpt '/
usr1/condor/spool/cluster47899.proc0.subproc0.tmp'
5/29 11:06:30 (47899.0) (9084):Trying to unlink /usr1/condor/spool/
cluster47899.proc0.subproc0.tmp
5/29 11:06:30 (47899.0) (9084):user_time = 1 ticks
5/29 11:06:30 (47899.0) (9084):sys_time = 1 ticks
5/29 10:59:27 ********** STARTER starting up ***********
5/29 10:59:27 ** $CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
5/29 10:59:27 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
5/29 10:59:27 ******************************************
5/29 10:59:27 Submitting machine is "sugar-internal.sugar"
5/29 10:59:27 EventHandler {
5/29 10:59:27 func = 0x4ce4fe
5/29 10:59:27 mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD
SIGTSTP
5/29 10:59:27 }
5/29 10:59:27 Done setting resource limits
5/29 10:59:27 *FSM* Transitioning to state "GET_PROC"
5/29 10:59:27 *FSM* Executing state func "get_proc()" [ ]
5/29 10:59:27 Entering get_proc()
5/29 10:59:27 Entering get_job_info()
5/29 10:59:27 Startup Info:
5/29 10:59:27 Version Number: 1
5/29 10:59:27 Id: 47899.0
5/29 10:59:27 JobClass: STANDARD
5/29 10:59:27 Uid: 620
5/29 10:59:27 Gid: 1001
5/29 10:59:27 VirtPid: -1
5/29 10:59:27 SoftKillSignal: 20
5/29 10:59:27 Cmd: "/home/dbrown/projects/daswg/condor/ckpt/
lalapps_inspiral"
5/29 10:59:27 Args: "--enable-output --trig-end-time 0 --cluster-
method window --dynamic-range-exponent 6.900000e+01 --bank-file H1-TMPLT
BANK-817701712-2048.xml.gz --high-pass-order 8 --strain-high-pass-
order 8 --ifo-tag FIRST --gps-end-time 817703760 --calibrated-data
real_8
--channel-name H1:LSC-STRAIN --snr-threshold 5.5 --number-of-
segments 15 --trig-start-time 0 --enable-high-pass 3.000000e+01 --
debug-level
33 --gps-start-time 817701712 --high-pass-attenuation 1.000000e-01
--chisq-bins 0 --inverse-spec-length 16 --segment-length 1048576 --low-
frequency-cutoff 4.000000e+01 --pad-data 8 --cluster-window 1.600000e
+01 --sample-rate 4096 --chisq-threshold 10.0 --resample-filter ldas -
-strain-high-pass-atten 1.000000e-01 --strain-high-pass-freq 3.000000e
+01 --segment-overlap 524288 --frame-cache cache/H-H1_RDS_C03_L2-8177
01704-817707343.cache --chisq-delta 2.000000e-01 --bank-veto-subbank-
size 1 --approximant FindChirpSP --order twoPN --spectrum-type median
--write-compress --disable-rsq-veto --enable-filter-inj-only --data-
checkpoint --checkpoint-path /home/dbrown/projects/daswg/condor/ckpt --
verbose"
5/29 10:59:27 Env: ""
5/29 10:59:27 Iwd: "/home/dbrown/projects/daswg/condor/ckpt"
5/29 10:59:27 Ckpt Wanted: TRUE
5/29 10:59:27 Is Restart: FALSE
5/29 10:59:27 Core Limit Valid: TRUE
5/29 10:59:27 Coredump Limit 0
5/29 10:59:27 User uid set to 620
5/29 10:59:27 User uid set to 1001
5/29 10:59:27 User Process 47899.0 {
5/29 10:59:27 cmd = /home/dbrown/projects/daswg/condor/ckpt/
lalapps_inspiral
5/29 10:59:27 args = --enable-output --trig-end-time 0 --cluster-
method window --dynamic-range-exponent 6.900000e+01 --bank-file H1-
TMPLTBANK-817701712-2048.xml.gz --high-pass-order 8 --strain-high-
pass-order 8 --ifo-tag FIRST --gps-end-time 817703760 --calibrated-
data real_8 --channel-name H1:LSC-STRAIN --snr-threshold 5.5 --number-
of-segments 15 --trig-start-time 0 --enable-high-pass 3.000000e+01 --
debug-level
33 --gps-start-time 817701712 --high-pass-attenuation 1.000000e-01
--chisq-bins 0 --inverse-spec-length 16 --segment-length 1048576 --
low-frequency-cutoff 4.000000e+01 --pad-data 8 --cluster-window
1.600000e+01 --sample-rate 4096 --chisq-threshold 10.0 --resample-
filter ldas --strain-high-pass-atten 1.000000e-01 --strain-high-pass-
freq 3.000000e+01 --segment-overlap 524288 --frame-cache cache/H-
H1_RDS_C03_L2-817701704-817707343.cache --chisq-delta 2.000000e-01 --
bank-veto-subbank-size 1 --approximant FindChirpSP --order twoPN --
spectrum-type median --write-compress --disable-rsq-veto --enable-
filter-inj-only --data-checkpoint --checkpoint-path /home/dbrown/
projects/daswg/condor/ckpt --verbose
5/29 10:59:27 env = _CONDOR_SLOT=slot4 CONDOR_VM=slot4
CONDOR_SCRATCH_DIR=/usr1/condor/execute/dir_23827
_condor_BIND_ALL_INTERFACES=FALS
E
5/29 10:59:27 local_dir = dir_23827
5/29 10:59:27 cur_ckpt = dir_23827/condor_exec.47899.0
5/29 10:59:27 core_name = (either 'core' or 'core.<pid>')
5/29 10:59:27 uid = 620, gid = 1001
5/29 10:59:27 v_pid = -1
5/29 10:59:27 pid = (NOT CURRENTLY EXECUTING)
5/29 10:59:27 exit_status_valid = FALSE
5/29 10:59:27 exit_status = (NEVER BEEN EXECUTED)
5/29 10:59:27 ckpt_wanted = TRUE
5/29 10:59:27 coredump_limit_exists = TRUE
5/29 10:59:27 coredump_limit = 0
5/29 10:59:27 soft_kill_sig = 20
5/29 10:59:27 job_class = STANDARD
5/29 10:59:27 state = NEW
5/29 10:59:27 new_ckpt_created = FALSE
5/29 10:59:27 ckpt_transferred = FALSE
5/29 10:59:27 core_created = FALSE
5/29 10:59:27 core_transferred = FALSE
5/29 10:59:27 exit_requested = FALSE
5/29 10:59:27 image_size = -1 blocks
5/29 10:59:27 user_time = 0
5/29 10:59:27 sys_time = 0
5/29 10:59:27 guaranteed_user_time = 0
5/29 10:59:27 guaranteed_sys_time = 0
5/29 10:59:27 }
5/29 10:59:27 *FSM* Transitioning to state "GET_EXEC"
5/29 10:59:27 *FSM* Executing state func "get_exec()" [ SUSPEND
VACATE DIE ]
5/29 10:59:27 Entering get_exec()
5/29 10:59:27 Executable is located on submitting host
5/29 10:59:27 Expanded executable name is "/usr1/condor/spool/
cluster47899.ickpt.subproc0"
5/29 10:59:27 Going to try 3 attempts at getting the initial executable
5/29 10:59:27 Entering get_file( /usr1/condor/spool/
cluster47899.ickpt.subproc0, dir_23827/condor_exec.47899.0, 0755 )
5/29 10:59:27 Opened "/usr1/condor/spool/cluster47899.ickpt.subproc0"
via file stream
5/29 10:59:27 Get_file() transferred 14790995 bytes, 103210598 bytes/
second
5/29 10:59:27 Fetched orig ckpt file "/usr1/condor/spool/
cluster47899.ickpt.subproc0" into "dir_23827/condor_exec.47899.0"
with 1 attempt
5/29 10:59:28 Executable 'dir_23827/condor_exec.47899.0' is linked
with "$CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $" on a "$CondorP
latform: X86_64-LINUX_RHEL5 $"
5/29 10:59:28 *FSM* Executing transition function "spawn_all"
5/29 10:59:28 Pipe built
5/29 10:59:28 New pipe_fds[14,1]
5/29 10:59:28 cmd_fd = 14
5/29 10:59:28 Calling execve( "/usr1/condor/execute/dir_23827/
condor_exec.47899.0", "condor_exec.47899.0", "-_condor_cmd_fd", "14",
"--enable-output", "--trig-end-time", "0", "--cluster-method",
"window", "--dynamic-range-exponent", "6.900000e+01", "--bank-file",
"H1-TMPLTBANK-817701712-2048.xml.gz", "--high-pass-order", "8", "--
strain-high-pass-order", "8", "--ifo-tag", "FIRST", "--gps-end-time",
"817703760", "--
calibrated-data", "real_8", "--channel-name", "H1:LSC-STRAIN", "--snr-
threshold", "5.5", "--number-of-segments", "15", "--trig-start-time",
"0", "--enable-high-pass", "3.000000e+01", "--debug-level", "33", "--
gps-start-time", "817701712", "--high-pass-attenuation",
"1.000000e-01", "--chisq-bins", "0", "--inverse-spec-length", "16",
"--segment-length", "1048576", "--low-frequency-cutoff", "4.000000e
+01", "--pad-dat
a", "8", "--cluster-window", "1.600000e+01", "--sample-rate", "4096",
"--chisq-threshold", "10.0", "--resample-filter", "ldas", "--strain-
high-pass-atten", "1.000000e-01", "--strain-high-pass-freq",
"3.000000e+01", "--segment-overlap", "524288", "--frame-cache",
"cache/H-H1_RDS_C03_L2-817701704-817707343.cache", "--chisq-delta",
"2.000000e-01", "--bank-veto-subbank-size", "1", "--approximant",
"FindChirpSP", "--order", "twoPN", "--spectrum-type", "median", "--
write-compress", "--disable-rsq-veto", "--enable-filter-inj-only", "--
data-checkpoint", "--checkpoint-path", "/home/dbrown/projects/daswg/
condor/ckpt", "--verbose", 0, "_CONDOR_SLOT=slot4",
"CONDOR_VM=slot4", "CONDOR_SCRATCH_DIR=/usr1/condor/execute/
dir_23827", "_condor_BIND_ALL_INTERFACES=FALSE", 0 )
5/29 10:59:28 Started user job - PID = 23829
5/29 10:59:28 cmd_fp = 0xe245690
5/29 10:59:28 end
5/29 10:59:28 *FSM* Transitioning to state "SUPERVISE"
5/29 10:59:28 *FSM* Got asynchronous event "CHILD_EXIT"
5/29 10:59:28 *FSM* Executing transition function "reaper"
5/29 10:59:28 *FSM* Aborting transition function "reaper"
5/29 10:59:28 *FSM* Executing state func "supervise_all
()" [ GET_NEW_PROC SUSPEND VACATE ALARM DIE CHILD_EXIT PERIODIC_CKPT ]
5/29 11:01:26 *FSM* Got asynchronous event "VACATE"
5/29 11:01:26 *FSM* Executing transition function "req_vacate"
5/29 11:01:26 req_ckpt_exit_all: Proc -1 in state EXECUTING
5/29 11:01:26 Requesting Exit on proc #-1
5/29 11:01:26 UserProc::send_sig_no_privsep(): Sent signal SIGCONT to
user job 23829
5/29 11:01:26 UserProc::send_sig(): Sent signal SIGTSTP to user job
23829
5/29 11:01:26 *FSM* Transitioning to state "TERMINATE"
5/29 11:01:26 *FSM* Executing state func "terminate_all()" [ ]
5/29 11:01:26 *FSM* Transitioning to state "TERMINATE_WAIT"
5/29 11:01:26 *FSM* Executing state func "asynch_wait()" [ SUSPEND
ALARM DIE CHILD_EXIT ]
5/29 11:06:28 *FSM* Got asynchronous event "DIE"
5/29 11:06:28 *FSM* Executing transition function "req_die"
5/29 11:06:28 req_exit_all: Proc -1 in state EXECUTING
5/29 11:06:28 Requesting Exit on proc #-1
5/29 11:06:28 UserProc::send_sig_no_privsep(): Sent signal SIGCONT to
user job 23829
5/29 11:06:28 UserProc::send_sig(): Sent signal SIGKILL to user job
23829
5/29 11:06:28 *FSM* Got asynchronous event "CHILD_EXIT"
5/29 11:06:28 *FSM* Executing transition function "reaper"
5/29 11:06:28 Process 23829 killed by signal 9
5/29 11:06:28 Process exited by request
5/29 11:06:28 *FSM* Transitioning to state "TERMINATE"
5/29 11:06:28 *FSM* Executing state func "terminate_all()" [ ]
5/29 11:06:28 *FSM* Transitioning to state "SEND_STATUS_ALL"
5/29 11:06:28 *FSM* Executing state func "dispose_all()" [ ]
5/29 11:06:28 Sending final status for process 47899.0
5/29 11:06:28 STATUS encoded as CKPT, *NOT* TRANSFERRED
5/29 11:06:28 User time = 0.000000 seconds
5/29 11:06:28 System time = 0.000000 seconds
5/29 11:06:28 Unlinked "dir_23827/condor_exec.47899.0"
5/29 11:06:28 Removing directory "dir_23827"
5/29 11:06:28 *FSM* Reached state "END"
On May 27, 2008, at 5:50 PM, condor-admin response tracking system
wrote:
> Hello,
>
> As a potential workaround when running this job under Condor, what if
> you specified this in the submit description file:
>
> file_remaps = "/var/run/nscd/socket=local:/this-file-should-not-exist"
>
> Then, when the stduniv libraries try to open the former, it gets
> redirected to
> the latter, and you get ENOENT. If any fallback mechanism exists in
> the
> glibc, this should show it.
>
> Thank you.
>
> -pete
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Peter Keller <psilord__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu,
> anderson__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu,'__AT__cs.wisc.edu
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
--Apple-Mail-55--308150283
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEH
AQAAoIIIBTCCA/gwggLgoAMCAQICASkwDQYJKoZIhvcNAQEFBQAwdTETMBEG
CgmSJomT8ixkARkWA25ldDESMBAGCgmSJomT8ixkARkWAkVTMQ4wDAYDVQQK
EwVFU25ldDEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxGDAW
BgNVBAMTD0VTbmV0IFJvb3QgQ0EgMTAeFw0wMjEyMDUwODAwMDBaFw0xMzAx
MjUwODAwMDBaMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/Is
ZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRp
ZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwggEiMA0GCSqGSIb3DQEBAQUA
A4IBDwAwggEKAoIBAQC09dYjYaPbCD5mtbiQb7Ka3y1qAm0ZcqKCFciWcfe8
Kwcuy9tjHuIsLf9ZItdkDW4xy8sua9nJlx3KlwjtumTMtOtg35KZCknUd8KM
4VGTSFdLVG9AbNayef76caVCGM1+jyF0Lq03kauGOPTcNfZe1TZa3e1c9rc8
ljV5OSWa/mfsCACyS5zFIWu0yIDNyJdf+n0hwaPN53wllpJ30taD+JBjQ7h2
k4xRWzeaznLOb9OztZVRA/1sVze+iczFh2xwa4VdGy0eIIPw1pfvYwxO36rm
0S109qvbsNlaroPRbxerPKakQLpKe034Xcx7gBPqUk/FxoRRWin5EWN3rz9L
AgMBAAGjgZ4wgZswDgYDVR0PAQH/BAQDAgGGMBEGCWCGSAGG+EIBAQQEAwIA
hzAdBgNVHQ4EFgQUyhkdEo5upDhdQtQxDgjb2Y0XDV0wHwYDVR0jBBgwFoAU
vF1NSC/4NZRZq1yJSz7RsjoUAeowDwYDVR0TAQH/BAUwAwEB/zAlBgNVHREE
HjAcgRpET0VHcmlkcy1DQS0xQGRvZWdyaWRzLm9yZzANBgkqhkiG9w0BAQUF
AAOCAQEAZNVrIDLqe39CEOiJt7Q7EpBPhAihMvDTSf/42u0SMbUmChww4mLm
ph5DBghZUVF8Yn59kRZMn1QLOtO1HzLqvAvPITacZVPlJgG2IXzlR636YghZ
FAycbIUEOJDBHR4vtQO1KDxgZwvAbtmKIoxvhUCq2xsfFt9kCBBn+JYtQ6O5
LsBJq3PmuubeMcc7mbQAfJZ7h/3QghgkFIhmE1+LBXPJbkuP8vgfg6h2BKoA
f5TFfZECgGZKimfN110tBvfedGZwYYd3/GsJc83B0JN1gny0gqNVPm392Uch
XGeBRrHnm2gkhIkr48Oq6EmNGV9/a6XfbplQW/JWbtPVPWkaizCCBAUwggLt
oAMCAQICAkeUMA0GCSqGSIb3DQEBBQUAMGkxEzARBgoJkiaJk/IsZAEZFgNv
cmcxGDAWBgoJkiaJk/IsZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlm
aWNhdGUgQXV0aG9yaXRpZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwHhcN
MDcwOTEyMTkwMjA3WhcNMDgwOTExMTkwMjA3WjBeMRMwEQYKCZImiZPyLGQB
GRYDb3JnMRgwFgYKCZImiZPyLGQBGRYIZG9lZ3JpZHMxDzANBgNVBAsTBlBl
b3BsZTEcMBoGA1UEAxMTRHVuY2FuIEJyb3duIDUwNjU5MjCCASIwDQYJKoZI
hvcNAQEBBQADggEPADCCAQoCggEBAM+cSlKBhUWHQNmcRf3bsHC3ngHwQC+E
/RJNoYdRC0vQPVQ6kt12TLn6IGX+Eq0p5cHxd+rmOAms7zMzdhw9CYypN4KE
rgQOubIV7zTzajSYsNBKYkX6TA/jyt1223fmruD5hEnnOCWevsruA1iLIUFS
NjZ0WqYA+2KbHvjrObPkcoB7zZ2x4CV1ayWlYRZwLUrSdzSXQKYpbmTJili/
c5GzXpd22oQCpWX8+472pM0zM9l4A3B7uTFpmunsvKox741+CeSbgJzaHkIZ
V/9TjsEO6zrg05+JEGOcXzII2mlgEWtJRmOTOco2QkZ5h0aU8XTsxVAbbtCU
Y7JH4Z91wCMCAwEAAaOBwTCBvjARBglghkgBhvhCAQEEBAMCBeAwDgYDVR0P
AQH/BAQDAgXgMB8GA1UdIwQYMBaAFMoZHRKObqQ4XULUMQ4I29mNFw1dMCIG
A1UdEQQbMBmBF2RhYnJvd25AcGh5c2ljcy5zeXIuZWR1MDoGA1UdHwQzMDEw
L6AtoCuGKWh0dHA6Ly9wa2kxLmRvZWdyaWRzLm9yZy9DUkwvMWMzZjJjYTgu
Y3JsMBgGA1UdIAQRMA8wDQYLKoZIhvdMAwcBAgkwDQYJKoZIhvcNAQEFBQAD
ggEBAA3JV1r6C2MEwcNGarW8KBr3phLOLXoF2656DUFIy8sqler1t38f7ucX
hRSQLu26eLyGgUzrsPuiEAPqFYYNZa71DuQCcYbBs6wW7QFQrXMq7trHkXVG
qRhiHgT+tTVqxPkZgMKDcj853N9MiZod5QgYQCfEy+4A17WZ31W/2NzPgSYn
2beOsHTnMbkciPIi7Jq7E8IV0wvfPuv+ypRRhymG3VthKrRQCMKu0I4QaUfL
iX4BrlB07QesDw7X4kwR+o5flOjkjliQdBWZDcl+hyLNzbi20niOuLW1eoto
Gn9dnelZa9h2jQRqhyfvXDUpOt9jStxsSZjgkDK6L4BfmT4xggL6MIIC9gIB
ATBvMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/IsZAEZFghE
T0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxFjAU
BgNVBAMTDURPRUdyaWRzIENBIDECAkeUMAkGBSsOAwIaBQCgggFgMBgGCSqG
SIb3DQEJAzELBgkqhkiG9w0BBwEwHAYJKoZIhvcNAQkFMQ8XDTA4MDUyOTE1
MjYwN1owIwYJKoZIhvcNAQkEMRYEFIav3lhv7Ow+bYZC2C+sMU3fwk0sMH4G
CSsGAQQBgjcQBDFxMG8waTETMBEGCgmSJomT8ixkARkWA29yZzEYMBYGCgmS
JomT8ixkARkWCERPRUdyaWRzMSAwHgYDVQQLExdDZXJ0aWZpY2F0ZSBBdXRo
b3JpdGllczEWMBQGA1UEAxMNRE9FR3JpZHMgQ0EgMQICR5QwgYAGCyqGSIb3
DQEJEAILMXGgbzBpMRMwEQYKCZImiZPyLGQBGRYDb3JnMRgwFgYKCZImiZPy
LGQBGRYIRE9FR3JpZHMxIDAeBgNVBAsTF0NlcnRpZmljYXRlIEF1dGhvcml0
aWVzMRYwFAYDVQQDEw1ET0VHcmlkcyBDQSAxAgJHlDANBgkqhkiG9w0BAQEF
AASCAQCRvb6jcz7/qKrkchmyUGlGmYiTOzQ6Sawn62ueXi+wr6M93GQRldyH
yP+cINgxjW3uMPBODJyrNEUgKbFKXK7OEq2sb89SpD1eBVc8ab2SqsPxs9un
F/uEYTeqOcSE58PaTI2bBd6ejY6FtqjaIZ7u4EF8Tiq8HWdYg+x1r37AvkuS
vQUYF6BkBKIGRy66KvNvtPuJ+rcmfBvAueheYXy9H+ErhTOlqSeUc/ZK9sOQ
4ldK1vjhW6+l1Qq/+BGtkacRG2P0gX1npNHsm3g5yOsEXojpwXfq4lSxw/QS
KvvSZcEu0RyxBzJfTz0J+Zl59vCLCRKYE7WaHI0jZ+NnqzUoAAAAAAAA
--Apple-Mail-55--308150283--
===========================================================================
Date mail was appended: Thu May 29 10:26:00 2008 (1212074761)
CC: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>,
Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
Date: Thu, 29 May 2008 11:51:04 -0400
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
X-Scanner: InterScan AntiVirus for Sendmail
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
--Apple-Mail-57--306652608
Hi Stuart and Pete,
The code uses geteuid() to get the effective userid of the process
and then getpwuid() to map that to a username. The code does have a
fallback, so that if getpwuid() returns a non-zero value, only the
numeric uid is stored.
Pete, what do you think of these two possible solutions:
1. Fix getpwuid() in the standard universe so it correctly returns
the username.
2. Add a standard universe implementation of getpwuid() that just
returns a non-zero return code.
Obviously 1 is preferable, however in standard universe code condor
compiled on FC4 the call to getpwuid() fails and returns a non-zero
exit code, so 2 would be no worse than the existing situation. I
recall now that I put in the fallback to store the uid to handle
condor compiled code.
How long do you think it would take to implement either of these two
options?
Cheers,
Duncan.
On May 28, 2008, at 4:52 PM, Stuart Anderson wrote:
> Duncan,
> It might also be worth checking whether the getuid() or possibly
> geteuid() provide all the information this code needs without opening
> a socket.
>
> Thanks.
>
>
> On Tue, May 27, 2008 at 04:50:41PM -0500, condor-admin response
> tracking system wrote:
>> Hello,
>>
>> As a potential workaround when running this job under Condor, what if
>> you specified this in the submit description file:
>>
>> file_remaps = "/var/run/nscd/socket=local:/this-file-should-not-
>> exist"
>>
>> Then, when the stduniv libraries try to open the former, it gets
>> redirected to
>> the latter, and you get ENOENT. If any fallback mechanism exists
>> in the
>> glibc, this should show it.
>>
>> Thank you.
>>
>> -pete
>>
>>
>> ========================================
>> MESSAGE INFORMATION
>> ========================================
>> * From: Peter Keller <psilord__AT__cs.wisc.edu>
>> * Ticket Email List: dabrown__AT__physics.syr.edu,
>> anderson__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu,'__AT__cs.wisc.edu
>>
>
> --
> Stuart Anderson anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
--Apple-Mail-57--306652608
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEH
AQAAoIIIBTCCA/gwggLgoAMCAQICASkwDQYJKoZIhvcNAQEFBQAwdTETMBEG
CgmSJomT8ixkARkWA25ldDESMBAGCgmSJomT8ixkARkWAkVTMQ4wDAYDVQQK
EwVFU25ldDEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxGDAW
BgNVBAMTD0VTbmV0IFJvb3QgQ0EgMTAeFw0wMjEyMDUwODAwMDBaFw0xMzAx
MjUwODAwMDBaMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/Is
ZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRp
ZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwggEiMA0GCSqGSIb3DQEBAQUA
A4IBDwAwggEKAoIBAQC09dYjYaPbCD5mtbiQb7Ka3y1qAm0ZcqKCFciWcfe8
Kwcuy9tjHuIsLf9ZItdkDW4xy8sua9nJlx3KlwjtumTMtOtg35KZCknUd8KM
4VGTSFdLVG9AbNayef76caVCGM1+jyF0Lq03kauGOPTcNfZe1TZa3e1c9rc8
ljV5OSWa/mfsCACyS5zFIWu0yIDNyJdf+n0hwaPN53wllpJ30taD+JBjQ7h2
k4xRWzeaznLOb9OztZVRA/1sVze+iczFh2xwa4VdGy0eIIPw1pfvYwxO36rm
0S109qvbsNlaroPRbxerPKakQLpKe034Xcx7gBPqUk/FxoRRWin5EWN3rz9L
AgMBAAGjgZ4wgZswDgYDVR0PAQH/BAQDAgGGMBEGCWCGSAGG+EIBAQQEAwIA
hzAdBgNVHQ4EFgQUyhkdEo5upDhdQtQxDgjb2Y0XDV0wHwYDVR0jBBgwFoAU
vF1NSC/4NZRZq1yJSz7RsjoUAeowDwYDVR0TAQH/BAUwAwEB/zAlBgNVHREE
HjAcgRpET0VHcmlkcy1DQS0xQGRvZWdyaWRzLm9yZzANBgkqhkiG9w0BAQUF
AAOCAQEAZNVrIDLqe39CEOiJt7Q7EpBPhAihMvDTSf/42u0SMbUmChww4mLm
ph5DBghZUVF8Yn59kRZMn1QLOtO1HzLqvAvPITacZVPlJgG2IXzlR636YghZ
FAycbIUEOJDBHR4vtQO1KDxgZwvAbtmKIoxvhUCq2xsfFt9kCBBn+JYtQ6O5
LsBJq3PmuubeMcc7mbQAfJZ7h/3QghgkFIhmE1+LBXPJbkuP8vgfg6h2BKoA
f5TFfZECgGZKimfN110tBvfedGZwYYd3/GsJc83B0JN1gny0gqNVPm392Uch
XGeBRrHnm2gkhIkr48Oq6EmNGV9/a6XfbplQW/JWbtPVPWkaizCCBAUwggLt
oAMCAQICAkeUMA0GCSqGSIb3DQEBBQUAMGkxEzARBgoJkiaJk/IsZAEZFgNv
cmcxGDAWBgoJkiaJk/IsZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlm
aWNhdGUgQXV0aG9yaXRpZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwHhcN
MDcwOTEyMTkwMjA3WhcNMDgwOTExMTkwMjA3WjBeMRMwEQYKCZImiZPyLGQB
GRYDb3JnMRgwFgYKCZImiZPyLGQBGRYIZG9lZ3JpZHMxDzANBgNVBAsTBlBl
b3BsZTEcMBoGA1UEAxMTRHVuY2FuIEJyb3duIDUwNjU5MjCCASIwDQYJKoZI
hvcNAQEBBQADggEPADCCAQoCggEBAM+cSlKBhUWHQNmcRf3bsHC3ngHwQC+E
/RJNoYdRC0vQPVQ6kt12TLn6IGX+Eq0p5cHxd+rmOAms7zMzdhw9CYypN4KE
rgQOubIV7zTzajSYsNBKYkX6TA/jyt1223fmruD5hEnnOCWevsruA1iLIUFS
NjZ0WqYA+2KbHvjrObPkcoB7zZ2x4CV1ayWlYRZwLUrSdzSXQKYpbmTJili/
c5GzXpd22oQCpWX8+472pM0zM9l4A3B7uTFpmunsvKox741+CeSbgJzaHkIZ
V/9TjsEO6zrg05+JEGOcXzII2mlgEWtJRmOTOco2QkZ5h0aU8XTsxVAbbtCU
Y7JH4Z91wCMCAwEAAaOBwTCBvjARBglghkgBhvhCAQEEBAMCBeAwDgYDVR0P
AQH/BAQDAgXgMB8GA1UdIwQYMBaAFMoZHRKObqQ4XULUMQ4I29mNFw1dMCIG
A1UdEQQbMBmBF2RhYnJvd25AcGh5c2ljcy5zeXIuZWR1MDoGA1UdHwQzMDEw
L6AtoCuGKWh0dHA6Ly9wa2kxLmRvZWdyaWRzLm9yZy9DUkwvMWMzZjJjYTgu
Y3JsMBgGA1UdIAQRMA8wDQYLKoZIhvdMAwcBAgkwDQYJKoZIhvcNAQEFBQAD
ggEBAA3JV1r6C2MEwcNGarW8KBr3phLOLXoF2656DUFIy8sqler1t38f7ucX
hRSQLu26eLyGgUzrsPuiEAPqFYYNZa71DuQCcYbBs6wW7QFQrXMq7trHkXVG
qRhiHgT+tTVqxPkZgMKDcj853N9MiZod5QgYQCfEy+4A17WZ31W/2NzPgSYn
2beOsHTnMbkciPIi7Jq7E8IV0wvfPuv+ypRRhymG3VthKrRQCMKu0I4QaUfL
iX4BrlB07QesDw7X4kwR+o5flOjkjliQdBWZDcl+hyLNzbi20niOuLW1eoto
Gn9dnelZa9h2jQRqhyfvXDUpOt9jStxsSZjgkDK6L4BfmT4xggL6MIIC9gIB
ATBvMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/IsZAEZFghE
T0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxFjAU
BgNVBAMTDURPRUdyaWRzIENBIDECAkeUMAkGBSsOAwIaBQCgggFgMBgGCSqG
SIb3DQEJAzELBgkqhkiG9w0BBwEwHAYJKoZIhvcNAQkFMQ8XDTA4MDUyOTE1
NTEwNVowIwYJKoZIhvcNAQkEMRYEFGBYoQcDb+ao1b6MMvYMmOMMcq78MH4G
CSsGAQQBgjcQBDFxMG8waTETMBEGCgmSJomT8ixkARkWA29yZzEYMBYGCgmS
JomT8ixkARkWCERPRUdyaWRzMSAwHgYDVQQLExdDZXJ0aWZpY2F0ZSBBdXRo
b3JpdGllczEWMBQGA1UEAxMNRE9FR3JpZHMgQ0EgMQICR5QwgYAGCyqGSIb3
DQEJEAILMXGgbzBpMRMwEQYKCZImiZPyLGQBGRYDb3JnMRgwFgYKCZImiZPy
LGQBGRYIRE9FR3JpZHMxIDAeBgNVBAsTF0NlcnRpZmljYXRlIEF1dGhvcml0
aWVzMRYwFAYDVQQDEw1ET0VHcmlkcyBDQSAxAgJHlDANBgkqhkiG9w0BAQEF
AASCAQAPr/E1z+qovjvh3zhhdEkzP5Pzowxs7YROP3xQ7+o1IpduPj3d9UC0
UkEm7X6DkzY54EFRcsReYB7uaMp1xoYtL0MkyHrwJXNLvk0mWOEGsZkncLi2
YlJXAXN704HADlc4FSV2zQTyqHIZgjiHWV49WnTVqxIsjMIj3OSDuDjYxbws
5bujoBj0THBewo9HwiuWxxXH2sO73qFA7GGcRaTUGxRYWf0aijagdvJj81VN
noJ0LTS3R8ZncIFBDFVAq8icEBHthS6uKRMdMHUX9CHSpHgqFH8+IoSIIvfJ
ti9B3/JopgzJBaTjzjgAToYxv3K7xylIdYPWFOR5D9dmY9d5AAAAAAAA
--Apple-Mail-57--306652608--
===========================================================================
Date mail was appended: Thu May 29 10:50:53 2008 (1212076254)
CC: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>,
Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>, Jaime Frey
<jfrey__AT__cs.wisc.edu>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
Date: Thu, 29 May 2008 12:16:56 -0400
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
X-Scanner: InterScan AntiVirus for Sendmail
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
--Apple-Mail-59--305101222
Hi Stuart,
On May 27, 2008, at 12:56 PM, Stuart Anderson wrote:
> Is this issue CentOS 5 specific? i.e., do you have this problem
> on any of the other LIGO FC4 sites running the same 7.0.1 version
> of Condor
> with the same application code base?
The call to getpwuid() on FC4 fails with a non-zero exit code, and so
the code just stores the uid.
> Is the call to getpwuid() new to the CBC code base since we
> first passed the CentOS 5 acceptance testing with an earlier version
> of Condor 7.0.x at Syracuse?
No the call has been there for the last four years. I deleted the
directory with the original testing I did, so assume that I screwed
up the testing, in absence of hard evidence to the contrary.
> Is there an acceptable work around for the CBC code to avoid
> this call for the short term or is this still a blocking bug for
> upgrading the rest of the LDG?
Modifying to code would likely cause problems, as all users would
have to make sure that patch was in their code. They wouldn't realize
> Does this have any causal relationship to the stderr issue?
No, code without the call to getpwuid() still exhibits the stderr issue.
Cheers,
Duncan.
> Thanks.
>
>
> On Tue, May 27, 2008 at 12:04:35PM -0400, Duncan Brown wrote:
>> Hi Pete,
>>
>> On May 27, 2008, at 11:47 AM, condor-admin response tracking system
>> wrote:
>>
>>> On Fri, May 23, 2008 at 05:24:45PM -0500, condor-admin response
>>> tracking system
>>> wrote:
>>>> The offending function preventing checkpointing is getpwuid(). If I
>>>> comment out the following lines:
>>>>
>>>> if(!(pw = getpwuid(uid)))
>>>> snprintf(ptable->username, LIGOMETA_USERNAME_MAX, "%d", uid);
>>>> else
>>>> snprintf(ptable->username, LIGOMETA_USERNAME_MAX, "%s", pw-
>>>>> pw_name);
>>>>
>>>> then the inspiral code successfully checkpoints when told to. The
>>>> presence of the gethostname() call makes no difference.
>>>>
>>>> I'll check the stderr stuff tomorrow.
>>>
>>> It looks like the nscd.conf file has a method by which one may turn
>>> off
>>> using the /var/run/nscd/socket communication method. Check out
>>> setting
>>> 'persistant passwd no' and 'shared passwd yes' in the config file
>>> and
>>> see if that fixes the default behavior.
>>
>> No, it doesn't fix the checkpointing. I set
>>
>> enable-cache passwd yes
>> positive-time-to-live passwd 600
>> negative-time-to-live passwd 20
>> suggested-size passwd 211
>> check-files passwd yes
>> persistent passwd no
>> shared passwd yes
>> max-db-size passwd 33554432
>> auto-propagate passwd yes
>>
>> and the code still refuses to checkpoint, with or without the nscd
>> daemon running.
>>
>>> Even if it does, though, I still should figure out a method in
>>> stduniv
>>> by which no configuration changes on your part are needed.
>>
>> Great, thanks.
>>
>> Cheers,
>> Duncan.
>>
>>
>>> Thank you.
>>>
>>> -pete
>>>
>>>
>>>
>>> ========================================
>>> MESSAGE INFORMATION
>>> ========================================
>>> * From: Peter Keller <psilord__AT__cs.wisc.edu>
>>> * Ticket Email List: dabrown__AT__physics.syr.edu,
>>> anderson__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu,'__AT__cs.wisc.ed
>>> u
>>
>> --
>>
>> Duncan Brown Room 263-1, Department of
>> Physics,
>> Assistant Professor of Physics Syracuse University, NY
>> 13244, USA
>> Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/
>> ~duncan
>>
>>
>
>
>
> --
> Stuart Anderson anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
--Apple-Mail-59--305101222
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEH
AQAAoIIIBTCCA/gwggLgoAMCAQICASkwDQYJKoZIhvcNAQEFBQAwdTETMBEG
CgmSJomT8ixkARkWA25ldDESMBAGCgmSJomT8ixkARkWAkVTMQ4wDAYDVQQK
EwVFU25ldDEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxGDAW
BgNVBAMTD0VTbmV0IFJvb3QgQ0EgMTAeFw0wMjEyMDUwODAwMDBaFw0xMzAx
MjUwODAwMDBaMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/Is
ZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRp
ZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwggEiMA0GCSqGSIb3DQEBAQUA
A4IBDwAwggEKAoIBAQC09dYjYaPbCD5mtbiQb7Ka3y1qAm0ZcqKCFciWcfe8
Kwcuy9tjHuIsLf9ZItdkDW4xy8sua9nJlx3KlwjtumTMtOtg35KZCknUd8KM
4VGTSFdLVG9AbNayef76caVCGM1+jyF0Lq03kauGOPTcNfZe1TZa3e1c9rc8
ljV5OSWa/mfsCACyS5zFIWu0yIDNyJdf+n0hwaPN53wllpJ30taD+JBjQ7h2
k4xRWzeaznLOb9OztZVRA/1sVze+iczFh2xwa4VdGy0eIIPw1pfvYwxO36rm
0S109qvbsNlaroPRbxerPKakQLpKe034Xcx7gBPqUk/FxoRRWin5EWN3rz9L
AgMBAAGjgZ4wgZswDgYDVR0PAQH/BAQDAgGGMBEGCWCGSAGG+EIBAQQEAwIA
hzAdBgNVHQ4EFgQUyhkdEo5upDhdQtQxDgjb2Y0XDV0wHwYDVR0jBBgwFoAU
vF1NSC/4NZRZq1yJSz7RsjoUAeowDwYDVR0TAQH/BAUwAwEB/zAlBgNVHREE
HjAcgRpET0VHcmlkcy1DQS0xQGRvZWdyaWRzLm9yZzANBgkqhkiG9w0BAQUF
AAOCAQEAZNVrIDLqe39CEOiJt7Q7EpBPhAihMvDTSf/42u0SMbUmChww4mLm
ph5DBghZUVF8Yn59kRZMn1QLOtO1HzLqvAvPITacZVPlJgG2IXzlR636YghZ
FAycbIUEOJDBHR4vtQO1KDxgZwvAbtmKIoxvhUCq2xsfFt9kCBBn+JYtQ6O5
LsBJq3PmuubeMcc7mbQAfJZ7h/3QghgkFIhmE1+LBXPJbkuP8vgfg6h2BKoA
f5TFfZECgGZKimfN110tBvfedGZwYYd3/GsJc83B0JN1gny0gqNVPm392Uch
XGeBRrHnm2gkhIkr48Oq6EmNGV9/a6XfbplQW/JWbtPVPWkaizCCBAUwggLt
oAMCAQICAkeUMA0GCSqGSIb3DQEBBQUAMGkxEzARBgoJkiaJk/IsZAEZFgNv
cmcxGDAWBgoJkiaJk/IsZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlm
aWNhdGUgQXV0aG9yaXRpZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwHhcN
MDcwOTEyMTkwMjA3WhcNMDgwOTExMTkwMjA3WjBeMRMwEQYKCZImiZPyLGQB
GRYDb3JnMRgwFgYKCZImiZPyLGQBGRYIZG9lZ3JpZHMxDzANBgNVBAsTBlBl
b3BsZTEcMBoGA1UEAxMTRHVuY2FuIEJyb3duIDUwNjU5MjCCASIwDQYJKoZI
hvcNAQEBBQADggEPADCCAQoCggEBAM+cSlKBhUWHQNmcRf3bsHC3ngHwQC+E
/RJNoYdRC0vQPVQ6kt12TLn6IGX+Eq0p5cHxd+rmOAms7zMzdhw9CYypN4KE
rgQOubIV7zTzajSYsNBKYkX6TA/jyt1223fmruD5hEnnOCWevsruA1iLIUFS
NjZ0WqYA+2KbHvjrObPkcoB7zZ2x4CV1ayWlYRZwLUrSdzSXQKYpbmTJili/
c5GzXpd22oQCpWX8+472pM0zM9l4A3B7uTFpmunsvKox741+CeSbgJzaHkIZ
V/9TjsEO6zrg05+JEGOcXzII2mlgEWtJRmOTOco2QkZ5h0aU8XTsxVAbbtCU
Y7JH4Z91wCMCAwEAAaOBwTCBvjARBglghkgBhvhCAQEEBAMCBeAwDgYDVR0P
AQH/BAQDAgXgMB8GA1UdIwQYMBaAFMoZHRKObqQ4XULUMQ4I29mNFw1dMCIG
A1UdEQQbMBmBF2RhYnJvd25AcGh5c2ljcy5zeXIuZWR1MDoGA1UdHwQzMDEw
L6AtoCuGKWh0dHA6Ly9wa2kxLmRvZWdyaWRzLm9yZy9DUkwvMWMzZjJjYTgu
Y3JsMBgGA1UdIAQRMA8wDQYLKoZIhvdMAwcBAgkwDQYJKoZIhvcNAQEFBQAD
ggEBAA3JV1r6C2MEwcNGarW8KBr3phLOLXoF2656DUFIy8sqler1t38f7ucX
hRSQLu26eLyGgUzrsPuiEAPqFYYNZa71DuQCcYbBs6wW7QFQrXMq7trHkXVG
qRhiHgT+tTVqxPkZgMKDcj853N9MiZod5QgYQCfEy+4A17WZ31W/2NzPgSYn
2beOsHTnMbkciPIi7Jq7E8IV0wvfPuv+ypRRhymG3VthKrRQCMKu0I4QaUfL
iX4BrlB07QesDw7X4kwR+o5flOjkjliQdBWZDcl+hyLNzbi20niOuLW1eoto
Gn9dnelZa9h2jQRqhyfvXDUpOt9jStxsSZjgkDK6L4BfmT4xggL6MIIC9gIB
ATBvMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/IsZAEZFghE
T0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxFjAU
BgNVBAMTDURPRUdyaWRzIENBIDECAkeUMAkGBSsOAwIaBQCgggFgMBgGCSqG
SIb3DQEJAzELBgkqhkiG9w0BBwEwHAYJKoZIhvcNAQkFMQ8XDTA4MDUyOTE2
MTY1NlowIwYJKoZIhvcNAQkEMRYEFFwIBOtm8dXHOPjKmbNVwGLhHVk1MH4G
CSsGAQQBgjcQBDFxMG8waTETMBEGCgmSJomT8ixkARkWA29yZzEYMBYGCgmS
JomT8ixkARkWCERPRUdyaWRzMSAwHgYDVQQLExdDZXJ0aWZpY2F0ZSBBdXRo
b3JpdGllczEWMBQGA1UEAxMNRE9FR3JpZHMgQ0EgMQICR5QwgYAGCyqGSIb3
DQEJEAILMXGgbzBpMRMwEQYKCZImiZPyLGQBGRYDb3JnMRgwFgYKCZImiZPy
LGQBGRYIRE9FR3JpZHMxIDAeBgNVBAsTF0NlcnRpZmljYXRlIEF1dGhvcml0
aWVzMRYwFAYDVQQDEw1ET0VHcmlkcyBDQSAxAgJHlDANBgkqhkiG9w0BAQEF
AASCAQBjnrIuck7VMj+0s4Z0jO+UWjkajjHKXWGnofG0hCuWRTkgis7BPMZU
juCGBiMMu229Uacb1P95hFY+7vYCbOXIHwrultshVlPcmGFgTnn97ePCO15s
8m7uBhiRar826e6DLgVN5FEpN03TkYewCCrwPzVu+rMJtc1ls2i714/ADQaj
NKt73l1Vhl82BlFmU3noRlttsEwskV2T8211TVzbT0vKjOBHFgEuuo0EvKCF
yHZweE7deCN1x43wprzm25iIIMSDOU2ATiKV0fX9E9ZHcor0KRM4x9NLKSYv
JDPCPPL85z00L1fukcV7dr5imCAdUvvb+r2FtbGgZMGHPAa1AAAAAAAA
--Apple-Mail-59--305101222--
===========================================================================
Date mail was appended: Thu May 29 11:16:52 2008 (1212077814)
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>, condor-admin
response tracking system <condor-admin__AT__cs.wisc.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>, Jaime Frey <jfrey__AT__cs.wisc.edu>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
Date: Thu, 29 May 2008 16:35:38 -0400
To: Duncan Brown <dabrown__AT__physics.syr.edu>
X-Scanner: InterScan AntiVirus for Sendmail
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
--Apple-Mail-71--289578764
Hi all,
On May 29, 2008, at 12:16 PM, Duncan Brown wrote:
>> Does this have any causal relationship to the stderr issue?
>
> No, code without the call to getpwuid() still exhibits the stderr
> issue.
Sorry, I take this back! If I remove the call to getpwuid() then I DO
see the error message in the log file.
If the code is
uid = geteuid();
/* if(!(pw = getpwuid(uid))) */
if(1)
snprintf(ptable->username, LIGOMETA_USERNAME_MAX, "%
d", uid);
else
snprintf(ptable->username, LIGOMETA_USERNAME_MAX, "%
s", pw->pw_name);
ptable->process_id = process_id;
I get
[dbrown@sugar stderr]$ condor_submit inspiral_test.sub
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 48223.
000 (48223.000.000) 05/29 16:27:37 Job submitted from host:
<10.20.1.23:33101>
...
001 (48223.000.000) 05/29 16:27:55 Job executing on host:
<10.20.2.43:57593>
...
005 (48223.000.000) 05/29 16:27:55 Job terminated.
(1) Normal termination (return value 1)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
773 - Run Bytes Sent By Job
10591708 - Run Bytes Received By Job
773 - Total Bytes Sent By Job
10591708 - Total Bytes Received By Job
...
[dbrown@sugar stderr]$ cat inspiral_test.err
condor_exec.48223.0: unrecognized option `--badgers'
[dbrown@sugar stderr]$
If I remove the if(1) and put the call to getpwuid() back in, so the
code is:
if(!(pw = getpwuid(uid)))
snprintf(ptable->username, LIGOMETA_USERNAME_MAX, "%
d", uid);
else
snprintf(ptable->username, LIGOMETA_USERNAME_MAX, "%
s", pw->pw_name);
ptable->process_id = process_id;
I get no output to stderr:
[dbrown@sugar stderr]$ condor_submit inspiral_test.sub
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 48224.
000 (48224.000.000) 05/29 16:31:55 Job submitted from host:
<10.20.1.23:33101>
...
001 (48224.000.000) 05/29 16:32:18 Job executing on host:
<10.20.2.43:57593>
...
005 (48224.000.000) 05/29 16:32:18 Job terminated.
(1) Normal termination (return value 1)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
941 - Run Bytes Sent By Job
10604281 - Run Bytes Received By Job
941 - Total Bytes Sent By Job
10604281 - Total Bytes Received By Job
[dbrown@sugar stderr]$ cat inspiral_test.err
[dbrown@sugar stderr]$
So condor ticket 17983 is directly related to this ticket. Sorry for
the confusion.
Cheers,
Duncan.
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
--Apple-Mail-71--289578764
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEH
AQAAoIIIBTCCA/gwggLgoAMCAQICASkwDQYJKoZIhvcNAQEFBQAwdTETMBEG
CgmSJomT8ixkARkWA25ldDESMBAGCgmSJomT8ixkARkWAkVTMQ4wDAYDVQQK
EwVFU25ldDEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxGDAW
BgNVBAMTD0VTbmV0IFJvb3QgQ0EgMTAeFw0wMjEyMDUwODAwMDBaFw0xMzAx
MjUwODAwMDBaMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/Is
ZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRp
ZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwggEiMA0GCSqGSIb3DQEBAQUA
A4IBDwAwggEKAoIBAQC09dYjYaPbCD5mtbiQb7Ka3y1qAm0ZcqKCFciWcfe8
Kwcuy9tjHuIsLf9ZItdkDW4xy8sua9nJlx3KlwjtumTMtOtg35KZCknUd8KM
4VGTSFdLVG9AbNayef76caVCGM1+jyF0Lq03kauGOPTcNfZe1TZa3e1c9rc8
ljV5OSWa/mfsCACyS5zFIWu0yIDNyJdf+n0hwaPN53wllpJ30taD+JBjQ7h2
k4xRWzeaznLOb9OztZVRA/1sVze+iczFh2xwa4VdGy0eIIPw1pfvYwxO36rm
0S109qvbsNlaroPRbxerPKakQLpKe034Xcx7gBPqUk/FxoRRWin5EWN3rz9L
AgMBAAGjgZ4wgZswDgYDVR0PAQH/BAQDAgGGMBEGCWCGSAGG+EIBAQQEAwIA
hzAdBgNVHQ4EFgQUyhkdEo5upDhdQtQxDgjb2Y0XDV0wHwYDVR0jBBgwFoAU
vF1NSC/4NZRZq1yJSz7RsjoUAeowDwYDVR0TAQH/BAUwAwEB/zAlBgNVHREE
HjAcgRpET0VHcmlkcy1DQS0xQGRvZWdyaWRzLm9yZzANBgkqhkiG9w0BAQUF
AAOCAQEAZNVrIDLqe39CEOiJt7Q7EpBPhAihMvDTSf/42u0SMbUmChww4mLm
ph5DBghZUVF8Yn59kRZMn1QLOtO1HzLqvAvPITacZVPlJgG2IXzlR636YghZ
FAycbIUEOJDBHR4vtQO1KDxgZwvAbtmKIoxvhUCq2xsfFt9kCBBn+JYtQ6O5
LsBJq3PmuubeMcc7mbQAfJZ7h/3QghgkFIhmE1+LBXPJbkuP8vgfg6h2BKoA
f5TFfZECgGZKimfN110tBvfedGZwYYd3/GsJc83B0JN1gny0gqNVPm392Uch
XGeBRrHnm2gkhIkr48Oq6EmNGV9/a6XfbplQW/JWbtPVPWkaizCCBAUwggLt
oAMCAQICAkeUMA0GCSqGSIb3DQEBBQUAMGkxEzARBgoJkiaJk/IsZAEZFgNv
cmcxGDAWBgoJkiaJk/IsZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlm
aWNhdGUgQXV0aG9yaXRpZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwHhcN
MDcwOTEyMTkwMjA3WhcNMDgwOTExMTkwMjA3WjBeMRMwEQYKCZImiZPyLGQB
GRYDb3JnMRgwFgYKCZImiZPyLGQBGRYIZG9lZ3JpZHMxDzANBgNVBAsTBlBl
b3BsZTEcMBoGA1UEAxMTRHVuY2FuIEJyb3duIDUwNjU5MjCCASIwDQYJKoZI
hvcNAQEBBQADggEPADCCAQoCggEBAM+cSlKBhUWHQNmcRf3bsHC3ngHwQC+E
/RJNoYdRC0vQPVQ6kt12TLn6IGX+Eq0p5cHxd+rmOAms7zMzdhw9CYypN4KE
rgQOubIV7zTzajSYsNBKYkX6TA/jyt1223fmruD5hEnnOCWevsruA1iLIUFS
NjZ0WqYA+2KbHvjrObPkcoB7zZ2x4CV1ayWlYRZwLUrSdzSXQKYpbmTJili/
c5GzXpd22oQCpWX8+472pM0zM9l4A3B7uTFpmunsvKox741+CeSbgJzaHkIZ
V/9TjsEO6zrg05+JEGOcXzII2mlgEWtJRmOTOco2QkZ5h0aU8XTsxVAbbtCU
Y7JH4Z91wCMCAwEAAaOBwTCBvjARBglghkgBhvhCAQEEBAMCBeAwDgYDVR0P
AQH/BAQDAgXgMB8GA1UdIwQYMBaAFMoZHRKObqQ4XULUMQ4I29mNFw1dMCIG
A1UdEQQbMBmBF2RhYnJvd25AcGh5c2ljcy5zeXIuZWR1MDoGA1UdHwQzMDEw
L6AtoCuGKWh0dHA6Ly9wa2kxLmRvZWdyaWRzLm9yZy9DUkwvMWMzZjJjYTgu
Y3JsMBgGA1UdIAQRMA8wDQYLKoZIhvdMAwcBAgkwDQYJKoZIhvcNAQEFBQAD
ggEBAA3JV1r6C2MEwcNGarW8KBr3phLOLXoF2656DUFIy8sqler1t38f7ucX
hRSQLu26eLyGgUzrsPuiEAPqFYYNZa71DuQCcYbBs6wW7QFQrXMq7trHkXVG
qRhiHgT+tTVqxPkZgMKDcj853N9MiZod5QgYQCfEy+4A17WZ31W/2NzPgSYn
2beOsHTnMbkciPIi7Jq7E8IV0wvfPuv+ypRRhymG3VthKrRQCMKu0I4QaUfL
iX4BrlB07QesDw7X4kwR+o5flOjkjliQdBWZDcl+hyLNzbi20niOuLW1eoto
Gn9dnelZa9h2jQRqhyfvXDUpOt9jStxsSZjgkDK6L4BfmT4xggL6MIIC9gIB
ATBvMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/IsZAEZFghE
T0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxFjAU
BgNVBAMTDURPRUdyaWRzIENBIDECAkeUMAkGBSsOAwIaBQCgggFgMBgGCSqG
SIb3DQEJAzELBgkqhkiG9w0BBwEwHAYJKoZIhvcNAQkFMQ8XDTA4MDUyOTIw
MzUzOVowIwYJKoZIhvcNAQkEMRYEFDPkV3RE41jydevK5xIl6AzcjFN5MH4G
CSsGAQQBgjcQBDFxMG8waTETMBEGCgmSJomT8ixkARkWA29yZzEYMBYGCgmS
JomT8ixkARkWCERPRUdyaWRzMSAwHgYDVQQLExdDZXJ0aWZpY2F0ZSBBdXRo
b3JpdGllczEWMBQGA1UEAxMNRE9FR3JpZHMgQ0EgMQICR5QwgYAGCyqGSIb3
DQEJEAILMXGgbzBpMRMwEQYKCZImiZPyLGQBGRYDb3JnMRgwFgYKCZImiZPy
LGQBGRYIRE9FR3JpZHMxIDAeBgNVBAsTF0NlcnRpZmljYXRlIEF1dGhvcml0
aWVzMRYwFAYDVQQDEw1ET0VHcmlkcyBDQSAxAgJHlDANBgkqhkiG9w0BAQEF
AASCAQA2Yh1l8k/BMvX5CmdjJyFMTW7JPFGOln7laQ/GOb7SHFPt0qob/Bqy
xgdawdss1EC+ID2mAkuhM8CsgXmaJFanOP2P9D+reRIw5v8RW+98RkFgPiEC
XCpNPaTKNZvvNLToB04ukMGoPgLqF2/gYV2aduU97CtCbgovrNlvTlJlVtwb
O24a2pTFbpnedyfFjzv6KJ2PY62LfX+LsoMDEwhshwatZYBO7MqE+nnlgsw5
q1h6LROnLgl2gM+Y0QJ4sk8+Jn0Xe/0SwpjBTgFdg87xTIGNw1YoZ+A9xivh
4u+xYUbtYLLV6jBIWpz3eaV01tWWbQzwkbmWB7t9sL71Feq4AAAAAAAA
--Apple-Mail-71--289578764--
===========================================================================
Date mail was appended: Thu May 29 15:35:26 2008 (1212093327)
Date: Fri, 30 May 2008 11:23:55 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
Hello,
On Thu, May 29, 2008 at 10:26:00AM -0500, condor-admin response tracking system wrote:
> That doesn't seem to work either, but I don't see any mention of the
> file_remap in the Shadow or the Starter log.
Ok, after looking into this it turns out that the file remaping codebase
is for files opened with open() (and a few other things). So, in the case
of this unix domain socket, we don't have a remapping system for the
filename in the sockaddr parameter in the bind() call which represents
the unix domain socket.
So this is why the remap failed to work.
-pete
===========================================================================
Date mail was appended: Fri May 30 11:24:00 2008 (1212164641)
Date: Fri, 30 May 2008 13:27:29 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
Hello,
Ok, I have determined what I am going to do given the current software
layers and features stduniv supports. I will patch glibc itself to
force it not to default to opening the nscd unix domain socket. In our
case, we never want that socket opened in the case of a stduniv, and the
fallback code in glibc should do the right thing. Of course, I will test
it to ensure it works and give you a test executable to test as well.
I am trying hard to get this into our imminent stable release of Condor.
This should give you full ability to get the real information you require
from getpwiud() (and the other calls going through nscd) via alternative
resolvers the stduniv glibc knows about.
Thank you.
-pete
===========================================================================
Date mail was appended: Fri May 30 13:27:30 2008 (1212172051)
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
Date: Fri, 30 May 2008 14:49:19 -0400
To: condor-admin__AT__cs.wisc.edu
X-Scanner: InterScan AntiVirus for Sendmail
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
--Apple-Mail-4--209558198
Hi Pete,
That's great, thanks. Please let me know when you have a new version
that you want me to test. Does this also solve the stderr problem?
Cheers,
Duncan.
On May 30, 2008, at 2:27 PM, condor-admin response tracking system
wrote:
> Hello,
>
> Ok, I have determined what I am going to do given the current software
> layers and features stduniv supports. I will patch glibc itself to
> force it not to default to opening the nscd unix domain socket. In our
> case, we never want that socket opened in the case of a stduniv,
> and the
> fallback code in glibc should do the right thing. Of course, I will
> test
> it to ensure it works and give you a test executable to test as well.
> I am trying hard to get this into our imminent stable release of
> Condor.
>
> This should give you full ability to get the real information you
> require
> from getpwiud() (and the other calls going through nscd) via
> alternative
> resolvers the stduniv glibc knows about.
>
> Thank you.
>
> -pete
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Peter Keller <psilord__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu,
> anderson__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu,'__AT__cs.wisc.edu
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
--Apple-Mail-4--209558198
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEH
AQAAoIIIBTCCA/gwggLgoAMCAQICASkwDQYJKoZIhvcNAQEFBQAwdTETMBEG
CgmSJomT8ixkARkWA25ldDESMBAGCgmSJomT8ixkARkWAkVTMQ4wDAYDVQQK
EwVFU25ldDEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxGDAW
BgNVBAMTD0VTbmV0IFJvb3QgQ0EgMTAeFw0wMjEyMDUwODAwMDBaFw0xMzAx
MjUwODAwMDBaMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/Is
ZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRp
ZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwggEiMA0GCSqGSIb3DQEBAQUA
A4IBDwAwggEKAoIBAQC09dYjYaPbCD5mtbiQb7Ka3y1qAm0ZcqKCFciWcfe8
Kwcuy9tjHuIsLf9ZItdkDW4xy8sua9nJlx3KlwjtumTMtOtg35KZCknUd8KM
4VGTSFdLVG9AbNayef76caVCGM1+jyF0Lq03kauGOPTcNfZe1TZa3e1c9rc8
ljV5OSWa/mfsCACyS5zFIWu0yIDNyJdf+n0hwaPN53wllpJ30taD+JBjQ7h2
k4xRWzeaznLOb9OztZVRA/1sVze+iczFh2xwa4VdGy0eIIPw1pfvYwxO36rm
0S109qvbsNlaroPRbxerPKakQLpKe034Xcx7gBPqUk/FxoRRWin5EWN3rz9L
AgMBAAGjgZ4wgZswDgYDVR0PAQH/BAQDAgGGMBEGCWCGSAGG+EIBAQQEAwIA
hzAdBgNVHQ4EFgQUyhkdEo5upDhdQtQxDgjb2Y0XDV0wHwYDVR0jBBgwFoAU
vF1NSC/4NZRZq1yJSz7RsjoUAeowDwYDVR0TAQH/BAUwAwEB/zAlBgNVHREE
HjAcgRpET0VHcmlkcy1DQS0xQGRvZWdyaWRzLm9yZzANBgkqhkiG9w0BAQUF
AAOCAQEAZNVrIDLqe39CEOiJt7Q7EpBPhAihMvDTSf/42u0SMbUmChww4mLm
ph5DBghZUVF8Yn59kRZMn1QLOtO1HzLqvAvPITacZVPlJgG2IXzlR636YghZ
FAycbIUEOJDBHR4vtQO1KDxgZwvAbtmKIoxvhUCq2xsfFt9kCBBn+JYtQ6O5
LsBJq3PmuubeMcc7mbQAfJZ7h/3QghgkFIhmE1+LBXPJbkuP8vgfg6h2BKoA
f5TFfZECgGZKimfN110tBvfedGZwYYd3/GsJc83B0JN1gny0gqNVPm392Uch
XGeBRrHnm2gkhIkr48Oq6EmNGV9/a6XfbplQW/JWbtPVPWkaizCCBAUwggLt
oAMCAQICAkeUMA0GCSqGSIb3DQEBBQUAMGkxEzARBgoJkiaJk/IsZAEZFgNv
cmcxGDAWBgoJkiaJk/IsZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlm
aWNhdGUgQXV0aG9yaXRpZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwHhcN
MDcwOTEyMTkwMjA3WhcNMDgwOTExMTkwMjA3WjBeMRMwEQYKCZImiZPyLGQB
GRYDb3JnMRgwFgYKCZImiZPyLGQBGRYIZG9lZ3JpZHMxDzANBgNVBAsTBlBl
b3BsZTEcMBoGA1UEAxMTRHVuY2FuIEJyb3duIDUwNjU5MjCCASIwDQYJKoZI
hvcNAQEBBQADggEPADCCAQoCggEBAM+cSlKBhUWHQNmcRf3bsHC3ngHwQC+E
/RJNoYdRC0vQPVQ6kt12TLn6IGX+Eq0p5cHxd+rmOAms7zMzdhw9CYypN4KE
rgQOubIV7zTzajSYsNBKYkX6TA/jyt1223fmruD5hEnnOCWevsruA1iLIUFS
NjZ0WqYA+2KbHvjrObPkcoB7zZ2x4CV1ayWlYRZwLUrSdzSXQKYpbmTJili/
c5GzXpd22oQCpWX8+472pM0zM9l4A3B7uTFpmunsvKox741+CeSbgJzaHkIZ
V/9TjsEO6zrg05+JEGOcXzII2mlgEWtJRmOTOco2QkZ5h0aU8XTsxVAbbtCU
Y7JH4Z91wCMCAwEAAaOBwTCBvjARBglghkgBhvhCAQEEBAMCBeAwDgYDVR0P
AQH/BAQDAgXgMB8GA1UdIwQYMBaAFMoZHRKObqQ4XULUMQ4I29mNFw1dMCIG
A1UdEQQbMBmBF2RhYnJvd25AcGh5c2ljcy5zeXIuZWR1MDoGA1UdHwQzMDEw
L6AtoCuGKWh0dHA6Ly9wa2kxLmRvZWdyaWRzLm9yZy9DUkwvMWMzZjJjYTgu
Y3JsMBgGA1UdIAQRMA8wDQYLKoZIhvdMAwcBAgkwDQYJKoZIhvcNAQEFBQAD
ggEBAA3JV1r6C2MEwcNGarW8KBr3phLOLXoF2656DUFIy8sqler1t38f7ucX
hRSQLu26eLyGgUzrsPuiEAPqFYYNZa71DuQCcYbBs6wW7QFQrXMq7trHkXVG
qRhiHgT+tTVqxPkZgMKDcj853N9MiZod5QgYQCfEy+4A17WZ31W/2NzPgSYn
2beOsHTnMbkciPIi7Jq7E8IV0wvfPuv+ypRRhymG3VthKrRQCMKu0I4QaUfL
iX4BrlB07QesDw7X4kwR+o5flOjkjliQdBWZDcl+hyLNzbi20niOuLW1eoto
Gn9dnelZa9h2jQRqhyfvXDUpOt9jStxsSZjgkDK6L4BfmT4xggL6MIIC9gIB
ATBvMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/IsZAEZFghE
T0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxFjAU
BgNVBAMTDURPRUdyaWRzIENBIDECAkeUMAkGBSsOAwIaBQCgggFgMBgGCSqG
SIb3DQEJAzELBgkqhkiG9w0BBwEwHAYJKoZIhvcNAQkFMQ8XDTA4MDUzMDE4
NDkxOVowIwYJKoZIhvcNAQkEMRYEFBxuTidyW/dT+C/p/QvLDWp5uQf8MH4G
CSsGAQQBgjcQBDFxMG8waTETMBEGCgmSJomT8ixkARkWA29yZzEYMBYGCgmS
JomT8ixkARkWCERPRUdyaWRzMSAwHgYDVQQLExdDZXJ0aWZpY2F0ZSBBdXRo
b3JpdGllczEWMBQGA1UEAxMNRE9FR3JpZHMgQ0EgMQICR5QwgYAGCyqGSIb3
DQEJEAILMXGgbzBpMRMwEQYKCZImiZPyLGQBGRYDb3JnMRgwFgYKCZImiZPy
LGQBGRYIRE9FR3JpZHMxIDAeBgNVBAsTF0NlcnRpZmljYXRlIEF1dGhvcml0
aWVzMRYwFAYDVQQDEw1ET0VHcmlkcyBDQSAxAgJHlDANBgkqhkiG9w0BAQEF
AASCAQBMiDPfH42OWjG3uTURACH1bWYC+8mTSiIibpH9NFGK4bnMf84y9t6q
UyONUBpA3y9audHSOapWQCAi2Y9evpiAAcyksRx6HOnKwDqTdNt9qtm2o5Lm
ZevzHPLGSqlu6THz8ElCT0U/FsKhWBCe/7OItVs04nEC2Q77ctKJ/C9rk76z
Yxojq8tn5DrlDqCJX3wh6nlhHXA59HB2HkVuC7rVuBosBMCl8BbDYw7jePkN
fHUmU7F6PvPuKRZVajmbAekWdJF9l6wL9D4izBsUUHC9ILZxRg1vPPbcwUHs
5IGOvppByd5HuW61rG3uzHKzG/zMakExkBX4Cf69+E0OUdT/AAAAAAAA
--Apple-Mail-4--209558198--
===========================================================================
Date mail was appended: Fri May 30 13:49:08 2008 (1212173349)
Date: Fri, 30 May 2008 13:51:41 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
On Fri, May 30, 2008 at 01:49:08PM -0500, condor-admin response tracking system wrote:
> That's great, thanks. Please let me know when you have a new version
> that you want me to test. Does this also solve the stderr problem?
I don't know if it solves the stderr problem yet, once I finish testing
of correct checkpoint deferral I'll see if it solves the other thing.
Thank you.
-pete
===========================================================================
Date mail was appended: Fri May 30 13:51:46 2008 (1212173506)
Date: Sat, 31 May 2008 17:26:39 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
Hello,
On Fri, May 30, 2008 at 01:49:08PM -0500, condor-admin response tracking system wrote:
> That's great, thanks. Please let me know when you have a new version
> that you want me to test. Does this also solve the stderr problem?
If you could go to the ftp repository here and grab a.out.cndr and run it
on the machines with nscd running, we can see if my fix will help.
ftp ftp.cs.wisc.edu
login: ftp
passwd: email address
bin
cd condor/temporary/forligo/getpwuid_fix
get foo.c
get a.out.cndr
quit
The binary is a stduniv binariy linked against my newly patched glibc
which ignores the nscd under all conditions. foo.c is the source code
which produced the binary.
You'd run it like this on an x86_64 machine:
setarch x86_64 -L -R -B ./a.out.cndr <uid>
If my fix is successful, it will immediately checkpoint with a sig usr 2 and
no output (other than the usual Condor header).
Then, restart it like this:
setarch x86_64 -L -R -B ./a.out.cndr -_condor_restart ./a.out.cndr.ckpt
and if the uid was in /etc/passwd, you should get the username for it,
if the uid is not in /etc/passwd (suppose you use LDAP or something),
then you'll get the uid in numeric form as output.
You can recompile the foo.c file against your installation to confirm that
it is in fact failing to checkpoint with your install of Condor.
Let me know if this works! If it does, I think I can still get my fixes into
the stable series.
Thank you.
-pete
===========================================================================
Date mail was appended: Sat May 31 17:26:43 2008 (1212272806)
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
Date: Sun, 1 Jun 2008 12:31:00 -0400
To: condor-admin__AT__cs.wisc.edu
X-Scanner: InterScan AntiVirus for Sendmail
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
--Apple-Mail-7--45056722
Hi Pete,
Success.
Cheers,
Duncan.
[dbrown@sugar-dev1 ckpt2]$ setarch x86_64 -L -R -B ./a.out.cndr 620
Condor: Notice: Will checkpoint to ./a.out.cndr.ckpt
Condor: Notice: Remote system calls disabled.
User defined signal 2
[dbrown@sugar-dev1 ckpt2]$ setarch x86_64 -L -R -B ./a.out.cndr -
_condor_restart ./a.out.cndr.ckpt
Condor: Notice: Will restart from ./a.out.cndr.ckpt
dbrown
[dbrown@sugar-dev1 ckpt2]$ setarch x86_64 -L -R -B ./a.out.cndr
1234Condor: Notice: Will checkpoint to ./a.out.cndr.ckpt
Condor: Notice: Remote system calls disabled.
User defined signal 2
[dbrown@sugar-dev1 ckpt2]$ setarch x86_64 -L -R -B ./a.out.cndr -
_condor_restart ./a.out.cndr.ckpt
Condor: Notice: Will restart from ./a.out.cndr.ckpt
1234
[dbrown@sugar-dev1 ckpt2]$ condor_compile gcc -o a.out foo.c
LINKING FOR CONDOR : /usr/bin/ld -L/usr/lib64 -Bstatic --eh-frame-hdr
-m elf_x86_64 --hash-style=gnu -dynamic-linker /lib64/ld-linux-
x86-64.so.2 -o a.out /usr/lib64/condor_rt0.o /usr/lib64/crti.o /usr/
lib/gcc/x86_64-redhat-linux/4.1.1/crtbeginT.o -L/usr/lib64 -L/usr/lib/
gcc/x86_64-redhat-linux/4.1.1 -L/usr/lib/gcc/x86_64-redhat-linux/
4.1.1 -L/usr/lib/gcc/x86_64-redhat-linux/4.1.1/../../../../lib64 -L/
lib/../lib64 -L/usr/lib/../lib64 /tmp/ccQJzqLM.o /usr/lib64/
libcondorsyscall.a /usr/lib64/libcondor_z.a /usr/lib64/libcomp_libstdc
++.a /usr/lib64/libcomp_libgcc.a /usr/lib64/libcomp_libgcc_eh.a --as-
needed --no-as-needed -lcondor_c -lcondor_nss_files -lcondor_nss_dns -
lcondor_resolv -lcondor_c -lcondor_nss_files -lcondor_nss_dns -
lcondor_resolv -lcondor_c /usr/lib64/libcomp_libgcc.a /usr/lib64/
libcomp_libgcc_eh.a --as-needed --no-as-needed /usr/lib/gcc/x86_64-
redhat-linux/4.1.1/crtend.o /usr/lib64/crtn.o
/usr/lib64/libcondorsyscall.a(condor_file_agent.o): In function
`CondorFileAgent::open(char const*, int, int)':
/home/condor/execute/dir_15919/userdir/src/condor_ckpt/
condor_file_agent.C:106: warning: the use of `tmpnam' is dangerous,
better use `mkstemp'
/tmp/ccQJzqLM.o: In function `main':
foo.c:(.text+0x48): warning: Using 'getpwuid' in statically linked
applications requires at runtime the shared libraries from the glibc
version used for linking
/usr/lib64/libcondorsyscall.a(special_stubs.o): In function
`condor_gethostbyaddr':
/home/condor/execute/dir_15919/userdir/src/condor_syscall_lib/
special_stubs.C:200: warning: Using 'gethostbyaddr' in statically
linked applications requires at runtime the shared libraries from the
glibc version used for linking
/usr/lib64/libcondorsyscall.a(special_stubs.o): In function
`condor_gethostbyname':
/home/condor/execute/dir_15919/userdir/src/condor_syscall_lib/
special_stubs.C:193: warning: Using 'gethostbyname' in statically
linked applications requires at runtime the shared libraries from the
glibc version used for linking
/usr/lib64/libcondorsyscall.a(sock.o): In function
`Sock::getportbyserv(char*)':
/home/condor/execute/dir_15919/userdir/src/condor_io/sock.C:208:
warning: Using 'getservbyname' in statically linked applications
requires at runtime the shared libraries from the glibc version used
for linking
[dbrown@sugar-dev1 ckpt2]$ ./a.out 620
Condor: Notice: Will checkpoint to ./a.out.ckpt
Condor: Notice: Remote system calls disabled.
dbrown
On May 31, 2008, at 6:26 PM, condor-admin response tracking system
wrote:
> Hello,
>
> On Fri, May 30, 2008 at 01:49:08PM -0500, condor-admin response
> tracking system wrote:
>> That's great, thanks. Please let me know when you have a new version
>> that you want me to test. Does this also solve the stderr problem?
>
> If you could go to the ftp repository here and grab a.out.cndr and
> run it
> on the machines with nscd running, we can see if my fix will help.
>
> ftp ftp.cs.wisc.edu
> login: ftp
> passwd: email address
> bin
> cd condor/temporary/forligo/getpwuid_fix
> get foo.c
> get a.out.cndr
> quit
>
> The binary is a stduniv binariy linked against my newly patched glibc
> which ignores the nscd under all conditions. foo.c is the source code
> which produced the binary.
>
> You'd run it like this on an x86_64 machine:
>
> setarch x86_64 -L -R -B ./a.out.cndr <uid>
>
> If my fix is successful, it will immediately checkpoint with a sig
> usr 2 and
> no output (other than the usual Condor header).
>
> Then, restart it like this:
>
> setarch x86_64 -L -R -B ./a.out.cndr -_condor_restart ./
> a.out.cndr.ckpt
>
> and if the uid was in /etc/passwd, you should get the username for it,
> if the uid is not in /etc/passwd (suppose you use LDAP or something),
> then you'll get the uid in numeric form as output.
>
> You can recompile the foo.c file against your installation to
> confirm that
> it is in fact failing to checkpoint with your install of Condor.
>
> Let me know if this works! If it does, I think I can still get my
> fixes into
> the stable series.
>
> Thank you.
>
> -pete
>
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Peter Keller <psilord__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu,
> anderson__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu,'__AT__cs.wisc.edu
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
--Apple-Mail-7--45056722
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEH
AQAAoIIIBTCCA/gwggLgoAMCAQICASkwDQYJKoZIhvcNAQEFBQAwdTETMBEG
CgmSJomT8ixkARkWA25ldDESMBAGCgmSJomT8ixkARkWAkVTMQ4wDAYDVQQK
EwVFU25ldDEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxGDAW
BgNVBAMTD0VTbmV0IFJvb3QgQ0EgMTAeFw0wMjEyMDUwODAwMDBaFw0xMzAx
MjUwODAwMDBaMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/Is
ZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRp
ZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwggEiMA0GCSqGSIb3DQEBAQUA
A4IBDwAwggEKAoIBAQC09dYjYaPbCD5mtbiQb7Ka3y1qAm0ZcqKCFciWcfe8
Kwcuy9tjHuIsLf9ZItdkDW4xy8sua9nJlx3KlwjtumTMtOtg35KZCknUd8KM
4VGTSFdLVG9AbNayef76caVCGM1+jyF0Lq03kauGOPTcNfZe1TZa3e1c9rc8
ljV5OSWa/mfsCACyS5zFIWu0yIDNyJdf+n0hwaPN53wllpJ30taD+JBjQ7h2
k4xRWzeaznLOb9OztZVRA/1sVze+iczFh2xwa4VdGy0eIIPw1pfvYwxO36rm
0S109qvbsNlaroPRbxerPKakQLpKe034Xcx7gBPqUk/FxoRRWin5EWN3rz9L
AgMBAAGjgZ4wgZswDgYDVR0PAQH/BAQDAgGGMBEGCWCGSAGG+EIBAQQEAwIA
hzAdBgNVHQ4EFgQUyhkdEo5upDhdQtQxDgjb2Y0XDV0wHwYDVR0jBBgwFoAU
vF1NSC/4NZRZq1yJSz7RsjoUAeowDwYDVR0TAQH/BAUwAwEB/zAlBgNVHREE
HjAcgRpET0VHcmlkcy1DQS0xQGRvZWdyaWRzLm9yZzANBgkqhkiG9w0BAQUF
AAOCAQEAZNVrIDLqe39CEOiJt7Q7EpBPhAihMvDTSf/42u0SMbUmChww4mLm
ph5DBghZUVF8Yn59kRZMn1QLOtO1HzLqvAvPITacZVPlJgG2IXzlR636YghZ
FAycbIUEOJDBHR4vtQO1KDxgZwvAbtmKIoxvhUCq2xsfFt9kCBBn+JYtQ6O5
LsBJq3PmuubeMcc7mbQAfJZ7h/3QghgkFIhmE1+LBXPJbkuP8vgfg6h2BKoA
f5TFfZECgGZKimfN110tBvfedGZwYYd3/GsJc83B0JN1gny0gqNVPm392Uch
XGeBRrHnm2gkhIkr48Oq6EmNGV9/a6XfbplQW/JWbtPVPWkaizCCBAUwggLt
oAMCAQICAkeUMA0GCSqGSIb3DQEBBQUAMGkxEzARBgoJkiaJk/IsZAEZFgNv
cmcxGDAWBgoJkiaJk/IsZAEZFghET0VHcmlkczEgMB4GA1UECxMXQ2VydGlm
aWNhdGUgQXV0aG9yaXRpZXMxFjAUBgNVBAMTDURPRUdyaWRzIENBIDEwHhcN
MDcwOTEyMTkwMjA3WhcNMDgwOTExMTkwMjA3WjBeMRMwEQYKCZImiZPyLGQB
GRYDb3JnMRgwFgYKCZImiZPyLGQBGRYIZG9lZ3JpZHMxDzANBgNVBAsTBlBl
b3BsZTEcMBoGA1UEAxMTRHVuY2FuIEJyb3duIDUwNjU5MjCCASIwDQYJKoZI
hvcNAQEBBQADggEPADCCAQoCggEBAM+cSlKBhUWHQNmcRf3bsHC3ngHwQC+E
/RJNoYdRC0vQPVQ6kt12TLn6IGX+Eq0p5cHxd+rmOAms7zMzdhw9CYypN4KE
rgQOubIV7zTzajSYsNBKYkX6TA/jyt1223fmruD5hEnnOCWevsruA1iLIUFS
NjZ0WqYA+2KbHvjrObPkcoB7zZ2x4CV1ayWlYRZwLUrSdzSXQKYpbmTJili/
c5GzXpd22oQCpWX8+472pM0zM9l4A3B7uTFpmunsvKox741+CeSbgJzaHkIZ
V/9TjsEO6zrg05+JEGOcXzII2mlgEWtJRmOTOco2QkZ5h0aU8XTsxVAbbtCU
Y7JH4Z91wCMCAwEAAaOBwTCBvjARBglghkgBhvhCAQEEBAMCBeAwDgYDVR0P
AQH/BAQDAgXgMB8GA1UdIwQYMBaAFMoZHRKObqQ4XULUMQ4I29mNFw1dMCIG
A1UdEQQbMBmBF2RhYnJvd25AcGh5c2ljcy5zeXIuZWR1MDoGA1UdHwQzMDEw
L6AtoCuGKWh0dHA6Ly9wa2kxLmRvZWdyaWRzLm9yZy9DUkwvMWMzZjJjYTgu
Y3JsMBgGA1UdIAQRMA8wDQYLKoZIhvdMAwcBAgkwDQYJKoZIhvcNAQEFBQAD
ggEBAA3JV1r6C2MEwcNGarW8KBr3phLOLXoF2656DUFIy8sqler1t38f7ucX
hRSQLu26eLyGgUzrsPuiEAPqFYYNZa71DuQCcYbBs6wW7QFQrXMq7trHkXVG
qRhiHgT+tTVqxPkZgMKDcj853N9MiZod5QgYQCfEy+4A17WZ31W/2NzPgSYn
2beOsHTnMbkciPIi7Jq7E8IV0wvfPuv+ypRRhymG3VthKrRQCMKu0I4QaUfL
iX4BrlB07QesDw7X4kwR+o5flOjkjliQdBWZDcl+hyLNzbi20niOuLW1eoto
Gn9dnelZa9h2jQRqhyfvXDUpOt9jStxsSZjgkDK6L4BfmT4xggL6MIIC9gIB
ATBvMGkxEzARBgoJkiaJk/IsZAEZFgNvcmcxGDAWBgoJkiaJk/IsZAEZFghE
T0VHcmlkczEgMB4GA1UECxMXQ2VydGlmaWNhdGUgQXV0aG9yaXRpZXMxFjAU
BgNVBAMTDURPRUdyaWRzIENBIDECAkeUMAkGBSsOAwIaBQCgggFgMBgGCSqG
SIb3DQEJAzELBgkqhkiG9w0BBwEwHAYJKoZIhvcNAQkFMQ8XDTA4MDYwMTE2
MzEwMVowIwYJKoZIhvcNAQkEMRYEFJTO0CegFKrL/dPOW148N1OjvkhNMH4G
CSsGAQQBgjcQBDFxMG8waTETMBEGCgmSJomT8ixkARkWA29yZzEYMBYGCgmS
JomT8ixkARkWCERPRUdyaWRzMSAwHgYDVQQLExdDZXJ0aWZpY2F0ZSBBdXRo
b3JpdGllczEWMBQGA1UEAxMNRE9FR3JpZHMgQ0EgMQICR5QwgYAGCyqGSIb3
DQEJEAILMXGgbzBpMRMwEQYKCZImiZPyLGQBGRYDb3JnMRgwFgYKCZImiZPy
LGQBGRYIRE9FR3JpZHMxIDAeBgNVBAsTF0NlcnRpZmljYXRlIEF1dGhvcml0
aWVzMRYwFAYDVQQDEw1ET0VHcmlkcyBDQSAxAgJHlDANBgkqhkiG9w0BAQEF
AASCAQABg+7NgoiZKRzQ3nDnYy/DK5SQE6/meTYh4ypWPSt9TfcRIdD5AD6D
qs7N3h8xpwwttc9QjUDMwEGQd8Cz/V4dF9qspkxJyX62qukJJKcGvJ0n1mrR
TU4gRCV8n5FkkFy8uakQHjcdBbOP7+DS3KemsFVSbQjXql794G4LR49OGSvd
5ASO4MiOP/b8ogxqecIP8STbCVOAv7MAWMXMlXX4dqI22w9XeL7wvvmuZjCd
rzz1pOaWma3s7SgWzC/pnvO4bWfwHu8B5fqe38u0zz5vDRMEzB9pTCZQakAZ
CIxsQqmqzRdgvYi42A6SjWKzaNpUzqO/ZMahR2YIl5emf8TOAAAAAAAA
--Apple-Mail-7--45056722--
===========================================================================
Date mail was appended: Sun Jun 1 11:30:47 2008 (1212337848)
Date: Mon, 2 Jun 2008 10:42:02 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
On Thu, May 29, 2008 at 03:35:26PM -0500, condor-admin response tracking system wrote:
> Sorry, I take this back! If I remove the call to getpwuid() then I DO
> see the error message in the log file.
I haven't looked at this one in great detail yet and I don't think I
can hold back the release any longer. However, I've modified my foo.c
program using the patched glibc and it emits stderr properly around the
ckpt_and_exit(). I've looked through the stduniv codebase and don't see
any obvious reason why data emitted to stderr would have been tied to
whether or not a checkpoint is deferred.
So, given your statement about stderr working if you commented out
getpwuid(), are you willing to wait and see on the stderr issue after
you get the new release?
Thank you.
-pete
===========================================================================
Date mail was appended: Mon Jun 2 10:42:13 2008 (1212421333)
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
Date: Mon, 2 Jun 2008 13:07:45 -0400
To: condor-admin__AT__cs.wisc.edu
X-Scanner: InterScan AntiVirus for Sendmail
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
Hi Pete,
Sounds good to me. Let me know when the new release is ready and I'll
re-test our code.
Cheers,
Duncan.
On Jun 2, 2008, at 11:42 AM, condor-admin response tracking system
wrote:
> On Thu, May 29, 2008 at 03:35:26PM -0500, condor-admin response
> tracking system wrote:
>> Sorry, I take this back! If I remove the call to getpwuid() then I DO
>> see the error message in the log file.
>
> I haven't looked at this one in great detail yet and I don't think I
> can hold back the release any longer. However, I've modified my foo.c
> program using the patched glibc and it emits stderr properly around
> the
> ckpt_and_exit(). I've looked through the stduniv codebase and don't
> see
> any obvious reason why data emitted to stderr would have been tied to
> whether or not a checkpoint is deferred.
>
> So, given your statement about stderr working if you commented out
> getpwuid(), are you willing to wait and see on the stderr issue after
> you get the new release?
>
> Thank you.
>
> -pete
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Peter Keller <psilord__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu,
> anderson__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu,'__AT__cs.wisc.edu
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
===========================================================================
Date mail was appended: Mon Jun 2 12:07:26 2008 (1212426447)
Date: Mon, 2 Jun 2008 15:48:27 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
On Mon, Jun 02, 2008 at 12:07:26PM -0500, condor-admin response tracking system wrote:
> Sounds good to me. Let me know when the new release is ready and I'll
> re-test our code.
Ok, the nscd deferred checkpoint problem fix is checked in and will be in
in Condor 7.0.2.
I'll let you know when it is available.
Would you like me to resolve this ticket? If stderr proves to be a problem
for you, you can always respond to this ticket and we can reopen it.
Thank you.
-pete
===========================================================================
Date mail was appended: Mon Jun 2 15:48:29 2008 (1212439710)
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: Re: [condor-admin #17975] LIGO: CentOS 5 condor jobs are not
checkpointing
Date: Mon, 2 Jun 2008 16:56:58 -0400
To: condor-admin__AT__cs.wisc.edu
X-Scanner: InterScan AntiVirus for Sendmail
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
Hi Pete,
Yes, please close this one and the stderr ticket. I'll reply if there
is still a problem with stderr.
Cheers,
Duncan.
On Jun 2, 2008, at 4:48 PM, condor-admin response tracking system wrote:
> On Mon, Jun 02, 2008 at 12:07:26PM -0500, condor-admin response
> tracking system wrote:
>> Sounds good to me. Let me know when the new release is ready and I'll
>> re-test our code.
>
> Ok, the nscd deferred checkpoint problem fix is checked in and will
> be in
> in Condor 7.0.2.
>
> I'll let you know when it is available.
>
> Would you like me to resolve this ticket? If stderr proves to be a
> problem
> for you, you can always respond to this ticket and we can reopen it.
>
> Thank you.
>
> -pete
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Peter Keller <psilord__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu,
> anderson__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu,'__AT__cs.wisc.edu
--
Duncan Brown Room 263-1, Department of Physics,
Assistant Professor of Physics Syracuse University, NY 13244, USA
Phone: (315) 443 5993 http://www.gravity.phy.syr.edu/~duncan
===========================================================================
Date mail was appended: Mon Jun 2 15:56:39 2008 (1212440200)
Subject: Actions
Ticket resolved by psilord
===========================================================================
Date of actions: Mon Jun 2 16:04:27 2008 (1212440667)