LIGO Support Ticket 18995
Ticket Information
Number: admin 18995
User: anderson@ligo.caltech.edu
Email:
Status: resolved
Assigned To: psilord
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu, Peter Keller <psilord__AT__cs.wisc.edu>
Subject: LIGO: condor_checkpoint fails to generate checkpoint image
Date: Mon, 9 Feb 2009 13:13:28 -0800
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
It was observed while running,
[root@ldas-grid ~]# condor_version
$CondorVersion: 7.2.0 Dec 19 2008 BuildID: 121001 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
on an FC4 x86_64 machine that the command condor_checkpoint to a
machine running 4 Standard Universe jobs failed to produce a check
point image for one of the jobs, instead the job terminated with
signal 6 and was restarted. The user log file showed:
001 (43231768.000.000) 02/09 08:25:45 Job executing on host:
<10.14.1.161:53070>
...
005 (43231768.000.000) 02/09 12:35:34 Job terminated.
(0) Abnormal termination (signal 6)
(0) No core file
Usr 0 04:06:08, Sys 0 00:00:15 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 04:06:08, Sys 0 00:00:15 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
589 - Run Bytes Sent By Job
6226275 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
009 (43231768.000.000) 02/09 12:35:35 Job was aborted by the user.
...
000 (43250747.000.000) 02/09 12:35:47 Job submitted from host:
<10.14.0.12:57413>
DAG Node: 6bb15148427f34fea8d95b485dead0ea
...
001 (43250747.000.000) 02/09 12:35:59 Job executing on host:
<10.14.2.35:47977>
...
The Starter log files shows the job starting at 8:25 and then failing
to check point at 12:35,
2/9 08:25:45 ********** STARTER starting up ***********
2/9 08:25:45 ** $CondorVersion: 7.2.0 Dec 19 2008 BuildID: 121001 $
2/9 08:25:45 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
2/9 08:25:45 ******************************************
2/9 08:25:45 Submitting machine is "ldas-grid.ligo.caltech.edu"
2/9 08:25:45 EventHandler {
2/9 08:25:45 func = 0x4ca580
2/9 08:25:45 mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD
SIGTSTP
2/9 08:25:45 }
2/9 08:25:45 Done setting resource limits
2/9 08:25:45 *FSM* Transitioning to state "GET_PROC"
2/9 08:25:45 *FSM* Executing state func "get_proc()" [ ]
2/9 08:25:45 Entering get_proc()
2/9 08:25:45 Entering get_job_info()
2/9 08:25:45 Startup Info:
2/9 08:25:45 Version Number: 1
2/9 08:25:45 Id: 43231768.0
2/9 08:25:45 JobClass: STANDARD
2/9 08:25:45 Uid: 451
2/9 08:25:45 Gid: 451
2/9 08:25:45 VirtPid: -1
2/9 08:25:45 SoftKillSignal: 20
2/9 08:25:45 Cmd: "/mnt/qfs1/ajw/projects/ihope/run/s5_high_mass/
820555271-831355391/full_data/../executables/lalapps_coire"
2/9 08:25:45 Args: "--eff-snr-denom-fac 50 --output H1H2L1-
COIRE_SLIDE_SUMMARY_SECOND_H1H2L1_FULL_DATA-820555271-10800120.xml --
cluster-time 10000 -
-user-tag FULL_DATA --summary H1H2L1-
COIRE_SLIDE_SUMMARY_SECOND_H1H2L1_FULL_DATA-820555271-10800120.txt --
num-slides 50 --data-type all_data --coinc-st
at effective_snrsq --input H1H2L1-
COIRE_SLIDE_SUMMARY_SECOND_H1H2L1_FULL_DATA-820555271-10800120.input"
2/9 08:25:45 Env: ""
2/9 08:25:45 Iwd: "/mnt/qfs1/ajw/projects/ihope/run/s5_high_mass/
820555271-831355391/full_data"
2/9 08:25:45 Ckpt Wanted: TRUE
2/9 08:25:45 Is Restart: FALSE
2/9 08:25:45 Core Limit Valid: TRUE
2/9 08:25:45 Coredump Limit 0
2/9 08:25:45 User uid set to 451
2/9 08:25:45 User uid set to 451
2/9 08:25:45 User Process 43231768.0 {
2/9 08:25:45 cmd = /mnt/qfs1/ajw/projects/ihope/run/s5_high_mass/
820555271-831355391/full_data/../executables/lalapps_coire
2/9 08:25:45 args = --eff-snr-denom-fac 50 --output H1H2L1-
COIRE_SLIDE_SUMMARY_SECOND_H1H2L1_FULL_DATA-820555271-10800120.xml --
cluster-time 10000 --
user-tag FULL_DATA --summary H1H2L1-
COIRE_SLIDE_SUMMARY_SECOND_H1H2L1_FULL_DATA-820555271-10800120.txt --
num-slides 50 --data-type all_data --coinc-sta
t effective_snrsq --input H1H2L1-
COIRE_SLIDE_SUMMARY_SECOND_H1H2L1_FULL_DATA-820555271-10800120.input
2/9 08:25:45 env = _CONDOR_SLOT=slot4 CONDOR_VM=slot4
CONDOR_SCRATCH_DIR=/usr1/condor/execute/dir_27805
_condor_BIND_ALL_INTERFACES=TRUE
2/9 08:25:45 local_dir = dir_27805
2/9 08:25:45 cur_ckpt = dir_27805/condor_exec.43231768.0
2/9 08:25:45 core_name = (either 'core' or 'core.<pid>')
2/9 08:25:45 uid = 451, gid = 451
2/9 08:25:45 v_pid = -1
2/9 08:25:45 pid = (NOT CURRENTLY EXECUTING)
2/9 08:25:45 exit_status_valid = FALSE
2/9 08:25:45 exit_status = (NEVER BEEN EXECUTED)
2/9 08:25:45 ckpt_wanted = TRUE
2/9 08:25:45 coredump_limit_exists = TRUE
2/9 08:25:45 coredump_limit = 0
2/9 08:25:45 soft_kill_sig = 20
2/9 08:25:45 job_class = STANDARD
2/9 08:25:45 state = NEW
2/9 08:25:45 new_ckpt_created = FALSE
2/9 08:25:45 ckpt_transferred = FALSE
2/9 08:25:45 core_created = FALSE
2/9 08:25:45 core_transferred = FALSE
2/9 08:25:45 exit_requested = FALSE
2/9 08:25:45 image_size = -1 blocks
2/9 08:25:45 user_time = 0
2/9 08:25:45 sys_time = 0
2/9 08:25:45 guaranteed_user_time = 0
2/9 08:25:45 guaranteed_sys_time = 0
2/9 08:25:45 }
2/9 08:25:45 *FSM* Transitioning to state "GET_EXEC"
2/9 08:25:45 *FSM* Executing state func "get_exec()" [ SUSPEND
VACATE DIE ]
2/9 08:25:45 Entering get_exec()
2/9 08:25:45 Executable is located on submitting host
2/9 08:25:45 Expanded executable name is "/usr1/condor/spool/
cluster43231768.ickpt.subproc0"
2/9 08:25:45 Going to try 3 attempts at getting the initial executable
2/9 08:25:45 Entering get_file( /usr1/condor/spool/
cluster43231768.ickpt.subproc0, dir_27805/condor_exec.43231768.0, 0755 )
2/9 08:25:45 Opened "/usr1/condor/spool/
cluster43231768.ickpt.subproc0" via file stream
2/9 08:25:45 Get_file() transferred 6225045 bytes, 113256632 bytes/
second
2/9 08:25:45 Fetched orig ckpt file "/usr1/condor/spool/
cluster43231768.ickpt.subproc0" into "dir_27805/condor_exec.
43231768.0" with 1 attempt
2/9 08:25:45 Executable 'dir_27805/condor_exec.43231768.0' is linked
with "$CondorVersion: 7.2.0 Dec 19 2008 BuildID: 121001 $" on a
"$CondorPlatform:
X86_64-LINUX_RHEL3 $"
2/9 08:25:45 *FSM* Executing transition function "spawn_all"
2/9 08:25:45 Pipe built
2/9 08:25:45 New pipe_fds[14,1]
2/9 08:25:45 cmd_fd = 14
2/9 08:25:45 Calling execve( "/usr1/condor/execute/dir_27805/
condor_exec.43231768.0", "condor_exec.43231768.0", "-_condor_cmd_fd",
"14", "--eff-snr-den
om-fac", "50", "--output", "H1H2L1-
COIRE_SLIDE_SUMMARY_SECOND_H1H2L1_FULL_DATA-820555271-10800120.xml",
"--cluster-time", "10000", "--user-tag", "FULL_
DATA", "--summary", "H1H2L1-
COIRE_SLIDE_SUMMARY_SECOND_H1H2L1_FULL_DATA-820555271-10800120.txt",
"--num-slides", "50", "--data-type", "all_data", "--co
inc-stat", "effective_snrsq", "--input", "H1H2L1-
COIRE_SLIDE_SUMMARY_SECOND_H1H2L1_FULL_DATA-820555271-10800120.input",
0, "_CONDOR_SLOT=slot4", "CONDO
R_VM=slot4", "CONDOR_SCRATCH_DIR=/usr1/condor/execute/dir_27805",
"_condor_BIND_ALL_INTERFACES=TRUE", 0 )
2/9 08:25:45 Started user job - PID = 27809
2/9 08:25:45 cmd_fp = 0xa3cb70
2/9 08:25:45 end
2/9 08:25:45 *FSM* Transitioning to state "SUPERVISE"
2/9 08:25:45 *FSM* Got asynchronous event "CHILD_EXIT"
2/9 08:25:45 *FSM* Executing transition function "reaper"
2/9 08:25:45 *FSM* Aborting transition function "reaper"
2/9 08:25:45 *FSM* Executing state func
"supervise_all()" [ GET_NEW_PROC SUSPEND VACATE ALARM DIE CHILD_EXIT
PERIODIC_CKPT ]
2/9 12:35:28 *FSM* Got asynchronous event "PERIODIC_CKPT"
2/9 12:35:28 *FSM* Executing transition function "periodic_ckpt_all"
2/9 12:35:28 UserProc::send_sig_no_privsep(): Sent signal SIGCONT to
user job 27809
2/9 12:35:28 UserProc::send_sig(): Sent signal SIGUSR2 to user job 27809
2/9 12:35:28 Requested user job to do a periodic checkpoint
2/9 12:35:28 *FSM* Got asynchronous event "CHILD_EXIT"
2/9 12:35:28 *FSM* Executing transition function "reaper"
2/9 12:35:28 Process 27809 killed by signal 6
2/9 12:35:28 Process exited abnormally
2/9 12:35:28 *FSM* Transitioning to state "PROC_EXIT"
2/9 12:35:28 *FSM* Executing state func "proc_exit()" [ DIE ]
2/9 12:35:28 *FSM* Transitioning to state "SEND_CORE"
2/9 12:35:28 *FSM* Executing state func "send_core()" [ SUSPEND
VACATE DIE ]
2/9 12:35:28 No core file to send - probably ran out of disk
2/9 12:35:28 *FSM* Executing transition function "dispose_one"
2/9 12:35:28 Sending final status for process 43231768.0
2/9 12:35:28 STATUS encoded as ABNORMAL, NO CORE
2/9 12:35:28 User time = 14768.740000 seconds
2/9 12:35:28 System time = 15.200000 seconds
2/9 12:35:28 IO: Incoming packet improperly sized (len=0,end=0)
Stack dump for process 27805 at timestamp 1234211728 (13 frames)
condor_starter(dprintf_dump_stack+0x9b)[0x4c602e]
condor_starter[0x4c62ab]
/lib64/libc.so.6[0x331ae2f2b0]
/lib64/libc.so.6(gsignal+0x40)[0x331ae2f200]
/lib64/libc.so.6(abort+0x100)[0x331ae30730]
/lib64/libc.so.6(__assert_fail+0xf6)[0x331ae282b6]
condor_starter(REMOTE_CONDOR_reallyexit+0x1f6)[0x4b01b8]
condor_starter(_Z17send_final_statusP8UserProc+0x9e)[0x4922c4]
condor_starter(_Z11dispose_onev+0x1f)[0x4929a7]
condor_starter(_ZN12StateMachine7executeEv+0x279)[0x4ca513]
condor_starter(main+0xff)[0x491b3d]
/lib64/libc.so.6(__libc_start_main+0xef)[0x331ae1c40f]
condor_starter(tcsetattr+0xaa)[0x49192a]
2/9 12:35:28 *FSM* Got asynchronous event "DIE"
2/9 12:35:28 *FSM* Executing transition function "req_die"
2/9 12:35:28 req_exit_all: Proc -1 in state ABNORMAL_EXIT
2/9 12:35:28 *FSM* Transitioning to state "TERMINATE"
2/9 12:35:28 *FSM* Executing state func "terminate_all()" [ ]
2/9 12:35:28 No core file to send - probably ran out of disk
2/9 12:35:28 *FSM* Transitioning to state "SEND_STATUS_ALL"
2/9 12:35:28 *FSM* Executing state func "dispose_all()" [ ]
2/9 12:35:28 Sending final status for process 43231768.0
2/9 12:35:28 STATUS encoded as ABNORMAL, NO CORE
2/9 12:35:28 User time = 14768.740000 seconds
2/9 12:35:28 System time = 15.200000 seconds
2/9 12:35:34 condor_read(): recv() returned -1, errno = 104, assuming
failure reading 256 bytes from unknown source.
2/9 12:35:34 Buf::read(): condor_read() failed
2/9 12:35:34 IO: Packet read failed: read -1 of 256
2/9 12:36:00 ******************************************************
2/9 12:36:00 ** condor_starter (CONDOR_STARTER) STARTING UP
2/9 12:36:00 ** /usr/sbin/condor_starter
2/9 12:36:00 ** SubsystemInfo: name=STARTER type=STARTER(8)
class=DAEMON(1)
2/9 12:36:00 ** Configuration: subsystem:STARTER local:<NONE>
class:DAEMON
2/9 12:36:00 ** $CondorVersion: 7.2.0 Dec 19 2008 BuildID: 121001 $
2/9 12:36:00 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
2/9 12:36:00 ** PID = 2144
2/9 12:36:00 ** Log last touched 2/9 12:35:34
2/9 12:36:00 ******************************************************
The Checkpoint Server log shows no reference to this job, though it
did apparently successfully checkpoint 3 other jobs running on the
same machine,
2/9 12:29:11 Sending ckpt server ad to collector...
2/9 12:34:11 Sending ckpt server ad to collector...
2/9 12:35:28 ----------------------------------------------------
2/9 12:35:28 [reqid: 392] Receiving STORE request from 10.14.0.12
2/9 12:35:28 Using descriptor 8 to handle request
2/9 12:35:28 Owner: shaon__AT__ldas-grid.ligo.caltech.edu
2/9 12:35:28 File name: cluster43242783.proc0.subproc0.tmp
2/9 12:35:28 File size: 309736479
2/9 12:35:28 STORE service address: 10.14.1.161:55941
2/9 12:35:28 Request to store checkpoint file GRANTED
2/9 12:35:28 Receiving checkpoint to file: ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu
/cluster43242783.proc0.subproc0.tmp
2/9 12:35:28 ----------------------------------------------------
2/9 12:35:28 [reqid: 393] Receiving STORE request from 10.14.0.12
2/9 12:35:28 Using descriptor 8 to handle request
2/9 12:35:28 Owner: shaon__AT__ldas-grid.ligo.caltech.edu
2/9 12:35:28 File name: cluster43243113.proc0.subproc0.tmp
2/9 12:35:28 File size: 309650463
2/9 12:35:28 STORE service address: 10.14.1.161:53480
2/9 12:35:28 Request to store checkpoint file GRANTED
2/9 12:35:28 ----------------------------------------------------
2/9 12:35:28 Receiving checkpoint to file: ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu
/cluster43243113.proc0.subproc0.tmp
2/9 12:35:28 [reqid: 394] Receiving STORE request from 10.14.0.12
2/9 12:35:28 Using descriptor 8 to handle request
2/9 12:35:28 Owner: shaon__AT__ldas-grid.ligo.caltech.edu
2/9 12:35:28 File name: cluster43226390.proc0.subproc0.tmp
2/9 12:35:28 File size: 308786207
2/9 12:35:28 STORE service address: 10.14.1.161:36881
2/9 12:35:28 Receiving checkpoint to file: ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu
/cluster43226390.proc0.subproc0.tmp
2/9 12:35:28 Request to store checkpoint file GRANTED
2/9 12:35:31 ----------------------------------------------------
2/9 12:35:31 [reqid: 395] Receiving SERVICE request from 10.14.0.12
2/9 12:35:31 Using descriptor 8 to handle request
2/9 12:35:31 Service: SERVICE_RENAME
2/9 12:35:31 Owner: shaon__AT__ldas-grid.ligo.caltech.edu
2/9 12:35:31 File: cluster43243113.proc0.subproc0.tmp
2/9 12:35:31 Renaming file: ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu
/cluster43243113.proc0.subproc0.tmp to ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu
/cluster43243113.proc0.subproc0
2/9 12:35:31 Deleting file: ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu
/cluster43243113.proc0.subproc0.tmp
2/9 12:35:31 Service request successfully completed
2/9 12:35:31 ----------------------------------------------------
2/9 12:35:31 [reqid: 396] Receiving SERVICE request from 10.14.0.12
2/9 12:35:31 Using descriptor 8 to handle request
2/9 12:35:31 Service: SERVICE_RENAME
2/9 12:35:31 Owner: shaon__AT__ldas-grid.ligo.caltech.edu
2/9 12:35:31 File: cluster43242783.proc0.subproc0.tmp
2/9 12:35:31 Renaming file: ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu
/cluster43242783.proc0.subproc0.tmp to ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu
/cluster43242783.proc0.subproc0
2/9 12:35:31 Deleting file: ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu
/cluster43242783.proc0.subproc0.tmp
2/9 12:35:31 Service request successfully completed
2/9 12:35:31 ----------------------------------------------------
2/9 12:35:31 [reqid: 397] Receiving SERVICE request from 10.14.0.12
2/9 12:35:31 Using descriptor 8 to handle request
2/9 12:35:31 Service: SERVICE_RENAME
2/9 12:35:31 Owner: shaon__AT__ldas-grid.ligo.caltech.edu
2/9 12:35:31 File: cluster43226390.proc0.subproc0.tmp
2/9 12:35:31 Renaming file: ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu
/cluster43226390.proc0.subproc0.tmp to ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu
/cluster43226390.proc0.subproc0
2/9 12:35:31 Deleting file: ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu
/cluster43226390.proc0.subproc0.tmp
2/9 12:35:31 Service request successfully completed
2/9 12:35:48 [reqid: 392] File successfully received
2/9 12:35:48 RECV transferred 309736479 bytes in 20 seconds (15486823
bytes / sec)
2/9 12:35:52 [reqid: 393] File successfully received
2/9 12:35:53 RECV transferred 309650463 bytes in 25 seconds (12386018
bytes / sec)
2/9 12:35:53 [reqid: 394] File successfully received
2/9 12:35:53 RECV transferred 308786207 bytes in 25 seconds (12351448
bytes / sec)
2/9 12:35:53 ERROR from waitpid(): 10 (No child processes)
2/9 12:39:11 Sending ckpt server ad to collector...
2/9 12:44:11 Sending ckpt server ad to collector...
My hypothesis is that the reason the one job failed to checkpoint
while the other 3 succeeded is that the failed job was using more
memory and was close to the shell imposed "ulimit -v" value. If so,
there is a bug in the checkpoint code for handling the memory
allocation error that resulted in exceeding this limit.
How much memory does the checkpoint library code in a Standard
Universe job need to process a PERIODIC_CKPT event?
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Mon Feb 9 15:13:33 2009 (1234214016)
Subject: Actions
Assigned to psilord by nleroy
===========================================================================
Date of actions: Mon Feb 9 9:04:53 2009 (1234215173)
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18995] LIGO: condor_checkpoint fails to
generate checkpoint image
Date: Mon, 9 Feb 2009 15:38:42 -0800
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
I have somewhat confirmed the ulimit hypothesis by moving this job to
a different slots with more available memory and then successfully
running condor_checkpoint.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Mon Feb 9 17:38:49 2009 (1234222730)
Subject: Actions
Ticket resolved by tannenba
===========================================================================
Date of actions: Fri Mar 13 16:24:44 2009 (1236979484)