LIGO Support Ticket 18995

Ticket Information
  Number:      admin 18995
  User:        anderson@ligo.caltech.edu
  Email:       
  Status:      resolved
  Assigned To: psilord
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu, Peter Keller <psilord__AT__cs.wisc.edu>
Subject: LIGO: condor_checkpoint fails to generate checkpoint image
Date: Mon, 9 Feb 2009 13:13:28 -0800
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

It was observed while running,

[root@ldas-grid ~]# condor_version
$CondorVersion: 7.2.0 Dec 19 2008 BuildID: 121001 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

on an FC4 x86_64 machine that  the command condor_checkpoint to a  
machine running 4 Standard Universe jobs failed to produce a check  
point image for one of the jobs, instead the job terminated with  
signal 6 and was restarted. The user log file showed:

001 (43231768.000.000) 02/09 08:25:45 Job executing on host:  
<10.14.1.161:53070>
...
005 (43231768.000.000) 02/09 12:35:34 Job terminated.
	(0) Abnormal termination (signal 6)
	(0) No core file
		Usr 0 04:06:08, Sys 0 00:00:15  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 04:06:08, Sys 0 00:00:15  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	589  -  Run Bytes Sent By Job
	6226275  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
...
009 (43231768.000.000) 02/09 12:35:35 Job was aborted by the user.
...
000 (43250747.000.000) 02/09 12:35:47 Job submitted from host:  
<10.14.0.12:57413>
     DAG Node: 6bb15148427f34fea8d95b485dead0ea
...
001 (43250747.000.000) 02/09 12:35:59 Job executing on host:  
<10.14.2.35:47977>
...


The Starter log files shows the job starting at 8:25 and then failing  
to check point at 12:35,


2/9 08:25:45 ********** STARTER starting up ***********
2/9 08:25:45 ** $CondorVersion: 7.2.0 Dec 19 2008 BuildID: 121001 $
2/9 08:25:45 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
2/9 08:25:45 ******************************************
2/9 08:25:45 Submitting machine is "ldas-grid.ligo.caltech.edu"
2/9 08:25:45 EventHandler {
2/9 08:25:45    func = 0x4ca580
2/9 08:25:45    mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD  
SIGTSTP
2/9 08:25:45 }
2/9 08:25:45 Done setting resource limits
2/9 08:25:45    *FSM* Transitioning to state "GET_PROC"
2/9 08:25:45    *FSM* Executing state func "get_proc()" [  ]
2/9 08:25:45 Entering get_proc()
2/9 08:25:45 Entering get_job_info()
2/9 08:25:45 Startup Info:
2/9 08:25:45    Version Number: 1
2/9 08:25:45    Id: 43231768.0
2/9 08:25:45    JobClass: STANDARD
2/9 08:25:45    Uid: 451
2/9 08:25:45    Gid: 451
2/9 08:25:45    VirtPid: -1
2/9 08:25:45    SoftKillSignal: 20
2/9 08:25:45    Cmd: "/mnt/qfs1/ajw/projects/ihope/run/s5_high_mass/ 
820555271-831355391/full_data/../executables/lalapps_coire"
2/9 08:25:45    Args: "--eff-snr-denom-fac 50 --output H1H2L1- 
COIRE_SLIDE_SUMMARY_SECOND_H1H2L1_FULL_DATA-820555271-10800120.xml -- 
cluster-time 10000 -
-user-tag FULL_DATA --summary H1H2L1- 
COIRE_SLIDE_SUMMARY_SECOND_H1H2L1_FULL_DATA-820555271-10800120.txt -- 
num-slides 50 --data-type all_data --coinc-st
at effective_snrsq --input H1H2L1- 
COIRE_SLIDE_SUMMARY_SECOND_H1H2L1_FULL_DATA-820555271-10800120.input"
2/9 08:25:45    Env: ""
2/9 08:25:45    Iwd: "/mnt/qfs1/ajw/projects/ihope/run/s5_high_mass/ 
820555271-831355391/full_data"
2/9 08:25:45    Ckpt Wanted: TRUE
2/9 08:25:45    Is Restart: FALSE
2/9 08:25:45    Core Limit Valid: TRUE
2/9 08:25:45    Coredump Limit 0
2/9 08:25:45 User uid set to 451
2/9 08:25:45 User uid set to 451
2/9 08:25:45 User Process 43231768.0 {
2/9 08:25:45   cmd = /mnt/qfs1/ajw/projects/ihope/run/s5_high_mass/ 
820555271-831355391/full_data/../executables/lalapps_coire
2/9 08:25:45   args = --eff-snr-denom-fac 50 --output H1H2L1- 
COIRE_SLIDE_SUMMARY_SECOND_H1H2L1_FULL_DATA-820555271-10800120.xml -- 
cluster-time 10000 --
user-tag FULL_DATA --summary H1H2L1- 
COIRE_SLIDE_SUMMARY_SECOND_H1H2L1_FULL_DATA-820555271-10800120.txt -- 
num-slides 50 --data-type all_data --coinc-sta
t effective_snrsq --input H1H2L1- 
COIRE_SLIDE_SUMMARY_SECOND_H1H2L1_FULL_DATA-820555271-10800120.input
2/9 08:25:45   env = _CONDOR_SLOT=slot4 CONDOR_VM=slot4  
CONDOR_SCRATCH_DIR=/usr1/condor/execute/dir_27805  
_condor_BIND_ALL_INTERFACES=TRUE
2/9 08:25:45   local_dir = dir_27805
2/9 08:25:45   cur_ckpt = dir_27805/condor_exec.43231768.0
2/9 08:25:45   core_name = (either 'core' or 'core.<pid>')
2/9 08:25:45   uid = 451, gid = 451
2/9 08:25:45   v_pid = -1
2/9 08:25:45   pid = (NOT CURRENTLY EXECUTING)
2/9 08:25:45   exit_status_valid = FALSE
2/9 08:25:45   exit_status = (NEVER BEEN EXECUTED)
2/9 08:25:45   ckpt_wanted = TRUE
2/9 08:25:45   coredump_limit_exists = TRUE
2/9 08:25:45   coredump_limit = 0
2/9 08:25:45   soft_kill_sig = 20
2/9 08:25:45   job_class = STANDARD
2/9 08:25:45   state = NEW
2/9 08:25:45   new_ckpt_created = FALSE
2/9 08:25:45   ckpt_transferred = FALSE
2/9 08:25:45   core_created = FALSE
2/9 08:25:45   core_transferred = FALSE
2/9 08:25:45   exit_requested = FALSE
2/9 08:25:45   image_size = -1 blocks
2/9 08:25:45   user_time = 0
2/9 08:25:45   sys_time = 0
2/9 08:25:45   guaranteed_user_time = 0
2/9 08:25:45   guaranteed_sys_time = 0
2/9 08:25:45 }
2/9 08:25:45    *FSM* Transitioning to state "GET_EXEC"
2/9 08:25:45    *FSM* Executing state func "get_exec()" [ SUSPEND  
VACATE DIE  ]
2/9 08:25:45 Entering get_exec()
2/9 08:25:45 Executable is located on submitting host
2/9 08:25:45 Expanded executable name is "/usr1/condor/spool/ 
cluster43231768.ickpt.subproc0"
2/9 08:25:45 Going to try 3 attempts at getting the initial executable
2/9 08:25:45 Entering get_file( /usr1/condor/spool/ 
cluster43231768.ickpt.subproc0, dir_27805/condor_exec.43231768.0, 0755 )
2/9 08:25:45 Opened "/usr1/condor/spool/ 
cluster43231768.ickpt.subproc0" via file stream
2/9 08:25:45 Get_file() transferred 6225045 bytes, 113256632 bytes/ 
second
2/9 08:25:45 Fetched orig ckpt file "/usr1/condor/spool/ 
cluster43231768.ickpt.subproc0" into "dir_27805/condor_exec. 
43231768.0" with 1 attempt
2/9 08:25:45 Executable 'dir_27805/condor_exec.43231768.0' is linked  
with "$CondorVersion: 7.2.0 Dec 19 2008 BuildID: 121001 $" on a  
"$CondorPlatform:
X86_64-LINUX_RHEL3 $"
2/9 08:25:45    *FSM* Executing transition function "spawn_all"
2/9 08:25:45 Pipe built
2/9 08:25:45 New pipe_fds[14,1]
2/9 08:25:45 cmd_fd = 14
2/9 08:25:45 Calling execve( "/usr1/condor/execute/dir_27805/ 
condor_exec.43231768.0", "condor_exec.43231768.0", "-_condor_cmd_fd",  
"14", "--eff-snr-den
om-fac", "50", "--output", "H1H2L1- 
COIRE_SLIDE_SUMMARY_SECOND_H1H2L1_FULL_DATA-820555271-10800120.xml",  
"--cluster-time", "10000", "--user-tag", "FULL_
DATA", "--summary", "H1H2L1- 
COIRE_SLIDE_SUMMARY_SECOND_H1H2L1_FULL_DATA-820555271-10800120.txt",  
"--num-slides", "50", "--data-type", "all_data", "--co
inc-stat", "effective_snrsq", "--input", "H1H2L1- 
COIRE_SLIDE_SUMMARY_SECOND_H1H2L1_FULL_DATA-820555271-10800120.input",  
0, "_CONDOR_SLOT=slot4", "CONDO
R_VM=slot4", "CONDOR_SCRATCH_DIR=/usr1/condor/execute/dir_27805",  
"_condor_BIND_ALL_INTERFACES=TRUE", 0 )
2/9 08:25:45 Started user job - PID = 27809
2/9 08:25:45 cmd_fp = 0xa3cb70
2/9 08:25:45 end
2/9 08:25:45    *FSM* Transitioning to state "SUPERVISE"
2/9 08:25:45    *FSM* Got asynchronous event "CHILD_EXIT"
2/9 08:25:45    *FSM* Executing transition function "reaper"
2/9 08:25:45    *FSM* Aborting transition function "reaper"
2/9 08:25:45    *FSM* Executing state func  
"supervise_all()" [ GET_NEW_PROC SUSPEND VACATE ALARM DIE CHILD_EXIT  
PERIODIC_CKPT  ]
2/9 12:35:28    *FSM* Got asynchronous event "PERIODIC_CKPT"
2/9 12:35:28    *FSM* Executing transition function "periodic_ckpt_all"
2/9 12:35:28 UserProc::send_sig_no_privsep(): Sent signal SIGCONT to  
user job 27809
2/9 12:35:28 UserProc::send_sig(): Sent signal SIGUSR2 to user job 27809
2/9 12:35:28    Requested user job to do a periodic checkpoint
2/9 12:35:28    *FSM* Got asynchronous event "CHILD_EXIT"
2/9 12:35:28    *FSM* Executing transition function "reaper"
2/9 12:35:28 Process 27809 killed by signal 6
2/9 12:35:28 Process exited abnormally
2/9 12:35:28    *FSM* Transitioning to state "PROC_EXIT"
2/9 12:35:28    *FSM* Executing state func "proc_exit()" [ DIE  ]
2/9 12:35:28    *FSM* Transitioning to state "SEND_CORE"
2/9 12:35:28    *FSM* Executing state func "send_core()" [ SUSPEND  
VACATE DIE  ]
2/9 12:35:28 No core file to send - probably ran out of disk
2/9 12:35:28    *FSM* Executing transition function "dispose_one"
2/9 12:35:28 Sending final status for process 43231768.0
2/9 12:35:28 STATUS encoded as ABNORMAL, NO CORE
2/9 12:35:28 User time = 14768.740000 seconds
2/9 12:35:28 System time = 15.200000 seconds
2/9 12:35:28 IO: Incoming packet improperly sized (len=0,end=0)
Stack dump for process 27805 at timestamp 1234211728 (13 frames)
condor_starter(dprintf_dump_stack+0x9b)[0x4c602e]
condor_starter[0x4c62ab]
/lib64/libc.so.6[0x331ae2f2b0]
/lib64/libc.so.6(gsignal+0x40)[0x331ae2f200]
/lib64/libc.so.6(abort+0x100)[0x331ae30730]
/lib64/libc.so.6(__assert_fail+0xf6)[0x331ae282b6]
condor_starter(REMOTE_CONDOR_reallyexit+0x1f6)[0x4b01b8]
condor_starter(_Z17send_final_statusP8UserProc+0x9e)[0x4922c4]
condor_starter(_Z11dispose_onev+0x1f)[0x4929a7]
condor_starter(_ZN12StateMachine7executeEv+0x279)[0x4ca513]
condor_starter(main+0xff)[0x491b3d]
/lib64/libc.so.6(__libc_start_main+0xef)[0x331ae1c40f]
condor_starter(tcsetattr+0xaa)[0x49192a]
2/9 12:35:28    *FSM* Got asynchronous event "DIE"
2/9 12:35:28    *FSM* Executing transition function "req_die"
2/9 12:35:28 req_exit_all: Proc -1 in state ABNORMAL_EXIT
2/9 12:35:28    *FSM* Transitioning to state "TERMINATE"
2/9 12:35:28    *FSM* Executing state func "terminate_all()" [  ]
2/9 12:35:28 No core file to send - probably ran out of disk
2/9 12:35:28    *FSM* Transitioning to state "SEND_STATUS_ALL"
2/9 12:35:28    *FSM* Executing state func "dispose_all()" [  ]
2/9 12:35:28 Sending final status for process 43231768.0
2/9 12:35:28 STATUS encoded as ABNORMAL, NO CORE
2/9 12:35:28 User time = 14768.740000 seconds
2/9 12:35:28 System time = 15.200000 seconds
2/9 12:35:34 condor_read(): recv() returned -1, errno = 104, assuming  
failure reading 256 bytes from unknown source.
2/9 12:35:34 Buf::read(): condor_read() failed
2/9 12:35:34 IO: Packet read failed: read -1 of 256
2/9 12:36:00 ******************************************************
2/9 12:36:00 ** condor_starter (CONDOR_STARTER) STARTING UP
2/9 12:36:00 ** /usr/sbin/condor_starter
2/9 12:36:00 ** SubsystemInfo: name=STARTER type=STARTER(8)  
class=DAEMON(1)
2/9 12:36:00 ** Configuration: subsystem:STARTER local:<NONE>  
class:DAEMON
2/9 12:36:00 ** $CondorVersion: 7.2.0 Dec 19 2008 BuildID: 121001 $
2/9 12:36:00 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
2/9 12:36:00 ** PID = 2144
2/9 12:36:00 ** Log last touched 2/9 12:35:34
2/9 12:36:00 ******************************************************


The Checkpoint Server log shows no reference to this job, though it  
did apparently successfully checkpoint 3 other jobs running on the  
same machine,

2/9 12:29:11 Sending ckpt server ad to collector...
2/9 12:34:11 Sending ckpt server ad to collector...
2/9 12:35:28 ----------------------------------------------------
2/9 12:35:28 [reqid: 392] Receiving STORE request from 10.14.0.12
2/9 12:35:28     Using descriptor 8 to handle request
2/9 12:35:28     Owner:     shaon__AT__ldas-grid.ligo.caltech.edu
2/9 12:35:28     File name: cluster43242783.proc0.subproc0.tmp
2/9 12:35:28     File size: 309736479
2/9 12:35:28     STORE service address: 10.14.1.161:55941
2/9 12:35:28     Request to store checkpoint file GRANTED
2/9 12:35:28     Receiving checkpoint to file: ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu 
/cluster43242783.proc0.subproc0.tmp
2/9 12:35:28 ----------------------------------------------------
2/9 12:35:28 [reqid: 393] Receiving STORE request from 10.14.0.12
2/9 12:35:28     Using descriptor 8 to handle request
2/9 12:35:28     Owner:     shaon__AT__ldas-grid.ligo.caltech.edu
2/9 12:35:28     File name: cluster43243113.proc0.subproc0.tmp
2/9 12:35:28     File size: 309650463
2/9 12:35:28     STORE service address: 10.14.1.161:53480
2/9 12:35:28     Request to store checkpoint file GRANTED
2/9 12:35:28 ----------------------------------------------------
2/9 12:35:28     Receiving checkpoint to file: ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu 
/cluster43243113.proc0.subproc0.tmp
2/9 12:35:28 [reqid: 394] Receiving STORE request from 10.14.0.12
2/9 12:35:28     Using descriptor 8 to handle request
2/9 12:35:28     Owner:     shaon__AT__ldas-grid.ligo.caltech.edu
2/9 12:35:28     File name: cluster43226390.proc0.subproc0.tmp
2/9 12:35:28     File size: 308786207
2/9 12:35:28     STORE service address: 10.14.1.161:36881
2/9 12:35:28     Receiving checkpoint to file: ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu 
/cluster43226390.proc0.subproc0.tmp
2/9 12:35:28     Request to store checkpoint file GRANTED
2/9 12:35:31 ----------------------------------------------------
2/9 12:35:31 [reqid: 395] Receiving SERVICE request from 10.14.0.12
2/9 12:35:31     Using descriptor 8 to handle request
2/9 12:35:31     Service: SERVICE_RENAME
2/9 12:35:31     Owner:   shaon__AT__ldas-grid.ligo.caltech.edu
2/9 12:35:31     File:    cluster43243113.proc0.subproc0.tmp
2/9 12:35:31     Renaming file: ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu 
/cluster43243113.proc0.subproc0.tmp to ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu 
/cluster43243113.proc0.subproc0
2/9 12:35:31     Deleting file: ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu 
/cluster43243113.proc0.subproc0.tmp
2/9 12:35:31     Service request successfully completed
2/9 12:35:31 ----------------------------------------------------
2/9 12:35:31 [reqid: 396] Receiving SERVICE request from 10.14.0.12
2/9 12:35:31     Using descriptor 8 to handle request
2/9 12:35:31     Service: SERVICE_RENAME
2/9 12:35:31     Owner:   shaon__AT__ldas-grid.ligo.caltech.edu
2/9 12:35:31     File:    cluster43242783.proc0.subproc0.tmp
2/9 12:35:31     Renaming file: ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu 
/cluster43242783.proc0.subproc0.tmp to ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu 
/cluster43242783.proc0.subproc0
2/9 12:35:31     Deleting file: ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu 
/cluster43242783.proc0.subproc0.tmp
2/9 12:35:31     Service request successfully completed
2/9 12:35:31 ----------------------------------------------------
2/9 12:35:31 [reqid: 397] Receiving SERVICE request from 10.14.0.12
2/9 12:35:31     Using descriptor 8 to handle request
2/9 12:35:31     Service: SERVICE_RENAME
2/9 12:35:31     Owner:   shaon__AT__ldas-grid.ligo.caltech.edu
2/9 12:35:31     File:    cluster43226390.proc0.subproc0.tmp
2/9 12:35:31     Renaming file: ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu 
/cluster43226390.proc0.subproc0.tmp to ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu 
/cluster43226390.proc0.subproc0
2/9 12:35:31     Deleting file: ./10.14.0.12/shaon__AT__ldas-grid.ligo.caltech.edu 
/cluster43226390.proc0.subproc0.tmp
2/9 12:35:31     Service request successfully completed
2/9 12:35:48 [reqid: 392] File successfully received
2/9 12:35:48 RECV transferred 309736479 bytes in 20 seconds (15486823  
bytes / sec)
2/9 12:35:52 [reqid: 393] File successfully received
2/9 12:35:53 RECV transferred 309650463 bytes in 25 seconds (12386018  
bytes / sec)
2/9 12:35:53 [reqid: 394] File successfully received
2/9 12:35:53 RECV transferred 308786207 bytes in 25 seconds (12351448  
bytes / sec)
2/9 12:35:53 ERROR from waitpid(): 10 (No child processes)
2/9 12:39:11 Sending ckpt server ad to collector...
2/9 12:44:11 Sending ckpt server ad to collector...


My hypothesis is that the reason the one job failed to checkpoint  
while the other 3 succeeded is that the failed job was using more  
memory and was close to the shell imposed "ulimit -v" value. If so,  
there is a bug in the checkpoint code for handling the memory  
allocation error that resulted in exceeding this limit.

How much memory does the checkpoint library code in a Standard  
Universe job need to process a PERIODIC_CKPT event?

Thanks.

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date of creation: Mon Feb  9 15:13:33 2009 (1234214016)
Subject: Actions

Assigned to psilord by nleroy
===========================================================================
Date of actions: Mon Feb  9  9:04:53 2009 (1234215173)
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18995] LIGO: condor_checkpoint fails to
 generate checkpoint image
Date: Mon, 9 Feb 2009 15:38:42 -0800
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

I have somewhat confirmed the ulimit hypothesis by moving this job to  
a different slots with more available memory and then successfully  
running condor_checkpoint.

Thanks.

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Mon Feb  9 17:38:49 2009 (1234222730)
Subject: Actions

Ticket resolved by tannenba
===========================================================================
Date of actions: Fri Mar 13 16:24:44 2009 (1236979484)