Condor provides checkpointing services to single process jobs on a
number of Unix platforms.
To enable checkpointing, the user must link the program with the
Condor system call library (libcondorsyscall.a), using the
condor_ compile command.
This means that the
user must have the object files or source code of the program to use
Condor checkpointing. However, the checkpointing services provided by
Condor are strictly optional. So, while there are some classes of
jobs for which Condor does not provide checkpointing services, these
jobs may still be submitted to Condor to take advantage of Condor's
resource management functionality. (See
section 2.4.1 on
page
for a description of the
classes of jobs for which Condor does not provide checkpointing
services.)
Process checkpointing is implemented in the Condor system call library as a signal handler. When Condor sends a checkpoint signal to a process linked with this library, the provided signal handler writes the state of the process out to a file or a network socket. This state includes the contents of the process stack and data segments, all shared library code and data mapped into the process's address space, the state of all open files, and any signal handlers and pending signals. On restart, the process reads this state from the file, restoring the stack, shared library and data segments, file state, signal handlers, and pending signals. The checkpoint signal handler then returns to user code, which continues from where it left off when the checkpoint signal arrived.
Condor processes for which checkpointing is enabled perform a checkpoint when preempted from a machine. When a suitable replacement execution machine is found (of the same architecture and operating system), the process is restored on this new machine from the checkpoint, and computation is resumed from where it left off. Jobs that can not be checkpointed are preempted and restarted from the beginning.
Condor's periodic checkpointing provides fault tolerance. Condor pools are each configured with the PERIODIC_CHECKPOINT expression which controls when and how often jobs which can be checkpointed do periodic checkpoints (examples: never, every three hours, etc.). When the time for a periodic checkpoint occurs, the job suspends processing, performs the checkpoint, and immediately continues from where it left off. There is also a condor_ ckpt command which allows the user to request that a Condor job immediately perform a periodic checkpoint.
In all cases, Condor jobs continue execution from the most recent complete checkpoint. If service is interrupted while a checkpoint is being performed, causing that checkpoint to fail, the process will restart from the previous checkpoint. Condor uses a commit style algorithm for writing checkpoints: a previous checkpoint is deleted only after a new complete checkpoint has been written successfully.
In certain cases, checkpointing may be delayed until a more appropriate time. For example, a Condor job will defer a checkpoint request if it is communicating with another process over the network. When the network connection is closed, the checkpoint will occur.
The Condor checkpointing facility can also be used for any Unix process outside of the Condor batch environment. Standalone checkpointing is described in section 4.2.1.
Condor can now read and write compressed checkpoints. This new functionality is provided in the libcondorzsyscall.a library. If /usr/lib/libz.a exists on your workstation, condor_ compile will automatically link your job with the compression-enabled version of the checkpointing library.
By default, a checkpoint is written to a file on the local disk of the
machine where the job was submitted. A checkpoint server is available
to serve as a repository for checkpoints. (See
section 3.11.5 on page
.)
When a host is configured to use a checkpoint server, jobs submitted
on that machine write and read checkpoints to and from the server
rather than the local disk of the submitting machine, taking the
burden of storing checkpoint files off of the submitting machines and
placing it instead on server machines (with disk space dedicated to
the purpose of storing checkpoints).
Using the Condor checkpoint library without the remote system call functionality and outside of the Condor system is known as standalone mode checkpointing.
To prepare a program for standalone checkpointing, simply use the condor_ compile utility as for a standard Condor job, but do not use condor_ submit - just run your program normally from the command line. The checkpointing library will print a message to let you know that checkpointing is enabled and to inform you where the checkpoint image is stored:
Condor: Notice: Will checkpoint to program_name.ckpt Condor: Notice: Remote system calls disabled.
To force the program to write a checkpoint image and stop, send it the SIGTSP signal or press control-Z. To force the program to write a checkpoint image and continue executing, send it the SIGUSR2 signal.
To restart the program from a checkpoint, run it again with the option ``-_condor_restart'' and the name of the checkpoint image file.
To use a different filename for the checkpoint image, use the option ''-_condor_ckpt'' and the name of the file you want checkpoints written to.
Some programs have fundamental limitations that make them unsafe for checkpointing. For example, a program that both reads and writes a single file may enter an unexpected state. Here is an example of how this might happen.
In this example, the program would re-read data from the file, but instead of finding the original data, would see data created in the future, and yield unexpected results.
To prevent this sort of accident, Condor displays a warning if a file is used for both reading and writing. You can ignore or disable these warnings if you choose (see section 4.2.3,) but please understand that your program may compute incorrect results.
Condor has a number of messages that warn you of unexpected behaviors in your program. For example, if a file is opened for reading and writing, you will see:
Condor: Warning: READWRITE: File '/tmp/x' used for both reading and writing.
You may control how these messages are displayed with the
-_condor_warning command-line argument. This argument
accepts a warning category and a mode. The category describes a certain
class of messages, such as READWRITE or ALL. The mode describes what
to do with the category. It may be ON, OFF, or ONCE.
If a category is ON, it is always displayed.
If a category is OF, it is never displayed.
If a category is ONCE, it is displayed only once.
To show all the available categories and modes, just use
-_condor_warning with no arguments.
For example, to limit read/write warnings to one instance:
-_condor_warning READWRITE ONCE
To turn all ordinary notices off:
-_condor_warning NOTICE OFF
The same effect can be accomplished within a program by using the function
_condor_warning_config, described in section 4.2.4.
A program need not be rewritten to take advantage of checkpointing. However, the checkpointing library provides several C entry points that allow for a program to control its own checkpointing behavior if needed.
void ckpt()
void ckpt_and_exit()
void init_image_with_file_name( char *ckpt_file_name )
restart() must be called to perform the
actual restart.
void init_image_with_file_descriptor( int fd )
restart() must be called to
perform the actual restart.
void restart()
void _condor_ckpt_disable()
_condor_ckpt_disable(), access the
file, and then call _condor_ckpt_enable(). Some program
actions, such as opening a socket or a pipe, implicitly cause
checkpointing to be disabled.
void _condor_ckpt_enable()
_condor_ckpt_disable(). If a checkpointing signal arrived
while checkpointing was disabled, the checkpoint will occur when
this function is called. Disabling and enabling of checkpointing
must occur in matched pairs. _condor_ckpt_enable() must
be called once for every time that _condor_ckpt_disable()
is called.
int _condor_warning_config( const char *kind, const char *mode )
kind and mode arguments are the same as for the
-_condor_warning option described in section 4.2.3. This function returns true
if the arguments are understood and accepted. Otherwise, it returns false.
extern int condor_compress_ckpt