A job is submitted for execution to Condor using the condor_submit command. condor_submit takes as an argument the name of a file called a submit description file. This file contains commands and keywords to direct the queuing of jobs. In the submit description file, Condor finds everything it needs to know about the job. Items such as the name of the executable to run, the initial working directory, and command-line arguments to the program all go into the submit description file. condor_submit creates a job ClassAd based upon the information, and Condor works toward running the job.
The contents of a submit file can save time for Condor users. It is easy to submit multiple runs of a program to Condor. To run the same program 500 times on 500 different input data sets, arrange your data files accordingly so that each run reads its own input, and each run writes its own output. Each individual run may have its own initial working directory, stdin, stdout, stderr, command-line arguments, and shell environment. A program that directly opens its own files will read the file names to use either from stdin or from the command line. A program that opens a static filename every time will need to use a separate subdirectory for the output of each run.
The condor_submit manual page
is on page
and
contains a complete and full description of how to use condor_submit.
In addition to the examples of submit description files given in the condor_submit manual page, here are a few more.
Example 1 is the simplest submit description file possible. It queues up one copy of the program foo(which had been created by condor_compile) for execution by Condor. Since no platform is specified, Condor will use its default, which is to run the job on a machine which has the same architecture and operating system as the machine from which it was submitted. No input, output, and error commands are given in the submit description file, so the files stdin, stdout, and stderr will all refer to /dev/null. The program may produce output by explicitly opening a file and writing to it. A log file, foo.log, will also be produced that contains events the job had during its lifetime inside of Condor. When the job finishes, its exit conditions will be noted in the log file. It is recommended that you always have a log file so you know what happened to your jobs.
####################
#
# Example 1
# Simple condor job description file
#
####################
Executable = foo
Log = foo.log
Queue
Example 2 queues two copies of the program mathematica. The first copy will run in directory run_1, and the second will run in directory run_2. For both queued copies, stdin will be test.data, stdout will be loop.out, and stderr will be loop.error. There will be two sets of files written, as the files are each written to their own directories. This is a convenient way to organize data if you have a large group of Condor jobs to run. The example file shows program submission of mathematica as a vanilla universe job. This may be necessary if the source and/or object code to program mathematica is not available.
####################
#
# Example 2: demonstrate use of multiple
# directories for data organization.
#
####################
Executable = mathematica
Universe = vanilla
input = test.data
output = loop.out
error = loop.error
Log = loop.log
Initialdir = run_1
Queue
Initialdir = run_2
Queue
The submit description file for Example 3 queues 150 runs of program foo which has been compiled and linked for Silicon Graphics workstations running IRIX 6.5. This job requires Condor to run the program on machines which have greater than 32 megabytes of physical memory, and expresses a preference to run the program on machines with more than 64 megabytes, if such machines are available. It also advises Condor that it will use up to 28 megabytes of memory when running. Each of the 150 runs of the program is given its own process number, starting with process number 0. So, files stdin, stdout, and stderr will refer to in.0, out.0, and err.0 for the first run of the program, in.1, out.1, and err.1 for the second run of the program, and so forth. A log file containing entries about when and where Condor runs, checkpoints, and migrates processes for the 150 queued programs will be written into file foo.log.
#################### # # Example 3: Show off some fancy features including # use of pre-defined macros and logging. # #################### Executable = foo Requirements = Memory >= 32 && OpSys == "IRIX65" && Arch =="SGI" Rank = Memory >= 64 Image_Size = 28 Meg Error = err.$(Process) Input = in.$(Process) Output = out.$(Process) Log = foo.log Queue 150
The requirements and rank commands in the submit description file are powerful and flexible. Using them effectively requires care, and this section presents those details.
Both requirements and rank need to be specified
as valid Condor ClassAd expressions, however, default values are set by the
condor_submit program if these aren't defined in the submit description file.
From the condor_submit manual page and the above examples, you see
that writing ClassAd expressions is intuitive, especially if you
are familiar with the programming language C. There are some
pretty nifty expressions you can write with ClassAds.
A complete description of ClassAds and their expressions
can be found in section 4.1 on
page
.
All of the commands in the submit description file are case insensitive, except for the ClassAd attribute string values. ClassAds attribute names are case insensitive, but ClassAd string values are always case sensitive. The correct specification for an architecture is
requirements = arch == "ALPHA"
so an accidental specification of
requirements = arch == "alpha"
will not work due to the incorrect case.
The allowed ClassAd attributes are those that appear in a machine or a job ClassAd. To see all of the machine ClassAd attributes for all machines in the Condor pool, run condor_status -l. The -l argument to condor_status means to display all the complete machine ClassAds. The job ClassAds, if there jobs in the queue, can be seen with the condor_q -l command. This will show you all the available attributes you can play with.
To help you out with what these attributes all signify, descriptions follow for the attributes which will be common to every machine ClassAd. Remember that because ClassAds are flexible, the machine ads in your pool may include additional attributes specific to your site's installation and policies.
If executables are available for the different platforms of machines in the Condor pool, Condor can be allowed the choice of a larger number of machines when allocating a machine for a job. Modifications to the submit description file allow this choice of platforms.
A simplified example is a cross submission. An executable is available for one platform, but the submission is done from a different platform. Given the correct executable, the requirements command in the submit description file specifies the target architecture. For example, an executable compiled for a Sun 4, submitted from an Intel architecture running Linux would add the requirement
requirements = Arch == "SUN4x" && OpSys == "SOLARIS251"Without this requirement, condor_submit will assume that the program is to be executed on a machine with the same platform as the machine where the job is submitted.
Cross submission works for both standard and vanilla universes. The burden is on the user to both obtain and specify the correct executable for the target architecture. To list the architecture and operating systems of the machines in a pool, run condor_status.
A more complex example of a heterogeneous submission occurs when a job may be executed on many different architectures to gain full use of a diverse architecture and operating system pool. If the executables are available for the different architectures, then a modification to the submit description file will allow Condor to choose an executable after an available machine is chosen.
A special-purpose MachineAd substitution macro can be used in the executable, environment, and arguments attributes in the submit description file. The macro has the form
$$(MachineAdAttribute)Note that this macro is ignored in all other submit description attributes. The $$() informs Condor to substitute the requested MachineAdAttribute from the machine where the job will be executed.
An example of the heterogeneous job submission has executables available for three platforms: LINUX Intel, Solaris26 Intel, and Irix 6.5 SGI machines. This example uses povray to render images using a popular free rendering engine.
The substitution macro chooses a specific executable after a platform for running the job is chosen. These executables must therefore be named based on the machine attributes that describe a platform. The executables named
povray.LINUX.INTEL povray.SOLARIS26.INTEL povray.IRIX65.SGIwill work correctly for the macro
povray.$$(OpSys).$$(Arch)
The executables or links to executables with this name are placed into the initial working directory so that they may be found by Condor. A submit description file that queues three jobs for this example:
####################
#
# Example of heterogeneous submission
#
####################
universe = vanilla
Executable = povray.$$(OpSys).$$(Arch)
Log = povray.log
Output = povray.out.$(Process)
Error = povray.err.$(Process)
Requirements = (Arch == "INTEL" && OpSys == "LINUX") || \
(Arch == "INTEL" && OpSys =="SOLARIS26") || \
(Arch == "SGI" && OpSys == "IRIX65")
Arguments = +W1024 +H768 +Iimage1.pov
Queue
Arguments = +W1024 +H768 +Iimage2.pov
Queue
Arguments = +W1024 +H768 +Iimage3.pov
Queue
These jobs are submitted to the vanilla universe to assure that once a job is started on a specific platform, it will finish running on that platform. Switching platforms in the middle of job execution cannot work correctly.
There are two common errors made with the substitution macro. The first is the use of a non-existent MachineAdAttribute. If the specified MachineAdAttribute does not exist in the machine's ClassAd, then Condor will place the job in the machine state of hold until the problem is resolved.
The second common error occurs due to an incomplete job set up. For example, the submit description file given above specifies three available executables. If one is missing, Condor report back that an executable is missing when it happens to match the job with a resource that requires the missing binary.
Jobs submitted to the standard universe may produce checkpoints. A checkpoint can then be used to start up and continue execution of a partially completed job. For a partially completed job, the checkpoint and the job are specific to a platform. If migrated to a different machine, correct execution requires that the platform must remain the same.
A more complex requirements expression tells Condor to migrate a partially completed job to another machine with the same platform.
CkptRequirements = ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && \
((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED))
Requirements = ( (Arch == "INTEL" && OpSys == "LINUX") || \
(Arch == "INTEL" && OpSys =="SOLARIS26") || \
(Arch == "SGI" && OpSys == "IRIX65") ) && $(CkptRequirements)
The Requirements expression in the example uses a macro
to add an additional expression, called CkptRequirements.
The CkptRequirements expression guarantees correct operation
in the two possible cases for a job.
In the first case, the job has not produced a checkpoint.
The ClassAd attributes CkptArch and CkptOpSys
will be undefined, and therefore the meta operator (=?=)
evaluates to true.
In the second case, the job has produced a checkpoint.
The Machine ClassAd is restricted to require further execution
only on a machine of the same platform.
The attributes CkptArch and CkptOpSys
will be defined, ensuring that the platform chosen for further
execution will be the same as the one used just before the
checkpoint.
Note that this restriction of platforms also applies to platforms where the executables are binary compatible.
The complete submit description file for this example:
####################
#
# Example of heterogeneous submission
#
####################
universe = standard
Executable = povray.$$(OpSys).$$(Arch)
Log = povray.log
Output = povray.out.$(Process)
Error = povray.err.$(Process)
CkptRequirements = ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && \
((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED))
Requirements = ( (Arch == "INTEL" && OpSys == "LINUX") || \
(Arch == "INTEL" && OpSys =="SOLARIS26") || \
(Arch == "SGI" && OpSys == "IRIX65") ) && $(CkptRequirements)
Arguments = +W1024 +H768 +Iimage1.pov
Queue
Arguments = +W1024 +H768 +Iimage2.pov
Queue
Arguments = +W1024 +H768 +Iimage3.pov
Queue