A directed acyclic graph (DAG) can be used to represent a set of programs where the input, output, or execution of one or more programs is dependent on one or more other programs. The programs are nodes (vertices) in the graph, and the edges (arcs) identify the dependencies. Condor alone finds machines for the execution of programs, but it does not schedule programs (jobs) based on dependencies. The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan submits jobs to Condor in an order represented by a DAG and processes the results. An input file defined prior to submission describes the DAG, and a Condor submit description file for each program in the DAG is used by Condor.
Each node (program) in the DAG needs its own Condor submit description file. As DAGMan submits jobs to Condor, it uses a single Condor log file to enforce the ordering required for the DAG. The DAG itself is defined by the contents of a DAGMan input file. DAGMan is responsible for scheduling, recovery, and reporting for the set of programs submitted to Condor.
The following sections specify the use of DAGMan.
The input file used by DAGMan specifies three items:
These three items are placed in the input file for DAGMan in the order listed.
Comments may be placed in the input file that describes the DAG.
The pound character (#) as the first character on a
line identifies the line as a comment.
Comments do not span lines.
An example input file for DAGMan is
# Filename: diamond.dag # Job A A.condor Job B B.condor Job C C.condor Job D D.condor Script PRE A top_pre.csh Script PRE B mid_pre.perl $JOB Script POST B mid_post.perl $JOB $RETURN Script PRE C mid_pre.perl $JOB Script POST C mid_post.perl $JOB $RETURN Script PRE D bot_pre.csh PARENT A CHILD B C PARENT B C CHILD D
This input file describes the DAG shown in Figure 2.2.
The first section of the input file lists all the programs that appear in the DAG. Each program is described by a single line called a Job Entry. The syntax used for each Job Entry is
JOB JobName CondorSubmitDescriptionFile [DONE]
A Job Entry maps a JobName to a Condor submit description file. The JobName uniquely identifies programs within the DAGMan input file and within output messages.
The keyword JOB and the JobName are not case sensitive. A JobName of joba is equivalent to JobA. The CondorSubmitDescriptionFile is case sensitive, since the UNIX file system is case sensitive. The JobName can be any string that contains no white space.
The optional DONE identifies a job as being already completed. This is useful in situations where the user wishes to verify results, but does not need all programs within the dependency graph to be executed. The DONE feature is also utilized when an error occurs causing the DAG to not be completed. DAGMan generates a Rescue DAG, a DAGMan input file that can be used to restart and complete a DAG without re-executing completed programs.
The second type of item in a DAGMan input file enumerates processing that is done either before a program within the DAG is submitted to Condor for execution or after a program within the DAG completes its execution. Processing done before a program is submitted to Condor is called a PRE script. Processing done after a program successfully completes its execution under Condor is called a POST script. A node in the DAG is comprised of the program together with PRE and/or POST scripts. The dependencies in the DAG are enforced based on nodes.
Syntax for PRE and POST script lines within the input file
SCRIPT PRE JobName ExecutableName [arguments]
SCRIPT POST JobNameExecutableName [arguments]
The SCRIPT keyword identifies the type of line within the DAG input file. The PRE or POST keyword specifies the relative timing of when the script is to be run. The JobName specifies the job to which the script is attached. The ExecutableName specifies the script to be executed, and it may be followed by any command line arguments to that script. The ExecutableName and optional arguments have their case preserved.
Scripts are optional for each job, and any scripts are executed on the machine to which the DAGMan is submitted.
The PRE and POST scripts are commonly used when files must be placed into a staging area for the job to use, and files are cleaned up or removed once the job is finished running. An example using PRE/POST scripts involves staging files that are stored on tape. The PRE script reads compressed input files from the tape drive, and it uncompresses them, placing the input files in the current directory. The program within the DAG node is submitted to Condor, and it reads these input files. The program produces output files. The POST script compresses the output files, writes them out to the tape, and then deletes the staged input and output files.
DAGMan takes note of the exit value of the program as well as the exit value of its scripts. If the PRE script fails (exit value != 0), then neither the job nor the POST script runs, and the node is marked as failed.
If the PRE script succeeds, the program is submitted to Condor. If the program fails, the DAG node is marked as failed. An exit value not equal to 0 indicates program failure. It is therefore important that the program returns the exit value 0 to indicate the program did not fail.
The POST script is run regardless of the job's return value. If the POST script fails (exit value != 0), then the node is marked as failed.
A node not mark as failed at any point is successful.
Two variables are available to ease script writing. The $JOB variable evaluates to JobName. The $RETURN variable evaluates to the return value of the program. The variables may be placed anywhere within the arguments.
As an example, suppose the PRE script expands a compressed file named JobName.gz. The SCRIPT entry for jobs A, B, and C are
SCRIPT PRE A pre.csh $JOB .gz SCRIPT PRE B pre.csh $JOB .gz SCRIPT PRE C pre.csh $JOB .gz
The script pre.csh may use these arguments
#!/bin/csh gunzip $argv[1]$argv[2]
The third type of item in the DAG input file describes the dependencies within the DAG. Nodes are parents and/or children within the DAG. A parent node must be completed successfully before any child node may be started. A child node is started once all its parents have successfully completed.
The syntax of a dependency line within the DAG input file:
PARENT ParentJobName... CHILD ChildJobName...
The PARENT keyword is followed by one or more ParentJobNames. The CHILD keyword is followed by one or more ChildJobNames. Each child job depends on every parent job on the line. A single line in the input file can specify the dependencies from one or more parents to one or more children. As an example, the line
PARENT p1 p2 CHILD c1 c2produces four dependencies:
p1 to c1
p1 to c2
p2 to c1
p2 to c2
Each node in a DAG may be a unique executable, each with a unique Condor submit description file. Each program may be submitted to a different universe within Condor, for example standard, vanilla, or DAGMan.
Two limitations exist.
First, each Condor submit description file must submit only one job.
There may not be multiple queue lines, or DAGMan will fail.
The second limitation is that
the submit description file for all jobs within the DAG
must use the same log.
DAGMan enforces the dependencies within a DAG
using the events recorded in the
log file produced by job submission to Condor.
Here is an example Condor submit description file to go with the diamond-shaped DAG example.
# Filename: diamond_job.condor # executable = /path/diamond.exe output = diamond.out.$(cluster) error = diamond.err.$(cluster) log = diamond_condor.log universe = vanilla notification = NEVER queue
This example uses the same Condor submit description file for all the jobs in the DAG. This implies that each node within the DAG runs the same program. The $(cluster) macro is used to produce unique file names for each program's output. Each job is submitted separately, into its own cluster, so this provides unique names for the output files.
The notification is set to NEVER in this example.
This tells Condor not to send e-mail about the completion of a program
submitted to Condor.
For DAGs with many nodes, this becomes the method used
to reduce or eliminate excessive numbers of e-mails.
A DAG is submitted using the program condor_submit_dag.
See the manual
page
for complete details.
A simple submission has the syntax
condor_submit_dag DAGInputFileName
The example may be submitted with
condor_submit_dag diamond.dagIn order to guarantee recoverability, the DAGMan program itself is run as a Condor job. As such, it needs a submit description file. DAGMan produces the needed file, naming it by appending the DAGInputFileName with .condor.sub. This submit description file may be editted if the DAG is submitted with
condor_submit_dag -no_submit diamond.dagcausing DAGMan to generate the submit description file, but not submit DAGMan to Condor. To submit the DAG, once the submit description file is editted, use
condor_submit diamond.dag.condor.sub
An optional argument to condor_submit_dag, maxjobs, is used to specify the maximum number of jobs that DAGMan may submit to Condor at one time. It is commonly used when there is a limited amount of input file staging capacity. As a specific example, consider a case where each job will require 4 Mbytes of input files, and the jobs will run in a directory with a volume of 100 MB of free space. Using the argument -maxjobs 25 guarantees that a maximum of 25 jobs can be submitted to Condor at one time.
After submission, the progress of the DAG can be monitored by looking at the common log file, observing the e-mail that program submission to Condor causes, or by using condor_q.
A DAG can fail in one of two ways. Either DAGMan itself fails, or a node within the DAG fails. If DAGMan fails, no Condor jobs will remain. Currently, if a node within the DAG fails, DAGMan continues running as a Condor job.
condor_submit_dag attempts to check the DAG input file to verify that all the nodes in the DAG specify the same log file. If a problem is detected, condor_submit_dag prints out an error message and aborts.
To omit the check that all nodes use the same log file, as may be desired in the case where there are thousands of nodes, submit the job with the -log option. An example of this submission:
condor_submit_dag -log diamond_condor.logThis option tells condor_submit_dag to omit the verification step and use the given file as the log file.
To remove an entire DAG, consisting of DAGMan plus any jobs submitted to Condor, remove the DAGMan job running under Condor. condor_q will list the job number. Use the job number to remove the job, for example
% condor_q
-- Submitter: turunmaa.cs.wisc.edu : <128.105.175.125:36165> : turunmaa.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
9.0 smoler 10/12 11:47 0+00:01:32 R 0 8.7 condor_dagman -f -
11.0 smoler 10/12 11:48 0+00:00:00 I 0 3.6 B.out
12.0 smoler 10/12 11:48 0+00:00:00 I 0 3.6 C.out
3 jobs; 2 idle, 1 running, 0 held
% condor_rm 9.0
Before the DAGMan job stops running, it uses condor_rm to remove any Condor jobs within the DAG that are running.
In the case where a machine is scheduled to go down, DAGMan will clean up memory and exit. However, in will leave any submitted jobs in Condor's queue.
NOTE: The Rescue DAG feature is not implemented.
DAGMan does not support job resubmission on failure. If any node in the DAG fails, the entire DAG is aborted. As a substitute for resubmission, DAGMan offers an approach called the Rescue DAG.
The Rescue DAG is a DAG input file, functionally the same as the original DAG file. It additionally contains indication of successfully completed nodes using the DONE option in the input description file. If the DAG is resubmitted, the jobs marked as completed will not be resubmitted.
The Rescue DAG is automatically generated by DAGMan when a node within the DAG fails. The file is named using the DAGInputFileName, and appending the suffix .rescue to it. Statistics about the failed DAG execution are presented as comments at the beginning of the Rescue DAG input file.