A directed acyclic graph (DAG) can be used to represent a set of programs where the input, output, or execution of one or more programs is dependent on one or more other programs. The programs are nodes (vertices) in the graph, and the edges (arcs) identify the dependencies. Condor finds machines for the execution of programs, but it does not schedule programs (jobs) based on dependencies. The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan submits jobs to Condor in an order represented by a DAG and processes the results. An input file defined prior to submission describes the DAG, and a Condor submit description file for each program in the DAG is used by Condor.
Each node (program) in the DAG needs its own Condor submit description file. As DAGMan submits jobs to Condor, it uses a single Condor log file to enforce the ordering required for the DAG. The DAG itself is defined by the contents of a DAGMan input file. DAGMan is responsible for scheduling, recovery, and reporting for the set of programs submitted to Condor.
The following sections specify the use of DAGMan.
The input file used by DAGMan specifies four items:
Comments may be placed in the input file that describes the DAG.
The pound character (#) as the first character on a
line identifies the line as a comment.
Comments do not span lines.
An example input file for DAGMan is
# Filename: diamond.dag # Job A A.condor Job B B.condor Job C C.condor Job D D.condor Script PRE A top_pre.csh Script PRE B mid_pre.perl $JOB Script POST B mid_post.perl $JOB $RETURN Script PRE C mid_pre.perl $JOB Script POST C mid_post.perl $JOB $RETURN Script PRE D bot_pre.csh PARENT A CHILD B C PARENT B C CHILD D Retry C 3
This input file describes the DAG shown in Figure 2.2.
The first section of the input file lists all the programs that appear in the DAG. Each program is described by a single line called a Job Entry. The syntax used for each Job Entry is
JOB JobName CondorSubmitDescriptionFile [DONE]
A Job Entry maps a JobName to a Condor submit description file. The JobName uniquely identifies nodes within the DAGMan input file and within output messages.
The keyword JOB and the JobName are not case sensitive. A JobName of joba is equivalent to JobA. The CondorSubmitDescriptionFile is case sensitive, since the UNIX file system is case sensitive. The JobName can be any string that contains no white space.
The optional DONE identifies a job as being already completed. This is useful in situations where the user wishes to verify results, but does not need all programs within the dependency graph to be executed. The DONE feature is also utilized when an error occurs causing the DAG to not be completed. DAGMan generates a Rescue DAG, a DAGMan input file that can be used to restart and complete a DAG without re-executing completed programs.
The second type of item in a DAGMan input file enumerates processing that is done either before a program within the DAG is submitted to Condor for execution or after a program within the DAG completes its execution. Processing done before a program is submitted to Condor is called a PRE script. Processing done after a program successfully completes its execution under Condor is called a POST script. A node in the DAG is comprised of the program together with PRE and/or POST scripts. The dependencies in the DAG are enforced based on nodes.
Syntax for PRE and POST script lines within the input file
SCRIPT PRE JobName ExecutableName [arguments]
SCRIPT POST JobNameExecutableName [arguments]
The SCRIPT keyword identifies the type of line within the DAG input file. The PRE or POST keyword specifies the relative timing of when the script is to be run. The JobName specifies the node to which the script is attached. The ExecutableName specifies the script to be executed, and it may be followed by any command line arguments to that script. The ExecutableName and optional arguments have their case preserved.
Scripts are optional for each job, and any scripts are executed on the machine to which the DAG is submitted.
The PRE and POST scripts are commonly used when files must be placed into a staging area for the job to use, and files are cleaned up or removed once the job is finished running. An example using PRE/POST scripts involves staging files that are stored on tape. The PRE script reads compressed input files from the tape drive, and it uncompresses them, placing the input files in the current directory. The program within the DAG node is submitted to Condor, and it reads these input files. The program produces output files. The POST script compresses the output files, writes them out to the tape, and then deletes the staged input and output files.
DAGMan takes note of the exit value of the scripts as well as the program. If the PRE script fails (exit value != 0), then neither the program nor the POST script runs, and the node is marked as failed.
If the PRE script succeeds, the program is submitted to Condor. If the program fails and there is no POST script, the DAG node is marked as failed. An exit value not equal to 0 indicates program failure. It is therefore important that the program returns the exit value 0 to indicate the program did not fail.
If the program fails and there is a POST script, node failure is determined by the exit value of the POST script. A failing value from the POST script marks the node as failed. A succeeding value from the POST script (even with a failed program) marks the node as successful. Therefore, the POST script may need to consider the return value from the program.
By default, the POST script is run regardless of the program's return value. To prevent POST scripts from running after failed jobs, pass the -NoPostFail argument to condor_ submit_dag.
A node not marked as failed at any point is successful.
Two variables are available to ease script writing. The $JOB variable evaluates to JobName. For POST scripts, the $RETURN variable evaluates to the return value of the program. The variables may be placed anywhere within the arguments.
As an example, suppose the PRE script expands a compressed file named JobName.gz. The SCRIPT entry for jobs A, B, and C are
SCRIPT PRE A pre.csh $JOB .gz SCRIPT PRE B pre.csh $JOB .gz SCRIPT PRE C pre.csh $JOB .gz
The script pre.csh may use these arguments
#!/bin/csh gunzip $argv[1]$argv[2]
The third type of item in the DAG input file describes the dependencies within the DAG. Nodes are parents and/or children within the DAG. A parent node must be completed successfully before any child node may be started. A child node is started once all its parents have successfully completed.
The syntax of a dependency line within the DAG input file:
PARENT ParentJobName... CHILD ChildJobName...
The PARENT keyword is followed by one or more ParentJobNames. The CHILD keyword is followed by one or more ChildJobNames. Each child job depends on every parent job on the line. A single line in the input file can specify the dependencies from one or more parents to one or more children. As an example, the line
PARENT p1 p2 CHILD c1 c2produces four dependencies:
p1 to c1
p1 to c2
p2 to c1
p2 to c2
The fourth type of item in the DAG input file provides a way (optional) to retry failed nodes. The syntax for retry is
Retry JobName NumberOfRetries
where the JobName is the same as the name given in a Job Entry line, and NumberOfRetries is an integer, the number of times to retry the node after failure. The default number of retries for any node is 0, the same as not having a retry line in the file.
In the event of retry, all parts of a node within the DAG are redone, following the same rules regarding node failure as given above. The PRE script is executed first, followed by submitting the program to Condor upon success of the PRE script. Failure of the node is then determined by the return value of the program, the existence and return value of a POST script.
Each node in a DAG may be a unique executable, each with a unique Condor submit description file. Each program may be submitted to a different universe within Condor, for example standard, vanilla, or DAGMan.
Two limitations exist.
First, each Condor submit description file must submit only one job.
There may not be multiple queue lines, or DAGMan will fail.
The second limitation is that
the submit description file for all jobs within the DAG
must specify the same log.
DAGMan enforces the dependencies within a DAG
using the events recorded in the
log file produced by job submission to Condor.
Here is an example Condor submit description file to go with the diamond-shaped DAG example.
# Filename: diamond_job.condor # executable = /path/diamond.exe output = diamond.out.$(cluster) error = diamond.err.$(cluster) log = diamond_condor.log universe = vanilla notification = NEVER queue
This example uses the same Condor submit description file for all the jobs in the DAG. This implies that each node within the DAG runs the same program. The $(cluster) macro is used to produce unique file names for each program's output. Each job is submitted separately, into its own cluster, so this provides unique names for the output files.
The notification is set to NEVER in this example.
This tells Condor not to send e-mail about the completion of a program
submitted to Condor.
For DAGs with many nodes, this is recommended
to reduce or eliminate excessive numbers of e-mails.
A DAG is submitted using the program condor_ submit_dag.
See the manual
page
for complete details.
A simple submission has the syntax
condor_ submit_dag DAGInputFileName
The example may be submitted with
condor_submit_dag diamond.dagIn order to guarantee recoverability, the DAGMan program itself is run as a Condor job. As such, it needs a submit description file. condor_ submit_dag produces the needed file, naming it by appending the DAGInputFileName with .condor.sub. This submit description file may be editted if the DAG is submitted with
condor_submit_dag -no_submit diamond.dagcausing condor_ submit_dag to generate the submit description file, but not submit DAGMan to Condor. To submit the DAG, once the submit description file is editted, use
condor_submit diamond.dag.condor.sub
An optional argument to condor_ submit_dag, -maxjobs, is used to specify the maximum number of Condor jobs that DAGMan may submit to Condor at one time. It is commonly used when there is a limited amount of input file staging capacity. As a specific example, consider a case where each job will require 4 Mbytes of input files, and the jobs will run in a directory with a volume of 100 Mbytes of free space. Using the argument -maxjobs 25 guarantees that a maximum of 25 jobs, using a maximum of 100 Mbytes of space, will be submitted to Condor at one time.
While the -maxjobs argument is used to limit the number of Condor jobs submitted at one time, it may be desirable to limit the number of scripts running at one time. The optional -maxpre argument limits the number of PRE scripts that may be running at one time, while the optional -maxpost argument limits the number of POST scripts that may be running at one time.
After submission, the progress of the DAG can be monitored by looking at the common log file, observing the e-mail that program submission to Condor causes, or by using condor_ q -dag.
condor_ submit_dag attempts to check the DAG input file to verify that all the nodes in the DAG specify the same log file. If a problem is detected, condor_ submit_dag prints out an error message and aborts.
To omit the check that all nodes use the same log file, as may be desired in the case where there are thousands of nodes, submit the job with the -log option. An example of this submission:
condor_submit_dag -log diamond_condor.logThis option tells condor_ submit_dag to omit the verification step and use the given file as the log file.
To remove an entire DAG, consisting of DAGMan plus any jobs submitted to Condor, remove the DAGMan job running under Condor. condor_ q will list the job number. Use the job number to remove the job, for example
% condor_q
-- Submitter: turunmaa.cs.wisc.edu : <128.105.175.125:36165> : turunmaa.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
9.0 smoler 10/12 11:47 0+00:01:32 R 0 8.7 condor_dagman -f -
11.0 smoler 10/12 11:48 0+00:00:00 I 0 3.6 B.out
12.0 smoler 10/12 11:48 0+00:00:00 I 0 3.6 C.out
3 jobs; 2 idle, 1 running, 0 held
% condor_rm 9.0
Before the DAGMan job stops running, it uses condor_ rm to remove any Condor jobs within the DAG that are running.
In the case where a machine is scheduled to go down, DAGMan will clean up memory and exit. However, it will leave any submitted jobs in Condor's queue.
DAGMan can help with the resubmission of uncompleted portions of a DAG when one or more nodes resulted in failure. If any node in the DAG fails, the remainder of the DAG is continued until no more forward progress can be made based on the DAG's dependencies. At this point, DAGMan produces a file called a Rescue DAG.
The Rescue DAG is a DAG input file, functionally the same as the original DAG file. It additionally contains indication of successfully completed nodes using the DONE option in the input description file. If the DAG is resubmitted using this Rescue DAG input file, the nodes marked as completed will not be reexecuted.
The Rescue DAG is automatically generated by DAGMan when a node within the DAG fails. The file is named using the DAGInputFileName, and appending the suffix .rescue to it. Statistics about the failed DAG execution are presented as comments at the beginning of the Rescue DAG input file.
If the Rescue DAG file is generated before all retries of a node are completed, then the Rescue DAG file will also contain Retry entries. The number of retries will be set to the appropriate remaining number of retries.