A directed acyclic graph (DAG) can be used to represent a set of computations where the input, output, or execution of one or more computations is dependent on one or more other computations. The computations are nodes (vertices) in the graph, and the edges (arcs) identify the dependencies. Condor finds machines for the execution of programs, but it does not schedule programs based on dependencies. The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for the execution of programs (computations). DAGMan submits the programs to Condor in an order represented by a DAG and processes the results. A DAG input file describes the DAG, and further submit description file(s) are used by DAGMan when submitting programs to run under Condor.
DAGMan is itself executed as a scheduler universe job within Condor. As DAGMan submits programs, it monitors log file(s) to to enforce the ordering required within the DAG. DAGMan is also responsible for scheduling, recovery, and reporting on the set of programs submitted to Condor.
To DAGMan, a node in a DAG may encompass more than a single program submitted to run under Condor. Figure 2.2 illustrates the elements of a node.
At one time, the number of Condor jobs per node was restricted to one. This restriction is now relaxed such that all Condor jobs within a node must share a single cluster number. See the condor_submit manual page for a further definition of a cluster. A limitation exists such that all jobs within the single cluster must use the same log file.
As DAGMan schedules and submits jobs within nodes to Condor, these jobs are defined to succeed or fail based on their return values. This success or failure is propagated in well-defined ways to the level of a node within a DAG. Further progression of computation (towards completing the DAG) may be defined based upon the success or failure of one or more nodes.
The failure of a single job within a cluster of multiple jobs (within a single node) causes the entire cluster of jobs to fail. Any other jobs within the failed cluster of jobs are immediately removed. Each node within a DAG is further defined to succeed or fail, based upon the return values of a PRE script, the job(s) within the cluster, and/or a POST script.
The input file used by DAGMan is called a DAG input file. It may specify twelve types of items:
All items are optional, but there must be at least one JOB or DATA item.
Comments may be placed in the DAG input file.
The pound character (#) as the first character on a
line identifies the line as a comment.
Comments do not span lines.
A simple diamond-shaped DAG, as shown in Figure 2.3 is presented as a starting point for examples. This DAG contains 4 nodes.
A very simple DAG input file for this diamond-shaped DAG is
# File name: diamond.dag
#
JOB A A.condor
JOB B B.condor
JOB C C.condor
JOB D D.condor
PARENT A CHILD B C
PARENT B C CHILD D
Each DAG input file key word is described below.
The JOB key word specifies a job to be managed by Condor. The syntax used for each JOB entry is
JOB JobName SubmitDescriptionFileName [DIR directory] [DONE]
A JOB entry maps a JobName to a Condor submit description file. The JobName uniquely identifies nodes within the DAGMan input file and in output messages. Note that the name for each node within the DAG must be unique.
The key words JOB and DONE are not case sensitive. Therefore, DONE, Done, and done are all equivalent. The values defined for JobName and SubmitDescriptionFileName are case sensitive, as file names in the Unix file system are case sensitive. The JobName can be any string that contains no white space, except for the strings PARENT and CHILD (in upper, lower, or mixed case).
The DIR option specifies a working directory for this node, from which the Condor job will be submitted, and from which a PRE and/or POST script will be run. Note that a DAG containing DIR specifications cannot be run in conjunction with the -usedagdir command-line argument to condor_submit_dag. A rescue DAG generated by a DAG run with the -usedagdir argument will contain DIR specifications, so the rescue DAG must be run without the -usedagdir argument.
The optional DONE identifies a job as being already completed. This is useful in situations where the user wishes to verify results, but does not need all programs within the dependency graph to be executed. The DONE feature is also utilized when an error occurs causing the DAG to be aborted without completion. DAGMan generates a Rescue DAG, a DAG input file that can be used to restart and complete a DAG without re-executing completed nodes.
The SUBDAG keyword specifies a special case of a JOB node, where the job is a DAG.
SUBDAG EXTERNAL JobName DagFileName [DIR directory] [DONE]
A SUBDAG node is essentially the same as a "normal" node, except that the nested DAG file is specified instead of the Condor submit file. ("SUBDAG EXTERNAL A foo.dag" is functionally equivalent to "JOB A foo.dag.condor.sub", but SUBDAG EXTERNAL is now the preferred syntax for specifying such a node.) "EXTERNAL" means that the SUBDAG is run in its own instance of condor_dagman.
For more information, see section 2.10.8.
The SPLICE keyword creates a named instance of a DAG as specified in another file as an entity which may have PARENT and CHILD dependencies associated with other splice names or node names in the including DAG file. The syntax for SPLICE is
SPLICE SpliceName DagFileName [DIR directory]
After parsing incorporates a splice, all nodes within the spice become nodes within the including DAG. For detailed information about splices, see section 2.10.10.
The DATA key word specifies a job to be managed by the Stork data placement server. The syntax used for each DATA entry is
DATA JobName SubmitDescriptionFileName [DIR directory] [DONE]
A DATA entry maps a JobName to a Stork submit description file. In all other respects, the DATA key word is identical to the JOB key word.
Here is an example of a simple DAG that stages in data using Stork, processes the data using Condor, and stages the processed data out using Stork. Depending upon the implementation, multiple data jobs to stage in data or to stage out data may be run in parallel.
DATA STAGE_IN1 stage_in1.stork
DATA STAGE_IN2 stage_in2.stork
JOB PROCESS process.condor
DATA STAGE_OUT1 stage_out1.stork
DATA STAGE_OUT2 stage_out2.stork
PARENT STAGE_IN1 STAGE_IN2 CHILD PROCESS
PARENT PROCESS CHILD STAGE_OUT1 STAGE_OUT2
The SCRIPT key word specifies processing that is done either before a job within the DAG is submitted to Condor or Stork for execution or after a job within the DAG completes its execution. Processing done before a job is submitted to Condor or Stork is called a PRE script. Processing done after a job completes its execution under Condor or Stork is called a POST script. A node in the DAG is comprised of the job together with PRE and/or POST scripts.
PRE and POST script lines within the DAG input file use the syntax:
SCRIPT PRE JobName ExecutableName [arguments]
SCRIPT POST JobName ExecutableName [arguments]
The SCRIPT key word identifies the type of line within the DAG input file. The PRE or POST key word specifies the relative timing of when the script is to be run. The JobName specifies the node to which the script is attached. The ExecutableName specifies the script to be executed, and it may be followed by any command line arguments to that script. The ExecutableName and optional arguments are case sensitive; they have their case preserved.
Scripts are optional for each job, and any scripts are executed on the machine from which the DAG is submitted; this is not necessarily the same machine upon which the node's Condor or Stork job is run. Further, a single cluster of Condor jobs may be spread across several machines.
A PRE script is commonly used to place files in a staging area for the cluster of jobs to use. A POST script is commonly used to clean up or remove files once the cluster of jobs is finished running. An example uses PRE and POST scripts to stage files that are stored on tape. The PRE script reads compressed input files from the tape drive, and it uncompresses them, placing the input files in the current directory. The cluster of Condor jobs reads these input files. and produces output files. The POST script compresses the output files, writes them out to the tape, and then removes both the staged input files and the output files.
DAGMan takes note of the exit value of the scripts as well as the job. A script with an exit value not equal to 0 fails. If the PRE script fails, then neither the job nor the POST script runs, and the node fails.
If the PRE script succeeds, the Condor or Stork job is submitted. If the job fails and there is no POST script, the DAG node is marked as failed. An exit value not equal to 0 indicates program failure. It is therefore important that a successful program return the exit value 0.
If the job fails and there is a POST script, node failure is determined by the exit value of the POST script. A failing value from the POST script marks the node as failed. A succeeding value from the POST script (even with a failed job) marks the node as successful. Therefore, the POST script may need to consider the return value from the job.
By default, the POST script is run regardless of the job's return value.
A node not marked as failed at any point is successful. Table 2.1 summarizes the success or failure of an entire node for all possibilities. An S stands for success, an F stands for failure, and the dash character (-) identifies that there is no script.
|
Two variables may be used within the DAG input file, and may ease script writing. The variables are often utilized in the arguments passed to a PRE or POST script. The variable $JOB evaluates to the (case sensitive) string defined for JobName. For use as an argument to POST scripts, the $RETURN variable evaluates to the return value of the Condor or Stork job. A job that dies due to a signal is reported with a $RETURN value representing the negative signal number. For example, SIGKILL (signal 9) is reported as -9. A job whose batch system submission fails is reported as -1001. A job that is externally removed from the batch system queue (by something other than condor_dagman) is reported as -1002.
As an example, consider the diamond-shaped DAG example. Suppose the PRE script expands a compressed file needed as input to nodes B and C. The file is named of the form JobName.gz. The DAG input file becomes
# File name: diamond.dag
#
JOB A A.condor
JOB B B.condor
JOB C C.condor
JOB D D.condor
SCRIPT PRE B pre.csh $JOB .gz
SCRIPT PRE C pre.csh $JOB .gz
PARENT A CHILD B C
PARENT B C CHILD D
The script pre.csh uses the arguments to form the file name of the compressed file:
#!/bin/csh
gunzip $argv[1]$argv[2]
The PARENT and CHILD key words specify the dependencies within the DAG. Nodes are parents and/or children within the DAG. A parent node must be completed successfully before any of its children may be started. A child node may only be started once all its parents have successfully completed.
The syntax of a dependency line within the DAG input file:
PARENT ParentJobName... CHILD ChildJobName...
The PARENT key word is followed by one or more ParentJobNames. The CHILD key word is followed by one or more ChildJobNames. Each child job depends on every parent job within the line. A single line in the input file can specify the dependencies from one or more parents to one or more children. As an example, the line
PARENT p1 p2 CHILD c1 c2produces four dependencies:
p1 to c1
p1 to c2
p2 to c1
p2 to c2
The RETRY key word provides a way to retry failed nodes. The use of retry is optional. The syntax for retry is
RETRY JobName NumberOfRetries [UNLESS-EXIT value]
where JobName identifies the node. NumberOfRetries is an integer number of times to retry the node after failure. The implied number of retries for any node is 0, the same as not having a retry line in the file. Retry is implemented on nodes, not parts of a node.
The diamond-shaped DAG example may be modified to retry node C:
# File name: diamond.dag
#
JOB A A.condor
JOB B B.condor
JOB C C.condor
JOB D D.condor
PARENT A CHILD B C
PARENT B C CHILD D
Retry C 3
If node C is marked as failed (for any reason), then it is started over as a first retry. The node will be tried a second and third time, if it continues to fail. If the node is marked as successful, then further retries do not occur.
Retry of a node may be short circuited using the optional key word UNLESS-EXIT (followed by an integer exit value). If the node exits with the specified integer exit value, then no further processing will be done on the node.
The VARS key word provides a method for defining a macro that can be referenced in the node's submit description file. These macros are defined on a per-node basis, using the following syntax:
VARS JobName macroname= "string" [macroname= "string"... ]
The macro may be used within the submit description file of the relevant node. A macroname consists of alphanumeric characters (a..Z and 0..9), as well as the underscore character. The space character delimits macros, when there is more than one macro defined for a node.
Correct syntax requires that the string must be
enclosed in double quotes.
To use a double quote inside string,
escape it with the backslash character (\).
To add the backslash character itself, use two backslashes (\\).
The string $(JOB) maybe used in string and will expand to
JobName.
If the VARS line appears in a DAG file used as a splice file,
then $(JOB) will be the fully scoped name of the node.
Note that macro names cannot begin with the string "queue" (in any combination of upper and lower case).
If the DAG input file contains
# File name: diamond.dag
#
JOB A A.condor
JOB B B.condor
JOB C C.condor
JOB D D.condor
VARS A state="Wisconsin"
then file A.condor may use the macro state. This example submit description file for the Condor job in node A passes the value of the macro as a command-line argument to the job.
# file name: A.condor
executable = A.exe
log = A.log
error = A.err
arguments = $(state)
queue
This Condor job's command line will be
A.exe WisconsinThe use of macros may allow a reduction in the necessary number of unique submit description files.
The PRIORITY key word assigns a priority to a DAG node. The syntax for PRIORITY is
PRIORITY JobName PriorityValue
The node priority affects the order in which nodes that are ready at the same time will be submitted. Note that node priority does not override the DAG dependencies.
Node priority is mainly relevant if node submission is throttled via the -maxjobs or -maxidle command-line flags or the DAGMAN_MAX_JOBS_SUBMITTED or DAGMAN_MAX_JOBS_IDLE configuration macros. Note that PRE scripts can affect the order in which jobs run, so DAGs containing PRE scripts may not run the nodes in exact priority order, even if doing so would satisfy the DAG dependencies.
The priority value is an integer (which can be negative). A larger numerical priority is better (will be run before a smaller numerical value). The default priority is 0.
Adding PRIORITY for node C in the diamond-shaped DAG
# File name: diamond.dag
#
JOB A A.condor
JOB B B.condor
JOB C C.condor
JOB D D.condor
PARENT A CHILD B C
PARENT B C CHILD D
Retry C 3
PRIORITY C 1
This will cause node C to be submitted before node B (normally, node B would be submitted first).
The CATEGORY key word assigns a category to a DAG node. The syntax for CATEGORY is
CATEGORY JobName CategoryName
Node categories are used for job submission throttling (see MAXJOBS below). Category names cannot contain white space.
The MAXJOBS key word limits the number of submitted jobs for a node category. The syntax for MAXJOBS is
MAXJOBS CategoryName MaxJobsValue
If the number of submitted jobs for a given category reaches the limit, no further jobs in that category will be submitted until other jobs in the category terminate. If there is no MAXJOBS entry for a given node category, the limit is set to infinity.
Note that a single invocation of condor_submit counts as one job, even if the submit file produces a multi-job cluster.
The DAGMAN_MAX_JOBS_SUBMITTED configuration macro and the condor_submit_dag -maxjobs command-line flag are still in effect if node category throttles are used.
The ABORT-DAG-ON key word provides a way to abort the entire DAG if a given node returns a specific exit code. The syntax for ABORT-DAG-ON is
ABORT-DAG-ON JobName AbortExitValue [RETURN DAGReturnValue]
If the node specified by JobName returns the specified AbortExitValue, the DAG is immediately aborted. A DAG abort differs from a node failure, in that a DAG abort causes all nodes within the DAG to be stopped immediately. This includes removing the jobs in nodes that are currently running. A node failure allows the DAG to continue running, until no more progress can be made due to dependencies.
An abort overrides node retries. If a node returns the abort exit value, the DAG is aborted, even if the node has retry specified.
When a DAG aborts, by default it exits with the node return value that caused the abort. This can be changed by using the optional RETURN key word along with specifying the desired DAGReturnValue. The DAG abort return value can be used for DAGs within DAGs, allowing an inner DAG to cause an abort of an outer DAG.
Adding ABORT-DAG-ON for node C in the diamond-shaped DAG
# File name: diamond.dag
#
JOB A A.condor
JOB B B.condor
JOB C C.condor
JOB D D.condor
PARENT A CHILD B C
PARENT B C CHILD D
Retry C 3
ABORT-DAG-ON C 10 RETURN 1
causes the DAG to be aborted, if node C exits with a return value of 10. Any other currently running nodes (only node B is a possibility for this particular example) are stopped and removed. If this abort occurs, the return value for the DAG is 1.
The CONFIG keyword specifies a configuration file to be used to set condor_dagman configuration options when running this DAG. The syntax for CONFIG is
CONFIG ConfigFileName
If the DAG file contains a line like this:
CONFIG dagman.config
the configuration values in the file dagman.config will be used for this DAG.
For more information about how condor_dagman configuration files work, see section 2.10.12.
Each node in a DAG may use a unique submit description file. One key limitation is that each Condor submit description file must submit jobs described by a single cluster number. At the present time DAGMan cannot deal with a submit file producing multiple job clusters.
At one time, DAGMan required that all jobs within all nodes specify the same, single log file. This is no longer the case. However, if the DAG utilizes a large number of separate log files, performance may suffer. Therefore, it is better to have fewer, or even only a single log file. Unfortunately, each Stork job currently requires a separate log file. DAGMan enforces the dependencies within a DAG using the events recorded in the log file(s) produced by job submission to Condor.
Here is a modified version of the DAG input file for the diamond-shaped DAG. The modification has each node use the same submit description file.
# File name: diamond.dag
#
JOB A diamond_job.condor
JOB B diamond_job.condor
JOB C diamond_job.condor
JOB D diamond_job.condor
PARENT A CHILD B C
PARENT B C CHILD D
Here is the single Condor submit description file for this DAG:
# File name: diamond_job.condor
#
executable = /path/diamond.exe
output = diamond.out.$(cluster)
error = diamond.err.$(cluster)
log = diamond_condor.log
universe = vanilla
notification = NEVER
queue
This example uses the same Condor submit description file for all the jobs in the DAG. This implies that each node within the DAG runs the same job. The $(cluster) macro produces unique file names for each job's output. As the Condor job within each node causes a separate job submission, each has a unique cluster number.
Notification is set to NEVER in this example.
This tells Condor not to send e-mail about the completion of a job
submitted to Condor.
For DAGs with many nodes, this
reduces or eliminates excessive numbers of e-mails.
A separate example shows an intended use of a VARS entry in the DAG input file. This use may dramatically reduce the number of Condor submit description files needed for a DAG. In the case where the submit description file for each node varies only in file naming, the use of a substitution macro within the submit description file reduces the need to a single submit description file. Note that the user log file for a job currently cannot be specified using a macro passed from the DAG.
The example uses a single submit description file in the DAG input file, and uses the Vars entry to name output files.
The relevant portion of the DAG input file appears as
JOB A theonefile.sub JOB B theonefile.sub JOB C theonefile.sub VARS A outfilename="A" VARS B outfilename="B" VARS C outfilename="C"
The submit description file appears as
# submit description file called: theonefile.sub
executable = progX
universe = standard
output = $(outfilename)
error = error.$(outfilename)
log = progX.log
queue
For a DAG like this one with thousands of nodes, being able to write and maintain a single submit description file and a single, yet more complex, DAG input file is preferable.
A DAG is submitted using the program condor_submit_dag.
See the manual
page
for complete details.
A simple submission has the syntax
condor_submit_dag DAGInputFileName
The diamond-shaped DAG example may be submitted with
condor_submit_dag diamond.dagIn order to guarantee recoverability, the DAGMan program itself is run as a Condor job. As such, it needs a submit description file. condor_submit_dag produces this needed submit description file, naming it by appending .condor.sub to the DAGInputFileName. This submit description file may be edited if the DAG is submitted with
condor_submit_dag -no_submit diamond.dagcausing condor_submit_dag to generate the submit description file, but not submit DAGMan to Condor. To submit the DAG, once the submit description file is edited, use
condor_submit diamond.dag.condor.sub
An optional argument to condor_submit_dag, -maxjobs, is used to specify the maximum number of batch jobs that DAGMan may submit at one time. It is commonly used when there is a limited amount of input file staging capacity. As a specific example, consider a case where each job will require 4 Mbytes of input files, and the jobs will run in a directory with a volume of 100 Mbytes of free space. Using the argument -maxjobs 25 guarantees that a maximum of 25 jobs, using a maximum of 100 Mbytes of space, will be submitted to Condor and/or Stork at one time.
While the -maxjobs argument is used to limit the number of batch system jobs submitted at one time, it may be desirable to limit the number of scripts running at one time. The optional -maxpre argument limits the number of PRE scripts that may be running at one time, while the optional -maxpost argument limits the number of POST scripts that may be running at one time.
An optional argument to condor_submit_dag, -maxidle, is used to limit the number of idle jobs within a given DAG. When the number of idle node jobs in the DAG reaches the specified value, condor_dagman will stop submitting jobs, even if there are ready nodes in the DAG. Once some of the idle jobs start to run, condor_dagman will resume submitting jobs. Note that this parameter only limits the number of idle jobs submitted by a given instance of condor_dagman. Idle jobs submitted by other sources (including other condor_dagman runs) are ignored.
After submission, the progress of the DAG can be monitored by looking at the log file(s), observing the e-mail that job submission to Condor causes, or by using condor_q -dag. There is a large amount of information in an extra file. The name of this extra file is produced by appending .dagman.out to DAGInputFileName; for example, if the DAG file is diamond.dag, this extra file is diamond.dag.dagman.out. If this extra file grows too large, limit its size with the MAX_DAGMAN_LOG configuration macro (see section 3.3.4).
If you have some kind of problem in your DAGMan run, please save the corresponding dagman.out file; it is the most important debugging tool for DAGMan. As of version 6.8.2, the dagman.out is appended to, rather than overwritten, with each new DAGMan run.
condor_submit_dag attempts to check the DAG input file. If a problem is detected, condor_submit_dag prints out an error message and aborts.
To remove an entire DAG, consisting of DAGMan plus any jobs submitted to Condor or Stork, remove the DAGMan job running under Condor. condor_q will list the job number. Use the job number to remove the job, for example
% condor_q
-- Submitter: turunmaa.cs.wisc.edu : <128.105.175.125:36165> : turunmaa.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
9.0 smoler 10/12 11:47 0+00:01:32 R 0 8.7 condor_dagman -f -
11.0 smoler 10/12 11:48 0+00:00:00 I 0 3.6 B.out
12.0 smoler 10/12 11:48 0+00:00:00 I 0 3.6 C.out
3 jobs; 2 idle, 1 running, 0 held
% condor_rm 9.0
Before the DAGMan job stops running, it uses condor_rm and/or stork_rm to remove any jobs within the DAG that are running.
In the case where a machine is scheduled to go down, DAGMan will clean up memory and exit. However, it will leave any submitted jobs in Condor's queue.
DAGMan can help with the resubmission of uncompleted portions of a DAG, when one or more nodes result in failure. If any node in the DAG fails, the remainder of the DAG is continued until no more forward progress can be made based on the DAG's dependencies. At this point, DAGMan produces a file called a Rescue DAG.
The Rescue DAG is a DAG input file, functionally the same as the original DAG file. (Note that if multiple DAG files are specified on the condor_submit_dag command line, a single rescue DAG encompassing all of the input DAGs is generated.) The rescue DAG additionally contains an indication of successfully completed nodes by appending the DONE key word to the node's JOB or DATA lines. If the DAG is resubmitted using this Rescue DAG input file, the nodes marked as completed will not be re-executed.
If the Rescue DAG file is generated before all retries of a node are completed, then the Rescue DAG file will also contain Retry entries. The number of retries will be set to the appropriate remaining number of retries.
The Rescue DAG is automatically generated by DAGMan when a node within the DAG fails or when condor_dagman itself is removed with condor_rm. The file name of the rescue DAG depends on whether "old-style" or "new-style" rescue DAG naming is used (see below). Statistics about the failed DAG execution are presented as comments at the beginning of the Rescue DAG input file.
The granularity defining success or failure in the Rescue DAG input file is the node - if a node fails, all parts of the node will be re-run even if some of them worked the first time. For example, if a node's PRE script succeeds, but then the node job fails, when the rescue DAG is run the PRE script for that node will be re-run. The Condor job within a node may result in the submission of multiple Condor jobs under a single cluster. If one of the multiple jobs fails, the node fails. Therefore, a resubmission of the Rescue DAG will again result in the submission of the entire cluster of jobs.
Prior to version 7.1.0, the default has been that, for a DAG file my.dag, the rescue DAG created (if any) would be named my.dag.rescue. To run the rescue DAG, you would run condor_submit_dag on my.dag.rescue, and if that run fails, it would produce a rescue DAG named my.dag.rescue.rescue, and so on.
Starting with version 7.1.0, however, the default behavior is
different. For a DAG file my.dag, an initial failure
produces a rescue DAG named my.dag.rescue001. However, if
you re-run condor_submit_dag on my.dag, the rescue
DAG my.dag.rescue001 will automatically be run instead of
the original DAG. If that run fails, the rescue DAG written will
be my.dag.rescue002. In fact, running condor_submit_dag
on the original DAG file will automatically run the newest
(highest numbered) rescue DAG. You can specify a rescue DAG
to run with the the -dorescuefrom flag - for example, if you
specify -dorescuefrom 3, rescue DAG number 3 will be run, and
all newer rescue DAGs will be renamed (.old will be appended to the
file name, and previous .old files, if any, will be overwritten).
The maximum rescue DAG number is configured by the
DAGMAN_MAX_RESCUE_NUM configuration macro
(see
).
The old behavior can be maintained by setting
DAGMAN_OLD_RESCUE (see
)
to true and DAGMAN_AUTO_RESCUE
(see
) to false.
It can be helpful to see a picture of a DAG. DAGMan can assist you in visualizing a DAG by creating the input files used by the AT&T Research Labs graphviz package. dot is a program within this package, available from http://www.graphviz.org/, and it is used to draw pictures of DAGs.
DAGMan produces one or more dot files as the result of an extra line in a DAGMan input file. The line appears as
DOT dag.dot
This creates a file called dag.dot. which contains a specification of the DAG before any jobs within the DAG are submitted to Condor. The dag.dot file is used to create a visualization of the DAG by using this file as input to dot. This example creates a Postscript file, with a visualization of the DAG:
dot -Tps dag.dot -o dag.ps
Within the DAGMan input file, the DOT command can take several optional parameters:
DOT dag.dot DONT-OVERWRITE
causes files
dag.dot.0,
dag.dot.1,
dag.dot.2,
etc. to be created.
This option is
most useful combined with the UPDATE option to
visualize the history of the DAG after it has finished executing.
label=.
This may be useful if further editing of the created files would
be necessary,
perhaps because you are automatically visualizing the DAG as it
progresses.
If conflicting parameters are used in a DOT command, the last one listed is used.
The organization and dependencies of the jobs within a DAG are the keys to its utility. There are cases when a DAG is easier to visualize and construct hierarchically, in other words when a node within a DAG is also a DAG. Condor DAGMan handles this situation quite easily. (Note that DAGs can be nested to any depth.)
Since more than one DAG is being discussed, terminology is introduced to clarify which DAG is which. Reuse the example diamond-shaped DAG as given in Figure 2.3. Assume that node B of this diamond-shaped DAG will itself be a DAG. The DAG of node B is called the inner DAG, and the diamond-shaped DAG is called the outer DAG.
Work on the inner DAG first. Here is a very simple linear DAG input file used as an example of the inner DAG.
# File name: inner.dag
#
JOB X X.submit
JOB Y Y.submit
JOB Z Z.submit
PARENT X CHILD Y
PARENT Y CHILD Z
The Condor submit file corresponding to this DAG will be named inner.dag.condor.sub. (The DAGMan submit file is always named <DAG file name>.condor.sub.)
A simple example of a DAG input file for the outer DAG is
# File name: diamond.dag
#
JOB A A.submit
SUBDAG EXTERNAL B inner.dag
JOB C C.submit
JOB D D.submit
PARENT A CHILD B C
PARENT B C CHILD D
This is equivalent, but the version above is now preferred:
# File name: diamond.dag
#
JOB A A.submit
JOB B inner.dag.condor.sub
JOB C C.submit
JOB D D.submit
PARENT A CHILD B C
PARENT B C CHILD D
The outer DAG is then submitted as before, with the command
condor_submit_dag diamond.dag
In Condor 7.1.4 and later, when you run condor_submit_dag on the outer DAG file, condor_submit_dag -no_submit -update_submit is automatically run on the inner DAG file before the outer DAG is actually run. (If you want to disable this feature, you can do so by passing the -no_recurse command-line flag to condor_submit_dag.)
The following command-line flags are passed to the lower-level condor_submit_dag:
The following command-line flags are preserved in existing lower-level DAG submit files (if any exist):
Note that the -force option will cause existing DAG submit files to be overwritten without preserving any existing values.
Because of the automatic recursion in condor_submit_dag, normally you only need to run condor_submit_dag on your outermost DAG. But you can manually run condor_submit_dag on an inner DAG or DAGs to set -maxjobs or other values. For instance, using the example in the previous section, you could do the following:
condor_submit_dag -no_submit -maxjobs 1 inner.dag condor_submit_dag diamond.dag
This would set maxjobs to 1 for the inner DAG, and then run the entire workflow.
When using nested DAGs, it is strongly recommended that you use "new-style" rescue DAGs (this is the default). Using "new-style" rescue DAGs will automatically run the proper rescue DAG(s) if there is a failure in your workflow. For example, if one of the nodes in inner.dag fails, this will produce a rescue DAG for inner.dag (inner.dag.rescue.001, etc.). Then, since inner.dag failed, node B of diamond.dag will fail, producing a rescue DAG for diamond.dag (diamond.dag.rescue.001, etc.). If you re-run condor_submit_dag diamond.dag the most recent outer rescue DAG will be run, and this will re-run the inner DAG, which will actually run the most recent inner rescue DAG. If you use "old-style" rescue DAGs, you would have to either rename the inner rescue DAG or run it manually.
Remember that, unless you use the DIR keyword in your outer DAG, the inner DAG will be submitted from the directory in which you run the outer DAG. Therefore, all paths in the inner DAG file (to submit files, etc.) must be specified accordingly.
A single use of condor_submit_dag may execute multiple, independent DAGs. Each independent DAG has its own DAG input file. These DAG input files are command-line arguments to condor_submit_dag (see the condor_submit_dag manual page at 9).
Internally, all of the independent DAGs are combined into a single, larger DAG, with no dependencies between the original independent DAGs. As a result, any generated rescue DAG file represents all of the input DAGs as a single DAG. The file name of this rescue DAG is based on the DAG input file listed first within the command-line arguments to condor_submit_dag (unlike a single-DAG rescue DAG file, however, the file name will be <whatever>.dag_multi.rescue or <whatever>.dag_multi.rescueNNN, as opposed to just <whatever>.dag.rescue or <whatever>.dag.rescueNNN). Other files such as dagman.out and the lock file also have names based on this first DAG input file.
The success or failure of the independent DAGs is well defined. When multiple, independent DAGs are submitted with a single command, the success of the composite DAG is defined as the logical AND of the success of each independent DAG. This implies that failure is defined as the logical OR of the failure of any of the independent DAGs.
By default, DAGMan internally renames the nodes to avoid node name collisions. If all node names are unique, the renaming of nodes may be disabled by setting the configuration variable DAGMAN_MUNGE_NODE_NAMES to False (see 3.3.25).
A weakness in scalability exists when submitting a DAG within a DAG. Each executing independent DAG requires its own invocation of condor_dagman to be running. The scaling issue presents itself when the same semantic DAG is reused hundreds or thousands of times in a larger DAG. Further, there may be many rescue DAGs created if a problem occurs. To alleviate these concerns, the DAGMan language introduces the concept of graph splicing.
A splice is a named instance of a subgraph which is specified in a
separate DAG file.
The splice is treated as a whole entity during dependency
specification in the including DAG.
The same DAG file may be reused as differently named splices,
each one
incorporating a copy of the dependency graph (and nodes therein) into the
including DAG.
Any splice in an including DAG may have dependencies
between the sets of initial and final nodes.
A splice may be incorporated into an including DAG without any
dependencies; it is considered
a disjoint DAG within the including DAG.
The nodes within a splice are scoped according to
a hierarchy of names associated with the splices,
as the splices are parsed from the top level DAG file.
The scoping character to describe the
inclusion hierarchy of nodes into the top level dag is
'+'.
This character is chosen due
to a restriction in the allowable characters which may be in a file name
across the variety of ports that Condor supports.
In any DAG file, all splices must have unique names,
but the same splice name may be reused in different DAG files.
Condor does not detect nor support splices that form a cycle within the DAG. A DAGMan job that causes a cyclic inclusion of splices will eventually exhaust available memory and crash.
The following series of examples illustrate potential uses of splicing. To simplify the examples, presume that each and every job uses the same, simple Condor submit description file:
# BEGIN SUBMIT FILE submit.condor executable = /bin/echo arguments = OK universe = vanilla output = $(jobname).out error = $(jobname).err log = submit.log notification = NEVER queue # END SUBMIT FILE submit.condor
This first simple example splices a diamond-shaped DAG in between the two nodes of a top level DAG. Here is the DAG input file for the diamond-shaped DAG:
# BEGIN DAG FILE diamond.dag JOB A submit.condor VARS A jobname="$(JOB)" JOB B submit.condor VARS B jobname="$(JOB)" JOB C submit.condor VARS C jobname="$(JOB)" JOB D submit.condor VARS D jobname="$(JOB)" PARENT A CHILD B C PARENT B C CHILD D # END DAG FILE diamond.dag
The top level DAG incorporates the diamond-shaped splice:
# BEGIN DAG FILE toplevel.dag JOB X submit.condor VARS X jobname="$(JOB)" JOB Y submit.condor VARS Y jobname="$(JOB)" # This is an instance of diamond.dag, given the symbolic name DIAMOND SPLICE DIAMOND diamond.dag # Set up a relationship between the nodes in this dag and the splice PARENT X CHILD DIAMOND PARENT DIAMOND CHILD Y # END DAG FILE toplevel.dag
Figure 2.4 illustrates the resulting top level DAG and the dependencies produced. Notice the naming of nodes scoped with the splice name. This hierarchy of splice names assures unique names associated with all nodes.
Figure 2.5 illustrates the starting point for a more complex example. The DAG input file X.dag describes this X-shaped DAG. The completed example displays more of the spatial constructs provided by splices. Pay particular attention to the notion that each named splice creates a new graph, even when the same DAG input file is specified.
# BEGIN DAG FILE X.dag JOB A submit.condor VARS A jobname="$(JOB)" JOB B submit.condor VARS B jobname="$(JOB)" JOB C submit.condor VARS C jobname="$(JOB)" JOB D submit.condor VARS D jobname="$(JOB)" JOB E submit.condor VARS E jobname="$(JOB)" JOB F submit.condor VARS F jobname="$(JOB)" JOB G submit.condor VARS G jobname="$(JOB)" # Make an X-shaped dependency graph PARENT A B C CHILD D PARENT D CHILD E F G # END DAG FILE X.dag
File s1.dag continues the example, presenting the DAG input file that incorporates two separate splices of the X-shaped DAG. Figure 2.6 illustrates the resulting DAG.
# BEGIN DAG FILE s1.dag JOB A submit.condor VARS A jobname="$(JOB)" JOB B submit.condor VARS B jobname="$(JOB)" # name two individual splices of the X-shaped DAG SPLICE X1 X.dag SPLICE X2 X.dag # Define dependencies # A must complete before the initial nodes in X1 can start PARENT A CHILD X1 # All final nodes in X1 must finish before the initial nodes in X2 can begin PARENT X1 CHILD X2 # All final nodes in X2 must finish before B may begin. PARENT X2 CHILD B # END DAG FILE s1.dag
The top level DAG in the hierarchy of this complex example is described by the DAG input file toplevel.dag. Figure 2.7 illustrates the final DAG. Notice that the DAG has two disjoint graphs in it as a result of splice S3 not having any dependencies associated with it in this top level DAG.
# BEGIN DAG FILE toplevel.dag JOB A submit.condor VARS A jobname="$(JOB)" JOB B submit.condor VARS B jobname="$(JOB)" JOB C submit.condor VARS C jobname="$(JOB)" JOB D submit.condor VARS D jobname="$(JOB)" # a diamond-shaped DAG PARENT A CHILD B C PARENT B C CHILD D # This splice of the X-shaped DAG can only run after # the diamond dag finishes SPLICE S2 X.dag PARENT D CHILD S2 # Since there are no dependencies for S3, # the following splice is disjoint SPLICE S3 s1.dag # END DAG FILE toplevel.dag
The DIR option specifies a working directory for a splice, from which the splice will be parsed and the containing jobs submitted. The directory associated with the splices' DIR specification will be propagated as a prefix to all nodes in the splice and any included splices. If a node already has a DIR specification, then the splice's DIR specification will be a prefix to the nodes and separated by a directory separator character. Jobs in included splices with an absolute path for their DIR specification will have their DIR specification untouched. Note that a DAG containing DIR specifications cannot be run in conjunction with the -usedagdir command-line argument to condor_submit_dag. A rescue DAG generated by a DAG run with the -usedagdir argument will contain DIR specifications, so the rescue DAG must be run without the -usedagdir argument.
By default, condor_dagman assumes that all relative paths in a DAG input file and the associated Condor submit description files are relative to the current working directory when condor_submit_dag is run. Note that relative paths in submit description files can be modified by the submit command initialdir; see the condor_submit manual page at 9 for more details. The rest of this discussion ignores initialdir.
In most cases, path names relative to the current working directory is the desired behavior. However, if running multiple DAGs with a single condor_dagman, and each DAG is in its own directory, this will cause problems. In this case, use the -usedagdir command-line argument to condor_submit_dag (see the condor_submit_dag manual page at 9 for more details). This tells condor_dagman to run each DAG as if condor_submit_dag had been run in the directory in which the relevant DAG file exists.
For example, assume that a directory called parent contains two subdirectories called dag1 and dag2, and that dag1 contains the DAG input file one.dag and dag2 contains the DAG input file two.dag. Further, assume that each DAG is set up to be run from its own directory with the following command:
cd dag1; condor_submit_dag one.dagThis will correctly run one.dag.
The goal is to run the two, independent DAGs located within dag1 and dag2 while the current working directory is parent. To do so, run the following command:
condor_submit_dag -usedagdir dag1/one.dag dag2/two.dag
Of course, if all paths in the DAG input file(s) and the relevant submit description files are absolute, the -usedagdir argument is not needed; however, using absolute paths is NOT generally a good idea.
If you do not use -usedagdir, relative paths can still work for multiple DAGs, if all file paths are given relative to the current working directory as condor_submit_dag is executed. However, this means that, if the DAGs are in separate directories, they cannot be submitted from their own directories, only from the parent directory the paths are set up for.
Note that if you use the -usedagdir argument, and your run results in a rescue DAG, the rescue DAG file will be written to the current working directory, and should be run from that directory. The rescue DAG includes all the path information necessary to run each node job in the proper directory.
Configuration macros for condor_dagman can be specified in several ways:
In the above list, configuration values specified later in the list override ones specified earlier (e.g., a value specified on the condor_submit_dag command line overrides corresponding values in any configuration file; a value specified in a DAGMan-specific configuration file overrides values specified in a general Condor configuration file).
Non-condor_dagman, non-daemoncore configuration macros in a condor_dagman-specific configuration file are ignored.
Only a single configuration file can be specified for a given condor_dagman run. For example, if one file is specified in a DAG, and a different file is specified on the condor_submit_dag command line, this is a fatal error at submit time. The same is true if different configuration files are specified in multiple DAG files referenced in a single condor_submit_dag command.
If multiple DAGs are run in a single condor_dagman run, the configuration options specified in the condor_dagman configuration file, if any, apply to all DAGs, even if some of the DAGs specify no configuration file.
Configuration variables relating to DAGMan may be found in section 3.3.25.