
This is a report on how USCMS is using Condor-G matchmaking to dynamically assign computing jobs to grid sites. The specific recipe described here should be easy to apply in other cases.
The USCMS computing grid is composed of a single tier 1 site called the master site, located at Fermi Lab. The master site submits computing jobs to a growing number of other computing sites, using Condor's Globus Universe (i.e. Condor-G).
Each computing job may contain multiple steps, but at the very least, they have a step to stage-in files, run the job, stage-out files, and clean up, releasing resources allocated on the remote site. These steps are described by a DAG, which is managed by Condor DAGMan, running at the master site. Each node in the DAG submits a job to the remote site. Of course, all of the steps of a job have to run at the same site, so matching jobs to sites must happen at the whole DAG level rather than at the individual job level.
Before the addition of matchmaking, the production manager had to choose which remote site to use for each job. In order to make this manageable, batches of jobs were lumped together into "group DAGs". Execution of these group DAGs was then manually scheduled as resources became available.
We wanted to preserve several good features of this system. One was the "just-in-time-scheduling" approach. Having a low latency between the time a job is scheduled to run at a site and the time it actually begins execution makes it easier to adapt to changing conditions and reduces the need for accurate prediction.
Another feature of the manually-scheduled system that we wanted to keep was the grouping of jobs into batches. Although automated scheduling could make decisions on a job-by-job basis, we wanted to at least preserve the option of operating on larger units of jobs. This is mostly an issue of scalability. Currently, DAGMan runs only at the master site, so we wanted to be able to have many running jobs per instance of DAGMan.
One final feature of the manual system that we wanted to preserve was the ability to back out of a scheduling decision. If a job gets matched to a site and for some reason this choice has to be modified, we wanted it to be a simple matter of stopping the job and rescheduling it. Ideally, this would happen under the control of an automated policy that specifies when to reschedule.
Matchmaking takes place between ClassAds describing sites and ClassAds describing jobs. A ClassAd is basically a set of attributes, and an expression of requirements, and an expression of preferences.
In the case of Condor-G (6.5.x and higher), matchmaking happens
when the GlobusScheduler attribute in the job's
submit file is not directly specified. Instead of assigning it
to a hard-coded gatekeeper URL, it is defined in the following
way:
GlobusScheduler = $$(GatekeeperURL)
The $$() ClassAd syntax inserts information from the
target ClassAd when a match is made, so the above example says to
use the value of GatekeeperURL from the site ClassAd when submitting
the job to Globus.
In the case of USCMS, however, the mechanism is a little bit more complex, because we want to bind an entire DAG to a remote site, such that the DAG itself runs at the submission site and the nodes in the DAG run at the remote site. We would potentially be interested in also being able to run the DAG remotely, but having it run at the master site was our first goal, since this required the least changes to the existing architecture.
Binding DAG nodes to a remote site through Condor-G matchmaking turned out to require only one additional step over the simple matchmaking scheme outlined above.
The DAG job is submitted with the following lines in the Condor
submit file:
GlobusScheduler = $$(MatchmakerURL)
Environment = $$(SiteEnvironment)
Each site ClassAd then defines an attribute
MatchmakerURL, which points at the master
gatekeeper, rather than the remote site's gatekeeper. When the DAG
job is matched to a site, it therefore runs at the master site but it
has environment variables that come from the remote site's ClassAd.
These variables are used to direct the individual jobs, represented by
the DAG nodes, to the chosen remote site.
Data from the environment is injected into the ClassAd of
individual jobs via the $ENV() submit file syntax. For
example:
GlobusScheduler = $ENV(SiteGatekeeperURL)
Remote_Initialdir = $ENV(SiteWorkingArea)
Environment = SiteVDTLocation=$ENV(SiteVDTLocation);SiteExportDir=$ENV(SiteExportDir)
This means that we can write generic DAGs and DAG nodes. None of the information about the remote site is written in the DAG or the submit files for DAG nodes. Instead, all site-specific information is passed from the site ClassAd into the environment of DAGMan, and from there it is referenced by the job submit files, which submit jobs to the remote site.
Unlike matchmaking within a Condor pool, there is currently no
automatic system for publishing and updating ClassAds describing Grid
sites. Instead, the ClassAd for each site must be periodically
written into a file and published to the Condor Collector at the
master site via condor_advertise.
One could potentially obtain information dynamically at the remote site about resource availability and publish this in the site's ClassAd. However, we were only really interested in keeping the job queues for each site just slightly over-stuffed, so that the worker nodes would stay busy. The simplest way to achieve this requires just two numbers, one describing the maximum desired depth of the queue and another indicating the current depth.
Technically, there could be other jobs running at a remote site, so a measure of current queue depth would involve an actual query of the remote system. However, we decided that in the first version, it would be good enough to simply know how many jobs had been submitted to the remote site from the USCMS master, which is information readily available right at the master site itself. (This also side-steps a potential race condition where jobs that have already been matched to a remote site have not yet showed up there in the remote queue when the queue is queried.)
A simple script was written to scan the submission queue for DAG
jobs matched to sites. (Here, "scanning the queue" means parsing the
output from .) This script, the
matchmaker monitor, is then run periodically to publish new ClassAds
for each site.
A typical site ClassAd looks something like this:
MyType = "Machine"
TargetType = "Job"
Name = "uw-hep-USCMS-Site"
StartdIpAddr = "<127.0.0.1:1234>"
UpdateSequenceNumber = 1057960761
Rank = 0.000000
CurrentRank = 0.000000
WantAdRevaluate = True
CurMatches = 0
MaxMatches = 2
Requirements = (CurMatches < MaxMatches) && (TARGET.WANT_USCMS_MATCHMAKER =?= TRUE)
Is_USCMS_Site = True
MatchmakerURL = "ramen.fnal.gov:/jobmanager-condor"
SiteEnvironment = "SiteGatekeeperURL=dill.hep.wisc.edu:/jobmanager-condor;SiteVDTLocation=/dpe;SiteWorkingArea=/scratch/work;SiteExportDir=/scratch/export"
MOP_SITE="uw-hep"
Each line from the above ClassAd is described below:
MyType = "Machine"
TargetType = "Job"
This informs the Condor Negotiator that the ClassAd is a resource
provider rather than a resource consumer. Although the site is not a
single machine, this syntax is simply a carry-over from matchmaking
at an individual machine level.
Name = "uw-hep-USCMS-Site"
Each site ClassAd must have a unique name.
StartdIpAddr = "<127.0.0.1:1234>"
This attribute is meaningless in the context of the Grid. It is
just there to keep the negotiator happy in the current version of Condor.
UpdateSequenceNumber = 1057960761
Each time a new ClassAd is published, it must have a higher
sequence number than the previous one or it will be ignored by the
negotiator. We used the UNIX timestamp value.
Rank = 0.000000
CurrentRank = 0.000000
In general, Rank may be an expression specifying a
site's preference to run some types of jobs over others. The
CurrentRank attribute is meaningless in the context of
the Globus Universe.
CurMatches = 0
MaxMatches = 2
Requirements = (CurMatches < MaxMatches) && ...
WantAdRevaluate = True
CurMatches and MaxMatches are the number
of jobs in the site's queue and the maximum depth of the queue
respectively. CurMatches is determined by simply
scanning the master submission queue. MaxMatches is
obtained from a site configuration file. This file could be updated
manually or by querying MDS; whatever the mechanism, it is
independent of how the matchmaking works.
WantAdRevaluate causes CurMatches to be automatically
incremented (by the negotiator) each time a match is made. There is
no automatic way to decrement this value as jobs finish; it is up to
the USCMS-provided matchmaker monitor to set CurMatches
correctly when it periodically updates the ClassAd (every 2 minutes or
so).
Requirements = ... (TARGET.WANT_USCMS_MATCHMAKER =?= TRUE)
Is_USCMS_Site = True
In order to prevent the possibility of jobs not meant for the USCMS
grid getting matched to a site ClassAd, the above pieces were added to
the site ClassAd. In the DAG submit file, we then add the following:
+WANT_USCMS_MATCHMAKER = True
Requirements = (Is_USCMS_Site =?= True)
This ensures that only site ClassAds will be accepted by the job
and only jobs that were made to match with a site are accepted by the
site ClassAd. If you never had any other type of job or machine ad
published to the collector, these extra requirements would not be
necessary.
MatchmakerURL = "ramen.fnal.gov:/jobmanager-condor"
SiteEnvironment = ...
These final pieces of the ClassAd are required because we
reference them in the DAG submit file with $$(), as
described earlier. The MatchmakerURL is the master gatekeeper
where we run DAGMan jobs that have been matched to a remote site.
The SiteEnvironment setting is where site-specific
information is passed to the DAG job.
Specifically, how are site ClassAds published?
condor_advertise UPDATE_STARTD_AD file_containing_classad
The reference to startd here is a hold-over from matchmaking
in a Condor cluster. We are not really publishing a ClassAd for
condor_startd, but we advertise it that way because we
want the Condor collector to treat the site ClassAd as a computing resource.
In the case of USCMS, condor_advertise is called every
few minutes for each site by the matchmaker monitor.
How are CurMatches and MaxMatches maintained?
The output of condor_q -l contains the full ClassAd
for each submitted job. Every two minutes when the matchmaker monitor
script scans this list, it adds up the number of jobs which have been
matched to each site.
When a job gets matched, extra values are inserted into the job
ClassAd, containing information from the site ClassAd needed to
perform $$() substitution. These extra ClassAd
attributes are of the form MATCH_XXX where
XXX is the name of the attribute from the site ClassAd.
In our case, we look for MATCH_SiteEnvironment, and
inside there is an environment variable assigned to the name of the
site. This allows us to increment the value of
CurMatches for the site that was matched to each job.
Currently, there is no mechanism for specifying the size of a DAG
that groups together a number of jobs. Since we tend to use a
standard group size (like 50 jobs per DAG), we simply divide the
maximum number of jobs that should be simultaneously submitted to a
site by the standard group size to obtain a value for
MaxMatches, which is then inserted into the site ClassAd.
So the bottom line is this: MaxMatches is relatively
static information about the overall size of a given site. It does
not need to be frequently modified as a result of changing conditions
on the remote site.
CurMatches, on the other hand, is dynamic information
maintained automatically by the Condor Negotiator and USCMS's
matchmaker monitor. The negotiator increments CurMatches
when a match is made and then the value will eventually fall when the
job finishes, because the monitor periodically updates the value with
a count of how many matches still exist.
How are DAG jobs submitted for matchmaking?
Normally, one submits DAGs via condor_submit_dag. If
you use the -no_submit option, this will produce a condor
submit file suitable for running DAGMan in the Scheduler Universe.
The USCMS submission software (MOP) simply modifies this submit file
slightly in order to submit the DAG job to the Globus Universe.
The original submit file looks like this:
universe = scheduler
executable = /condor/bin/condor_dagman
getenv = True
output = group.dag.lib.out
error = group.dag.lib.out
log = group.dag.dagman.log
remove_kill_sig = SIGUSR1
arguments = -f -l . -Debug 3 -Lockfile group.dag.lock -Dag group.dag -Rescue group.dag.rescue -Condorlog /dpe/.../group.log
environment = _CONDOR_DAGMAN_LOG=group.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0
notification = never
queue
The modified submit file looks like this:
universe = globus
+WANT_USCMS_MATCHMAKER = True
Requirements = Is_USCMS_Site =?= True
Rank = -CurMatches/MaxMatches
environment = $$(SiteEnvironment);PATH=$ENV(PATH);USER=$ENV(USER);_CONDOR_DAGMAN_LOG=/dpe/.../group.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0
executable = /condor/bin/condor_dagman
transfer_executable = false
arguments = -f -l . -Debug 3 -Lockfile group.dag.lock -Dag group.dag -Rescue group.dag.rescue -Condorlog /dpe/.../group.log
remote_initialdir = /dpe/...
GlobusScheduler = $$(MatchmakerURL)
output = group.dag.lib.out
error = group.dag.lib.out
log = group.dag.dagman.log
notification = never
GlobusRSL = (condor_submit=(universe scheduler)(remove_kill_sig SIGUSR1))
queue
The above submit file assumes that MatchmakerURL
refers to a Condor jobmanager. This is not absolutely necessary.
DAGMan can even run under the fork jobmanager, but running it in
Condor's Scheduler Universe provides extra reliability.
There are also some assumptions about how the environment should be set up. For example, the above code works when the user who is submitting the jobs to the matchmaker is the same user that the jobs are mapped to when they run. It would not be hard to adapt to a different case.
This work was made possible by the financial support of the
National Science Foundation and the University of Wisconsin.