JobRouter allows you to specify a policy for routing vanilla universe jobs to one or more grid sites, through any of the grid protocols supported by Condor (e.g. gt2, gt4, Condor-C). The idea is to do as little scheduling in advance as possible and to only feed jobs to the sites as they consume them. Meanwhile, the jobs waiting to be routed are ordinary vanilla universe jobs, so they may run in the local Condor pool or in other pools via flocking. Except for having your excess jobs queue up in the vanilla universe job queue, you can get a similar effect by submitting all of your jobs as grid universe jobs and using Condor-G matchmaking. However, the router adds some additional convenience features: tracking of aggregate job states for use in routing policy, MaxIdleJobs, and blackhole throttling.
JobRouter is most appropriate for high throughput work flows, where you have many more jobs than computers and you just want to keep as many of the computers busy as possible. It is less suitable for a situation where you have a small number of jobs and you need a scheduler to choose the best place to run the jobs in order to finish them as quickly as possible. The JobRouter doesn't know which site will run your jobs faster, but it can decide whether to send more jobs to a site based on whether jobs already submitted to that site are sitting idle or not and whether you have experienced a lot of recent job failures at that site.
What is meant by job routing is this: a vanilla universe job is transformed into a grid universe job by making a copy of the job ClassAd and modifying some attributes of the job as described by a configurable policy. The copy of the ClassAd shows up in the job queue as a new job id. It is referred to as the routed copy of the job.
Until the routed job finishes or is removed, the original copy of the job passively mirrors the state of the routed job. During this time, it is not available for matchmaking as a normal vanilla universe job, because it is tied to the routed job. It also does not evaluate periodic expressions, such as PeriodicHold. Only the periodic expressions of the routed job are evaluated. When the routed job completes, the original job ClassAd is updated so that it reflects the final status of the job. If the routed job is removed, the original job returns to the normal idle state and may be available for matchmaking or rerouting as usual. If instead the original job is removed or goes on hold, the routed job is removed.
Not all jobs are suitable for routing. The following section gives a more specific example of how job routing can be used and what types of jobs are suitable.
Suppose you submit jobs to a Condor pool and you would like to send excess jobs to other available sites, such as resources on the Open Science Grid. Here is how you could use JobRouter to make this work:
should_transfer_files = yes when_to_transfer_output = ON_EXIT input_files = input1,input2 output_files = output1,output2
Note that unlike in the vanilla universe, if your job is transformed into a globus job and you have not explicitly listed output files, files produced in the working directory of your job will not be automatically transferred back when the job completes. Only files your explicitly list will be returned.
+WantJobRouter = True
You could make this expression fancier. For example, suppose you want jobs to first be rejected by your local Condor matchmaker before being candidates for routing to the grid:
+WantJobRouter = LastRejMatchTime =!= UNDEFINED
x509userproxy = /tmp/x509up_u275
This is not necessary if JobRouter is configured to add a grid proxy to your job for you.
$ condor_submit job1.sub
where job1.sub might look like this:
universe = vanilla executable = my_executable output = job1.stdout error = job1.stderr log = job1.ulog should_transfer_files = true when_to_transfer_output = on_exit +WantJobRouter = LastRejMatchTime =!= UNDEFINED x509userproxy = /tmp/x509up_u275 queue
To see the full job queue, use condor_ q as usual. To see a more specialized view of the routed jobs, use condor_ router_q. Example:
$ condor_router_q -S
JOBS ST Route GridResource
40 I Site1 site1.edu/jobmanager-condor
10 I Site2 site2.edu/jobmanager-pbs
2 R Site3 condor submit.site3.edu condor.site3.edu
To see history of routed jobs, use condor_ router_history. Example:
$ condor_router_history
Routed job history from 2007-06-27 23:38 to 2007-06-28 23:38
Site Hours Jobs Runs
Completed Aborted
-------------------------------------------------------
Site1 10 2 0
Site2 8 2 1
Site3 40 6 0
-------------------------------------------------------
TOTAL 58 10 1
This is a specific example of how you could configure JobRouter to send jobs to grid sites. A general discussion of configuration options will follow.
This example sets up three routes for jobs. One is a Condor site accessed via the Globus gt2 protocol. Another is a PBS site also accessed via Globus gt2. The third site is a Condor site accessed by schedd-to-schedd job submission, a.k.a Condor-C. The JobRouter doesn't know which site would be best for a given job, but, as specified in the following policy, it will stop sending more jobs to a site if ten jobs that have already been sent there are idle.
These configuration settings should be made in the local config file of the submit machine. If you have not already successfully submitted grid jobs from this machine, it is a good idea to get that working before you attempt to use JobRouter. Typically, the only thing you need to add (supposing you are using GSI authentication for the grid) is an X509 trusted certification authority directory in a place recognized by Condor (e.g. /etc/grid-security/certificates). The VDT (http://vdt.cs.wisc.edu) provides a convenient way to setup and install trusted CAs if you are using one of the common CAs in your grid.
# These settings become the default settings for all routes
JOB_ROUTER_DEFAULTS = \
[ \
requirements=target.WantJobRouter is True; \
MaxIdleJobs = 10; \
MaxJobs = 200; \
\
/* now modify routed job attributes */ \
/* remove routed job if it goes on hold or stays idle for over 6 hours */ \
set_PeriodicRemove = JobStatus == 5 || \
(JobStatus == 1 && (CurrentTime - QDate) > 3600*6); \
set_WantJobRouter = false; \
set_requirements = true; \
]
# This could be made an attribute of the job, rather than being hard-coded
ROUTED_JOB_MAX_TIME = 1440
# Now we define each of the routes to send jobs on
JOB_ROUTER_ENTRIES = \
[ GridResource = "gt2 site1.edu/jobmanager-condor"; \
name = "Site 1"; \
] \
[ GridResource = "gt2 site2.edu/jobmanager-pbs"; \
name = "Site 2"; \
set_GlobusRSL = "(maxwalltime=$(ROUTED_JOB_MAX_TIME))(jobType=single)"; \
] \
[ GridResource = "condor submit.site3.edu condor.site3.edu"; \
name = "Site 3"; \
set_remote_jobuniverse = 5; \
]
# Reminder: you must restart Condor for changes to DAEMON_LIST to take effect.
DAEMON_LIST = $(DAEMON_LIST) JOB_ROUTER
# For testing, set this to a small value to speed things up.
# Once you are running at large scale, set it to a higher value
# to prevent the JobRouter from using too much cpu.
JOB_ROUTER_POLLING_PERIOD = 10
#It is good to save lots of schedd queue history
#for use with the router_history command.
MAX_HISTORY_ROTATIONS = 20
Some questions you may have after reading the above policy: Can the routing table be dynamically generated from grid information systems? Do users have to have their own grid credentials or can JobRouter insert service credentials for them? What's up with the syntax of the routing table: C-style comments, strange ClassAd expressions, escaped end of lines? The next section covers the specifics of JobRouter configuration. Read on!
JobRouter is configured with a routing table, which is a list of ClassAds describing each site where jobs may be sent. The ClassAd syntax is slightly different from much of the rest of Condor, because it uses New ClassAds, a re-implementation of ClassAds that Condor is gradually adopting and which may one day completely replace the old implementation. A good place to learn about the syntax of New ClassAds is the Informal Language Description in the C++ ClassAds tutorial: http://www.cs.wisc.edu/condor/classad/c++tut.html. For the most part, everything in the old ClassAds language is supported by New ClassAds, with the exception of a number of ClassAd functions that have not yet been added to New ClassAds. So if job ClassAds make use of ClassAd functions, they cannot currently be routed.
Since JobRouter is configured with New ClassAds but is operating on Old ClassAds stored in the job queue, it may be confusing at first to understand which ClassAd expressions are evaluated as New ClassAds and which are evaluated as Old ClassAds. For example, the requirements expression of routes in the routing table are evaluated by the JobRouter, so they may use New ClassAds features, whereas the PeriodicHold expression in the routed job is evaluated by condor_ schedd, so it may only use features of Old ClassAds. As long as the expressions you use in the routing table are compatible with both implementations of ClassAds, you do not need to be concerned about this detail. In case you do need to use special features, the expressions that are evaluated (as New ClassAds) by the JobRouter will be identified in the reference below.
The most basic thing to know about New ClassAd syntax is simply that
each ClassAd is surrounded by square brackets, and each assignment
statement in the ClassAd should end with a semicolon. When the
ClassAd is embedded in a Condor configuration file, it could all
appear on a single line, but the readability of the ClassAd is often
improved by inserting line continuations (i.e. backslashes followed by
newlines) after each assignment statement in the ClassAd, as in the
examples above. Unfortunately, this makes it a little awkward to
insert configuration comments in the ClassAd, because of the way line
continuations and the Condor configuration comment character `#' work.
One alternative is to use C-style comments /* ... */ as in the
examples above. Another option is to read in the JobRouter entries
from a separate file, rather than embedding them in the Condor
configuration file.
The JobRouter configuration parameters are listed below:
You may modify the routing table and issue condor_ reconfig to have JobRouter rebuild the routing table. If you do so and you change the name of a route, then the count of currently routed jobs on that route will not be accurate until the existing jobs finish. Another option if you want dynamic routing is to read the routing table from an external source via JOB_ROUTER_ENTRIES_FILE or JOB_ROUTER_ENTRIES_CMD .
The meaning of each attribute in the routing entries is listed below.
Job Route ClassAd Attributes
.
The Open Science Grid has a service called ReSS (Resource Selection Service). It presents grid sites as ClassAds in a Condor collector. This example builds a routing table from the site ClassAds in the ReSS collector.
Using JOB_ROUTER_ENTRIES_CMD , we tell JobRouter to call a
simple script which queries the collector and outputs a routing table.
The script, called osg_ress_routing_table.sh, is just this:
#!/bin/sh
# you _MUST_ change this:
export condor_status=/path/to/condor_status
# if no command line arguments specify -pool, use this:
export _CONDOR_COLLECTOR_HOST=osg-ress-1.fnal.gov
$condor_status -format '[ ' BeginAd \
-format 'GridResource = "gt2 %s"; ' GlueCEInfoContactString \
-format ']\n' EndAd "$@" | uniq
Save this script to a file and make sure the permissions on the file mark it as executable. Test this script by calling it by hand before trying to use it with JobRouter. You may supply additional arguments such as -constraint to limit the sites which are returned.
Once you are satisfied that the routing table constructed by the script is what you want, configure JobRouter to use it:
# command to build the routing table JOB_ROUTER_ENTRIES_CMD = /path/to/osg_ress_routing_table.sh <extra arguments> # how often to rebuild the routing table: JOB_ROUTER_ENTRIES_REFRESH = 3600
Using the previous example JobRouter configuration on
page
, you may simply use the
above settings to replace JOB_ROUTER_ENTRIES . (Or you may
leave JOB_ROUTER_ENTRIES there and have a routing table
containing entries from both sources.) When you restart or reconfig
JobRouter, you should see messages in JobRouterLog indicating that it
is adding more routes to the table.