Condor-C allows jobs in one machine's job queue to be moved to another machine's job queue. These machines may be far removed from each other, providing powerful grid computation mechanisms, while requiring only Condor software and its configuration.
Condor-C is highly resistant to network disconnections and machine failures on both the submission and remote sides. An expected usage sets up Personal Condor on a laptop, submits some jobs that are sent to a Condor pool, waits until the jobs are staged on the pool, then turns off the laptop. When the laptop reconnects at a later time, any results can be pulled back.
Condor-C scales gracefully when compared with Condor's flocking mechanism. The machine upon which jobs are submitted maintains a single process and network connection to a remote machine, without regard to the number of jobs queued or running.
Configuration of a machine from which jobs are submitted requires two extra configuration variables:
CONDOR_GAHP=$(SBIN)/condor_c-gahp C_GAHP_LOG=/tmp/CGAHPLog.$(USERNAME)
The acronym GAHP stands for Grid ASCII Helper Protocol. A GAHP server provides grid-related services for a variety of underlying middle-ware systems. The configuration variable CONDOR_GAHP gives a full path to the GAHP server utilized by Condor-C. The configuration variable C_GAHP_LOG defines the location of the log that the Condor GAHP server writes. The log for the Condor GAHP is written as the user on whose behalf it is running; thus like GRIDMANAGER_LOG the C_GAHP_LOG configuration variable must point to a location the end user can write to.
A submit machine must also have a condor_ collector daemon to which the condor_ schedd daemon can submit a query. The query is for the location (IP address and port) of the intended remote machine's condor_ schedd daemon. This facilitates communication between the two machines. This condor_ collector does not need to be the same collector that the local condor_ schedd daemon reports to. The remote_pool submit file command is used to specify the condor_ collector daemon to contact for the location.
The machine upon which jobs are executed must also be configured correctly. This machine must be running a condor_ schedd daemon. Unless specified explicitly in a submit file, CONDOR_HOST must point to a condor_ collector daemon that it can write to, and the machine upon which jobs are submitted can read from. This facilitates communication between the two machines.
An important aspect of configuration is the security configuration relating to authentication. Condor-C on the remote machine relies on an authentication protocol to know the identity of the user under which to run a job. The following is a working example of the security configuration for authentication. This authentication method, CLAIMTOBE, trusts the identity claimed by a host or IP address.
SEC_DEFAULT_NEGOTIATION = OPTIONAL SEC_DEFAULT_AUTHENTICATION_METHODS = CLAIMTOBE
The following represents a minimal submit description file for a job.
# minimal submit description file for a Condor-C job universe = grid grid_type = condor executable = myjob output = myoutput error = myerror log = mylog remote_schedd = joe@remotemachine.example.com remote_pool = remotecentralmanager.example.com +remote_jobuniverse = 5 +remote_requirements = True +remote_ShouldTransferFiles = "YES" +remote_WhenToTransferOutput = "ON_EXIT" queue
The remote machine needs to understand the attributes of the job. These are specified in the submit description file using the '+' syntax, followed by the string remote_. At a minimum, this will be the job's universe and the job's requirements. It is likely that other attributes specific to the job's universe (on the remote pool) will also be necessary. Note that attributes set with '+' are inserted directly into the job's ClassAd. Specify attributes as they must appear in the job's ClassAd, not the submit description file. For example, the universe is specified using an integer assigned for a job ClassAd JobUniverse. Similarly, place quotation marks around string expressions. As an example, a submit description file would ordinarily contain
when_to_transfer_output = ON_EXITThis must appear in the Condor-C job submit description file as
+remote_WhenToTransferOutput = "ON_EXIT"
For convenience, the specific entries of universe, remote_schedd, remote_pool, globus_rsl, and globus_scheduler can be specified as remote_ commands using the standard submit file syntax and without the leading '+'. So instead of
+remote_universe = 5you can say
remote_universe = vanilla. Similarlly, instead of
+remote_remote_schedd = "schedd.example.com"you can write
remote_remote_schedd = schedd.example.com. This behavior only works for the specific entries universe, remote_schedd, remote_pool, globus_rsl, and globus_scheduler. For all other entries, you will need to use the '+' syntax, remembering to place strings in quotation marks.
For this particular example, the job is to be run as a vanilla universe job at the remote pool. The (remote pool's) condor_ schedd daemon is likely to place its job queue data on a local disk and execute the job on another machine within the pool of machines. This implies that the file systems for the resulting submit machine (the machine specified by remote_schedd) and the execute machine (the machine that runs the job) will not be shared. Thus, the two inserted ClassAds
+remote_ShouldTransferFiles = "YES" +remote_WhenToTransferOutput = "ON_EXIT"are used to invoke Condor's file transfer mechanism.
As Condor-C is a recent addition to Condor, the universes, associated integer assignments, and notes about the existence of functionality are given in Table 5.1. The note "untested" implies that submissions under the given universe have not yet been throughly tested. They may already work.
|
For communication between condor_ schedd daemons on the submit and remote machines, the location of the remote condor_ schedd daemon is needed. This information resides in the condor_ collector of the remote machine's pool. The remote_pool command in the submit description file says which condor_ collector should be queried for the remote condor_ schedd daemon's location. An example of this submit command is
remote_pool = machine1.example.comIf the remote condor_ collector is not listening on the standard port (9618), then the port it is listening on needs to be specified, like so:
remote_pool = machine1.example.com:12345
File transfer of a job's executable, stdin, stdout, and
stderr are automatic.
When other files need to be transferred using Condor's file transfer
mechanism
(see section 2.5.4 on page
),
the mechanism is applied based on the resulting job universe on the
remote machine.
Condor-G is the name given to Condor when grid universe jobs are sent to grid resources utilizing Globus software for job execution. The Globus Toolkit provides a framework for building grid systems and applications. See the Globus Alliance web page at http://www.globus.org for descriptions and details of the Globus software.
Condor provides the same job management capabilities for Condor-G jobs as for other jobs. From Condor, a user may effectively submit jobs, manage jobs, and have jobs execute on widely distributed machines.
It may appear that Condor-G is a simple replacement for the Globus Toolkit's globusrun command. However, Condor-G does much more. It allows the submission of many jobs at once, along with the monitoring of those jobs with a convenient interface. There is notification when jobs complete or fail and maintenance of Globus credentials that may expire while a job is running. On top of this, Condor-G is a fault-tolerant system; if a machine crashes, all of these functions are again available as the machine returns.
Condor (and Globus) utilize the following protocols and terminology. The protocols allow Condor to interact with grid machines toward the end result of executing jobs.
Figure 5.1 shows how Condor interacts with Globus software towards running jobs. The diagram is specific to the gt2 grid_type. Condor contains a GASS server, used to transfer the executable, stdin, stdout, and stderr to and from the remote job execution site. Condor uses the GRAM protocol to contact the remote gatekeeper and request that a new jobmanager be started. The GRAM protocol is also used to when monitoring the job's progress. Condor detects and intelligently handles cases such as if the remote resource crashes.
There are now three different versions of the GRAM protocol. Condor supports all three. Condor's grid universe uses the grid_type command within a submit description file to distinguish among them.
Condor-G supports submitting jobs to remote resources running the Globus Toolkit versions 1 and 2, also called the pre-web services GRAM. These Condor-G jobs are submitted the same as any other Condor job. The universe is grid, and the pre-web services GRAM protocol is specified by setting the submit command grid_type to gt2. Older submit description files specifying a globus universe job will default to this.
Under Condor, successful job submission to the grid universe with gt2 requires credentials. An X.509 certificate is used to create a proxy, and an account, authorization, or allocation to use a grid resource is required. For more information on proxies and certificates, please consult the Alliance PKI pages at
http://archive.ncsa.uiuc.edu/SCD/Alliance/GridSecurity/
Before submitting a job to Condor under the grid universe, use grid-proxy-init to create a proxy.
Here is a simple submit description file. The example specifies a gt2 job to be run on an NCSA machine.
executable = test globusscheduler = modi4.ncsa.uiuc.edu/jobmanager universe = grid grid_type = gt2 output = test.out log = test.log queue
The executable for this example is transferred from the local machine to the remote machine. By default, Condor transfers the executable, as well as any files specified by an input command. Note that the executable must be compiled for its intended platform.
The command globusscheduler is a required command for Condor-G jobs. It specifies the scheduling software to be used on the remote resource. There is a specific jobmanager for each type of batch system supported by Globus. The full syntax for this command line appears as
globusscheduler = machinename[:port]/jobmanagername[:X.509 distinguished name]The portions of this syntax specification enclosed within square brackets ([and ]) are optional. On a machine where the jobmanager is listening on a nonstandard port, include the port number. The
jobmanagername
is one of five strings:
jobmanager jobmanager-condor jobmanager-pbs jobmanager-lsf jobmanager-sgeThe Globus software running on the remote resource uses this string to identify and select the correct service to perform. Other
jobmanagername strings may be used,
where additional services are defined and implemented.
No input file is specified for this example job. Any output (file specified by an output command) or error (file specified by an error command) is transferred from the remote machine to the local machine as it is produced. This implies that these files may be incomplete in the case where the executable does not finish running on the remote resource. The ability to transfer standard output and standard error as they are produced may be disabled by adding to the submit description file:
stream_output = False stream_error = FalseAs a result, standard output and standard error will be transferred only after the job completes.
The job log file is maintained on the submit machine.
Example output from condor_ q for this submission looks like:
% condor_q -- Submitter: wireless48.cs.wisc.edu : <128.105.48.148:33012> : wireless48.cs.wi ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 7.0 smith 3/26 14:08 0+00:00:00 I 0 0.0 test 1 jobs; 1 idle, 0 running, 0 held
After a short time, the Globus resource accepts the job. Again running condor_ q will now result in
% condor_q -- Submitter: wireless48.cs.wisc.edu : <128.105.48.148:33012> : wireless48.cs.wi ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 7.0 smith 3/26 14:08 0+00:01:15 R 0 0.0 test 1 jobs; 0 idle, 1 running, 0 held
Then, very shortly after that, the queue will be empty again, because the job has finished:
% condor_q -- Submitter: wireless48.cs.wisc.edu : <128.105.48.148:33012> : wireless48.cs.wi ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held
A second example of a submit description file runs the Unix ls program on a different Globus resource.
executable = /bin/ls transfer_executable = false globusscheduler = vulture.cs.wisc.edu/jobmanager universe = grid grid_type = gt2 output = ls-test.out log = ls-test.log queue
In this example, the executable (the binary) has been pre-staged. The executable is on the remote machine, and it is not to be transferred before execution. Note that the required globusscheduler and universe commands are present. The command
transfer_executable = falsewithin the submit description file identifies the executable as being pre-staged. In this case, the executable command gives the path to the executable on the remote machine.
A third example submits a Perl script to be run as a submitted Condor job. The Perl script both lists and sets environment variables for a job. Save the following Perl script with the name env-test.pl, to be used as a Condor job executable.
#!/usr/bin/env perl
foreach $key (sort keys(%ENV))
{
print "$key = $ENV{$key}\n"
}
exit 0;
Run the Unix command
chmod 755 env-test.plto make the Perl script executable.
Now create the following submit description file. Replace example.cs.wisc.edu/jobmanager with a resource you are authorized to use.
executable = env-test.pl globusscheduler = example.cs.wisc.edu/jobmanager universe = grid grid_type = gt2 environment = foo=bar; zot=qux output = env-test.out log = env-test.log queue
When the job has completed, the output file, env-test.out, should contain something like this:
GLOBUS_GRAM_JOB_CONTACT = https://example.cs.wisc.edu:36213/30905/1020633947/ GLOBUS_GRAM_MYJOB_CONTACT = URLx-nexus://example.cs.wisc.edu:36214 GLOBUS_LOCATION = /usr/local/globus GLOBUS_REMOTE_IO_URL = /home/smith/.globus/.gass_cache/globus_gass_cache_1020633948 HOME = /home/smith LANG = en_US LOGNAME = smith X509_USER_PROXY = /home/smith/.globus/.gass_cache/globus_gass_cache_1020633951 foo = bar zot = qux
Of particular interest is the GLOBUS_REMOTE_IO_URL environment variable. Condor-G automatically starts up a GASS remote I/O server on the submit machine. Because of the potential for either side of the connection to fail, the URL for the server cannot be passed directly to the job. Instead, it is placed into a file, and the GLOBUS_REMOTE_IO_URL environment variable points to this file. Remote jobs can read this file and use the URL it contains to access the remote GASS server running inside Condor-G. If the location of the GASS server changes (for example, if Condor-G restarts), Condor-G will contact the Globus gatekeeper and update this file on the machine where the job is running. It is therefore important that all accesses to the remote GASS server check this file for the latest location.
The following example is a Perl script that uses the GASS server in Condor-G to copy input files to the execute machine. In this example, the remote job counts the number of lines in a file.
#!/usr/bin/env perl
use FileHandle;
use Cwd;
STDOUT->autoflush();
$gassUrl = `cat $ENV{GLOBUS_REMOTE_IO_URL}`;
chomp $gassUrl;
$ENV{LD_LIBRARY_PATH} = $ENV{GLOBUS_LOCATION}. "/lib";
$urlCopy = $ENV{GLOBUS_LOCATION}."/bin/globus-url-copy";
# globus-url-copy needs a full pathname
$pwd = getcwd();
print "$urlCopy $gassUrl/etc/hosts file://$pwd/temporary.hosts\n\n";
`$urlCopy $gassUrl/etc/hosts file://$pwd/temporary.hosts`;
open(file, "temporary.hosts");
while(<file>) {
print $_;
}
exit 0;
The submit description file used to submit the Perl script as a Condor job appears as:
executable = gass-example.pl globusscheduler = example.cs.wisc.edu/jobmanager universe = grid grid_type = gt2 output = gass.out log = gass.log queue
There are two optional submit description file commands of note: x509userproxy and globusrsl. The x509userproxy command specifies the path to an X.509 proxy. The command is of the form:
x509userproxy = /path/to/proxyIf this optional command is not present in the submit description file, then Condor-G checks the value of the environment variable X509_USER_PROXY for the location of the proxy. If this environment variable is not present, then Condor-G looks for the proxy in the file /tmp/x509up_uXXXX, where the characters
XXXX in this file name are
replaced with the Unix user id.
The globusrsl command is used to add additional attribute settings to a job's RSL string. The format of the globusrsl command is
globusrsl = (name=value)(name=value)Here is an example of this command from a submit description file:
globusrsl = (project=Test_Project)This example's attribute name for the additional RSL is project, and the value assigned is Test_Project.
Condor-G supports submitting jobs to remote resources running the Globus Toolkit version 3.2. Please note that this Globus Toolkit version is not compatible with the Globus Toolkit version 3.0. See http://www-unix.globus.org/toolkit/docs/3.2/index.html for more information about the Globus Toolkit version 3.2.
For grid_type gt3 jobs, the submit description file is much the same as for grid_type gt2 jobs. The globusscheduler command is still required, but the format changes from gt2 to one that is a URL. The syntax follows the form:
globusscheduler = http://hostname[:port]/ogsa/services/base/gram/ XXXManagedJobFactoryService
or
globusscheduler = http://IPaddress[:port]/ogsa/services/base/gram/ XXXManagedJobFactoryService
This value is placed on two lines for
formatting purposes, but is all on a single line within
a submit description file.
The portion of this syntax specification enclosed within
square brackets ([and ]) is optional.
The substring XXX within the last part of the value
is replaced by one of five strings that (like for
gt2) identifies and selects the correct service to perform.
The five strings that replace XXX are
Fork Condor PBS LSF SGE
An example, given on two lines (again, for formatting reasons) is
globusscheduler = http://198.51.254.40:8080/ogsa/services/base/gram/ ForkManagedJobFactoryService
On the machine where the job is submitted,
there is no requirement for any Globus Toolkit 3.2 components.
Condor itself installs all necessary framework within the directory
$(LIB)/lib/gt3.
The machine where the job is submitted
is required to
have Java 1.4 or a higher version installed.
The configuration variable JAVA
must identify the location of the installation.
See page
within
section 3.3
for the complete description of the configuration variable JAVA.
Condor-G supports submitting jobs to remote resources running the Globus Toolkit version 4.0. Please note that this Globus Toolkit version is not compatible with the Globus Toolkit version 3.0 or 3.2. See http://www-unix.globus.org/toolkit/docs/4.0/index.html for more information about the Globus Toolkit version 4.0.
For grid_type gt4 jobs, the submit description file is much the same as for grid_type gt2 or gt3 jobs. The globusscheduler command is still required, and is given in the form of a URL; the syntax follows the form:
globusscheduler = https://hostname[:port]/wsrf/services/ManagedJobFactoryService
or
globusscheduler = https://IPaddress[:port]/wsrf/services/ManagedJobFactoryServiceThe portion of this syntax specification enclosed within square brackets ([and ]) is optional.
A new submit command called jobmanager_type distinguishes the correct service to perform. The value of jobmanager_type is one of five strings:
Fork Condor PBS LSF SGE
File transfer occurs as expected for a Condor job
(for the executable, input, and output).
However, the underlying transfer mechanism requires access
to a GridFTP server from the machine where the job
is submitted.
On this machine,
there is no requirement for any Globus Toolkit 4.0 components.
Condor itself installs all necessary framework within the directory
$(LIB)/lib/gt4.
The machine where the job is submitted
is also required to
have Java 1.4.2 or a higher version installed.
The configuration variable JAVA
must identify the location of the installation.
See page
within
section 3.3
for the complete description of the configuration variable JAVA.
The delimiting of arguments passed to a Condor-G job varies based on the grid_type of the job. For the gt2 and gt3 grid_types, there are two languages involved, leading to two sets of parsing rules that must work together. gt4 grid_type jobs are less complex with respect to the delimiting of arguments, as Condor encapsulates one set of parsing rules, thereby isolating the user from needing to understand or use them.
For all Condor-G jobs, the arguments to a job are kept in the job ClassAd attribute Args. This attribute is a string, and therefore enclosed within double quote marks. Condor uses space characters to delimit the listed arguments. Here is an arguments command from a submit description file with spaces to delimit the arguments.
arguments = 13 argument2 argument3The Args ClassAd attribute becomes
Args = "13 argument2 argument3"All further parsing of the arguments uses the Args attribute as a starting point. A query upon this attribute, such as to give the arguments, results in the 3 arguments
argv[1] = 13 argv[2] = argument2 argv[3] = argument3
Since the double quote mark character (") marks the
beginning and end of a string (in the ClassAd language),
an escaped double quote mark (\") is utilized to have
a double quote mark within the string.
For example,
the submit description file arguments command
arguments = 13 argument2 \"string3\"gives the ClassAd attribute
Args = "13 argument2 \"string3\""Again, all further parsing of the arguments uses the Args attribute as a starting point. A query upon this attribute, such as to give the arguments, results in
argv[1] = 13 argv[2] = argument2 argv[3] = "string3"
For the gt2 and gt3 grid_types, the jobmanager on the remote resource must receive information about job arguments in RSL (Resource Specification Language). This language has its own way of delimiting arguments. Therefore, the arguments command in the submit description file (and the associated ClassAd attribute) must take both languages into account.
Delimiters in RSL are spaces,
the single quote mark,
and the double quote mark.
In addition,
the characters +, &, %, (, and )
have special meaning in RSL, so must be delimited,
to include them in an argument.
Placing a space character into an argument is accomplished
by delimiting with one of the quote marks.
As an example, the submit description file command
arguments = '%s' 'argument with spaces' '+%d'results in the Condor-G job receiving the arguments
argv[1] = %s argv[2] = argument with spaces argv[3] = +%d
Should the arguments themselves contain the single quote character, an argument may be delimited with a double quote mark. Note that because the ClassAd attribute Args represents the information, the double quote marks must be escaped in the submit description file command. The submit description file command
arguments = \"don't\" \"mess with\" \"quoting rules\"results in the RSL arguments
argv[1] = don't argv[2] = mess with argv[3] = quoting rules
And, if the job arguments have both single and double quotes, the appearance of a quote character twice in a row is converted (in RSL) to a single instance of the character and the literal continues until the next solo quote character. The submit description file command
arguments = 'don''t yell \"No!\"' '+%s'results in the RSL arguments
argv[1] = don't yell "No!" argv[2] = +%s
For gt4 grid_type jobs, follow Condor's ClassAd language rules for delimiting arguments. Spaces delimit arguments, and the double quote mark character must be escaped to be included in an argument. Condor itself will modify the arguments to be expressed correctly in RSL. Note that the space character cannot be a part of an argument.
Difficulties with proxy expiration occur in two cases. The first case are long running jobs, which do not complete before the proxy expires. The second case occurs when great numbers of jobs are submitted. Some of the jobs may not yet be started or not yet completed before the proxy expires. One proposed solution to these difficulties is to generate longer-lived proxies. This, however, presents a greater security problem. Remember that a GSI proxy is sent to the remote Globus resource. If a proxy falls into the hands of a malicious user at the remote site, the malicious user can impersonate the proxy owner for the duration of the proxy's lifetime. The longer the proxy's lifetime, the more time a malicious user has to misuse the owner's credentials. To minimize the window of opportunity of a malicious user, it is recommended that proxies have a short lifetime (on the order of several hours).
The MyProxy software generates proxies using credentials (a user certificate or a long-lived proxy) located on a secure MyProxy server. Condor-G talks to the MyProxy server, renewing a proxy as it is about to expire. Another advantage that this presents is it relieves the user from having to store a GSI user certificate and private key on the machine where jobs are submitted. This may be particularly important if a shared Condor-G submit machine is used by several users.
In the a typical case, the following steps occur:
Condor-G keeps track of the password to the MyProxy server for credential renewal. Although Condor-G tries to keep the password encrypted and secure, it is still possible (although highly unlikely) for the password to be intercepted from the Condor-G machine (more precisely, from the machine that the condor_ schedd daemon that manages the grid universe jobs runs on, which may be distinct from the machine from where jobs are submitted). The following safeguard practices are recommended.
myproxy-init -s <host> -x -r <cert subject> -k <cred name>
The option -x -r <cert subject> essentially tells the MyProxy server to require two forms of authentication:
condor_submit -p mypassword /home/user/myjob.submit
A submit description file may include the password. An example contains commands of the form:
executable = /usr/bin/my-executable universe = grid grid_type = gt3 globusscheduler = condor-unsup-7 MyProxyHost = example.cs.wisc.edu:7512 MyProxyServerDN = /O=doesciencegrid.org/OU=People/CN=Jane Doe 25900 MyProxyPassword = password MyProxyCredentialName = my_executable_run queueNote that storing this submit file reduces security, as it contains the password.
Currently, Condor-G calls the myproxy-get-delegation command-line tool, passing it the necessary arguments. The location of the myproxy-get-delegation executable is determined by the configuration variable MYPROXY_GET_DELEGATION in the configuration file on the Condor-G machine. This variable is read by the condor_ gridmanager. If myproxy-get-delegation is a dynamically-linked executable (verify this with ldd myproxy-get-delegation), point MYPROXY_GET_DELEGATION to a wrapper shell script that sets LD_LIBRARY_PATH to the correct MyProxy library or Globus library directory and then calls myproxy-get-delegation. Here is an example of such a wrapper script:
#!/bin/sh export LD_LIBRARY_PATH=/opt/myglobus/lib exec /opt/myglobus/bin/myproxy-get-delegation $@
Condor's Grid Monitor is designed to improve the scalability of machines running Globus Toolkit 2 gatekeepers. Normally, this gatekeeper runs a jobmanager process for every job submitted to the gatekeeper. This includes both currently running jobs and jobs waiting in the queue. Each jobmanager runs a Perl script at frequent intervals (every 10 seconds) to poll the state of its job in the local batch system. For example, with 400 jobs submitted to a gatekeeper, there will be 400 jobmanagers running, each regularly starting a Perl script. When a large number of jobs have been submitted to a single gatekeeper, this frequent polling can heavily load the gatekeeper. When the gatekeeper is under heavy load, the system can become non-responsive, and a variety of problems can occur.
Condor's Grid Monitor temporarily replaces these jobmanagers. It is named the Grid Monitor, because it replaces the monitoring (polling) duties previously done by jobmanagers. When the Grid Monitor runs, Condor attempts to start a single process to poll all of a user's jobs at a given gatekeeper. While a job is waiting in the queue, but not yet running, Condor shuts down the associated jobmanager, and instead relies on the Grid Monitor to report changes in status. The jobmanager started to add the job to the remote batch system queue is shut down. The jobmanager restarts when the job begins running.
By default, standard output and standard error are streamed back to the submitting machine while the job is running. Streamed I/O requires the jobmanager. As a result, the Grid Monitor cannot replace the jobmanager for jobs that use streaming. If possible, disable streaming for all jobs; this is accomplished by placing the following lines in each job's submit description file:
stream_output = False stream_error = False
The Grid Monitor requires that the gatekeeper support the fork jobmanager with the name jobmanager-fork. If the gatekeeper does not support the fork jobmanager, the Grid Monitor will not be used for that site. The condor_ gridmanager log file reports any problems using the Grid Monitor.
To enable the Grid Monitor, two variables are added to the Condor configuration file. The configuration macro GRID_MONITOR is already present in current distributions of Condor, but it may be missing from earlier versions of Condor. Also set the configuration macro ENABLE_GRID_MONITOR to True.
GRID_MONITOR = $(SBIN)/grid_monitor.sh ENABLE_GRID_MONITOR = TRUE
When you remove a job with condor_ rm, you may find that the job enters the ``X'' state for a very long time. This is normal: Condor is attempting to communicate with the remote scheduling system and ensure that the job has been properly cleaned up. If it takes too long or (in rare circumstances) is never removed, you can force the job to leave the job queue by using the -forcex option to condor_ rm. This will forcibly remove jobs that are in the X state without attempting to finish any cleanup at the remote scheduler.
In it simplest usage, the grid universe allows users to specify the single grid site they wish to submit their job to. Often this is sufficient: perhaps a user knows exactly which grid site they wish to use, or a higher-level resource broker (such as the European Data Grid's resource broker) has decided which grid site should be used. But when users have a variety of sites to choose from and there is no other resource broker to make the decision, the grid universe can use matchmaking to decide which grid site a job should run on.
Please note that the grid universe's matchmaking ability is relatively new. Work is being done to improve it and make it easier to use. For now, please expect some rough edges.
The grid universe uses the same matchmaking mechanism that the other Condor universes use: the condor_ collector and condor_ negotiator daemons, which are described in Section 3.1.2.
There are two differences in how matchmaking is done in the grid universe, versus the other universes. First, advertise grid sites that are available so that they are known and considered during the matchmaking process. This is accomplished by writing ClassAd attributes and using condor_ advertise to place the attributes into the ClassAd used in matchmaking. The second change is to the submit description file. This file needs to specify requirements that describe what type of grid site can be used, instead of identifying a specific grid site.
In the following sections, examples are given for a GT2 grid-type job and resource. A couple minor changes may be required for other grid-types. Primarily, an attribute other than globusscheduler may have to be used.
Each grid site that is available for matching purposes needs to be advertised to the condor_ collector. Normally in Condor this is done with the condor_ startd daemon, and you do not normally need to be aware of the contents of this advertisement. Currently, there is no equivalent to the condor_ startd daemon for advertising grid sites, so you need have a deeper understanding.
To properly advertise a grid site, a ClassAd needs to be sent periodically to the condor_ collector. A ClassAd is a list of attributes and values that describe a job, a machine, or a grid site. ClassAds are briefly described in Section 2.3 and some of the common attributes of machine ClassAds are described in Section 2.5.2.
When you advertise a grid site, it looks very similar to a ClassAd for a machine. In fact, the condor_ collector will believe it is a machine, but with a different set of attributes.
To advertise a grid site, you first need to describe the site in a file. Here is a sample ClassAd that describes a grid site:
# This is a comment MyType = "Machine" TargetType = "Job" Name = "Example1_Gatekeeper" Machine = "Example1_Gatekeeper" gatekeeper_url = "grid.example.com/jobmanager" Requirements = (CurMatches < 10) && (TARGET.JobUniverse == 9) && (TARGET.JobGridType =?= ``gt2'') Rank = 0.000000 CurrentRank = 0.000000 WantAdRevaluate = True UpdateSequenceNumber = 4 CurMatches = 0
Let's look at each line:
# This is a comment
Your file can have comments that begin with the hash mark (#).
MyType = "Machine"
Your grid site is pretending to be a single machine, for the purpose of matchmaking. MyType is an attribute that the condor_ negotiator daemon will expect to be a string. Strings must be surrounded by double-quote marks, as in this example. You may have surprising, unintuitive errors if they are not quoted. You will always want MyType to be ``Machine''.
TargetType = "Job"
This is an attribute that says the grid site (machine) wants to be matched with a job. Leave this as it is.
Name = "Example1_Gatekeeper"
You will want a unique name for each grid site. Any name is fine, as long as it is quoted.
Machine = "Example1_Gatekeeper"
Machine is just like Name. Normally in Condor, the Machine and Name may be slightly different if you have multiple CPUs. For grid matchmaking, they should probably be the same.
gatekeeper_url = "grid.example.com/jobmanager"
This is the Globus gatekeeper contact string for your grid site. It is probably a machine name followed by a slash followed by the name of the jobmanager. If you have different job managers, you can only specify one per ClassAd.
UpdateSequenceNumber = 4
UpdateSequenceNumber is a positive number that must increase each time you advertise a grid site. Normally you advertise your grid site every five minutes. The condor_ collector daemon will discard a grid site's ClassAd after 15 minutes if there have been no updates. A good number to set this to is the current time in seconds (the epoch, as given by the C time function call), but if you are worried about your clock running backward, you can set it to whatever you like. If ClassAds are received with a sequence number older than the last ClassAd, they are ignored.
CurMatches = 0
This number is incremented each time a match is made for this grid site. Unlike a normal machine ClassAd that can only be matched against once, grid site advertisements can be matched against many time.
You will probably want to set this number to be the number of grid jobs that you have running on your site, and keep it updated each time you submit a new ClassAd. If you do not specify CurMatches, Condor will assume it is 0.
Condor will increment this number every time it makes a match against a grid site.
Requirements = (CurMatches < 10) && (TARGET.JobUniverse == 9) && (TARGET.JobGridType =?= ``gt2'')
These are the requirements that the grid site insists must be true before it will accept a job. These could refer to features of the job's ClassAd. In this case, we will take any grid universe job that's of grid-type gt2, as long we have less than 10 matches currently. This will ensure that Condor will only run 10 jobs at your site--assuming that you keep CurMatches up to date when jobs finish. Of course, you can edit this statement to have different requirements. For example, if you want to accept all jobs, you can have Requirements = True.
Rank = 0.000000 CurrentRank = 0.000000
This is a numerical ranking that will be assigned to a job. Right now it is not used, but should be set to 0.
WantAdRevaluate = True
The WantAdRevaluate attribute distinguishes grid site ClassAds from normal machine ClassAds and allows multiple matches to be made against a single site. It should be in your ad and should be true. Note that True is not in quotes, and it should not be.
You can add other attributes to your ClassAd, to make it easy for a job to decide which grid site it wants to use. For instance, if you have pre-installed the Bamboozle software environment on your grid site, you could advertise, HaveBamboozle = True and BamboozleVersion = 10. Jobs can require a grid site that has Bamboozle installed by extending their requirements with HaveBamboozle == True. (Note the double equal sign in the requirements.)
As an aside, we recommend that jobs that need specific applications should bring them with them instead of relying on having them pre-installed at a Grid site. You will have more reliable execution if you do.
Once you have a file that describes your site, you need to send it to the condor_ collector daemon. For this, use condor_ advertise. We recommend that you write a script to create the file containing the ClassAd, then run the script every five minutes with cron. The script should probably update the CurMatches variable, if you want to restrict the number of grid jobs that can be submitted at one time.
For condor_ advertise, specify UPDATE_STARTD_AD for the update command. For example, if your ClassAd is specified in a file named grid-ad you would do:
condor_advertise UPDATE_STARTD_AD grid-ad
condor_ advertise usually uses UDP to transmit your ClassAd. In wide-area networks, this may be insufficient. You can use TCP by specifying the -tcp option.
Submitting a grid universe job that requires matchmaking is straightforward. Instead of specifying a particular scheduler with globussheduler like this:
globusscheduler = grid.example.com/jobmanager
you instead specify requirements and tell Condor where to find the gatekeeper URL in the grid site ClassAd:
globusscheduler = $$(gatekeeper_url) requirements = TARGET.gatekeeper_url =!= UNDEFINED
This will allow to run at any grid site, and will extract the gatekeeper_url attribute from the ClassAd. There is no magic meaning behind gatekeeper_url--you could use GatekeeperContactString if you desired, as long as it is the same in both the job description and the grid site ClassAd.
The requirements specified here are a bit simple. Perhaps you only want to run at a site that has the Bamboozle software installed, and the sites that have it installed specify ``HaveBamboozle = True'', as described above. A complete job description may look like this:
universe = grid grid_typd = gt2 executable = analyze_bamboozle_data output = aaa.$(Cluster).out error = aaa.$(Cluster).err log = aaa.log globusscheduler = $$(gatekeeper_url) requirements = (HaveBamboozle == True) && (TARGET.gatekeeper_url =!= UNDEFINED) leave_in_queue = jobstatus == 4 queue
What if a job fails to run at a grid site due to an error? It will be returned to the queue, and Condor will attempt to match it and re-run it at another site. Condor isn't very clever about avoiding sites that may be bad, but you can give it some assistance. Let's say that you want to avoid running at the last grid site you ran at. You could add this to your job description:
match_list_length = 1 Rank = TARGET.Name != LastMatchName0
This will prefer to run at a grid site that was not just tried, but it will allow the job to be run there if there is no other option.
When you specify match_list_length, you provide an integer N, and Condor will keep track of the last N matches. The oldest match will be LastMatchName0, and next oldest will be LastMatchName1, and so on. (See the condor_ submit manual page for more details.) The Rank expression allows you to specify a numerical ranking for different matches. When combined with match_list_length, you can prefer to avoid sites that you have already run at.
In addition, condor_ submit has two options to help you control grid universe job resubmissions and rematching. See globus_resubmit and globus_rematch in the condor_ submit manual page. These options are independent of match_list_length.
There are some new attributes that will be added to the Job ClassAd, and may be useful to you when you write your rank, requirements, globus_resubmit or globus_rematch option. Please refer to Section 2.5.2 and read about the following option:
The following example of a command within the submit description file releases jobs 5 minutes after being held, increasing the time between releases by 5 minutes each time. It will continue to retry up to 4 times per Globus submission, plus 4. The plus 4 is necessary in case the job goes on hold before being submitted to Globus, although this is unlikely.
periodic_release = ( NumSystemHolds <= ((NumGlobusSubmits * 4) + 4) ) \ && (NumGlobusSubmits < 4) && \ ( HoldReason != "via condor_hold (by user $ENV(USER))" ) && \ ((CurrentTime - EnteredCurrentStatus) > ( NumSystemHolds *60*5 ))
The following example forces Globus resubmission after a job has been held 4 times per Globus submission.
globus_resubmit = NumSystemHolds == (NumGlobusSubmits + 1) * 4
If you are concerned about unknown or malicious grid sites reporting to your condor_ collector, you should use Condor's security options, documented in Section 3.7.