next up previous contents index
Next: 5.4 Glidein Up: 5. Grid Computing Previous: 5.2 Connecting Condor Pools   Contents   Index

Subsections


5.3 The Grid Universe


5.3.1 Condor-C, the condor grid_type

Condor-C allows jobs in one machine's job queue to be moved to another machine's job queue. These machines may be far removed from each other, providing powerful grid computation mechanisms, while requiring only Condor software and its configuration.

Condor-C is highly resistant to network disconnections and machine failures on both the submission and remote sides. An expected usage sets up Personal Condor on a laptop, submits some jobs that are sent to a Condor pool, waits until the jobs are staged on the pool, then turns off the laptop. When the laptop reconnects at a later time, any results can be pulled back.

Condor-C scales gracefully when compared with Condor's flocking mechanism. The machine upon which jobs are submitted maintains a single process and network connection to a remote machine, without regard to the number of jobs queued or running.


5.3.1.1 Condor-C Configuration

There are two aspects to configuration to enable the submission and execution of Condor-C jobs. These two aspects correspond to the endpoints of the communication: there is the machine from which jobs are submitted, and there is the remote machine upon which the jobs are placed in the queue (executed).

Configuration of a machine from which jobs are submitted requires two extra configuration variables:

CONDOR_GAHP=$(SBIN)/condor_c-gahp
C_GAHP_LOG=/tmp/CGAHPLog.$(USERNAME)

The acronym GAHP stands for Grid ASCII Helper Protocol. A GAHP server provides grid-related services for a variety of underlying middle-ware systems. The configuration variable CONDOR_GAHP gives a full path to the GAHP server utilized by Condor-C. The configuration variable C_GAHP_LOG defines the location of the log that the Condor GAHP server writes. The log for the Condor GAHP is written as the user on whose behalf it is running; thus like GRIDMANAGER_LOG the C_GAHP_LOG configuration variable must point to a location the end user can write to.

A submit machine must also have a condor_ collector daemon to which the condor_ schedd daemon can submit a query. The query is for the location (IP address and port) of the intended remote machine's condor_ schedd daemon. This facilitates communication between the two machines. This condor_ collector does not need to be the same collector that the local condor_ schedd daemon reports to. The remote_pool submit file command is used to specify the condor_ collector daemon to contact for the location.

The machine upon which jobs are executed must also be configured correctly. This machine must be running a condor_ schedd daemon. Unless specified explicitly in a submit file, CONDOR_HOST must point to a condor_ collector daemon that it can write to, and the machine upon which jobs are submitted can read from. This facilitates communication between the two machines.

An important aspect of configuration is the security configuration relating to authentication. Condor-C on the remote machine relies on an authentication protocol to know the identity of the user under which to run a job. The following is a working example of the security configuration for authentication. This authentication method, CLAIMTOBE, trusts the identity claimed by a host or IP address.

SEC_DEFAULT_NEGOTIATION = OPTIONAL
SEC_DEFAULT_AUTHENTICATION_METHODS = CLAIMTOBE


5.3.1.2 Condor-C Job Submission

Job submission of Condor-C jobs is the same as for any Condor job. The universe is grid, and the grid_type is condor. The remote pool's job queue is defined and located by the submit description file command remote_schedd. The value assigned for this command is the same as the condor_ schedd ClassAd attribute Name on the remote machine. The remote pool's condor_ collector is defined by the submit description file command remote_pool.

The following represents a minimal submit description file for a job.

   # minimal submit description file for a Condor-C job
   universe = grid
   grid_type = condor
   executable = myjob
   output = myoutput
   error = myerror
   log = mylog

   remote_schedd = joe@remotemachine.example.com
   remote_pool = remotecentralmanager.example.com
   +remote_jobuniverse = 5
   +remote_requirements = True
   +remote_ShouldTransferFiles = "YES"
   +remote_WhenToTransferOutput = "ON_EXIT"
   queue

The remote machine needs to understand the attributes of the job. These are specified in the submit description file using the '+' syntax, followed by the string remote_. At a minimum, this will be the job's universe and the job's requirements. It is likely that other attributes specific to the job's universe (on the remote pool) will also be necessary. Note that attributes set with '+' are inserted directly into the job's ClassAd. Specify attributes as they must appear in the job's ClassAd, not the submit description file. For example, the universe is specified using an integer assigned for a job ClassAd JobUniverse. Similarly, place quotation marks around string expressions. As an example, a submit description file would ordinarily contain

when_to_transfer_output = ON_EXIT
This must appear in the Condor-C job submit description file as
+remote_WhenToTransferOutput = "ON_EXIT"

For convenience, the specific entries of universe, remote_schedd, remote_pool, globus_rsl, globus_xml, and globus_scheduler can be specified as remote_ commands using the standard submit file syntax and without the leading '+'. So instead of

+remote_universe = 5
you can say
remote_universe = vanilla
. Similarlly, instead of
+remote_remote_schedd = "schedd.example.com"
you can write
remote_remote_schedd = schedd.example.com
. This behavior only works for the specific entries universe, remote_schedd, remote_pool, globus_rsl, globus_xml, and globus_scheduler. For all other entries, you will need to use the '+' syntax, remembering to place strings in quotation marks.

For this particular example, the job is to be run as a vanilla universe job at the remote pool. The (remote pool's) condor_ schedd daemon is likely to place its job queue data on a local disk and execute the job on another machine within the pool of machines. This implies that the file systems for the resulting submit machine (the machine specified by remote_schedd) and the execute machine (the machine that runs the job) will not be shared. Thus, the two inserted ClassAds

+remote_ShouldTransferFiles = "YES"
+remote_WhenToTransferOutput = "ON_EXIT"
are used to invoke Condor's file transfer mechanism.

As Condor-C is a recent addition to Condor, the universes, associated integer assignments, and notes about the existence of functionality are given in Table 5.1. The note "untested" implies that submissions under the given universe have not yet been throughly tested. They may already work.


Table 5.1: Functionality of remote job universes with Condor-C
Universe Name Value Notes
standard 1 untested
PVM 4 untested
vanilla 5 works well
scheduler 7 works well
MPI 8 untested
grid 9  
  grid_type = condor works well
  grid_type = gt2 untested
  grid_type = gt3 untested
  grid_type = gt4 untested
java 10 untested
parallel 11 untested
local 12 works well


For communication between condor_ schedd daemons on the submit and remote machines, the location of the remote condor_ schedd daemon is needed. This information resides in the condor_ collector of the remote machine's pool. The remote_pool command in the submit description file says which condor_ collector should be queried for the remote condor_ schedd daemon's location. An example of this submit command is

remote_pool = machine1.example.com
If the remote condor_ collector is not listening on the standard port (9618), then the port it is listening on needs to be specified, like so:
remote_pool = machine1.example.com:12345

File transfer of a job's executable, stdin, stdout, and stderr are automatic. When other files need to be transferred using Condor's file transfer mechanism (see section 2.5.4 on page [*]), the mechanism is applied based on the resulting job universe on the remote machine.


5.3.1.3 Current Limitations in Condor-C

Submitting jobs to run under the grid universe has not yet been perfected. The following is a list of known limitations with Condor-C:

  1. Authentication methods other than CLAIMTOBE, such as GSI and KERBEROS, are untested, and may not yet work.

  2. Platforms that run a Windows operating system are not yet supported as either a submit or remote execute machine.


5.3.2 Condor-G, the gt2, gt3, and gt4 grid_types

Condor-G is the name given to Condor when grid universe jobs are sent to grid resources utilizing Globus software for job execution. The Globus Toolkit provides a framework for building grid systems and applications. See the Globus Alliance web page at http://www.globus.org for descriptions and details of the Globus software.

Condor provides the same job management capabilities for Condor-G jobs as for other jobs. From Condor, a user may effectively submit jobs, manage jobs, and have jobs execute on widely distributed machines.

It may appear that Condor-G is a simple replacement for the Globus Toolkit's globusrun command. However, Condor-G does much more. It allows the submission of many jobs at once, along with the monitoring of those jobs with a convenient interface. There is notification when jobs complete or fail and maintenance of Globus credentials that may expire while a job is running. On top of this, Condor-G is a fault-tolerant system; if a machine crashes, all of these functions are again available as the machine returns.


5.3.2.1 Globus Protocols and Terminology

The Globus software provides a well-defined set of protocols that allow authentication, data transfer, and remote job execution. Authentication is a mechanism by which an identity is verified. Given proper authentication, authorization to use a resource is required. Authorization is a policy that determines who is allowed to do what.

Condor (and Globus) utilize the following protocols and terminology. The protocols allow Condor to interact with grid machines toward the end result of executing jobs.

GSI
The Globus Toolkit's Grid Security Infrastructure (GSI) provides essential building blocks for other grid protocols and Condor-G. This authentication and authorization system makes it possible to authenticate a user just once, using public key infrastructure (PKI) mechanisms to verify a user-supplied grid credential. GSI then handles the mapping of the grid credential to the diverse local credentials and authentication/authorization mechanisms that apply at each site.
GRAM
The Grid Resource Allocation and Management (GRAM) protocol supports remote submission of a computational request (for example, to run a program) to a remote computational resource, and it supports subsequent monitoring and control of the computation. GRAM is the Globus protocol that Condor-G uses to talk to remote Globus jobmanagers.
GASS
The Globus Toolkit's Global Access to Secondary Storage (GASS) service provides mechanisms for transferring data to and from a remote HTTP, FTP, or GASS server. GASS is used by Condor for the gt2 and gt3 grid_types to transfer a job's files to and from the machine where the job is submitted and the remote resource.
GridFTP
GridFTP is an extension of FTP that provides strong security and high-performance options for large data transfers. It is used with the gt4 grid_type to transfer the job's files between the machine where the job is submitted and the remote resource.
RSL
RSL (Resource Specification Language) is the language GRAM accepts to specify job information.
gatekeeper
A gatekeeper is a software daemon executing on a remote machine on the grid. It is relevant only to the gt2 grid_type, and this daemon handles the initial communication between Condor and a remote resource.
jobmanager
A jobmanager is the Globus service that is initiated at a remote resource to submit, keep track of, and manage grid I/O for jobs running on an underlying batch system. There is a specific jobmanager for each type of batch system supported by Globus (examples are Condor, LSF, and PBS).

Figure 5.1: Condor-G interaction with Globus-managed resources
\includegraphics{grids/gfig1.eps}

Figure 5.1 shows how Condor interacts with Globus software towards running jobs. The diagram is specific to the gt2 grid_type. Condor contains a GASS server, used to transfer the executable, stdin, stdout, and stderr to and from the remote job execution site. Condor uses the GRAM protocol to contact the remote gatekeeper and request that a new jobmanager be started. The GRAM protocol is also used to when monitoring the job's progress. Condor detects and intelligently handles cases such as if the remote resource crashes.

There are now three different versions of the GRAM protocol. Condor supports all three. Condor's grid universe uses the grid_type command within a submit description file to distinguish among them.

gt2
This initial GRAM protocol is used in Globus Toolkit versions 1 and 2. It is still used by many production systems. Where available in the other, more recent versions of the protocol, gt2 is referred to as the pre-web services GRAM.
gt3
The gt3 grid_type corresponds to Globus Toolkit version 3 as part of Globus' shift to web services-based protocols. It is replaced by the Globus Toolkit version 4. An installation of the Globus Toolkit version 3 may also include the the pre-web services GRAM.
gt4
The GRAM protocol was introduced in Globus Toolkit version 4 as a more standards-compliant version of the GT3 web services-based GRAM. An installation of the Globus Toolkit version 4 may also include the the pre-web services GRAM.


5.3.2.2 The gt2 grid_type

Condor-G supports submitting jobs to remote resources running the Globus Toolkit versions 1 and 2, also called the pre-web services GRAM. These Condor-G jobs are submitted the same as any other Condor job. The universe is grid, and the pre-web services GRAM protocol is specified by setting the submit command grid_type to gt2. Older submit description files specifying a globus universe job will default to this.

Under Condor, successful job submission to the grid universe with gt2 requires credentials. An X.509 certificate is used to create a proxy, and an account, authorization, or allocation to use a grid resource is required. For more information on proxies and certificates, please consult the Alliance PKI pages at

http://archive.ncsa.uiuc.edu/SCD/Alliance/GridSecurity/

Before submitting a job to Condor under the grid universe, use grid-proxy-init to create a proxy.

Here is a simple submit description file. The example specifies a gt2 job to be run on an NCSA machine.

executable = test
globusscheduler = modi4.ncsa.uiuc.edu/jobmanager
universe = grid
grid_type = gt2
output = test.out
log = test.log
queue

The executable for this example is transferred from the local machine to the remote machine. By default, Condor transfers the executable, as well as any files specified by an input command. Note that the executable must be compiled for its intended platform.

The command globusscheduler is a required command for Condor-G jobs. It specifies the scheduling software to be used on the remote resource. There is a specific jobmanager for each type of batch system supported by Globus. The full syntax for this command line appears as

globusscheduler = machinename[:port]/jobmanagername[:X.509 distinguished name]
The portions of this syntax specification enclosed within square brackets ([and ]) are optional. On a machine where the jobmanager is listening on a nonstandard port, include the port number. The jobmanagername is one of five strings:
jobmanager
jobmanager-condor
jobmanager-pbs
jobmanager-lsf
jobmanager-sge
The Globus software running on the remote resource uses this string to identify and select the correct service to perform. Other jobmanagername strings may be used, where additional services are defined and implemented.

No input file is specified for this example job. Any output (file specified by an output command) or error (file specified by an error command) is transferred from the remote machine to the local machine as it is produced. This implies that these files may be incomplete in the case where the executable does not finish running on the remote resource. The ability to transfer standard output and standard error as they are produced may be disabled by adding to the submit description file:

stream_output = False
stream_error  = False
As a result, standard output and standard error will be transferred only after the job completes.

The job log file is maintained on the submit machine.

Example output from condor_ q for this submission looks like:

% condor_q


-- Submitter: wireless48.cs.wisc.edu : <128.105.48.148:33012> : wireless48.cs.wi

 ID      OWNER         SUBMITTED     RUN_TIME ST PRI SIZE CMD
   7.0   smith        3/26 14:08   0+00:00:00 I  0   0.0  test

1 jobs; 1 idle, 0 running, 0 held

After a short time, the Globus resource accepts the job. Again running condor_ q will now result in

% condor_q


-- Submitter: wireless48.cs.wisc.edu : <128.105.48.148:33012> : wireless48.cs.wi

 ID      OWNER         SUBMITTED     RUN_TIME ST PRI SIZE CMD
   7.0   smith        3/26 14:08   0+00:01:15 R  0   0.0  test

1 jobs; 0 idle, 1 running, 0 held

Then, very shortly after that, the queue will be empty again, because the job has finished:

% condor_q


-- Submitter: wireless48.cs.wisc.edu : <128.105.48.148:33012> : wireless48.cs.wi

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

0 jobs; 0 idle, 0 running, 0 held

A second example of a submit description file runs the Unix ls program on a different Globus resource.

executable = /bin/ls
transfer_executable = false
globusscheduler = vulture.cs.wisc.edu/jobmanager
universe = grid
grid_type = gt2
output = ls-test.out
log = ls-test.log
queue

In this example, the executable (the binary) has been pre-staged. The executable is on the remote machine, and it is not to be transferred before execution. Note that the required globusscheduler and universe commands are present. The command

transfer_executable = false
within the submit description file identifies the executable as being pre-staged. In this case, the executable command gives the path to the executable on the remote machine.

A third example submits a Perl script to be run as a submitted Condor job. The Perl script both lists and sets environment variables for a job. Save the following Perl script with the name env-test.pl, to be used as a Condor job executable.

#!/usr/bin/env perl

foreach $key (sort keys(%ENV))
{
   print "$key = $ENV{$key}\n"
}

exit 0;

Run the Unix command

chmod 755 env-test.pl
to make the Perl script executable.

Now create the following submit description file. Replace example.cs.wisc.edu/jobmanager with a resource you are authorized to use.

executable = env-test.pl
globusscheduler = example.cs.wisc.edu/jobmanager
universe = grid
grid_type = gt2
environment = foo=bar; zot=qux
output = env-test.out
log = env-test.log
queue

When the job has completed, the output file, env-test.out, should contain something like this:

GLOBUS_GRAM_JOB_CONTACT = https://example.cs.wisc.edu:36213/30905/1020633947/
GLOBUS_GRAM_MYJOB_CONTACT = URLx-nexus://example.cs.wisc.edu:36214
GLOBUS_LOCATION = /usr/local/globus
GLOBUS_REMOTE_IO_URL = /home/smith/.globus/.gass_cache/globus_gass_cache_1020633948
HOME = /home/smith
LANG = en_US
LOGNAME = smith
X509_USER_PROXY = /home/smith/.globus/.gass_cache/globus_gass_cache_1020633951
foo = bar
zot = qux

Of particular interest is the GLOBUS_REMOTE_IO_URL environment variable. Condor-G automatically starts up a GASS remote I/O server on the submit machine. Because of the potential for either side of the connection to fail, the URL for the server cannot be passed directly to the job. Instead, it is placed into a file, and the GLOBUS_REMOTE_IO_URL environment variable points to this file. Remote jobs can read this file and use the URL it contains to access the remote GASS server running inside Condor-G. If the location of the GASS server changes (for example, if Condor-G restarts), Condor-G will contact the Globus gatekeeper and update this file on the machine where the job is running. It is therefore important that all accesses to the remote GASS server check this file for the latest location.

The following example is a Perl script that uses the GASS server in Condor-G to copy input files to the execute machine. In this example, the remote job counts the number of lines in a file.

#!/usr/bin/env perl
use FileHandle;
use Cwd;

STDOUT->autoflush();
$gassUrl = `cat $ENV{GLOBUS_REMOTE_IO_URL}`;
chomp $gassUrl;

$ENV{LD_LIBRARY_PATH} = $ENV{GLOBUS_LOCATION}. "/lib";
$urlCopy = $ENV{GLOBUS_LOCATION}."/bin/globus-url-copy";

# globus-url-copy needs a full pathname
$pwd = getcwd();
print "$urlCopy $gassUrl/etc/hosts file://$pwd/temporary.hosts\n\n";
`$urlCopy $gassUrl/etc/hosts file://$pwd/temporary.hosts`;

open(file, "temporary.hosts");
while(<file>) {
print $_;
}

exit 0;

The submit description file used to submit the Perl script as a Condor job appears as:

executable = gass-example.pl
globusscheduler = example.cs.wisc.edu/jobmanager
universe = grid
grid_type = gt2
output = gass.out
log = gass.log
queue

There are two optional submit description file commands of note: x509userproxy and globus_rsl. The x509userproxy command specifies the path to an X.509 proxy. The command is of the form:

x509userproxy = /path/to/proxy
If this optional command is not present in the submit description file, then Condor-G checks the value of the environment variable X509_USER_PROXY for the location of the proxy. If this environment variable is not present, then Condor-G looks for the proxy in the file /tmp/x509up_uXXXX, where the characters XXXX in this file name are replaced with the Unix user id.

The globus_rsl command is used to add additional attribute settings to a job's RSL string. The format of the globus_rsl command is

globus_rsl = (name=value)(name=value)
Here is an example of this command from a submit description file:
globus_rsl = (project=Test_Project)
This example's attribute name for the additional RSL is project, and the value assigned is Test_Project.


5.3.2.3 The gt3 grid_type

Condor-G supports submitting jobs to remote resources running the Globus Toolkit version 3.2. Please note that this Globus Toolkit version is not compatible with the Globus Toolkit version 3.0. See http://www-unix.globus.org/toolkit/docs/3.2/index.html for more information about the Globus Toolkit version 3.2.

For grid_type gt3 jobs, the submit description file is much the same as for grid_type gt2 jobs. The globusscheduler command is still required, but the format changes from gt2 to one that is a URL. The syntax follows the form:

globusscheduler = http://hostname[:port]/ogsa/services/base/gram/
XXXManagedJobFactoryService

or

globusscheduler = http://IPaddress[:port]/ogsa/services/base/gram/
XXXManagedJobFactoryService

This value is placed on two lines for formatting purposes, but is all on a single line within a submit description file. The portion of this syntax specification enclosed within square brackets ([and ]) is optional. The substring XXX within the last part of the value is replaced by one of five strings that (like for gt2) identifies and selects the correct service to perform. The five strings that replace XXX are

Fork
Condor
PBS
LSF
SGE

An example, given on two lines (again, for formatting reasons) is

globusscheduler = http://198.51.254.40:8080/ogsa/services/base/gram/
ForkManagedJobFactoryService

On the machine where the job is submitted, there is no requirement for any Globus Toolkit 3.2 components. Condor itself installs all necessary framework within the directory $(LIB)/lib/gt3. The machine where the job is submitted is required to have Java 1.4 or a higher version installed. The configuration variable JAVA must identify the location of the installation. See page [*] within section 3.3 for the complete description of the configuration variable JAVA.


5.3.2.4 The gt4 grid_type

Condor-G supports submitting jobs to remote resources running the Globus Toolkit version 4.0. Please note that this Globus Toolkit version is not compatible with the Globus Toolkit version 3.0 or 3.2. See http://www-unix.globus.org/toolkit/docs/4.0/index.html for more information about the Globus Toolkit version 4.0.

For grid_type gt4 jobs, the submit description file is much the same as for grid_type gt2 or gt3 jobs. The globusscheduler command is still required, and is given in the form of a URL; the syntax follows the form:

globusscheduler = [https://]hostname[:port][/wsrf/services/ManagedJobFactoryService]

or

globusscheduler = [https://]IPaddress[:port][/wsrf/services/ManagedJobFactoryService]
The portions of this syntax specification enclosed within square brackets ([and ]) is optional.

A new submit command called jobmanager_type distinguishes the correct service to perform. The value of jobmanager_type is one of five strings:

Fork
Condor
PBS
LSF
SGE

The globus_xml command can be used to add additional attributes to the XML-based RSL string that Condor writes to submit the job to GRAM. Here is an example of this command from a submit description file:

globus_xml = <project>Test_Project</project>
This example's attribute name for the additional RSL is project, and the value assigned is Test_Project.

File transfer occurs as expected for a Condor job (for the executable, input, and output). However, the underlying transfer mechanism requires access to a GridFTP server from the machine where the job is submitted. On this machine, there is no requirement for any Globus Toolkit 4.0 components, other than the GridFTP server for file transfer. Condor itself installs all necessary framework within the directory $(LIB)/lib/gt4. The machine where the job is submitted is also required to have Java 1.4.2 or a higher version installed. The configuration variable JAVA must identify the location of the installation. See page [*] within section 3.3 for the complete description of the configuration variable JAVA.


5.3.2.5 Delimiting Arguments

The delimiting of arguments passed to a Condor-G job varies based on the grid_type of the job. For the gt2 and gt3 grid_types, there are two languages involved, leading to two sets of parsing rules that must work together. gt4 grid_type jobs are less complex with respect to the delimiting of arguments, as Condor encapsulates one set of parsing rules, thereby isolating the user from needing to understand or use them.

For all Condor-G jobs, the arguments to a job are kept in the job ClassAd attribute Args. This attribute is a string, and therefore enclosed within double quote marks. Condor uses space characters to delimit the listed arguments. Here is an arguments command from a submit description file with spaces to delimit the arguments.

arguments = 13 argument2 argument3
The Args ClassAd attribute becomes
Args = "13 argument2 argument3"
All further parsing of the arguments uses the Args attribute as a starting point. A query upon this attribute, such as to give the arguments, results in the 3 arguments
argv[1] = 13
argv[2] = argument2
argv[3] = argument3

Since the double quote mark character (") marks the beginning and end of a string (in the ClassAd language), an escaped double quote mark (\") is utilized to have a double quote mark within the string. For example, the submit description file arguments command

arguments = 13 argument2 \"string3\"
gives the ClassAd attribute
Args = "13 argument2 \"string3\""
Again, all further parsing of the arguments uses the Args attribute as a starting point. A query upon this attribute, such as to give the arguments, results in
argv[1] = 13
argv[2] = argument2
argv[3] = "string3"

For the gt2 and gt3 grid_types, the jobmanager on the remote resource must receive information about job arguments in RSL (Resource Specification Language). This language has its own way of delimiting arguments. Therefore, the arguments command in the submit description file (and the associated ClassAd attribute) must take both languages into account.

Delimiters in RSL are spaces, the single quote mark, and the double quote mark. In addition, the characters +, &, %, (, and ) have special meaning in RSL, so must be delimited, to include them in an argument. Placing a space character into an argument is accomplished by delimiting with one of the quote marks. As an example, the submit description file command

arguments = '%s' 'argument with spaces' '+%d'
results in the Condor-G job receiving the arguments
argv[1] = %s
argv[2] = argument with spaces
argv[3] = +%d

Should the arguments themselves contain the single quote character, an argument may be delimited with a double quote mark. Note that because the ClassAd attribute Args represents the information, the double quote marks must be escaped in the submit description file command. The submit description file command

arguments = \"don't\" \"mess with\" \"quoting rules\"
results in the RSL arguments
argv[1] = don't
argv[2] = mess with
argv[3] = quoting rules

And, if the job arguments have both single and double quotes, the appearance of a quote character twice in a row is converted (in RSL) to a single instance of the character and the literal continues until the next solo quote character. The submit description file command

arguments = 'don''t yell \"No!\"' '+%s'
results in the RSL arguments
argv[1] = don't yell "No!"
argv[2] = +%s

For gt4 grid_type jobs, follow Condor's ClassAd language rules for delimiting arguments. Spaces delimit arguments, and the double quote mark character must be escaped to be included in an argument. Condor itself will modify the arguments to be expressed correctly in RSL. Note that the space character cannot be a part of an argument.


5.3.2.6 Credential Management with MyProxy

Condor-G can use MyProxy software to automatically renew GSI proxies for grid universe jobs with grid_types gt2, gt3, or gt4. MyProxy is a software component developed at NCSA and used widely throughout the grid community. For more information see: http://myproxy.ncsa.uiuc.edu/

Difficulties with proxy expiration occur in two cases. The first case are long running jobs, which do not complete before the proxy expires. The second case occurs when great numbers of jobs are submitted. Some of the jobs may not yet be started or not yet completed before the proxy expires. One proposed solution to these difficulties is to generate longer-lived proxies. This, however, presents a greater security problem. Remember that a GSI proxy is sent to the remote Globus resource. If a proxy falls into the hands of a malicious user at the remote site, the malicious user can impersonate the proxy owner for the duration of the proxy's lifetime. The longer the proxy's lifetime, the more time a malicious user has to misuse the owner's credentials. To minimize the window of opportunity of a malicious user, it is recommended that proxies have a short lifetime (on the order of several hours).

The MyProxy software generates proxies using credentials (a user certificate or a long-lived proxy) located on a secure MyProxy server. Condor-G talks to the MyProxy server, renewing a proxy as it is about to expire. Another advantage that this presents is it relieves the user from having to store a GSI user certificate and private key on the machine where jobs are submitted. This may be particularly important if a shared Condor-G submit machine is used by several users.

In the a typical case, the following steps occur:

  1. The user creates a long-lived credential on a secure MyProxy server, using the myproxy-init command. Each organization generally has their own MyProxy server.

  2. The user creates a short-lived proxy on a local submit machine, using grid-proxy-init or myproxy-get-delegation.

  3. The user submits a Condor-G job, specifying:
    MyProxy server name (host:port)
    MyProxy credential name (optional)
    MyProxy password

  4. At the short-lived proxy expiration Condor-G talks to the MyProxy server to refresh the proxy.

Condor-G keeps track of the password to the MyProxy server for credential renewal. Although Condor-G tries to keep the password encrypted and secure, it is still possible (although highly unlikely) for the password to be intercepted from the Condor-G machine (more precisely, from the machine that the condor_ schedd daemon that manages the grid universe jobs runs on, which may be distinct from the machine from where jobs are submitted). The following safeguard practices are recommended.

  1. Provide time limits for credentials on the MyProxy server. The default is one week, but you may want to make it shorter.

  2. Create several different MyProxy credentials, maybe as many as one for each submitted job. Each credential has a unique name, which is identified with the MyProxyCredentialName command in the submit description file.

  3. Use the following options when initializing the credential on the MyProxy server:

    myproxy-init -s <host> -x -r <cert subject> -k <cred name>
    

    The option -x -r <cert subject> essentially tells the MyProxy server to require two forms of authentication:

    1. a password (initially set with myproxy-init)
    2. an existing proxy (the proxy to be renewed)

  4. A submit description file may include the password. An example contains commands of the form:
    executable      = /usr/bin/my-executable
    universe        = grid
    grid_type       = gt3
    globusscheduler = condor-unsup-7
    MyProxyHost     = example.cs.wisc.edu:7512
    MyProxyServerDN = /O=doesciencegrid.org/OU=People/CN=Jane Doe 25900
    MyProxyPassword = password
    MyProxyCredentialName = my_executable_run
    queue
    
    Note that placing the password within the submit file is not really secure, as it relies upon whatever file system security there is. This may still be better than option 5.

  5. Use the -p option to condor_ submit. The submit command appears as
    condor_submit -p mypassword /home/user/myjob.submit
    
    The argument list for condor_ submit defaults to being publically available. An attacker with a log in to the local machine could generate a simple shell script to watch for the password.

Currently, Condor-G calls the myproxy-get-delegation command-line tool, passing it the necessary arguments. The location of the myproxy-get-delegation executable is determined by the configuration variable MYPROXY_GET_DELEGATION in the configuration file on the Condor-G machine. This variable is read by the condor_ gridmanager. If myproxy-get-delegation is a dynamically-linked executable (verify this with ldd myproxy-get-delegation), point MYPROXY_GET_DELEGATION to a wrapper shell script that sets LD_LIBRARY_PATH to the correct MyProxy library or Globus library directory and then calls myproxy-get-delegation. Here is an example of such a wrapper script:

#!/bin/sh
export LD_LIBRARY_PATH=/opt/myglobus/lib
exec /opt/myglobus/bin/myproxy-get-delegation $@


5.3.2.7 The Grid Monitor

Condor's Grid Monitor is designed to improve the scalability of machines running Globus Toolkit 2 gatekeepers. Normally, this gatekeeper runs a jobmanager process for every job submitted to the gatekeeper. This includes both currently running jobs and jobs waiting in the queue. Each jobmanager runs a Perl script at frequent intervals (every 10 seconds) to poll the state of its job in the local batch system. For example, with 400 jobs submitted to a gatekeeper, there will be 400 jobmanagers running, each regularly starting a Perl script. When a large number of jobs have been submitted to a single gatekeeper, this frequent polling can heavily load the gatekeeper. When the gatekeeper is under heavy load, the system can become non-responsive, and a variety of problems can occur.

Condor's Grid Monitor temporarily replaces these jobmanagers. It is named the Grid Monitor, because it replaces the monitoring (polling) duties previously done by jobmanagers. When the Grid Monitor runs, Condor attempts to start a single process to poll all of a user's jobs at a given gatekeeper. While a job is waiting in the queue, but not yet running, Condor shuts down the associated jobmanager, and instead relies on the Grid Monitor to report changes in status. The jobmanager started to add the job to the remote batch system queue is shut down. The jobmanager restarts when the job begins running.

By default, standard output and standard error are streamed back to the submitting machine while the job is running. Streamed I/O requires the jobmanager. As a result, the Grid Monitor cannot replace the jobmanager for jobs that use streaming. If possible, disable streaming for all jobs; this is accomplished by placing the following lines in each job's submit description file:

stream_output = False
stream_error  = False

The Grid Monitor requires that the gatekeeper support the fork jobmanager with the name jobmanager-fork. If the gatekeeper does not support the fork jobmanager, the Grid Monitor will not be used for that site. The condor_ gridmanager log file reports any problems using the Grid Monitor.

To enable the Grid Monitor, two variables are added to the Condor configuration file. The configuration macro GRID_MONITOR is already present in current distributions of Condor, but it may be missing from earlier versions of Condor. Also set the configuration macro ENABLE_GRID_MONITOR to True.

GRID_MONITOR        = $(SBIN)/grid_monitor.sh
ENABLE_GRID_MONITOR = TRUE


5.3.2.8 Limitations of Condor-G

Submitting jobs to run under the globus universe has not yet been perfected. The following is a list of known limitations:

  1. No checkpoints.
  2. No job exit codes. Job exit codes are not available (when using grid_types gt2 and gt3).
  3. Limited platform availability. Windows support is not yet available.

5.3.3 Removing Grid Universe jobs

When you remove a job with condor_ rm, you may find that the job enters the ``X'' state for a very long time. This is normal: Condor is attempting to communicate with the remote scheduling system and ensure that the job has been properly cleaned up. If it takes too long or (in rare circumstances) is never removed, you can force the job to leave the job queue by using the -forcex option to condor_ rm. This will forcibly remove jobs that are in the X state without attempting to finish any cleanup at the remote scheduler.


5.3.4 Matchmaking in the Grid Universe

In it simplest usage, the grid universe allows users to specify the single grid site they wish to submit their job to. Often this is sufficient: perhaps a user knows exactly which grid site they wish to use, or a higher-level resource broker (such as the European Data Grid's resource broker) has decided which grid site should be used. But when users have a variety of sites to choose from and there is no other resource broker to make the decision, the grid universe can use matchmaking to decide which grid site a job should run on.

Please note that the grid universe's matchmaking ability is relatively new. Work is being done to improve it and make it easier to use. For now, please expect some rough edges.

The grid universe uses the same matchmaking mechanism that the other Condor universes use: the condor_ collector and condor_ negotiator daemons, which are described in Section 3.1.2.

There are two differences in how matchmaking is done in the grid universe, versus the other universes. First, advertise grid sites that are available so that they are known and considered during the matchmaking process. This is accomplished by writing ClassAd attributes and using condor_ advertise to place the attributes into the ClassAd used in matchmaking. The second change is to the submit description file. This file needs to specify requirements that describe what type of grid site can be used, instead of identifying a specific grid site.

In the following sections, examples are given for a GT2 grid-type job and resource. A couple minor changes may be required for other grid-types. Primarily, an attribute other than globusscheduler may have to be used.

5.3.4.1 Advertising grid sites to Condor

Each grid site that is available for matching purposes needs to be advertised to the condor_ collector. Normally in Condor this is done with the condor_ startd daemon, and you do not normally need to be aware of the contents of this advertisement. Currently, there is no equivalent to the condor_ startd daemon for advertising grid sites, so you need have a deeper understanding.

To properly advertise a grid site, a ClassAd needs to be sent periodically to the condor_ collector. A ClassAd is a list of attributes and values that describe a job, a machine, or a grid site. ClassAds are briefly described in Section 2.3 and some of the common attributes of machine ClassAds are described in Section 2.5.2.

When you advertise a grid site, it looks very similar to a ClassAd for a machine. In fact, the condor_ collector will believe it is a machine, but with a different set of attributes.

To advertise a grid site, you first need to describe the site in a file. Here is a sample ClassAd that describes a grid site:

# This is a comment
MyType                = "Machine"
TargetType            = "Job"
Name                  = "Example1_Gatekeeper"
Machine               = "Example1_Gatekeeper"
gatekeeper_url        = "grid.example.com/jobmanager"
Requirements          = (CurMatches < 10) && (TARGET.JobUniverse == 9) && (TARGET.JobGridType =?= ``gt2'')
Rank                  = 0.000000
CurrentRank           = 0.000000
WantAdRevaluate       = True
UpdateSequenceNumber  = 4
CurMatches            = 0

Let's look at each line:

# This is a comment

Your file can have comments that begin with the hash mark (#).

MyType                = "Machine"

Your grid site is pretending to be a single machine, for the purpose of matchmaking. MyType is an attribute that the condor_ negotiator daemon will expect to be a string. Strings must be surrounded by double-quote marks, as in this example. You may have surprising, unintuitive errors if they are not quoted. You will always want MyType to be ``Machine''.

TargetType            = "Job"

This is an attribute that says the grid site (machine) wants to be matched with a job. Leave this as it is.

Name                  = "Example1_Gatekeeper"

You will want a unique name for each grid site. Any name is fine, as long as it is quoted.

Machine               = "Example1_Gatekeeper"

Machine is just like Name. Normally in Condor, the Machine and Name may be slightly different if you have multiple CPUs. For grid matchmaking, they should probably be the same.

gatekeeper_url        = "grid.example.com/jobmanager"

This is the Globus gatekeeper contact string for your grid site. It is probably a machine name followed by a slash followed by the name of the jobmanager. If you have different job managers, you can only specify one per ClassAd.

UpdateSequenceNumber  = 4

UpdateSequenceNumber is a positive number that must increase each time you advertise a grid site. Normally you advertise your grid site every five minutes. The condor_ collector daemon will discard a grid site's ClassAd after 15 minutes if there have been no updates. A good number to set this to is the current time in seconds (the epoch, as given by the C time function call), but if you are worried about your clock running backward, you can set it to whatever you like. If ClassAds are received with a sequence number older than the last ClassAd, they are ignored.

CurMatches            = 0

This number is incremented each time a match is made for this grid site. Unlike a normal machine ClassAd that can only be matched against once, grid site advertisements can be matched against many time.

You will probably want to set this number to be the number of grid jobs that you have running on your site, and keep it updated each time you submit a new ClassAd. If you do not specify CurMatches, Condor will assume it is 0.

Condor will increment this number every time it makes a match against a grid site.

Requirements          = (CurMatches < 10) && (TARGET.JobUniverse == 9) && (TARGET.JobGridType =?= ``gt2'')

These are the requirements that the grid site insists must be true before it will accept a job. These could refer to features of the job's ClassAd. In this case, we will take any grid universe job that's of grid-type gt2, as long we have less than 10 matches currently. This will ensure that Condor will only run 10 jobs at your site--assuming that you keep CurMatches up to date when jobs finish. Of course, you can edit this statement to have different requirements. For example, if you want to accept all jobs, you can have Requirements = True.

Rank                  = 0.000000
CurrentRank           = 0.000000

This is a numerical ranking that will be assigned to a job. Right now it is not used, but should be set to 0.

WantAdRevaluate       = True

The WantAdRevaluate attribute distinguishes grid site ClassAds from normal machine ClassAds and allows multiple matches to be made against a single site. It should be in your ad and should be true. Note that True is not in quotes, and it should not be.

You can add other attributes to your ClassAd, to make it easy for a job to decide which grid site it wants to use. For instance, if you have pre-installed the Bamboozle software environment on your grid site, you could advertise, HaveBamboozle = True and BamboozleVersion = 10. Jobs can require a grid site that has Bamboozle installed by extending their requirements with HaveBamboozle == True. (Note the double equal sign in the requirements.)

As an aside, we recommend that jobs that need specific applications should bring them with them instead of relying on having them pre-installed at a Grid site. You will have more reliable execution if you do.

Once you have a file that describes your site, you need to send it to the condor_ collector daemon. For this, use condor_ advertise. We recommend that you write a script to create the file containing the ClassAd, then run the script every five minutes with cron. The script should probably update the CurMatches variable, if you want to restrict the number of grid jobs that can be submitted at one time.

For condor_ advertise, specify UPDATE_STARTD_AD for the update command. For example, if your ClassAd is specified in a file named grid-ad you would do:

    condor_advertise UPDATE_STARTD_AD grid-ad

condor_ advertise usually uses UDP to transmit your ClassAd. In wide-area networks, this may be insufficient. You can use TCP by specifying the -tcp option.

5.3.4.2 Submitting grid universe jobs that use matchmaking

Submitting a grid universe job that requires matchmaking is straightforward. Instead of specifying a particular scheduler with globussheduler like this:

globusscheduler = grid.example.com/jobmanager

you instead specify requirements and tell Condor where to find the gatekeeper URL in the grid site ClassAd:

globusscheduler = $$(gatekeeper_url)
requirements    = TARGET.gatekeeper_url =!= UNDEFINED

This will allow to run at any grid site, and will extract the gatekeeper_url attribute from the ClassAd. There is no magic meaning behind gatekeeper_url--you could use GatekeeperContactString if you desired, as long as it is the same in both the job description and the grid site ClassAd.

The requirements specified here are a bit simple. Perhaps you only want to run at a site that has the Bamboozle software installed, and the sites that have it installed specify ``HaveBamboozle = True'', as described above. A complete job description may look like this:

universe        = grid
grid_typd       = gt2
executable      = analyze_bamboozle_data
output          = aaa.$(Cluster).out
error           = aaa.$(Cluster).err
log             = aaa.log
globusscheduler = $$(gatekeeper_url)
requirements    = (HaveBamboozle == True) && (TARGET.gatekeeper_url =!= UNDEFINED)
leave_in_queue  = jobstatus == 4
queue

5.3.4.3 Advanced usage

What if a job fails to run at a grid site due to an error? It will be returned to the queue, and Condor will attempt to match it and re-run it at another site. Condor isn't very clever about avoiding sites that may be bad, but you can give it some assistance. Let's say that you want to avoid running at the last grid site you ran at. You could add this to your job description:

match_list_length = 1
Rank              = TARGET.Name != LastMatchName0

This will prefer to run at a grid site that was not just tried, but it will allow the job to be run there if there is no other option.

When you specify match_list_length, you provide an integer N, and Condor will keep track of the last N matches. The oldest match will be LastMatchName0, and next oldest will be LastMatchName1, and so on. (See the condor_ submit manual page for more details.) The Rank expression allows you to specify a numerical ranking for different matches. When combined with match_list_length, you can prefer to avoid sites that you have already run at.

In addition, condor_ submit has two options to help you control grid universe job resubmissions and rematching. See globus_resubmit and globus_rematch in the condor_ submit manual page. These options are independent of match_list_length.

There are some new attributes that will be added to the Job ClassAd, and may be useful to you when you write your rank, requirements, globus_resubmit or globus_rematch option. Please refer to Section 2.5.2 and read about the following option:

The following example of a command within the submit description file releases jobs 5 minutes after being held, increasing the time between releases by 5 minutes each time. It will continue to retry up to 4 times per Globus submission, plus 4. The plus 4 is necessary in case the job goes on hold before being submitted to Globus, although this is unlikely.

periodic_release = ( NumSystemHolds <= ((NumGlobusSubmits * 4) + 4) ) \
   && (NumGlobusSubmits < 4) && \
   ( HoldReason != "via condor_hold (by user $ENV(USER))" ) && \
   ((CurrentTime - EnteredCurrentStatus) > ( NumSystemHolds *60*5 ))

The following example forces Globus resubmission after a job has been held 4 times per Globus submission.

globus_resubmit = NumSystemHolds == (NumGlobusSubmits + 1) * 4

If you are concerned about unknown or malicious grid sites reporting to your condor_ collector, you should use Condor's security options, documented in Section 3.7.


next up previous contents index
Next: 5.4 Glidein Up: 5. Grid Computing Previous: 5.2 Connecting Condor Pools   Contents   Index
condor-admin@cs.wisc.edu