Condor-G is a ``window to the grid.'' The function of Condor-G becomes clear with a brief overview of the software that forms a Condor pool. For this discussion, the software of a Condor pool is divided into two parts. The first part does job management. It keeps track of a user's jobs. You can ask the job management part of Condor to show you the job queue, to submit new jobs to the system, to put jobs on hold, and to request information about jobs that have completed. The other part of the Condor software does resource management. It keeps track of which machines are available to run jobs, how the available machines should be utilized given all the users who want to run jobs on them, and when a machine is no longer available. works with grid resources, allowing users to effectively submit jobs, manage jobs, and have jobs execute on widely distributed machines.
A machine with the job management part installed is called a submit machine. A machine with the resource management part installed is called an execute machine. Each machine may have one part or both. Condor-G is the job management part of Condor. Condor-G lets you submit jobs into a queue, have a log detailing the life cycle of your jobs, manage your input and output files, along with everything else you expect from a job queuing system.
Condor uses Globus to provide underlying software needed to utilize grid resources, such as authentication, remote program execution and data transfer. Condor's capabilities when executing jobs on Globus resources have significantly increased. The same Condor tools that access local resources are now able to use the Globus protocols to access resources at multiple sites.
Condor-G is a program that manages both a queue of jobs and the resources from one or more sites where those jobs can execute. It communicates with these resources and transfers files to and from these resources using Globus mechanisms. (In particular, Condor-G uses the GRAM protocol for job submission, and it runs a local GASS server for file transfers).
It may appear that Condor-G is a simple replacement for the Globus toolkit's globusrun command. However, Condor-G does much more. It allows you to submit many jobs at once and then to monitor those jobs with a convenient interface, receive notification when jobs complete or fail, and maintain your Globus credentials which may expire while a job is running. On top of this, Condor-G is a fault-tolerant system; if your machine crashes, you can still perform all of these functions when your machine returns to life.
The Globus software provides a well-defined set of protocols that allow authentication, data transfer, and remote job submission.
Authentication is a mechanism by which an identity is verified. Given proper authentication, authorization to use a resource is required. Authorization is a policy that determines who is allowed to do what.
Condor-G allows the user to treat the Grid as a local resource, and the same command-line tools perform basic job management such as:
These are features that Condor has provided for many years. Condor-G extends this to the grid, providing resource management while still providing fault tolerance and exactly-once execution semantics.
Figure 5.1 shows how Condor-G interacts with Globus protocols. Condor-G contains a GASS server, used to transfer the executable, stdin, stdout, and stderr to and from the remote job execution site. Condor-G uses the GRAM protocol to contact the remote Globus Gatekeeper and request that a new job manager be started. GRAM is also used to monitor the job's progress. Condor-G detects and intelligently handles cases such as if the remote Globus resource crashes.
)
there are two steps necessary before a globus universe job
can be submitted:
GRIDMANAGER = $(SBIN)/condor_gridmanager GAHP = $(SBIN)/gahp_server MAX_GRIDMANAGER_LOG = 64000 GRIDMANAGER_DEBUG = D_COMMAND GRIDMANAGER_LOG = $(LOG)/GridLogs/GridmanagerLog.$(USERNAME) GLIDEIN_SERVER_NAME = gridftp.cs.wisc.edu GLIDEIN_SERVER_DIR = /p/condor/public/binaries/glidein
If Condor-G is installed as root, the file
set by the configuration variable
GRIDMANAGER_LOG must have world-write permission.
All of the parent directories for this file must
also have world-execute permission.
The Gridmanager runs as the user who submitted the job,
so the Gridmanager may not be able to write to the ordinary
$(log) directory.
Use of the definition of GRIDMANAGER_LOG
shown above will likely require the creation of
the directory $(LOG)/GridLogs.
Permissions on this directory should be set
by running chmod using the value 1777.
Another option is to locate the Gridmanager log files somewhere else, like so:
GRIDMANAGER_LOG = /tmp/GridmanagerLog.$(USERNAME)
If you make any changes to the configuration file while Condor is running, you will need to issue a condor_ reconfigure command.
See section 3.3 on
page
for
more information about configuration file entries.
See section 3.3.18 on
page
for information
about configuration file entries specific to the Condor-G
gridmanager.
Condor-G periodically checks for an updated proxy at an interval given by the configuration variable GRIDMANAGER_CHECKPROXY_INTERVAL. The value is defined in terms of seconds. For example, if you create a 12-hour proxy, and then 6 hours later re-run grid-proxy-init, Condor-G will check the proxy within this time interval, and use the new proxy it finds there. The default interval is 10 minutes.
Condor-G also knows when the proxy of each job will expire, and if the proxy is not refreshed before GRIDMANAGER_MINIMUM_PROXY_TIME seconds before the proxy expires, the Condor-G grid manager daemon exits. Since the grid manager daemon keeps track of all jobs associated with a proxy, its tasks (such as authentication, file transfer, job log maintenance) will not occur. So, if GRIDMANAGER_MINIMUM_PROXY_TIME is 180, and the proxy is 3 minutes away from expiring, Condor-G will attempt to safely shut down, instead of simply losing contact with the remote job because Condor-G is unable to authenticate the remote job. The default setting is 3 minutes (180 seconds).
This section contains what users need to know to run and manage jobs under the globus universe.
Under Condor, successful job submission to the Globus universe requires credentials. An X.509 certificate is used to create a proxy, and an account, authorization, or allocation to use a grid resource is required. For more information on proxies and certificates, please consult the Alliance PKI pages at
http://archive.ncsa.uiuc.edu/SCD/Alliance/GridSecurity/
Before submitting a job to Condor under the Globus universe, make sure you have your Grid credentials and have used grid-proxy-init to create a proxy.
A job is submitted for execution to Condor using the condor_ submit command. condor_ submit takes as an argument the name of a file called a submit description file. The following sample submit description file runs a job on the Origin2000 at NCSA.
executable = test globusscheduler = modi4.ncsa.uiuc.edu/jobmanager universe = globus output = test.out log = test.log queue
The executable for this example is transferred from the local machine to the remote machine. By default, Condor transfers the executable, as well as any files specified by the input command. Note that this executable must be compiled for the correct intended platform.
The globusscheduler command is dependent on the scheduling software available on remote resource. This required command will change based on the Grid resource intended for execution of the job. A jobmanager is the Globus service that is spawned at a remote site to submit, keep track of, and manage Grid I/O for jobs running on the local batch system there. There is a specific jobmanager for each type of batch system supported by Globus (examples are Condor, LSF, and PBS).
All Condor-G jobs (intended for execution on Globus-controlled
resources) are submitted to the globus universe.
The universe = globus command is required
in the submit description file.
No input file is specified for this example job. Any output (file specified by the output) or error (file specified by the error) is transferred from the remote machine to the local machine as it is produced. This implies that these files may be incomplete in the case where the executable does not finish running on the remote resource. The job log file is maintained on the local machine.
To submit this job to Condor-G for execution on the remote machine, use
condor_submit test.submitwhere test.submit is the name of the submit description file.
Example output from condor_ q for this submission looks like:
% condor_q -- Submitter: wireless48.cs.wisc.edu : <128.105.48.148:33012> : wireless48.cs.wi ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 7.0 epaulson 3/26 14:08 0+00:00:00 I 0 0.0 test 1 jobs; 1 idle, 0 running, 0 held
After a short time, Globus accepts the job. Again running condor_ q will now result in
% condor_q -- Submitter: wireless48.cs.wisc.edu : <128.105.48.148:33012> : wireless48.cs.wi ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 7.0 epaulson 3/26 14:08 0+00:01:15 R 0 0.0 test 1 jobs; 0 idle, 1 running, 0 held
Then, very shortly after that, the queue will be empty again, because the job has finished:
% condor_q -- Submitter: wireless48.cs.wisc.edu : <128.105.48.148:33012> : wireless48.cs.wi ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held
A second example of a submit description file runs the Unix ls program on a different Globus resource.
executable = /bin/ls Transfer_Executable = false globusscheduler = vulture.cs.wisc.edu/jobmanager universe = globus output = ls-test.out log = ls-test.log queue
In this example, the executable (the binary) has been pre-staged. The executable is on the remote machine, and it is not to be transferred before execution. Note that the required globusscheduler and universe commands are present. The command
Transfer_Executable = FALSEwithin the submit description file identifies the executable as being pre-staged. In this case, the executable command gives the path to the executable on the remote machine.
A third example submits a Perl script to be run as a submitted Condor job. The Perl script both lists and sets environment variables for a job. Save the following Perl script with the name env-test.pl, to be used as a Condor job executable.
#!/usr/bin/env perl
foreach $key (sort keys(%ENV))
{
print "$key = $ENV{$key}\n"
}
exit 0;
Run the Unix command
chmod 755 env-test.plto make the Perl script executable.
Now create the following submit description file (Replace biron.cs.wisc.edu/jobmanager with a resource you are authorized to use.):
executable = env-test.pl globusscheduler = biron.cs.wisc.edu/jobmanager universe = globus environment = foo=bar; zot=qux output = env-test.out log = env-test.log queue
When the job has completed, the output file env-test.out should contain something like this:
GLOBUS_GRAM_JOB_CONTACT = https://biron.cs.wisc.edu:36213/30905/1020633947/ GLOBUS_GRAM_MYJOB_CONTACT = URLx-nexus://biron.cs.wisc.edu:36214 GLOBUS_LOCATION = /usr/local/globus GLOBUS_REMOTE_IO_URL = /home/epaulson/.globus/.gass_cache/globus_gass_cache_1020633948 HOME = /home/epaulson LANG = en_US LOGNAME = epaulson X509_USER_PROXY = /home/epaulson/.globus/.gass_cache/globus_gass_cache_1020633951 foo = bar zot = qux
Of particular interest is the GLOBUS_REMOTE_IO_URL environment variable. Condor-G automatically starts up a GASS remote I/O server on the submitting machine. Because of the potential for either side of the connection to fail, the URL for the server cannot be passed directly to the job. Instead, it is put into a file, and the GLOBUS_REMOTE_IO_URL environment variable points to this file. Remote jobs can read this file and use the URL it contains to access the remote GASS server running inside Condor-G. If the location of the GASS server changes (for example, if Condor-G restarts), Condor-G will contact the Globus gatekeeper and update this file on the machine where the job is running. It is therefore important that all accesses to the remote GASS server check this file for the latest location.
The following example is a Perl script that uses the GASS server in Condor-G to copy input files to the execute machine. In this example, the remote job counts the number of lines in a file.
#!/usr/bin/env perl
use FileHandle;
use Cwd;
STDOUT->autoflush();
$gassUrl = `cat $ENV{GLOBUS_REMOTE_IO_URL}`;
chomp $gassUrl;
$ENV{LD_LIBRARY_PATH} = $ENV{GLOBUS_LOCATION}. "/lib";
$urlCopy = $ENV{GLOBUS_LOCATION}."/bin/globus-url-copy";
# globus-url-copy needs a full pathname
$pwd = getcwd();
print "$urlCopy $gassUrl/etc/hosts file://$pwd/temporary.hosts\n\n";
`$urlCopy $gassUrl/etc/hosts file://$pwd/temporary.hosts`;
open(file, "temporary.hosts");
while(<file>) {
print $_;
}
exit 0;
The submit description file used to submit the Perl script as a Condor job appears as:
executable = gass-example.pl globusscheduler = biron.cs.wisc.edu/jobmanager universe = globus output = gass.out log = gass.log queue
There are two optional submit description file commands of note: x509userproxy and globusrsl. The x509userproxy command specifies the path to an X.509 proxy. The command is of the form:
x509userproxy = /path/to/proxyIf this optional command is not present in the submit description file, then Condor-G checks the value of the environment variable X509_USER_PROXY for the location of the proxy. If this environment variable is not present, then Condor-G looks for the proxy in the file /tmp/x509up_u0000, where the trailing zeros in this file name are replaced with the Unix user id.
The globusrsl command is used to add additional attribute settings to a job's RSL string. The format of the globusrsl command is
globusrsl = (name=value)(name=value)Here is an example of this command from a submit description file:
globusrsl = (project=Test_Project)This example's attribute name for the additional RSL is
project, and the value assigned is Test_Project.
Glidein is a mechanism by which one or more Grid resources (remote machines) temporarily join a local Condor pool. The program condor_ glidein is used to add a machine to a Condor pool. During the period of time when the added resource is part of the local pool, the resource is visible to users of the pool, but the resource is only available for use by the user that added the resource to the pool.
After glidein, the user may submit jobs for execution on the added resource the same way that all Condor jobs are submitted. To force a submitted job to run on the added resource, the submit description file contains a requirement that the job run specifically on the added resource.
The local Condor pool configuration file(s) must give HOSTALLOW_WRITE permission to every resource that will be added using condor_ glidein. Wildcards are permitted in this specification. For example, you can add every machine at cs.wisc.edu by adding *.cs.wisc.edu to the HOSTALLOW_WRITE list. Recall that you must run condor_ reconfig for configuration file changes to take effect.
Make sure that the Condor and Globus tools are in your PATH.
condor_ glidein first contacts the Globus resource and checks for the presence of the necessary configuration files and Condor executables. If the executables are not present for the machine architecture, operating system version, and Condor version required, a server running at UW is contacted to transfer the needed executables. You can also set up your own server for condor_ glidein to contact. To gain access to the server or learn how to set up your own server, send email to condor-admin@cs.wisc.edu.
When the files are correctly in place, Condor daemons are started. condor_ glidein does this by creating a submit description file for condor_ submit, which runs the condor_ master under the Globus universe. This implies that execution of the condor_ master is started on the Globus resource. The Condor daemons exit gracefully when no jobs run on the daemons for a configurable period of time. The default length of time is 20 minutes.
The Condor daemons on the Globus resource contact the local pool and attempt to join the pool. The START expression for the condor_ startd daemon requires that the username of the person running condor_ glidein matches the username of the jobs submitted through Condor.
After a short length of time, the Globus resource can be seen in the local Condor pool, as with this example.
% condor_status | grep denal 7591386@denal IRIX65 SGI Unclaimed Idle 3.700 24064 0+00:06:35
Once the Globus resource has been added to the local Condor pool with condor_ glidein, job(s) may be submitted. To force a job to run on the Globus resource, specify that Globus resource as a machine requirement in the submit description file. Here is an example from within the submit description file that forces submission to the Globus resource denali.mcs.anl.gov:
requirements = ( machine == "denali.mcs.anl.gov" ) \
&& FileSystemDomain != "" \
&& Arch != "" && OpSys != ""
This example requires that the job run only on denali.mcs.anl.gov,
and it prevents Condor from inserting the filesystem domain,
architecture, and operating system attributes as requirements
in the matchmaking process.
Condor must be told not to use the submission machine's
attributes in those cases
where the Globus resource's attributes
do not match the submission machine's attributes.