Condor or Condor-G: Which one is right for me?
A source of confusion for many new (and old) Condor users is the relationship
between Condor and Condor-G. People who are new to grid computing may have
heard that Condor-G gives them a way to easily get their jobs running on
the resources of the emerging Grids. Managers and users of existing Condor
pools also want to link their resources with other pools to establish a larger
Grid. These paragraphs attempt to explain the differences between Condor-G
and Condor, they give example use cases, and they elaborate on which is the
right one to install.
The difference between Condor and Condor-G will become clearer with a
brief overview of the software that forms a Condor pool. For this discussion,
the software of a Condor pool is divided into two parts. The first part does
job management. It keeps track of a user's jobs. You can ask the job management
part of Condor to show you the job queue, to submit new jobs to the system,
to put jobs on hold, and to request information about jobs that have completed.
The other part of the Condor software does resource management. It keeps
track of which machines are available to run jobs, how the available machines
should be utilized given all the users who want to run jobs on them, and
when a machine is no longer available.
A machine with the job management part installed is called a submit machine.
A machine with the resource management part installed is called an execute
machine. Each machine may have one part or both. For example, a cluster of
10 machines may have one machine as a submit machine, and the other nine
as execute machines. At the UW Computer Sciences department, we have both
parts installed on all of our 300 users' desktops. This allows anyone in
the department to submit from his/her own machine. No one needs to login
to a central server to use the thousand machines in the pool, and every machine
is potentially available run jobs.
What Is Condor-G?
Condor-G is the job management part of Condor. Condor-G lets you submit
jobs into a queue, have a log detailing the life cycle of your jobs, manage
your input and output files, along with everything else you expect from a
job queuing system. Condor-G gets its name from how it talks to the resource
management part. Instead of using the Condor-developed protocols to start
running a job on a remote machine, Condor-G uses the Globus Toolkit(tm) to
start the job on the remote machine. Condor-G provides a "window to the Grid"
for users to both access resources and manage jobs running on remote resources.
Use Condor-G to look across the Grid and see instantly how your jobs are
doing. You can trust Condor-G to keep watching your jobs while your attention
is elsewhere.
Who should want to install Condor-G?
Install Condor-G to submit to resources accessible through a Globus interface.
For example, the Origin2000 at Boston University is running LSF, and it also
provides a Globus interface. One way of running jobs on this machine is to
use 'ssh' to connect to the Origin, and run the LSF submit command. Another
way of running jobs on this machine is to use the Globus interface directly
to ask the Origin to run a job for you. Better yet, use Condor-G to submit
the jobs. Condor-G gives you a consistent interface for submitting jobs,
whether to another machine in a local pool or to a resource sitting behind
a firewall on the other side of the country. Condor-G users use the same
interface to run a job on a Linux cluster managed by PBS or the NQS queue
on a Cray T3E.
Do not install Condor-G if you want to add a Globus interface to your
existing Condor pool. Remember, Condor-G does not *create* a grid service.
It only deals with *using* remote grid services.
Of course, much like you can have a machine be both a submit machine and
an execute machine in a Condor pool, you can have Condor-G and the Globus
interface to a pool installed at the same. Users will submit jobs through
Condor-G in order to have their jobs execute on resources outside the Condor
pool. External users will use the Globus interface to submit jobs to be executed
inside the Condor pool. (Yes, you could use the submission semantics of Condor-G
to force a job to be submitted to the Globus interface on the local Condor
pool, but that is a rather circuitous way to run jobs.)
How do I get Condor-G for installation?
See instructions on Condor-G
web page
Answers to some frequently asked questions
- Do I need to have a Condor pool installed to use Condor-G?
No, you do not. Condor-G is only the job management part of Condor. It makes
perfect sense to install Condor-G on just one machine within an organization.
Note that access to remote resources using a Globus interface will be done
through that one machine using Condor-G.
- I am an existing Condor pool administrator, and I want to be able
to let my users access resources managed by Globus, as well as access the
Condor pool. Should I install Condor-G?
Yes.
- I want to install Globus on my Condor pool, and have external
users submit jobs into the pool. Do I need to install Condor-G?
No, you do not need to install Condor-G. You need to install Globus, to
get the "Condor jobmanager setup package." See the Condor-G home page for
a link to Globus.
- I just bought a 64-node cluster. Should I install Condor or another
system such as PBS, or LSF, or GridEngine?
Answer: It is up to you. We obviously would rather see you run only Condor
on your new cluster, but your decision depends on what you're most experienced
and comfortable with. If you install a different batch system, you can still
install Globus on the front end, and you can use Condor-G to let external
users submit jobs to run on your cluster.
- I am the administrator at Physics, and I have a 64-node cluster
running Condor. The administrator at Chemistry is also running Condor on
her 64-node cluster. We would like to be able to share resources. Do I need
to install Globus and Condor-G?
You may, but you do not have a need to. Condor has a feature called flocking,
which lets multiple Condor pools share resources. By setting configuration
variables, jobs may be executed on either cluster. Flocking is good (better
than Condor-G) because all the Condor features continue to work. Examples
of features are checkpointing remote system calls. Unfortunately, flocking
only works between Condor pools. So, access to a resource managed by PBS,
for example, will still require Globus and Condor-G to submit jobs into the
PBS queue.
- Why would I want to use Condor-G?
If you have more than a trivial set of work to get done, you will find that
you spend a great deal of time managing your jobs without Condor. Your time
would be better spent working with your results. For example, imagine you
want to run 1000 jobs on the NCSA Itanium cluster without Condor-G. To use
the Globus interface directly, you will type 'globusrun' 1000 times. Now
imagine that you want to check on the status of your jobs. Using 'globus-job-status'
1000 times will not be much fun. And, heaven help you if you find a bug and
need to cancel your thousand jobs. This will be followed by resubmitting
all 1000 jobs. Under Condor-G, job submission and management is simplified.
Condor-G will also ease submission and management of jobs when job flow is
complex. For example, if the first 100 jobs must run before any of the next
100 jobs, and once those 200 jobs are done, the next 700 may start. The last
100 may run only after the first 900 finish. Condor DAGMan manages all of
the dependencies for you. You can avoid writing a rather complex Perl script.
- What piece of software is most like Condor?
Condor is comparable to LSF, Sun GRID Engine, or PBS.
- What piece of software is most like Condor-G?
Condor-G is most like globus-job-submit, or a web job-submission portal.
- Can I install both a Globus interface to a Condor pool and Condor-G
at the same time?
Yes. In fact, you might want to. (See question 2 above.)
- What is Glidein?
Glidein provides a way to temporarily add a Globus resource to a local Condor
pool. Condor-G is not required to use glidein. Glidein uses Globus resource-management
software to run jobs. It works by executing the portions of Condor software
on the Globus resource. Then, Condor may execute the user's jobs. There are
several benefits to working in this way. Checkpoints can be made of the jobs
running on the Globus resource, and remote system calls can be supported.
Condor can also dynamically schedule jobs across the Grid.
- What is planned for future Condor-G versions?
In the short term, there are a few minor queuing features missing from Condor-G,
such as sending job-completion e-mail. Globus 2.2 adds file-staging to Globus
resources, so a future Condor-G will be able to transfer all of a job's data
files to a remote site, and not just stdin, stdout, and stderr. A Windows
port of Globus is expected sometime in late 2002 or 2003, and when that is
complete, Condor-G will run on Windows.
Condor home page
- condor-admin@cs.wisc.edu