Condor High Throughput Computing

Top Five Myths About Condor


  1. Myth: Condor requires users to recompile their applications.
    Reality: Condor runs ordinary, unmodified applications.

    Condor can run users' ordinary, unmodified executables and scripts in the "vanilla" universe. These programs may rely on an existing shared file system for I/O, or Condor may optionally stage the needed data files in and out. If desired, users' may optionally re-link their applications with the Condor C library and submit them to the "standard" universe. This provides checkpointing and transparent remote I/O without any modification to the application source code. However, this step is entirely optional. (Source: Online Manual )

  2. Myth: Condor has a single point of failure.
    Reality: Condor has excellent failure isolation.

    The failure of a single component in a Condor system only affects other processes that deal directly with it. For example, the failure of an execution machine will only cause jobs running on that machine to roll back and execute elsewhere. The failure of a submission machine will cause jobs submitted by that machine to roll back and execute elsewhere after it reboots. The failure of a central manager will temporarily prevent new jobs from starting, but no running jobs will be affected. (Source: Online FAQ )

  3. Myth: Condor is only good at "cycle stealing."
    Reality: Condor can effectively manage many kinds of distributed systems.

    Condor was originally designed for harnessing the unused cycles of desktop workstations, sometimes known as "cycle stealing." However, the same system design that works for cycle stealing is also effective for managing uniform clusters, multiprocessor machines, and wide-area distributed systems. The Condor pool at UW-Madison manages workstations, several clusters, and several multiprocessors all in one system. A national-scale Condor pool in Italy harnesses autonomous workstations spread throughout ten cities. The key observation is that all are susceptible to failures, to a greater or lesser degree. Condor's original attention to failures for "cycle stealing" makes it a uniquely robust system for many environments. (Source: "Condor and The Grid" in "Grid Computing" by Berman, et al.)

  4. Myth: Condor only runs sequential jobs.
    Reality: Condor has extensive support for parallel programming environments.

    Condor provides explicit support for the MPI, PVM, and Master-Worker programming environments. Users may write standard applications to these interfaces and execute them on Condor-managed machines. Of course, MPI and PVM jobs expect a highly reliable environment, so such users should instruct Condor to harness reliable machines for these jobs. Applications written to the Master-Worker interface are naturally more fault tolerant and are better able to harness unreliable machines such as personal workstations. (Sources: Online Manual and Research Publications )

  5. Myth: Condor doesn't do "grid" computing.
    Reality: Condor is involved in many forms of distributed computing, including the "grid."

    There seems to be little agreement over exactly what the term "grid" means. However, we can describe Condor using more concrete terms:

    Condor is a reliable, scalable, and proven distributed system, with many years of operating experience. It manages a computing system with multiple owners, multiple users, and no centralized administrative structure. It deals gracefully with failures and delivers a large number of computing cycles on the time scale of months to years.

    Condor systems can become part of a "grid" in several ways. Multiple independent Condor pools may be allied together through a technique known as "flocking." Condor users may harness remote non-Condor resources via Condor-G, a fault-tolerant agent for "grid" computing.

    (Sources: Online Manual and Research Publications)




condor-admin@cs.wisc.edu