Condor High Throughput Computing

Managing Jobs That Use Licenses With Condor

An incredibly wide variety of software may be run under Condor. However, it gets complicated if the software needs one of a limited number of available licenses. While Condor does not directly know of or manage licenses on behalf of jobs run under Condor, Condor is flexible enough to enable workarounds. None are perfect, but they do work. Each section below describes an approach.

Use Condor's Concurrency Limits mechanism

Condor's Concurrency Limits mechanism allows Condor to keep a count of available and used resources. A license is a resource under this mechanism. As jobs that require a license are run, Condor maintains a count, and only starts jobs as licenses are available.

The Condor Manual, section 3.12.11 describes the mechanism.

The mechanism has the potential drawback that if licenses can be used outside of Condor, for example, a user starting an interactive session on their own machine, Condor may permit the use of a license that it can not realize has been allocated.

Assign Licenses to Machines

Assign licenses to machines. If there are 10 licenses, then pick 10 machines, and declare that only these 10 machines are allowed to use the licenses. Within Condor configuration, advertise that these machines have licenses. For example, pretend the SW package is called Example Software version 3.12. On the 10 machines chosen to have the 10 licenses, the condor_config configuration file to advertise the license may contain:

HasLicenseExampleSoftware3_12 = TRUE
STARTD_EXPRS = HasLicenseExampleSoftare3_12

Then, each job submitted to Condor that needs a license will specify that requirement in the submit description file, appearing as:

requirements = HasLicenseExampleSoftware3_12

Note that the requirements expression for the job contains the same value as name of the defined machine ClassAd attribute.

It is likely that not all jobs within the Condor pool require a license. If this is the case, a further consideration of the 10 machines with the licenses with respect to their configuration is warranted. If the 10 licenses are extremely scarce resources, such that the 10 machines with the licenses ought to be dedicated to running those jobs needing licenses, then the configuration on the 10 machines may be set to only start jobs that need a license. This modifies the START expression. The configuration becomes:

HasLicenseExampleSoftware3_12 = TRUE
STARTD_EXPRS = HasLicenseExampleSoftare3_12
START = NeedExampleSoftware3_12
and the submit description file both requires a machine that has the software and advertises that it needs the software:
requirements = HasLicenseExampleSoftware3_12
+NeedExampleSoftware3_12 = TRUE

Where the licenses are not the first consideration in the matchmaking of machines, instead of only starting jobs that need licenses (and remaining idle where the number of jobs needing the license is fewer than the number of machines advertising the availability of a license), the machines may use the RANK expression to identify that they prefer jobs that need a license over those jobs that do not. The configuration file for each machine in the set that have licenses becomes

HasLicenseExampleSoftware3_12 = TRUE
STARTD_EXPRS = HasLicenseExampleSoftware3_12
RANK = NeedExampleSoftware3_12

And, the submit description file is the same as given:

requirements = HasLicenseExampleSoftware3_12
+NeedExampleSoftware3_12 = TRUE

It is possible to use the SUBMIT_EXPRS configuration variable to automatically modify jobs submitted from a particular submit machine.

Assigning licenses to machines has the potential drawback that if licenses can be used outside of Condor, for example, a user starting an interactive session on their own machine, Condor may permit the use of a license that it can not realize has been allocated. It is more work to configure, but the machine ClassAd attribute HasLicenseExampleSoftware3_12 can be made dynamic. This further assumes the presence of a script to count the number of licenses available. This will use Condors "cron" functionality. See STARTD_CRON_* configuration settings in the manual for this functionality. This is still an imperfect solution, as race conditions exist such that a job can start before noticing that the license is gone.

Have Jobs Restart

While a bit of a brute force solution, each job that needs a license may be resubmitted if it starts, but then fails to acquire a license.

This approach requires that the job to be wrapped in a script that detects the failure to acquire a license. The script then returns a well known exit code. For an example, use the code 52. Then the job's submit description file is appended with:

on_exit_remove = (ExitBySignal == TRUE) || (ExitCode != 52)

If already using DAGMan to manage work flows, then DAGMan's POST scripts and RETRYs accomplish this same approach using resubmission, while the test for failure is a script on the submit machine.




condor-admin@cs.wisc.edu