next up previous contents index
Next: 4.5 Application Program Interfaces Up: 4. Miscellaneous Concepts Previous: 4.3 Computing On Demand   Contents   Index

Subsections


4.4 Job Hooks

In the past, Condor has always sent work to the execute machines by pushing jobs to the condor_ startd daemon, either from the condor_ schedd daemon or via condor_ cod. Beginning with the Condor 7.1.0, the condor_ startd daemon now has the ability to pull work by fetching jobs via a system of plug-ins or hooks. Any site can configure a set of hooks to fetch work completely outside of the usual Condor matchmaking system.

A hook is an external program or script invoked by Condor at various points during the life cycle of a job. Instead of putting all the code and logic directly into the Condor daemons to handle the variety of external systems from which it might fetch work, sites can write their own programs or scripts and allow Condor to invoke these hooks at the right moments to accomplish the desired outcome. This eliminates the expense of the matchmaking and scheduling provided by the the condor_ schedd and the condor_ negotiator, although at the price of the flexibility they offer. Therefore, the hooks allow Condor to more easily and directly interface with external scheduling systems.

A projected use of the hook mechanism implements what might be termed a glide-in factory, especially where the factory is behind a firewall. Without using the hook mechanism to fetch work, a glide-in condor_ startd daemon behind a firewall depends on GCB to help it listen and eventually receive work pushed from elsewhere. With the hook mechanism, a glide-in condor_ startd daemon behind a firewall uses the hook to pull work. The hook needs only an outbound network connection to complete its task, thereby being able to operate from behind the firewall, without the intervention of GCB.

The following sections describe how this system of hooks works, the semantics of fetched jobs, the interaction between fetched work and regular Condor jobs, what hooks are invoked by Condor at various stages of the job work flow, how to configure a machine to fetch jobs, and how to write your own hooks.


4.4.1 Overview of Fetching Work

Periodically, each execution slot managed by a condor_ startd will invoke a hook to see if there is any work that can be fetched. Whenever this hook returns a valid job, the condor_ startd will evaluate the current state of the slot and decide if it should start executing the fetched work. If the slot is unclaimed and the Start expression evaluates to TRUE, a new claim will be created for the fetched job. If the slot is claimed, the condor_ startd will evaluate the Rank expression relative to the fetched job and compare it to the value of the Rank for the currently running job and decide if the existing job should be preempted due to the fetched job having a higher rank. If the slot is unavailable for whatever reason, the condor_ startd will refuse the fetched job and ignore it. Either way, once the condor_ startd decides what it should do with the fetched job, it will invoke another hook to reply to the attempt to fetch work, so that the external system knows what happened to that work unit.

If the job is accepted, a claim is created for it and the slot moves into the Claimed state. As soon as this happens, the condor_ startd will spawn a condor_ starter to manage the execution of the job. At this point, from the perspective of the condor_ startd, this claim is just like any other. The usual policy expressions are evaluated, and if the job needs to be suspended or evicted, it will be. If a higher-ranked job being managed by a condor_ schedd is matched with the slot, that job will preempt the fetched work.

The condor_ starter itself can optionally invoke additional hooks to help manage the execution of the specific job. There are hooks to prepare the execution environment for the job, periodically update information about the job as it runs, notify when the job exits, and to take special actions when the job is being evicted.

Assuming there are no interruptions, the job completes, and the condor_ starter exits, the condor_ startd will invoke the hook to fetch work again. If another job is available, the existing claim will be reused and a new condor_ starter is spawned. If the hook returns that there is no more work to perform, the claim will be evicted, and the slot will return to the Owner state.


4.4.2 Hooks Invoked by Condor

There are a handful of hooks invoked by Condor related to fetching work, some of which are called by the condor_ startd and others by the condor_ starter. Each hook will be described below, including when it is invoked, what task it is supposed to accomplish, what data is passed to the hook, and what output (and, when relevant) exit status is expected.


4.4.2.1 Hook: Fetch Work

HOOK_FETCH_WORK is invoked whenever the condor_ startd wants to see if there is any work to fetch. There is a related configuration expression called FetchWorkDelay which determines how long the condor_ startd will wait between attempts to fetch work, which is described in detail in section 4.4.4 on page [*] below. HOOK_FETCH_WORK is the most important hook in the whole system, and is the only hook that must be defined for any of the other condor_ startd hooks to operate.

Arguments
None.

Standard input
ClassAd of the slot that is looking for work.

Expected output
ClassAd of a job that can be run. If there is no work, the hook should return no output.

Exit status
Ignored.

The job ClassAd returned by the hook needs to contain enough information for the condor_ starter to eventually spawn the work. The required and optional attributes in this ClassAd are identical to the ones described for Computing on Demand (COD) jobs in section 4.3.3 ``COD Application Attributes'' on page [*].


4.4.2.2 Hook: Reply Fetch

HOOK_REPLY_FETCH is invoked whenever HOOK_FETCH_WORK returns data and the the condor_ startd decides if it's going to accept the fetched job or not.

Arguments
Either the string accept or reject.

Standard input
A copy of the job ClassAd and the slot ClassAd (separated by the string ----- and a new line).

Expected output
None.

Exit status
Ignored.

The condor_ startd will not wait for this hook to return before taking other actions, and ignores all output. The hook is simply advisory, and has no impact on the behavior of the condor_ startd.


4.4.2.3 Hook: Evict Claim

HOOK_EVICT_CLAIM is invoked whenever the condor_ startd needs to evict a claim representing fetched work.

Arguments
None.

Standard input
A copy of the job ClassAd and the slot ClassAd (separated by the string ----- and a new line).

Expected output
None.

Exit status
Ignored.

The condor_ startd will not wait for this hook to return before taking other actions, and ignores all output. The hook is simply advisory, and has no impact on the behavior of the condor_ startd.


4.4.2.4 Hook: Prepare Job

HOOK_PREPARE_JOB is invoked by the condor_ starter before a job is going to be run. This hook provides a chance to execute commands to setup the job environment, for example to transfer input files.

Arguments
None.

Standard input
A copy of the job ClassAd and the slot ClassAd (separated by the string ----- and a new line).

Expected output
None.

Exit status
0 for success preparing the job, any non-zero value on failure.

The condor_ starter waits until this hook returns before attempting to execute the job. If the hook returns a non-zero exit status, the condor_ starter will assume an error was reached while attempting to setup the job environment and abort the job.


4.4.2.5 Hook: Update Job Info

HOOK_UPDATE_JOB_INFO is invoked periodically during the life of the job to update information about the status of the job. When the job is first spawned, the condor_ starter will invoke this hook after STARTER_INITIAL_UPDATE_INTERVAL seconds (defaults to 8). Thereafter, the condor_ starter will invoke the hook every STARTER_INITIAL_UPDATE_INTERVAL seconds (defaults to 300, in other words, every 5 minutes).

Arguments
None.

Standard input
A copy of the job ClassAd that has been augmented with additional attributes describing the current status and execution behavior of the job.

Expected output
None.

Exit status
Ignored.

The condor_ starter will not wait for this hook to return before taking other actions, and ignores all output. The hook is simply advisory, and has no impact on the behavior of the condor_ starter.

The additional attributes included inside the job ClassAd are:

JobState
The current state of the job. Can be either ``Running'' or ``Suspended''.

JobPid
The process identifier for the initial job directly spawned by the condor_ starter.

NumPids
The number of processes that the job has currently spawned.

JobStartDate
The epoch time when the job was first spawned by the condor_ starter.

RemoteSysCpu
The total number of seconds of system CPU time (the time spent at system calls) the job has used.

RemoteUserCpu
The total number of seconds of user CPU time the job has used.

ImageSize
The memory image size of the job in Kbytes.


4.4.2.6 Hook: Job Exit

HOOK_JOB_EXIT is invoked whenever a job exits, either on its own or when being evicted from an execution slot.

Arguments
A string describing how the job exited:

Standard input
A copy of the job ClassAd that has been augmented with additional attributes describing the execution behavior of the job and its final results.

Expected output
None.

Exit status
Ignored.

The condor_ starter will wait for this hook to return before taking any other actions. In the case of jobs that are being managed by a condor_ shadow, this hook is invoked before the condor_ starter does its own optional file transfer back to the submission machine, writes to the local user log file, or notifies the condor_ shadow that the job has exited.

The job ClassAd passed to this hook contains all of the extra attributes described above for HOOK_UPDATE_JOB_INFO , and the following additional attributes that are only present once a job exits:

ExitReason
A human-readable string describing why the job exited.

ExitBySignal
A boolean indicating if the job exited due to being killed by a signal, or if it exited with an exit status.

ExitSignal
If ExitBySignal is true, the signal number that killed the job.

ExitCode
If ExitBySignal is false, the integer exit code of the job.

JobDuration
The number of seconds that the job ran during this invocation.


4.4.3 Keywords to Define Hooks in the Condor Configuration files

Hooks are defined in the Condor configuration files by prefixing the name of the hook with a keyword. This way, a given machine can have multiple sets of hooks, each set identified by a specific keyword.

Each slot on the machine can define a separate keyword for the set of hooks that should be used ([SLOTN_JOB_HOOK_KEYWORD ). Note that the ``N'' in ``SLOTN'' should be replaced with the slot identification number, for example, on slot1, the setting would be called [SLOT1_JOB_HOOK_KEYWORD. If the slot-specific keyword is not defined, the condor_ startd will use a global keyword (STARTD_JOB_HOOK_KEYWORD ).

Once a job is fetched via HOOK_FETCH_WORK , the condor_ startd will insert the keyword used to fetch that job into the job ClassAd as HookKeyword. This way, the same keyword will be used to select the hooks invoked by the condor_ starter during the actual execution of the job. However, the STARTER_JOB_HOOK_KEYWORD can be defined to force the condor_ starter to always use a given keyword for its own hooks, instead of looking the job ClassAd for a HookKeyword attribute.

For example, the following configuration defines two sets of hooks, and on a machine with 4 slots, 3 of the slots use the global keyword for running work from a database-driven system, and one of the slots uses a custom keyword to handle work fetched from a web service.

  # Most slots fetch and run work from the database system.
  STARTD_JOB_HOOK_KEYWORD = DATABASE

  # Slot4 fetches and runs work from a web service.
  SLOT4_JOB_HOOK_KEYWORD = WEB

  # The database system needs to both provide work and know the reply
  # for each attempted claim.
  DATABASE_HOOK_DIR = /usr/local/condor/fetch/database
  DATABASE_HOOK_FETCH_WORK = $(DATABASE_HOOK_DIR)/fetch_work.php
  DATABASE_HOOK_REPLY_FETCH = $(DATABASE_HOOK_DIR)/reply_fetch.php

  # The web system only needs to fetch work.
  WEB_HOOK_DIR = /usr/local/condor/fetch/web
  WEB_HOOK_FETCH_WORK = $(WEB_HOOK_DIR)/fetch_work.php

The keywords ``DATABASE'' and ``WEB'' are completely arbitrary, so each site is encouraged to use different (more specific) names as appropriate for their own needs.


4.4.4 Defining the FetchWorkDelay Expression

There are two events that trigger the condor_ startd to attempt to fetch new work:

Even if a given compute slot is already busy running other work, it's possible that if it fetched new work, the condor_ startd would prefer the fetched work (via the Rank expression) over the work it is currently running. However, the condor_ startd frequently evaluates its own state, especially when a slot is claimed. Therefore, administrators can define an expression which controls how long the condor_ startd will wait between attempts to fetch new work. This expression is called FetchWorkDelay.

The FetchWorkDelay expression must evaluate to an integer, which defines the number of seconds since the last fetch attempt completed before the condor_ startd will attempt to fetch more work. However, as a ClassAd expression (evaluated in the context of the ClassAd of the slot considering if it should fetch more work, and the ClassAd of the currently running job, if any), the length of the delay can be based on the current state the slot and even the currently running job.

For example, a very common configuration would be to always wait 5 minutes (300 seconds) between attempts to fetch work, unless the slot is Claimed/Idle, in which case the condor_ startd should fetch immediately:

FetchWorkDelay = ifThenElse(State == "Claimed" && Activity == "Idle", 0, 300)

If the condor_ startd wants to fetch work, but the time since the last attempted fetch is shorter than the current value of the delay expression, the condor_ startd will set a timer to fetch as soon as the delay expires.

If this expression is not defined, the condor_ startd will default to a five minute (300 second) delay between all attempts to fetch work.


next up previous contents index
Next: 4.5 Application Program Interfaces Up: 4. Miscellaneous Concepts Previous: 4.3 Computing On Demand   Contents   Index
condor-admin@cs.wisc.edu