Hawkeye - Condor based System
Monitoring

Hawkeye

A Monitoring and Management Tool for Distributed Systems

Hawkeye utilizes the technologies already present in Condor and ClassAds to provide rich mechanisms for collecting, storing, and using information about computers. A Hawkeye system can be used to monitor various attributes of a collection of systems. The monitoring mechanism may also be used then to further the management of systems.

Because Hawkeye is based on Condor, configuration is extremely flexible. Hawkeye is easily customized for your site and for each individual system being monitored.

Hawkeye is currently being used to monitor systems on the US/CMS test bed. See the US/CMS Status Page to view the current status of the US/CMS test bed. Note that this web page requires a web browser that can interpret XML and XSL. Recent versions of Internet Explorer and Mozilla work, and Netscape 4.x does not work. The Condor team uses Hawkeye to monitor portions of our main Condor pool. See the CS Pool Status Page to view our use of Hawkeye. The current status of our C2 cluster (generated from Hawkeye data) is also on this page.

Hawkeye works by configuring Condor such that it periodically executes specified program(s). These programs are typically scripts. A program produces output in the form of ClassAd attribute/value pairs. These pairs are then added (using defined naming conventions) to the machine ClassAd. The machine ClassAd then contains attributes which may be used in expressions (such as START and SUSPEND, as well as the submit description file REQUIREMENTS expression). A ClassAd may be displayed using condor_status.

Availability

Version 1.0.0 (26-Oct-2006) of the Hawkeye software is currently available for download for Linux and Solaris. This version fixes quite a number of issues, and is the first to support the "new" configuration syntax.

Version 1.0 RC5 (22-April-2004) of the Hawkeye software is currently available for download for Linux and Solaris. This version fixes a bug in the startd in RC4 which can cause the it to consume 100% of the available CPU time. All users of Hawkeye 1.0 RC4 are advised to upgrade to 1.0 RC5.

The 1.0 releases differ from previous releases in that the Hawkeye monitoring modules are no longer distributed with Hawkeye, but are instead available separately. This allows the Hawkeye system to updated independently of the modules.

New: Version 0.1.0 of the Condor "Setup Hawkeye" package is now available. We've been getting an increasing stream of requests to make it easier to add Hawkeye features to your existing Condor. To solve this problem, We've created the above package which modifies your Condor configuration & installs the same "install module" script as is used in Hawkeye (except that here it's named condor_install_module instead of hawkeye_install_module). You can then download & install the same modules as you can for Hawkeye, and can then use the dynamic attributes in your Machine Ad for match-making purposes. It's availabe on the normal Hawkeye download page.

Modules

These are the programs that monitor attributes within a Condor pool. The modules can be individually enabled and configured, or it is easy to create and install your own custom modules. The currently available modules are:

Details

Configuration details

Specification of module output

Status

Hawkeye is very much a work-in-progress. As described above, Hawkeye v1.0 RC5 and its modules are available.

Note: The "old" Hawkeye configuration syntax was to specify the entire job list in a single HAWKEYE_JOBS list. This syntax was somewhat similar to that of the classic UNIX "cron" program, but proved to be confusing and trouble prone. The new "joblist" syntax is much more consistent and easier to maintain. In this new syntax, the user provides in the "joblist" a list of job names to run. Hawkeye then uses the names in that list to get the operational parameters for each of those jobs, with the attribute names based on the job name.

See section 3.3.9 "condor_startd Configuration File Macros" for more details, in particular the block starting with "STARTD_CRON_JOBLIST".

The net effect is that the following configuration syntax should be considered obsolete, and should be rewritten, as shown below:


# Hawkeye Job Definitions
HAWKEYE_JOBS   = $(HAWKEYE_JOBS) job1:j1prefix_:$(MODULES)/job1_exe:5m:nokill
HAWKEYE_JOBS   = $(HAWKEYE_JOBS) job2:j2prefix_:$(MODULES)/job1_exe:1h
HAWKEYE_JOB1_ARGS =-foo -bar
HAWKEYE_JOB1_ENV = xyzzy=somevalue
HAWKEYE_JOB2_ENV = lwpi=somevalue
Instead, write this as:
# Hawkeye Job Definitions
HAWKEYE_JOBLIST   =

# Job 1
HAWKEYE_JOBLIST = $(HAWKEYE_JOBLIST) job1
HAWKEYE_job1_PREFIX = j1prefix_
HAWKEYE_job1_EXECUTABLE = $(MODULES)/job1_exe
HAWKEYE_job1_PERIOD = 5m
HAWKEYE_job1_MODE = periodic
HAWKEYE_job1_RECONFIG = false
HAWKEYE_job1_KILL = false
HAWKEYE_job1_ARGS =-foo -bar
HAWKEYE_JOB1_ENV = xyzzy=somevalue

# Job 2
HAWKEYE_JOBLIST = $(HAWKEYE_JOBLIST) job2
HAWKEYE_job1_PREFIX = j2prefix_
HAWKEYE_job1_EXECUTABLE = $(MODULES)/job2_exe
HAWKEYE_job1_PERIOD = 1h
HAWKEYE_job1_MODE = periodic
HAWKEYE_job1_RECONFIG = false
HAWKEYE_job1_KILL = false
HAWKEYE_job1_ARGS =
HAWKEYE_JOB1_ENV = lwpi=somevalue

This new syntax is also easier to read and maintain.

Future Effort

Hawkeye is undergoing active development in the following areas:

Papers and Slides

Useful Links


condor-admin@cs.wisc.edu