Next: 3.12 Setting Up for
Up: 3. Administrators' Manual
Previous: 3.10 The High Availability
Contents
Index
Subsections
3.11 Quill
The following section provides an overview of the installation, deployment
and use of quill and also a description of various aspects of quill that
are significant to deployment.
3.11.1 Theory of Operation
Quill consists of three components:
- The condor_ quill server which maintains the job queue and history
tables in the database.
- A modified condor_ q tool which can be used to query the database
tables.
- A modified condor_ history tool.
In this release, these two modified query tools will replace their traditional
counterparts as in addition to querying the database, these modified tools allows
users access to the old features of querying the schedd and the history file
respectively.
3.11.2 Installing and Configuring Quill
Here is an overview of the steps needed to install and use quill and then
a description of each step.
- Install postgres server and client libraries if they aren't already
- Configure postgres to suit quill
- Unpack and build the quill server and client tools
- Modify condor_ config with quill related options
- Invoke the quill daemon and start querying it
- Miscellaneous issues (database schema, security, etc.)
Here, each of the above steps are detailed.
- The Postgress Server
Quill uses the postgres server as its backend and the postgres client
library, libpq to talk to the server. While the 8.0 version of postgres
will be included as part of condor externals, quill has also been tested
with earlier versions of postgres (specifically 7.4).
In addition, one can obtain the postgres source from:
http://www.postgresql.org/ftp/source/
Installation instructions are detailed in:
http://www.postgresql.org/docs/8.0/static/installation.html
- Configuration of Postgress
The following steps need to be taken after postgres is installed. These
are done only once; quill takes care of all other database creation and
maintenance tasks.
- The quill daemon and client tools connect to the database as
users ``quillreader'' and ``quillwriter'' respectively.
We're talking about database users and not operating system users (two
completely different things). So if those users dont already exist,
they need to be added using the 'createuser' command in the postgres bin
directory. Moreover they need to be assigned appropriate passwords; these
passwords will be used by the quill tools to connect to the database in a
secure way. User ``quillreader'' should not be allowed to create
more databases nor create more users. User ``quillwriter'' should
also not be allowed to create more users however it should be allowed
to create more databases. The following commands creates the two users
with the appropriate permissions (be ready to enter the corresponding
passwords when prompted):
/path/to/postgres/bin/directory/createuser quillreader \
--no-createdb --no-adduser --pwprompt
/path/to/postgres/bin/directory/createuser quillwriter \
--createdb --no-adduser --pwprompt
- Postgres should be configured to accept tcp/ip connections. In version
7, this was done by setting tcpip_socket=true in the postgresql.conf file.
In version 8, this has changed. Look at the listen_addresses variable
in the same file. Set yours appropriately, e.g. listen_addresses = '*'
(here, '*' means any ip interface)
- Postgres needs to be configured to accept tcp/ip connections from
certain hosts. This also enables remote connections. This is done in
the pg_hba.conf file which usually resides in the postgres server's
data directory. While the particular syntax and semantics for host based
configuration can vary from site to site, basically one needs to allow
access to any hosts that will access this database server, either by
way of the quill daemon itself writing to the server, or by way of the
condor_ q tool querying this server. For example, in order to give
database users ``quillreader'' and ``quillwriter''
password-enabled access to all databases on current machine from any
other machine in the network add the following:
| host |
all |
quillreader |
128.105.0.0 |
255.255.0.0 |
password |
| host |
all |
quillwriter |
128.105.0.0 |
255.255.0.0 |
password |
Note that in addition to the database specified by
QUILL_DB_NAME in the condor_config file, the quill
daemon also needs access to the database 'template1'. This is because
in order to create the former database in the first place, it needs to
connect to the latter.
Once the server is up and running and the client libraries are installed,
we can now go ahead and install quill.
- Compiling and linking Quill
Quill has been fully integrated into the condor build system. This means
that condor_ quill appears as a directory under the top level src/
directory in the condor source and building the condor source will also
automatically build quill and both of its query tools.
- Modifying condor_config
Now that we have built and installed quill, its time to tweak the
condor_config file to include quill related options.
The following variables need to either be modified or added:
The first one is DAEMON_LIST. Add QUILL to this
list as shown below:
DAEMON_LIST = MASTER, etc. etc., QUILL
Add .quillwritepassword to the VALID_SPOOL_FILES
variable, since we do not want condor_ preen to delete this file
thinking it is junk:
VALID_SPOOL_FILES = job_queue.log, etc. etc., .quillwritepassword
We need to tell it where it resides and what are its start-up arguments:
QUILL = $(SBIN)/condor_quill
QUILL_ARGS = -f
Quill writes to its own log just as the other daemons. This log can be
checked to see quill's run-time behavior and any malfunctions.
QUILL_LOG = $(LOG)/QuillLog
The following options go in the daemon-specific (in this case, quill)
section with appropriately modified values to suit the local environment:
QUILL_ENABLED = TRUE
QUILL_NAME = some-unique-quill-name.cs.wisc.edu
QUILL_DB_NAME = database-for-some-unique-quill-name
QUILL_DB_IP_ADDR = databaseipaddress:port
# the following parameter's units is in seconds
QUILL_POLLING_PERIOD = 10
# the following parameter's units is in hours
QUILL_HISTORY_CLEANING_INTERVAL = 24
# the following parameter's units is in days
QUILL_HISTORY_DURATION = 30
QUILL_IS_REMOTELY_QUERYABLE = TRUE
QUILL_DB_QUERY_PASSWORD = password-for-database-user-quillreader
QUILL_ADDRESS_FILE = $(LOG)/.quill_address
Following is a description on each. Skip to the next section for a brief
overview on how to query quill:
- QUILL_ENABLED
Turning this flag on or off controls the enabling/disabling of quill functionality.
Setting it to TRUE results in the proper functioning of the quill server and client tools
can correctly access it. Setting this variable to FALSE results in the quill server
exiting after noting down a descriptive message in its log file. The clients
condor_ q and condor_ history will behave as though there is no quill instance
running. As with many other variables, a change to this variable while quill is
running should be followed by a condor_ reconfig command for the change to take effect.
Note that one can also turn off quill by removing it from the DAEMON_LIST variable.
However, this variable can be used to not only disable quill during the initial setup of
the condor instance but also while the quill server is running.
- QUILL_NAME
This is the name of this quill server. Each quill server sends an ad to
the collector containing its name. As such its important that the name
of a quill server should not conflict with that of any other quill server,
or for that matter, any schedd. The latter is because each quill sends a
QUILL_AD to the collector, with both its name as well as the name
of the schedd its mirroring and so for querying purposes its important to keep
them unique. It might be convenient to simply name the quill server,
quill@machinename.fully.qualified.address
- QUILL_DB_NAME and QUILL_DB_IP_ADDR
These two variables are used to determine the location of the database
server that this quill would talk to, and the name of the database that
it creates. More than one quill server can talk to the same database
server. This can be done by simply letting all the
QUILL_DB_IP_ADDR point to the same database server.
Notice:
- QUILL_POLLING_PERIOD
This controls the frequency with which quill polls the
job_queue.log file. By default, it is 10 seconds. Since quill
works by periodically sniffing the log file for updates and then sending
those updates to the database, this variable controls the tradeoff between
the currency of query results and quill's load on the system-usually
negligible.
- QUILL_HISTORY_CLEANING_INTERVAL and
QUILL_HISTORY_DURATION
These two variables control the deletion of historical jobs from the
database. QUILL_HISTORY_DURATION is the number of days
after completion (more precisely, the number of days since the history ad got
into the history database - those two might be different if a job is completed
but stays in the queue for a while) that a given job will stay in the database.
So all jobs beyond QUILL_HISTORY_DURATION will be deleted. Now,
scanning the entire database for old jobs can get pretty expensive,
so the other variable QUILL_HISTORY_CLEANING_INTERVAL
is the number of hours between two successive scans. By default,
QUILL_HISTORY_DURATION is set to 180 days and
QUILL_HISTORY_CLEANING_INTERVAL is set to 24 hours.
- QUILL_IS_REMOTELY_QUERYABLE
Thanks to postgres one can now remotely query both the job queue and the
history tables. This variable controls whether this remote querying
feature should be enabled. By default it is TRUE. Note that even if
this is FALSE, one can still query the job queue in the remote schedd
This variable only controls whether the database tables are remotely queryable.
- QUILL_DB_QUERY_PASSWORD
In order for the query tools to connect to a database, it needs to provide
the password that is assigned to database user ``quillreader'' above.
This variable is then advertised by the quill daemon to the collector.
This facility enables remote querying: remote condor_ q query tools first
ask the collector for the password associated with a particular quill database
and then query that database. Users who do not have access to the collector
cannot view the password and as such cannot query the database. Again, this
password just provides 'read' access to the database.
- QUILL_ADDRESS_FILE
When quill starts up, it can place it's address (IP and port)
into a file. This way, tools running on the local machine don't
need to query the central manager to find quill. This
feature can be turned off by commenting out the variable.
- Invoking the quill daemon and querying it.
Once the condor_config file is updated with the above arguments,
the quill daemon can be started by either restarting condor using
condor_ restart or just starting it using condor_ master. All the
daemons in the DAEMON_LIST variable, as updated above, are
started and managed by the master accordingly.
The condor_ quill daemon is responsible for maintaining a database
mirror of the job_queue and history logs. One can query those two
using condor_ q and condor_ history respectively. Both these two tools
retain all their old functionality, i.e. condor_ q can be used to query
the schedd and condor_ history can be used to query the history file
Moreover, they retain all their old options plus some more thanks to
database technology. For example, as before, we can query both using
the job id (cluster.proc), owner, dags, io, cputime, etc. Orthogonally,
just as how we could query remote schedds, we can also query remote quill
databases for job queue and historical information. The latter is new
functionality thanks to the remote querying functionality in Postgres.
The -help option can be used to look at all the options supported by
both tools.
3.11.3 Examples
- Query a remote quill daemon on regular.cs.wisc.edu for all the jobs in
the queue
condor_q -name quill@regular.cs.wisc.edu
condor_q -name schedd@regular.cs.wisc.edu
There are two ways to get to a quill daemon: directly using its name as
specified in the QUILL_NAME variable in section 4) above, or indirectly
by querying the schedd using its name. In the latter case, condor_ q will detect
if that schedd is being serviced by a database, and if so, directly query it.
In both cases, the ip address and port of the database server hosting the data of
this particular remote quill daemon can be figured out by the QUILL_DB_IP_ADDR
and QUILL_DB_NAME variables specified in the QUILL_AD
sent by the quill daemon to the collector and in the SCHEDD_AD sent by
the schedd.
- Query a remote quill daemon on regular.cs.wisc.edu for all historical
jobs belonging to owner 'akini'.
condor_history -name quill@regular.cs.wisc.edu akini
- Query the local quill daemon for the average time spent in the queue
for all non-completed jobs.
condor_q -avgqueuetime
This is a new query. -avgqueuetime is defined as the average of
(currenttime - jobsubmissiontime) over all jobs which are neither
completed (JobStatus == 4) or removed (JobStatus == 3).
- Query the local quill daemon for all historical jobs completed since
Apr 1, 2005 at 13h 00m.
condor_history -completedsince '04/01/2005 13:00'
This is also a new query. It fetches all jobs
which got into the 'Completed' state on or after the
specified timestamp. We follow Postgres's date/time
syntax rules as it encompasses most format options. See
http://www.postgresql.org/docs/8.0/static/datatype-datetime.html#AEN4516
for the various timestamp formats.
3.11.4 Quill and Its RDBMS Schema
With only 7 tables and 2 views, quill uses a relatively simple database
schema. These can be broadly divided into tables used to store job
queue information and those used to store historical information.
The job queue part of the schema closely follows condor's classad data
model, i.e. each row in these tables describe an <attribute,value>
pair of the classad. Additionally, just as how condor distinguishes a
ClusterAd from a ProcAd where the former stores attributes common to all
jobs within a cluster whereas the latter stores attributes specific to
each job, the schema also makes this distinction. Finally, numerical
and string valued attributes are stored separately.
Thus we have four tables:
- ClusterAds_Str (cid int,
attr text,
val text,
primary key (cid, attr))
- ClusterAds_Num (cid int,
attr text,
val double precision,
primary key (cid, attr))
- ProcAds_Str (cid int,
pid int,
attr text,
val text,
primary key (cid, pid, attr))
- ProcAds_Num (cid int,
pid int,
attr text,
val double precision,
primary key (cid, pid, attr))
In addition to the <attribute, value>, each row contains the cluster-id
(cid) and in the case of the ProcAd tables, also the proc-id (pid).
Since each classad would be split into potentially two tables (string
and numeric), there are views that unify them into a single entity in
order to simplify queries.
Here are the view definitions:
- Definition of ClusterAds view
CREATE VIEW ClusterAds as select cid,
attr,
val from ClusterAds_Str UNION ALL
select cid,
attr,
cast(val as text) from ClusterAds_Num;
- Definition of ProcAds view
CREATE VIEW ProcAds as select cid,
pid,
attr,
val from ProcAds_Str UNION ALL
select cid,
pid,
attr,
cast(val as text) from ProcAds_Num;
Finally, the job queue part of the schema also contains a table that
stores metadata information related to the job_queue.log file.
- JobQueuePollingInfo (last_file_mtime BIGINT,
last_file_size BIGINT,
last_next_cmd_offset BIGINT,
last_cmd_offset BIGINT,
last_cmd_type SMALLINT,
last_cmd_key text,
last_cmd_mytype text,
last_cmd_targettype text,
last_cmd_name text,
last_cmd_value text)
At all times, there's only 1 row in this table and it describes
information related to the last time quill polled the job_queue.log file.
- last_file_mtime and last_file_size
The last modified time and size of the file.
- last_cmd_offset and last_next_cmd_offset
The offsets of the record last read from the file and its successive record.
- last_cmd_type
The command type (101, 102, etc.) of the record.
- last_cmd_key,
last_cmd_mytype,
last_cmd_targettype,
last_cmd_name,
and
last_cmd_value
Together, these attributes define the record itself. The key
refers to the combined "cid.pid" pair, mytype and target usually
contains Job and Machine respectively, and finally the name and
value contains the <attribute,value> pair.
The historical information on the other hand is slightly differently
designed. Instead of a purely vertical data model (each row is a
<attribute,value> pair), we have two tables that together represent the
complete job classad. Their schema is as follows:
- History_Horizontal (cid int,
pid int,
EnteredHistoryTable timestamp with time zone,
Owner text,
QDate int,
RemoteWallClockTime int,
RemoteUserCpu float,
RemoteSysCpu float,
ImageSize int,
JobStatus int,
JobPrio int,
Cmd text,
CompletionDate int,
LastRemoteHost text,
primary key(cid,pid))
- History_Vertical (cid int, pid int, attr text, val text, primary key
(cid, pid, attr))
Each historical job ad is divided into its horizontal and vertical
counterparts. This division was made because of query performance
reasons. While its easier to store classads in a vertical table,
queries on vertical tables generally perform worse than those on
horizontal tables since the latter has lot fewer records. However, in
Condor, since job ads dont have a fixed schema (users can define their
own attributes), a purely horizontal schema would end up having a lot
of null values. As such, we have a hybrid schema where attributes on
which queries are frequently performed (via condor_ history) are put
in the History_Horizontal table and the other attributes
are stored vertically (just as in the Cluster/Proc tables above) in the
History_Vertical table. Also History_Horizontal
contains all the attributes needed to service the short form of the
condor_ history command (i.e. without the -l option).
The resulting hybrid schema has proven to be the most efficient in
servicing condor_ history queries. The job queue tables (Cluster and
Proc) were not designed in this hybrid manner because job queues aren't
as large as history; just a vertical schema worked great.
3.11.5 Quill and Security
There are several layers of security in Quill, some provided by condor and
others provided by the database. Firstly, all accesses to the database
are password-protected.
- As mentioned in section 2c) above, the query tools, condor_ q and
condor_ history connect to the database as user ``quillreader''.
The password for this user can vary from one database to another and
as such, each quill daemon advertises this password to the collector.
The query tools then obtain this password from the collector and
connect successfully to the database. Access to the database by the
``quillreader'' user is read-only as this is sufficient for the
query tools. The quill daemon ensures this protected access using the sql
GRANT command when it first creates the tables in the database. Note that
access to the ``quillreader'' password itself can be blocked by
blocking access to the collector, a feature already supported in Condor.
- The quill daemon, on the other hand, needs read and write access
to the database. As such, it connects as user ``quillwriter''
who has owner priviledges to the database. Since this gives all
access to the ``quillwriter'' user, its password cannot
be stored in a public place (such as the collector). For this
reason, the ``quillwriter'' password is stored in a file called
.quillwritepassword in the condor spool directory. Appropriate read/write
protections on this file guarantee secure access to the database.
This file must be created and protected by the site administrator;
if this file does not exist as and where expected, the condor_ quill
daemon logs and error and exits.
- Finally, as mentioned in section 4) above, the
IsRemotelyQueryable attribute in the ``quill ad'' advertised
by the quill daemon to the collector can be used by site administrators
to disallow the database from being read by all remote condor query tools.
3.11.6 Maintenance of Quill
There are virtually no maintenance issues in Quill. Once started, it
checks if all necessary database related structures (database itself,
tables, indices, views) are present and creates them if they are
not present. It also purges old historical jobs based on user policy
(see section 4 above) and garbage collection in the database (using the
postgres VACUUM ANALYZE command). Of course, if Quill is shut
down and the database is no longer needed, it can be dropped using the
postgres dropdb command.
Next: 3.12 Setting Up for
Up: 3. Administrators' Manual
Previous: 3.10 The High Availability
Contents
Index
condor-admin@cs.wisc.edu