Next: 8.4 Stable Release Series
Up: 8. Version History and
Previous: 8.2 Upgrade Surprises
Contents
Index
Subsections
8.3 Development Release Series 6.9
This is the development release series of Condor.
The details of each version are described below.
Version 6.9.4
Release Notes:
- The default in standard universe for copy_to_spool
is now true. In 6.9.3, it was changed to false for all
universes for performance reasons, but this is deemed too risky for
standard universe, because any modification of the executable is
likely to make it impossible to resume execution using checkpoint
files made from the original version of the executable.
- Version 1.5.0 of the Generic Connection Broker (GCB) is
now used for building Condor.
This version of GCB fixes a few critical bugs.
- GCB was unable to pass information about sockets registered
at a GCB broker to child processes due to a bug in the way a
special environment variable was being set.
- All sockets for outbound connections were being registered
at the GCB broker, which was putting severe strain on the GCB
broker even under relatively low load.
Now, only sockets that are listening for inbound connections are
registered at the broker.
- The USE_CLONE_TO_CREATE_PROCESSES setting was
causing havoc for applications linked with GCB.
This configuration setting is now always disabled if GCB is enabled.
- Fixed a race condition in GCB_connect() that would
frequently cause connect() attempts to fail, especially
non-blocking connections.
- Fixed bugs in GCB_select() when GCB changes the direction
of a connection from active to passive (for example, so that a
Condor daemon running behind a firewall will use an outbound
connection to communicate with a public client that had
attempted to initiate contact via the GCB broker).
- Also improved logging at the GCB broker.
Additionally, there was a bug in how Condor was publishing the
classified ads for GCB-enabled daemons.
Condor used to be re-writing any attributes containing an IP address
when a classified ad was sent over a network connection (in an
effort to provide correct behavior for multi-homed machines).
Now, this re-writing is disabled whenever GCB is enabled, since GCB
already has logic to determine the correct IP addresses to advertise.
For more information about GCB, see section 3.7.3 on
page
.
- The owner of the log file for the condor_ gridmanager
has changed to the condor user.
In Condor 6.9.3 and previous versions, it was owned by the
user submitting the job.
Therefore, the owner of and permissions on an existing log file
are likely to be incorrect.
Condor issues an error if the condor_ gridmanager is unable
to read and write the existing file.
To correct the problem, an administrator may modify file
permissions such that the condor user may read and
write the log file.
Alternatively, an administrator may delete the file, and
Condor will create a new file with the expected owner and
permissions.
In addition, the definition for GRIDMANAGER_LOG
in the condor_config.generic file has changed for
Condor 6.9.4.
New Features:
Configuration Variable Additions and Changes:
Bugs Fixed:
- Trailing commas in lists of items in submit files and
configuration files are now ignored. Previously, Condor would treat trailing
commas in various surprising ways.
- Numerous bugs in GCB and the interaction between Condor and GCB.
See the release notes above for details.
- The submit file entry ``coresize'' was not being honored properly on
many universe. It is now honored on all universes except pvm and the grid
universes (except where the grid type is Condor). For the java universe,
it controls the core file size for the JVM itself.
- The condor_ configure installation script now allows Condor
to be installed on hosts without a fully-qualified domain name.
- Fixed a bug in condor_ dagman: if a DAG run with a per-DAG
configuration file specification generated a rescue DAG, the rescue
DAG file did not contain the appropriate DAG configuration file line.
(This bug was introduced when the per-DAG configuration file option
was added in version 6.9.2.)
- Fixed a bug introduced in 6.9.3 when handling local universe jobs.
The starter ignored failures in contacting the condor_ schedd in the
final update to the job queue.
- When the condor_ schedd is issued a graceful shutdown command, any jobs
that running with a job lease are allowed to keep running. When the condor_ schedd
starts back up at a later time, it will spawn condor_ shadow to reconnect
to the jobs if they are still executing. This mimics the same behavior as
a fast shutdown. This also fixes a bug in 6.9.3 in which the condor_ schedd
would fail to reconnect to jobs that were left running during a graceful
shutdown.
- When the condor_ starter is gracefully shutting down and if it
has become disconnected from the condor_ shadow, it will wait for the
job lease time to expire before giving up on telling the condor_ shadow
that the job was evicted. Previously, the condor_ starter would exit
as soon as it was done evicting the job.
- Job ad attribute HoldReasonCode is now properly set when
condor_ hold is called and when jobs are submitted on hold.
- If a job specified a job lease duration, and the condor_ schedd
was killed or crashed, the condor_ shadow used to notice when the
condor_ schedd was gone, and gracefully shutdown the job (evicting
the job at the remote site).
Now, the condor_ shadow honors the job lease duration, and if the
lease has not yet expired, it simply exists without evicting the
job, in the hopes that the condor_ schedd will be restarted in time
to reconnect to the still-running job and resume computation.
- Fixed a bug from 6.9.3 in which condor_ q -format no longer
worked when given an expression (as opposed to simple attribute reference).
The expression was always treated as being undefined.
- When a condor daemon such as the condor_ schedd or
condor_ negotiator tried to establish many new security sessions for
UDP messages in a short span of time, it was possible for the daemon
to run out of file descriptors, causing it to abort execution and be
restarted by the condor_ master. A problem was found and fixed in the
mechanism that protects against this.
- Improved error descriptions when Condor-C encounters failures when
sending input files to the remote schedd.
- Rare failure conditions during stage in would cause Condor-C to put
the state of the job in the remote schedd into an invalid state in which
it would run but later fail during stage out. This now results in the
job on the submit side going on hold with a staging failure.
- Fixed a bug which could cause condor_ store_cred to crash during
common use.
- Fixed a bug where the vanilla universe condor_ starter would possibly
crash when running a job not as the owner of the job.
- Fixed a bug which would cause a condor_ starter being used for
the local universe to core dump.
- Fixed a bug which caused the condor_ schedd to core dump while
processing a job's crontab entries in the submit description file.
- Fixed a privilege separation bug in the standard universe
condor_ starter.
Known Bugs:
- standard universe jobs do not work
when writing binary data. The behavior exhibited in this
case may include the job crashing, or corrupt binary written.
- grid universe jobs for the gt4 grid type do not work,
if Condor daemons are started as root
and there is file transfer associated with or specified by the job.
These jobs are placed on hold.
- The STARTD_RESOURCE_PREFIX setting on Windows results
in broken behavior on both Condor 6.9.3 and 6.9.4. Specifically, when
this setting is given a value other than its default (``slot''), all
jobs will run using the ``condor-reuse-slot1'' user account,
regardless of the actual slot used for execution.
Additions and Changes to the Manual:
- New documentation for the new vm universe
in the User's Manual, section 2.12.
Definitions of configuration variables for the vm universe are in
section 3.3.25.
- New RDBMS schema tables added for Quill in
section 3.11.4.
Version 6.9.3
Release Notes:
- As of version 6.9.3, the entire Condor system has undergone a
major terminology change.
For almost 10 years, Condor has used the term virtual machine
or vm to refer to each distinct resource that could run a
Condor job (for example, each of the CPUs on an SMP machine).
Back when we chose this terminology, it made sense, since each of
these resource was like an independent machine in a pool, with
its own state, ClassAd, claims, and so on.
However, in recent years, the term virtual machine is now
almost universally associated with the kinds of virtual machines
created using tools such as VMware and Xen. Entire operating systems
run inside a given process, usually emulating the underlying
hardware on a host machine.
So, to avoid confusion with these other kinds of virtual machines,
the old virtual machine terminology has been replaced by
the term slot.
Numerous configuration settings, command-line arguments to Condor
tools, ClassAd attribute names, and so on, have all been
modified to reflect the new slot terminology.
In general, the old settings and options will still work, but are
now retired and may disappear in the future.
- The condor_ install installation script has
been removed.
All sites should use condor_ configure when setting up a new Condor
installation.
- The SECONDARY_COLLECTOR_LIST configuration variable has
been removed.
Sites relying on this variable should instead use the configuration
variable COLLECTOR_HOST . It may be used to
define a list of condor_ collector daemon hosts.
- Cleaned up and improved help information for condor_ history.
New Features:
- Numerous scalability and performance improvements. Given enough
memory, the schedd can now handle much larger job queues (e.g. 10s of
thousands) without the severe degradation in performance that used to
be the case.
- Added the START_LOCAL_UNIVERSE and START_SCHEDULER_UNIVERSE
parameters for the condor_ schedd. This allows administrators to control whether
a Local/Scheduler universe job will be started. This expression is evaluated
against the job's ClassAd before the Requirements expression.
- All Local and Scheduler universe jobs now have their Requirements
expressions evaluated before execution. If the expression evaluates to false, the
job will not be allowed to begin running. In previous versions of Condor, Local
and Scheduler universe jobs could begin execution without the condor_ schedd checking
the validity of the Requirements.
- Added SCHEDD_INTERVAL_TIMESLICE and
PERIODIC_EXPR_TIMESLICE. These indicate the maximum
fraction of time that the schedd will spend on the respective
activities. Previously, these activities were done on a fixed
interval, so with very large job queue sizes, the fraction of time
spent was increasing to unreasonable levels.
- Under Intel Linux, added USE_CLONE_TO_CREATE_PROCESSES .
This defaults to true and results in scalability improvements for processes
using large amounts of memory (e.g. a schedd with a lot of jobs in the queue).
- Jobs in the parallel universe now can have $$ expanded in their
ads in the same way as other universes.
- Local universe jobs now support policy expression evaluation, which includes
the ON_EXIT_REMOVE, ON_EXIT_HOLD, PERIODIC_REMOVE,
PERIODIC_HOLD, and PERIODIC_RELEASE attributes. The periodic
expressions are evaluated at intervals determined by the
PERIODIC_EXPR_INTERVAL configuration macro.
- Jobs can be scheduled to executed periodically, similar to the crontab
functionality found in Unix systems. The condor_ schedd calculates the next
runtime for a job based on the new CRON_MINUTE, CRON_HOUR,
CRON_DAY_OF_MONTH, CRON_MONTH, and
CRON_DAY_OF_WEEK attributes. A preparation time defined by the
CRON_PREP_TIME attribute allows a job to be submitted to the
execution machine before the actual time the job is to begin execution.
Jobs that would like to be run repeatedly will need to define the
the ON_EXIT_REMOVE attribute properly so that they are
re-queued after executing each time.
- Condor now looks for its configuration file in /usr/local/etc
if the CONDOR_CONFIG environment variable is not set and there is
no condor_config file located in /etc/condor. This allows a default
Condor installation to be more compatible with Free BSD.
- If a user job requests streaming input or output in the submit
file, the job can now run with job leases and the job will continue
to run for the lease duration should the submit machine crash. Previously,
jobs with streaming i/o would be evicted if the submit machine crashed.
While the submit machine is down, if the job tried to issue a streaming
read or write, the job will block until the submit machine returns or the
job lease expires.
- Ever since version 6.7.19, condor_ submit has added a default
job lease duration of 20 minutes to all jobs that support these
leases.
However, there was no way to disable this functionality if a user
did not want job lease semantics.
Now, a user can place
job_lease_duration = 0 in their submit
file to manually disable the job lease.
- Added new configuration knob STARTER_UPLOAD_TIMEOUT
which sets the timeout for the starter to upload output files to the
shadow on job exit. The default value is 200 seconds, which should
be sufficient for serial jobs. For parallel jobs, this may need to
be increased if many large output files are sent back to the shadow
on job exit.
- condor_ dagman now aborts the DAG on ``scary'' submit events.
These are submit events in which
the Condor ID of the event does not match the
expected value.
Previously, condor_ dagman printed a warning, but continued.
To restore Condor to the previous behavior,
set the new DAGMAN_ABORT_ON_SCARY_SUBMIT configuration variable
to False.
- When the condor_ master detects that its GCB broker is unavailable
and there is a list of alternative brokers,
it will restart immediately if MASTER_WAITS_FOR_GCB_BROKER is
set to False instead of waiting for another broker to became available.
condor_ glidein now sets MASTER_WAITS_FOR_GCB_BROKER
to False in its configuration file.
- When using GCB and a list of brokers is available, the
condor_ master will now pick a random broker rather than the least-loaded
one.
- All Condor daemons now evaluate some ClassAd expressions
whenever they are about to publish an update to the
condor_ collector.
Currently, the two supported expressions are:
- DAEMON_SHUTDOWN
- If True, the daemon will gracefully shut itself down and will not
be restarted by the condor_ master (as if it sent itself a
condor_ off command).
- DAEMON_SHUTDOWN_FAST
- If True, the daemon will quickly shut itself down and will not be
restarted by the condor_ master (as if it sent itself a
condor_ off command using the -fast option).
For more information about these expressions, see
section 3.3.5 on
page
.
- When the condor_ master sends email announcing that another daemon has
died, exited, or been killed, it now notes the name of the machine, the
daemon's name, and a summary of the situation in the Subject line.
- Anyplace in a Condor configuration or submit description file where
wild cards may be used, you can now place wild cards at both the beginning
and end of the string pattern (i.e. match strings that contain the text
between the wild cards anywhere in the string). Previously, only one
wild card could appear in the string pattern.
- Added optional configuration setting
NEGOTIATOR_MATCH_EXPRS . This allows the negotiator to
insert expressions into the matched ClassAd. See
page
for more information.
- Increased speed of ClassAd parsing.
- Added DEDICATED_EXECUTE_ACCOUNT_REGEXP and
deprecated the boolean setting
EXECUTE_LOGIN_IS_DEDICATED , because the latter could not
handle a policy where some jobs run as the job owner and some run as
dedicated execution accounts. Also added support for
STARTER_ALLOW_RUNAS_OWNER under Unix. See
Section 3.3.7 and
Section 3.6.12 for more information.
- All Condor daemons now publish a MyCurrentTime attribute
which is the current local time at the time the update was generated
and sent to the condor_ collector.
This is in addition to the LastHeardFrom attribute which is
inserted by the condor_ collector (the current local time at the
collector when the update is received).
- condor_ history now accepts partial command line
arguments. For example, -constraint can be abbreviated -const.
This brings condor_ history in line with other Condor command
line tools.
- condor_ history can now emit ClassAds formatted as XML with
the new -xml option.
This brings condor_ history more in line condor_ q.
- The
$$ substitution macro syntax now supports the insertion
of literal $$ characters through the use of $$(DOLLARDOLLAR).
Also, $$ expansion is no longer recursive, so if the value being
substituted in place of a $$ macro itself contains $$ characters,
these are no longer interpreted as substitution macros but are instead
inserted literally.
- When started as root on a Linux 64-bit x86 machine, Condor daemons will
now leave core files in the log directory when they crash. This matches
Condor's behavior on most other Unix-like operating systems, including
32-bit x86 versions of Linux.
- The _CONDOR_SLOT variable is now placed into the
environment for jobs of all universes.
This variable indicates what slot a given job is running on, and
will have the same value as the SlotID from the machine
classified ad where the job is running.
The _CONDOR_SLOT variable replaces the deprecated
CONDOR_VM environment variable, which was only defined for
standard universe jobs.
- Added a USE_PROCD configuration parameter. If this
parameter is set to true for a given daemon, the daemon will use the
condor_ procd program to monitor process families. If set to false,
the daemon will execute process family monitoring logic on its
own. The condor_ procd is more scalable and is also an essential
piece in the ongoing privilege separation effort. The disadvantage of
using the ProcD is that it is newer, less-hardened code.
Configuration Variable Additions and Changes:
- The SECONDARY_COLLECTOR_LIST configuration variable has
been removed.
Sites relying on this variable should instead use the configuration
variable COLLECTOR_HOST to
define a list of condor_ collector daemon hosts.
- Added new configuration variables START_LOCAL_UNIVERSE
and START_SCHEDULER_UNIVERSE for the condor_ schedd daemon.
These boolean expressions default to True.
START_LOCAL_UNIVERSE is relevant only to local universe jobs.
START_SCHEDULER_UNIVERSE is relevant only to scheduler
universe jobs.
These new variables allow an administrator to define
a START expression specific to these jobs.
The expression is evaluated
against the job's ClassAd before the Requirements expression.
- Added new configuration variables SCHEDD_INTERVAL_TIMESLICE
and PERIODIC_EXPR_TIMESLICE . These configuration variables
address a scalability issue for very large job queues.
Previously, the condor_ schedd daemon handled an activity related
to counting jobs, as well as the activity related to evaluating
periodic expressions for jobs at the fixed time interval of 5 minutes.
With large job queues, the fraction of the condor_ schedd daemon
execution time devoted to these two activities became excessive,
such that it could be doing little else.
The fixed time interval is now gone, and Condor calculates the amount
of time spent on the two activities, using these new configuration
variables to calculate an appropriate time interval.
Each is a floating point value within the range
(noninclusive) 0.0 to 1.0.
Each determines the maximum fraction of the time interval that the
condor_ schedd daemon will spend on the respective
activity.
SCHEDD_INTERVAL_TIMESLICE defaults to the value 0.05,
such that the calculated time interval will be 20 * the amount
of time spent on the counting jobs activity.
PERIODIC_EXPR_TIMESLICE defaults to the value 0.01,
such that the calculated time interval will be 100 * the amount
of time spent on the periodic expression evaluation activity.
- Added new configuration variable
USE_CLONE_TO_CREATE_PROCESSES , relevant only to the
Intel Linux platform.
This boolean value defaults to True, and it results in scalability
improvements for Condor processes using large amounts of memory.
These processes may clone themselves instead of forking themselves.
An example of the improvement occurs for a condor_ schedd
daemon with a lot of jobs in the queue.
- Added new configuration variable STARTER_UPLOAD_TIMEOUT ,
which allows a configurable time (in seconds) for a timeout used by the
condor_ starter.
The default value of 200 seconds replaces the previously hard coded
value of 20 seconds.
This timeout before job failure is to upload output files to the
condor_ shadow upon job exit.
The default value should be sufficient for serial jobs.
For parallel jobs, it may need to
be increased if there are many large output files.
- Added new configuration variable DAGMAN_ABORT_ON_SCARY_SUBMIT .
This boolean variable defaults to True, and causes
condor_ dagman to abort the DAG on ``scary'' submit events.
These are submit events in which
the Condor ID of the event does not match the expected value.
Previously, condor_ dagman printed a warning, but continued.
To restore Condor to the previous behavior,
set DAGMAN_ABORT_ON_SCARY_SUBMIT to False.
- Added new configuration variable NEGOTIATOR_MATCH_EXPRS .
It causes the condor_ negotiator to
insert expressions into the matched ClassAd. See
page
for details.
- Added new configuration variable
DEDICATED_EXECUTE_ACCOUNT_REGEXP to replace the retired
EXECUTE_LOGIN_IS_DEDICATED ,
because EXECUTE_LOGIN_IS_DEDICATED could not
handle a policy where some jobs run as the job owner and others run as
dedicated execution accounts. Also added support for
the existing configuration variable
STARTER_ALLOW_RUNAS_OWNER under Unix. See
Section 3.3.7 and
Section 3.6.12 for more information.
- Added new configuration variable USE_PROCD .
This boolean variable defaults to False for the
condor_ master, and True for all other daemons.
When True, the daemon will use the
condor_ procd program to monitor process families.
When False, a daemon will execute process family
monitoring logic on its own.
The condor_ procd is more scalable and is also an essential
piece in the ongoing privilege separation effort. The disadvantage of
using the condor_ procd is that it is newer, less-hardened code.
Bugs Fixed:
- On Unix systems, Condor can now handle file descriptors larger than
FD_SETSIZE when using the select system call. Previously, file descriptors
larger than FD_SETSIZE would cause memory corruption and crashes.
- When an update to the condor_ collector from the
condor_ startd is lost, it is possible for multiple claims to the
same resource to be handed out by the condor_ negotiator. This is
still true. What is fixed is that these multiple claims will not
result in mutual annihilation of the various attempts to use the
resource. Instead, the first claim to be successfully requested will
proceed and the others will be rejected.
- condor_ glidein was setting PREEN_INTERVAL =0 in the default
configuration, but this is no longer a legal value, as of 6.9.2.
- condor_ glidein was not setting necessary configuration parameters
for condor_ procd in the default glidein configuration.
- In 6.9.2, Condor daemons crashed after failing to authenticate a
network connection.
- condor_ status will now accurately report the ActvtyTime
(activity time) value in Condor pools where not all machines are in
the same timezone, or if there is clock-skew between the hosts.
- Fixed the known issue in Condor 6.9.2 where using the
EXECUTE_LOGIN_IS_DEDICATED setting on UNIX platforms would
cause the condor_ procd to crash.
- Failure when activating a COD claim no longer will result in an
opportunistic job running on the same condor_ startd being left
suspended. This problem was most likely to be seen when using the
GLEXEC_STARTER feature.
- In Condor 6.9.2 for Tru64 UNIX, the condor_ master would
immediately fail if started as root. This problem has been fixed.
- Condor 6.9.2 introduced a problem where the condor_ master
would fail if started as root with the UID part of the
CONDOR_IDS parameter set to 0 (root). This issue has been
fixed.
Known Bugs:
- The 6.9.3 condor_ schedd daemon incorrectly handles jobs with leases
(true by default for vanilla, java, and parallel universe jobs) when
shutting down gracefully. These jobs are allowed to continue running,
but when the condor_ schedd daemon is started back up, it fails to reconnect
to them. The result is that the orphaned jobs are left running for
the duration of the job's lease time (a default time of 20 minutes).
The state of the jobs in the restarted queue is independent of any
orphaned running jobs, so these queued jobs may begin running on another
machine while orphans are still running.
- condor_ q -format in 6.9.3 does not work with expressions. It
behaves as if the expression evaluates to an undefined result.
Version 6.9.2
Release Notes:
- As part of ongoing security enhancements, Condor now has a
new, required daemon: condor_ procd. This daemon is
automatically started by the condor_ master, you do not need to
add it to DAEMON_LIST .
However, you must be certain to update the condor_ master
if you update any of the other Condor daemons.
- Some configuration settings that previously accepted 0 no
longer do so. Instead the daemon using the setting will exit
with an error message listing the acceptable range to its log.
For these settings 0 was equivalent to requesting the default.
As this was undocumented and confusing behavior it is no longer
present. To request a setting use its default, either comment it
out, or set it to nothing (``EXAMPLE_SETTING='').
Setting impacted include but are not limited to:
MASTER_BACKOFF_CONSTANT ,
MASTER_BACKOFF_CEILING ,
MASTER_RECOVER_FACTOR ,
MASTER_UPDATE_INTERVAL ,
MASTER_NEW_BINARY_DELAY ,
PREEN_INTERVAL ,
SHUTDOWN_FAST_TIMEOUT ,
SHUTDOWN_GRACEFUL_TIMEOUT ,
MASTER_<name>_BACKOFF_CONSTANT ,
MASTER_<name>_BACKOFF_CEILING ,
- Version 1.4.1 of the Generic Connection Broker (GCB) is
now used for building Condor. This version of GCB fixes a timing bug
where a client may incorrectly think a network connection has been established,
and also guards against an unresponsive client from causing a denial of
service by the broker.
For more information about GCB, see section 3.7.3 on
page
.
New Features:
- On UNIX, an execute-side Condor installation can run without
root privileges and still execute jobs as different users, properly
clean up when a job exits, and correctly enforce policies specified by
the Condor administrator and resource owners. Privileged functionality
has been separated into a well-defined set of functions provided by a
setuid helper program. This feature currently does not work for the
standard or PVM universes.
- Added support for EmailAttributes in the parallel universe.
Previously, it was only valid in the vanilla and standard universes.
- Added configuration parameter DEDICATED_SCHEDULER_USE_FIFO
which defaults to true. When false, the dedicated scheduler will
use a best-fit algorithm to schedule parallel jobs. This setting is
not recommended, as it can cause starvation. When true, the dedicated
scheduler will schedule jobs in a first-in, first-out manner.
- Added -dump to condor_ config_val which will print out
all of the macros defined in any of the configuration files found by
the program.
condor_ config_val -dump -v will augment the output
with exactly what line and in what file each configuration variable
was found.
NOTE: : The output format of the -dump option will most likely
change in a future revision of Condor.
- Node names in condor_ dagman DAG files can now be DAG
keywords, except for PARENT and CHILD.
- Improved the log message when OnExitRemove or
OnExitHold evaluates to UNDEFINED.
- Added the DAGMAN_ON_EXIT_REMOVE configuration macro,
which allows customization of the OnExitRemove expression
generated by condor_ submit_dag.
- When using GCB, Condor can now be told to choose from a list of
brokers. NET_REMAP_INAGENT is now a space and comma separated
list of brokers. On start up, the condor_ master will query all of the
brokers and pick the least-used one for it and its children to use. If
none of the brokers are operational, then the condor_ master will wait
until one is working. This waiting can be disabled by setting
MASTER_WAITS_FOR_GCB_BROKER to FALSE in the configuration file.
If the chosen broker fails and recovery is not possible or another broker
is available, the condor_ master will restart all of the daemons.
- When using GCB, communications between parent and child
Condor daemons on the same host no longer use the GCB broker.
This improves scalability and also allows a single host to
continue functioning if the GCB broker is unavailable.
- The condor_ schedd now uses non-blocking methods to send the
``alive'' message to the condor_ startd when renewing the job lease.
This prevents the condor_ schedd from blocking for 20 seconds while
trying to connect to a machine that has become disconnected from the
network.
- condor_ advertise can read the classad to be advertised from
standard input.
- Unix Condor daemons now reinitialize their DNS
configuration (e.g. IP addresses of the name servers) on reconfig.
- A configuration file for condor_ dagman can now be specified
in a DAG file or on the condor_ submit_dag command line.
- Added condor_ cod option -lease for creation of COD claims
with a limited duration lease. This provides automatic cleanup of COD
claims that are not renewed by the user. The default lease is infinitely
long, so existing behavior is unchanged unless -lease is explicitly
specified.
- Added condor_ cod command delegate_proxy which will
delegate an x509 proxy to the requested COD claim.
This is primarily useful for sites wishing to use glexec to spawn the
condor_ starter used for COD jobs.
The new command optionally takes an -x509proxy argument to
specify the proxy file.
If this argument is not present, condor_ cod will search for the
proxy using the same logic as condor_ submit does.
- STARTD_DEBUG can now be empty, indicating a default, minimal
log level. It now defaults to empty.
Previously it had to be non-empty and defaulted to include D_COMMAND.
- The addition of the condor_ procd daemon means that all process
family monitoring and control logic is no longer replicated in each
Condor daemon that needs it. This improves Condor's scalability,
particularly on machines with many processes.
Bugs Fixed:
- Under various circumstances, condor 6.9.1 daemons would abort
with the message, ``ERROR: Unexpected pending status for fake message
delivery.'' A specific example is when OnExitRemove or
OnExitHold evaluated to UNDEFINED. This caused the
condor_ schedd to abort.
- In Condor 6.9.1, the condor_ schedd would die during startup
when trying to reconnect to running jobs for which the condor_ schedd
could not find a startd ClassAd. This would happen shortly after
logging the following message: ``Could not find machine ClassAds for
one or more jobs. May be flocking, or machine may be down.
Attempting to reconnect anyway.''
- Improved Condor's validity checking of configuration values.
For example, in some cases where Condor was expecting an integer but
was given an expression such as 12*60, it would silently interpret
this as 12. Such cases now result in the condor daemon exiting
after issuing an error message into the log file.
- When sending a WM_CLOSE message to a process on Windows,
Condor daemons now invoke the helper program condor_ softkill to do
so. This prevents the daemon from needing to temporarily switch away
from its dedicated service Window Station and Desktop. It also fixes a
bug where daemons would leak Window Station and Desktop handles. This
was mainly a problem in the condor_ schedd when running many scheduler
universe jobs.
Known Bugs:
- condor_ glidein generates a default config file that sets
PREEN_INTERVAL to an invalid value (0). To fix this,
remove the setting of PREEN_INTERVAL.
- There are a couple of known issues with Condor's
GLEXEC_STARTER feature when used in conjunction with
COD. First, the condor_ cod tool invoked with the
delegate_proxy option will sometimes incorrectly report that the
operation has failed. In addition, the GLEXEC_STARTER
feature will not work properly with COD unless the UID that the each
COD job runs as is different than the UID of the opportunistic job or
any other COD jobs that are running on the execute machine when the
COD claim is activated.
- The EXECUTE_LOGIN_IS_DEDICATED feature has been found
to be broken on UNIX platforms. Its use will cause the condor_ procd
to crash, bringing down the other Condor daemons with it.
Version 6.9.1
Release Notes:
- The 6.9.1 release contains all of the bug fixes and enhancements
from the 6.8.x series up to and including version 6.8.3.
- Version 1.4.0 of the Generic Connection Broker (GCB) library is
now used for building Condor, and it is the 1.4.0 versions of the
gcb_broker and gcb_relay_server programs that are
included in this release.
This version of GCB includes enhancements used by Condor
along with a new GCB-related command-line tool:
gcb_broker_query.
Condor 6.9.1 will not work properly with older versions of the
gcb_broker or gcb_relay_server.
For more information about GCB, see section 3.7.3 on
page
.
New Features:
- Improved the performance of the ClassAd matching algorithm,
which speeds up the condor_ schedd and other daemons.
- Improved the scalability of the algorithm used by
the condor_ schedd daemon to find runnable jobs.
This makes a noticeable difference in condor_ schedd daemon performance,
when there are on the order of thousands of jobs in the queue.
- the D_ COMMAND debugging level has been enhanced to
log many more messages.
- Updated the version of DRMAA, which contains several great
improvements regarding scalability and race conditions.
- Added the DAGMAN_SUBMIT_DEPTH_FIRST configuration macro,
which causes condor_ dagman to submit ready nodes in more-or-less depth-first
order, if set to True. The default behavior is to submit
the ready nodes in breadth-first order.
- Added configuration parameter USE_PROCESS_GROUPS .
If it is set to False,
then Condor daemons on Unix machines will not create new
sessions or process groups. This is intended for use with Glidein, as
we have had reports that some batch systems cannot properly track jobs that
create new process groups. The default value is True.
- The default value for the submit file command
copy_to_spool has been changed to False, because copying
the executable to the spool directory for each job (or job cluster) is almost
never desired. Previously, the default was True in all
cases, except for grid universe jobs and remote submissions.
- More types of file transfer errors now result in the job going
on hold, with a specific error message about what went wrong. The new
cases involve failures to write output files to disk on the submit
side (for example, when the disk is full).
As always, the specific error number is
recorded in HoldReasonSubCode, so you can enforce an automated
error handling policy using periodic_release or
periodic_remove.
- Added the <SUBSYS>_DAEMON_AD_FILE
configuration variable, which is similar to the
<SUBSYS>_ADDRESS_FILE .
This new variable will be used in future versions of Condor, but is
not necessary for 6.9.1.
Bugs Fixed:
- Fixed a bug in the condor_ master so that it will now send obituary
e-mails when it kills child processes that it considers hung.
- condor_ configure used to always make a personal Condor with
-install even when -type called for only execute or
submit types. Now, condor_ configure honors the -type
argument, even when using -install.
If -type is not specified, the default is to still install a
full personal Condor with the following daemons:
condor_ master, condor_ collector,
condor_ negotiator, condor_ schedd, condor_ startd.
- While removing, putting on hold, or vacating a large number of
jobs, it was possible for the condor_ schedd and the condor_ shadow to
temporarily deadlock with each other. This has been fixed under Unix,
but not yet under Windows.
- Communication from a condor_ schedd to a condor_ startd
now occurs in a nonblocking manner.
This fixes the problem of the condor_ schedd blocking
when the claimed machine running the condor_ startd
cannot be reached, for example because the machine is turned off.
Known Bugs:
- Under various circumstances, condor 6.9.1 daemons abort
with the message, ``ERROR: Unexpected pending status for fake message
delivery.'' A specific example is when OnExitRemove or
OnExitHold evaluated to UNDEFINED, which causes the
condor_ schedd to abort.
- In Condor 6.9.1, the condor_ schedd will die during startup
when trying to reconnect to running jobs for which the condor_ schedd
can not find a startd ClassAd. This happens shortly after
logging the following message: ``Could not find machine ClassAds for
one or more jobs. May be flocking, or machine may be down.
Attempting to reconnect anyway.''
Version 6.9.0
Release Notes:
- The 6.9.0 release contains all of the bug fixes and enhancements
from the 6.8.x series up to and including version 6.8.2.
New Features:
- Preliminary support for using glexec on execute machines
has been added. This feature causes the condor_ startd to spawn the
condor_ starter as the user that glexec determines based on
the user's GSI credential.
- A ``per-job history files'' feature has been added to the
condor_ schedd. When enabled, this will cause the condor_ schedd to
write out a copy of each job's ClassAd when it leaves the job
queue. The directory to place these files in is determined by the
parameter PER_JOB_HISTORY_DIR . It is the responsibility of
whatever external entity (for example, an accounting or monitoring system) is
using these files to remove them as it completes its processing.
- condor_ chirp command now supports writing messages to the user log.
- condor_ chirp getattr and putattr now send all classad getattr
and putattr commands to the proc 0 classad, which allows multiple proc
parallel jobs to use proc 0 as a scratch pad.
- Parallel jobs now support an AllRemoteHosts attribute,
which lists all the hosts across all procs in a cluster.
- The DAGMAN_ABORT_DUPLICATES configuration macro (which causes
condor_ dagman to abort itself if it detects another condor_ dagman
running on the same DAG) now defaults to True instead of
False.
Bugs Fixed:
Known Bugs:
Next: 8.4 Stable Release Series
Up: 8. Version History and
Previous: 8.2 Upgrade Surprises
Contents
Index
condor-admin@cs.wisc.edu