The following sections describe how to setup Condor for use in a
number of special environments or configurations.
See section 3.4 on
page
for installation instructions for
the various ``contrib modules'' that you can optionally download and
install in your pool.
If you are using AFS at your site, be sure to read section 3.3.5 on ``Shared Filesystem Config Files Entries'' for details on configuring your machines to interact with and use shared filesystems, AFS in particular.
Condor does not currently have a way to authenticate itself to AFS. This is true of the Condor daemons that would like to authenticate as AFS user Condor, and the condor_shadow, which would like to authenticate as the user who submitted the job it is serving. Since neither of these things can happen yet, there are a number of special things people who use AFS with Condor must do. Some of this must be done by the administrator(s) installing Condor. Some of this must be done by Condor users who submit jobs.
The most important thing is that since the Condor daemons can't authenticate to AFS, the LOCAL_DIR (and it's subdirectories like ``log'' and ``spool'') for each machine must be either writable to unauthenticated users, or must not be on AFS. The first option is a VERY bad security hole so you should NOT have your local directory on AFS. If you've got NFS installed as well and want to have your LOCAL_DIR for each machine on a shared file system, use NFS. Otherwise, you should put the LOCAL_DIR on a local partition on each machine in your pool. This means that you should run condor_install to install your release directory and configure your pool, setting the LOCAL_DIR parameter to some local partition. When that's complete, log into each machine in your pool and run condor_init to set up the local Condor directory.
The RELEASE_DIR , which holds all the Condor binaries, libraries and scripts can and probably should be on AFS. None of the Condor daemons need to write to these files, they just need to read them. So, you just have to make your RELEASE_DIR world readable and Condor will work just fine. This makes it easier to upgrade your binaries at a later date, means that your users can find the Condor tools in a consistent location on all the machines in your pool, and that you can have the Condor config files in a centralized location. This is what we do at UW-Madison's CS department Condor pool and it works quite well.
Finally, you might want to setup some special AFS groups to help your users deal with Condor and AFS better (you'll want to read the section below anyway, since you're probably going to have to explain this stuff to your users). Basically, if you can, create an AFS group that contains all unauthenticated users but that is restricted to a given host or subnet. You're supposed to be able to make these host-based ACLs with AFS, but we've had some trouble getting that working here at UW-Madison. What we have instead is a special group for all machines in our department. So, the users here just have to make their output directories on AFS writable to any process running on any of our machines, instead of any process on any machine with AFS on the Internet.
The condor_shadow process runs on the machine where you submitted your Condor jobs and performs all file system access for your jobs. Because this process isn't authenticated to AFS as the user who submitted the job, it will not normally be able to write any output. So, when you submit jobs, any directories where your job will be creating output files will need to be world writable (to non-authenticated AFS users). In addition, if your program writes to stdout or stderr, or you're using a user log for your jobs, those files will need to be in a directory that's world-writable.
Any input for your job, either the file you specify as input in your submit file, or any files your program opens explicitly, needs to be world-readable.
Some sites may have special AFS groups set up that can make this unauthenticated access to your files less scary. For example, there's supposed to be a way with AFS to grant access to any unauthenticated process on a given host. That way, you only have to grant write access to unauthenticated processes on your submit machine, instead of any unauthenticated process on the Internet. Similarly, unauthenticated read access could be granted only to processes running your submit machine. Ask your AFS administrators about the existence of such AFS groups and details of how to use them.
The other solution to this problem is to just not use AFS at all. If you have disk space on your submit machine in a partition that is not on AFS, you can submit your jobs from there. While the condor_shadow is not authenticated to AFS, it does run with the effective UID of the user who submitted the jobs. So, on a local (or NFS) file system, the condor_shadow will be able to access your files normally, and you won't have to grant any special permissions to anyone other than yourself. If the Condor daemons are not started as root however, the shadow will not be able to run with your effective UID, and you'll have a similar problem as you would with files on AFS. See the section on ``Running Condor as Non-Root'' for details.
Beginning with Condor version 6.0.1, you can use a single, global config file for all platforms in your Condor pool, with only platform-specific settings placed in separate files. This greatly simplifies administration of a heterogeneous pool by allowing you to change platform-independent, global settings in one place, instead of separately for each platform. This is made possible by the LOCAL_CONFIG_FILE parameter being treated by Condor as a list of files, instead of a single file. Of course, this will only help you if you are using a shared filesystem for the machines in your pool, so that multiple machines can actually share a single set of configuration files.
If you have multiple platforms, you should put all platform-independent settings (the vast majority) into your regular condor_config file, which would be shared by all platforms. This global file would be the one that is found with the CONDOR_CONFIG environment variable, user condor's home directory, or /etc/condor/condor_config.
You would then set the LOCAL_CONFIG_FILE parameter from that global config file to specify both a platform-specific config file and optionally, a local, machine-specific config file (this parameter is described in section 3.3.2 on ``Condor-wide Config File Entries'').
The order in which you specify files in the LOCAL_CONFIG_FILE parameter is important, because settings in files at the beginning of the list are overridden if the same settings occur in files later in the list. So, if you specify the platform-specific file and then the machine-specific file, settings in the machine-specific file would override those in the platform-specific file (which is probably what you want).
To specify the platform-specific file, you could simply use the ARCH and OPSYS parameters which are defined automatically by Condor. For example, if you had Intel Linux machines, Sparc Solaris 2.6 machines, and SGIs running IRIX 6.x, you might have files named:
condor_config.INTEL.LINUX
condor_config.SUN4x.SOLARIS26
condor_config.SGI.IRIX6
Then, assuming these three files were in the directory held in the ETC macro, and you were using machine-specific config files in the same directory, named by each machine's hostname, your LOCAL_CONFIG_FILE parameter would be set to:
LOCAL_CONFIG_FILE = $(ETC)/condor_config.$(ARCH).$(OPSYS), \
$(ETC)/$(HOSTNAME).local
Alternatively, if you are using AFS, you can use an ``@sys link'' to specify the platform-specific config file and let AFS resolve this link differently on different systems. For example, perhaps you have a soft linked named ``condor_config.platform'' that points to ``condor_config.@sys''. In this case, your files might be named:
condor_config.i386_linux2
condor_config.sun4x_56
condor_config.sgi_64
condor_config.platform -> condor_config.@sys
and your LOCAL_CONFIG_FILE parameter would be set to:
LOCAL_CONFIG_FILE = $(ETC)/condor_config.platform, \
$(ETC)/$(HOSTNAME).local
The only settings that are truly platform-specific are:
Reasonable defaults for all of these settings will be found in the default config files inside a given platform's binary distribution (except the RELEASE_DIR , since it is up to you where you want to install your Condor binaries and libraries). If you have multiple platforms, simply take one of the condor_config files you get from either running condor_install or from the <release_dir>/etc/examples/condor_config.generic file, take these settings out and save them into a platform-specific file, and install the resulting platform-independent file as your global config file. Then, find the same settings from the config files for any other platforms you are setting up and put them in their own platform specific files. Finally, set your LOCAL_CONFIG_FILE parameter to point to the appropriate platform-specific file, as described above.
Not even all of these settings are necessarily going to be different. For example, if you have installed a mail program that understands the ``-s'' option in /usr/local/bin/mail on all your platforms, you could just set MAIL to that in your global file and not define it anywhere else. If you've only got Digital Unix and IRIX machines, the DAEMON_LIST will be the same for each, so there's no reason not to put that in the global config file (or, if you have no IRIX or Digital Unix machines, DAEMON_LIST won't have to be platform-specific either).
It is certainly possible that you might want other settings to be platform-specific as well. Perhaps you want a different startd policy for one of your platforms. Maybe different people should get the email about problems with different platforms. There's nothing hard-coded about any of this. What you decide should be shared and what should not is entirely up to you and how you lay out your config files.
Since the LOCAL_CONFIG_FILE parameter can be an arbitrary list of files, you can even break up your global, platform-independent settings into separate files. In fact, your global config file might only contain a definition for LOCAL_CONFIG_FILE , and all other settings would be handled in separate files.
You might want to give different people permission to change different Condor settings. For example, if you wanted some user to be able to change certain settings, but nothing else, you could specify those settings in a file which was early in the LOCAL_CONFIG_FILE list, give that user write permission on that file, then include all the other files after that one. That way, if the user was trying to change settings she/he shouldn't, they would simply be overridden.
As you can see, this mechanism is quite flexible and powerful. If you have very specific configuration needs, they can probably be met by using file permissions, the LOCAL_CONFIG_FILE setting, and your imagination.
The Checkpoint Server maintains a repository for checkpoint files. Using checkpoint servers reduces the disk requirements of submitting machines in the pool, since the submitting machines no longer need to store checkpoint files locally. Checkpoint server machines should have a large amount of disk space available, and they should have a fast connection to machines in the Condor pool.
If your spool directories are on a network file system, then checkpoint files will make two trips over the network: one between the submitting machine and the execution machine, and a second between the submitting machine and the network file server. If you install a checkpoint server and configure it to use the server's local disk, the checkpoint will travel only once over the network, between the execution machine and the checkpoint server. You may also obtain checkpointing network performance benefits by using multiple checkpoint servers, as discussed below.
NOTE: It is a good idea to pick very stable machines for your checkpoint servers. If individual checkpoint servers crash, the Condor system will continue to operate, although poorly. While the Condor system will recover from a checkpoint server crash as best it can, there are two problems that can (and will) occur:
for details).
This parameter represents the maximum amount of CPU time you are
willing to discard by starting a job over from scratch if the
checkpoint server is not responding to requests.
The location of checkpoints changes upon the installation
of a checkpoint server.
A configuration change would cause
currently queued jobs with checkpoints
to not be able to find their checkpoints.
This results in the jobs with checkpoints
remaining indefinitely queued (never running)
due to the lack of finding their checkpoints.
It is therefore best to
either remove jobs from the queues or let them complete
before installing a checkpoint server.
It is advisable to shut your pool down before doing any
maintenance on your checkpoint server.
See section 3.10 on
page
for details on shutting
down your pool.
A graduated installation of the checkpoint server may be accomplished by configuring submit machines as their queues empty.
To install a checkpoint server, download the appropriate binary contrib module for the platform(s) on which your server will run. Uncompress and untar the file to result in a directory that contains a README, ckpt_server.tar, and so on. The file ckpt_server.tar acts much like the release.tar file from a main release. This archive contains the files:
sbin/condor_ckpt_server
sbin/condor_cleanckpts
etc/examples/condor_config.local.ckpt.server
These new files are not found in the main release, so you can
safely untar the archive directly into your existing release
directory.
condor_ckpt_server is the checkpoint server binary.
condor_cleanckpts is a script that can be periodically run to
remove stale checkpoint files from your server.
The checkpoint server normally cleans all old files itself.
However, in certain error situations, stale files can be left that are
no longer needed.
You may set up a cron job that calls
condor_cleanckpts every week or so to automate the cleaning up
of any
stale files.
The example configuration file give with the module
is described below.
After unpacking the module, there are three steps to complete. Each is discussed in its own section:
Place settings in the local configuration file of the checkpoint server. The file etc/examples/condor_config.local.ckpt.server contains the needed settings. Insert these into the local configuration file of your checkpoint server machine.
The CKPT_SERVER_DIR must be customized. The CKPT_SERVER_DIR attribute defines where your checkpoint files are to be located. It is better if this is on a very fast local file system (preferably a RAID). The speed of this file system will have a direct impact on the speed at which your checkpoint files can be retrieved from the remote machines.
The other optional settings are:
The rest of these settings are the checkpoint server-specific versions
of the Condor logging entries, as described in
section 3.3.3 on
page
.
To start the newly configured checkpoint server,
restart Condor on that host to enable
the condor_master to notice the new configuration.
Do this by sending a condor_restart command from any machine
with administrator access to your pool.
See section 3.8 on
page
for full details about IP/host-based
security in Condor.
After the checkpoint server is running, you change a few settings in your configuration files to let your pool know about your new server:
It is most convenient to set these parameters in your global configuration file, so they affect all submission machines. However, you may configure each submission machine separately (using local configuration files) if you do not want all of your submission machines to start using the checkpoint server at one time. If USE_CKPT_SERVER is set to FALSE, the submission machine will not use a checkpoint server.
Once these settings are in place, send a
condor_reconfig to all machines in your pool so the changes take
effect.
This is described in section 3.10.2 on
page
.
It is possible to configure a Condor pool to use multiple checkpoint servers. The deployment of checkpoint servers across the network improves checkpointing performance. In this case, Condor machines are configured to checkpoint to the nearest checkpoint server. There are two main performance benefits to deploying multiple checkpoint servers:
Once you have multiple checkpoint servers running in your pool, the following configuration changes are required to make them active.
First, USE_CKPT_SERVER should be set to TRUE (the default) on all
submitting machines where Condor jobs should use a checkpoint server.
Additionally, STARTER_CHOOSES_CKPT_SERVER should be set to
TRUE (the default) on these submitting machines.
When TRUE, this parameter specifies that the checkpoint server
specified by the machine running the job should be used instead of the
checkpoint server specified by the submitting machine.
See section 3.3.6 on
page
for more
details.
This allows the job to use the checkpoint server closest to the
machine on which it is running, instead of the server closest to the
submitting machine.
For convenience, set these parameters in the
global configuration file.
Second, set CKPT_SERVER_HOST on each machine. As described, this is set to the full hostname of the checkpoint server machine. In the case of multiple checkpoint servers, set this in the local configuraton file. It is the hostname of the nearest server to the machine.
Third, send a
condor_reconfig to all machines in the pool so the changes take
effect.
This is described in section 3.10.2 on
page
.
After completing these three steps, the jobs in your pool will send checkpoints to the nearest checkpoint server. On restart, a job will remember where its checkpoint was stored and get it from the appropriate server. After a job successfully writes a checkpoint to a new server, it will remove any previous checkpoints left on other servers.
NOTE: If the configured checkpoint server is unavailable, the job will keep trying to contact that server as described above. It will not use alternate checkpoint servers. This may change in future versions of Condor.
The configuration described in the previous section ensures that jobs will always write checkpoints to their nearest checkpoint server. In some circumstances, it is also useful to configure Condor to localize checkpoint read transfers, which occur when the job restarts from its last checkpoint on a new machine. To localize these transfers, we want to schedule the job on a machine which is near the checkpoint server on which the job's checkpoint is stored.
We can say that all of the machines configured to use checkpoint server ``A'' are in ``checkpoint server domain A.'' To localize checkpoint transfers, we want jobs which run on machines in a given checkpoint server domain to continue running on machines in that domain, transferring checkpoint files in a single local area of the network. There are two possible configurations which specify what a job should do when there are no available machines in its checkpoint server domain:
The first step in implementing checkpoint server domains is to include the name of the nearest checkpoint server in the machine ClassAd, so this information can be used in job scheduling decisions. To do this, add the following configuration to each machine:
CkptServer = "$(CKPT_SERVER_HOST)" STARTD_EXPRS = $(STARTD_EXPRS), CkptServerFor convenience, we suggest that you set these parameters in the global config file. Note that this example assumes that STARTD_EXPRS is defined previously in your configuration. If not, then you should use the following configuration instead:
CkptServer = "$(CKPT_SERVER_HOST)" STARTD_EXPRS = CkptServerNow, all machine ClassAds will include a CkptServer attribute, which is the name of the checkpoint server closest to this machine. So, the CkptServer attribute defines the checkpoint server domain of each machine.
To restrict jobs to one checkpoint server domain, we need to modify the jobs' Requirements expression as follows:
Requirements = ((LastCkptServer == TARGET.CkptServer) || (LastCkptServer =?= UNDEFINED))This Requirements expression uses the LastCkptServer attribute in the job's ClassAd, which specifies where the job last wrote a checkpoint, and the CkptServer attribute in the machine ClassAd, which specifies the checkpoint server domain. If the job has not written a checkpoint yet, the LastCkptServer attribute will be UNDEFINED, and the job will be able to execute in any checkpoint server domain. However, once the job performs a checkpoint, LastCkptServer will be defined and the job will be restricted to the checkpoint server domain where it started running.
If instead we want to allow jobs to transfer to other checkpoint server domains when there are no available machines in the current checkpoint server domain, we need to modify the jobs' Rank expression as follows:
Rank = ((LastCkptServer == TARGET.CkptServer) || (LastCkptServer =?= UNDEFINED))This Rank expression will evaluate to 1 for machines in the job's checkpoint server domain and 0 for other machines. So, the job will prefer to run on machines in its checkpoint server domain, but if no such machines are available, the job will run in a new checkpoint server domain.
You can automatically append the checkpoint server domain
Requirements or Rank expressions to all STANDARD
universe jobs submitted in your pool using
APPEND_REQ_STANDARD or APPEND_RANK_STANDARD .
See section 3.3.13 on
page
for more details.
The condor_schedd may be configured to submit jobs to more than one
pool.
In the default configuration, the condor_schedd contacts the
Central Manager specified by the CONDOR_HOST macro (described
in section 3.3.2 on
page
)
to locate execute machines
available to run jobs in its queue.
However, the
FLOCK_NEGOTIATOR_HOSTS and FLOCK_COLLECTOR_HOSTS
macros (described in
section 3.3.9 on
page
) may
be used to specify additional
Central Managers for the condor_schedd to contact.
When the local
pool does not satisfy all job requests, the condor_schedd will try
the pools specified by these macros in turn until all jobs are
satisfied.
$(HOSTALLOW_NEGOTIATOR_SCHEDD) (see section 3.3.4) must also be configured to allow negotiators from all of the $(FLOCK_NEGOTIATOR_HOSTS) to contact the schedd. Please make sure the $(NEGOTIATOR_HOST) is first in the $(HOSTALLOW_NEGOTIATOR_SCHEDD) list. Similarly, the central managers of the remote pools must be configured to listen to requests from this schedd.
This section describes how to configure the condor_startd for SMP (Symmetric Multi-Processor) machines. Beginning with Condor version 6.1, machines with more than one CPU can be configured to run more than one job at a time. As always, owners of the resources have great flexibility in defining the policy under which multiple jobs may run, suspend, vacate, etc.
The way SMP machines are represented to the Condor system is that the shared resources are broken up into individual virtual machines (``VM'') that can be matched with and claimed by users. Each virtual machine is represented by an individual ``ClassAd'' (see the ClassAd reference, section 4.1, for details). In this way, a single SMP machine will appear to the Condor system as a collection of separate virtual machines. So for example, if you had an SMP machine named ``vulture.cs.wisc.edu'', it would appear to Condor as multiple machines, named ``vm1@vulture.cs.wisc.edu'', ``vm2@vulture.cs.wisc.edu'', and so on.
You can configure how you want the condor_startd to break up the shared system resources into the different virtual machines. All shared system resources (like RAM, disk space, swap space, etc) can either be divided evenly among all the virtual machines, with each CPU getting its own virtual machine, or you can define your own virtual machine types, so that resources can be unevenly partitioned. The following section gives details on how to configure Condor to divide the resources on an SMP machine into seperate virtual machines.
This section describes the settings that allow you to define your own virtual machine types and to control how many virtual machines of each type are reported to Condor.
There are two main ways to go about dividing an SMP machine:
Begining with Condor version 6.1.6, the number of each type being reported can be changed at run-time, by issuing a simple reconfig to the condor_startd (sending a SIGHUP or using condor_reconfig). However, the definitions for the types themselves cannot be changed with a reconfig. If you change any VM type definitions, you must use ``condor_restart -startd'' for that change to take effect.
To define your own virtual machine types, you simply add config file parameters that list how much of each system resource you want in the given VM type. You do this with settings of the form VIRTUAL_MACHINE_TYPE_<N> . The <N> is to be replaced with an integer, for example, VIRTUAL_MACHINE_TYPE_1, which specifies the virtual machine type you're defining. You will use this number later to configure how many VMs of this type you want to advertise.
A type describes what share of the total system resources a given virtual machine has available to it.
The type can be defined in a number of ways:
Some attributes, such as the number of CPUs and total amount of RAM in the machine, do not change (unless the machine is turned off and more chips are added to it). For these two attributes, you can specify either absolute values, or percentages of the total available amount. For example, in a machine with 128 megs of RAM, you could specify any of the following to get the same effect: ``mem=64'', ``mem=1/2'', or ``mem=50%''. Other resources are dynamic, such as disk space and swap space. For these, you must specify the percentage or fraction of the total value that is alloted to each VM, instead of specifying absolute values. As the total values of these resources change on your machine, each VM will take its fraction of the total and report that as its available amount.
All attribute names are case insensitive when defining VM types. You can use as much or as little of each word as you'd like. The attributes you can tune are:
Assume the host as 4 CPUs and 256 megs of RAM. Here are some example VM type definitions, all of which are valid. Types 1-3 are all equivalent with each other, as are types 4-6
VIRTUAL_MACHINE_TYPE_1 = cpus=2, ram=128, swap=25%, disk=1/2
VIRTUAL_MACHINE_TYPE_2 = cpus=1/2, memory=128, virt=25%, disk=50%
VIRTUAL_MACHINE_TYPE_3 = c=1/2, m=50%, v=1/4, disk=1/2
VIRTUAL_MACHINE_TYPE_4 = c=25%, m=64, v=1/4, d=25%
VIRTUAL_MACHINE_TYPE_5 = 25%
VIRTUAL_MACHINE_TYPE_6 = 1/4
If you are not defining your own VM types, all you have to configure is how many of the evenly divided VMs you want reported to Condor. You do this by setting the NUM_VIRTUAL_MACHINES parameter. You just supply the number of machines you want reported. If you do not define this yourself, Condor will advertise all the CPUs in your machines by default.
If you define your own types, things are slightly more complicated. Now, you must specify how many virtual machines of each type should be reported. You do this with settings of the form NUM_VIRTUAL_MACHINES_TYPE_<N> . The <N> is to be replaced with an actual number, for example, NUM_VIRTUAL_MACHINES_TYPE_1.
NOTE: Be sure you have read and understand section 3.6 on ``Configuring The Startd Policy'' before you proceed with this section.
Each virtual machine from an SMP is treated as an independent machine, with its own view of its machine state. For now, a single set of policy expressions is in place for all virtual machines simultaneously. Eventually, you will be able to explicitly specify separate policies for each one. However, since you do have control over each virtual machine's view of its own state, you can effectively have separate policies for each resource.
For example, you can configure how many of the virtual machines ``notice'' console or tty activity on the SMP as a whole. Ones that aren't configured to notice any activity will report ConsoleIdle and KeyboardIdle times from when the startd was started, (plus a configurable number of seconds). So, you can setup a 4 CPU machine with all the default startd policy settings and with the keyboard and console ``connected'' to only one virtual machine. Assuming there isn't too much load average (see section 3.11.7 below on ``Load Average for SMP Machines''), only one virtual machine will suspend or vacate its job when the owner starts typing at their machine again. The rest of the virtual machines could be matched with jobs and leave them running, even while the user was interactively using the machine.
Or, if you wish, you can configure all virtual machines to notice all tty and console activity. In this case, if a machine owner came back to her machine, all the currently running jobs would suspend or preempt (depending on your policy expressions), all at the same time.
All of this is controlled with the config file parameters listed
below.
These settings are fully described in
section 3.3.8 on
page
which lists all the
configuration file settings for the condor_startd.
Most operating systems define the load average for an SMP machine as the total load on all CPUs. For example, if you have a 4 CPU machine with 3 CPU-bound processes running at the same time, the load would be 3.0 In Condor, we maintain this view of the total load average and publish it in all resource ClassAds as TotalLoadAvg.
However, we also define the ``per-CPU'' load average for SMP machines. In this way, the model that each node on an SMP is a virtual machine, totally separate from the other nodes, can be maintained. All of the default, single-CPU policy expressions can be used directly on SMP machines, without modification, since the LoadAvg and CondorLoadAvg attributes are the per-virtual machine versions, not the total, SMP-wide versions.
The per-CPU load average on SMP machines is a number we basically invented. There is no system call you can use to ask your operating system for this value. Here's how it works:
We already compute the load average generated by Condor on each virtual machine. We do this by close monitoring of all processes spawned by any of the Condor daemons, even ones that are orphaned and then inherited by init. This Condor load average per virtual machine is reported as CondorLoadAvg in all resource ClassAds, and the total Condor load average for the entire machine is reported as TotalCondorLoadAvg. We also have the total, system-wide load average for the entire machine (reported as TotalLoadAvg). Basically, we walk through all the virtual machines and assign out portions of the total load average to each one. First, we assign out the known Condor load average to each node that is generating any. If there's any load average left in the total system load, that's considered owner load. Any virtual machines we already think are in the Owner state (like ones that have keyboard activity, etc), are the first to get assigned this owner load. We hand out owner load in increments of at most 1.0, so generally speaking, no virtual machine has a load average above 1.0. If we run out of total load average before we run out of virtual machines, all the remaining machines think they have no load average at all. If, instead, we run out of virtual machines and we still have owner load left, we start assigning that load to Condor nodes, too, creating individual nodes with a load average higher than 1.0.
This section describes how the startd handles its debug messages for SMP machines. In general, a given log message will either be something that is machine-wide (like reporting the total system load average), or it will be specific to a given virtual machine. Any log entrees specific to a virtual machine will have an extra header printed out in the entry: vm#:. So, for example, here's the output about system resources that are being gathered (with D_FULLDEBUG and D_LOAD turned on) on a 2 CPU machine with no Condor activity, and the keyboard connected to both virtual machines:
11/25 18:15 Swap space: 131064 11/25 18:15 number of kbytes available for (/home/condor/execute): 1345063 11/25 18:15 Looking up RESERVED_DISK parameter 11/25 18:15 Reserving 5120 kbytes for file system 11/25 18:15 Disk space: 1339943 11/25 18:15 Load avg: 0.340000 0.800000 1.170000 11/25 18:15 Idle Time: user= 0 , console= 4 seconds 11/25 18:15 SystemLoad: 0.340 TotalCondorLoad: 0.000 TotalOwnerLoad: 0.340 11/25 18:15 vm1: Idle time: Keyboard: 0 Console: 4 11/25 18:15 vm1: SystemLoad: 0.340 CondorLoad: 0.000 OwnerLoad: 0.340 11/25 18:15 vm2: Idle time: Keyboard: 0 Console: 4 11/25 18:15 vm2: SystemLoad: 0.000 CondorLoad: 0.000 OwnerLoad: 0.000 11/25 18:15 vm1: State: Owner Activity: Idle 11/25 18:15 vm2: State: Owner Activity: Idle
If, on the other hand, this machine only had one virtual machine connected to the keyboard and console, and the other vm was running a job, it might look something like this:
11/25 18:19 Load avg: 1.250000 0.910000 1.090000 11/25 18:19 Idle Time: user= 0 , console= 0 seconds 11/25 18:19 SystemLoad: 1.250 TotalCondorLoad: 0.996 TotalOwnerLoad: 0.254 11/25 18:19 vm1: Idle time: Keyboard: 0 Console: 0 11/25 18:19 vm1: SystemLoad: 0.254 CondorLoad: 0.000 OwnerLoad: 0.254 11/25 18:19 vm2: Idle time: Keyboard: 1496 Console: 1496 11/25 18:19 vm2: SystemLoad: 0.996 CondorLoad: 0.996 OwnerLoad: 0.000 11/25 18:19 vm1: State: Owner Activity: Idle 11/25 18:19 vm2: State: Claimed Activity: Busy
As you can see, shared system resources are printed without the header (like total swap space), which VM-specific messages (like the load average or state of each VM,) get the special header appended.
Beginning with Condor version 6.1.5, Condor can run on machines with multiple network interfaces. Basically, you tell each host with multiple interfaces which IP address you want the host to use for ingoing and outgoing Condor network communication. You do this by setting the NETWORK_INTERFACE parameter in the local config file for each host you need to. There are a few other special cases you might have to deal with, described below.
If your Central Manager is on a machine with multiple interfaces, instead of defining the COLLECTOR_HOST or NEGOTIATOR_HOST parameters (which are usually both defined in terms of CONDOR_HOST ), you should set the CM_IP_ADDR .
WARNING: The default HOSTALLOW_ADMINISTRATOR setting in the config file references $(CONDOR_HOST) , and the default HOSTALLOW_NEGOTIATOR setting references $(NEGOTIATOR_HOST) . So you'll need to change both of these settings to reference $(CM_IP_ADDR) instead.
If your Checkpoint Server is on a machine with multiple interfaces, the only way to get things to work is if your different interfaces have different hostnames associated with them, and you set CKPT_SERVER_HOST to the hostname that corresponds with the IP address you want to use. You will still need to specify NETWORK_INTERFACE in the local config file for your Checkpoint Server.