Please see the web page http://www.cs.wisc.edu/condor/uwcs. As it explains, your home directory is in AFS, which by default has access control restrictions which can prevent Condor jobs from running properly. The above URL will explain how to solve the problem.
Generally you shouldn't ignore all of the mail Condor sends, but you can reduce the amount you get by telling Condor that you don't want to be notified every time a job successfully completes, only when a job experiences an error. To do this, include a line in your submit file like the following:
Notification = Error
See the Notification parameter in the condor_ q man page on
page
of this manual for more
information.
Check the following:
See Section 3.3.5, on
page
.
See Section 3.3.5, on
page
.
See Section 3.7.2, on page
.
This can occur when the machine your job is running on is missing a shared library required by your program. One solution is to install the shared library on all machines the job may execute on. Another, easier, solution is to try to re-link your program statically so it contains all the routines it needs.
Problems like the following are often reported to us:
> I have submitted 100 jobs to my pool, and only 18 appear to be > running, but there are plenty of machines available. What should I > do to investigate the reason why this happens?
Start by following these steps to understand the problem:
See if the jobs are starting to run but then exiting right away, or if they never even start.
No. You may only use binary compatibility between SPARC Solaris 2.5.1 and SPARC Solaris 2.6 and between SPARC Solaris 2.7 and SPARC Solaris 2.8, but not between SPARC Solaris 2.6 and SPARC Solaris 2.7. We may implement support for this feature in a future release of Condor.
Condor tries to provide a number, the ``Condor Load Average'' (reported in the machine ClassAd as CondorLoadAvg), which is intended to represent the total load average on the system caused by any running Condor job(s). Unfortunately, it is impossible to get an accurate number for this without support from the operating system. This is not available. So, Condor does the best it can, and it mostly works in most cases. However, there are a number of ways this statistic can go wrong.
The old default Condor policy was to suspend if the non-Condor load average went over a certain threshold. However, because of the problems providing accurate numbers for this (described below), some jobs would go into a cycle of getting suspended and resumed. The default suspend policy now shipped with Condor uses the solution explained here.
While there are too many technical details of why CondorLoadAvg might be wrong for a short answer here, a brief explanation is presented. When a job has periodic behavior, and the load it places upon a machine is changing over time, the system load also changes over time. However, Condor thinks that the job's share of the system load (what it uses to compute the CondorLoad) is also changing. So, when the job was running, and then stops, both the system load and the Condor load start falling. If it all worked correctly, they'd fall at the exact same rate, and NonCondorLoad would be constant. Unfortunately, CondorLoadAvg falls faster, since Condor thinks the job's share of the total load is falling, too. Therefore, CondorLoadAvg falls faster than the system load, NonCondorLoad goes up, and the old default SUSPEND expression becomes true.
It appears that Condor should be able to avoid this problem, but for a host of reasons, it can not. There is no good way (without help from the operating systems Condor runs on; the help does not exist) to get this right. The only way to compute these numbers more accurately without support from the operating system is to sample everything at such a high rate that Condor itself would create a large load average, just to try to compute the load average. This is Heisenberg's uncertainty principle in action.
A similar sampling error can occur when Condor is starting a job within the vanilla universe with many processes and with a heavy initial load. Condor mistakenly decides that the load on the machine has gotten too high while the job is in the initialization phase and kicks the job off the machine.
To correct this problem, Condor needs to check to see if the load of the machine has been high over an interval of time. There is an attribute, CpuBusyTime that can be used for this purpose. This macro returns the time $(CpuBusy) (defined in the default configuration file) has been true, or 0 if $(CpuBusy) is false. $(CpuBusy) is usually defined in terms of non-Condor load. These are the default settings:
NonCondorLoadAvg = (LoadAvg - CondorLoadAvg) HighLoad = 0.5 CPUBusy = ($(NonCondorLoadAvg) >= $(HighLoad))
To take advantage of CpuBusyTime, you can use it in your SUSPEND expression.
Here is an example:
SUSPEND = (CpuBusyTime > 3 * $(MINUTE)) && ((CurrentTime - JobStart) > 90)
The above policy says to only suspend the job if the cpu has been busy with non-Condor load at least three minutes and it has been at least 90 seconds since the start of the job.
There are four circumstances under which Condor may evict a job. They are controlled by different expressions.
Reason number 1 is the user priority: controlled by the PREEMPTION_REQUIREMENTS expression in the configuration file. If there is a job from a higher priority user sitting idle, the condor_ negotiator daemon may evict a currently running job submitted from a lower priority user if PREEMPTION_REQUIREMENTS is True. For more on user priorities, see section 2.7 and section 3.5.
Reason number 2 is the owner (machine) policy: controlled by the PREEMPT expression in the configuration file. When a job is running and the PREEMPT expression evaluates to True, the condor_ startd will evict the job. The PREEMPT expression should reflect the requirements under which the machine owner will not permit a job to continue to run. For example, a policy to evict a currently running job when a key is hit or when it is the 9:00am work arrival time, would be expressed in the PREEMPT expression and enforced by the condor_ startd. For more on the PREEMPT expression, see section 3.6.
Reason number 3 is the owner (machine) preference: controlled by the RANK expression in the configuration file (sometimes called the startd rank or machine rank). The RANK expression is evaluated as a floating point number. When one job is running, a second idle job that evaluates to a higher RANK value tells the condor_ startd to prefer the second job over the first. Therefore, the condor_ startd will evict the first job so that it can start running the second (preferred) job. For more on RANK, see section 3.6.
Reason number 4 is if Condor is to be shutdown: on a machine that is currently running a job. Condor evicts the currently running job before proceeding with the shutdown.
The answer is dependent on the universe of the jobs.
Under the scheduler universe, the signal jobs get upon condor_ rm can be set by the user in the submit description file with the form of
remove_kill_sig = SIGWHATEVERIf this command is not defined, Condor further looks for a command in the submit description file with the form
kill_sig = SIGWHATEVERAnd, if that command is also not given, Condor uses SIGTERM.
For all other universes, the jobs get the value of
the submit description file command
kill_sig, which is SIGTERM by default.
If a job is killed or evicted, the job is sent a
kill_sig,
unless it is on the receiving end of a hard kill,
in which case it gets SIGKILL.
Under all universes, the signal is sent only to the parent PID of the job, namely, the first child of the condor_ starter. If the child itself is forking, the child must catch and forward signals as appropriate. This in turn depends on the user's desired behavior. The exception to this is (again) where the job is receiving a hard kill. Condor sends the value SIGKILL to all the PIDs in the family.