Please see the web page http://www.cs.wisc.edu/condor/uwcs. As it explains, your home directory is in AFS, which by default has access control restrictions which can prevent Condor jobs from running properly. The above URL will explain how to solve the problem.
Generally you shouldn't ignore all of the mail Condor sends, but you can reduce the amount you get by telling Condor that you don't want to be notified every time a job successfully completes, only when a job experiences an error. To do this, include a line in your submit file like the following:
Notification = Error
See the Notification parameter in the condor_ q man page on
page
of this manual for more
information.
Check the following:
See Section 3.3.5, on
page
.
See Section 3.3.5, on
page
.
See Section 3.7.1, on page
.
This can occur when the machine your job is running on is missing a shared library required by your program. One solution is to install the shared library on all machines the job may execute on. Another, easier, solution is to try to re-link your program statically so it contains all the routines it needs.
Problems like the following are often reported to us:
> I have submitted 100 jobs to my pool, and only 18 appear to be > running, but there are plenty of machines available. What should I > do to investigate the reason why this happens?
Start by following these steps to understand the problem:
See if the jobs are starting to run but then exiting right away, or if they never even start.
No. You may only use binary compatibility between SPARC Solaris 2.5.1 and SPARC Solaris 2.6 and between SPARC Solaris 2.7 and SPARC Solaris 2.8, but not between SPARC Solaris 2.6 and SPARC Solaris 2.7. We may implement support for this feature in a future release of Condor.
This is a load sampling error that Condor performs when starting a many process vanilla job with heavy initial load. Condor mistakenly decides that the load on the machine has gotten too high while the job is in the initialization phase and kicks the job off the machine.
What is needed is a way for Condor to check to see if the load of the machine has been high over a certain period of time. There is a startd attribute, CpuBusyTime that can be used for this purpose. This macro returns the time $(CpuBusy)(usually defined in the default config file) has been true. $(CpuBusy) is defined in terms of non-Condor load.
To take advantage of this macro, you can use it in your SUSPEND macro. Here is an example:
SUSPEND = (CpuBusyTime > 3 * $(MINUTE)) && ((CurrentTime - JobStart) > 90)
The above policy says to only suspend the job if the cpu has been busy with non-Condor load at least three minutes and it has been at least 90 seconds since the start of the job.
There are four circumstances under which Condor may evict a job. They are controlled by different expressions.
Reason number 1 is the user priority: controlled by the PREEMPTION_REQUIREMENTS expression in the configuration file. If there is a job from a higher priority user sitting idle, the condor_ negotiator daemon may evict a currently running job submitted from a lower priority user if PREEMPTION_REQUIREMENTS is True. For more on user priorities, see section 2.7 and section 3.5.
Reason number 2 is the owner (machine) policy: controlled by the PREEMPT expression in the configuration file. When a job is running and the PREEMPT expression evaluates to True, the condor_ startd will evict the job. The PREEMPT expression should reflect the requirements under which the machine owner will not permit a job to continue to run. For example, a policy to evict a currently running job when a key is hit or when it is the 9:00am work arrival time, would be expressed in the PREEMPT expression and enforced by the condor_ startd. For more on the PREEMPT expression, see section 3.6.
Reason number 3 is the owner (machine) preference: controlled by the RANK expression in the configuration file (sometimes called the startd rank or machine rank). The RANK expression is evaluated as a floating point number. When one job is running, a second idle job that evaluates to a higher RANK value tells the condor_ startd to prefer the second job over the first. Therefore, the condor_ startd will evict the first job so that it can start running the second (preferred) job. For more on RANK, see section 3.6.
Reason number 4 is if Condor is to be shutdown: on a machine that is currently running a job. Condor evicts the currently running job before proceding with the shutdown.