Please see the web page http://www.cs.wisc.edu/condor/uwcs. As it explains, your home directory is in AFS, which by default has access control restrictions which can prevent Condor jobs from running properly. The above URL will explain how to solve the problem.
Generally you shouldn't ignore all of the mail Condor sends, but you can reduce the amount you get by telling Condor that you don't want to be notified every time a job successfully completes, only when a job experiences an error. To do this, include a line in your submit file like the following:
Notification = Error
See the Notification parameter in the condor_q man page on
page
of this manual for more
information.
Check the following:
See Section 3.3.5, on
page
.
See Section 3.3.5, on
page
.
See Section 3.12.1, on page
.
This can occur when the machine your job is running on is missing a shared library required by your program. One solution is to install the shared library on all machines the job may execute on. Another, easier, solution is to try to re-link your program statically so it contains all the routines it needs.
Problems like the following are often reported to us:
> I have submitted 100 jobs to my pool, and only 18 appear to be > running, but there are plenty of machines available. What should I > do to investigate the reason why this happens?
Start by following these steps to understand the problem:
See if the jobs are starting to run but then exiting right away, or if they never even start.
No. You may only use binary compatibility between SPARC Solaris 2.5.1 and SPARC Solaris 2.6 and between SPARC Solaris 2.7 and SPARC Solaris 2.8, but not between SPARC Solaris 2.6 and SPARC Solaris 2.7. We may implement support for this feature in a future release of Condor.
This is a load sampling error that Condor performs when starting a many process vanilla job with heavy initial load. Condor mistakenly decides that the load on the machine has gotten too high while the job is in the initialization phase and kicks the job off the machine.
What is needed is a way for Condor to check to see if the load of the machine has been high over a certain period of time. There is a startd attribute, CpuBusyTime that can be used for this purpose. This macro returns the time $(CpuBusy)(usually defined in the default config file) has been true. $(CpuBusy) is defined in terms of non-Condor load.
To take advantage of this macro, you can use it in your SUSPEND macro. Here is an example:
SUSPEND = (CpuBusyTime > 3 * $(MINUTE)) && ((CurrentTime - JobStart) > 90)
The above policy says to only suspend the job if the cpu has been busy with non-Condor load at least three minutes and it has been at least 90 seconds since the start of the job.