next up previous contents index
Next: 6.4 Condor on Windows Up: 6. Frequently Asked Questions Previous: 6.2 Setting up Condor

Subsections

6.3 Running Condor Jobs

6.3.1 I'm at the University of Wisconsin-Madison Computer Science Dept., and I am having problems!

Please see the web page http://www.cs.wisc.edu/condor/uwcs. As it explains, your home directory is in AFS, which by default has access control restrictions which can prevent Condor jobs from running properly. The above URL will explain how to solve the problem.

6.3.2 I'm getting a lot of email from Condor. Can I just delete it all?

Generally you shouldn't ignore all of the mail Condor sends, but you can reduce the amount you get by telling Condor that you don't want to be notified every time a job successfully completes, only when a job experiences an error. To do this, include a line in your submit file like the following:

Notification = Error

See the Notification parameter in the condor_q man page on page [*] of this manual for more information.

6.3.3 Why will my vanilla jobs only run on the machine where I submitted them from?

Check the following:

1.
Did you submit the job from a local filesystem that other computers can't access?

See Section 3.3.5, on page [*].

2.
Did you set a special requirements expression for vanilla jobs that's preventing them from running but not other jobs?

See Section 3.3.5, on page [*].

3.
Is Condor running as a non-root user?

See Section 3.12.1, on page [*].

6.3.4 My job starts but exits right away with signal 9.

 

This can occur when the machine your job is running on is missing a shared library required by your program. One solution is to install the shared library on all machines the job may execute on. Another, easier, solution is to try to re-link your program statically so it contains all the routines it needs.

 

6.3.5 Why aren't any or all of my jobs running?

Problems like the following are often reported to us:

> I have submitted 100 jobs to my pool, and only 18 appear to be
> running, but there are plenty of machines available.
What should I
> do to investigate the reason why this happens?

Start by following these steps to understand the problem:

1.
Run condor_q -analyze and see what it says.

2.
Look at the User Log file (whatever you specified as "log = XXX" in the submit file).

See if the jobs are starting to run but then exiting right away, or if they never even start.

3.
Look at the SchedLog on the submit machine after it negotiates for this user. If a user doesn't have enough priority to get more machines the SchedLog will contain a message like "lost priority, no more jobs".

4.
If jobs are successfully being matched with machines, they still might be dying when they try to execute due to file permission problems or the like. Check the ShadowLog on the submit machine for warnings or errors.

5.
Look at the NegotiatorLog during the negotiation for the user. Look for messages about priority, "no more machines", or similar.

     
6.3.6 Can I submit my standard universe SPARC Solaris 2.6 jobs and have them run on a SPARC Solaris 2.7 machine?

No. You may only use binary compatibility between SPARC Solaris 2.5.1 and SPARC Solaris 2.6 and between SPARC Solaris 2.7 and SPARC Solaris 2.8, but not between SPARC Solaris 2.6 and SPARC Solaris 2.7. We may implement support for this feature in a future release of Condor.

   
6.3.7 Why do my vanilla jobs keep cycling between suspended and unsuspended?

This is a load sampling error that Condor performs when starting a many process vanilla job with heavy initial load. Condor mistakenly decides that the load on the machine has gotten too high while the job is in the initialization phase and kicks the job off the machine.

What is needed is a way for Condor to check to see if the load of the machine has been high over a certain period of time. There is a startd attribute, CpuBusyTime that can be used for this purpose. This macro returns the time $(CpuBusy)(usually defined in the default config file) has been true. $(CpuBusy) is defined in terms of non-Condor load.

To take advantage of this macro, you can use it in your SUSPEND     macro. Here is an example:

SUSPEND = (CpuBusyTime > 3 * $(MINUTE)) && ((CurrentTime - JobStart) > 90)

The above policy says to only suspend the job if the cpu has been busy with non-Condor load at least three minutes and it has been at least 90 seconds since the start of the job.


next up previous contents index
Next: 6.4 Condor on Windows Up: 6. Frequently Asked Questions Previous: 6.2 Setting up Condor
condor-admin@cs.wisc.edu