LIGO Support Ticket 13576

Ticket Information
  Number:      admin 13576
  User:        anderson@ligo.caltech.edu
  Email:       
  Status:      feature
  Assigned To: tannenba
Date: Wed, 12 Apr 2006 15:01:27 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Global limit on queued jobs

	Is there a way to set a global limit on the number of queued jobs
to prevent schedd from crashing? I am running 6.7.18 and recently had a
user accidentally submit 50k jobs. This resulted in schedd becoming
unresponsive and getting killed by condor_master:

4/10 19:10:08 ERROR: Child pid 13885 appears hung! Killing it hard.
4/10 19:10:08 The SCHEDD (pid 13885) was killed because it was no longer responding
4/10 19:10:08 restarting /ldcg/condor/sbin/condor_schedd in 10 seconds

	Given all the different ways I have seen condor_schedd crash or get
killed when users abuse the queue, I think it would be helpful to be able
to set a global maximum on the number of jobs allowed to be in the queue.
When that limit is reached any job submitted through any means to that
schedd should be rejected with an appropriate error message.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Wed Apr 12 17:02:03 2006 (1144879332)
Subject: Actions

Assigned to pavlo by pavlo
===========================================================================
Date of actions: Tue Apr 11 14:39:33 2006 (1144936790)
Subject: Actions

Status changed from new to feature by pavlo
===========================================================================
Date of actions: Tue Apr 11 14:39:33 2006 (1144952168)
From: Andy Pavlo <pavlo__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #13576] Global limit on queued jobs
Date: Thu, 13 Apr 2006 13:16:01 -0500

Stuart,

I am sorry but there is currently no way for Condor to limit the number of 
jobs submitted to a schedd. This is actually a difficult issue because one 
would have to define a flexible policy for the limits. For instance, if the 
limit is 500 jobs, what would happen if someone submitted 1000 jobs? Do the 
first 500 get accepted and all others get blocked? Is this per user or for 
all users, and is it per schedd or for the entire pool?

I will bring this up with the other developers to see if this is something we 
can get into Condor in the future. My suggestion to you for now is to write a 
wrapper script for condor_submit that can check to see whether the schedd is 
near some limit. Bare in mind though, that condor_q can be slow at times.

I hope this helps.
-- 
Andy Pavlo
pavlo__AT__cs.wisc.edu

===========================================================================
Date mail was appended: Thu Apr 13 13:16:12 2006 (1144952173)
Date: Thu, 13 Apr 2006 13:33:39 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #13576] Global limit on queued jobs

On Thu, Apr 13, 2006 at 01:16:12PM -0600, condor-admin response tracking system wrote:
> Stuart,
> 
> I am sorry but there is currently no way for Condor to limit the number of 
> jobs submitted to a schedd. This is actually a difficult issue because one 
> would have to define a flexible policy for the limits. For instance, if the 
> limit is 500 jobs, what would happen if someone submitted 1000 jobs? Do the 
> first 500 get accepted and all others get blocked? Is this per user or for 
> all users, and is it per schedd or for the entire pool?

The immediate problem I am trying so solve is schedd crashing or getting
killed by condor_master. This has happened 3-4 times in the last 2 Months
on the LIGO Caltech cluster due to users submitting too many jobs. I would
guess that a per schedd limit would be most appropriate for this.

In your example I would suggest that the first 500 get accepted and the
remaining get blocked. Almost any behavior would be better than schedd
crashing and leaving DAG jobs in a confused state that requires manual
cleanup by each user, i.e., rescue dag's don't work.

> 
> I will bring this up with the other developers to see if this is something we 
> can get into Condor in the future. My suggestion to you for now is to write a 
> wrapper script for condor_submit that can check to see whether the schedd is 
> near some limit. Bare in mind though, that condor_q can be slow at times.

Thanks. Slow condor_q performance is another reason to consider limiting the
number of idle jobs. I am quite confident that running condor_q before
every job would effectively cripple the cluster and prevent most of the
CPU's from doing any work at any given time. This would effectively
increase the time to schedule each job by more than an order of magnitude.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Apr 13 15:33:56 2006 (1144960437)
Subject: Actions

Assigned to tannenba by pavlo
===========================================================================
Date of actions: Mon May  1  9:47:08 2006 (1146495520)