LIGO Support Ticket 12616

Ticket Information
  Number:      admin 12616
  User:        sba@srl.caltech.edu
  Email:       duncan__AT__gravity.phys.uwm.edu,skoranda__AT__gravity.phys.uwm.edu
  Status:      open
  Assigned To: wright
Date: Sat, 06 Aug 2005 12:34:40 -0500
To: condor-admin__AT__cs.wisc.edu
From: Stuart Anderson <sba__AT__srl.caltech.edu> (by way of Alain Roy
 <roy__AT__cs.wisc.edu>)
Subject: Controlling resource abusive jobs

	I would like to request an enhancement to Condor to support the
restriction of resource abusive jobs.

PROBLEM:
	condor_starter has no way to monitor critical resource usage such
as memory on Linux machines in the Vanilla Universe when users submit
simple shell scripts that then fork/exec (rather than just exec) the
"real" application. As far as I can tell, in this case the condor_starter
processes just monitors the resources of its direct child process
which is just a simple shell and not the grand-children.

SOLUTION:
	Have condor_starter call setrlimit() at initialization time before
fork/exec of the user code, with the resource limits being specified
in a standard condor_config variable.


Is there an already existing solution for Condor monitoring all the
progeny of a Vanilla Universe job and not just the direct child process?


Thanks.

P.S. Besides memory limitations, this would be where I would also
like to be able to globally set the coredumpsize to 0 for all users
under normal circumstances.


-- 
Stuart Anderson  sba__AT__srl.caltech.edu  http://www.srl.caltech.edu/personnel/sba


===========================================================================
Date of creation: Sat Aug  6 12:34:48 2005 (1123349691)
Subject: Actions

Assigned to roy by roy
===========================================================================
Date of actions: Sat Aug  6 12:35:23 2005 (1123349723)
Date: Sat, 06 Aug 2005 12:37:03 -0500
To: condor-admin__AT__cs.wisc.edu
From: Alain Roy <roy__AT__cs.wisc.edu>
Subject: Re: [condor-admin #12616] Controlling resource abusive jobs

Hi Stuart,

It's an intriguing idea, but I'm not sure of all the implications so I'll 
talk to some other staff next week.

In what scenarios is this important to you? Is it just a "would be nice" 
feature, or is it something that is really important?

-alain


===========================================================================
Date mail was appended: Sat Aug  6 12:37:08 2005 (1123349828)
From: Stuart Anderson <sba__AT__srl.caltech.edu>
Subject: Re: [condor-admin #12616] Controlling resource abusive jobs
To: condor-admin__AT__cs.wisc.edu, roy__AT__cs.wisc.edu
Date: Sat, 6 Aug 2005 15:05:15 -0700 (PDT)
CC: Brown Duncan <duncan__AT__gravity.phys.uwm.edu>, Scott Koranda
 <skoranda__AT__gravity.phys.uwm.edu>

According to condor-admin response tracking system:
> Hi Stuart,
> 
> It's an intriguing idea, but I'm not sure of all the implications so I'll 
> talk to some other staff next week.
> 
> In what scenarios is this important to you? Is it just a "would be nice" 
> feature, or is it something that is really important?

It is fairly important to have some robust control over memory abusive
jobs. These are happening in practice and the consequences are significant:

1) The Linux kernel Oom killer often picks something other than the user
application to kill when it runs out of memory.

2) There is a problem with the Linux NFS/RPC implementation wherein a
node that is very busy swapping (because of an abusive job) will no
longer respond to NFS/RPC queries from other Linux nodes that are
NFS clients. Even when the abusive job is killed it is then necesary to
run "umount -f" on many of the other cluster nodes for filesystems served
up from the problem node since they are stuck and generate un-killable
jobs on the client machines.

Initially I thought the PREEMPT variable would handle this. However, we
have mixed and inconclusive results on the efficacy of this for Standard
Universe jobs, and as I understand the architecture there is no way PREEMPT
or KILL can candle someone submitting shell scripts in the Vanilla Universe
that call fork--since Condor will just monitor the shell process resources
and not the child application.

What would be nice is if we could limit the memory use on a per virtual
machine level so we can run one "small" and one "large" memory virtual
machine on each of or dual-processor SMP nodes. Currently we advertise
700MB and 1200MB for each node that has 2GB of physical memory.

Is there is a different way to limit memory abusive jobs that call fork()
in the Vanilla Universe?

Thanks.

-- 
Stuart Anderson  sba__AT__srl.caltech.edu  http://www.srl.caltech.edu/personnel/sba

===========================================================================
Date mail was appended: Sat Aug  6 17:05:21 2005 (1123365922)
Date: Tue, 09 Aug 2005 17:31:36 -0500
To: condor-admin__AT__cs.wisc.edu
From: Alain Roy <roy__AT__cs.wisc.edu>
Subject: Re: [condor-admin #12616] Controlling resource abusive jobs


>Is there is a different way to limit memory abusive jobs that call fork()
>in the Vanilla Universe?

Hi Stuart,

I'm still catching up from my vacation (I just got back today)--sorry for 
the slow response. You're right--it sounds like you need some solution for 
this.

Here are a few ideas.

1) Condor lets you specify a USER_JOB_WRAPPER. This is a program that will 
run before your job. You can make a small program that calls setrlimit() 
then runs the program. I'm not sure you can figure out what VM you are on, 
but someone else thinks that might be in the environment.

http://www.cs.wisc.edu/condor/manual/v6.7/3_3Configuration.html#10861

2) I forget how setrlimit() works, exactly, but you might be able to start 
up the condor_master with a limit and have it inherited by everything it 
starts, including the jobs. I'm not sure that will work--it seems iffy to me.

3) We can add real limits in Condor. This will take longer though. It will 
go on our very long "queue of work".

I think that the USER_JOB_WRAPPER is a very promising idea--what do you think?

-alain



===========================================================================
Date mail was appended: Tue Aug  9 17:31:42 2005 (1123626703)
Subject: Actions

Status changed from open to pending by roy
===========================================================================
Date of actions: Tue Aug  9 17:31:42 2005 (1123626704)
From: Stuart Anderson <sba__AT__srl.caltech.edu>
Subject: Re: [condor-admin #12616] Controlling resource abusive jobs
To: condor-admin__AT__cs.wisc.edu, roy__AT__cs.wisc.edu
Date: Tue, 9 Aug 2005 16:30:35 -0700 (PDT)
CC: sba__AT__srl.caltech.edu, duncan__AT__gravity.phys.uwm.edu,
 skoranda__AT__gravity.phys.uwm.edu

According to condor-admin response tracking system:
> 
> 
> >Is there is a different way to limit memory abusive jobs that call fork()
> >in the Vanilla Universe?
> 
> Hi Stuart,
> 
> I'm still catching up from my vacation (I just got back today)--sorry for 
> the slow response. You're right--it sounds like you need some solution for 
> this.
> 
> Here are a few ideas.
> 
> 1) Condor lets you specify a USER_JOB_WRAPPER. This is a program that will 
> run before your job. You can make a small program that calls setrlimit() 
> then runs the program. I'm not sure you can figure out what VM you are on, 
> but someone else thinks that might be in the environment.
> 
> http://www.cs.wisc.edu/condor/manual/v6.7/3_3Configuration.html#10861

Interesting.

> 
> 2) I forget how setrlimit() works, exactly, but you might be able to start 
> up the condor_master with a limit and have it inherited by everything it 
> starts, including the jobs. I'm not sure that will work--it seems iffy to me.

This is my nominal work around, modify /etc/init.d/condor on the worker
nodes to set hard process limits that are inherited by condor_master
and so on down the processes tree all the way to the offending applications.

> 
> 3) We can add real limits in Condor. This will take longer though. It will 
> go on our very long "queue of work".

Thanks. Please assume that the init.d process limit is sufficient for a
temporary work around unless we find it does not work, and that the
request for integrated process limits should just go on the long queue.

However, I did just have a different user run into this just today, i.e.,
they submitted a shell script which fork/exec an application which then
used too much memory, but it was only evicted from the node because of
priority, not becuase of PREEMPT since condor was unable to monitor the
actual application memory use.

> 
> I think that the USER_JOB_WRAPPER is a very promising idea--what do you think?

Not being a condor expert, I think modifying /etc/init.d/condor is safer
and should accomplish the same thing, unless there is a way for
USER_JOB_WRAPPER to know which Virtual Machine it is running as.

-- 
Stuart Anderson  sba__AT__srl.caltech.edu  http://www.srl.caltech.edu/personnel/sba

===========================================================================
Date mail was appended: Tue Aug  9 18:30:41 2005 (1123630241)
Date: Wed, 10 Aug 2005 09:19:16 -0500
To: condor-admin__AT__cs.wisc.edu
From: Alain Roy <roy__AT__cs.wisc.edu>
Subject: Re: [condor-admin #12616] Controlling resource abusive jobs


> > Here are a few ideas.

> > 3) We can add real limits in Condor. This will take longer though. It will
> > go on our very long "queue of work".
>
>Thanks.

I said "It's an idea..." Just to be clear, I'm not promising it yet. This 
is the sort of thing I would want to run by Miron first.

> > I think that the USER_JOB_WRAPPER is a very promising idea--what do you 
> think?
>
>Not being a condor expert, I think modifying /etc/init.d/condor is safer
>and should accomplish the same thing, unless there is a way for
>USER_JOB_WRAPPER to know which Virtual Machine it is running as.

I'm not sure what the safety issue is--can you explain?

Getting the VM is a bit of a hack, but it can be done without too much 
difficulty.

1) You want each job to specify:

Environment = VM=$$(VirtualMachineID)

You can force this by making a wrapper to condor_submit, which does (in 
essence)

condor_submit -a 'Environment = VM=$$(VirtualMachineID)' $@

2) The USER_JOB_WRAPPER now has access to the virtual machine id from the 
environment, and it can call setrlimit() appropriately.

Whaddya think?

-alain



===========================================================================
Date mail was appended: Wed Aug 10  9:19:21 2005 (1123683561)
Subject: Actions

Status changed from open to pending by roy
===========================================================================
Date of actions: Wed Aug 10  9:19:21 2005 (1123683563)
From: Stuart Anderson <sba__AT__srl.caltech.edu>
Subject: Re: [condor-admin #12616] Controlling resource abusive jobs
To: condor-admin__AT__cs.wisc.edu, roy__AT__cs.wisc.edu
Date: Wed, 10 Aug 2005 22:28:47 -0700 (PDT)
CC: Brown Duncan <duncan__AT__gravity.phys.uwm.edu>, Scott Koranda
 <skoranda__AT__gravity.phys.uwm.edu>

According to condor-admin response tracking system:
> 
> 
> > > Here are a few ideas.
> 
> > > 3) We can add real limits in Condor. This will take longer though. It will
> > > go on our very long "queue of work".
> >
> >Thanks.
> 
> I said "It's an idea..." Just to be clear, I'm not promising it yet. This 
> is the sort of thing I would want to run by Miron first.

Understood.

> 
> > > I think that the USER_JOB_WRAPPER is a very promising idea--what do you 
> > think?
> >
> >Not being a condor expert, I think modifying /etc/init.d/condor is safer
> >and should accomplish the same thing, unless there is a way for
> >USER_JOB_WRAPPER to know which Virtual Machine it is running as.
> 
> I'm not sure what the safety issue is--can you explain?

It is just that I am concerned, since I have never used USER_JOB_WRAPPER,
that it is possible to insert such a script into every Condor Universe
job without any side effects. It probably works, but would still have
problems with VM specific settings as discussed below.

> 
> Getting the VM is a bit of a hack, but it can be done without too much 
> difficulty.
> 
> 1) You want each job to specify:
> 
> Environment = VM=$$(VirtualMachineID)
> 
> You can force this by making a wrapper to condor_submit, which does (in 
> essence)
> 
> condor_submit -a 'Environment = VM=$$(VirtualMachineID)' $@
> 
> 2) The USER_JOB_WRAPPER now has access to the virtual machine id from the 
> environment, and it can call setrlimit() appropriately.
> 
> Whaddya think?

Sounds like too much of a hack since it would only cover jobs submitted via
the condor_submit script and we are also supporing jobs comming in
remotely via globus, or have I misunderstood?

However, the same idea of replacing condor executables with shell wrappers
could presumably be done for condor_starter and this script could then parse
the process arguments to look for the vm string and set the Unix process
limits accordingly before calling the "real" condor_starter processes.

Nonetheless, I still think it would be a useful enhancement for Condor to
have a fully integrated method for setting process limits via setrlimit().

-- 
Stuart Anderson  sba__AT__srl.caltech.edu  http://www.srl.caltech.edu/personnel/sba

===========================================================================
Date mail was appended: Thu Aug 11  0:28:52 2005 (1123738132)
Date: Thu, 11 Aug 2005 14:10:45 -0500
To: condor-admin__AT__cs.wisc.edu
From: Alain Roy <roy__AT__cs.wisc.edu>
Subject: Re: [condor-admin #12616] Controlling resource abusive jobs


> > I'm not sure what the safety issue is--can you explain?
>
>It is just that I am concerned, since I have never used USER_JOB_WRAPPER,
>that it is possible to insert such a script into every Condor Universe
>job without any side effects. It probably works, but would still have
>problems with VM specific settings as discussed below.

Fair enough. But I'm confused, because before you wrote:

>Is there is a different way to limit memory abusive jobs that call fork()
>in the Vanilla Universe?

So I thought you were just worry about vanilla jobs right now. That said, 
the USER_JOB_WRAPPER should work the same for both vanilla and standard 
universe jobs. I'm 90% certain it will work fine for MPI, Java, and PVM 
universe jobs. It won't work for Condor-G jobs though. (Well jobs that end 
up in a Condor pool can have a USER_JOB_WRAPPER, but it's the one 
associated with the pool, not the Condor-G submit point.)

>Sounds like too much of a hack since it would only cover jobs submitted via
>the condor_submit script and we are also supporing jobs comming in
>remotely via globus, or have I misunderstood?

Globus jobs are submitted with condor_submit. The only thing it won't cover 
is jobs submitted via the Condor SOAP API, but I am willing to bet that you 
don't use that. It's new and almost no one uses it.

>However, the same idea of replacing condor executables with shell wrappers
>could presumably be done for condor_starter and this script could then parse
>the process arguments to look for the vm string and set the Unix process
>limits accordingly before calling the "real" condor_starter processes.

I think that would work, but I'm not certain. The starter inherits things 
like a network connection to the shadow so you have to be careful to get 
things just right.

>Nonetheless, I still think it would be a useful enhancement for Condor to
>have a fully integrated method for setting process limits via setrlimit().

Agreed--I'm just trying hard to find a short-term solution that can hold 
you over while we improve things.

-alain



===========================================================================
Date mail was appended: Thu Aug 11 14:10:48 2005 (1123787448)
Subject: Actions

Status changed from open to pending by roy
===========================================================================
Date of actions: Thu Aug 11 14:10:48 2005 (1123787450)
From: Stuart Anderson <sba__AT__srl.caltech.edu>
Subject: Re: [condor-admin #12616] Controlling resource abusive jobs
To: condor-admin__AT__cs.wisc.edu, roy__AT__cs.wisc.edu
Date: Thu, 11 Aug 2005 20:43:34 -0700 (PDT)
CC: Brown Duncan <duncan__AT__gravity.phys.uwm.edu>, Scott Koranda
 <skoranda__AT__gravity.phys.uwm.edu>

Alain,

According to condor-admin response tracking system:
> 
> 
> > > I'm not sure what the safety issue is--can you explain?
> >
> >It is just that I am concerned, since I have never used USER_JOB_WRAPPER,
> >that it is possible to insert such a script into every Condor Universe
> >job without any side effects. It probably works, but would still have
> >problems with VM specific settings as discussed below.
> 
> Fair enough. But I'm confused, because before you wrote:
> 
> >Is there is a different way to limit memory abusive jobs that call fork()
> >in the Vanilla Universe?
> 
> So I thought you were just worry about vanilla jobs right now. That said, 
> the USER_JOB_WRAPPER should work the same for both vanilla and standard 
> universe jobs. I'm 90% certain it will work fine for MPI, Java, and PVM 
> universe jobs. It won't work for Condor-G jobs though. (Well jobs that end 
> up in a Condor pool can have a USER_JOB_WRAPPER, but it's the one 
> associated with the pool, not the Condor-G submit point.)

Sorry for the confusion.  We have seen both Standard and non-forking Vanilla
jobs having a problem with using too much memory and not being evicted by
PREEMPT.

What I would like is to figure out how to use PREEMPT to properly handle
standard and non-forking vanilla jobs, as well as have a catch-all
process limit (or something similar) to handle jobs that fork.

On the not-a-problem-but-might-be-helpful-in-the-future wish list would be a
system to limit the total resources used by a forking job by adding up all
the processes in its tree and not just the resources of individual processes.

> 
> >Sounds like too much of a hack since it would only cover jobs submitted via
> >the condor_submit script and we are also supporing jobs comming in
> >remotely via globus, or have I misunderstood?
> 
> Globus jobs are submitted with condor_submit. The only thing it won't cover 
> is jobs submitted via the Condor SOAP API, but I am willing to bet that you 
> don't use that. It's new and almost no one uses it.
> 
> >However, the same idea of replacing condor executables with shell wrappers
> >could presumably be done for condor_starter and this script could then parse
> >the process arguments to look for the vm string and set the Unix process
> >limits accordingly before calling the "real" condor_starter processes.
> 
> I think that would work, but I'm not certain. The starter inherits things 
> like a network connection to the shadow so you have to be careful to get 
> things just right.
> 
> >Nonetheless, I still think it would be a useful enhancement for Condor to
> >have a fully integrated method for setting process limits via setrlimit().
> 
> Agreed--I'm just trying hard to find a short-term solution that can hold 
> you over while we improve things.
> 

I will probably stick with modifying /etc/init.d/condor on the worker nodes
as a temporary work around until an integrated VM-specific solution is
available.

However, do you know of a Linux equivalent to the Solaris /usr/bin/plimit?
If that was available I could write a periodic cron job that searches 
for any child processes of "condor_starter" and based on the "vm?" argument
calls plimit to set the process limits on the user processes according to
which vm it belongs to.

Thanks.

-- 
Stuart Anderson  sba__AT__srl.caltech.edu  http://www.srl.caltech.edu/personnel/sba

===========================================================================
Date mail was appended: Thu Aug 11 22:43:43 2005 (1123818223)
Subject: Actions

Assigned to wright by wright
===========================================================================
Date of actions: Thu Jun  1  4:19:32 2006 (1149153572)
CC: Todd Tannenbaum <tannenba__AT__cs.wisc.edu>, Alain Roy <roy__AT__cs.wisc.edu>
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-admin #12616] Controlling resource abusive jobs
Date: Thu, 1 Jun 2006 14:00:07 -0700
To: condor-admin__AT__cs.wisc.edu

as per our discussion at condor-week about this issue, i've been  
working on making sure that the condor code is correctly updating the  
image size in the job queue whenever a vanilla job is getting preempted.

unfortunately, i started from the assumption it was broken, and  
started writing code to fix it. ;)

the more i dug into the starter's code, the more it seemed like it  
was *already* doing this.  so, then i switched to trying to reproduce  
the problem you've reported, and was unable to do so.  in local  
testing, with a job that malloc's 10K every second, everything seemed  
to work as expected.  periodic imagesize updates made it into the job  
queue just fine.  when PREEMPT became TRUE (when the imagesize  
crossed a threshold), the job was evicted and the imagesize was  
updated one last time during the eviction.  of course, if KILL is  
true and the startd is telling the starter to "hardkill" (or if you  
use condor_vacate -fast), then no such updates happen, but that's by  
design (since the starter is under the assumption it must  
*immediately* evict, and has no time for costly operations like TCP  
communication with the shadow).

looking at the code (but not testing it), i can verify that this does  
*not* happen with standard universe jobs.  there, if the startd  
decides to preempt (for any reason), the starter does not do a final  
imagesize update to the shadow.  maybe that's the only problem you're  
worried about?

can you confirm or deny that this is really a problem with vanilla  
jobs (esp in 6.7.19)?  that seemed to be your top priority based on  
the meeting we had, but i can't find a problem, either in the code or  
in testing.  maybe it used to be a bug back in 2005 (when this issue  
was created) but enough things have changed in the code that this has  
already been fixed?

thanks,
-derek




===========================================================================
Date mail was appended: Thu Jun  1 16:00:22 2006 (1149195623)