LIGO Support Ticket 12616
Ticket Information
Number: admin 12616
User: sba@srl.caltech.edu
Email: duncan__AT__gravity.phys.uwm.edu,skoranda__AT__gravity.phys.uwm.edu
Status: open
Assigned To: wright
Date: Sat, 06 Aug 2005 12:34:40 -0500
To: condor-admin__AT__cs.wisc.edu
From: Stuart Anderson <sba__AT__srl.caltech.edu> (by way of Alain Roy
<roy__AT__cs.wisc.edu>)
Subject: Controlling resource abusive jobs
I would like to request an enhancement to Condor to support the
restriction of resource abusive jobs.
PROBLEM:
condor_starter has no way to monitor critical resource usage such
as memory on Linux machines in the Vanilla Universe when users submit
simple shell scripts that then fork/exec (rather than just exec) the
"real" application. As far as I can tell, in this case the condor_starter
processes just monitors the resources of its direct child process
which is just a simple shell and not the grand-children.
SOLUTION:
Have condor_starter call setrlimit() at initialization time before
fork/exec of the user code, with the resource limits being specified
in a standard condor_config variable.
Is there an already existing solution for Condor monitoring all the
progeny of a Vanilla Universe job and not just the direct child process?
Thanks.
P.S. Besides memory limitations, this would be where I would also
like to be able to globally set the coredumpsize to 0 for all users
under normal circumstances.
--
Stuart Anderson sba__AT__srl.caltech.edu http://www.srl.caltech.edu/personnel/sba
===========================================================================
Date of creation: Sat Aug 6 12:34:48 2005 (1123349691)
Subject: Actions
Assigned to roy by roy
===========================================================================
Date of actions: Sat Aug 6 12:35:23 2005 (1123349723)
Date: Sat, 06 Aug 2005 12:37:03 -0500
To: condor-admin__AT__cs.wisc.edu
From: Alain Roy <roy__AT__cs.wisc.edu>
Subject: Re: [condor-admin #12616] Controlling resource abusive jobs
Hi Stuart,
It's an intriguing idea, but I'm not sure of all the implications so I'll
talk to some other staff next week.
In what scenarios is this important to you? Is it just a "would be nice"
feature, or is it something that is really important?
-alain
===========================================================================
Date mail was appended: Sat Aug 6 12:37:08 2005 (1123349828)
From: Stuart Anderson <sba__AT__srl.caltech.edu>
Subject: Re: [condor-admin #12616] Controlling resource abusive jobs
To: condor-admin__AT__cs.wisc.edu, roy__AT__cs.wisc.edu
Date: Sat, 6 Aug 2005 15:05:15 -0700 (PDT)
CC: Brown Duncan <duncan__AT__gravity.phys.uwm.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>
According to condor-admin response tracking system:
> Hi Stuart,
>
> It's an intriguing idea, but I'm not sure of all the implications so I'll
> talk to some other staff next week.
>
> In what scenarios is this important to you? Is it just a "would be nice"
> feature, or is it something that is really important?
It is fairly important to have some robust control over memory abusive
jobs. These are happening in practice and the consequences are significant:
1) The Linux kernel Oom killer often picks something other than the user
application to kill when it runs out of memory.
2) There is a problem with the Linux NFS/RPC implementation wherein a
node that is very busy swapping (because of an abusive job) will no
longer respond to NFS/RPC queries from other Linux nodes that are
NFS clients. Even when the abusive job is killed it is then necesary to
run "umount -f" on many of the other cluster nodes for filesystems served
up from the problem node since they are stuck and generate un-killable
jobs on the client machines.
Initially I thought the PREEMPT variable would handle this. However, we
have mixed and inconclusive results on the efficacy of this for Standard
Universe jobs, and as I understand the architecture there is no way PREEMPT
or KILL can candle someone submitting shell scripts in the Vanilla Universe
that call fork--since Condor will just monitor the shell process resources
and not the child application.
What would be nice is if we could limit the memory use on a per virtual
machine level so we can run one "small" and one "large" memory virtual
machine on each of or dual-processor SMP nodes. Currently we advertise
700MB and 1200MB for each node that has 2GB of physical memory.
Is there is a different way to limit memory abusive jobs that call fork()
in the Vanilla Universe?
Thanks.
--
Stuart Anderson sba__AT__srl.caltech.edu http://www.srl.caltech.edu/personnel/sba
===========================================================================
Date mail was appended: Sat Aug 6 17:05:21 2005 (1123365922)
Date: Tue, 09 Aug 2005 17:31:36 -0500
To: condor-admin__AT__cs.wisc.edu
From: Alain Roy <roy__AT__cs.wisc.edu>
Subject: Re: [condor-admin #12616] Controlling resource abusive jobs
>Is there is a different way to limit memory abusive jobs that call fork()
>in the Vanilla Universe?
Hi Stuart,
I'm still catching up from my vacation (I just got back today)--sorry for
the slow response. You're right--it sounds like you need some solution for
this.
Here are a few ideas.
1) Condor lets you specify a USER_JOB_WRAPPER. This is a program that will
run before your job. You can make a small program that calls setrlimit()
then runs the program. I'm not sure you can figure out what VM you are on,
but someone else thinks that might be in the environment.
http://www.cs.wisc.edu/condor/manual/v6.7/3_3Configuration.html#10861
2) I forget how setrlimit() works, exactly, but you might be able to start
up the condor_master with a limit and have it inherited by everything it
starts, including the jobs. I'm not sure that will work--it seems iffy to me.
3) We can add real limits in Condor. This will take longer though. It will
go on our very long "queue of work".
I think that the USER_JOB_WRAPPER is a very promising idea--what do you think?
-alain
===========================================================================
Date mail was appended: Tue Aug 9 17:31:42 2005 (1123626703)
Subject: Actions
Status changed from open to pending by roy
===========================================================================
Date of actions: Tue Aug 9 17:31:42 2005 (1123626704)
From: Stuart Anderson <sba__AT__srl.caltech.edu>
Subject: Re: [condor-admin #12616] Controlling resource abusive jobs
To: condor-admin__AT__cs.wisc.edu, roy__AT__cs.wisc.edu
Date: Tue, 9 Aug 2005 16:30:35 -0700 (PDT)
CC: sba__AT__srl.caltech.edu, duncan__AT__gravity.phys.uwm.edu,
skoranda__AT__gravity.phys.uwm.edu
According to condor-admin response tracking system:
>
>
> >Is there is a different way to limit memory abusive jobs that call fork()
> >in the Vanilla Universe?
>
> Hi Stuart,
>
> I'm still catching up from my vacation (I just got back today)--sorry for
> the slow response. You're right--it sounds like you need some solution for
> this.
>
> Here are a few ideas.
>
> 1) Condor lets you specify a USER_JOB_WRAPPER. This is a program that will
> run before your job. You can make a small program that calls setrlimit()
> then runs the program. I'm not sure you can figure out what VM you are on,
> but someone else thinks that might be in the environment.
>
> http://www.cs.wisc.edu/condor/manual/v6.7/3_3Configuration.html#10861
Interesting.
>
> 2) I forget how setrlimit() works, exactly, but you might be able to start
> up the condor_master with a limit and have it inherited by everything it
> starts, including the jobs. I'm not sure that will work--it seems iffy to me.
This is my nominal work around, modify /etc/init.d/condor on the worker
nodes to set hard process limits that are inherited by condor_master
and so on down the processes tree all the way to the offending applications.
>
> 3) We can add real limits in Condor. This will take longer though. It will
> go on our very long "queue of work".
Thanks. Please assume that the init.d process limit is sufficient for a
temporary work around unless we find it does not work, and that the
request for integrated process limits should just go on the long queue.
However, I did just have a different user run into this just today, i.e.,
they submitted a shell script which fork/exec an application which then
used too much memory, but it was only evicted from the node because of
priority, not becuase of PREEMPT since condor was unable to monitor the
actual application memory use.
>
> I think that the USER_JOB_WRAPPER is a very promising idea--what do you think?
Not being a condor expert, I think modifying /etc/init.d/condor is safer
and should accomplish the same thing, unless there is a way for
USER_JOB_WRAPPER to know which Virtual Machine it is running as.
--
Stuart Anderson sba__AT__srl.caltech.edu http://www.srl.caltech.edu/personnel/sba
===========================================================================
Date mail was appended: Tue Aug 9 18:30:41 2005 (1123630241)
Date: Wed, 10 Aug 2005 09:19:16 -0500
To: condor-admin__AT__cs.wisc.edu
From: Alain Roy <roy__AT__cs.wisc.edu>
Subject: Re: [condor-admin #12616] Controlling resource abusive jobs
> > Here are a few ideas.
> > 3) We can add real limits in Condor. This will take longer though. It will
> > go on our very long "queue of work".
>
>Thanks.
I said "It's an idea..." Just to be clear, I'm not promising it yet. This
is the sort of thing I would want to run by Miron first.
> > I think that the USER_JOB_WRAPPER is a very promising idea--what do you
> think?
>
>Not being a condor expert, I think modifying /etc/init.d/condor is safer
>and should accomplish the same thing, unless there is a way for
>USER_JOB_WRAPPER to know which Virtual Machine it is running as.
I'm not sure what the safety issue is--can you explain?
Getting the VM is a bit of a hack, but it can be done without too much
difficulty.
1) You want each job to specify:
Environment = VM=$$(VirtualMachineID)
You can force this by making a wrapper to condor_submit, which does (in
essence)
condor_submit -a 'Environment = VM=$$(VirtualMachineID)' $@
2) The USER_JOB_WRAPPER now has access to the virtual machine id from the
environment, and it can call setrlimit() appropriately.
Whaddya think?
-alain
===========================================================================
Date mail was appended: Wed Aug 10 9:19:21 2005 (1123683561)
Subject: Actions
Status changed from open to pending by roy
===========================================================================
Date of actions: Wed Aug 10 9:19:21 2005 (1123683563)
From: Stuart Anderson <sba__AT__srl.caltech.edu>
Subject: Re: [condor-admin #12616] Controlling resource abusive jobs
To: condor-admin__AT__cs.wisc.edu, roy__AT__cs.wisc.edu
Date: Wed, 10 Aug 2005 22:28:47 -0700 (PDT)
CC: Brown Duncan <duncan__AT__gravity.phys.uwm.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>
According to condor-admin response tracking system:
>
>
> > > Here are a few ideas.
>
> > > 3) We can add real limits in Condor. This will take longer though. It will
> > > go on our very long "queue of work".
> >
> >Thanks.
>
> I said "It's an idea..." Just to be clear, I'm not promising it yet. This
> is the sort of thing I would want to run by Miron first.
Understood.
>
> > > I think that the USER_JOB_WRAPPER is a very promising idea--what do you
> > think?
> >
> >Not being a condor expert, I think modifying /etc/init.d/condor is safer
> >and should accomplish the same thing, unless there is a way for
> >USER_JOB_WRAPPER to know which Virtual Machine it is running as.
>
> I'm not sure what the safety issue is--can you explain?
It is just that I am concerned, since I have never used USER_JOB_WRAPPER,
that it is possible to insert such a script into every Condor Universe
job without any side effects. It probably works, but would still have
problems with VM specific settings as discussed below.
>
> Getting the VM is a bit of a hack, but it can be done without too much
> difficulty.
>
> 1) You want each job to specify:
>
> Environment = VM=$$(VirtualMachineID)
>
> You can force this by making a wrapper to condor_submit, which does (in
> essence)
>
> condor_submit -a 'Environment = VM=$$(VirtualMachineID)' $@
>
> 2) The USER_JOB_WRAPPER now has access to the virtual machine id from the
> environment, and it can call setrlimit() appropriately.
>
> Whaddya think?
Sounds like too much of a hack since it would only cover jobs submitted via
the condor_submit script and we are also supporing jobs comming in
remotely via globus, or have I misunderstood?
However, the same idea of replacing condor executables with shell wrappers
could presumably be done for condor_starter and this script could then parse
the process arguments to look for the vm string and set the Unix process
limits accordingly before calling the "real" condor_starter processes.
Nonetheless, I still think it would be a useful enhancement for Condor to
have a fully integrated method for setting process limits via setrlimit().
--
Stuart Anderson sba__AT__srl.caltech.edu http://www.srl.caltech.edu/personnel/sba
===========================================================================
Date mail was appended: Thu Aug 11 0:28:52 2005 (1123738132)
Date: Thu, 11 Aug 2005 14:10:45 -0500
To: condor-admin__AT__cs.wisc.edu
From: Alain Roy <roy__AT__cs.wisc.edu>
Subject: Re: [condor-admin #12616] Controlling resource abusive jobs
> > I'm not sure what the safety issue is--can you explain?
>
>It is just that I am concerned, since I have never used USER_JOB_WRAPPER,
>that it is possible to insert such a script into every Condor Universe
>job without any side effects. It probably works, but would still have
>problems with VM specific settings as discussed below.
Fair enough. But I'm confused, because before you wrote:
>Is there is a different way to limit memory abusive jobs that call fork()
>in the Vanilla Universe?
So I thought you were just worry about vanilla jobs right now. That said,
the USER_JOB_WRAPPER should work the same for both vanilla and standard
universe jobs. I'm 90% certain it will work fine for MPI, Java, and PVM
universe jobs. It won't work for Condor-G jobs though. (Well jobs that end
up in a Condor pool can have a USER_JOB_WRAPPER, but it's the one
associated with the pool, not the Condor-G submit point.)
>Sounds like too much of a hack since it would only cover jobs submitted via
>the condor_submit script and we are also supporing jobs comming in
>remotely via globus, or have I misunderstood?
Globus jobs are submitted with condor_submit. The only thing it won't cover
is jobs submitted via the Condor SOAP API, but I am willing to bet that you
don't use that. It's new and almost no one uses it.
>However, the same idea of replacing condor executables with shell wrappers
>could presumably be done for condor_starter and this script could then parse
>the process arguments to look for the vm string and set the Unix process
>limits accordingly before calling the "real" condor_starter processes.
I think that would work, but I'm not certain. The starter inherits things
like a network connection to the shadow so you have to be careful to get
things just right.
>Nonetheless, I still think it would be a useful enhancement for Condor to
>have a fully integrated method for setting process limits via setrlimit().
Agreed--I'm just trying hard to find a short-term solution that can hold
you over while we improve things.
-alain
===========================================================================
Date mail was appended: Thu Aug 11 14:10:48 2005 (1123787448)
Subject: Actions
Status changed from open to pending by roy
===========================================================================
Date of actions: Thu Aug 11 14:10:48 2005 (1123787450)
From: Stuart Anderson <sba__AT__srl.caltech.edu>
Subject: Re: [condor-admin #12616] Controlling resource abusive jobs
To: condor-admin__AT__cs.wisc.edu, roy__AT__cs.wisc.edu
Date: Thu, 11 Aug 2005 20:43:34 -0700 (PDT)
CC: Brown Duncan <duncan__AT__gravity.phys.uwm.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>
Alain,
According to condor-admin response tracking system:
>
>
> > > I'm not sure what the safety issue is--can you explain?
> >
> >It is just that I am concerned, since I have never used USER_JOB_WRAPPER,
> >that it is possible to insert such a script into every Condor Universe
> >job without any side effects. It probably works, but would still have
> >problems with VM specific settings as discussed below.
>
> Fair enough. But I'm confused, because before you wrote:
>
> >Is there is a different way to limit memory abusive jobs that call fork()
> >in the Vanilla Universe?
>
> So I thought you were just worry about vanilla jobs right now. That said,
> the USER_JOB_WRAPPER should work the same for both vanilla and standard
> universe jobs. I'm 90% certain it will work fine for MPI, Java, and PVM
> universe jobs. It won't work for Condor-G jobs though. (Well jobs that end
> up in a Condor pool can have a USER_JOB_WRAPPER, but it's the one
> associated with the pool, not the Condor-G submit point.)
Sorry for the confusion. We have seen both Standard and non-forking Vanilla
jobs having a problem with using too much memory and not being evicted by
PREEMPT.
What I would like is to figure out how to use PREEMPT to properly handle
standard and non-forking vanilla jobs, as well as have a catch-all
process limit (or something similar) to handle jobs that fork.
On the not-a-problem-but-might-be-helpful-in-the-future wish list would be a
system to limit the total resources used by a forking job by adding up all
the processes in its tree and not just the resources of individual processes.
>
> >Sounds like too much of a hack since it would only cover jobs submitted via
> >the condor_submit script and we are also supporing jobs comming in
> >remotely via globus, or have I misunderstood?
>
> Globus jobs are submitted with condor_submit. The only thing it won't cover
> is jobs submitted via the Condor SOAP API, but I am willing to bet that you
> don't use that. It's new and almost no one uses it.
>
> >However, the same idea of replacing condor executables with shell wrappers
> >could presumably be done for condor_starter and this script could then parse
> >the process arguments to look for the vm string and set the Unix process
> >limits accordingly before calling the "real" condor_starter processes.
>
> I think that would work, but I'm not certain. The starter inherits things
> like a network connection to the shadow so you have to be careful to get
> things just right.
>
> >Nonetheless, I still think it would be a useful enhancement for Condor to
> >have a fully integrated method for setting process limits via setrlimit().
>
> Agreed--I'm just trying hard to find a short-term solution that can hold
> you over while we improve things.
>
I will probably stick with modifying /etc/init.d/condor on the worker nodes
as a temporary work around until an integrated VM-specific solution is
available.
However, do you know of a Linux equivalent to the Solaris /usr/bin/plimit?
If that was available I could write a periodic cron job that searches
for any child processes of "condor_starter" and based on the "vm?" argument
calls plimit to set the process limits on the user processes according to
which vm it belongs to.
Thanks.
--
Stuart Anderson sba__AT__srl.caltech.edu http://www.srl.caltech.edu/personnel/sba
===========================================================================
Date mail was appended: Thu Aug 11 22:43:43 2005 (1123818223)
Subject: Actions
Assigned to wright by wright
===========================================================================
Date of actions: Thu Jun 1 4:19:32 2006 (1149153572)
CC: Todd Tannenbaum <tannenba__AT__cs.wisc.edu>, Alain Roy <roy__AT__cs.wisc.edu>
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-admin #12616] Controlling resource abusive jobs
Date: Thu, 1 Jun 2006 14:00:07 -0700
To: condor-admin__AT__cs.wisc.edu
as per our discussion at condor-week about this issue, i've been
working on making sure that the condor code is correctly updating the
image size in the job queue whenever a vanilla job is getting preempted.
unfortunately, i started from the assumption it was broken, and
started writing code to fix it. ;)
the more i dug into the starter's code, the more it seemed like it
was *already* doing this. so, then i switched to trying to reproduce
the problem you've reported, and was unable to do so. in local
testing, with a job that malloc's 10K every second, everything seemed
to work as expected. periodic imagesize updates made it into the job
queue just fine. when PREEMPT became TRUE (when the imagesize
crossed a threshold), the job was evicted and the imagesize was
updated one last time during the eviction. of course, if KILL is
true and the startd is telling the starter to "hardkill" (or if you
use condor_vacate -fast), then no such updates happen, but that's by
design (since the starter is under the assumption it must
*immediately* evict, and has no time for costly operations like TCP
communication with the shadow).
looking at the code (but not testing it), i can verify that this does
*not* happen with standard universe jobs. there, if the startd
decides to preempt (for any reason), the starter does not do a final
imagesize update to the shadow. maybe that's the only problem you're
worried about?
can you confirm or deny that this is really a problem with vanilla
jobs (esp in 6.7.19)? that seemed to be your top priority based on
the meeting we had, but i can't find a problem, either in the code or
in testing. maybe it used to be a bug back in 2005 (when this issue
was created) but enough things have changed in the code that this has
already been fixed?
thanks,
-derek
===========================================================================
Date mail was appended: Thu Jun 1 16:00:22 2006 (1149195623)