LIGO Support Ticket 1754

Ticket Information
  Number:      support 1754
  User:        anderson@ligo.caltech.edu
  Email:       espinoza_e__AT__ligo.caltech.edu
  Status:      resolved
  Assigned To: adesmet
Date: Mon, 27 Nov 2006 17:35:30 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: LIGO problem with condor_off and condor_vacate being ignored

The LIGO Condor pool at Caltech running,

# condor_version 
$CondorVersion: 6.8.2 Oct 12 2006 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

fails to honor codor_vacate or condor_off -startd -graceful reqeuests
after a condor_off -startd -peaceful command. This causes problems when
shutting down the cluster. I would like to be able to run,

condor_off -startd -peaceful
# Wait for a reasonable amount of time for currently running un-checkpointable
# jobs to complete.
condor_off -start -graceful
# Kick off all remaining jobs, but let those that know how to check point
# do so.
#
# After KILLING_TIMEOUT it is now guaranteed that there are no longer
# any running jobs.


Another less frequent scenario where this functionality is desirable is when
condor_off -startd -peaceful has been run, but it is then realized that there
is not enough time to wait for all the jobs to finish. Currently it is then
necessary to run, condor_off -startd -fast.  I have not confirmed this,
but I suspect "fast" implies no checkpointing?

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Mon Nov 27 19:35:49 2006 (1164677752)
Subject: Actions

Assigned to adesmet by adesmet
===========================================================================
Date of actions: Tue Nov 28 14:59:47 2006 (1164747587)
Date: Tue, 28 Nov 2006 16:20:34 -0600
From: Alan De Smet <adesmet__AT__cs.wisc.edu>
To: adesmet <condor-support__AT__cs.wisc.edu>
Subject: Re: [condor-support #1754] LIGO problem with condor_off and  
 condor_vacate being ignored

> fails to honor codor_vacate or condor_off -startd -graceful reqeuests
> after a condor_off -startd -peaceful command. 

Regrettably this is a known deficiency.  Once Condor gets a
particular shutdown path set in its mind, it's rather insistant
about it.  While annoying, it's not currently on our schedule to
fix in the near term.  How important is this to you?

> necessary to run, condor_off -startd -fast.  I have not confirmed this,
> but I suspect "fast" implies no checkpointing?

Correct.  A graceful shutdown will allow a job
SHUTDOWN_GRACEFUL_TIMEOUT seconds (default: 1800) to exit, and
thus checkpoint before SIGKILLing it.  A fast shutdown goes right
to SIGKILL.

===========================================================================
Date mail was appended: Tue Nov 28 16:20:37 2006 (1164752437)
Date: Tue, 28 Nov 2006 20:09:05 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>,
 Alain Roy <roy__AT__cs.wisc.edu>, Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1754] LIGO problem with condor_off and  
 condor_vacate being ignored

Alain,
	Please add this to LIGO-tickets page.

Toddy,
	Any progress on automatically filtering incoming LIGO tickets?

Alan,
	This is a significant enough bug that I would like you to reconsider
adding this to your list of fixes. Alternatively, or as a temporary workaround,
what is the current best practice for the following scenario:

Given a scheduled Condor pool shutdown at T0, how do I notify Vanilla
Universe T0-X that it should continue to run any jobs that are currently
running but not to start any new ones; at T0-epsilon notify all jobs in
any Universe that can checkpoint (Standard and Parallel in our case) that
they should checkpoint now and kill any jobs that cannot checkpoint.

This seems to be me to be the "right" way to shutdown a Condor pool, i.e.,
maximize efficiency of resources and minimize the impact to users. However,
please let me know if I have missed the "Condor" way of doing this?

Thanks.


On Tue, Nov 28, 2006 at 04:20:37PM -0600, condor-support response tracking system wrote:
> > fails to honor codor_vacate or condor_off -startd -graceful reqeuests
> > after a condor_off -startd -peaceful command. 
> 
> Regrettably this is a known deficiency.  Once Condor gets a
> particular shutdown path set in its mind, it's rather insistant
> about it.  While annoying, it's not currently on our schedule to
> fix in the near term.  How important is this to you?

Enough so that I would like you to reconsider adding it to the enhancement
list.

> 
> > necessary to run, condor_off -startd -fast.  I have not confirmed this,
> > but I suspect "fast" implies no checkpointing?
> 
> Correct.  A graceful shutdown will allow a job
> SHUTDOWN_GRACEFUL_TIMEOUT seconds (default: 1800) to exit, and
> thus checkpoint before SIGKILLing it.  A fast shutdown goes right
> to SIGKILL.
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Alan De Smet <adesmet__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
> 
> -- 
> ======================================================================
> This mail was sent from the RUST Mail System
> Please direct all replies to condor-support__AT__cs.wisc.edu
> Please include the current subject line in your reply.
> ======================================================================
> 

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Tue Nov 28 22:11:43 2006 (1164773503)
Date: Wed, 29 Nov 2006 10:49:55 -0600
From: Alan De Smet <adesmet__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
Subject: Re: [condor-support #1754] LIGO problem with condor_off and  
 condor_vacate being ignored

> Given a scheduled Condor pool shutdown at T0, how do I notify Vanilla
> Universe T0-X that it should continue to run any jobs that are currently
> running but not to start any new ones; at T0-epsilon notify all jobs in
> any Universe that can checkpoint (Standard and Parallel in our case) that
> they should checkpoint now and kill any jobs that cannot checkpoint.
> 
> This seems to be me to be the "right" way to shutdown a Condor pool, i.e.,
> maximize efficiency of resources and minimize the impact to users. However,
> please let me know if I have missed the "Condor" way of doing this?

Unfortunately there isn't a way to accomplish this directly with
Condor.

Allowing condor_off -graceful to work after condor_off
-peaceful would be much ideal.  I'll bring it up with the other
staff today and we'll discuss reprioritizing it.

I'm also looking at a possible workaround that would use Condor's
"cron" functionality (also known as Hawkeye).  I'll let you know
when I have a more certain answer.


===========================================================================
Date mail was appended: Wed Nov 29 10:49:57 2006 (1164818997)
Date: Thu, 30 Nov 2006 18:41:07 -0600
From: Alan De Smet <adesmet__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
Subject: Re: [condor-support #1754] LIGO problem with condor_off and  
 condor_vacate being ignored

--s2ZSL+KKDSLx8OML

I have a workaround for the peaceful to graceful shutdown you'd
like.  The setting MAXJOBRETIREMENTTIME can be used to "delay' a
graceful shutdown.  The attached script implements this.

As it's currently written it lets a job continue for 45 minutes
before preempting it (the PEACEFULTIME setting).  Note that the
timer is inaccurate as it relies on information from the
collector which will lag the information the startd has.

The script as currently written expects to be run on the machine
you wish to shut down, although it could be modified to do remote
shutdowns.

This requires that MAXJOBRETIREMENTTIME be settable through the
runtime configuration option.  The following will enable this
functionality for the administrator:

ENABLE_RUNTIME_CONFIG = TRUE
SETTABLE_ATTRS_ADMINISTRATOR = MAXJOBRETIREMENTTIME


The "correct" solution (allowing -graceful after -peaceful) is
still on our list of things to do, but it may be a while.  In the
mean time, is this an acceptable workaround?

-- 
Alan De Smet                              Condor Project Research
adesmet__AT__cs.wisc.edu                 http://www.condorproject.org/

--s2ZSL+KKDSLx8OML

#! /bin/sh

# NOTE: new security in 6.8 means you need to specifically
# allow access to this setting.  The following settings will
# enable this:
#
# ENABLE_RUNTIME_CONFIG = TRUE
# SETTABLE_ATTRS_ADMINISTRATOR = MAXJOBRETIREMENTTIME

# Allow 45 minutes.
# Note that this is approximate and can easily be several minutes
# off as it relies on information from the collector which lags
# the startd.
PEACEFULTIME=`expr 60 '*' 45`

MACHINE=`hostname`
RUNTIME=`condor_status $MACHINE -format '%d' TotalJobRunTime`
RETIRETIME=`expr $PEACEFULTIME + $RUNTIME`

condor_config_val -startd -rset "MAXJOBRETIREMENTTIME=$RETIRETIME"
condor_reconfig -subsystem startd
condor_off -startd -graceful

--s2ZSL+KKDSLx8OML--

===========================================================================
Date mail was appended: Thu Nov 30 18:41:09 2006 (1164933670)
Date: Thu, 30 Nov 2006 19:25:21 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1754] LIGO problem with condor_off and  
 condor_vacate being ignored

Alan,

Very nice. Yes, this is an acceptable workaround for now. However, please
clarify what would happen if MAXJOBRETIREMENTTIME is changed after
the graceful command, i.e.,

condor_config_val -startd -rset "MAXJOBRETIREMENTTIME=$RETIRETIME"
condor_reconfig -subsystem startd
condor_off -startd -graceful
# New logistical constraints are realized
condor_config_val -startd -rset "MAXJOBRETIREMENTTIME=$NEWRETIRETIME"
condor_reconfig -subsystem startd

will the first RETIRETIME be honored or the second NEWRETIRETIME,
or someting else?

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Nov 30 21:25:41 2006 (1164943542)
Date: Fri, 1 Dec 2006 11:22:36 -0600
From: Alan De Smet <adesmet__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
Subject: Re: [condor-support #1754] LIGO problem with condor_off and  
 condor_vacate being ignored

> Very nice. Yes, this is an acceptable workaround for now. However, please
> clarify what would happen if MAXJOBRETIREMENTTIME is changed after
> the graceful command, i.e.,
> 
> condor_config_val -startd -rset "MAXJOBRETIREMENTTIME=$RETIRETIME"
> condor_reconfig -subsystem startd
> condor_off -startd -graceful
> # New logistical constraints are realized
> condor_config_val -startd -rset "MAXJOBRETIREMENTTIME=$NEWRETIRETIME"
> condor_reconfig -subsystem startd
> 
> will the first RETIRETIME be honored or the second NEWRETIRETIME,
> or someting else?

NEWRETIRETIME should replace RETIRETIME.  Graceful shutdowns
allow MAXJOBRETIREMENTTIME to change.  Note that peaceful
shutdowns do _not_ allow it to change (peaceful is implemented in
part through MJRT.


===========================================================================
Date mail was appended: Fri Dec  1 11:22:39 2006 (1164993760)
Subject: Actions

Ticket resolved by adesmet
===========================================================================
Date of actions: Fri Dec  1 11:22:39 2006 (1164993761)