LIGO Support Ticket 13160

Ticket Information
  Number:      admin 13160
  User:        anderson@ligo.caltech.edu
  Email:       
  Status:      open
  Assigned To: jfrey
Date: Fri, 6 Jan 2006 10:15:45 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: condor_off fails to checkpoint in 6.7.14?

universe jobs before shutting down a node. However, in version 6.7.14 
condor_off nodename
appears to shutdown the ckpt server before checkpointing running jobs,
i.e., tail CkptServerLog

                        Sending ckpt server ad to collector...
1/6 10:05:45    SIGTERM trapped; shutting down checkpoint server


After 5 minutes the starter then issued,

1/6 10:10:46 vm2: State change: KILL is TRUE
1/6 10:10:46 vm2: Changing activity: Vacating -> Killing

but that appears to have failed, so it ran,

1/6 10:11:16 vm2: starter (pid 25382) is not responding to the request to hardkill its job. 
 The startd will now directly hard kill the starter and all its decendents.

Is this a bug or a feature?


Thanks.


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Fri Jan  6 12:16:01 2006 (1136571364)
Subject: Actions

Assigned to jfrey by jfrey
===========================================================================
Date of actions: Fri Jan  6 13:54:42 2006 (1136577282)
From: Jaime Frey <jfrey__AT__cs.wisc.edu>
Subject: Re: [condor-admin #13160] condor_off fails to checkpoint in 6.7.14?
Date: Fri, 6 Jan 2006 14:03:39 -0600
To: condor-admin__AT__cs.wisc.edu

> From the man page for condor_off I was expecting it to checkpoint  
> standard
> universe jobs before shutting down a node. However, in version 6.7.14
> condor_off nodename
> appears to shutdown the ckpt server before checkpointing running jobs,
> i.e., tail CkptServerLog
>
>                         Sending ckpt server ad to collector...
> 1/6 10:05:45    SIGTERM trapped; shutting down checkpoint server
>
>
> After 5 minutes the starter then issued,
>
> 1/6 10:10:46 vm2: State change: KILL is TRUE
> 1/6 10:10:46 vm2: Changing activity: Vacating -> Killing
>
> but that appears to have failed, so it ran,
>
> 1/6 10:11:16 vm2: starter (pid 25382) is not responding to the  
> request to hardkill its job.
>  The startd will now directly hard kill the starter and all its  
> decendents.
>
> Is this a bug or a feature?

Are you running checkpoint servers on all of your execute machines?  
The anticipated use is to have a small handful of checkpoint servers  
for a pool. There's no coordination between shutting down checkpoint  
servers and startds because they're usually not on the same machines.

As for the starter waiting 5 minutes to kill the job and then failing  
to to so, I would need the full log files to try diagnosing the problem.

Thanks and regards,
UW-Madison Condor Team



===========================================================================
Date mail was appended: Fri Jan  6 14:03:33 2006 (1136577814)
Subject: Actions

Status changed from open to pending by jfrey
===========================================================================
Date of actions: Fri Jan  6 14:03:33 2006 (1136577815)
Date: Fri, 6 Jan 2006 15:04:35 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #13160] condor_off fails to checkpoint in 6.7.14?

On Fri, Jan 06, 2006 at 02:03:33PM -0600, condor-admin response tracking system wrote:
> > From the man page for condor_off I was expecting it to checkpoint  
> > standard
> > universe jobs before shutting down a node. However, in version 6.7.14
> > condor_off nodename
> > appears to shutdown the ckpt server before checkpointing running jobs,
> > i.e., tail CkptServerLog
> >
> >                         Sending ckpt server ad to collector...
> > 1/6 10:05:45    SIGTERM trapped; shutting down checkpoint server
> >
> >
> > After 5 minutes the starter then issued,
> >
> > 1/6 10:10:46 vm2: State change: KILL is TRUE
> > 1/6 10:10:46 vm2: Changing activity: Vacating -> Killing
> >
> > but that appears to have failed, so it ran,
> >
> > 1/6 10:11:16 vm2: starter (pid 25382) is not responding to the  
> > request to hardkill its job.
> >  The startd will now directly hard kill the starter and all its  
> > decendents.
> >
> > Is this a bug or a feature?
> 
> Are you running checkpoint servers on all of your execute machines?  

Yes.

> The anticipated use is to have a small handful of checkpoint servers  
> for a pool. There's no coordination between shutting down checkpoint  
> servers and startds because they're usually not on the same machines.

Unfortunate.

The configuration for the LIGO clusters, with dedicated compute nodes
for running Condor, is that each node in the cluster runs its own checkpoint
server since it has a local connection to a high speed disk. Otherwise, a
synchronous shutdown of 100's of nodes running >100MByte standard universe
jobs would possibly overload the checkpoint servers with 10-100GByte of data
to checkpoint, or at a minimum take a long time.

Please consider adding synchronization logic to delay shuting down any
checkpoint servers until the condor_startd/condor_starter processes
have had a chance to checkpoint (if possible).

> 
> As for the starter waiting 5 minutes to kill the job and then failing  
> to to so, I would need the full log files to try diagnosing the problem.

Possibly because the starter was blocked trying to talk to the now
non-existent checkpoint server? Other than that this part of the puzzle
is probably not worth following up much further right now.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Jan  6 17:04:52 2006 (1136588692)
From: Jaime Frey <jfrey__AT__cs.wisc.edu>
Subject: Re: [condor-admin #13160] condor_off fails to checkpoint in 6.7.14?
Date: Fri, 6 Jan 2006 17:39:17 -0600
To: condor-admin__AT__cs.wisc.edu

On Jan 6, 2006, at 5:04 PM, condor-admin response tracking system wrote:

> On Fri, Jan 06, 2006 at 02:03:33PM -0600, condor-admin response  
> tracking system wrote:
>>> From the man page for condor_off I was expecting it to checkpoint
>>> standard
>>> universe jobs before shutting down a node. However, in version  
>>> 6.7.14
>>> condor_off nodename
>>> appears to shutdown the ckpt server before checkpointing running  
>>> jobs,
>>> i.e., tail CkptServerLog
>>>
>>>                         Sending ckpt server ad to collector...
>>> 1/6 10:05:45    SIGTERM trapped; shutting down checkpoint server
>>>
>>>
>>> After 5 minutes the starter then issued,
>>>
>>> 1/6 10:10:46 vm2: State change: KILL is TRUE
>>> 1/6 10:10:46 vm2: Changing activity: Vacating -> Killing
>>>
>>> but that appears to have failed, so it ran,
>>>
>>> 1/6 10:11:16 vm2: starter (pid 25382) is not responding to the
>>> request to hardkill its job.
>>>  The startd will now directly hard kill the starter and all its
>>> decendents.
>>>
>>> Is this a bug or a feature?
>>
>> Are you running checkpoint servers on all of your execute machines?
>
> Yes.
>
>> The anticipated use is to have a small handful of checkpoint servers
>> for a pool. There's no coordination between shutting down checkpoint
>> servers and startds because they're usually not on the same machines.
>
> Unfortunate.
>
> The configuration for the LIGO clusters, with dedicated compute nodes
> for running Condor, is that each node in the cluster runs its own  
> checkpoint
> server since it has a local connection to a high speed disk.  
> Otherwise, a
> synchronous shutdown of 100's of nodes running >100MByte standard  
> universe
> jobs would possibly overload the checkpoint servers with  
> 10-100GByte of data
> to checkpoint, or at a minimum take a long time.
>
> Please consider adding synchronization logic to delay shuting down any
> checkpoint servers until the condor_startd/condor_starter processes
> have had a chance to checkpoint (if possible).

Your setup assumes that all machines go up and down together. Failure  
of individual machines will make some jobs unable to run until the  
machine comes back up.

You should be able to approximate the synchronized logic by shutting  
down all of the startds first, then everything else, like so:

condor_off -all -startd
sleep 300
condor_off -all

Thanks and regards,
UW-Madison Condor Team



===========================================================================
Date mail was appended: Fri Jan  6 17:39:00 2006 (1136590741)
Subject: Actions

Status changed from open to pending by jfrey
===========================================================================
Date of actions: Fri Jan  6 17:39:00 2006 (1136590742)
Date: Fri, 6 Jan 2006 16:05:01 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #13160] condor_off fails to checkpoint in 6.7.14?

On Fri, Jan 06, 2006 at 05:39:00PM -0600, condor-admin response tracking system wrote:
> On Jan 6, 2006, at 5:04 PM, condor-admin response tracking system wrote:
> 
> > On Fri, Jan 06, 2006 at 02:03:33PM -0600, condor-admin response  
> > tracking system wrote:
> >>> From the man page for condor_off I was expecting it to checkpoint
> >>> standard
> >>> universe jobs before shutting down a node. However, in version  
> >>> 6.7.14
> >>> condor_off nodename
> >>> appears to shutdown the ckpt server before checkpointing running  
> >>> jobs,
> >>> i.e., tail CkptServerLog
> >>>
> >>>                         Sending ckpt server ad to collector...
> >>> 1/6 10:05:45    SIGTERM trapped; shutting down checkpoint server
> >>>
> >>>
> >>> After 5 minutes the starter then issued,
> >>>
> >>> 1/6 10:10:46 vm2: State change: KILL is TRUE
> >>> 1/6 10:10:46 vm2: Changing activity: Vacating -> Killing
> >>>
> >>> but that appears to have failed, so it ran,
> >>>
> >>> 1/6 10:11:16 vm2: starter (pid 25382) is not responding to the
> >>> request to hardkill its job.
> >>>  The startd will now directly hard kill the starter and all its
> >>> decendents.
> >>>
> >>> Is this a bug or a feature?
> >>
> >> Are you running checkpoint servers on all of your execute machines?
> >
> > Yes.
> >
> >> The anticipated use is to have a small handful of checkpoint servers
> >> for a pool. There's no coordination between shutting down checkpoint
> >> servers and startds because they're usually not on the same machines.
> >
> > Unfortunate.
> >
> > The configuration for the LIGO clusters, with dedicated compute nodes
> > for running Condor, is that each node in the cluster runs its own  
> > checkpoint
> > server since it has a local connection to a high speed disk.  
> > Otherwise, a
> > synchronous shutdown of 100's of nodes running >100MByte standard  
> > universe
> > jobs would possibly overload the checkpoint servers with  
> > 10-100GByte of data
> > to checkpoint, or at a minimum take a long time.
> >
> > Please consider adding synchronization logic to delay shuting down any
> > checkpoint servers until the condor_startd/condor_starter processes
> > have had a chance to checkpoint (if possible).
> 
> Your setup assumes that all machines go up and down together. Failure  
> of individual machines will make some jobs unable to run until the  
> machine comes back up.
> 
> You should be able to approximate the synchronized logic by shutting  
> down all of the startds first, then everything else, like so:
> 
> condor_off -all -startd
> sleep 300
> condor_off -all
> 

Understood. However, if you add the logic to shutdown condor_startd first
before the checkpoint server (if it exists) this would not change the
behavior for your anticipated dedicated checkpoint server model, but it
would have the advantage of supporting a simple shutdown mechanism for
clusters that use local disk storage for high speed parallel checkpointing.

What do you think?

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Jan  6 18:05:19 2006 (1136592320)
From: Jaime Frey <jfrey__AT__cs.wisc.edu>
Subject: Re: [condor-admin #13160] condor_off fails to checkpoint in 6.7.14?
Date: Fri, 6 Jan 2006 18:27:42 -0600
To: condor-admin__AT__cs.wisc.edu

> Understood. However, if you add the logic to shutdown condor_startd  
> first
> before the checkpoint server (if it exists) this would not change the
> behavior for your anticipated dedicated checkpoint server model,  
> but it
> would have the advantage of supporting a simple shutdown mechanism for
> clusters that use local disk storage for high speed parallel  
> checkpointing.
>
> What do you think?

We'll consider it.

Thanks and regards,
UW-Madison Condor Team



===========================================================================
Date mail was appended: Fri Jan  6 18:27:26 2006 (1136593646)
Subject: Actions

Ticket resolved by jfrey
===========================================================================
Date of actions: Fri Jan  6 18:27:26 2006 (1136593647)
Subject: Actions

Ticket was reopened by mailnull
===========================================================================
Date of actions: Thu Oct 19 16:53:52 2006 (1161294832)
Date: Thu, 19 Oct 2006 14:53:26 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>,
 tannenba__AT__cs.wisc.edu
Subject: Re: [condor-admin #13160] condor_off fails to checkpoint in 6.7.14?

On Fri, Jan 06, 2006 at 06:27:26PM -0600, condor-admin response tracking system wrote:
> > Understood. However, if you add the logic to shutdown condor_startd  
> > first
> > before the checkpoint server (if it exists) this would not change the
> > behavior for your anticipated dedicated checkpoint server model,  
> > but it
> > would have the advantage of supporting a simple shutdown mechanism for
> > clusters that use local disk storage for high speed parallel  
> > checkpointing.
> >
> > What do you think?
> 
> We'll consider it.

Jaime, Todd,
	What was the conclusion on this?

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Oct 19 16:53:52 2006 (1161294832)