LIGO Support Ticket 13160
Ticket Information
Number: admin 13160
User: anderson@ligo.caltech.edu
Email:
Status: open
Assigned To: jfrey
Date: Fri, 6 Jan 2006 10:15:45 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: condor_off fails to checkpoint in 6.7.14?
universe jobs before shutting down a node. However, in version 6.7.14
condor_off nodename
appears to shutdown the ckpt server before checkpointing running jobs,
i.e., tail CkptServerLog
Sending ckpt server ad to collector...
1/6 10:05:45 SIGTERM trapped; shutting down checkpoint server
After 5 minutes the starter then issued,
1/6 10:10:46 vm2: State change: KILL is TRUE
1/6 10:10:46 vm2: Changing activity: Vacating -> Killing
but that appears to have failed, so it ran,
1/6 10:11:16 vm2: starter (pid 25382) is not responding to the request to hardkill its job.
The startd will now directly hard kill the starter and all its decendents.
Is this a bug or a feature?
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Fri Jan 6 12:16:01 2006 (1136571364)
Subject: Actions
Assigned to jfrey by jfrey
===========================================================================
Date of actions: Fri Jan 6 13:54:42 2006 (1136577282)
From: Jaime Frey <jfrey__AT__cs.wisc.edu>
Subject: Re: [condor-admin #13160] condor_off fails to checkpoint in 6.7.14?
Date: Fri, 6 Jan 2006 14:03:39 -0600
To: condor-admin__AT__cs.wisc.edu
> From the man page for condor_off I was expecting it to checkpoint
> standard
> universe jobs before shutting down a node. However, in version 6.7.14
> condor_off nodename
> appears to shutdown the ckpt server before checkpointing running jobs,
> i.e., tail CkptServerLog
>
> Sending ckpt server ad to collector...
> 1/6 10:05:45 SIGTERM trapped; shutting down checkpoint server
>
>
> After 5 minutes the starter then issued,
>
> 1/6 10:10:46 vm2: State change: KILL is TRUE
> 1/6 10:10:46 vm2: Changing activity: Vacating -> Killing
>
> but that appears to have failed, so it ran,
>
> 1/6 10:11:16 vm2: starter (pid 25382) is not responding to the
> request to hardkill its job.
> The startd will now directly hard kill the starter and all its
> decendents.
>
> Is this a bug or a feature?
Are you running checkpoint servers on all of your execute machines?
The anticipated use is to have a small handful of checkpoint servers
for a pool. There's no coordination between shutting down checkpoint
servers and startds because they're usually not on the same machines.
As for the starter waiting 5 minutes to kill the job and then failing
to to so, I would need the full log files to try diagnosing the problem.
Thanks and regards,
UW-Madison Condor Team
===========================================================================
Date mail was appended: Fri Jan 6 14:03:33 2006 (1136577814)
Subject: Actions
Status changed from open to pending by jfrey
===========================================================================
Date of actions: Fri Jan 6 14:03:33 2006 (1136577815)
Date: Fri, 6 Jan 2006 15:04:35 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #13160] condor_off fails to checkpoint in 6.7.14?
On Fri, Jan 06, 2006 at 02:03:33PM -0600, condor-admin response tracking system wrote:
> > From the man page for condor_off I was expecting it to checkpoint
> > standard
> > universe jobs before shutting down a node. However, in version 6.7.14
> > condor_off nodename
> > appears to shutdown the ckpt server before checkpointing running jobs,
> > i.e., tail CkptServerLog
> >
> > Sending ckpt server ad to collector...
> > 1/6 10:05:45 SIGTERM trapped; shutting down checkpoint server
> >
> >
> > After 5 minutes the starter then issued,
> >
> > 1/6 10:10:46 vm2: State change: KILL is TRUE
> > 1/6 10:10:46 vm2: Changing activity: Vacating -> Killing
> >
> > but that appears to have failed, so it ran,
> >
> > 1/6 10:11:16 vm2: starter (pid 25382) is not responding to the
> > request to hardkill its job.
> > The startd will now directly hard kill the starter and all its
> > decendents.
> >
> > Is this a bug or a feature?
>
> Are you running checkpoint servers on all of your execute machines?
Yes.
> The anticipated use is to have a small handful of checkpoint servers
> for a pool. There's no coordination between shutting down checkpoint
> servers and startds because they're usually not on the same machines.
Unfortunate.
The configuration for the LIGO clusters, with dedicated compute nodes
for running Condor, is that each node in the cluster runs its own checkpoint
server since it has a local connection to a high speed disk. Otherwise, a
synchronous shutdown of 100's of nodes running >100MByte standard universe
jobs would possibly overload the checkpoint servers with 10-100GByte of data
to checkpoint, or at a minimum take a long time.
Please consider adding synchronization logic to delay shuting down any
checkpoint servers until the condor_startd/condor_starter processes
have had a chance to checkpoint (if possible).
>
> As for the starter waiting 5 minutes to kill the job and then failing
> to to so, I would need the full log files to try diagnosing the problem.
Possibly because the starter was blocked trying to talk to the now
non-existent checkpoint server? Other than that this part of the puzzle
is probably not worth following up much further right now.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Fri Jan 6 17:04:52 2006 (1136588692)
From: Jaime Frey <jfrey__AT__cs.wisc.edu>
Subject: Re: [condor-admin #13160] condor_off fails to checkpoint in 6.7.14?
Date: Fri, 6 Jan 2006 17:39:17 -0600
To: condor-admin__AT__cs.wisc.edu
On Jan 6, 2006, at 5:04 PM, condor-admin response tracking system wrote:
> On Fri, Jan 06, 2006 at 02:03:33PM -0600, condor-admin response
> tracking system wrote:
>>> From the man page for condor_off I was expecting it to checkpoint
>>> standard
>>> universe jobs before shutting down a node. However, in version
>>> 6.7.14
>>> condor_off nodename
>>> appears to shutdown the ckpt server before checkpointing running
>>> jobs,
>>> i.e., tail CkptServerLog
>>>
>>> Sending ckpt server ad to collector...
>>> 1/6 10:05:45 SIGTERM trapped; shutting down checkpoint server
>>>
>>>
>>> After 5 minutes the starter then issued,
>>>
>>> 1/6 10:10:46 vm2: State change: KILL is TRUE
>>> 1/6 10:10:46 vm2: Changing activity: Vacating -> Killing
>>>
>>> but that appears to have failed, so it ran,
>>>
>>> 1/6 10:11:16 vm2: starter (pid 25382) is not responding to the
>>> request to hardkill its job.
>>> The startd will now directly hard kill the starter and all its
>>> decendents.
>>>
>>> Is this a bug or a feature?
>>
>> Are you running checkpoint servers on all of your execute machines?
>
> Yes.
>
>> The anticipated use is to have a small handful of checkpoint servers
>> for a pool. There's no coordination between shutting down checkpoint
>> servers and startds because they're usually not on the same machines.
>
> Unfortunate.
>
> The configuration for the LIGO clusters, with dedicated compute nodes
> for running Condor, is that each node in the cluster runs its own
> checkpoint
> server since it has a local connection to a high speed disk.
> Otherwise, a
> synchronous shutdown of 100's of nodes running >100MByte standard
> universe
> jobs would possibly overload the checkpoint servers with
> 10-100GByte of data
> to checkpoint, or at a minimum take a long time.
>
> Please consider adding synchronization logic to delay shuting down any
> checkpoint servers until the condor_startd/condor_starter processes
> have had a chance to checkpoint (if possible).
Your setup assumes that all machines go up and down together. Failure
of individual machines will make some jobs unable to run until the
machine comes back up.
You should be able to approximate the synchronized logic by shutting
down all of the startds first, then everything else, like so:
condor_off -all -startd
sleep 300
condor_off -all
Thanks and regards,
UW-Madison Condor Team
===========================================================================
Date mail was appended: Fri Jan 6 17:39:00 2006 (1136590741)
Subject: Actions
Status changed from open to pending by jfrey
===========================================================================
Date of actions: Fri Jan 6 17:39:00 2006 (1136590742)
Date: Fri, 6 Jan 2006 16:05:01 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #13160] condor_off fails to checkpoint in 6.7.14?
On Fri, Jan 06, 2006 at 05:39:00PM -0600, condor-admin response tracking system wrote:
> On Jan 6, 2006, at 5:04 PM, condor-admin response tracking system wrote:
>
> > On Fri, Jan 06, 2006 at 02:03:33PM -0600, condor-admin response
> > tracking system wrote:
> >>> From the man page for condor_off I was expecting it to checkpoint
> >>> standard
> >>> universe jobs before shutting down a node. However, in version
> >>> 6.7.14
> >>> condor_off nodename
> >>> appears to shutdown the ckpt server before checkpointing running
> >>> jobs,
> >>> i.e., tail CkptServerLog
> >>>
> >>> Sending ckpt server ad to collector...
> >>> 1/6 10:05:45 SIGTERM trapped; shutting down checkpoint server
> >>>
> >>>
> >>> After 5 minutes the starter then issued,
> >>>
> >>> 1/6 10:10:46 vm2: State change: KILL is TRUE
> >>> 1/6 10:10:46 vm2: Changing activity: Vacating -> Killing
> >>>
> >>> but that appears to have failed, so it ran,
> >>>
> >>> 1/6 10:11:16 vm2: starter (pid 25382) is not responding to the
> >>> request to hardkill its job.
> >>> The startd will now directly hard kill the starter and all its
> >>> decendents.
> >>>
> >>> Is this a bug or a feature?
> >>
> >> Are you running checkpoint servers on all of your execute machines?
> >
> > Yes.
> >
> >> The anticipated use is to have a small handful of checkpoint servers
> >> for a pool. There's no coordination between shutting down checkpoint
> >> servers and startds because they're usually not on the same machines.
> >
> > Unfortunate.
> >
> > The configuration for the LIGO clusters, with dedicated compute nodes
> > for running Condor, is that each node in the cluster runs its own
> > checkpoint
> > server since it has a local connection to a high speed disk.
> > Otherwise, a
> > synchronous shutdown of 100's of nodes running >100MByte standard
> > universe
> > jobs would possibly overload the checkpoint servers with
> > 10-100GByte of data
> > to checkpoint, or at a minimum take a long time.
> >
> > Please consider adding synchronization logic to delay shuting down any
> > checkpoint servers until the condor_startd/condor_starter processes
> > have had a chance to checkpoint (if possible).
>
> Your setup assumes that all machines go up and down together. Failure
> of individual machines will make some jobs unable to run until the
> machine comes back up.
>
> You should be able to approximate the synchronized logic by shutting
> down all of the startds first, then everything else, like so:
>
> condor_off -all -startd
> sleep 300
> condor_off -all
>
Understood. However, if you add the logic to shutdown condor_startd first
before the checkpoint server (if it exists) this would not change the
behavior for your anticipated dedicated checkpoint server model, but it
would have the advantage of supporting a simple shutdown mechanism for
clusters that use local disk storage for high speed parallel checkpointing.
What do you think?
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Fri Jan 6 18:05:19 2006 (1136592320)
From: Jaime Frey <jfrey__AT__cs.wisc.edu>
Subject: Re: [condor-admin #13160] condor_off fails to checkpoint in 6.7.14?
Date: Fri, 6 Jan 2006 18:27:42 -0600
To: condor-admin__AT__cs.wisc.edu
> Understood. However, if you add the logic to shutdown condor_startd
> first
> before the checkpoint server (if it exists) this would not change the
> behavior for your anticipated dedicated checkpoint server model,
> but it
> would have the advantage of supporting a simple shutdown mechanism for
> clusters that use local disk storage for high speed parallel
> checkpointing.
>
> What do you think?
We'll consider it.
Thanks and regards,
UW-Madison Condor Team
===========================================================================
Date mail was appended: Fri Jan 6 18:27:26 2006 (1136593646)
Subject: Actions
Ticket resolved by jfrey
===========================================================================
Date of actions: Fri Jan 6 18:27:26 2006 (1136593647)
Subject: Actions
Ticket was reopened by mailnull
===========================================================================
Date of actions: Thu Oct 19 16:53:52 2006 (1161294832)
Date: Thu, 19 Oct 2006 14:53:26 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>,
tannenba__AT__cs.wisc.edu
Subject: Re: [condor-admin #13160] condor_off fails to checkpoint in 6.7.14?
On Fri, Jan 06, 2006 at 06:27:26PM -0600, condor-admin response tracking system wrote:
> > Understood. However, if you add the logic to shutdown condor_startd
> > first
> > before the checkpoint server (if it exists) this would not change the
> > behavior for your anticipated dedicated checkpoint server model,
> > but it
> > would have the advantage of supporting a simple shutdown mechanism for
> > clusters that use local disk storage for high speed parallel
> > checkpointing.
> >
> > What do you think?
>
> We'll consider it.
Jaime, Todd,
What was the conclusion on this?
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Thu Oct 19 16:53:52 2006 (1161294832)