LIGO Support Ticket 17529

Ticket Information
  Number:      admin 17529
  User:        anderson@ligo.caltech.edu
  Email:       skoranda__AT__gravity.phys.uwm.edu
  Status:      open
  Assigned To: burnett
Date: Sat, 1 Mar 2008 13:03:50 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: condor_master unexpectedly restarts other condor daemons on
 self restart
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

Running,

# condor_version
$CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

on,

# cat /etc/redhat-release 
Fedora Core release 4 (Stentz)

it was observed that when condor_master restarts itself after automatically
detecting a new condor_master binary image on disk it restarts all of the
daemons in DAEMON_LIST even if they where not all running before the
exec() call.

This was observed when upgrading (rpm -U) the condor binaries on an
execute machine that had the startd deliberately turned off, i.e.,
after condor_master restarted itself it proceeded to start the Startd.

This is not a critical problem, but if it is easy to "fix", I would find
it more intuitive if condor_master only restarted those condor daemons it
was actively managing when it restarts itself.

Thanks.


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Sat Mar  1 15:04:11 2008 (1204405457)
Subject: Actions

Assigned to burnett by burnett
===========================================================================
Date of actions: Mon Mar  3 14:30:07 2008 (1204576207)
Date: Mon, 03 Mar 2008 14:35:33 -0600
From: Ben Burnett <burnett__AT__cs.wisc.edu>
Subject: RE: [condor-admin #17529] LIGO: condor_master unexpectedly     
 restarts  other condor daemons on
To: condor-admin__AT__cs.wisc.edu
Content-Language: en-us
Thread-Index: Ach9bV+q29ziBzLgQVSx7Qcnf+5/CQAAA5hg
X-Spam-Report: AuthenticatedSender=yes, SenderIP=128.105.48.96
X-Spam-Pmxinfo: Server=avs-6, Version=5.4.1.325704,  Antispam-Engine:  
 2.6.0.325393, Antispam-Data: 2008.3.3.122108,  SenderIP=128.105.48.96
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

Hi Stuart:

The reason for this is that the master retains no persistent state, so if a
person were to call condor_off for a particular subsystem, it is only that
instance of the master that will be mindful of that operation.

The preferred method would be to configure DAEMON_LIST on the machines in
question to start only the daemons you want.  

Sorry I can't offer a simpler solution.

Regards,
-B




===========================================================================
Date mail was appended: Mon Mar  3 14:36:15 2008 (1204576575)
Date: Mon, 3 Mar 2008 13:26:25 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17529] LIGO: condor_master unexpectedly  
 restarts other condor daemons on
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

On Mon, Mar 03, 2008 at 02:36:15PM -0600, condor-admin response tracking system wrote:
> Hi Stuart:
> 
> The reason for this is that the master retains no persistent state, so if a
> person were to call condor_off for a particular subsystem, it is only that
> instance of the master that will be mindful of that operation.
> 
> The preferred method would be to configure DAEMON_LIST on the machines in
> question to start only the daemons you want.  
> 
> Sorry I can't offer a simpler solution.
> 

Understood.

What do you think about having condor_master pass state information to
its reincarnation (via the exec() call for example) so it restarts
in the same state as it was before restarting itself?

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Mar  3 15:26:41 2008 (1204579602)
Date: Mon, 03 Mar 2008 15:33:48 -0600
From: Ben Burnett <burnett__AT__cs.wisc.edu>
Subject: RE: [condor-admin #17529] LIGO: condor_master unexpectedly    
 restarts  other condor daemons on
To: condor-admin__AT__cs.wisc.edu
Content-Language: en-us
Thread-Index: Ach9dUamMc97U8PORDuiXrh5FL/38QAALUOg
X-Spam-Report: AuthenticatedSender=yes, SenderIP=128.105.48.96
X-Spam-Pmxinfo: Server=avs-9, Version=5.4.1.325704,  Antispam-Engine:  
 2.6.0.325393, Antispam-Data: 2008.3.3.132312,  SenderIP=128.105.48.96
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

>>>
What do you think about having condor_master pass state information to
its reincarnation (via the exec() call for example) so it restarts
in the same state as it was before restarting itself?
<<<

It's certainly a viable option; is there any reason you could not use per
machine local configurations? (I'm just trying to understand the situation.)

-B


===========================================================================
Date mail was appended: Mon Mar  3 15:34:52 2008 (1204580092)
Date: Mon, 3 Mar 2008 13:51:14 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17529] LIGO: condor_master unexpectedly  
 restarts other condor daemons on
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

On Mon, Mar 03, 2008 at 03:34:52PM -0600, condor-admin response tracking system wrote:
> >>>
> What do you think about having condor_master pass state information to
> its reincarnation (via the exec() call for example) so it restarts
> in the same state as it was before restarting itself?
> <<<
> 
> It's certainly a viable option; is there any reason you could not use per
> machine local configurations? (I'm just trying to understand the situation.)
> 

I was working on a problem on an execute machine, so I had run
"condor_off -startd XYZ", and then along came the 7.0.1 release
so I upgraded the entire pool without taking specific action for
the problem machine and then noticed that the problem execute machine
was running startd again.

I view this is as just a minor convenience factor, i.e., if I had known
about and remembered this feature I could have edited the condor_config.local
file on the problematic execute machine or not run "rpm -U" until the machine
was fixed. However, it would be even nicer to not have to remember any of
this by having condor_master state preserved.

However, this is definitely a very low priority task, so if it is not simple
to implement and not obvious to the Condor team that it is worth doing more
generally, we can just leave this in the wouldn't-it-be-nice-someday category.

Thanks.


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Mar  3 15:51:33 2008 (1204581093)
Date: Mon, 03 Mar 2008 16:30:48 -0600
From: Ben Burnett <burnett__AT__cs.wisc.edu>
Subject: RE: [condor-admin #17529] LIGO: condor_master unexpectedly    
 restarts  other condor daemons on
To: condor-admin__AT__cs.wisc.edu
Content-Language: en-us
Thread-Index: Ach9eL/zBJhDvS9yQiaQnVXW0qKAfwABR7qg
X-Spam-Report: AuthenticatedSender=yes, SenderIP=128.105.48.96
X-Spam-Pmxinfo: Server=avs-9, Version=5.4.1.325704,  Antispam-Engine:  
 2.6.0.325393, Antispam-Data: 2008.3.3.141537,  SenderIP=128.105.48.96
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

Fair enough, I can see how that would be a little irritating/frustrating.  I'll
bring it up and see what the others have to say.  As you say, it's not a major
change if we restrict it to the exec() level.

-B

-----Original Message-----
From: condor-admin response tracking system [mailto:condor-admin__AT__cs.wisc.edu] 
Sent: Monday, March 03, 2008 3:52 PM
To: burnett__AT__cs.wisc.edu
Subject: Re: [condor-admin #17529] LIGO: condor_master unexpectedly restarts
other condor daemons on

On Mon, Mar 03, 2008 at 03:34:52PM -0600, condor-admin response tracking system
wrote:
> >>>
> What do you think about having condor_master pass state information to
> its reincarnation (via the exec() call for example) so it restarts
> in the same state as it was before restarting itself?
> <<<
> 
> It's certainly a viable option; is there any reason you could not use per
> machine local configurations? (I'm just trying to understand the situation.)
> 

I was working on a problem on an execute machine, so I had run
"condor_off -startd XYZ", and then along came the 7.0.1 release
so I upgraded the entire pool without taking specific action for
the problem machine and then noticed that the problem execute machine
was running startd again.

I view this is as just a minor convenience factor, i.e., if I had known
about and remembered this feature I could have edited the condor_config.local
file on the problematic execute machine or not run "rpm -U" until the machine
was fixed. However, it would be even nicer to not have to remember any of
this by having condor_master state preserved.

However, this is definitely a very low priority task, so if it is not simple
to implement and not obvious to the Condor team that it is worth doing more
generally, we can just leave this in the wouldn't-it-be-nice-someday category.

Thanks.


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson



===========================================================================
Date mail was appended: Mon Mar  3 16:31:28 2008 (1204583488)
Subject: Comments added

both Condor and LIGO agree that this is a feature request, and very low priority.

Comments added by tannenba

===========================================================================
Date comments were added: Fri Jul 18 13:46:11 2008 (1216406771)