LIGO Support Ticket 17529
Ticket Information
Number: admin 17529
User: anderson@ligo.caltech.edu
Email: skoranda__AT__gravity.phys.uwm.edu
Status: open
Assigned To: burnett
Date: Sat, 1 Mar 2008 13:03:50 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: condor_master unexpectedly restarts other condor daemons on
self restart
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
Running,
# condor_version
$CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
on,
# cat /etc/redhat-release
Fedora Core release 4 (Stentz)
it was observed that when condor_master restarts itself after automatically
detecting a new condor_master binary image on disk it restarts all of the
daemons in DAEMON_LIST even if they where not all running before the
exec() call.
This was observed when upgrading (rpm -U) the condor binaries on an
execute machine that had the startd deliberately turned off, i.e.,
after condor_master restarted itself it proceeded to start the Startd.
This is not a critical problem, but if it is easy to "fix", I would find
it more intuitive if condor_master only restarted those condor daemons it
was actively managing when it restarts itself.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Sat Mar 1 15:04:11 2008 (1204405457)
Subject: Actions
Assigned to burnett by burnett
===========================================================================
Date of actions: Mon Mar 3 14:30:07 2008 (1204576207)
Date: Mon, 03 Mar 2008 14:35:33 -0600
From: Ben Burnett <burnett__AT__cs.wisc.edu>
Subject: RE: [condor-admin #17529] LIGO: condor_master unexpectedly
restarts other condor daemons on
To: condor-admin__AT__cs.wisc.edu
Content-Language: en-us
Thread-Index: Ach9bV+q29ziBzLgQVSx7Qcnf+5/CQAAA5hg
X-Spam-Report: AuthenticatedSender=yes, SenderIP=128.105.48.96
X-Spam-Pmxinfo: Server=avs-6, Version=5.4.1.325704, Antispam-Engine:
2.6.0.325393, Antispam-Data: 2008.3.3.122108, SenderIP=128.105.48.96
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
Hi Stuart:
The reason for this is that the master retains no persistent state, so if a
person were to call condor_off for a particular subsystem, it is only that
instance of the master that will be mindful of that operation.
The preferred method would be to configure DAEMON_LIST on the machines in
question to start only the daemons you want.
Sorry I can't offer a simpler solution.
Regards,
-B
===========================================================================
Date mail was appended: Mon Mar 3 14:36:15 2008 (1204576575)
Date: Mon, 3 Mar 2008 13:26:25 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17529] LIGO: condor_master unexpectedly
restarts other condor daemons on
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
On Mon, Mar 03, 2008 at 02:36:15PM -0600, condor-admin response tracking system wrote:
> Hi Stuart:
>
> The reason for this is that the master retains no persistent state, so if a
> person were to call condor_off for a particular subsystem, it is only that
> instance of the master that will be mindful of that operation.
>
> The preferred method would be to configure DAEMON_LIST on the machines in
> question to start only the daemons you want.
>
> Sorry I can't offer a simpler solution.
>
Understood.
What do you think about having condor_master pass state information to
its reincarnation (via the exec() call for example) so it restarts
in the same state as it was before restarting itself?
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Mon Mar 3 15:26:41 2008 (1204579602)
Date: Mon, 03 Mar 2008 15:33:48 -0600
From: Ben Burnett <burnett__AT__cs.wisc.edu>
Subject: RE: [condor-admin #17529] LIGO: condor_master unexpectedly
restarts other condor daemons on
To: condor-admin__AT__cs.wisc.edu
Content-Language: en-us
Thread-Index: Ach9dUamMc97U8PORDuiXrh5FL/38QAALUOg
X-Spam-Report: AuthenticatedSender=yes, SenderIP=128.105.48.96
X-Spam-Pmxinfo: Server=avs-9, Version=5.4.1.325704, Antispam-Engine:
2.6.0.325393, Antispam-Data: 2008.3.3.132312, SenderIP=128.105.48.96
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
>>>
What do you think about having condor_master pass state information to
its reincarnation (via the exec() call for example) so it restarts
in the same state as it was before restarting itself?
<<<
It's certainly a viable option; is there any reason you could not use per
machine local configurations? (I'm just trying to understand the situation.)
-B
===========================================================================
Date mail was appended: Mon Mar 3 15:34:52 2008 (1204580092)
Date: Mon, 3 Mar 2008 13:51:14 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #17529] LIGO: condor_master unexpectedly
restarts other condor daemons on
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
On Mon, Mar 03, 2008 at 03:34:52PM -0600, condor-admin response tracking system wrote:
> >>>
> What do you think about having condor_master pass state information to
> its reincarnation (via the exec() call for example) so it restarts
> in the same state as it was before restarting itself?
> <<<
>
> It's certainly a viable option; is there any reason you could not use per
> machine local configurations? (I'm just trying to understand the situation.)
>
I was working on a problem on an execute machine, so I had run
"condor_off -startd XYZ", and then along came the 7.0.1 release
so I upgraded the entire pool without taking specific action for
the problem machine and then noticed that the problem execute machine
was running startd again.
I view this is as just a minor convenience factor, i.e., if I had known
about and remembered this feature I could have edited the condor_config.local
file on the problematic execute machine or not run "rpm -U" until the machine
was fixed. However, it would be even nicer to not have to remember any of
this by having condor_master state preserved.
However, this is definitely a very low priority task, so if it is not simple
to implement and not obvious to the Condor team that it is worth doing more
generally, we can just leave this in the wouldn't-it-be-nice-someday category.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Mon Mar 3 15:51:33 2008 (1204581093)
Date: Mon, 03 Mar 2008 16:30:48 -0600
From: Ben Burnett <burnett__AT__cs.wisc.edu>
Subject: RE: [condor-admin #17529] LIGO: condor_master unexpectedly
restarts other condor daemons on
To: condor-admin__AT__cs.wisc.edu
Content-Language: en-us
Thread-Index: Ach9eL/zBJhDvS9yQiaQnVXW0qKAfwABR7qg
X-Spam-Report: AuthenticatedSender=yes, SenderIP=128.105.48.96
X-Spam-Pmxinfo: Server=avs-9, Version=5.4.1.325704, Antispam-Engine:
2.6.0.325393, Antispam-Data: 2008.3.3.141537, SenderIP=128.105.48.96
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
Fair enough, I can see how that would be a little irritating/frustrating. I'll
bring it up and see what the others have to say. As you say, it's not a major
change if we restrict it to the exec() level.
-B
-----Original Message-----
From: condor-admin response tracking system [mailto:condor-admin__AT__cs.wisc.edu]
Sent: Monday, March 03, 2008 3:52 PM
To: burnett__AT__cs.wisc.edu
Subject: Re: [condor-admin #17529] LIGO: condor_master unexpectedly restarts
other condor daemons on
On Mon, Mar 03, 2008 at 03:34:52PM -0600, condor-admin response tracking system
wrote:
> >>>
> What do you think about having condor_master pass state information to
> its reincarnation (via the exec() call for example) so it restarts
> in the same state as it was before restarting itself?
> <<<
>
> It's certainly a viable option; is there any reason you could not use per
> machine local configurations? (I'm just trying to understand the situation.)
>
I was working on a problem on an execute machine, so I had run
"condor_off -startd XYZ", and then along came the 7.0.1 release
so I upgraded the entire pool without taking specific action for
the problem machine and then noticed that the problem execute machine
was running startd again.
I view this is as just a minor convenience factor, i.e., if I had known
about and remembered this feature I could have edited the condor_config.local
file on the problematic execute machine or not run "rpm -U" until the machine
was fixed. However, it would be even nicer to not have to remember any of
this by having condor_master state preserved.
However, this is definitely a very low priority task, so if it is not simple
to implement and not obvious to the Condor team that it is worth doing more
generally, we can just leave this in the wouldn't-it-be-nice-someday category.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Mon Mar 3 16:31:28 2008 (1204583488)
Subject: Comments added
both Condor and LIGO agree that this is a feature request, and very low priority.
Comments added by tannenba
===========================================================================
Date comments were added: Fri Jul 18 13:46:11 2008 (1216406771)