LIGO Support Ticket 17168

Ticket Information
  Number:      admin 17168
  User:        anderson@ligo.caltech.edu
  Email:       skoranda__AT__gravity.phys.uwm.edu,cannon_k__AT__ligo.caltech.edu
  Status:      resolved
  Assigned To: tannenba
Date: Fri, 2 Nov 2007 16:28:33 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>,         Kipp Cannon
 <cannon_k__AT__ligo.caltech.edu>
Subject: LIGO: Shadow failures to connect to schedd
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

The LIGO Caltech Condor pool running,

# condor_version
$CondorVersion: 6.9.4 Aug 30 2007 $
$CondorPlatform: X86_64-LINUX_RHEL3 $

is experiencing jobs being re-run due to failures of the Shadow to
communicate with the Schedd after the Starter reports the jobs have
completed. This is is similar to ticket 15711. However, most of the
current failures are not associated with the error message,
"Forcing job requeue!", in the ShadowLog, instead 


ShadowLog
---------
11/2 11:31:43 (?.?) (29248):******* Standard Shadow starting up *******
11/2 11:31:43 (?.?) (29248):** $CondorVersion: 6.9.4 Aug 30 2007 $
11/2 11:31:43 (?.?) (29248):** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/2 11:31:43 (?.?) (29248):*******************************************
11/2 11:31:43 (?.?) (29248):uid=0, euid=4135, gid=0, egid=4135
11/2 11:31:43 (?.?) (29248):Hostname = "<10.14.2.78:33288>", Job = 20269592.0
11/2 11:31:43 (20269592.0) (29248):Requesting Primary Starter

11/2 11:31:43 (20269592.0) (29248):Shadow: Request to run a job was ACCEPTED
11/2 11:31:43 (20269592.0) (29248):Shadow: RSC_SOCK connected, fd = 17
11/2 11:31:43 (20269592.0) (29248):Shadow: CLIENT_LOG connected, fd = 18
11/2 11:31:43 (20269592.0) (29248):My_Filesystem_Domain = "ligo"
11/2 11:31:43 (20269592.0) (29248):My_UID_Domain = "ligo"
11/2 11:31:43 (20269592.0) (29248):     Entering pseudo_get_file_stream
11/2 11:31:43 (20269592.0) (29248):     file = "/usr1/condor/spool/cluster20269592.ickpt.subproc0"

11/2 11:31:43 (20269592.0) (29248):Reaped child status - pid 29249 exited with status 0

11/2 11:31:43 (20269592.0) (29248):Read: User Job - $CondorPlatform: X86_64-LINUX_RHEL3 $
11/2 11:31:43 (20269592.0) (29248):Read: User Job - $CondorVersion: 6.9.4 Oct 16 2007 10/16/07 stduniv binary write patc
h $
11/2 11:31:43 (20269592.0) (29248):Read: Checkpoint file name is "/usr1/condor/spool/cluster20269592.proc0.subproc0"

11/2 12:10:54 (20269592.0) (29248):Shadow: Job 20269592.0 exited, termsig = 0, coredump = 0, retcode = 0
11/2 12:10:54 (20269592.0) (29248):Shadow: Job exited normally with status 0
11/2 12:10:54 (20269592.0) (29248):user_time = 0 ticks
11/2 12:10:54 (20269592.0) (29248):sys_time = 3 ticks

11/2 12:14:03 (20269592.0) (29248):attempt to connect to <10.14.0.12:38431> failed: Connection timed out (connect errno 
= 110).  Will keep trying for 300 total seconds (111 to go).

11/2 12:15:54 (20269592.0) (29248):attempt to connect to <10.14.0.12:38431> failed: Connection timed out (connect errno 
= 110).
11/2 12:15:54 (20269592.0) (29248):Can't connect to queue manager: CEDAR:6001:Failed to connect to <10.14.0.12:38431>
11/2 12:15:54 (20269592.0) (29248):ERROR "Failed to connect to schedd!" at line 1012 in file shadow.C
11/2 12:15:55 (20269592.0) (29248):Shadow: DoCleanup: unlinking TmpCkpt '/usr1/condor/spool/cluster20269592.proc0.subpro
c0.tmp'
11/2 12:15:55 (20269592.0) (29248):Trying to unlink /usr1/condor/spool/cluster20269592.proc0.subproc0.tmp


and then presumably exiting with no further log message. The Schedd then logs

ScheddLog
---------
11/2 12:15:55 (pid:25163) Shadow pid 29248 for job 20269592.0 exited with status 4

and then it re-runs that job.


Much less frequently, we also get the following.

11/2 14:03:57 (20275539.0) (19606): Maximum number of job cleanup retry attempts (SHADOW_MAX_JOB_CLEANUP_RETRIES=5) reached; Forcing job requeue!


Is there anything that can be done to make this final Shadow to Schedd
comminction more robust when the submit machine is busy?

In particular, where does the 300sec come from and why are there are only
2 "failed to connect" messages. From the following 6.9 manual entries
I was expecting to see 5 connection failures separated by 30sec each:


SHADOW_JOB_CLEANUP_RETRY_DELAY
    This is an integer specifying the number of seconds to wait between tries to commit the final update to the job ClassAd in the condor_ schedd's job queue. The default is 30.

SHADOW_MAX_JOB_CLEANUP_RETRIES
    This is an integer specifying the number of times to try committing the final update to the job ClassAd in the condor_ schedd's job queue. The default is 5.



In the immmediate case the submit machine is busy running lots of Shadow
processes that are stat()'ing 10's of thousands of files each due to
ticket 15795, the fix for which will hopefully be released in 6.9.5 in
a few weeks. Nonetheless, I expect there will always be yet another reason
why the submit machine can get overloaded so hardening the Shadow-Schedd
communication is probably of more general use.

Thanks.


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Fri Nov  2 18:28:50 2007 (1194046133)
Subject: Actions

Assigned to wenger by wenger
===========================================================================
Date of actions: Tue Nov  6 16:14:06 2007 (1194387246)
Date: Tue, 6 Nov 2007 16:24:41 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-admin__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-admin #17168] LIGO: Shadow failures to connect to schedd

Stuart,

Just a quick update on this -- I'm trying to get ahold of Greg Quinn to 
take a look at this ticket, since he worked on 15711, and he knows more
about the shadow and schedd than I do.

At any rate, I just wanted to let you know that we're not just ignoring 
this.  I think you just missed last week's RUST crew, and yesterday was 
pretty crazy for me.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Tue Nov  6 16:24:50 2007 (1194387890)
Subject: Actions

Assigned to gquinn by wenger
===========================================================================
Date of actions: Tue Nov  6 16:55:07 2007 (1194389708)
Date: Wed, 07 Nov 2007 10:12:08 -0600
From: Greg Quinn <gquinn__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17168] LIGO: Shadow failures to connect to schedd

 > The LIGO Caltech Condor pool running,
> 
> # condor_version $CondorVersion: 6.9.4 Aug 30 2007 $ $CondorPlatform:
> X86_64-LINUX_RHEL3 $
> 
> is experiencing jobs being re-run due to failures of the Shadow to 
> communicate with the Schedd after the Starter reports the jobs have 
> completed. This is is similar to ticket 15711. However, most of the 
> current failures are not associated with the error message, "Forcing
> job requeue!", in the ShadowLog, instead

Hi Stuart,

It certainly appears that you a running into the same problem as
experienced in ticket 15711. The fix that was included in Condor 6.9.4
to address that issue was only implemented for the "new" Shadow, which
unfortunately doesn't handle standard universe jobs. So the messages:

> 11/2 12:14:03 (20269592.0) (29248):attempt to connect to
> 	<10.14.0.12:38431> failed: Connection timed out (connect errno =
> 	110).  Will keep trying for 300 total seconds (111 to go).
> 11/2 12:15:54 (20269592.0) (29248):attempt to connect to
> 	<10.14.0.12:38431> failed: Connection timed out (connect errno =
> 	110). 11/2 12:15:54 (20269592.0) (29248):Can't connect to queue
> 	manager: CEDAR:6001:Failed to connect to <10.14.0.12:38431> 11/2
> 12:15:54 (20269592.0) (29248):ERROR "Failed to connect to schedd!" at
> 	line 1012 in file shadow.C

are a result of the old (standard universe) Shadow failing to send an 
update to the SchedD. These messages:

> 11/2 14:03:57 (20275539.0) (19606): Maximum number of job cleanup
> 	retry attempts (SHADOW_MAX_JOB_CLEANUP_RETRIES=5) reached; Forcing
> 	job requeue!

are a result of the analogous situation for jobs that use the new Shadow 
(vanilla universe jobs) with the retry logic.

How often are you seeing these problems? In the earlier ticket, you had 
mentioned something on the order of 300 times an hour. Have things 
improved since then?

> Is there anything that can be done to make this final Shadow to
> Schedd comminction more robust when the submit machine is busy?

One idea would be to implement the same retry logic in the old Shadow as 
we already did in the new. I'm not sure if this is on our radar. In 
addition, as your logs show, this doesn't completely eliminate the 
problem since it still happens for vanilla jobs.

> In particular, where does the 300sec come from and why are there are
> only 2 "failed to connect" messages.

The 300 sec. thing is some retry logic that is internal to Condor's 
communications layer. That level of retry has existed for a very long time.

> In the immmediate case the submit machine is busy running lots of
> Shadow processes that are stat()'ing 10's of thousands of files each
> due to ticket 15795, the fix for which will hopefully be released in
> 6.9.5 in a few weeks. Nonetheless, I expect there will always be yet
> another reason why the submit machine can get overloaded so hardening
> the Shadow-Schedd communication is probably of more general use.

The only other suggestion I have at this point (if you haven't tried it 
already) is to try setting SHADOW_RENICE_INCREMENT to 0 (it defaults to 
10). This causes Shadows to run at the same nice level as the SchedD. 
Dan Bradley has had success using this to improve submit-side 
scalability. In fact, this will be the default setting for the upcoming 
6.9.5.

Greg Quinn
Condor Team

===========================================================================
Date mail was appended: Wed Nov  7 11:16:11 2007 (1194455772)
Date: Wed, 7 Nov 2007 12:50:40 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu, cannon_k__AT__ligo.caltech.edu
Subject: Re: [condor-admin #17168] LIGO: Shadow failures to connect to schedd
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

On Wed, Nov 07, 2007 at 11:16:11AM -0600, condor-admin response tracking system wrote:
> 
> How often are you seeing these problems? In the earlier ticket, you had 
> mentioned something on the order of 300 times an hour. Have things 
> improved since then?

It strongly depends on what the current mix of jobs being run is.

> 
> > Is there anything that can be done to make this final Shadow to
> > Schedd comminction more robust when the submit machine is busy?
> 
> One idea would be to implement the same retry logic in the old Shadow as 
> we already did in the new. I'm not sure if this is on our radar. In 
> addition, as your logs show, this doesn't completely eliminate the 
> problem since it still happens for vanilla jobs.

Please consider making this communication more robust for Standard Universe
jobs as well.

> 
> The only other suggestion I have at this point (if you haven't tried it 
> already) is to try setting SHADOW_RENICE_INCREMENT to 0 (it defaults to 
> 10). This causes Shadows to run at the same nice level as the SchedD. 
> Dan Bradley has had success using this to improve submit-side 
> scalability. In fact, this will be the default setting for the upcoming 
> 6.9.5.

I have just made this change.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Wed Nov  7 14:50:57 2007 (1194468658)
Date: Thu, 08 Nov 2007 11:36:15 -0600
From: Greg Quinn <gquinn__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17168] LIGO: Shadow failures to connect to schedd

>>> Is there anything that can be done to make this final Shadow to
>>> Schedd comminction more robust when the submit machine is busy?
>> One idea would be to implement the same retry logic in the old Shadow as 
>> we already did in the new. I'm not sure if this is on our radar. In 
>> addition, as your logs show, this doesn't completely eliminate the 
>> problem since it still happens for vanilla jobs.
> 
> Please consider making this communication more robust for Standard Universe
> jobs as well.

OK, I'll bring this up at our next developer meeting.

>> The only other suggestion I have at this point (if you haven't tried it 
>> already) is to try setting SHADOW_RENICE_INCREMENT to 0 (it defaults to 
>> 10). This causes Shadows to run at the same nice level as the SchedD. 
>> Dan Bradley has had success using this to improve submit-side 
>> scalability. In fact, this will be the default setting for the upcoming 
>> 6.9.5.
> 
> I have just made this change.

Cool. Let me know if it helps.

Greg Quinn
Condor Team

===========================================================================
Date mail was appended: Thu Nov  8 11:36:20 2007 (1194543380)
Date: Fri, 9 Nov 2007 14:09:49 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu, cannon_k__AT__ligo.caltech.edu
Subject: Re: [condor-admin #17168] LIGO: Shadow failures to connect to schedd
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

On Thu, Nov 08, 2007 at 11:36:20AM -0600, condor-admin response tracking system wrote:
> >>> Is there anything that can be done to make this final Shadow to
> >>> Schedd comminction more robust when the submit machine is busy?
> >> One idea would be to implement the same retry logic in the old Shadow as 
> >> we already did in the new. I'm not sure if this is on our radar. In 
> >> addition, as your logs show, this doesn't completely eliminate the 
> >> problem since it still happens for vanilla jobs.
> > 
> > Please consider making this communication more robust for Standard Universe
> > jobs as well.
> 
> OK, I'll bring this up at our next developer meeting.

Thanks, I think this would be very helpful.

> 
> >> The only other suggestion I have at this point (if you haven't tried it 
> >> already) is to try setting SHADOW_RENICE_INCREMENT to 0 (it defaults to 
> >> 10). This causes Shadows to run at the same nice level as the SchedD. 
> >> Dan Bradley has had success using this to improve submit-side 
> >> scalability. In fact, this will be the default setting for the upcoming 
> >> 6.9.5.
> > 
> > I have just made this change.
> 
> Cool. Let me know if it helps.

I am not certain yet, but it may have hurt quite a bit. Last night the
schedd was killed by condor_master for being unresponsive 7 times.
I have not been able to determine if this was due to changing the
SHADOW_RENICE_INCREMENT from 10 to 0 or not.

What is the current status of the 6.9.5 release? I am eager to get the
condor_shadow don't-build-a-file-catalog fix as soon as possible to reduce
the load on the submit machine. If the 6.9.5 release has been delayed is it
possible to get a condor_shadow pre-release to fix this for the short-term,
or are there other substanative changes to the shadow that would make this
ill-advised?

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Nov  9 16:10:16 2007 (1194646218)
Subject: Re: [condor-admin #17168] LIGO: Shadow failures to connect to 	schedd
From: Greg Quinn <gquinn__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Date: Mon, 12 Nov 2007 15:04:57 -0600

Hello,

> > >> The only other suggestion I have at this point (if you haven't tried it 
> > >> already) is to try setting SHADOW_RENICE_INCREMENT to 0 (it defaults to 
> > >> 10). This causes Shadows to run at the same nice level as the SchedD. 
> > >> Dan Bradley has had success using this to improve submit-side 
> > >> scalability. In fact, this will be the default setting for the upcoming 
> > >> 6.9.5.
> > > 
> > > I have just made this change.
> > 
> > Cool. Let me know if it helps.
> 
> I am not certain yet, but it may have hurt quite a bit. Last night the
> schedd was killed by condor_master for being unresponsive 7 times.
> I have not been able to determine if this was due to changing the
> SHADOW_RENICE_INCREMENT from 10 to 0 or not.

Too bad :(

> What is the current status of the 6.9.5 release? I am eager to get the
> condor_shadow don't-build-a-file-catalog fix as soon as possible to reduce
> the load on the submit machine. If the 6.9.5 release has been delayed is it
> possible to get a condor_shadow pre-release to fix this for the short-term,
> or are there other substanative changes to the shadow that would make this
> ill-advised?

The 6.9.5 release is currently waiting fo two last features to be
checked - I am unsure of their timelines. I can't see the harm in
running a pre-release version of the vanilla Shadow. I have made such a
binary (X86_64/RHEL3) available at:

http://ftp.cs.wisc.edu/condor/temporary/forligo/x86_64_rhel3/condor_shadow

Greg Quinn
Condor Team


===========================================================================
Date mail was appended: Mon Nov 12 15:08:27 2007 (1194901707)
Date: Mon, 12 Nov 2007 18:23:16 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu, cannon_k__AT__ligo.caltech.edu
Subject: Re: [condor-admin #17168] LIGO: Shadow failures to connect to schedd
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

On Mon, Nov 12, 2007 at 03:08:27PM -0600, condor-admin response tracking system wrote:
> 
> The 6.9.5 release is currently waiting fo two last features to be
> checked - I am unsure of their timelines. I can't see the harm in
> running a pre-release version of the vanilla Shadow. I have made such a
> binary (X86_64/RHEL3) available at:
> 
> http://ftp.cs.wisc.edu/condor/temporary/forligo/x86_64_rhel3/condor_shadow
> 

To support standard universe jobs do I also need a condor_shadow.std
pre-release binary?

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Nov 12 20:23:33 2007 (1194920613)
Subject: Re: [condor-admin #17168] LIGO: Shadow failures to connect to 	schedd
From: Greg Quinn <gquinn__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Date: Tue, 13 Nov 2007 09:22:45 -0600

> To support standard universe jobs do I also need a condor_shadow.std
> pre-release binary?

The file catalog code is specific to file transfer, which is not used in
standard universe. So just replacing your vanilla shadow with the one
provided should eliminate all file catalog overhead on your system.

Greg


===========================================================================
Date mail was appended: Tue Nov 13  9:26:26 2007 (1194967587)
Date: Tue, 13 Nov 2007 15:41:52 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu, cannon_k__AT__ligo.caltech.edu
Subject: Re: [condor-admin #17168] LIGO: Shadow failures to connect to schedd
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

On Mon, Nov 12, 2007 at 03:08:27PM -0600, condor-admin response tracking system wrote:
> 
> The 6.9.5 release is currently waiting fo two last features to be
> checked - I am unsure of their timelines. I can't see the harm in
> running a pre-release version of the vanilla Shadow. I have made such a
> binary (X86_64/RHEL3) available at:
> 
> http://ftp.cs.wisc.edu/condor/temporary/forligo/x86_64_rhel3/condor_shadow

I have confirmed that this pre-release condor_shadow does not stat()
all the files in the IWD when starting Vanilla universe jobs.

Many thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Tue Nov 13 17:42:15 2007 (1194997336)
Date: Thu, 20 Dec 2007 20:16:49 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu, cannon_k__AT__ligo.caltech.edu
Subject: Re: [condor-admin #17168] LIGO: Shadow failures to connect to schedd
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Greg,
	The pre-release shadow worked well, and we have since upgraded
to the full 6.9.5 release, so the side issue of excessive stat() calls
on the submit machine has been resolved.

	However, the remaining issue in this ticket is regarding
robust shadow to schedd communication for standard universe jobs.
What is the current thinking regarding porting the new shadow changes
to the standard universe?

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Dec 20 22:17:06 2007 (1198210627)
Subject: Re: [condor-admin #17168] LIGO: Shadow failures to connect to 	schedd
From: Greg Quinn <gquinn__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Date: Fri, 21 Dec 2007 08:58:46 -0600

Hi Stuart,

Good to hear 6.9.5 has resolved most of this issue.

As far as implementing reties in the old Shadow, we did not have time to
get this in for the upcoming 7.0.0 release. What most of us around here
would like to see in the next development series is the merging of the
Shadows and Starters into a single pair. This would give standard
universe jobs the retry logic for "free". Of course, this effort is much
larger than just focusing on implementing the retries (but the benefit
would be big, especially in terms of development costs for future
improvements). If the Shadow and Starter unification projects do not
make the roadmap for the 7.1 series, we should be able to implement the
reties instead.

Greg

On Thu, 2007-12-20 at 22:17 -0600, condor-admin response tracking system
wrote:
> Greg,
> 	The pre-release shadow worked well, and we have since upgraded
> to the full 6.9.5 release, so the side issue of excessive stat() calls
> on the submit machine has been resolved.
> 
> 	However, the remaining issue in this ticket is regarding
> robust shadow to schedd communication for standard universe jobs.
> What is the current thinking regarding porting the new shadow changes
> to the standard universe?
> 
> Thanks.
> 
> -- 
> Stuart Anderson  anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu,cannon_k__AT__ligo.caltech.edu
> 


===========================================================================
Date mail was appended: Fri Dec 21  8:58:56 2007 (1198249137)
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17168] LIGO: Shadow failures to connect to schedd
Date: Thu, 29 Jan 2009 20:42:35 -0800
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

This ticket should be closed.

A quick audit of a 7.2.0 system with a busy Schedd is not showing this  
problem.

Thanks.


On Dec 21, 2007, at 6:58 AM, condor-admin response tracking system  
wrote:

> Hi Stuart,
>
> Good to hear 6.9.5 has resolved most of this issue.
>
> As far as implementing reties in the old Shadow, we did not have  
> time to
> get this in for the upcoming 7.0.0 release. What most of us around  
> here
> would like to see in the next development series is the merging of the
> Shadows and Starters into a single pair. This would give standard
> universe jobs the retry logic for "free". Of course, this effort is  
> much
> larger than just focusing on implementing the retries (but the benefit
> would be big, especially in terms of development costs for future
> improvements). If the Shadow and Starter unification projects do not
> make the roadmap for the 7.1 series, we should be able to implement  
> the
> reties instead.
>
> Greg


--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Thu Jan 29 22:42:47 2009 (1233290568)
Date: Fri, 30 Jan 2009 08:24:31 -0600
From: Greg Quinn <gquinn__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #17168] LIGO: Shadow failures to connect to schedd

Hello,

condor-admin response tracking system wrote:
> This ticket should be closed.
> 
> A quick audit of a 7.2.0 system with a busy Schedd is not showing this  
> problem.

Good to hear and thank you for the update. I'll close the ticket.

Greg

> Thanks.
> 
> 
> On Dec 21, 2007, at 6:58 AM, condor-admin response tracking system  
> wrote:
> 
>> Hi Stuart,
>>
>> Good to hear 6.9.5 has resolved most of this issue.
>>
>> As far as implementing reties in the old Shadow, we did not have  
>> time to
>> get this in for the upcoming 7.0.0 release. What most of us around  
>> here
>> would like to see in the next development series is the merging of the
>> Shadows and Starters into a single pair. This would give standard
>> universe jobs the retry logic for "free". Of course, this effort is  
>> much
>> larger than just focusing on implementing the retries (but the benefit
>> would be big, especially in terms of development costs for future
>> improvements). If the Shadow and Starter unification projects do not
>> make the roadmap for the 7.1 series, we should be able to implement  
>> the
>> reties instead.
>>
>> Greg
> 
> 
> --
> Stuart Anderson  anderson__AT__ligo.caltech.edu
> http://www.ligo.caltech.edu/~anderson
> 
> 
> 
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu,cannon_k__AT__ligo.caltech.edu
> 

===========================================================================
Date mail was appended: Fri Jan 30  8:23:58 2009 (1233325439)
Subject: Actions

Ticket resolved by gquinn
===========================================================================
Date of actions: Fri Jan 30  8:24:22 2009 (1233325462)
Subject: Actions

Assigned to tannenba by gquinn
===========================================================================
Date of actions: Thu Mar  5  9:23:24 2009 (1236266604)