LIGO Support Ticket 1678

Ticket Information
  Number:      support 1678
  User:        anderson@ligo.caltech.edu
  Email:       espinoza_e__AT__ligo.caltech.edu,patrick__AT__gravity.phys.uwm.edu,lazz__AT__ligo.caltech.edu,espinoza__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu
  Status:      resolved
  Assigned To: gthain
Date: Sat, 9 Sep 2006 15:12:42 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu
Subject: LIGO condor-6.8.1 schedd exit status 4

A pre-release of condor-6.8.1 recently provided by Alan for running on the
LIGO CIT cluster just had the schedd exit with status 4 and the following
message.  Note this occured 6 minutes after the segfault reported in
[condor-support #1677] and is similar to the spawnJobs errors reported
in [condor-support #1652] which also occured ~6 minutes after a schedd
crash.

Note there does not appear to have been a network failutre at this time,
but the load on the head-node was at least 50 (normally just a few) during
the schedd restart and may be the cause of the failure to reconnect messages.

It appears that a restart of schedd with a busy queue (~1000 jobs) which
has a mix of Standard/Vanilla and parallel (Dedicated Scheduler) jobs in it
is definitely not safe.



"/ldcg/condor/sbin/condor_schedd" on "ldas-grid.ligo.caltech.edu" exited with
status 4.
Condor will automatically restart this process in 10 seconds.   
 
*** Last 20 line(s) of file SchedLog:
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm1__AT__node246.ldas-cit.ligo.caltech.edu to reconnect to
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm4__AT__node165.ldas-cit.ligo.caltech.edu to reconnect to
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm1__AT__node217.ldas-cit.ligo.caltech.edu to reconnect to
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm2__AT__node209.ldas-cit.ligo.caltech.edu to reconnect to
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm1__AT__node20.ldas-cit.ligo.caltech.edu to reconnect to
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm3__AT__node253.ldas-cit.ligo.caltech.edu to reconnect to
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm2__AT__node253.ldas-cit.ligo.caltech.edu to reconnect to
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm2__AT__node258.ldas-cit.ligo.caltech.edu to reconnect to
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm3__AT__node115.ldas-cit.ligo.caltech.edu to reconnect to
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm2__AT__node119.ldas-cit.ligo.caltech.edu to reconnect to
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm4__AT__node271.ldas-cit.ligo.caltech.edu to reconnect to
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm2__AT__node189.ldas-cit.ligo.caltech.edu to reconnect to  
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm1__AT__node178.ldas-cit.ligo.caltech.edu to reconnect to
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm2__AT__node222.ldas-cit.ligo.caltech.edu to reconnect to
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm1__AT__node56.ldas-cit.ligo.caltech.edu to reconnect to
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm2__AT__node56.ldas-cit.ligo.caltech.edu to reconnect to
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm4__AT__node164.ldas-cit.ligo.caltech.edu to reconnect to
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm3__AT__node239.ldas-cit.ligo.caltech.edu to reconnect to
9/9 13:26:33 (pid:20456) Dedicated Scheduler:: couldn't find machine
vm4__AT__node159.ldas-cit.ligo.caltech.edu to reconnect to
9/9 13:26:33 (pid:20456) ERROR "spawnJobs(): allocation node has no matches!" at
line 2351 in file dedicated_scheduler.C
*** End of file SchedLog


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Sat Sep  9 17:18:03 2006 (1157840289)
Subject: Actions

Assigned to adesmet by adesmet
===========================================================================
Date of actions: Mon Sep 11 15:09:27 2006 (1158005367)
Subject: Actions

Assigned to gthain by adesmet
===========================================================================
Date of actions: Mon Sep 11 15:12:34 2006 (1158005554)
Date: Mon, 11 Sep 2006 16:57:34 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu, gthain__AT__cs.wisc.edu,         Erik Espinoza
 <espinoza_e__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4

We now have a second schedd running that is mean to just submit parallel
universe jobs and both schedd's are now running D_FULLDEBUG--which is
generating ~1GB per hour of log files. One interesting thing I have
noticed is that even though ldas-grid is no longer running any
parallel universe jobs the following log file section indicates that
is trying to process one.

9/11 16:43:14 Saving classad to history file
9/11 16:43:14 SelfDrainingQueue job_is_finished_queue is empty, not resetting timer
9/11 16:43:14 Canceling timer for SelfDrainingQueue job_is_finished_queue (timer id: 298564)
9/11 16:43:14 In DedicatedScheduler::checkReconnectQueue
9/11 16:43:14 In checkReconnectQueue(), job: 191697400.159333944
9/11 16:43:14 Job 191697400.159333944 missing from queue?
9/11 16:43:14 Trying to query collector <10.14.0.12:9618>
9/11 16:43:15 DedicatedScheduler found machine vm1__AT__node1.ldas-cit.ligo.caltech.edu for possibly reconnection for job (191697400.159333944)
9/11 16:43:15 DedicatedScheduler found machine vm3__AT__node4.ldas-cit.ligo.caltech.edu for possibly reconnection for job (191697400.159333944)
9/11 16:43:15 DedicatedScheduler found machine vm1__AT__node5.ldas-cit.ligo.caltech.edu for possibly reconnection for job (191697400.159333944)

It is perhaps important to note that current valid job id's are somewhere
around 7051962.0 not 191M and certainly not .159333944.
This is in the running schedd on ldas-grid. Is there anything useful
we can extract from the running ReconnectQueue to figure out what
it is confused about?

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Sep 11 18:58:07 2006 (1158019088)
Date: Wed, 13 Sep 2006 10:13:51 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu, miron__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu,
 zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, gthain__AT__cs.wisc.edu,
 wenger__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,         Patrick Brady
 <patrick__AT__gravity.phys.uwm.edu>,         Albert Lazzarini
 <lazz__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4

It happened again this morning, i.e.,

9/13 08:03:50 Create_Thread: ERROR: we've had 101 consecutive pid collisions,
giving up! (6270 PIDs being tracked internally.)
9/13 08:03:50 ERROR "Assertion ERROR on (tid != 0)" at line 115 in file
datathread.C

Fortunately, I was able to get to the Schedd log files before they
rolled over.  Please see the readme file for further comments and my
initial observations on the crash relating to the dedicated scheduler
and duplicate rescue dagas at,
http://www.ligo.caltech.edu/~anderson/condor.1678/readme

The detailed log files may be found in the same directory.


Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Wed Sep 13 12:16:02 2006 (1158167763)
Date: Wed, 13 Sep 2006 12:48:40 -0500
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>,
 condor-support__AT__cs.wisc.edu,         tannenba__AT__cs.wisc.edu,
 zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu,         gthain__AT__cs.wisc.edu,
 wenger__AT__cs.wisc.edu
From: Miron Livny <miron__AT__cs.wisc.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit   status 4
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,         Patrick Brady
 <patrick__AT__gravity.phys.uwm.edu>,         Albert Lazzarini
 <lazz__AT__ligo.caltech.edu>

Who is looking into this on the Condor side?  Miron

At 12:13 PM 9/13/2006, Stuart Anderson wrote:
>It happened again this morning, i.e.,
>
>9/13 08:03:50 Create_Thread: ERROR: we've had 101 consecutive pid collisions,
>giving up! (6270 PIDs being tracked internally.)
>9/13 08:03:50 ERROR "Assertion ERROR on (tid != 0)" at line 115 in file
>datathread.C
>
>Fortunately, I was able to get to the Schedd log files before they
>rolled over.  Please see the readme file for further comments and my
>initial observations on the crash relating to the dedicated scheduler
>and duplicate rescue dagas at,
>http://www.ligo.caltech.edu/~anderson/condor.1678/readme
>
>The detailed log files may be found in the same directory.
>
>
>Thanks.
>
>--
>Stuart Anderson  anderson__AT__ligo.caltech.edu
>http://www.ligo.caltech.edu/~anderson


===========================================================================
Date mail was appended: Wed Sep 13 12:51:38 2006 (1158169898)
Date: Wed, 13 Sep 2006 13:00:34 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: Miron Livny <miron__AT__cs.wisc.edu>
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>,
 condor-support__AT__cs.wisc.edu,         tannenba__AT__cs.wisc.edu,
 zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu,         wenger__AT__cs.wisc.edu,
 Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,         Patrick Brady
 <patrick__AT__gravity.phys.uwm.edu>,         Albert Lazzarini
 <lazz__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit  status  4

Miron Livny wrote:
> Who is looking into this on the Condor side?  Miron

I am.

-Greg

===========================================================================
Date mail was appended: Wed Sep 13 13:00:58 2006 (1158170459)
Date: Wed, 13 Sep 2006 16:38:14 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
CC: condor-support__AT__cs.wisc.edu, miron__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu,
 zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, wenger__AT__cs.wisc.edu,
 Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,         Patrick Brady
 <patrick__AT__gravity.phys.uwm.edu>,         Albert Lazzarini
 <lazz__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4

Stuart Anderson wrote:
> It happened again this morning, i.e.,
> 
> 9/13 08:03:50 Create_Thread: ERROR: we've had 101 consecutive pid collisions,
> giving up! (6270 PIDs being tracked internally.)
> 9/13 08:03:50 ERROR "Assertion ERROR on (tid != 0)" at line 115 in file
> datathread.C

Stuart:

The logs have been very helpful in identifying what is going on here. 
While we are working on a code fix, I think that upping the parameter

MAX_PID_COLLISION_RETRY

again to 1000 should be helpful.

Thanks, and I'll be updating you with more information,

-Greg

===========================================================================
Date mail was appended: Wed Sep 13 16:41:38 2006 (1158183700)
Date: Wed, 13 Sep 2006 15:42:39 -0700
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,
 patrick__AT__gravity.phys.uwm.edu, lazz__AT__ligo.caltech.edu
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4
X-Enigmail-Version: 0.94.0.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No

The MAX_PID_COLLISION_RETRY has been raised to 1,000.

Thanks,
Erik

condor-support response tracking system wrote:
> Stuart Anderson wrote:
>> It happened again this morning, i.e.,
>>
>> 9/13 08:03:50 Create_Thread: ERROR: we've had 101 consecutive pid collisions,
>> giving up! (6270 PIDs being tracked internally.)
>> 9/13 08:03:50 ERROR "Assertion ERROR on (tid != 0)" at line 115 in file
>> datathread.C
> 
> Stuart:
> 
> The logs have been very helpful in identifying what is going on here. 
> While we are working on a code fix, I think that upping the parameter
> 
> MAX_PID_COLLISION_RETRY
> 
> again to 1000 should be helpful.
> 
> Thanks, and I'll be updating you with more information,
> 
> -Greg
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Greg Thain <gthain__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,patrick__AT__gravity.phys.uwm.edu,lazz__AT__ligo.caltech.edu
> 

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Wed Sep 13 17:43:06 2006 (1158187387)
Date: Wed, 13 Sep 2006 20:39:04 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: Greg Thain <gthain__AT__cs.wisc.edu>
CC: condor-support__AT__cs.wisc.edu, miron__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu,
 zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, wenger__AT__cs.wisc.edu,
 Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,         Patrick Brady
 <patrick__AT__gravity.phys.uwm.edu>,         Albert Lazzarini
 <lazz__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4

On Wed, Sep 13, 2006 at 04:38:14PM -0500, Greg Thain wrote:
> Stuart Anderson wrote:
> >It happened again this morning, i.e.,
> >
> >9/13 08:03:50 Create_Thread: ERROR: we've had 101 consecutive pid 
> >collisions,
> >giving up! (6270 PIDs being tracked internally.)
> >9/13 08:03:50 ERROR "Assertion ERROR on (tid != 0)" at line 115 in file
> >datathread.C
> 
> Stuart:
> 
> The logs have been very helpful in identifying what is going on here. 
> While we are working on a code fix, I think that upping the parameter
> 
> MAX_PID_COLLISION_RETRY
> 
> again to 1000 should be helpful.

Are you sure? As Erik indicated earlier we have made this change, however,
in a recent discussion Alan indicated this retry mechanism was implemented
via recursion so there was some concern about recursing too deeply.

> 
> Thanks, and I'll be updating you with more information,

I look forward to that.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Wed Sep 13 22:40:58 2006 (1158205259)
Date: Thu, 14 Sep 2006 09:40:51 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4


> Are you sure? As Erik indicated earlier we have made this change, however,
> in a recent discussion Alan indicated this retry mechanism was implemented
> via recursion so there was some concern about recursing too deeply.

Yes, although there is some concern about deep recursion, I think it 
will be ok, and even if it isn't, it is no worse than the alternative. 
We are discussing code fixes now, and hope to be able to update you soon.

-Greg

===========================================================================
Date mail was appended: Thu Sep 14  9:40:57 2006 (1158244858)
Date: Thu, 14 Sep 2006 11:03:22 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
CC: condor-support__AT__cs.wisc.edu, miron__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu,
 zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu, gthain__AT__cs.wisc.edu,
 Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,         Patrick Brady
 <patrick__AT__gravity.phys.uwm.edu>,         Albert Lazzarini
 <lazz__AT__ligo.caltech.edu>,
 "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4

Stuart,

> 9/13 08:03:50 Create_Thread: ERROR: we've had 101 consecutive pid collisions,
> giving up! (6270 PIDs being tracked internally.)
> 9/13 08:03:50 ERROR "Assertion ERROR on (tid != 0)" at line 115 in file
> datathread.C
>
> Fortunately, I was able to get to the Schedd log files before they
> rolled over.  Please see the readme file for further comments and my
> initial observations on the crash relating to the dedicated scheduler
> and duplicate rescue dagas at,
> http://www.ligo.caltech.edu/~anderson/condor.1678/readme
>
> The detailed log files may be found in the same directory.

Do you have the relevant dagman.out file anywhere?

Do you know if the "original" (pre-schedd-crash) DAGMan was started after
"DAGMAN_ABORT_DUPLICATES = True" was added to the configuration?

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Thu Sep 14 11:07:35 2006 (1158250056)
Date: Fri, 15 Sep 2006 18:05:55 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
CC: condor-support__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4

On Thu, Sep 14, 2006 at 11:03:22AM -0500, R. Kent Wenger wrote:
> Stuart,
> 
> > 9/13 08:03:50 Create_Thread: ERROR: we've had 101 consecutive pid collisions,
> > giving up! (6270 PIDs being tracked internally.)
> > 9/13 08:03:50 ERROR "Assertion ERROR on (tid != 0)" at line 115 in file
> > datathread.C
> >
> > Fortunately, I was able to get to the Schedd log files before they
> > rolled over.  Please see the readme file for further comments and my
> > initial observations on the crash relating to the dedicated scheduler
> > and duplicate rescue dagas at,
> > http://www.ligo.caltech.edu/~anderson/condor.1678/readme
> >
> > The detailed log files may be found in the same directory.
> 
> Do you have the relevant dagman.out file anywhere?
> 
> Do you know if the "original" (pre-schedd-crash) DAGMan was started after
> "DAGMAN_ABORT_DUPLICATES = True" was added to the configuration?
> 

Erik, please correct me if I am wrong, but I believe we determined that
these DAGMan's where started after the configuration change, however,
you where unable to find the dagman.out files.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Sep 15 20:14:06 2006 (1158369247)
Date: Sun, 17 Sep 2006 22:52:10 -0500
To: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>,
 Stuart Anderson <anderson__AT__ligo.caltech.edu>
From: Miron Livny <miron__AT__cs.wisc.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit   status 4
CC: condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu,
 adesmet__AT__cs.wisc.edu, gthain__AT__cs.wisc.edu,         Erik Espinoza
 <espinoza_e__AT__ligo.caltech.edu>,         Patrick Brady
 <patrick__AT__gravity.phys.uwm.edu>,         Albert Lazzarini
 <lazz__AT__ligo.caltech.edu>,
 "R. Kent Wenger" <wenger__AT__cs.wisc.edu>

Greg and Kent,

Any news or progress?

Miron

At 11:03 AM 9/14/2006, R. Kent Wenger wrote:
>Stuart,
>
> > 9/13 08:03:50 Create_Thread: ERROR: we've had 101 consecutive pid 
> collisions,
> > giving up! (6270 PIDs being tracked internally.)
> > 9/13 08:03:50 ERROR "Assertion ERROR on (tid != 0)" at line 115 in file
> > datathread.C
> >
> > Fortunately, I was able to get to the Schedd log files before they
> > rolled over.  Please see the readme file for further comments and my
> > initial observations on the crash relating to the dedicated scheduler
> > and duplicate rescue dagas at,
> > http://www.ligo.caltech.edu/~anderson/condor.1678/readme
> >
> > The detailed log files may be found in the same directory.
>
>Do you have the relevant dagman.out file anywhere?
>
>Do you know if the "original" (pre-schedd-crash) DAGMan was started after
>"DAGMAN_ABORT_DUPLICATES = True" was added to the configuration?
>
>Kent Wenger
>Condor Team


===========================================================================
Date mail was appended: Sun Sep 17 22:53:49 2006 (1158551630)
Date: Mon, 18 Sep 2006 08:12:25 -0500
From: Greg Thain <gthain__AT__cs.wisc.edu>
To: Miron Livny <miron__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>,
 Stuart Anderson <anderson__AT__ligo.caltech.edu>,  condor-support__AT__cs.wisc.edu,
 tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu,  adesmet__AT__cs.wisc.edu,
 Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,  Patrick Brady
 <patrick__AT__gravity.phys.uwm.edu>,         Albert Lazzarini
 <lazz__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit  status  4

Miron Livny wrote:
> Greg and Kent,
> 
> Any news or progress?
> 
> Miron

On the PID collision front, we understand why it happens, and we are 
investigating several possible solutions.  In the short term, LIGO has 
implemented a configuration change which we believe will lessen the 
chance of hitting this problem.

-Greg

===========================================================================
Date mail was appended: Mon Sep 18  8:12:57 2006 (1158585178)
Date: Mon, 18 Sep 2006 15:31:42 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: Miron Livny <miron__AT__cs.wisc.edu>
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>,
 condor-support__AT__cs.wisc.edu,         tannenba__AT__cs.wisc.edu,
 zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu,         gthain__AT__cs.wisc.edu,
 Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,         Patrick Brady
 <patrick__AT__gravity.phys.uwm.edu>,         Albert Lazzarini
 <lazz__AT__ligo.caltech.edu>,
 "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit  status 4

Miron,

> Any news or progress?

I have an idea on the DAGMan end, but I can't confirm it without a
dagman.out file.

I may send them a pre-release DAGMan that will fix things if that is the
problem.

Kent

===========================================================================
Date mail was appended: Mon Sep 18 15:35:05 2006 (1158611707)
Date: Mon, 18 Sep 2006 13:40:15 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
CC: Miron Livny <miron__AT__cs.wisc.edu>, condor-support__AT__cs.wisc.edu,
 tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu,
 gthain__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,
 Patrick Brady <patrick__AT__gravity.phys.uwm.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit  status 4

On Mon, Sep 18, 2006 at 03:31:42PM -0500, R. Kent Wenger wrote:
> Miron,
> 
> > Any news or progress?
> 
> I have an idea on the DAGMan end, but I can't confirm it without a
> dagman.out file.
> 
> I may send them a pre-release DAGMan that will fix things if that is the
> problem.

Kent,
	Is this a possible fix for the root problem or a fix for the 
"DAGMAN_ABORT_DUPLICATES = True" patch?

	To avoid the problem of users cleaning up their own dagman.out
files before we can find them for a problem report, would it be possible,
and desirable, to add a DAGMAN debug option that keeps a more permanent
copy of any interesting dagman related logging information in a similar
fashion to the main condor daemons, e.g., DAGMAN_DEBUG = ...?

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Sep 18 15:45:06 2006 (1158612307)
Date: Mon, 18 Sep 2006 17:37:47 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
CC: Miron Livny <miron__AT__cs.wisc.edu>, condor-support__AT__cs.wisc.edu,
 tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu,
 gthain__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,
 Patrick Brady <patrick__AT__gravity.phys.uwm.edu>,
 "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit  status 4

Stuart,

> On Mon, Sep 18, 2006 at 03:31:42PM -0500, R. Kent Wenger wrote:
> >
> > I may send them a pre-release DAGMan that will fix things if that is the
> > problem.
>
> 	Is this a possible fix for the root problem or a fix for the
> "DAGMAN_ABORT_DUPLICATES = True" patch?

This would be a fix to the DAGMAN_ABORT_DUPLICATES feature.

You need RedHat 9 binaries, right?

> 	To avoid the problem of users cleaning up their own dagman.out
> files before we can find them for a problem report, would it be possible,
> and desirable, to add a DAGMAN debug option that keeps a more permanent
> copy of any interesting dagman related logging information in a similar
> fashion to the main condor daemons, e.g., DAGMAN_DEBUG = ...?

Well, there's already a DAGMAN_DEBUG that controls how much info goes
into the dagman.out file.

The main change I can think of would be an option to give the dagman.out
file a unique name -- maybe something like
foobar.dag.dagman.out.<machine>.<cluster>.  That way re-running the same
DAG wouldn't overwrite an existing dagman.out file.

Do you think something like that would solve the problem?  My inclination
is to not add something that essentially duplicates dagman.out.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Mon Sep 18 17:37:54 2006 (1158619075)
Date: Mon, 18 Sep 2006 16:06:41 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>,
 Brown Duncan <duncan__AT__gravity.phys.uwm.edu>
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4

Duncan,
	Please read below.

On Mon, Sep 18, 2006 at 05:37:54PM -0600, condor-support response tracking system wrote:
> Stuart,
> 
> > On Mon, Sep 18, 2006 at 03:31:42PM -0500, R. Kent Wenger wrote:
> > >
> > > I may send them a pre-release DAGMan that will fix things if that is the
> > > problem.
> >
> > 	Is this a possible fix for the root problem or a fix for the
> > "DAGMAN_ABORT_DUPLICATES = True" patch?
> 
> This would be a fix to the DAGMAN_ABORT_DUPLICATES feature.
> 
> You need RedHat 9 binaries, right?

Kent,
	I am still confused about the last discussion on the re-basing of
Condor Linux builds from RH9 to RHEL3, so whichever one of these you think
is the "right" one to run on an FC4 x86_64 machine will be fine.

> 
> > 	To avoid the problem of users cleaning up their own dagman.out
> > files before we can find them for a problem report, would it be possible,
> > and desirable, to add a DAGMAN debug option that keeps a more permanent
> > copy of any interesting dagman related logging information in a similar
> > fashion to the main condor daemons, e.g., DAGMAN_DEBUG = ...?
> 
> Well, there's already a DAGMAN_DEBUG that controls how much info goes
> into the dagman.out file.
> 
> The main change I can think of would be an option to give the dagman.out
> file a unique name -- maybe something like
> foobar.dag.dagman.out.<machine>.<cluster>.  That way re-running the same
> DAG wouldn't overwrite an existing dagman.out file.
> 
> Do you think something like that would solve the problem?  My inclination
> is to not add something that essentially duplicates dagman.out.
> 

Kent,
	This sounds reasonable to me but I often get confused about all the
files generated when running a DAG (and what they should be called).

Duncan,
	What do you think? The goal is to come up with a scheme that makes
it easier to provide Kent with the necessary debugging information in the
future when we have problems with DAG's without confusing our users or
breaking current analysis pipelines.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Sep 18 18:07:00 2006 (1158620820)
Date: Tue, 19 Sep 2006 17:17:41 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: Greg Thain <gthain__AT__cs.wisc.edu>
CC: Miron Livny <miron__AT__cs.wisc.edu>,
 "R. Kent Wenger" <wenger__AT__cs.wisc.edu>,
 condor-support__AT__cs.wisc.edu, tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu,
 adesmet__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,
 Patrick Brady <patrick__AT__gravity.phys.uwm.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit  status 4

Greg, Alan, others?

What bugs have been fixed in the 6.8.1 release today compared to the
6.8.1 pre-releaase Alan gave us that we are running at LIGO Caltech?
i.e., should we upgrade to 6.8.1 final or wait for more of these bug
fixes that are being worked on? Any advice would be appreciated.

Thanks.

On Mon, Sep 18, 2006 at 08:12:25AM -0500, Greg Thain wrote:
> Miron Livny wrote:
> >Greg and Kent,
> >
> >Any news or progress?
> >
> >Miron
> 
> On the PID collision front, we understand why it happens, and we are 
> investigating several possible solutions.  In the short term, LIGO has 
> implemented a configuration change which we believe will lessen the 
> chance of hitting this problem.
> 
> -Greg

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Tue Sep 19 19:19:05 2006 (1158711546)
Date: Wed, 20 Sep 2006 10:28:48 -0500
From: Alan De Smet <adesmet__AT__cs.wisc.edu>
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
CC: Greg Thain <gthain__AT__cs.wisc.edu>, Miron Livny <miron__AT__cs.wisc.edu>,
 "R. Kent Wenger" <wenger__AT__cs.wisc.edu>,
 condor-support__AT__cs.wisc.edu,         tannenba__AT__cs.wisc.edu,
 zmiller__AT__cs.wisc.edu,         Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,
 Patrick Brady <patrick__AT__gravity.phys.uwm.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit  status 4

Stuart Anderson <anderson__AT__ligo.caltech.edu> wrote:
> What bugs have been fixed in the 6.8.1 release today compared to the
> 6.8.1 pre-releaase Alan gave us that we are running at LIGO Caltech?

Not a lot, although there might be a security fix or two.  I'll
get you more complete details and a recommendation (wait or
upgrade) in a moment.

-- 
Alan De Smet                              Condor Project Research
adesmet__AT__cs.wisc.edu                 http://www.condorproject.org/

===========================================================================
Date mail was appended: Wed Sep 20 10:31:02 2006 (1158766263)
Date: Wed, 20 Sep 2006 10:46:34 -0500
From: Alan De Smet <adesmet__AT__cs.wisc.edu>
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
CC: Greg Thain <gthain__AT__cs.wisc.edu>, Miron Livny <miron__AT__cs.wisc.edu>,
 "R. Kent Wenger" <wenger__AT__cs.wisc.edu>,
 condor-support__AT__cs.wisc.edu,         tannenba__AT__cs.wisc.edu,
 zmiller__AT__cs.wisc.edu,         Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,
 Patrick Brady <patrick__AT__gravity.phys.uwm.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit  status 4

Stuart Anderson <anderson__AT__ligo.caltech.edu> wrote:
> What bugs have been fixed in the 6.8.1 release today compared to the
> 6.8.1 pre-releaase Alan gave us that we are running at LIGO Caltech?

A number of infrequent bugs have been fixed.

- Improvements to DAGMan's duplicate run detection.

- Fixes to some negotiator and schedd optimizations that could
  cause Condor to make incorrect matching decisions.

- Support for large STARTD_NAMEs.

- Fixes to CLAIMTOBE authorization to allow more precise
  filtering.  Not enabled by default because of backward
  compatibility issues.

- shadows are better at detecting that the associated starter is
  gone for good (say, because of a reboot), speeding up a number
  of things.

- File descriptor leak fixed.

> i.e., should we upgrade to 6.8.1 final or wait for more of these bug
> fixes that are being worked on? Any advice would be appreciated.

You should probably upgrade, but there is no rush.

You're not (to our knowledge) getting hit by most of the bugs we
fixed.  You are hitting the DAGMan problems, but Kent's giving
you binaries that are even newer than 6.8.1.  You might be
hitting the negotiator/schedd optimization problems, but the
impact is minor (jobs run in the "wrong" order).

We're working on a more permenant fix for your PID collision
problems; I'm hoping to be able to offer you a new schedd in
about a week.  Given that replacing your schedd is very severe
for your pool, you might put off the 6.8.1 upgrade until we can
provide a new schedd binary.  If the upgrade isn't a big deal,
you can certainly upgrade sooner and you might get some minor
benefits.

-- 
Alan De Smet                              Condor Project Research
adesmet__AT__cs.wisc.edu                 http://www.condorproject.org/

===========================================================================
Date mail was appended: Wed Sep 20 10:51:20 2006 (1158767480)
Date: Wed, 20 Sep 2006 22:02:56 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, patrick__AT__gravity.phys.uwm.edu,
 espinoza__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4

Alan,
	Thanks for the details. We will stick with 6.8.1pre for now and
wait for the 6.8.1post schedd before upgrading. Is there any chance
that this schedd will also have fixes for the dedicatedScheduler problems?

Thanks.

On Wed, Sep 20, 2006 at 10:51:20AM -0600, condor-support response tracking system wrote:
> Stuart Anderson <anderson__AT__ligo.caltech.edu> wrote:
> > What bugs have been fixed in the 6.8.1 release today compared to the
> > 6.8.1 pre-releaase Alan gave us that we are running at LIGO Caltech?
> 
> A number of infrequent bugs have been fixed.
> 
> - Improvements to DAGMan's duplicate run detection.
> 
> - Fixes to some negotiator and schedd optimizations that could
>   cause Condor to make incorrect matching decisions.
> 
> - Support for large STARTD_NAMEs.
> 
> - Fixes to CLAIMTOBE authorization to allow more precise
>   filtering.  Not enabled by default because of backward
>   compatibility issues.
> 
> - shadows are better at detecting that the associated starter is
>   gone for good (say, because of a reboot), speeding up a number
>   of things.
> 
> - File descriptor leak fixed.
> 
> > i.e., should we upgrade to 6.8.1 final or wait for more of these bug
> > fixes that are being worked on? Any advice would be appreciated.
> 
> You should probably upgrade, but there is no rush.
> 
> You're not (to our knowledge) getting hit by most of the bugs we
> fixed.  You are hitting the DAGMan problems, but Kent's giving
> you binaries that are even newer than 6.8.1.  You might be
> hitting the negotiator/schedd optimization problems, but the
> impact is minor (jobs run in the "wrong" order).
> 
> We're working on a more permenant fix for your PID collision
> problems; I'm hoping to be able to offer you a new schedd in
> about a week.  Given that replacing your schedd is very severe
> for your pool, you might put off the 6.8.1 upgrade until we can
> provide a new schedd binary.  If the upgrade isn't a big deal,
> you can certainly upgrade sooner and you might get some minor
> benefits.
> 
> -- 
> Alan De Smet                              Condor Project Research
> adesmet__AT__cs.wisc.edu                 http://www.condorproject.org/
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Alan De Smet <adesmet__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,patrick__AT__gravity.phys.uwm.edu,lazz__AT__ligo.caltech.edu,espinoza__AT__ligo.caltech.edu,duncan__AT__gravity.phys.uwm.edu
> 
> -- 
> ======================================================================
> This mail was sent from the RUST Mail System
> Please direct all replies to condor-support__AT__cs.wisc.edu
> Please include the current subject line in your reply.
> ======================================================================
> 

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Sep 21  0:03:15 2006 (1158814996)
Date: Thu, 21 Sep 2006 10:49:31 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
CC: Miron Livny <miron__AT__cs.wisc.edu>, condor-support__AT__cs.wisc.edu,
 tannenba__AT__cs.wisc.edu, zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu,
 gthain__AT__cs.wisc.edu, Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,
 Patrick Brady <patrick__AT__gravity.phys.uwm.edu>,
 "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit  status 4

Stuart,

Okay, I have a 6.8.2 pre-release condor_dagman for you, which should fix
the problems with the "abort duplicate runs" feature.

You can get it at:

    ftp://ftp.cs.wisc.edu/condor/temporary/forligo/condor_dagman

Note that in order for this condor_dagman to properly abort itself, the
earlier condor_dagman must also be the 6.8.2 pre-release version, not
6.8.1.  In other words, to get the "no duplicates" protection on existing
runs, you'll have to condor_rm them and start the rescue DAGs with the new
condor_dagman binary.

All of the configuration macros are the same as for 6.8.1.

Please let me know if you have any questions.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Thu Sep 21 10:50:53 2006 (1158853854)
Domainkey-Signature: a=rsa-sha1; q=dns; c=nofws;         s=beta;  
 d=gmail.com;         h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth;
 b=BgY6QYQDLg0edOz8GIkmcyymkkW5DWBfp7XH7XbnQUFRaOV4LWmqu+bSBYVQPmsdjuGT162M0/ct5nbfj74GZkGGa7RvkW6CkVnE1M7vLbXgbEXPWKjXQajpbct2s9xMMLuk3ot14eRJEkFGcnve1EZl6R4UjZWzGmnm9BjCe1E=
Date: Tue, 26 Sep 2006 11:39:52 -0700
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4
CC: "Stuart Anderson" <anderson__AT__ligo.caltech.edu>,
 "Miron Livny" <miron__AT__cs.wisc.edu>,
 condor-support__AT__cs.wisc.edu,         tannenba__AT__cs.wisc.edu,
 zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu,         gthain__AT__cs.wisc.edu,
 "Erik Espinoza" <espinoza_e__AT__ligo.caltech.edu>,
 "Patrick Brady" <patrick__AT__gravity.phys.uwm.edu>
X-Google-Sender-Auth: 248618f05cf708cc

Could we have this in RHAS please?

Thanks,
Erik

On 9/21/06, R. Kent Wenger <wenger__AT__cs.wisc.edu> wrote:
> Stuart,
>
> Okay, I have a 6.8.2 pre-release condor_dagman for you, which should fix
> the problems with the "abort duplicate runs" feature.
>
> You can get it at:
>
>     ftp://ftp.cs.wisc.edu/condor/temporary/forligo/condor_dagman
>
> Note that in order for this condor_dagman to properly abort itself, the
> earlier condor_dagman must also be the 6.8.2 pre-release version, not
> 6.8.1.  In other words, to get the "no duplicates" protection on existing
> runs, you'll have to condor_rm them and start the rescue DAGs with the new
> condor_dagman binary.
>
> All of the configuration macros are the same as for 6.8.1.
>
> Please let me know if you have any questions.
>
> Kent Wenger
> Condor Team
>

===========================================================================
Date mail was appended: Tue Sep 26 13:40:14 2006 (1159296015)
Date: Tue, 26 Sep 2006 18:44:55 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>,         Miron Livny
 <miron__AT__cs.wisc.edu>, condor-support__AT__cs.wisc.edu,  tannenba__AT__cs.wisc.edu,
 zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu,  gthain__AT__cs.wisc.edu,
 Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,  Patrick Brady
 <patrick__AT__gravity.phys.uwm.edu>,
 "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4

On Tue, 26 Sep 2006, Erik A. Espinoza wrote:

> Could we have this in RHAS please?

Okay, I'm planning to get a couple more fixes in this week before I leave.
If it's okay with you, I'll wait until those are in place -- that would
probably mean some time Thursday.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Tue Sep 26 18:47:30 2006 (1159314466)
Date: Thu, 28 Sep 2006 17:40:51 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>,         Miron Livny
 <miron__AT__cs.wisc.edu>, condor-support__AT__cs.wisc.edu,  tannenba__AT__cs.wisc.edu,
 zmiller__AT__cs.wisc.edu, adesmet__AT__cs.wisc.edu,  gthain__AT__cs.wisc.edu,
 Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>,  Patrick Brady
 <patrick__AT__gravity.phys.uwm.edu>,
 "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4

On Tue, 26 Sep 2006, Erik A. Espinoza wrote:

> Could we have this in RHAS please?

Okay, I have 6.8.2 pre-release condor_dagman and condor_submit_dag
binaries ready for you.  These have the change that the dagman.out
file is appended to instead of overwritten.  I'm still working on the
NFS log file checking; I'll follow up on the appropriate ticket.

The binaries are available at:

ftp://ftp.cs.wisc.edu/condor/temporary/forligo/ia64_rhas_3/condor_dagman
ftp://ftp.cs.wisc.edu/condor/temporary/forligo/ia64_rhas_3/condor_submit_dag
ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_rh_9/condor_dagman
ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_rh_9/condor_submit_dag
ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_rhas_3/condor_dagman
ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_rhas_3/condor_submit_dag

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Thu Sep 28 17:44:32 2006 (1159483473)
Date: Fri, 20 Oct 2006 10:07:50 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza_e__AT__ligo.caltech.edu, patrick__AT__gravity.phys.uwm.edu,
 espinoza__AT__ligo.caltech.edu, duncan__AT__gravity.phys.uwm.edu
Subject: Re: [condor-support #1678] LIGO condor-6.8.1 schedd exit status 4

We have not had this problem as of condor-6.8.2 with the reduction by 3x
in the number of schedd child processes created so this problem ticket can
be closed.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Oct 20 12:08:14 2006 (1161364094)
Subject: Actions

Ticket resolved by gthain
===========================================================================
Date of actions: Mon Oct 30  9:34:56 2006 (1162222496)