LIGO Support Ticket 1885

Ticket Information
  Number:      support 1885
  User:        anderson@ligo.caltech.edu
  Email:       espinoza_e__AT__ligo.caltech.edu,espinoza__AT__ligo.caltech.edu
  Status:      resolved
  Assigned To: wenger
Date: Tue, 20 Feb 2007 20:28:04 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: LIGO: NTsenders.C/NTreceivers.C Assertion errors

The LIGO CIT Condor pool was upgraded to 6.8.4 today and while things
are mostly working there are some jobs that are stuck generating repeated
shadow/starter assertion failures. Any help if figuring out what is
wrong would be appreciated.  For example, consider the job 10514982.0,

$ condor_q -run 10514982.0


-- Quill: citquill@ligo : <10.14.0.25:5432> : citquill_db
 ID      OWNER            SUBMITTED     RUN_TIME HOST(S)         
10514982.0   vmandic         2/20 18:41   0+01:21:07 vm1__AT__node107.ldas-cit.ligo.caltech.edu


However, it is not actually running on node107 due to the following errors:


The ShadowLog has entries:

$ tac /usr1/condor/log/ShadowLog | grep 10514982.0 | head
2/20 19:43:46 (10514982.0) (26689): Request to run on <10.14.1.107:35873> was ACCEPTED
2/20 19:43:46 Initializing a VANILLA shadow for job 10514982.0
2/20 19:43:32 (10514982.0) (26305): ERROR "Assertion ERROR on (result)" at line 236 in file NTreceivers.C
2/20 19:43:32 (10514982.0) (26305): Buf::write(): condor_write() failed
2/20 19:43:32 (10514982.0) (26305): condor_write(): Socket closed when trying to write 3698 bytes to <10.14.1.107:35873>, fd is 7
2/20 19:34:08 (10514982.0) (26305): Request to run on <10.14.1.107:35873> was ACCEPTED
2/20 19:34:08 Initializing a VANILLA shadow for job 10514982.0
2/20 19:34:01 (10514982.0) (26186): ERROR "Assertion ERROR on (result)" at line 236 in file NTreceivers.C
2/20 19:34:01 (10514982.0) (26186): Buf::write(): condor_write() failed
2/20 19:34:01 (10514982.0) (26186): condor_write(): Socket closed when trying to write 3698 bytes to <10.14.1.107:35873>, fd is 7

and a short snippet of the log shows:

2/20 19:34:08 ******************************************************
2/20 19:34:08 ** condor_shadow (CONDOR_SHADOW) STARTING UP
2/20 19:34:08 ** /ldcg/stow_pkgs/condor-6.8.4/condor/sbin/condor_shadow
2/20 19:34:08 ** $CondorVersion: 6.8.4 Feb  1 2007 $
2/20 19:34:08 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
2/20 19:34:08 ** PID = 26305
2/20 19:34:08 ** Log last touched 2/20 19:34:06
2/20 19:34:08 ******************************************************
2/20 19:34:08 Using config source: /usr1/condor/condor_config
2/20 19:34:08 Using local config sources: 
2/20 19:34:08    /usr1/condor/condor_config.local
2/20 19:34:08 DaemonCore: Command Socket at <10.14.0.12:44397>
2/20 19:34:08 Initializing a VANILLA shadow for job 10514982.0
2/20 19:34:08 (10514982.0) (26305): Request to run on <10.14.1.107:35873> was ACCEPTED

...

2/20 19:43:24 (10514998.0) (25201): Job 10514998.0 terminated: exited with status 0
2/20 19:43:24 (10514998.0) (25201): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
2/20 19:43:32 (10514982.0) (26305): condor_write(): Socket closed when trying to write 3698 bytes to <10.14.1.107:35873>, fd i
s 7
2/20 19:43:32 (10514982.0) (26305): Buf::write(): condor_write() failed
2/20 19:43:32 (10514982.0) (26305): ERROR "Assertion ERROR on (result)" at line 236 in file NTreceivers.C
2/20 19:43:32 (10514992.0) (26307): condor_write(): Socket closed when trying to write 3698 bytes to <10.14.1.112:49238>, fd i
s 7
2/20 19:43:32 (10514992.0) (26307): Buf::write(): condor_write() failed
2/20 19:43:32 (10514992.0) (26307): ERROR "Assertion ERROR on (result)" at line 236 in file NTreceivers.C


And the StarterLog's on node107 (ip=10.14.1.107) have matching errors:

[root@node107 log]# grep "2/20 " StarterLog.vm? | grep Assertion
StarterLog.vm1:2/20 19:20:11 ERROR "Assertion ERROR on (result)" at line 148 in file NTsenders.C
StarterLog.vm1:2/20 19:29:20 ERROR "Assertion ERROR on (result)" at line 148 in file NTsenders.C
StarterLog.vm1:2/20 19:39:08 ERROR "Assertion ERROR on (result)" at line 148 in file NTsenders.C
StarterLog.vm1:2/20 19:48:46 ERROR "Assertion ERROR on (result)" at line 148 in file NTsenders.C
StarterLog.vm2:2/20 19:30:00 ERROR "Assertion ERROR on (result)" at line 148 in file NTsenders.C
StarterLog.vm2:2/20 19:39:14 ERROR "Assertion ERROR on (result)" at line 148 in file NTsenders.C
StarterLog.vm2:2/20 19:48:51 ERROR "Assertion ERROR on (result)" at line 148 in file NTsenders.C


The full entry for one of the StarterLog jobs is:

2/20 19:43:46 ******************************************************
2/20 19:43:46 ** condor_starter (CONDOR_STARTER) STARTING UP
2/20 19:43:46 ** /ldcg/stow_pkgs/condor-6.8.4/condor/sbin/condor_starter
2/20 19:43:46 ** $CondorVersion: 6.8.4 Feb  1 2007 $
2/20 19:43:46 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
2/20 19:43:46 ** PID = 11043
2/20 19:43:46 ** Log last touched 2/20 19:39:08
2/20 19:43:46 ******************************************************
2/20 19:43:46 Using config source: /usr1/condor/condor_config
2/20 19:43:46 Using local config sources: 
2/20 19:43:46    /usr1/condor/condor_config.local
2/20 19:43:46 DaemonCore: Command Socket at <10.14.1.107:34066>
2/20 19:43:46 Done setting resource limits
2/20 19:48:46 condor_read(): timeout reading 5 bytes from <10.14.0.12:49696>.
2/20 19:48:46 ERROR "Assertion ERROR on (result)" at line 148 in file NTsenders.C
2/20 19:48:46 ERROR "LocalUserLog::logStarterError() called before init()" at line 205 in file local_user_log.C


The user log file is showing,

...
007 (10514982.000.000) 02/20 19:43:32 Shadow exception!
        Assertion ERROR on (result)
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...


Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Tue Feb 20 22:28:23 2007 (1172032105)
Subject: Actions

Assigned to wenger by wenger
===========================================================================
Date of actions: Wed Feb 21 10:59:11 2007 (1172077151)
Date: Wed, 21 Feb 2007 11:08:16 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C      
 Assertion errors

Stuart,

> The LIGO CIT Condor pool was upgraded to 6.8.4 today and while things
> are mostly working there are some jobs that are stuck generating repeated
> shadow/starter assertion failures. Any help if figuring out what is
> wrong would be appreciated.  For example, consider the job 10514982.0,

We'll look into this and get back to you ASAP.

I'm not familiar enough with the Windows code to have any smart ideas
offhand...

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Wed Feb 21 11:09:33 2007 (1172077776)
Date: Wed, 21 Feb 2007 09:54:23 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C     
 Assertion  errors
X-Enigmail-Version: 0.94.2.0
Openpgp: url=http://pgp.mit.edu/

Hello Kent,

This is very weird, as we are NOT using any Windows machines in our pool 
at all. All of our machines are using FC4/x86_64 w/ the 64-bit condor.\

Thanks,
Erik

condor-support response tracking system wrote:
> Stuart,
> 
>> The LIGO CIT Condor pool was upgraded to 6.8.4 today and while things
>> are mostly working there are some jobs that are stuck generating repeated
>> shadow/starter assertion failures. Any help if figuring out what is
>> wrong would be appreciated.  For example, consider the job 10514982.0,
> 
> We'll look into this and get back to you ASAP.
> 
> I'm not familiar enough with the Windows code to have any smart ideas
> offhand...
> 
> Kent Wenger
> Condor Team
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
> 

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Wed Feb 21 11:54:55 2007 (1172080495)
Date: Wed, 21 Feb 2007 12:03:08 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C    
 Assertion  errors

Erik,

> This is very weird, as we are NOT using any Windows machines in our pool
> at all. All of our machines are using FC4/x86_64 w/ the 64-bit condor.\

Hmm, maybe that's a clue in and of itself.

I have to confirm that it really *is* Windows, code, I guess.  I assumed
from the filename (NTreceivers.C) that it is, but maybe NT stands for
something different in that case.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Wed Feb 21 12:04:17 2007 (1172081058)
Date: Wed, 21 Feb 2007 12:25:08 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C    
 Assertion  errors

Erik,

> This is very weird, as we are NOT using any Windows machines in our pool
> at all. All of our machines are using FC4/x86_64 w/ the 64-bit condor.\

Okay, I have a little more info here -- the NTreceivers.C file name is
left over from a time when that code was Windows-only, but now it's not.

So the fact that you're getting an error message from that file isn't a
problem in and of itself.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Wed Feb 21 12:29:13 2007 (1172082553)
Date: Wed, 21 Feb 2007 13:19:13 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C      
 Assertion errors

Stuart,

Can you please send the complete ShadowLog?

Thanks.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Wed Feb 21 13:24:12 2007 (1172085853)
Date: Wed, 21 Feb 2007 11:54:19 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C     
 Assertion  errors
X-Enigmail-Version: 0.94.2.0
Openpgp: url=http://pgp.mit.edu/

Hello Kent,

Here are the ShadowLogs:
http://www.ligo.caltech.edu/~eespinoz/debug/02-20-2007/

ShadowLog.bz2 is the full log and is ~1g uncompressed
ShadowLog-02-20-2007.bz2 is the log for yesterday and today. It is ~260 
uncompressed.

Take your pick.

Thanks,
Erik

condor-support response tracking system wrote:
> Stuart,
> 
> Can you please send the complete ShadowLog?
> 
> Thanks.
> 
> Kent Wenger
> Condor Team
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,espinoza__AT__ligo.caltech.edu
> 

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Wed Feb 21 13:54:45 2007 (1172087686)
Date: Wed, 21 Feb 2007 14:18:46 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C    
 Assertion  errors

Erik,

> Here are the ShadowLogs:
> http://www.ligo.caltech.edu/~eespinoz/debug/02-20-2007/
>
> ShadowLog.bz2 is the full log and is ~1g uncompressed
> ShadowLog-02-20-2007.bz2 is the log for yesterday and today. It is ~260
> uncompressed.
>
> Take your pick.

Thanks!

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Wed Feb 21 14:23:15 2007 (1172089395)
Date: Wed, 21 Feb 2007 14:36:33 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C      
 Assertion errors

Stuart,

A few more questions:

- What version of Condor were you running prior to upgrading to 6.8.4?

- Are you sure this error has only shown up since the upgrade?

- Roughly how often is this happening (i.e., what percentage of jobs).

- Do the failures seem to be tied to a specific job (i.e., if a job
  fails when it trys to run on one machine, will that same job fail
  again on another machine)?

- Are there any other patterns to the failures that you can think of?

Thanks.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Wed Feb 21 14:39:12 2007 (1172090352)
Date: Wed, 21 Feb 2007 12:51:40 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C     
 Assertion  errors
X-Enigmail-Version: 0.94.2.0
Openpgp: url=http://pgp.mit.edu/

Hello

> - What version of Condor were you running prior to upgrading to 6.8.4?

Condor 6.8.3

> - Are you sure this error has only shown up since the upgrade?

Looks like we've had this error show up on various jobs in the past. I 
am unable to pull up condor_history as we set up a new PostgreSQL server 
and the history only goes back till yesterday.

> - Roughly how often is this happening (i.e., what percentage of jobs).

Currently this is only happening for one user's jobs. It is hard to say 
right now, but probably a very small percentage.

> - Do the failures seem to be tied to a specific job (i.e., if a job
>   fails when it trys to run on one machine, will that same job fail
>   again on another machine)?

Yes, it does appear that the same jobs keep popping up the same error.

> - Are there any other patterns to the failures that you can think of?

Not that I know of currently.

Thanks,
-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Wed Feb 21 14:52:06 2007 (1172091126)
Date: Thu, 22 Feb 2007 09:32:05 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C    
 Assertion  errors

> > - Do the failures seem to be tied to a specific job (i.e., if a job
> >   fails when it trys to run on one machine, will that same job fail
> >   again on another machine)?
>
> Yes, it does appear that the same jobs keep popping up the same error.

One more thing (for now, anyhow!) -- can you send the submit file of one
of the jobs that gets the errors?

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Thu Feb 22  9:34:26 2007 (1172158467)
Date: Thu, 22 Feb 2007 14:03:46 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C    
 Assertion  errors

> Currently this is only happening for one user's jobs. It is hard to say
> right now, but probably a very small percentage.

Hmm -- do you know whether they've always submitted on the same machine?
Has anyone else submitted on that machine?

Also, can you send the full StarterLog that shows the starter end of one
of these problems?

Thanks...

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Thu Feb 22 14:04:25 2007 (1172174666)
Date: Thu, 22 Feb 2007 15:02:01 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C    
 Assertion  errors

One more thing to try -- set SHADOW_DEBUG to D_NETWORK on the machine
from which the failed jobs are being submitted; once you hit the error,
send the ShadowLog.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Thu Feb 22 15:04:22 2007 (1172178263)
Date: Sat, 24 Feb 2007 14:28:35 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C     
 Assertion  errors
X-Enigmail-Version: 0.94.2.0
Openpgp: url=http://pgp.mit.edu/

> One more thing (for now, anyhow!) -- can you send the submit file of one
> of the jobs that gets the errors?
The jobs were submitted as part of a dag.

Here is the submit file used by the dag:
executable = tfcoh_dag.sh
output = out.$(jobNumber)
error = err.$(jobNumber)
arguments = $(paramsFile) $(jobsFile) $(jobNumber)
requirements = Memory >= 128 && OpSys == "LINUX"
universe = vanilla
notification = never
environment = HOME=$(home);LD_LIBRARY_PATH=$(ld_library_path)
log = /usr1/vmandic/stochastic_pipe.log
+MaxHours = 12
queue 1

Thanks,
-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Sat Feb 24 16:29:08 2007 (1172356149)
Date: Sat, 24 Feb 2007 14:32:03 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C     
 Assertion  errors
X-Enigmail-Version: 0.94.2.0
Openpgp: url=http://pgp.mit.edu/

Hello,

> Hmm -- do you know whether they've always submitted on the same machine?
> Has anyone else submitted on that machine?

Yes. We only have two submit machines and one is for DedicatedScheduler. 
The other is used for general use.

> Also, can you send the full StarterLog that shows the starter end of one
> of these problems?

2/24 03:31:36 ******************************************************
2/24 03:31:36 ** condor_starter (CONDOR_STARTER) STARTING UP
2/24 03:31:36 ** /ldcg/stow_pkgs/condor-6.8.4/condor/sbin/condor_starter
2/24 03:31:36 ** $CondorVersion: 6.8.4 Feb  1 2007 $
2/24 03:31:36 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
2/24 03:31:36 ** PID = 712
2/24 03:31:36 ** Log last touched 2/24 03:30:53
2/24 03:31:36 ******************************************************
2/24 03:31:36 Using config source: /usr1/condor/condor_config
2/24 03:31:36 Using local config sources:
2/24 03:31:36    /usr1/condor/condor_config.local
2/24 03:31:36 DaemonCore: Command Socket at <10.14.2.20:46086>
2/24 03:31:36 Done setting resource limits
2/24 03:33:27 Communicating with shadow <10.14.0.12:33616>
2/24 03:33:27 Submitting machine is "ldas-grid.ligo.caltech.edu"
2/24 03:33:27 Starting a VANILLA universe job with ID: 10568087.0
2/24 03:33:27 IWD: /archive/home/vmandic/detchar
2/24 03:33:27 Output file: /archive/home/vmandic/detchar/out.23577
2/24 03:33:27 Error file: /archive/home/vmandic/detchar/err.23577
2/24 03:33:27 Renice expr "0" evaluated to 0
2/24 03:33:27 About to exec /archive/home/vmandic/detchar/tfcoh_dag.sh 
/archive/home/vmandic/detchar/paramfiles/S5H1L1_strain_0p001Hz_200701_params.txt 
/archive/home/vmandic/sgwb/S5/input/jobfiles/S5H1L1_200701_jobs.txt 23577
2/24 03:33:27 Create_Process succeeded, pid=901
2/24 03:33:28 Process exited, pid=901, status=0
2/24 03:33:28 Got SIGQUIT.  Performing fast shutdown.
2/24 03:33:28 ShutdownFast all jobs.

Thanks,
-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Sat Feb 24 16:32:33 2007 (1172356354)
Date: Sat, 24 Feb 2007 14:42:14 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C     
 Assertion  errors
X-Enigmail-Version: 0.94.2.0
Openpgp: url=http://pgp.mit.edu/

Hey Kent,

When I make the SHADOW_DEBUG change do I have to use condor_reconfigure 
or something?? I assumed shadow's would take the new config immediately. 
I basically put the jobs on hold, modified the SHADOW_DEBUG and released 
the jobs a few minutes later. I'm not entirely sure this worked.

Thanks,
Erik

2/24 13:33:14 (10574154.0) (18617): condor_write(): Socket closed when 
trying to write 3705 bytes to <10.14.1.142:39868>, fd is 7
2/24 13:33:14 (10574154.0) (18617): Buf::write(): condor_write() failed
2/24 13:33:14 (10574154.0) (18617): ERROR "Assertion ERROR on (result)" 
at line 236 in file NTreceivers.C
2/24 13:33:38 ******************************************************
2/24 13:33:38 ** condor_shadow (CONDOR_SHADOW) STARTING UP
2/24 13:33:38 ** /ldcg/stow_pkgs/condor-6.8.4/condor/sbin/condor_shadow
2/24 13:33:38 ** $CondorVersion: 6.8.4 Feb  1 2007 $
2/24 13:33:38 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
2/24 13:33:38 ** PID = 18747
2/24 13:33:38 ** Log last touched 2/24 13:33:38
2/24 13:33:38 ******************************************************
2/24 13:33:38 Using config source: /usr1/condor/condor_config
2/24 13:33:38 Using local config sources:
2/24 13:33:38    /usr1/condor/condor_config.local
2/24 13:33:38 DaemonCore: Command Socket at <10.14.0.12:59942>
2/24 13:33:38 Initializing a VANILLA shadow for job 10574154.0
2/24 13:33:38 (10574154.0) (18747): CONNECT src=<10.14.0.12:57099> fd=7 
dst=<10.14.2.52:40926>
2/24 13:33:38 (10574154.0) (18747): Request to run on <10.14.2.52:40926> 
was ACCEPTED


condor-support response tracking system wrote:
> One more thing to try -- set SHADOW_DEBUG to D_NETWORK on the machine
> from which the failed jobs are being submitted; once you hit the error,
> send the ShadowLog.
> 
> Kent Wenger
> Condor Team
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,espinoza__AT__ligo.caltech.edu
> 

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Sat Feb 24 16:42:52 2007 (1172356972)
Date: Mon, 26 Feb 2007 10:24:30 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C    
 Assertion  errors

Erik,

> When I make the SHADOW_DEBUG change do I have to use condor_reconfigure
> or something?? I assumed shadow's would take the new config immediately.
> I basically put the jobs on hold, modified the SHADOW_DEBUG and released
> the jobs a few minutes later. I'm not entirely sure this worked.

Ah, yes, I guess I should have mentioned that.

Take a look at the condor_reconfig man page:

    http://www.cs.wisc.edu/condor/manual/v6.8/condor_reconfig.html

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Mon Feb 26 10:28:22 2007 (1172507303)
Date: Mon, 26 Feb 2007 08:40:39 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C     
 Assertion  errors
X-Enigmail-Version: 0.94.2.0
Openpgp: url=http://pgp.mit.edu/

Correct, I know about condor_reconfig. I was asking what service I am 
supposed to reconfig. Or should I just do a condor_reconfig -all?

condor-support response tracking system wrote:
> Erik,
> 
>> When I make the SHADOW_DEBUG change do I have to use condor_reconfigure
>> or something?? I assumed shadow's would take the new config immediately.
>> I basically put the jobs on hold, modified the SHADOW_DEBUG and released
>> the jobs a few minutes later. I'm not entirely sure this worked.
> 
> Ah, yes, I guess I should have mentioned that.
> 
> Take a look at the condor_reconfig man page:
> 
>     http://www.cs.wisc.edu/condor/manual/v6.8/condor_reconfig.html
> 
> Kent Wenger
> Condor Team
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,espinoza__AT__ligo.caltech.edu
> 

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Mon Feb 26 10:41:10 2007 (1172508070)
Date: Mon, 26 Feb 2007 11:52:00 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C    
 Assertion  errors

Erik,

> Correct, I know about condor_reconfig. I was asking what service I am
> supposed to reconfig. Or should I just do a condor_reconfig -all?

No, you should be able to do

    condor_reconfig -subsystem startd

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Mon Feb 26 11:52:04 2007 (1172512325)
Date: Tue, 27 Feb 2007 14:23:20 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C    
 Assertion  errors

Stuart and Erik,

Did you have a chance yet to get a ShadowLog with SHADOW_DEBUG set to
D_NETWORK, which shows the error?

I think that will be our most valuable debugging info at this point.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Tue Feb 27 14:26:11 2007 (1172607971)
Date: Tue, 27 Feb 2007 12:44:05 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C     
 Assertion  errors
X-Enigmail-Version: 0.94.2.0
Openpgp: url=http://pgp.mit.edu/

The user has removed the jobs from our queue. Unless we can reproduce 
the problem with other jobs, we will not be able to do further testing 
for now.

Thanks,
Erik

condor-support response tracking system wrote:
> Stuart and Erik,
> 
> Did you have a chance yet to get a ShadowLog with SHADOW_DEBUG set to
> D_NETWORK, which shows the error?
> 
> I think that will be our most valuable debugging info at this point.
> 
> Kent Wenger
> Condor Team
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,espinoza__AT__ligo.caltech.edu
> 

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Tue Feb 27 14:44:50 2007 (1172609091)
Date: Tue, 27 Feb 2007 14:46:35 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C    
 Assertion  errors

Erik,

> The user has removed the jobs from our queue. Unless we can reproduce
> the problem with other jobs, we will not be able to do further testing
> for now.

Okay, we'll see what we can figure out on our end with the info we have.

Please let us know if the problem happens again, though.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Tue Feb 27 14:47:54 2007 (1172609274)
Subject: Comments added

LIGO has said the problem is so infrequent, we can drop the ticket.

Comments added by tannenba

===========================================================================
Date comments were added: Mon Mar 12 16:53:05 2007 (1173736386)
Subject: Actions

Ticket resolved by wenger
===========================================================================
Date of actions: Mon Mar 12 17:21:14 2007 (1173738074)