LIGO Support Ticket 1885
Ticket Information
Number: support 1885
User: anderson@ligo.caltech.edu
Email: espinoza_e__AT__ligo.caltech.edu,espinoza__AT__ligo.caltech.edu
Status: resolved
Assigned To: wenger
Date: Tue, 20 Feb 2007 20:28:04 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: LIGO: NTsenders.C/NTreceivers.C Assertion errors
The LIGO CIT Condor pool was upgraded to 6.8.4 today and while things
are mostly working there are some jobs that are stuck generating repeated
shadow/starter assertion failures. Any help if figuring out what is
wrong would be appreciated. For example, consider the job 10514982.0,
$ condor_q -run 10514982.0
-- Quill: citquill@ligo : <10.14.0.25:5432> : citquill_db
ID OWNER SUBMITTED RUN_TIME HOST(S)
10514982.0 vmandic 2/20 18:41 0+01:21:07 vm1__AT__node107.ldas-cit.ligo.caltech.edu
However, it is not actually running on node107 due to the following errors:
The ShadowLog has entries:
$ tac /usr1/condor/log/ShadowLog | grep 10514982.0 | head
2/20 19:43:46 (10514982.0) (26689): Request to run on <10.14.1.107:35873> was ACCEPTED
2/20 19:43:46 Initializing a VANILLA shadow for job 10514982.0
2/20 19:43:32 (10514982.0) (26305): ERROR "Assertion ERROR on (result)" at line 236 in file NTreceivers.C
2/20 19:43:32 (10514982.0) (26305): Buf::write(): condor_write() failed
2/20 19:43:32 (10514982.0) (26305): condor_write(): Socket closed when trying to write 3698 bytes to <10.14.1.107:35873>, fd is 7
2/20 19:34:08 (10514982.0) (26305): Request to run on <10.14.1.107:35873> was ACCEPTED
2/20 19:34:08 Initializing a VANILLA shadow for job 10514982.0
2/20 19:34:01 (10514982.0) (26186): ERROR "Assertion ERROR on (result)" at line 236 in file NTreceivers.C
2/20 19:34:01 (10514982.0) (26186): Buf::write(): condor_write() failed
2/20 19:34:01 (10514982.0) (26186): condor_write(): Socket closed when trying to write 3698 bytes to <10.14.1.107:35873>, fd is 7
and a short snippet of the log shows:
2/20 19:34:08 ******************************************************
2/20 19:34:08 ** condor_shadow (CONDOR_SHADOW) STARTING UP
2/20 19:34:08 ** /ldcg/stow_pkgs/condor-6.8.4/condor/sbin/condor_shadow
2/20 19:34:08 ** $CondorVersion: 6.8.4 Feb 1 2007 $
2/20 19:34:08 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
2/20 19:34:08 ** PID = 26305
2/20 19:34:08 ** Log last touched 2/20 19:34:06
2/20 19:34:08 ******************************************************
2/20 19:34:08 Using config source: /usr1/condor/condor_config
2/20 19:34:08 Using local config sources:
2/20 19:34:08 /usr1/condor/condor_config.local
2/20 19:34:08 DaemonCore: Command Socket at <10.14.0.12:44397>
2/20 19:34:08 Initializing a VANILLA shadow for job 10514982.0
2/20 19:34:08 (10514982.0) (26305): Request to run on <10.14.1.107:35873> was ACCEPTED
...
2/20 19:43:24 (10514998.0) (25201): Job 10514998.0 terminated: exited with status 0
2/20 19:43:24 (10514998.0) (25201): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
2/20 19:43:32 (10514982.0) (26305): condor_write(): Socket closed when trying to write 3698 bytes to <10.14.1.107:35873>, fd i
s 7
2/20 19:43:32 (10514982.0) (26305): Buf::write(): condor_write() failed
2/20 19:43:32 (10514982.0) (26305): ERROR "Assertion ERROR on (result)" at line 236 in file NTreceivers.C
2/20 19:43:32 (10514992.0) (26307): condor_write(): Socket closed when trying to write 3698 bytes to <10.14.1.112:49238>, fd i
s 7
2/20 19:43:32 (10514992.0) (26307): Buf::write(): condor_write() failed
2/20 19:43:32 (10514992.0) (26307): ERROR "Assertion ERROR on (result)" at line 236 in file NTreceivers.C
And the StarterLog's on node107 (ip=10.14.1.107) have matching errors:
[root@node107 log]# grep "2/20 " StarterLog.vm? | grep Assertion
StarterLog.vm1:2/20 19:20:11 ERROR "Assertion ERROR on (result)" at line 148 in file NTsenders.C
StarterLog.vm1:2/20 19:29:20 ERROR "Assertion ERROR on (result)" at line 148 in file NTsenders.C
StarterLog.vm1:2/20 19:39:08 ERROR "Assertion ERROR on (result)" at line 148 in file NTsenders.C
StarterLog.vm1:2/20 19:48:46 ERROR "Assertion ERROR on (result)" at line 148 in file NTsenders.C
StarterLog.vm2:2/20 19:30:00 ERROR "Assertion ERROR on (result)" at line 148 in file NTsenders.C
StarterLog.vm2:2/20 19:39:14 ERROR "Assertion ERROR on (result)" at line 148 in file NTsenders.C
StarterLog.vm2:2/20 19:48:51 ERROR "Assertion ERROR on (result)" at line 148 in file NTsenders.C
The full entry for one of the StarterLog jobs is:
2/20 19:43:46 ******************************************************
2/20 19:43:46 ** condor_starter (CONDOR_STARTER) STARTING UP
2/20 19:43:46 ** /ldcg/stow_pkgs/condor-6.8.4/condor/sbin/condor_starter
2/20 19:43:46 ** $CondorVersion: 6.8.4 Feb 1 2007 $
2/20 19:43:46 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
2/20 19:43:46 ** PID = 11043
2/20 19:43:46 ** Log last touched 2/20 19:39:08
2/20 19:43:46 ******************************************************
2/20 19:43:46 Using config source: /usr1/condor/condor_config
2/20 19:43:46 Using local config sources:
2/20 19:43:46 /usr1/condor/condor_config.local
2/20 19:43:46 DaemonCore: Command Socket at <10.14.1.107:34066>
2/20 19:43:46 Done setting resource limits
2/20 19:48:46 condor_read(): timeout reading 5 bytes from <10.14.0.12:49696>.
2/20 19:48:46 ERROR "Assertion ERROR on (result)" at line 148 in file NTsenders.C
2/20 19:48:46 ERROR "LocalUserLog::logStarterError() called before init()" at line 205 in file local_user_log.C
The user log file is showing,
...
007 (10514982.000.000) 02/20 19:43:32 Shadow exception!
Assertion ERROR on (result)
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Tue Feb 20 22:28:23 2007 (1172032105)
Subject: Actions
Assigned to wenger by wenger
===========================================================================
Date of actions: Wed Feb 21 10:59:11 2007 (1172077151)
Date: Wed, 21 Feb 2007 11:08:16 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
Stuart,
> The LIGO CIT Condor pool was upgraded to 6.8.4 today and while things
> are mostly working there are some jobs that are stuck generating repeated
> shadow/starter assertion failures. Any help if figuring out what is
> wrong would be appreciated. For example, consider the job 10514982.0,
We'll look into this and get back to you ASAP.
I'm not familiar enough with the Windows code to have any smart ideas
offhand...
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Wed Feb 21 11:09:33 2007 (1172077776)
Date: Wed, 21 Feb 2007 09:54:23 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
X-Enigmail-Version: 0.94.2.0
Openpgp: url=http://pgp.mit.edu/
Hello Kent,
This is very weird, as we are NOT using any Windows machines in our pool
at all. All of our machines are using FC4/x86_64 w/ the 64-bit condor.\
Thanks,
Erik
condor-support response tracking system wrote:
> Stuart,
>
>> The LIGO CIT Condor pool was upgraded to 6.8.4 today and while things
>> are mostly working there are some jobs that are stuck generating repeated
>> shadow/starter assertion failures. Any help if figuring out what is
>> wrong would be appreciated. For example, consider the job 10514982.0,
>
> We'll look into this and get back to you ASAP.
>
> I'm not familiar enough with the Windows code to have any smart ideas
> offhand...
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
>
--
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517
===========================================================================
Date mail was appended: Wed Feb 21 11:54:55 2007 (1172080495)
Date: Wed, 21 Feb 2007 12:03:08 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
Erik,
> This is very weird, as we are NOT using any Windows machines in our pool
> at all. All of our machines are using FC4/x86_64 w/ the 64-bit condor.\
Hmm, maybe that's a clue in and of itself.
I have to confirm that it really *is* Windows, code, I guess. I assumed
from the filename (NTreceivers.C) that it is, but maybe NT stands for
something different in that case.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Wed Feb 21 12:04:17 2007 (1172081058)
Date: Wed, 21 Feb 2007 12:25:08 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
Erik,
> This is very weird, as we are NOT using any Windows machines in our pool
> at all. All of our machines are using FC4/x86_64 w/ the 64-bit condor.\
Okay, I have a little more info here -- the NTreceivers.C file name is
left over from a time when that code was Windows-only, but now it's not.
So the fact that you're getting an error message from that file isn't a
problem in and of itself.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Wed Feb 21 12:29:13 2007 (1172082553)
Date: Wed, 21 Feb 2007 13:19:13 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
Stuart,
Can you please send the complete ShadowLog?
Thanks.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Wed Feb 21 13:24:12 2007 (1172085853)
Date: Wed, 21 Feb 2007 11:54:19 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
X-Enigmail-Version: 0.94.2.0
Openpgp: url=http://pgp.mit.edu/
Hello Kent,
Here are the ShadowLogs:
http://www.ligo.caltech.edu/~eespinoz/debug/02-20-2007/
ShadowLog.bz2 is the full log and is ~1g uncompressed
ShadowLog-02-20-2007.bz2 is the log for yesterday and today. It is ~260
uncompressed.
Take your pick.
Thanks,
Erik
condor-support response tracking system wrote:
> Stuart,
>
> Can you please send the complete ShadowLog?
>
> Thanks.
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,espinoza__AT__ligo.caltech.edu
>
--
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517
===========================================================================
Date mail was appended: Wed Feb 21 13:54:45 2007 (1172087686)
Date: Wed, 21 Feb 2007 14:18:46 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
Erik,
> Here are the ShadowLogs:
> http://www.ligo.caltech.edu/~eespinoz/debug/02-20-2007/
>
> ShadowLog.bz2 is the full log and is ~1g uncompressed
> ShadowLog-02-20-2007.bz2 is the log for yesterday and today. It is ~260
> uncompressed.
>
> Take your pick.
Thanks!
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Wed Feb 21 14:23:15 2007 (1172089395)
Date: Wed, 21 Feb 2007 14:36:33 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
Stuart,
A few more questions:
- What version of Condor were you running prior to upgrading to 6.8.4?
- Are you sure this error has only shown up since the upgrade?
- Roughly how often is this happening (i.e., what percentage of jobs).
- Do the failures seem to be tied to a specific job (i.e., if a job
fails when it trys to run on one machine, will that same job fail
again on another machine)?
- Are there any other patterns to the failures that you can think of?
Thanks.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Wed Feb 21 14:39:12 2007 (1172090352)
Date: Wed, 21 Feb 2007 12:51:40 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
X-Enigmail-Version: 0.94.2.0
Openpgp: url=http://pgp.mit.edu/
Hello
> - What version of Condor were you running prior to upgrading to 6.8.4?
Condor 6.8.3
> - Are you sure this error has only shown up since the upgrade?
Looks like we've had this error show up on various jobs in the past. I
am unable to pull up condor_history as we set up a new PostgreSQL server
and the history only goes back till yesterday.
> - Roughly how often is this happening (i.e., what percentage of jobs).
Currently this is only happening for one user's jobs. It is hard to say
right now, but probably a very small percentage.
> - Do the failures seem to be tied to a specific job (i.e., if a job
> fails when it trys to run on one machine, will that same job fail
> again on another machine)?
Yes, it does appear that the same jobs keep popping up the same error.
> - Are there any other patterns to the failures that you can think of?
Not that I know of currently.
Thanks,
--
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517
===========================================================================
Date mail was appended: Wed Feb 21 14:52:06 2007 (1172091126)
Date: Thu, 22 Feb 2007 09:32:05 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
> > - Do the failures seem to be tied to a specific job (i.e., if a job
> > fails when it trys to run on one machine, will that same job fail
> > again on another machine)?
>
> Yes, it does appear that the same jobs keep popping up the same error.
One more thing (for now, anyhow!) -- can you send the submit file of one
of the jobs that gets the errors?
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Thu Feb 22 9:34:26 2007 (1172158467)
Date: Thu, 22 Feb 2007 14:03:46 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
> Currently this is only happening for one user's jobs. It is hard to say
> right now, but probably a very small percentage.
Hmm -- do you know whether they've always submitted on the same machine?
Has anyone else submitted on that machine?
Also, can you send the full StarterLog that shows the starter end of one
of these problems?
Thanks...
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Thu Feb 22 14:04:25 2007 (1172174666)
Date: Thu, 22 Feb 2007 15:02:01 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
One more thing to try -- set SHADOW_DEBUG to D_NETWORK on the machine
from which the failed jobs are being submitted; once you hit the error,
send the ShadowLog.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Thu Feb 22 15:04:22 2007 (1172178263)
Date: Sat, 24 Feb 2007 14:28:35 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
X-Enigmail-Version: 0.94.2.0
Openpgp: url=http://pgp.mit.edu/
> One more thing (for now, anyhow!) -- can you send the submit file of one
> of the jobs that gets the errors?
The jobs were submitted as part of a dag.
Here is the submit file used by the dag:
executable = tfcoh_dag.sh
output = out.$(jobNumber)
error = err.$(jobNumber)
arguments = $(paramsFile) $(jobsFile) $(jobNumber)
requirements = Memory >= 128 && OpSys == "LINUX"
universe = vanilla
notification = never
environment = HOME=$(home);LD_LIBRARY_PATH=$(ld_library_path)
log = /usr1/vmandic/stochastic_pipe.log
+MaxHours = 12
queue 1
Thanks,
--
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517
===========================================================================
Date mail was appended: Sat Feb 24 16:29:08 2007 (1172356149)
Date: Sat, 24 Feb 2007 14:32:03 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
X-Enigmail-Version: 0.94.2.0
Openpgp: url=http://pgp.mit.edu/
Hello,
> Hmm -- do you know whether they've always submitted on the same machine?
> Has anyone else submitted on that machine?
Yes. We only have two submit machines and one is for DedicatedScheduler.
The other is used for general use.
> Also, can you send the full StarterLog that shows the starter end of one
> of these problems?
2/24 03:31:36 ******************************************************
2/24 03:31:36 ** condor_starter (CONDOR_STARTER) STARTING UP
2/24 03:31:36 ** /ldcg/stow_pkgs/condor-6.8.4/condor/sbin/condor_starter
2/24 03:31:36 ** $CondorVersion: 6.8.4 Feb 1 2007 $
2/24 03:31:36 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
2/24 03:31:36 ** PID = 712
2/24 03:31:36 ** Log last touched 2/24 03:30:53
2/24 03:31:36 ******************************************************
2/24 03:31:36 Using config source: /usr1/condor/condor_config
2/24 03:31:36 Using local config sources:
2/24 03:31:36 /usr1/condor/condor_config.local
2/24 03:31:36 DaemonCore: Command Socket at <10.14.2.20:46086>
2/24 03:31:36 Done setting resource limits
2/24 03:33:27 Communicating with shadow <10.14.0.12:33616>
2/24 03:33:27 Submitting machine is "ldas-grid.ligo.caltech.edu"
2/24 03:33:27 Starting a VANILLA universe job with ID: 10568087.0
2/24 03:33:27 IWD: /archive/home/vmandic/detchar
2/24 03:33:27 Output file: /archive/home/vmandic/detchar/out.23577
2/24 03:33:27 Error file: /archive/home/vmandic/detchar/err.23577
2/24 03:33:27 Renice expr "0" evaluated to 0
2/24 03:33:27 About to exec /archive/home/vmandic/detchar/tfcoh_dag.sh
/archive/home/vmandic/detchar/paramfiles/S5H1L1_strain_0p001Hz_200701_params.txt
/archive/home/vmandic/sgwb/S5/input/jobfiles/S5H1L1_200701_jobs.txt 23577
2/24 03:33:27 Create_Process succeeded, pid=901
2/24 03:33:28 Process exited, pid=901, status=0
2/24 03:33:28 Got SIGQUIT. Performing fast shutdown.
2/24 03:33:28 ShutdownFast all jobs.
Thanks,
--
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517
===========================================================================
Date mail was appended: Sat Feb 24 16:32:33 2007 (1172356354)
Date: Sat, 24 Feb 2007 14:42:14 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
X-Enigmail-Version: 0.94.2.0
Openpgp: url=http://pgp.mit.edu/
Hey Kent,
When I make the SHADOW_DEBUG change do I have to use condor_reconfigure
or something?? I assumed shadow's would take the new config immediately.
I basically put the jobs on hold, modified the SHADOW_DEBUG and released
the jobs a few minutes later. I'm not entirely sure this worked.
Thanks,
Erik
2/24 13:33:14 (10574154.0) (18617): condor_write(): Socket closed when
trying to write 3705 bytes to <10.14.1.142:39868>, fd is 7
2/24 13:33:14 (10574154.0) (18617): Buf::write(): condor_write() failed
2/24 13:33:14 (10574154.0) (18617): ERROR "Assertion ERROR on (result)"
at line 236 in file NTreceivers.C
2/24 13:33:38 ******************************************************
2/24 13:33:38 ** condor_shadow (CONDOR_SHADOW) STARTING UP
2/24 13:33:38 ** /ldcg/stow_pkgs/condor-6.8.4/condor/sbin/condor_shadow
2/24 13:33:38 ** $CondorVersion: 6.8.4 Feb 1 2007 $
2/24 13:33:38 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
2/24 13:33:38 ** PID = 18747
2/24 13:33:38 ** Log last touched 2/24 13:33:38
2/24 13:33:38 ******************************************************
2/24 13:33:38 Using config source: /usr1/condor/condor_config
2/24 13:33:38 Using local config sources:
2/24 13:33:38 /usr1/condor/condor_config.local
2/24 13:33:38 DaemonCore: Command Socket at <10.14.0.12:59942>
2/24 13:33:38 Initializing a VANILLA shadow for job 10574154.0
2/24 13:33:38 (10574154.0) (18747): CONNECT src=<10.14.0.12:57099> fd=7
dst=<10.14.2.52:40926>
2/24 13:33:38 (10574154.0) (18747): Request to run on <10.14.2.52:40926>
was ACCEPTED
condor-support response tracking system wrote:
> One more thing to try -- set SHADOW_DEBUG to D_NETWORK on the machine
> from which the failed jobs are being submitted; once you hit the error,
> send the ShadowLog.
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,espinoza__AT__ligo.caltech.edu
>
--
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517
===========================================================================
Date mail was appended: Sat Feb 24 16:42:52 2007 (1172356972)
Date: Mon, 26 Feb 2007 10:24:30 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
Erik,
> When I make the SHADOW_DEBUG change do I have to use condor_reconfigure
> or something?? I assumed shadow's would take the new config immediately.
> I basically put the jobs on hold, modified the SHADOW_DEBUG and released
> the jobs a few minutes later. I'm not entirely sure this worked.
Ah, yes, I guess I should have mentioned that.
Take a look at the condor_reconfig man page:
http://www.cs.wisc.edu/condor/manual/v6.8/condor_reconfig.html
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Mon Feb 26 10:28:22 2007 (1172507303)
Date: Mon, 26 Feb 2007 08:40:39 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
X-Enigmail-Version: 0.94.2.0
Openpgp: url=http://pgp.mit.edu/
Correct, I know about condor_reconfig. I was asking what service I am
supposed to reconfig. Or should I just do a condor_reconfig -all?
condor-support response tracking system wrote:
> Erik,
>
>> When I make the SHADOW_DEBUG change do I have to use condor_reconfigure
>> or something?? I assumed shadow's would take the new config immediately.
>> I basically put the jobs on hold, modified the SHADOW_DEBUG and released
>> the jobs a few minutes later. I'm not entirely sure this worked.
>
> Ah, yes, I guess I should have mentioned that.
>
> Take a look at the condor_reconfig man page:
>
> http://www.cs.wisc.edu/condor/manual/v6.8/condor_reconfig.html
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,espinoza__AT__ligo.caltech.edu
>
--
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517
===========================================================================
Date mail was appended: Mon Feb 26 10:41:10 2007 (1172508070)
Date: Mon, 26 Feb 2007 11:52:00 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
Erik,
> Correct, I know about condor_reconfig. I was asking what service I am
> supposed to reconfig. Or should I just do a condor_reconfig -all?
No, you should be able to do
condor_reconfig -subsystem startd
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Mon Feb 26 11:52:04 2007 (1172512325)
Date: Tue, 27 Feb 2007 14:23:20 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
Stuart and Erik,
Did you have a chance yet to get a ShadowLog with SHADOW_DEBUG set to
D_NETWORK, which shows the error?
I think that will be our most valuable debugging info at this point.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Tue Feb 27 14:26:11 2007 (1172607971)
Date: Tue, 27 Feb 2007 12:44:05 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
X-Enigmail-Version: 0.94.2.0
Openpgp: url=http://pgp.mit.edu/
The user has removed the jobs from our queue. Unless we can reproduce
the problem with other jobs, we will not be able to do further testing
for now.
Thanks,
Erik
condor-support response tracking system wrote:
> Stuart and Erik,
>
> Did you have a chance yet to get a ShadowLog with SHADOW_DEBUG set to
> D_NETWORK, which shows the error?
>
> I think that will be our most valuable debugging info at this point.
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu,espinoza__AT__ligo.caltech.edu
>
--
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517
===========================================================================
Date mail was appended: Tue Feb 27 14:44:50 2007 (1172609091)
Date: Tue, 27 Feb 2007 14:46:35 -0600 (CST)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #1885] LIGO: NTsenders.C/NTreceivers.C
Assertion errors
Erik,
> The user has removed the jobs from our queue. Unless we can reproduce
> the problem with other jobs, we will not be able to do further testing
> for now.
Okay, we'll see what we can figure out on our end with the info we have.
Please let us know if the problem happens again, though.
Kent Wenger
Condor Team
===========================================================================
Date mail was appended: Tue Feb 27 14:47:54 2007 (1172609274)
Subject: Comments added
LIGO has said the problem is so infrequent, we can drop the ticket.
Comments added by tannenba
===========================================================================
Date comments were added: Mon Mar 12 16:53:05 2007 (1173736386)
Subject: Actions
Ticket resolved by wenger
===========================================================================
Date of actions: Mon Mar 12 17:21:14 2007 (1173738074)