LIGO Support Ticket 7591

Ticket Information
  Number:      support 7591
  User:        dabrown@physics.syr.edu
  Email:       
  Status:      resolved
  Assigned To: wenger
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
From: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: LIGO: problem with startd starting backfill jobs
Date: Tue, 5 Aug 2008 11:45:40 -0400
X-Scanner: InterScan AntiVirus for Sendmail
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

Hi Todd,

I am seeing the following error in my Startd logs when they try and  
fire up boinc backfill jobs:

8/5 10:56:30 ******************************************************
8/5 10:56:30 ** condor_starter (CONDOR_STARTER) STARTING UP
8/5 10:56:30 ** /usr/sbin/condor_starter
8/5 10:56:30 ** $CondorVersion: 7.0.2 Jun  9 2008 BuildID: 89891 $
8/5 10:56:30 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
8/5 10:56:30 ** PID = 4039
8/5 10:56:30 ** Log last touched 8/4 23:32:11
8/5 10:56:30 ******************************************************
8/5 10:56:30 Using config source: /usr1/condor/condor_config
8/5 10:56:30 Using local config sources:
8/5 10:56:30    /usr1/condor/condor_config.local
8/5 10:56:30 DaemonCore: Command Socket at <10.20.2.55:55356>
8/5 10:56:30 Done setting resource limits
8/5 10:56:31 condor_read(): recv() returned -1, errno = 104, assuming  
failure reading 5 bytes from unknown source.
8/5 10:56:31 IO: Failed to read packet header
8/5 10:56:31 ERROR "Assertion ERROR on (result)" at line 207 in file  
NTsenders.C
8/5 10:56:31 ERROR "LocalUserLog::logStarterError() called before init 
()" at line 223 in file local_user_log.C

Any idea what is happening? This occurred both before and after  
upgrading the schedd to 7.0.4.

Cheers,
Duncan.

-- 

Duncan Brown                          Room 263-1, Department of Physics,
Assistant Professor of Physics        Syracuse University, NY 13244, USA
Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan



===========================================================================
Date of creation: Tue Aug  5 10:45:44 2008 (1217951146)
Subject: Actions

Assigned to wenger by wenger
===========================================================================
Date of actions: Tue Aug  5 11:06:12 2008 (1217952372)
Date: Tue, 5 Aug 2008 11:15:20 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7591] LIGO: problem with startd starting
 backfill jobs

Duncan,

> I am seeing the following error in my Startd logs when they try and
> fire up boinc backfill jobs:
>
> 8/5 10:56:30 ******************************************************
> 8/5 10:56:30 ** condor_starter (CONDOR_STARTER) STARTING UP
> 8/5 10:56:30 ** /usr/sbin/condor_starter
> 8/5 10:56:30 ** $CondorVersion: 7.0.2 Jun  9 2008 BuildID: 89891 $
> 8/5 10:56:30 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
> 8/5 10:56:30 ** PID = 4039
> 8/5 10:56:30 ** Log last touched 8/4 23:32:11
> 8/5 10:56:30 ******************************************************
> 8/5 10:56:30 Using config source: /usr1/condor/condor_config
> 8/5 10:56:30 Using local config sources:
> 8/5 10:56:30    /usr1/condor/condor_config.local
> 8/5 10:56:30 DaemonCore: Command Socket at <10.20.2.55:55356>
> 8/5 10:56:30 Done setting resource limits
> 8/5 10:56:31 condor_read(): recv() returned -1, errno = 104, assuming
> failure reading 5 bytes from unknown source.
> 8/5 10:56:31 IO: Failed to read packet header
> 8/5 10:56:31 ERROR "Assertion ERROR on (result)" at line 207 in file
> NTsenders.C
> 8/5 10:56:31 ERROR "LocalUserLog::logStarterError() called before init
> ()" at line 223 in file local_user_log.C
>
> Any idea what is happening? This occurred both before and after
> upgrading the schedd to 7.0.4.

I'm grabbing this because I'm on RUST duty this week, and Todd is on 
"grant-writing vacation" right now.

Anyhow, I'll take a look at it.  Does this relate to some previous RUST
ticket that Todd handled or something like that?

I don't have any great ideas about this offhand -- I'll have to take at 
look at the code where the error is coming from.

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Tue Aug  5 11:19:40 2008 (1217953180)
From: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: Re: [condor-support #7591] LIGO: problem with startd starting
 backfill jobs
Date: Tue, 5 Aug 2008 12:22:48 -0400
To: condor-support__AT__cs.wisc.edu
X-Scanner: InterScan AntiVirus for Sendmail
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Hi Kent,

Nope, this is a new error. Thanks for looking into it.

Cheers,
Duncan.

On Aug 5, 2008, at 12:19 PM, condor-support response tracking system  
wrote:

> Duncan,
>
>> I am seeing the following error in my Startd logs when they try and
>> fire up boinc backfill jobs:
>>
>> 8/5 10:56:30 ******************************************************
>> 8/5 10:56:30 ** condor_starter (CONDOR_STARTER) STARTING UP
>> 8/5 10:56:30 ** /usr/sbin/condor_starter
>> 8/5 10:56:30 ** $CondorVersion: 7.0.2 Jun  9 2008 BuildID: 89891 $
>> 8/5 10:56:30 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
>> 8/5 10:56:30 ** PID = 4039
>> 8/5 10:56:30 ** Log last touched 8/4 23:32:11
>> 8/5 10:56:30 ******************************************************
>> 8/5 10:56:30 Using config source: /usr1/condor/condor_config
>> 8/5 10:56:30 Using local config sources:
>> 8/5 10:56:30    /usr1/condor/condor_config.local
>> 8/5 10:56:30 DaemonCore: Command Socket at <10.20.2.55:55356>
>> 8/5 10:56:30 Done setting resource limits
>> 8/5 10:56:31 condor_read(): recv() returned -1, errno = 104, assuming
>> failure reading 5 bytes from unknown source.
>> 8/5 10:56:31 IO: Failed to read packet header
>> 8/5 10:56:31 ERROR "Assertion ERROR on (result)" at line 207 in file
>> NTsenders.C
>> 8/5 10:56:31 ERROR "LocalUserLog::logStarterError() called before  
>> init
>> ()" at line 223 in file local_user_log.C
>>
>> Any idea what is happening? This occurred both before and after
>> upgrading the schedd to 7.0.4.
>
> I'm grabbing this because I'm on RUST duty this week, and Todd is on
> "grant-writing vacation" right now.
>
> Anyhow, I'll take a look at it.  Does this relate to some previous  
> RUST
> ticket that Todd handled or something like that?
>
> I don't have any great ideas about this offhand -- I'll have to  
> take at
> look at the code where the error is coming from.
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu,

-- 

Duncan Brown                          Room 263-1, Department of Physics,
Assistant Professor of Physics        Syracuse University, NY 13244, USA
Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan



===========================================================================
Date mail was appended: Tue Aug  5 11:22:52 2008 (1217953373)
Date: Tue, 5 Aug 2008 13:08:01 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: wenger <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7591] LIGO: problem with startd starting
 backfill jobs

Duncan,

> I am seeing the following error in my Startd logs when they try and
> fire up boinc backfill jobs:

One other thing -- could you send the job classad of one of the boinc 
jobs that's causing this error to show up?

Thanks...

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Tue Aug  5 13:12:56 2008 (1217959976)
From: Duncan Brown <dabrown__AT__physics.syr.edu>
Subject: Re: [condor-support #7591] LIGO: problem with startd starting
 backfill jobs
Date: Tue, 5 Aug 2008 16:04:21 -0400
To: condor-support__AT__cs.wisc.edu
X-Scanner: InterScan AntiVirus for Sendmail
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Hi Kent,

How do I get the classad of a backfill job?

Cheers,
Duncan.

On Aug 5, 2008, at 2:12 PM, condor-support response tracking system  
wrote:

> Duncan,
>
>> I am seeing the following error in my Startd logs when they try and
>> fire up boinc backfill jobs:
>
> One other thing -- could you send the job classad of one of the boinc
> jobs that's causing this error to show up?
>
> Thanks...
>
> Kent Wenger
> Condor Team
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
> * Ticket Email List: dabrown__AT__physics.syr.edu,

-- 

Duncan Brown                          Room 263-1, Department of Physics,
Assistant Professor of Physics        Syracuse University, NY 13244, USA
Phone: (315) 443 5993             http://www.gravity.phy.syr.edu/~duncan



===========================================================================
Date mail was appended: Tue Aug  5 15:04:57 2008 (1217966698)
Date: Wed, 6 Aug 2008 13:54:30 -0500 (CDT)
From: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: "R. Kent Wenger" <wenger__AT__cs.wisc.edu>
Subject: Re: [condor-support #7591] LIGO: problem with startd starting
 backfill jobs

Duncan,

> How do I get the classad of a backfill job?

Ah, yeah, I wasn't thinking too well when I asked that.

Taking another tack, a few more things:

* What previous version of Condor were you running that exhibited this
problem?

* Do you know if this happens *every* time a backfill job is started?

* Can you send us your configuration file that has the backfill-related
configuration?

* Can you send a full StarterLog and StartLog that include the error?

Kent Wenger
Condor Team

===========================================================================
Date mail was appended: Wed Aug  6 13:58:02 2008 (1218049083)
Subject: Actions

Ticket resolved by wenger
===========================================================================
Date of actions: Fri Aug 29 13:53:31 2008 (1220036011)