LIGO Support Ticket 2146

Ticket Information
  Number:      support 2146
  User:        anderson@ligo.caltech.edu
  Email:       skoranda__AT__gravity.phys.uwm.edu
  Status:      resolved
  Assigned To: danb
Date: Sat, 1 Dec 2007 12:36:53 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Scott Koranda <skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: 6.9.5 starter assertion error at line 575
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

I am unable to run backfill/boinc jobs in conjunction with regular condor
jobs on the LIGO Caltech pool after upgrading from 6.9.4 to 6.9.5. I am
able to run either boinc or regular jobs, but the transition from backfill
to regular jobs is broken. In particular the following infinite loop
is triggered:

1) master starts startd
2) startd starts boinc backfill manager
3) boinc appliation starts
4) a match is made for a regular queued job
5) startd kills backfill
6) startd generates Assertion ERROR (s_claim && s_claim->rip()) and exits
   with status 4
7) master notices and cycles back to state 1)

The current work around is to disable backfill via,
ENABLE_BACKFILL = FALSE


Here is an example from the StartLog,

12/1 11:03:39 ******************************************************
12/1 11:03:39 ** condor_startd (CONDOR_STARTD) STARTING UP
12/1 11:03:39 ** /usr/sbin/condor_startd
12/1 11:03:39 ** $CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
12/1 11:03:39 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
12/1 11:03:39 ** PID = 27202
12/1 11:03:39 ** Log last touched 12/1 10:42:17
12/1 11:03:39 ******************************************************
12/1 11:03:39 Using config source: /usr1/condor/condor_config
12/1 11:03:39 Using local config sources: 
12/1 11:03:39    /usr1/condor/condor_config.local
12/1 11:03:39 DaemonCore: Command Socket at <10.14.1.100:34817>
12/1 11:03:45 "/usr/sbin/condor_starter.pvm -classad" did not produce any output, ignoring
12/1 11:03:45 slot1: New machine resource of type 1 allocated
12/1 11:03:45 slot2: New machine resource of type 1 allocated
12/1 11:03:45 slot3: New machine resource of type 2 allocated
12/1 11:03:45 slot4: New machine resource of type 2 allocated
12/1 11:03:45 Instantiating a BOINC_BackfillMgr
12/1 11:03:45 Created a BOINC Backfill Manager
12/1 11:03:45 About to run initial benchmarks.
12/1 11:03:52 Completed initial benchmarks.
12/1 11:03:52 slot1: State change: IS_OWNER is false
12/1 11:03:52 slot1: Changing state: Owner -> Unclaimed
12/1 11:03:52 slot1: State change: START_BACKFILL is TRUE
12/1 11:03:52 slot1: Changing state: Unclaimed -> Backfill
12/1 11:03:52 slot2: State change: IS_OWNER is false
12/1 11:03:52 slot2: Changing state: Owner -> Unclaimed
12/1 11:03:52 slot2: State change: START_BACKFILL is TRUE
12/1 11:03:52 slot2: Changing state: Unclaimed -> Backfill
12/1 11:03:52 slot3: State change: IS_OWNER is false
12/1 11:03:52 slot3: Changing state: Owner -> Unclaimed
12/1 11:03:52 slot3: State change: START_BACKFILL is TRUE
12/1 11:03:52 slot3: Changing state: Unclaimed -> Backfill
12/1 11:03:52 slot4: State change: IS_OWNER is false
12/1 11:03:52 slot4: Changing state: Owner -> Unclaimed
12/1 11:03:52 slot4: State change: START_BACKFILL is TRUE
12/1 11:03:52 slot4: Changing state: Unclaimed -> Backfill
12/1 11:03:57 State change: BOINC client running for slot1
12/1 11:03:57 slot1: Changing activity: Idle -> Busy
12/1 11:03:57 State change: BOINC client running for slot2
12/1 11:03:57 slot2: Changing activity: Idle -> Busy
12/1 11:03:57 State change: BOINC client running for slot3
12/1 11:03:57 slot3: Changing activity: Idle -> Busy
12/1 11:03:57 State change: BOINC client running for slot4
12/1 11:03:57 slot4: Changing activity: Idle -> Busy
12/1 11:03:58 Calling HandleReq <HandleChildAliveCommand> (0)
12/1 11:03:58 Return from HandleReq <HandleChildAliveCommand>
12/1 11:08:41 DaemonCore: Command received via UDP from host <10.14.0.25:49215>, access level NEGOTIATOR
12/1 11:08:41 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
12/1 11:08:41 Calling HandleReq <command_match_info> (0)
12/1 11:08:41 slot1: match_info called
12/1 11:08:41 slot1: Received match <10.14.1.100:34817>#1196535819#1#...
12/1 11:08:41 slot1: State change: match notification protocol successful
12/1 11:08:41 slot1: Changing activity: Busy -> Killing
12/1 11:08:41 slot1: State change: starter exited
12/1 11:08:41 slot1: Changing state and activity: Backfill/Killing -> Matched/Idle
12/1 11:08:41 Return from HandleReq <command_match_info>
12/1 11:08:41 DaemonCore: Command received via UDP from host <10.14.0.25:49215>, access level NEGOTIATOR
12/1 11:08:41 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
12/1 11:08:41 Calling HandleReq <command_match_info> (0)
12/1 11:08:41 slot2: match_info called
12/1 11:08:41 slot2: Received match <10.14.1.100:34817>#1196535819#2#...
12/1 11:08:41 slot2: State change: match notification protocol successful
12/1 11:08:41 slot2: Changing activity: Busy -> Killing
12/1 11:08:41 slot2: State change: starter exited
12/1 11:08:41 slot2: Changing state and activity: Backfill/Killing -> Matched/Idle
12/1 11:08:41 Return from HandleReq <command_match_info>
12/1 11:08:42 DaemonCore: Command received via UDP from host <10.14.0.25:49215>, access level NEGOTIATOR
12/1 11:08:42 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
12/1 11:08:42 Calling HandleReq <command_match_info> (0)
12/1 11:08:42 slot3: match_info called
12/1 11:08:42 slot3: Received match <10.14.1.100:34817>#1196535819#3#...
12/1 11:08:42 slot3: State change: match notification protocol successful
12/1 11:08:42 slot3: Changing activity: Busy -> Killing
12/1 11:08:42 slot3: State change: starter exited
12/1 11:08:42 slot3: Changing state and activity: Backfill/Killing -> Matched/Idle
12/1 11:08:42 Return from HandleReq <command_match_info>
12/1 11:08:42 DaemonCore: Command received via UDP from host <10.14.0.25:49215>, access level NEGOTIATOR
12/1 11:08:42 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
12/1 11:08:42 Calling HandleReq <command_match_info> (0)
12/1 11:08:42 slot4: match_info called
12/1 11:08:42 slot4: Received match <10.14.1.100:34817>#1196535819#4#...
12/1 11:08:42 slot4: State change: match notification protocol successful
12/1 11:08:42 slot4: Changing activity: Busy -> Killing
12/1 11:08:42 Return from HandleReq <command_match_info>
12/1 11:08:42 Calling Handler <receiveJobClassAdUpdate>
12/1 11:08:42 Return from Handler <receiveJobClassAdUpdate>
12/1 11:08:42 DaemonCore: pid 27210 exited with status 0, invoking reaper 2 <BOINC_BackfillMgr::reaper()>
12/1 11:08:42 BOINC client (pid 27210) exited with status 0
12/1 11:08:42 ERROR "Assertion ERROR on (s_claim && s_claim->rip())" at line 575 in file Starter.C
12/1 11:08:42 slot1: Changing state: Matched -> Owner
12/1 11:08:42 slot1: State change: IS_OWNER is false
12/1 11:08:42 slot1: Changing state: Owner -> Unclaimed
12/1 11:08:42 slot1: State change: START_BACKFILL is TRUE
12/1 11:08:42 slot1: Changing state: Unclaimed -> Backfill
12/1 11:08:42 slot2: Changing state: Matched -> Owner
12/1 11:08:42 slot2: State change: IS_OWNER is false
12/1 11:08:42 slot2: Changing state: Owner -> Unclaimed
12/1 11:08:42 slot2: State change: START_BACKFILL is TRUE
12/1 11:08:42 slot2: Changing state: Unclaimed -> Backfill
12/1 11:08:42 slot3: Changing state: Matched -> Owner
12/1 11:08:42 slot3: State change: IS_OWNER is false
12/1 11:08:42 slot3: Changing state: Owner -> Unclaimed
12/1 11:08:42 slot3: State change: START_BACKFILL is TRUE
12/1 11:08:42 slot3: Changing state: Unclaimed -> Backfill
12/1 11:08:42 startd exiting because of fatal exception.
12/1 11:09:05 ******************************************************
12/1 11:09:05 ** condor_startd (CONDOR_STARTD) STARTING UP



and MasterLog,

12/1 11:03:39 ******************************************************
12/1 11:03:39 ** condor_master (CONDOR_MASTER) STARTING UP
12/1 11:03:39 ** /usr/sbin/condor_master
12/1 11:03:39 ** $CondorVersion: 6.9.5 Nov 28 2007 BuildID: 65347 $
12/1 11:03:39 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
12/1 11:03:39 ** PID = 27200
12/1 11:03:39 ** Log last touched 12/1 10:48:40
12/1 11:03:39 ******************************************************
12/1 11:03:39 Using config source: /usr1/condor/condor_config
12/1 11:03:39 Using local config sources: 
12/1 11:03:39    /usr1/condor/condor_config.local
12/1 11:03:39 DaemonCore: Command Socket at <10.14.1.100:58280>
12/1 11:03:39 GID-based process tracking requires use of ProcD; ignoring USE_PROCD setting
12/1 11:03:39 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 27202
12/1 11:03:39 Started process "/usr/sbin/condor_ckpt_server", pid and pgroup = 27203
12/1 11:03:52 Calling HandleReq <HandleChildAliveCommand> (0)
12/1 11:03:52 Return from HandleReq <HandleChildAliveCommand>
12/1 11:08:42 DaemonCore: pid 27202 exited with status 1024, invoking reaper 1 <Daemons::DefaultReaper()>
12/1 11:08:42 The STARTD (pid 27202) exited with status 4
12/1 11:08:42 Sending obituary for "/usr/sbin/condor_startd"
12/1 11:08:55 restarting /usr/sbin/condor_startd in 10 seconds
12/1 11:08:55 DaemonCore: return from reaper for pid 27202
12/1 11:09:05 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 27277
12/1 11:09:14 Calling HandleReq <HandleChildAliveCommand> (0)
12/1 11:09:14 Return from HandleReq <HandleChildAliveCommand>
12/1 11:11:17 DaemonCore: pid 27277 exited with status 1024, invoking reaper 1 <Daemons::DefaultReaper()>
12/1 11:11:17 The STARTD (pid 27277) exited with status 4
12/1 11:11:17 Sending obituary for "/usr/sbin/condor_startd"
12/1 11:11:25 restarting /usr/sbin/condor_startd in 11 seconds
12/1 11:11:25 DaemonCore: return from reaper for pid 27277
12/1 11:11:36 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 27325
12/1 11:11:45 Calling HandleReq <HandleChildAliveCommand> (0)
12/1 11:11:45 Return from HandleReq <HandleChildAliveCommand>
12/1 11:12:45 DaemonCore: pid 27325 exited with status 1024, invoking reaper 1 <Daemons::DefaultReaper()>
12/1 11:12:45 The STARTD (pid 27325) exited with status 4
12/1 11:12:45 Sending obituary for "/usr/sbin/condor_startd"

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date of creation: Sun Dec  2  0:18:18 2007 (1196576302)
Subject: Actions

Assigned to gquinn by gquinn
===========================================================================
Date of actions: Sun Dec  2 21:37:53 2007 (1196653073)
Subject: Actions

Assigned to danb by danb
===========================================================================
Date of actions: Mon Dec  3 17:55:28 2007 (1196726128)
Date: Tue, 04 Dec 2007 09:38:45 -0600
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #2146] LIGO: 6.9.5 starter assertion error at
 line 575
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu


>I am unable to run backfill/boinc jobs in conjunction with regular condor
>jobs on the LIGO Caltech pool after upgrading from 6.9.4 to 6.9.5. I am
>able to run either boinc or regular jobs, but the transition from backfill
>to regular jobs is broken.
>

Thanks for the report.  We have prepared a patch for this problem.  The 
fix will be released as part of 7.0.  If you need it before the end of 
the year, well need to give you a pre-release.

Cheers,
--Dan


===========================================================================
Date mail was appended: Tue Dec  4  9:38:58 2007 (1196782738)
Subject: Actions

Ticket resolved by danb
===========================================================================
Date of actions: Tue Dec  4  9:38:58 2007 (1196782740)
Subject: Actions

Ticket was reopened by mailnull
===========================================================================
Date of actions: Tue Dec  4 12:46:31 2007 (1196793991)
Date: Tue, 4 Dec 2007 10:46:15 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-support #2146] LIGO: 6.9.5 starter assertion error at
 line 575
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

On Tue, Dec 04, 2007 at 09:38:58AM -0600, condor-support response tracking system wrote:
> 
> 
> >I am unable to run backfill/boinc jobs in conjunction with regular condor
> >jobs on the LIGO Caltech pool after upgrading from 6.9.4 to 6.9.5. I am
> >able to run either boinc or regular jobs, but the transition from backfill
> >to regular jobs is broken.
> >
> 
> Thanks for the report.  We have prepared a patch for this problem.  The 
> fix will be released as part of 7.0.  If you need it before the end of 
> the year, well need to give you a pre-release.
> 

If the patch is restricted to the condor_startd executable, then yes,
I would like to get a pre-release to test (RHEL3 x86 and x86_64).

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Tue Dec  4 12:46:31 2007 (1196793991)
Date: Fri, 14 Dec 2007 16:11:01 -0600
From: Dan Bradley <danb__AT__cs.wisc.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #2146] LIGO: 6.9.5 starter assertion error at
 line 575
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu


condor-support response tracking system wrote:
> On Tue, Dec 04, 2007 at 09:38:58AM -0600, condor-support response tracking system wrote:
>   
>>     
>>> I am unable to run backfill/boinc jobs in conjunction with regular condor
>>> jobs on the LIGO Caltech pool after upgrading from 6.9.4 to 6.9.5. I am
>>> able to run either boinc or regular jobs, but the transition from backfill
>>> to regular jobs is broken.
>>>
>>>       
>> Thanks for the report.  We have prepared a patch for this problem.  The 
>> fix will be released as part of 7.0.  If you need it before the end of 
>> the year, well need to give you a pre-release.
>>
>>     
>
> If the patch is restricted to the condor_startd executable, then yes,
> I would like to get a pre-release to test (RHEL3 x86 and x86_64).
>   

Here's a build with the BOINC problem fixed (and a lot of other things 
too, since we're in bugfix only mode).  You should only need to upgrade 
the startd, but if you upgrade all binaries on the execute side (startd, 
master, starters), that would eliminate worries about an unexpected 
internal version incompatibility that I am not aware of.

ftp://ftp.cs.wisc.edu/condor/temporary/forligo/6.9.5-pre-release-2/

--Dan


===========================================================================
Date mail was appended: Fri Dec 14 16:11:13 2007 (1197670273)
Subject: Actions

Ticket resolved by danb
===========================================================================
Date of actions: Fri Dec 14 16:11:13 2007 (1197670274)
Subject: Actions

Ticket was reopened by mailnull
===========================================================================
Date of actions: Fri Dec 14 19:39:14 2007 (1197682754)
Date: Fri, 14 Dec 2007 17:38:50 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-support #2146] LIGO: 6.9.5 starter assertion error at
 line 575
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

On Fri, Dec 14, 2007 at 04:11:13PM -0600, condor-support response tracking system wrote:
> 
> Here's a build with the BOINC problem fixed (and a lot of other things 
> too, since we're in bugfix only mode).  You should only need to upgrade 
> the startd, but if you upgrade all binaries on the execute side (startd, 
> master, starters), that would eliminate worries about an unexpected 
> internal version incompatibility that I am not aware of.
> 
> ftp://ftp.cs.wisc.edu/condor/temporary/forligo/6.9.5-pre-release-2/
> 

Thanks, Dan. I decided to start with just the startd binary update
and that is working so far. At initial startup it successfully
starts the boinc/backfill jobs and then kicks them off almost
immediately as matches are made for "normal" jobs.

As far as I am concerned you can close this ticket.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Dec 14 19:39:14 2007 (1197682754)
Subject: Actions

Ticket resolved by danb
===========================================================================
Date of actions: Sat Dec 15  8:11:16 2007 (1197727876)