LIGO Support Ticket 1694

Ticket Information
  Number:      support 1694
  User:        espinoza@ligo.caltech.edu
  Email:       anderson__AT__ligo.caltech.edu,espinoza_e__AT__ligo.caltech.edu
  Status:      resolved
  Assigned To: wright
Date: Wed, 18 Oct 2006 11:38:23 -0700
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
Subject: LIGO Ticket: Condor 6.8.2 x86_64 Upgrade [Fwd: [Condor] Problem]
X-No-Archive: Yes
X-Archive: No

Greetings,

Yesterday we upgraded to Condor 6.8.2 x86_64 from Condor 6.8.1 i386. 
We've had about 4 or 5 of these e-mails coming in with messages similar 
to this.

Do you have any logs or info that I should send, or is this a known bug?

Thanks,
Erik

-------- Original Message --------
Subject: [Condor] Problem
Date: Wed, 18 Oct 2006 11:30:52 -0700
From: condor__AT__node203.ldas-cit.ligo.caltech.edu
To: ldas_admin_cit__AT__ligo.caltech.edu

This is an automated email from the Condor system
on machine "node203.ldas-cit.ligo.caltech.edu".  Do not reply.

"/ldcg/condor/sbin/condor_startd" on "node203.ldas-cit.ligo.caltech.edu" 
died due to signal 6.
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file StartLog:
10/18 11:25:49 vm4: Got universe "VANILLA" (5) from request classad
10/18 11:25:49 vm4: State change: claim-activation protocol successful
10/18 11:25:49 vm4: Changing activity: Idle -> Busy
10/18 11:28:03 Unable to calculate keyboard/mouse idle time due to them 
both being USB or not present, assuming infinite idle time for these 
devices.
10/18 11:30:40 DaemonCore: Command received via UDP from host 
<10.14.0.12:38631>
10/18 11:30:40 DaemonCore: received command 440 (MATCH_INFO), calling 
handler (command_match_info)
10/18 11:30:40 vm1: match_info called
10/18 11:30:41 DaemonCore: Command received via TCP from host 
<10.14.0.12:44797>
10/18 11:30:41 DaemonCore: received command 442 (REQUEST_CLAIM), calling 
handler (command_request_claim)
10/18 11:30:41 vm1: Preempting claim has correct ClaimId.
10/18 11:30:41 vm1: New claim has sufficient rank, preempting current claim.
10/18 11:30:41 vm1: State change: preempting claim based on user priority
10/18 11:30:41 vm1: State change: retiring due to preempting claim
10/18 11:30:41 vm1: Changing activity: Busy -> Retiring
10/18 11:30:41 vm1: State change: retirement ended/expired
10/18 11:30:41 vm1: Changing state and activity: Claimed/Retiring -> 
Preempting/Vacating
10/18 11:30:41 DaemonCore: Command received via TCP from host 
<10.14.0.12:34417>
10/18 11:30:41 DaemonCore: received command 404 
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
10/18 11:30:41 vm1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
10/18 11:30:41 Starter pid 324 exited with status 0
*** End of file StartLog



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: 
ldas_admin_cit__AT__ligo.caltech.edu
The Official Condor Homepage is http://www.cs.wisc.edu/condor

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date of creation: Wed Oct 18 13:39:04 2006 (1161196747)
Subject: Actions

Assigned to jfrey by pfc
===========================================================================
Date of actions: Wed Oct 18 13:49:50 2006 (1161197518)
Subject: Actions

Assigned to wright by tannenba
===========================================================================
Date of actions: Wed Oct 18 17:47:05 2006 (1161211625)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64     
 Upgrade [Fwd: [Condor] Problem]
Date: Wed, 18 Oct 2006 16:49:12 -0700
To: condor-support__AT__cs.wisc.edu


On Oct 18, 2006, at 3:47 PM, tannenba wrote:

> Yesterday we upgraded to Condor 6.8.2 x86_64 from Condor 6.8.1 i386.
> We've had about 4 or 5 of these e-mails coming in with messages  
> similar
> to this.
>
> Do you have any logs or info that I should send, or is this a known  
> bug?

this isn't a known bug, so more info would be most appreciated.   
especially if you've got any core files.

however, my first question:  are you using the official 6.8.2  
release, or are you using some kind of pre-release?  what does  
condor_version return on your binaries?  it should be this:

$CondorVersion: 6.8.2 Oct 12 2006 $

if not, you should upgrade to the official 6.8.2 release, before we  
spend any more time investigating this, since there were some last  
minute fixes that went into 6.8.2 that probably aren't present in any  
kind of pre-release you might be running.

but, assuming you have the valid, official release, and you're still  
seeing this problem, then yes, access to a core file would be very  
helpful.

thanks,
-derek





===========================================================================
Date mail was appended: Wed Oct 18 18:49:23 2006 (1161215363)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Wed Oct 18 18:49:34 2006 (1161215375)
Date: Wed, 18 Oct 2006 17:27:53 -0700
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64     
 Upgrade  [Fwd: [Condor] Problem]
X-Enigmail-Version: 0.94.1.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No

> this isn't a known bug, so more info would be most appreciated.   
> especially if you've got any core files.

No core files were left behind as a result of this.

> however, my first question:  are you using the official 6.8.2  
> release, or are you using some kind of pre-release?  what does  
> condor_version return on your binaries?  it should be this:

This is the official release:
$CondorVersion: 6.8.2 Oct 12 2006 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Wed Oct 18 19:28:27 2006 (1161217707)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Thu Oct 19 16:27:54 2006 (1161293274)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Thu, 19 Oct 2006 14:27:44 -0700
To: condor-support__AT__cs.wisc.edu


On Oct 18, 2006, at 5:28 PM, condor-support response tracking system  
wrote:

> No core files were left behind as a result of this.

alas.

> This is the official release:

ok, duly noted.  a few more follow-up questions:

1) how often is this happening?  only on

2) is D_FULLDEBUG in your STARTD_DEBUG setting?  if not, can you  
please add it and send a longer chunk of the StartLog file when one  
of these deaths occurs?

3) would you be willing to run a different version of the startd on a  
few machines to try to debug this?

i've got a few theories, but i'd have to modify the code to add more  
debugging output to see if any of them are valid...

thanks,
-derek



===========================================================================
Date mail was appended: Thu Oct 19 16:27:55 2006 (1161293275)
Date: Mon, 23 Oct 2006 15:26:06 -0700
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64     
 Upgrade  [Fwd: [Condor] Problem]
X-Enigmail-Version: 0.94.1.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No

Hey Derek,

Alain just forwarded me this message. Looks like I never got this one.

 > 1) how often is this happening?  only on

Pretty often. We have a 329 machine cluster (~1300 Condor VM's) and we 
get about 2-5 per day.

 > 2) is D_FULLDEBUG in your STARTD_DEBUG setting?  if not, can you
 > please add it and send a longer chunk of the StartLog file when one
 > of these deaths occurs?

The D_FULLDEBUG is not set. I can enable this. It's not happening on 
just one node, and so far has been pretty random.

Quick question. We have a relatively busy cluster. Since D_FULLDEBUG 
writes incredibly verbose logs, how much time would it take for the 
important data to be overwritten?

 > 3) would you be willing to run a different version of the startd on a
 > few machines to try to debug this?

Not my call, but I think it'd be okay. We'd do it on the entire cluster, 
again, because we don't know which node is going to show this problem next.

 > i've got a few theories, but i'd have to modify the code to add more
 > debugging output to see if any of them are valid...

Any ideas or suggestions would be appreciated.

Thanks,
Erik


condor-support response tracking system wrote:
> On Oct 18, 2006, at 3:47 PM, tannenba wrote:
> 
>> Yesterday we upgraded to Condor 6.8.2 x86_64 from Condor 6.8.1 i386.
>> We've had about 4 or 5 of these e-mails coming in with messages  
>> similar
>> to this.
>>
>> Do you have any logs or info that I should send, or is this a known  
>> bug?
> 
> this isn't a known bug, so more info would be most appreciated.   
> especially if you've got any core files.
> 
> however, my first question:  are you using the official 6.8.2  
> release, or are you using some kind of pre-release?  what does  
> condor_version return on your binaries?  it should be this:
> 
> $CondorVersion: 6.8.2 Oct 12 2006 $
> 
> if not, you should upgrade to the official 6.8.2 release, before we  
> spend any more time investigating this, since there were some last  
> minute fixes that went into 6.8.2 that probably aren't present in any  
> kind of pre-release you might be running.
> 
> but, assuming you have the valid, official release, and you're still  
> seeing this problem, then yes, access to a core file would be very  
> helpful.
> 
> thanks,
> -derek
> 
> 
> 
> 
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Derek Wright <wright__AT__cs.wisc.edu>
> * Ticket Email List: espinoza__AT__ligo.caltech.edu, 
> 

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Mon Oct 23 17:26:45 2006 (1161642406)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Mon Oct 23 17:35:46 2006 (1161642947)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Mon, 23 Oct 2006 15:35:38 -0700
To: condor-support__AT__cs.wisc.edu


On Oct 23, 2006, at 3:26 PM, condor-support response tracking system  
wrote:

> Alain just forwarded me this message. Looks like I never got this one.

evil. :(  sorry about that.

> Pretty often. We have a 329 machine cluster (~1300 Condor VM's) and we
> get about 2-5 per day.

duly noted.

> The D_FULLDEBUG is not set. I can enable this. It's not happening on
> just one node, and so far has been pretty random.

ok.

> Quick question. We have a relatively busy cluster. Since D_FULLDEBUG
> writes incredibly verbose logs, how much time would it take for the
> important data to be overwritten?

all i'd really care about are the ~200 lines from right before it  
exists from the death-by-signal.  if you had D_FULLDEBUG on, i'd say  
you should just set this value as well:

OBITUARY_LOG_LENGTH = 200

then, the master will automatically include the last 200 lines in the  
emails when the startd exits abnormally (instead of the 20 lines that  
are sent by default).

> Not my call, but I think it'd be okay. We'd do it on the entire  
> cluster,
> again, because we don't know which node is going to show this  
> problem next.

right.  i'd be careful to use exactly the 6.8.2 code, with only a few  
minor changes for the debugging messages, nothing substantive.

> Any ideas or suggestions would be appreciated.

really, the only reason they would die with signal 6 (SIGABRT) is if  
a call to assert() was failing.  in condor code, we have a function  
ASSERT() that writes something to the debug log before the actual  
SIGABRT, which is what we use in most of the code.  unfortunately,  
there are a few places in the startd's code that are still using  
assert(), which doesn't log anything.  my theory is that one of those  
is getting hit, but i'm not sure which one, or why.  so, i'm going to  
convert those to ASSERT(), and if necessary, add a few other  
D_FULLDEBUG messages right before them to dump out some other  
potentially useful state so that if we hit one of those cases, we'll  
have a better idea *why* the assert() is failing.

this is only happening on $CondorPlatform: X86_64-LINUX_RHEL3 $ ??   
is that the only platform in your pool?

thanks,
-derek



===========================================================================
Date mail was appended: Mon Oct 23 17:35:47 2006 (1161642948)
Date: Mon, 23 Oct 2006 15:50:06 -0700
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64     
 Upgrade  [Fwd: [Condor] Problem]
X-Enigmail-Version: 0.94.1.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No



condor-support response tracking system wrote:
> On Oct 23, 2006, at 3:26 PM, condor-support response tracking system  
> wrote:
> 
>> Alain just forwarded me this message. Looks like I never got this one.
> 
> evil. :(  sorry about that.

It happens.

>> Pretty often. We have a 329 machine cluster (~1300 Condor VM's) and we
>> get about 2-5 per day.
> 
> duly noted.
> 
>> The D_FULLDEBUG is not set. I can enable this. It's not happening on
>> just one node, and so far has been pretty random.
> 
> ok.
> 
>> Quick question. We have a relatively busy cluster. Since D_FULLDEBUG
>> writes incredibly verbose logs, how much time would it take for the
>> important data to be overwritten?
> 
> all i'd really care about are the ~200 lines from right before it  
> exists from the death-by-signal.  if you had D_FULLDEBUG on, i'd say  
> you should just set this value as well:
> 
> OBITUARY_LOG_LENGTH = 200
> 
> then, the master will automatically include the last 200 lines in the  
> emails when the startd exits abnormally (instead of the 20 lines that  
> are sent by default).

Sure thing.

>> Not my call, but I think it'd be okay. We'd do it on the entire  
>> cluster,
>> again, because we don't know which node is going to show this  
>> problem next.
> 
> right.  i'd be careful to use exactly the 6.8.2 code, with only a few  
> minor changes for the debugging messages, nothing substantive.

I think we can accommodate? Do you believe that we should do the 
D_FULLDEBUG and the new startd at the same time, or do one then the other?


>> Any ideas or suggestions would be appreciated.
> 
> really, the only reason they would die with signal 6 (SIGABRT) is if  
> a call to assert() was failing.  in condor code, we have a function  
> ASSERT() that writes something to the debug log before the actual  
> SIGABRT, which is what we use in most of the code.  unfortunately,  
> there are a few places in the startd's code that are still using  
> assert(), which doesn't log anything.  my theory is that one of those  
> is getting hit, but i'm not sure which one, or why.  so, i'm going to  
> convert those to ASSERT(), and if necessary, add a few other  
> D_FULLDEBUG messages right before them to dump out some other  
> potentially useful state so that if we hit one of those cases, we'll  
> have a better idea *why* the assert() is failing.
> 
> this is only happening on $CondorPlatform: X86_64-LINUX_RHEL3 $ ??   
> is that the only platform in your pool?

Yes

> thanks,
> -derek
> 
> 
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Derek Wright <wright__AT__cs.wisc.edu>
> * Ticket Email List: espinoza__AT__ligo.caltech.edu, anderson__AT__ligo.caltech.edu
> 

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Mon Oct 23 17:50:35 2006 (1161643836)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Mon, 23 Oct 2006 17:06:10 -0700
To: condor-support__AT__cs.wisc.edu


On Oct 23, 2006, at 3:50 PM, condor-support response tracking system  
wrote:

> I think we can accommodate? Do you believe that we should do the
> D_FULLDEBUG and the new startd at the same time, or do one then the  
> other?

it wouldn't hurt to turn on D_FULLDEBUG and up the  
OBITUARY_LOG_LENGTH already.  perhaps that will give me enough info  
to see what's happening.  meanwhile, while you're collecting data,  
i'll get you a new copy of the startd with the better logging info...

duly noted re: x86_64-rhel3 being the only platform.

sorry this is taking a while to resolve... :(

-derek





===========================================================================
Date mail was appended: Mon Oct 23 19:06:23 2006 (1161648384)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Mon Oct 23 19:06:57 2006 (1161648417)
Date: Mon, 23 Oct 2006 18:09:53 -0700
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64     
 Upgrade  [Fwd: [Condor] Problem]
X-Enigmail-Version: 0.94.1.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No

This has been enabled. I will foward the next message.

Thanks,
Erik

condor-support response tracking system wrote:
> On Oct 23, 2006, at 3:50 PM, condor-support response tracking system  
> wrote:
> 
>> I think we can accommodate? Do you believe that we should do the
>> D_FULLDEBUG and the new startd at the same time, or do one then the  
>> other?
> 
> it wouldn't hurt to turn on D_FULLDEBUG and up the  
> OBITUARY_LOG_LENGTH already.  perhaps that will give me enough info  
> to see what's happening.  meanwhile, while you're collecting data,  
> i'll get you a new copy of the startd with the better logging info...
> 
> duly noted re: x86_64-rhel3 being the only platform.
> 
> sorry this is taking a while to resolve... :(
> 
> -derek
> 
> 
> 
> 
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Derek Wright <wright__AT__cs.wisc.edu>
> * Ticket Email List: espinoza__AT__ligo.caltech.edu, anderson__AT__ligo.caltech.edu
> 

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Mon Oct 23 20:10:29 2006 (1161652230)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Tue Oct 24  1:00:04 2006 (1161669604)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Tue, 24 Oct 2006 16:37:35 -0700
To: condor-support__AT__cs.wisc.edu

ok, a new version of the condor_startd binary with the (hopefully  
better) logging info in the case of signal 6 is here:

ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_64_rhel3/ 
condor-6.8.3-pre-x86_64_rhel3.tar.gz

in there, there's a .static and .dynamic version of the startd.  i  
gave you the unstripped version, so that we could do better  
interactive debugging with it, too.

so, ideally, you'd install this on all your x86_64 boxes and leave  
the D_FULLDEBUG and OBITUARY_LOG_LENGTH stuff enabled.  hopefully,  
the next time you get an email about this, we should see a message at  
the end of the log saying what assert() is failing.  if that alone  
isn't enough for me to resolve the bug, we can attempt to catch it in  
the act via attaching to a few startds using gdb, and leaving them  
that way until they die.  this will be harder to notice, since we  
won't get email about it until the startd exists (which won't happen  
if we're still attached via gdb), but it's an option of last resort  
if we need to go that route.

let me know how it goes, or if you have any trouble downloading these  
new pre-release binaries.

thanks,
-derek




===========================================================================
Date mail was appended: Tue Oct 24 18:37:43 2006 (1161733064)
Date: Wed, 25 Oct 2006 13:25:16 -0700
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64     
 Upgrade  [Fwd: [Condor] Problem]
X-Enigmail-Version: 0.94.1.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No

Greetings,

I've seen about 8 of these today. Upon investigation it doesn't seem to 
be limited to jobs run by one user, but rather scattered between a few.

We are running condor 6.8.2 x86_64 release. Any ideas or further 
information I can provide?

No core files were left behind. Our startd is running with D_FULLDEBUG. 
I have not put in the new debugging startd, is this still necessary.

Thanks,
Erik A. Espinoza

-------- Original Message --------
Subject: [Condor] Problem
Date: Wed, 25 Oct 2006 12:51:16 -0700
From: condor__AT__node276.ldas-cit.ligo.caltech.edu
To: ldas_admin_cit__AT__ligo.caltech.edu

This is an automated email from the Condor system
on machine "node276.ldas-cit.ligo.caltech.edu".  Do not reply.

"/ldcg/condor/sbin/condor_startd" on "node276.ldas-cit.ligo.caltech.edu" 
died due to signal 6.
Condor will automatically restart this process in 10 seconds.

*** Last 200 line(s) of file StartLog:
10/25 12:35:46 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:36:11 ProcAPI::buildFamily() Found daddypid on the system: 3792
10/25 12:36:17 Getting monitoring info for pid 26355
10/25 12:36:22 ProcAPI::buildFamily() Found daddypid on the system: 8442
10/25 12:36:37 ProcAPI::buildFamily() Found daddypid on the system: 3833
10/25 12:36:37 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:37:02 ProcAPI::buildFamily() Found daddypid on the system: 3792
10/25 12:37:12 ProcAPI::buildFamily() Found daddypid on the system: 8442
10/25 12:37:27 ProcAPI::buildFamily() Found daddypid on the system: 3833
10/25 12:37:27 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:37:36 Swap space: 10457612
10/25 12:37:36 278592812 kbytes available for "/usr1/condor/execute"
10/25 12:37:36 Looking up RESERVED_DISK parameter
10/25 12:37:36 Reserving 5120 kbytes for file system
10/25 12:37:36 Disk space: 278587692
10/25 12:37:40 Trying to update collector <10.14.0.12:9618>
10/25 12:37:40 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/25 12:37:40 vm1: Sent update to 1 collector(s)
10/25 12:37:41 Trying to update collector <10.14.0.12:9618>
10/25 12:37:41 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/25 12:37:41 vm2: Sent update to 1 collector(s)
10/25 12:37:42 Trying to update collector <10.14.0.12:9618>
10/25 12:37:42 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/25 12:37:42 vm3: Sent update to 1 collector(s)
10/25 12:37:43 Trying to update collector <10.14.0.12:9618>
10/25 12:37:43 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/25 12:37:43 vm4: Sent update to 1 collector(s)
10/25 12:37:52 ProcAPI::buildFamily() Found daddypid on the system: 3792
10/25 12:37:55 DaemonCore: Command received via UDP from host 
<10.14.0.12:43743>
10/25 12:37:55 DaemonCore: received command 441 (ALIVE), calling handler 
(command_handler)
10/25 12:37:55 DaemonCore: Command received via UDP from host 
<10.14.0.12:44007>
10/25 12:37:55 DaemonCore: received command 441 (ALIVE), calling handler 
(command_handler)
10/25 12:37:55 DaemonCore: Command received via UDP from host 
<10.14.0.12:44007>
10/25 12:37:55 DaemonCore: received command 441 (ALIVE), calling handler 
(command_handler)
10/25 12:37:55 DaemonCore: Command received via UDP from host 
<10.14.0.12:44007>
10/25 12:37:55 DaemonCore: received command 441 (ALIVE), calling handler 
(command_handler)
10/25 12:38:02 ProcAPI::buildFamily() Found daddypid on the system: 8442
10/25 12:38:17 ProcAPI::buildFamily() Found daddypid on the system: 3833
10/25 12:38:17 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:38:42 ProcAPI::buildFamily() Found daddypid on the system: 3792
10/25 12:38:52 ProcAPI::buildFamily() Found daddypid on the system: 8442
10/25 12:39:07 ProcAPI::buildFamily() Found daddypid on the system: 3833
10/25 12:39:07 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:39:33 ProcAPI::buildFamily() Found daddypid on the system: 3792
10/25 12:39:43 ProcAPI::buildFamily() Found daddypid on the system: 8442
10/25 12:39:47 DaemonCore: Command received via UDP from host 
<10.14.2.26:35676>
10/25 12:39:47 DaemonCore: received command 60008 (DC_CHILDALIVE), 
calling handler (HandleChildAliveCommand)
10/25 12:39:59 ProcAPI::buildFamily() Found daddypid on the system: 3833
10/25 12:39:59 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:40:17 Getting monitoring info for pid 26355
10/25 12:40:24 ProcAPI::buildFamily() Found daddypid on the system: 3792
10/25 12:40:34 ProcAPI::buildFamily() Found daddypid on the system: 8442
10/25 12:40:49 ProcAPI::buildFamily() Found daddypid on the system: 3833
10/25 12:40:49 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:41:08 DaemonCore: Command received via UDP from host 
<10.14.2.26:35676>
10/25 12:41:08 DaemonCore: received command 60008 (DC_CHILDALIVE), 
calling handler (HandleChildAliveCommand)
10/25 12:41:14 ProcAPI::buildFamily() Found daddypid on the system: 3792
10/25 12:41:25 ProcAPI::buildFamily() Found daddypid on the system: 8442
10/25 12:41:40 ProcAPI::buildFamily() Found daddypid on the system: 3833
10/25 12:41:40 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:42:05 ProcAPI::buildFamily() Found daddypid on the system: 3792
10/25 12:42:15 ProcAPI::buildFamily() Found daddypid on the system: 8442
10/25 12:42:21 DaemonCore: Command received via UDP from host 
<10.14.2.26:35676>
10/25 12:42:21 DaemonCore: received command 60008 (DC_CHILDALIVE), 
calling handler (HandleChildAliveCommand)
10/25 12:42:30 ProcAPI::buildFamily() Found daddypid on the system: 3833
10/25 12:42:30 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:42:36 Swap space: 10457612
10/25 12:42:36 278592796 kbytes available for "/usr1/condor/execute"
10/25 12:42:36 Looking up RESERVED_DISK parameter
10/25 12:42:36 Reserving 5120 kbytes for file system
10/25 12:42:36 Disk space: 278587676
10/25 12:42:40 Trying to update collector <10.14.0.12:9618>
10/25 12:42:40 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/25 12:42:40 vm1: Sent update to 1 collector(s)
10/25 12:42:41 Trying to update collector <10.14.0.12:9618>
10/25 12:42:41 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/25 12:42:41 vm2: Sent update to 1 collector(s)
10/25 12:42:42 Trying to update collector <10.14.0.12:9618>
10/25 12:42:42 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/25 12:42:42 vm3: Sent update to 1 collector(s)
10/25 12:42:43 Trying to update collector <10.14.0.12:9618>
10/25 12:42:43 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/25 12:42:43 vm4: Sent update to 1 collector(s)
10/25 12:42:55 ProcAPI::buildFamily() Found daddypid on the system: 3792
10/25 12:42:56 DaemonCore: Command received via UDP from host 
<10.14.0.12:50791>
10/25 12:42:56 DaemonCore: received command 441 (ALIVE), calling handler 
(command_handler)
10/25 12:42:56 DaemonCore: Command received via UDP from host 
<10.14.0.12:50791>
10/25 12:42:56 DaemonCore: received command 441 (ALIVE), calling handler 
(command_handler)
10/25 12:42:56 DaemonCore: Command received via UDP from host 
<10.14.0.12:50791>
10/25 12:42:56 DaemonCore: received command 441 (ALIVE), calling handler 
(command_handler)
10/25 12:42:56 DaemonCore: Command received via UDP from host 
<10.14.0.12:50791>
10/25 12:42:56 DaemonCore: received command 441 (ALIVE), calling handler 
(command_handler)
10/25 12:43:05 ProcAPI::buildFamily() Found daddypid on the system: 8442
10/25 12:43:20 ProcAPI::buildFamily() Found daddypid on the system: 3833
10/25 12:43:20 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:43:45 ProcAPI::buildFamily() Found daddypid on the system: 3792
10/25 12:43:55 ProcAPI::buildFamily() Found daddypid on the system: 8442
10/25 12:44:10 ProcAPI::buildFamily() Found daddypid on the system: 3833
10/25 12:44:10 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:44:17 Getting monitoring info for pid 26355
10/25 12:44:35 ProcAPI::buildFamily() Found daddypid on the system: 3792
10/25 12:44:45 ProcAPI::buildFamily() Found daddypid on the system: 8442
10/25 12:45:00 ProcAPI::buildFamily() Found daddypid on the system: 3833
10/25 12:45:00 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:45:25 ProcAPI::buildFamily() Found daddypid on the system: 3792
10/25 12:45:35 ProcAPI::buildFamily() Found daddypid on the system: 8442
10/25 12:45:51 ProcAPI::buildFamily() Found daddypid on the system: 3833
10/25 12:45:51 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:46:16 ProcAPI::buildFamily() Found daddypid on the system: 3792
10/25 12:46:26 ProcAPI::buildFamily() Found daddypid on the system: 8442
10/25 12:46:41 ProcAPI::buildFamily() Found daddypid on the system: 3833
10/25 12:46:41 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:47:06 ProcAPI::buildFamily() Found daddypid on the system: 3792
10/25 12:47:16 ProcAPI::buildFamily() Found daddypid on the system: 8442
10/25 12:47:31 ProcAPI::buildFamily() Found daddypid on the system: 3833
10/25 12:47:31 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:47:36 Swap space: 10457612
10/25 12:47:36 278592784 kbytes available for "/usr1/condor/execute"
10/25 12:47:36 Looking up RESERVED_DISK parameter
10/25 12:47:36 Reserving 5120 kbytes for file system
10/25 12:47:36 Disk space: 278587664
10/25 12:47:40 Trying to update collector <10.14.0.12:9618>
10/25 12:47:40 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/25 12:47:40 vm1: Sent update to 1 collector(s)
10/25 12:47:41 Trying to update collector <10.14.0.12:9618>
10/25 12:47:41 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/25 12:47:41 vm2: Sent update to 1 collector(s)
10/25 12:47:42 Trying to update collector <10.14.0.12:9618>
10/25 12:47:42 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/25 12:47:42 vm3: Sent update to 1 collector(s)
10/25 12:47:43 Trying to update collector <10.14.0.12:9618>
10/25 12:47:43 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/25 12:47:43 vm4: Sent update to 1 collector(s)
10/25 12:47:56 DaemonCore: Command received via UDP from host 
<10.14.0.12:35149>
10/25 12:47:56 DaemonCore: received command 441 (ALIVE), calling handler 
(command_handler)
10/25 12:47:56 ProcAPI::buildFamily() Found daddypid on the system: 3792
10/25 12:47:56 DaemonCore: Command received via UDP from host 
<10.14.0.12:35149>
10/25 12:47:56 DaemonCore: received command 441 (ALIVE), calling handler 
(command_handler)
10/25 12:47:56 DaemonCore: Command received via UDP from host 
<10.14.0.12:35175>
10/25 12:47:56 DaemonCore: received command 441 (ALIVE), calling handler 
(command_handler)
10/25 12:47:56 DaemonCore: Command received via UDP from host 
<10.14.0.12:35175>
10/25 12:47:56 DaemonCore: received command 441 (ALIVE), calling handler 
(command_handler)
10/25 12:48:06 ProcAPI::buildFamily() Found daddypid on the system: 8442
10/25 12:48:17 Getting monitoring info for pid 26355
10/25 12:48:21 ProcAPI::buildFamily() Found daddypid on the system: 3833
10/25 12:48:21 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:48:46 ProcAPI::buildFamily() Found daddypid on the system: 3792
10/25 12:48:56 ProcAPI::buildFamily() Found daddypid on the system: 8442
10/25 12:49:11 ProcAPI::buildFamily() Found daddypid on the system: 3833
10/25 12:49:11 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:49:36 ProcAPI::buildFamily() Found daddypid on the system: 3792
10/25 12:49:46 ProcAPI::buildFamily() Found daddypid on the system: 8442
10/25 12:50:01 ProcAPI::buildFamily() Found daddypid on the system: 3833
10/25 12:50:01 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:50:26 ProcAPI::buildFamily() Found daddypid on the system: 3792
10/25 12:50:36 ProcAPI::buildFamily() Found daddypid on the system: 8442
10/25 12:50:52 ProcAPI::buildFamily() Found daddypid on the system: 3833
10/25 12:50:52 ProcAPI::buildFamily() Found daddypid on the system: 6498
10/25 12:51:01 vm3: match_info called
10/25 12:51:01 vm3: Started match timer (27147) for 120 seconds.
10/25 12:51:03 vm3: Canceled match timer (27147)
10/25 12:51:03 vm3: Schedd addr = <10.14.0.12:43253>
10/25 12:51:03 vm3: Alive interval = 300
10/25 12:51:03 vm3: Received ClaimId from schedd 
(<10.14.2.26:55212>#1161142062#1464)
10/25 12:51:03 vm3: Rank of this claim is: 0.000000
10/25 12:51:03 vm3: Preempting claim has correct ClaimId.
10/25 12:51:03 vm3: New claim has sufficient rank, preempting current claim.
10/25 12:51:03 vm3: Got RemoteUser (thorne@ligo) from request classad
10/25 12:51:03 vm3: State change: preempting claim based on user priority
10/25 12:51:03 vm3: State change: retiring due to preempting claim
10/25 12:51:03 vm3: Changing activity: Busy -> Retiring
10/25 12:51:03 vm3: State change: retirement ended/expired
10/25 12:51:03 vm3: Canceled ClaimLease timer (26496)
10/25 12:51:03 vm3: Changing state and activity: Claimed/Retiring -> 
Preempting/Vacating
10/25 12:51:03 vm3: In Starter::kill() with pid 3792, sig 15 (SIGTERM)
10/25 12:51:03 vm3: Got KILL_FRGN_JOB while in Preempting state, ignoring.
10/25 12:51:03 DaemonCore: No more children processes to reap.
10/25 12:51:03 Starter pid 3792 exited with status 0
10/25 12:51:03 ProcAPI::buildFamily failed: parent 3792 not found on system.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3792 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3792 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3792 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3792 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3792 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3793 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3793 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3793 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3793 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3793 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3794 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3794 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3794 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3794 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3794 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3795 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3795 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3795 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3795 does not exist.
10/25 12:51:03 ProcAPI::getProcInfo() pid 3795 does not exist.
10/25 12:51:03 ProcAPI::buildFamily failed: parent 3795 not found on system.
*** End of file StartLog



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: 
ldas_admin_cit__AT__ligo.caltech.edu
The Official Condor Homepage is http://www.cs.wisc.edu/condor


condor-support response tracking system wrote:
> ok, a new version of the condor_startd binary with the (hopefully  
> better) logging info in the case of signal 6 is here:
> 
> ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_64_rhel3/ 
> condor-6.8.3-pre-x86_64_rhel3.tar.gz
> 
> in there, there's a .static and .dynamic version of the startd.  i  
> gave you the unstripped version, so that we could do better  
> interactive debugging with it, too.
> 
> so, ideally, you'd install this on all your x86_64 boxes and leave  
> the D_FULLDEBUG and OBITUARY_LOG_LENGTH stuff enabled.  hopefully,  
> the next time you get an email about this, we should see a message at  
> the end of the log saying what assert() is failing.  if that alone  
> isn't enough for me to resolve the bug, we can attempt to catch it in  
> the act via attaching to a few startds using gdb, and leaving them  
> that way until they die.  this will be harder to notice, since we  
> won't get email about it until the startd exists (which won't happen  
> if we're still attached via gdb), but it's an option of last resort  
> if we need to go that route.
> 
> let me know how it goes, or if you have any trouble downloading these  
> new pre-release binaries.
> 
> thanks,
> -derek
> 
> 
> 
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Derek Wright <wright__AT__cs.wisc.edu>
> * Ticket Email List: espinoza__AT__ligo.caltech.edu, anderson__AT__ligo.caltech.edu
> 

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Wed Oct 25 15:25:52 2006 (1161807952)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Wed, 25 Oct 2006 13:45:25 -0700
To: condor-support__AT__cs.wisc.edu


On Oct 25, 2006, at 1:25 PM, condor-support response tracking system  
wrote:

> I have not put in the new debugging startd, is this still necessary.

that'd be nice.  the logs you provide suggest a possible clue, but i  
think the debugging binaries i sent will be much more explicit about  
what's happening.  if it's not a huge pain for you, i'd really love  
to see the logs the new binaries will generate in this case.

thanks,
-derek


===========================================================================
Date mail was appended: Wed Oct 25 15:45:33 2006 (1161809133)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Wed Oct 25 15:45:49 2006 (1161809150)
Date: Thu, 26 Oct 2006 18:31:23 -0700
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64     
 Upgrade  [Fwd: [Condor] Problem]
X-Enigmail-Version: 0.94.1.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No

Here is a log from the new binary.



-------- Original Message --------
Subject: [Condor] Problem
Date: Thu, 26 Oct 2006 17:59:29 -0700
From: condor__AT__node305.ldas-cit.ligo.caltech.edu
To: ldas_admin_cit__AT__ligo.caltech.edu

This is an automated email from the Condor system
on machine "node305.ldas-cit.ligo.caltech.edu".  Do not reply.

"/ldcg/condor/sbin/condor_startd" on "node305.ldas-cit.ligo.caltech.edu" 
died due to signal 6.
Condor will automatically restart this process in 10 seconds.

*** Last 200 line(s) of file StartLog:
10/26 17:50:11 DaemonCore: received command 441 (ALIVE), calling handler 
(command_handler)
10/26 17:50:11 DaemonCore: Command received via UDP from host 
<10.14.0.12:43212>
10/26 17:50:11 DaemonCore: received command 441 (ALIVE), calling handler 
(command_handler)
10/26 17:50:19 Swap space: 10482320
10/26 17:50:19 331462136 kbytes available for "/usr1/condor/execute"
10/26 17:50:19 Looking up RESERVED_DISK parameter
10/26 17:50:19 Reserving 5120 kbytes for file system
10/26 17:50:19 Disk space: 331457016
10/26 17:50:23 Trying to update collector <10.14.0.12:9618>
10/26 17:50:23 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 17:50:23 vm1: Sent update to 1 collector(s)
10/26 17:50:24 Trying to update collector <10.14.0.12:9618>
10/26 17:50:24 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 17:50:24 vm2: Sent update to 1 collector(s)
10/26 17:50:25 Trying to update collector <10.14.0.12:9618>
10/26 17:50:25 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 17:50:25 vm3: Sent update to 1 collector(s)
10/26 17:50:26 Trying to update collector <10.14.0.12:9618>
10/26 17:50:26 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 17:50:26 vm4: Sent update to 1 collector(s)
10/26 17:50:26 ProcAPI::buildFamily() Found daddypid on the system: 24814
10/26 17:50:31 ProcAPI::buildFamily() Found daddypid on the system: 24815
10/26 17:50:36 ProcAPI::buildFamily() Found daddypid on the system: 24743
10/26 17:50:41 ProcAPI::buildFamily() Found daddypid on the system: 19548
10/26 17:50:52 DaemonCore: Command received via UDP from host 
<10.14.2.55:35834>
10/26 17:50:52 DaemonCore: received command 60008 (DC_CHILDALIVE), 
calling handler (HandleChildAliveCommand)
10/26 17:50:52 DaemonCore: Command received via UDP from host 
<10.14.2.55:35834>
10/26 17:50:52 DaemonCore: received command 60008 (DC_CHILDALIVE), 
calling handler (HandleChildAliveCommand)
10/26 17:51:16 ProcAPI::buildFamily() Found daddypid on the system: 24814
10/26 17:51:17 Getting monitoring info for pid 5271
10/26 17:51:21 ProcAPI::buildFamily() Found daddypid on the system: 24815
10/26 17:51:26 ProcAPI::buildFamily() Found daddypid on the system: 24743
10/26 17:51:31 ProcAPI::buildFamily() Found daddypid on the system: 19548
10/26 17:52:06 ProcAPI::buildFamily() Found daddypid on the system: 24814
10/26 17:52:11 ProcAPI::buildFamily() Found daddypid on the system: 24815
10/26 17:52:16 ProcAPI::buildFamily() Found daddypid on the system: 24743
10/26 17:52:21 ProcAPI::buildFamily() Found daddypid on the system: 19548
10/26 17:52:41 DaemonCore: Command received via UDP from host 
<10.14.2.55:35837>
10/26 17:52:41 DaemonCore: received command 60008 (DC_CHILDALIVE), 
calling handler (HandleChildAliveCommand)
10/26 17:52:56 ProcAPI::buildFamily() Found daddypid on the system: 24814
10/26 17:53:01 ProcAPI::buildFamily() Found daddypid on the system: 24815
10/26 17:53:06 ProcAPI::buildFamily() Found daddypid on the system: 24743
10/26 17:53:11 ProcAPI::buildFamily() Found daddypid on the system: 19548
10/26 17:53:46 ProcAPI::buildFamily() Found daddypid on the system: 24814
10/26 17:53:51 ProcAPI::buildFamily() Found daddypid on the system: 24815
10/26 17:53:56 ProcAPI::buildFamily() Found daddypid on the system: 24743
10/26 17:54:01 ProcAPI::buildFamily() Found daddypid on the system: 19548
10/26 17:54:36 ProcAPI::buildFamily() Found daddypid on the system: 24814
10/26 17:54:41 ProcAPI::buildFamily() Found daddypid on the system: 24815
10/26 17:54:46 ProcAPI::buildFamily() Found daddypid on the system: 24743
10/26 17:54:51 ProcAPI::buildFamily() Found daddypid on the system: 19548
10/26 17:55:11 DaemonCore: Command received via UDP from host 
<10.14.0.12:57676>
10/26 17:55:11 DaemonCore: received command 441 (ALIVE), calling handler 
(command_handler)
10/26 17:55:11 DaemonCore: Command received via UDP from host 
<10.14.0.12:57676>
10/26 17:55:11 DaemonCore: received command 441 (ALIVE), calling handler 
(command_handler)
10/26 17:55:11 DaemonCore: Command received via UDP from host 
<10.14.0.12:57676>
10/26 17:55:11 DaemonCore: received command 441 (ALIVE), calling handler 
(command_handler)
10/26 17:55:11 DaemonCore: Command received via UDP from host 
<10.14.0.12:57676>
10/26 17:55:11 DaemonCore: received command 441 (ALIVE), calling handler 
(command_handler)
10/26 17:55:17 Getting monitoring info for pid 5271
10/26 17:55:19 Swap space: 10482320
10/26 17:55:19 331461904 kbytes available for "/usr1/condor/execute"
10/26 17:55:19 Looking up RESERVED_DISK parameter
10/26 17:55:19 Reserving 5120 kbytes for file system
10/26 17:55:19 Disk space: 331456784
10/26 17:55:23 Trying to update collector <10.14.0.12:9618>
10/26 17:55:23 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 17:55:23 vm1: Sent update to 1 collector(s)
10/26 17:55:24 Trying to update collector <10.14.0.12:9618>
10/26 17:55:24 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 17:55:24 vm2: Sent update to 1 collector(s)
10/26 17:55:25 Trying to update collector <10.14.0.12:9618>
10/26 17:55:25 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 17:55:25 vm3: Sent update to 1 collector(s)
10/26 17:55:26 Trying to update collector <10.14.0.12:9618>
10/26 17:55:26 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 17:55:26 vm4: Sent update to 1 collector(s)
10/26 17:55:26 ProcAPI::buildFamily() Found daddypid on the system: 24814
10/26 17:55:31 ProcAPI::buildFamily() Found daddypid on the system: 24815
10/26 17:55:36 ProcAPI::buildFamily() Found daddypid on the system: 24743
10/26 17:55:41 ProcAPI::buildFamily() Found daddypid on the system: 19548
10/26 17:56:16 ProcAPI::buildFamily() Found daddypid on the system: 24814
10/26 17:56:21 ProcAPI::buildFamily() Found daddypid on the system: 24815
10/26 17:56:26 ProcAPI::buildFamily() Found daddypid on the system: 24743
10/26 17:56:31 ProcAPI::buildFamily() Found daddypid on the system: 19548
10/26 17:57:06 ProcAPI::buildFamily() Found daddypid on the system: 24814
10/26 17:57:11 ProcAPI::buildFamily() Found daddypid on the system: 24815
10/26 17:57:16 ProcAPI::buildFamily() Found daddypid on the system: 24743
10/26 17:57:21 ProcAPI::buildFamily() Found daddypid on the system: 19548
10/26 17:57:56 ProcAPI::buildFamily() Found daddypid on the system: 24814
10/26 17:58:01 ProcAPI::buildFamily() Found daddypid on the system: 24815
10/26 17:58:06 ProcAPI::buildFamily() Found daddypid on the system: 24743
10/26 17:58:11 ProcAPI::buildFamily() Found daddypid on the system: 19548
10/26 17:58:46 ProcAPI::buildFamily() Found daddypid on the system: 24814
10/26 17:58:52 ProcAPI::buildFamily() Found daddypid on the system: 24815
10/26 17:58:57 ProcAPI::buildFamily() Found daddypid on the system: 24743
10/26 17:59:02 ProcAPI::buildFamily() Found daddypid on the system: 19548
10/26 17:59:03 vm4: Called deactivate_claim_forcibly()
10/26 17:59:03 vm4: In Starter::kill() with pid 24814, sig 3 (SIGQUIT)
10/26 17:59:03 DaemonCore: No more children processes to reap.
10/26 17:59:03 Starter pid 24814 exited with status 0
10/26 17:59:03 vm4: Canceled hardkill-starter timer (2684)
10/26 17:59:03 ProcAPI::buildFamily failed: parent 24814 not found on 
system.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24814 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24814 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24814 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24814 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24814 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24853 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24853 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24853 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24853 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24853 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24855 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24855 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24855 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24855 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24855 does not exist.
10/26 17:59:03 vm4: State change: starter exited
10/26 17:59:03 vm4: Changing activity: Busy -> Idle
10/26 17:59:03 vm4: Computing claimWorklifeExpired(); ClaimAge=803, 
ClaimWorklife=1800
10/26 17:59:03 vm2: Called deactivate_claim_forcibly()
10/26 17:59:03 vm2: In Starter::kill() with pid 24815, sig 3 (SIGQUIT)
10/26 17:59:03 DaemonCore: No more children processes to reap.
10/26 17:59:03 Starter pid 24815 exited with status 0
10/26 17:59:03 vm2: Canceled hardkill-starter timer (2686)
10/26 17:59:03 ProcAPI::buildFamily failed: parent 24815 not found on 
system.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24815 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24815 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24815 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24815 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24815 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24854 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24854 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24854 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24854 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24854 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24856 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24856 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24856 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24856 does not exist.
10/26 17:59:03 ProcAPI::getProcInfo() pid 24856 does not exist.
10/26 17:59:03 vm2: State change: starter exited
10/26 17:59:03 vm2: Changing activity: Busy -> Idle
10/26 17:59:03 vm2: Computing claimWorklifeExpired(); ClaimAge=803, 
ClaimWorklife=1800
10/26 17:59:07 vm4: Computing claimWorklifeExpired(); ClaimAge=807, 
ClaimWorklife=1800
10/26 17:59:07 vm2: Computing claimWorklifeExpired(); ClaimAge=807, 
ClaimWorklife=1800
10/26 17:59:08 Trying to update collector <10.14.0.12:9618>
10/26 17:59:08 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 17:59:08 vm2: Sent update to 1 collector(s)
10/26 17:59:10 Trying to update collector <10.14.0.12:9618>
10/26 17:59:10 Attempting to send update via UDP to collector 
ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 17:59:10 vm4: Sent update to 1 collector(s)
10/26 17:59:12 vm4: Computing claimWorklifeExpired(); ClaimAge=812, 
ClaimWorklife=1800
10/26 17:59:12 vm2: Computing claimWorklifeExpired(); ClaimAge=812, 
ClaimWorklife=1800
10/26 17:59:17 vm3: match_info called
10/26 17:59:17 vm3: Started match timer (2689) for 120 seconds.
10/26 17:59:17 Getting monitoring info for pid 5271
10/26 17:59:17 vm4: Computing claimWorklifeExpired(); ClaimAge=817, 
ClaimWorklife=1800
10/26 17:59:17 vm2: Computing claimWorklifeExpired(); ClaimAge=817, 
ClaimWorklife=1800
10/26 17:59:17 vm3: Canceled match timer (2689)
10/26 17:59:17 vm3: Schedd addr = <10.14.0.12:43253>
10/26 17:59:17 vm3: Alive interval = 300
10/26 17:59:17 vm3: Received ClaimId from schedd 
(<10.14.2.55:38265>#1161828907#42)
10/26 17:59:17 vm3: Rank of this claim is: 0.000000
10/26 17:59:17 vm3: Preempting claim has correct ClaimId.
10/26 17:59:17 vm3: New claim has sufficient rank, preempting current claim.
10/26 17:59:17 vm3: Got RemoteUser (vmandic@ligo) from request classad
10/26 17:59:17 vm3: State change: preempting claim based on user priority
10/26 17:59:17 vm3: State change: retiring due to preempting claim
10/26 17:59:17 vm3: Changing activity: Busy -> Retiring
10/26 17:59:17 vm3: State change: retirement ended/expired
10/26 17:59:17 vm3: Canceled ClaimLease timer (2008)
10/26 17:59:17 vm3: Changing state and activity: Claimed/Retiring -> 
Preempting/Vacating
10/26 17:59:17 vm3: In Starter::kill() with pid 19548, sig 15 (SIGTERM)
10/26 17:59:17 vm3: Got KILL_FRGN_JOB while in Preempting state, ignoring.
10/26 17:59:17 DaemonCore: No more children processes to reap.
10/26 17:59:17 Starter pid 19548 exited with status 0
10/26 17:59:17 ProcAPI::buildFamily failed: parent 19548 not found on 
system.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19548 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19548 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19548 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19548 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19548 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19549 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19549 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19549 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19549 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19549 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19550 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19550 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19550 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19550 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19550 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19551 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19551 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19551 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19551 does not exist.
10/26 17:59:17 ProcAPI::getProcInfo() pid 19551 does not exist.
10/26 17:59:17 ProcAPI::buildFamily failed: parent 19551 not found on 
system.
*** End of file StartLog



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: 
ldas_admin_cit__AT__ligo.caltech.edu
The Official Condor Homepage is http://www.cs.wisc.edu/condor


condor-support response tracking system wrote:
> On Oct 25, 2006, at 1:25 PM, condor-support response tracking system  
> wrote:
> 
>> I have not put in the new debugging startd, is this still necessary.
> 
> that'd be nice.  the logs you provide suggest a possible clue, but i  
> think the debugging binaries i sent will be much more explicit about  
> what's happening.  if it's not a huge pain for you, i'd really love  
> to see the logs the new binaries will generate in this case.
> 
> thanks,
> -derek
> 
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Derek Wright <wright__AT__cs.wisc.edu>
> * Ticket Email List: espinoza__AT__ligo.caltech.edu, anderson__AT__ligo.caltech.edu
> 

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Thu Oct 26 20:31:55 2006 (1161912716)
Date: Thu, 26 Oct 2006 20:45:23 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64     
 Upgrade [Fwd: [Condor] Problem]

In case it helps here is a second occurance from a different node:

This is an automated email from the Condor system
on machine "node101.ldas-cit.ligo.caltech.edu".  Do not reply.

"/ldcg/condor/sbin/condor_startd" on "node101.ldas-cit.ligo.caltech.edu" died due to signal 6.
Condor will automatically restart this process in 10 seconds.

*** Last 200 line(s) of file StartLog:
10/26 20:07:53 ProcAPI::buildFamily() Found daddypid on the system: 11573
10/26 20:08:29 ProcAPI::buildFamily() Found daddypid on the system: 31052
10/26 20:08:34 ProcAPI::buildFamily() Found daddypid on the system: 6282
10/26 20:08:44 ProcAPI::buildFamily() Found daddypid on the system: 28353
10/26 20:08:44 ProcAPI::buildFamily() Found daddypid on the system: 11573
10/26 20:09:19 ProcAPI::buildFamily() Found daddypid on the system: 31052
10/26 20:09:24 ProcAPI::buildFamily() Found daddypid on the system: 6282
10/26 20:09:34 ProcAPI::buildFamily() Found daddypid on the system: 28353
10/26 20:09:34 ProcAPI::buildFamily() Found daddypid on the system: 11573
10/26 20:10:09 Swap space: 10482320
10/26 20:10:09 161409676 kbytes available for "/usr1/condor/execute"
10/26 20:10:09 Looking up RESERVED_DISK parameter
10/26 20:10:09 Reserving 5120 kbytes for file system
10/26 20:10:09 Disk space: 161404556
10/26 20:10:09 ProcAPI::buildFamily() Found daddypid on the system: 31052
10/26 20:10:13 Trying to update collector <10.14.0.12:9618>
10/26 20:10:13 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 20:10:13 vm1: Sent update to 1 collector(s)
10/26 20:10:14 Trying to update collector <10.14.0.12:9618>
10/26 20:10:14 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 20:10:14 vm2: Sent update to 1 collector(s)
10/26 20:10:14 ProcAPI::buildFamily() Found daddypid on the system: 6282
10/26 20:10:15 Trying to update collector <10.14.0.12:9618>
10/26 20:10:15 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 20:10:15 vm3: Sent update to 1 collector(s)
10/26 20:10:16 Trying to update collector <10.14.0.12:9618>
10/26 20:10:16 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 20:10:16 vm4: Sent update to 1 collector(s)
10/26 20:10:24 ProcAPI::buildFamily() Found daddypid on the system: 28353
10/26 20:10:24 ProcAPI::buildFamily() Found daddypid on the system: 11573
10/26 20:10:36 DaemonCore: Command received via UDP from host <10.14.1.101:34054>
10/26 20:10:36 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand)
10/26 20:10:55 DaemonCore: Command received via UDP from host <10.14.0.12:33186>
10/26 20:10:55 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
10/26 20:10:56 DaemonCore: Command received via UDP from host <10.14.0.12:33186>
10/26 20:10:56 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
10/26 20:10:56 DaemonCore: Command received via UDP from host <10.14.0.12:33186>
10/26 20:10:56 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
10/26 20:10:56 DaemonCore: Command received via UDP from host <10.14.0.12:33186>
10/26 20:10:56 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
10/26 20:10:59 ProcAPI::buildFamily() Found daddypid on the system: 31052
10/26 20:11:04 ProcAPI::buildFamily() Found daddypid on the system: 6282
10/26 20:11:09 Getting monitoring info for pid 17398
10/26 20:11:14 ProcAPI::buildFamily() Found daddypid on the system: 28353
10/26 20:11:14 ProcAPI::buildFamily() Found daddypid on the system: 11573
10/26 20:11:49 ProcAPI::buildFamily() Found daddypid on the system: 31052
10/26 20:11:54 ProcAPI::buildFamily() Found daddypid on the system: 6282
10/26 20:12:04 ProcAPI::buildFamily() Found daddypid on the system: 28353
10/26 20:12:04 ProcAPI::buildFamily() Found daddypid on the system: 11573
10/26 20:12:39 ProcAPI::buildFamily() Found daddypid on the system: 31052
10/26 20:12:44 ProcAPI::buildFamily() Found daddypid on the system: 6282
10/26 20:12:54 ProcAPI::buildFamily() Found daddypid on the system: 28353
10/26 20:12:54 ProcAPI::buildFamily() Found daddypid on the system: 11573
10/26 20:13:22 DaemonCore: in SendAliveToParent()
10/26 20:13:22 DaemonCore: attempting to connect to '<10.14.1.101:49638>'
10/26 20:13:29 ProcAPI::buildFamily() Found daddypid on the system: 31052
10/26 20:13:34 ProcAPI::buildFamily() Found daddypid on the system: 6282
10/26 20:13:44 ProcAPI::buildFamily() Found daddypid on the system: 28353
10/26 20:13:44 ProcAPI::buildFamily() Found daddypid on the system: 11573
10/26 20:13:57 DaemonCore: Command received via UDP from host <10.14.1.101:34055>
10/26 20:13:57 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand)
10/26 20:14:19 ProcAPI::buildFamily() Found daddypid on the system: 31052
10/26 20:14:24 ProcAPI::buildFamily() Found daddypid on the system: 6282
10/26 20:14:34 ProcAPI::buildFamily() Found daddypid on the system: 28353
10/26 20:14:34 ProcAPI::buildFamily() Found daddypid on the system: 11573
10/26 20:15:10 Swap space: 10482320
10/26 20:15:10 161332216 kbytes available for "/usr1/condor/execute"
10/26 20:15:10 Looking up RESERVED_DISK parameter
10/26 20:15:10 Reserving 5120 kbytes for file system
10/26 20:15:10 Disk space: 161327096
10/26 20:15:10 ProcAPI::buildFamily() Found daddypid on the system: 31052
10/26 20:15:10 Getting monitoring info for pid 17398
10/26 20:15:14 Trying to update collector <10.14.0.12:9618>
10/26 20:15:14 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 20:15:14 vm1: Sent update to 1 collector(s)
10/26 20:15:15 Trying to update collector <10.14.0.12:9618>
10/26 20:15:15 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 20:15:15 vm2: Sent update to 1 collector(s)
10/26 20:15:15 ProcAPI::buildFamily() Found daddypid on the system: 6282
10/26 20:15:16 Trying to update collector <10.14.0.12:9618>
10/26 20:15:16 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 20:15:16 vm3: Sent update to 1 collector(s)
10/26 20:15:17 Trying to update collector <10.14.0.12:9618>
10/26 20:15:17 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 20:15:17 vm4: Sent update to 1 collector(s)
10/26 20:15:25 ProcAPI::buildFamily() Found daddypid on the system: 28353
10/26 20:15:25 ProcAPI::buildFamily() Found daddypid on the system: 11573
10/26 20:15:57 DaemonCore: Command received via UDP from host <10.14.0.12:56994>
10/26 20:15:57 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
10/26 20:15:57 DaemonCore: Command received via UDP from host <10.14.0.12:56994>
10/26 20:15:57 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
10/26 20:15:57 DaemonCore: Command received via UDP from host <10.14.0.12:56994>
10/26 20:15:57 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
10/26 20:15:57 DaemonCore: Command received via UDP from host <10.14.0.12:56994>
10/26 20:15:57 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
10/26 20:16:00 ProcAPI::buildFamily() Found daddypid on the system: 31052
10/26 20:16:05 ProcAPI::buildFamily() Found daddypid on the system: 6282
10/26 20:16:15 Unable to calculate keyboard/mouse idle time due to them both being USB or not present, assuming infinite idle time for these devices.
10/26 20:16:15 ProcAPI::buildFamily() Found daddypid on the system: 28353
10/26 20:16:15 ProcAPI::buildFamily() Found daddypid on the system: 11573
10/26 20:16:50 ProcAPI::buildFamily() Found daddypid on the system: 31052
10/26 20:16:55 ProcAPI::buildFamily() Found daddypid on the system: 6282
10/26 20:17:05 ProcAPI::buildFamily() Found daddypid on the system: 28353
10/26 20:17:05 ProcAPI::getProcInfo() pid 13026 does not exist.
10/26 20:17:05 ProcAPI::getProcInfo() pid 13026 does not exist.
10/26 20:17:05 ProcAPI::getProcInfo() pid 13026 does not exist.
10/26 20:17:05 ProcAPI::getProcInfo() pid 13026 does not exist.
10/26 20:17:05 ProcAPI::getProcInfo() pid 13026 does not exist.
10/26 20:17:05 ProcAPI::buildFamily() Found daddypid on the system: 11573
10/26 20:17:40 ProcAPI::buildFamily() Found daddypid on the system: 31052
10/26 20:17:45 ProcAPI::buildFamily() Found daddypid on the system: 6282
10/26 20:17:55 ProcAPI::buildFamily() Found daddypid on the system: 28353
10/26 20:17:55 ProcAPI::buildFamily() Found daddypid on the system: 11573
10/26 20:18:30 ProcAPI::buildFamily() Found daddypid on the system: 31052
10/26 20:18:36 ProcAPI::buildFamily() Found daddypid on the system: 6282
10/26 20:18:46 ProcAPI::buildFamily() Found daddypid on the system: 28353
10/26 20:18:46 ProcAPI::buildFamily() Found daddypid on the system: 11573
10/26 20:19:10 Getting monitoring info for pid 17398
10/26 20:19:21 ProcAPI::buildFamily() Found daddypid on the system: 31052
10/26 20:19:26 ProcAPI::buildFamily() Found daddypid on the system: 6282
10/26 20:19:36 ProcAPI::buildFamily() Found daddypid on the system: 28353
10/26 20:19:36 ProcAPI::buildFamily() Found daddypid on the system: 11573
10/26 20:20:10 Swap space: 10482320
10/26 20:20:10 161273828 kbytes available for "/usr1/condor/execute"
10/26 20:20:10 Looking up RESERVED_DISK parameter
10/26 20:20:10 Reserving 5120 kbytes for file system
10/26 20:20:10 Disk space: 161268708
10/26 20:20:11 ProcAPI::buildFamily() Found daddypid on the system: 31052
10/26 20:20:14 Trying to update collector <10.14.0.12:9618>
10/26 20:20:14 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 20:20:14 vm1: Sent update to 1 collector(s)
10/26 20:20:15 Trying to update collector <10.14.0.12:9618>
10/26 20:20:15 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 20:20:15 vm2: Sent update to 1 collector(s)
10/26 20:20:16 Trying to update collector <10.14.0.12:9618>
10/26 20:20:16 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 20:20:16 vm3: Sent update to 1 collector(s)
10/26 20:20:16 ProcAPI::buildFamily() Found daddypid on the system: 6282
10/26 20:20:17 Trying to update collector <10.14.0.12:9618>
10/26 20:20:17 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/26 20:20:17 vm4: Sent update to 1 collector(s)
10/26 20:20:26 ProcAPI::buildFamily() Found daddypid on the system: 28353
10/26 20:20:26 ProcAPI::buildFamily() Found daddypid on the system: 11573
10/26 20:20:27 vm1: match_info called
10/26 20:20:27 vm1: Started match timer (3188) for 120 seconds.
10/26 20:20:28 vm1: Canceled match timer (3188)
10/26 20:20:28 vm1: Schedd addr = <10.14.0.12:43253>
10/26 20:20:28 vm1: Alive interval = 300
10/26 20:20:28 vm1: Received ClaimId from schedd (<10.14.1.101:43817>#1161828891#52)
10/26 20:20:28 vm1: Rank of this claim is: 0.000000
10/26 20:20:28 vm1: Preempting claim has correct ClaimId.
10/26 20:20:28 vm1: New claim has sufficient rank, preempting current claim.
10/26 20:20:28 vm1: Got RemoteUser (anandss@ligo) from request classad
10/26 20:20:28 vm1: State change: preempting claim based on user priority
10/26 20:20:28 vm1: State change: retiring due to preempting claim
10/26 20:20:28 vm1: Changing activity: Busy -> Retiring
10/26 20:20:28 vm1: State change: retirement ended/expired
10/26 20:20:28 vm1: Canceled ClaimLease timer (2055)
10/26 20:20:28 vm1: Changing state and activity: Claimed/Retiring -> Preempting/Vacating
10/26 20:20:28 vm1: In Starter::kill() with pid 31052, sig 15 (SIGTERM)
10/26 20:20:28 vm2: Schedd addr = <10.14.0.12:43253>
10/26 20:20:28 vm2: Alive interval = 300
10/26 20:20:28 vm2: Received ClaimId from schedd (<10.14.1.101:43817>#1161828891#45)
10/26 20:20:28 vm2: Rank of this claim is: 0.000000
10/26 20:20:28 vm2: Preempting claim has correct ClaimId.
10/26 20:20:28 vm2: New claim has sufficient rank, preempting current claim.
10/26 20:20:28 vm2: Got RemoteUser (eirini@ligo) from request classad
10/26 20:20:28 vm2: State change: preempting claim based on user priority
10/26 20:20:28 vm2: State change: retiring due to preempting claim
10/26 20:20:28 vm2: Changing activity: Busy -> Retiring
10/26 20:20:28 vm2: State change: retirement ended/expired
10/26 20:20:28 vm2: Canceled ClaimLease timer (1763)
10/26 20:20:28 vm2: Changing state and activity: Claimed/Retiring -> Preempting/Vacating
10/26 20:20:28 vm2: In Starter::kill() with pid 28353, sig 15 (SIGTERM)
10/26 20:20:28 vm2: Got MATCH_INFO while in Preempting state, ignoring.
10/26 20:20:28 vm2: Got KILL_FRGN_JOB while in Preempting state, ignoring.
10/26 20:20:28 DaemonCore: No more children processes to reap.
10/26 20:20:28 Starter pid 28353 exited with status 0
10/26 20:20:28 ProcAPI::buildFamily failed: parent 28353 not found on system.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28353 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28353 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28353 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28353 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28353 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28354 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28354 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28354 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28354 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28354 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28355 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28355 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28355 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28355 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28355 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28356 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28356 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28356 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28356 does not exist.
10/26 20:20:28 ProcAPI::getProcInfo() pid 28356 does not exist.
10/26 20:20:28 ProcAPI::buildFamily failed: parent 28356 not found on system.
*** End of file StartLog



On Thu, Oct 26, 2006 at 06:31:23PM -0700, Erik A. Espinoza wrote:
> Here is a log from the new binary.
> 
> 
> 
> -------- Original Message --------
> Subject: [Condor] Problem
> Date: Thu, 26 Oct 2006 17:59:29 -0700
> From: condor__AT__node305.ldas-cit.ligo.caltech.edu
> To: ldas_admin_cit__AT__ligo.caltech.edu
> 
> This is an automated email from the Condor system
> on machine "node305.ldas-cit.ligo.caltech.edu".  Do not reply.
> 
> "/ldcg/condor/sbin/condor_startd" on "node305.ldas-cit.ligo.caltech.edu" 
> died due to signal 6.
> Condor will automatically restart this process in 10 seconds.
> 

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Oct 26 23:02:47 2006 (1161921768)
Date: Fri, 27 Oct 2006 11:33:38 -0700
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64     
 Upgrade [Fwd: [Condor] Problem]

Here is another example that was triggered by a condor_off command
on the head node. It is also different in that 3 out of the 4
VM's on this physical node reported,
Got KILL_FRGN_JOB while in Preempting state, ignoring
withing 1 second of each other.


This is an automated email from the Condor system
on machine "node183.ldas-cit.ligo.caltech.edu".  Do not reply.

"/ldcg/condor/sbin/condor_startd" on "node183.ldas-cit.ligo.caltech.edu" died due to signal 6.
Condor will automatically restart this process in 10 seconds.

*** Last 200 line(s) of file StartLog:
10/27 08:57:35 ProcAPI::buildFamily() Found daddypid on the system: 3622
10/27 08:57:51 ProcAPI::buildFamily() Found daddypid on the system: 3592
10/27 08:57:51 ProcAPI::buildFamily() Found daddypid on the system: 3578
10/27 08:57:51 ProcAPI::buildFamily() Found daddypid on the system: 3590
10/27 08:57:52 DaemonCore: Command received via UDP from host <10.14.0.12:46685>
10/27 08:57:52 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
10/27 08:57:52 DaemonCore: Command received via UDP from host <10.14.0.12:46685>
10/27 08:57:52 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
10/27 08:57:52 DaemonCore: Command received via UDP from host <10.14.0.12:46685>
10/27 08:57:52 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
10/27 08:57:52 DaemonCore: Command received via UDP from host <10.14.0.12:46685>
10/27 08:57:52 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
10/27 08:58:26 ProcAPI::buildFamily() Found daddypid on the system: 3622
10/27 08:58:41 ProcAPI::buildFamily() Found daddypid on the system: 3592
10/27 08:58:41 ProcAPI::buildFamily() Found daddypid on the system: 3578
10/27 08:58:41 ProcAPI::buildFamily() Found daddypid on the system: 3590
10/27 08:58:56 Getting monitoring info for pid 31827
10/27 08:59:05 DaemonCore: Command received via UDP from host <10.14.1.183:39976>
10/27 08:59:05 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand)
10/27 08:59:07 DaemonCore: Command received via UDP from host <10.14.1.183:39976>
10/27 08:59:07 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand)
10/27 08:59:07 DaemonCore: Command received via UDP from host <10.14.1.183:39976>
10/27 08:59:07 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand)
10/27 08:59:16 ProcAPI::buildFamily() Found daddypid on the system: 3622
10/27 08:59:31 ProcAPI::buildFamily() Found daddypid on the system: 3592
10/27 08:59:31 ProcAPI::buildFamily() Found daddypid on the system: 3578
10/27 08:59:31 ProcAPI::buildFamily() Found daddypid on the system: 3590
10/27 08:59:39 DaemonCore: Command received via UDP from host <10.14.1.183:39976>
10/27 08:59:39 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand)
10/27 09:00:01 Swap space: 10433396
10/27 09:00:01 259249372 kbytes available for "/usr1/condor/execute"
10/27 09:00:01 Looking up RESERVED_DISK parameter
10/27 09:00:01 Reserving 5120 kbytes for file system
10/27 09:00:01 Disk space: 259244252
10/27 09:00:05 Trying to update collector <10.14.0.12:9618>
10/27 09:00:05 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/27 09:00:05 vm1: Sent update to 1 collector(s)
10/27 09:00:06 Trying to update collector <10.14.0.12:9618>
10/27 09:00:06 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/27 09:00:06 vm2: Sent update to 1 collector(s)
10/27 09:00:06 ProcAPI::buildFamily() Found daddypid on the system: 3622
10/27 09:00:07 Trying to update collector <10.14.0.12:9618>
10/27 09:00:07 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/27 09:00:07 vm3: Sent update to 1 collector(s)
10/27 09:00:08 Trying to update collector <10.14.0.12:9618>
10/27 09:00:08 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/27 09:00:08 vm4: Sent update to 1 collector(s)
10/27 09:00:21 ProcAPI::buildFamily() Found daddypid on the system: 3592
10/27 09:00:21 ProcAPI::buildFamily() Found daddypid on the system: 3578
10/27 09:00:21 ProcAPI::buildFamily() Found daddypid on the system: 3590
10/27 09:00:56 ProcAPI::buildFamily() Found daddypid on the system: 3622
10/27 09:01:11 ProcAPI::buildFamily() Found daddypid on the system: 3592
10/27 09:01:11 ProcAPI::buildFamily() Found daddypid on the system: 3578
10/27 09:01:11 ProcAPI::buildFamily() Found daddypid on the system: 3590
10/27 09:01:46 ProcAPI::buildFamily() Found daddypid on the system: 3622
10/27 09:02:01 ProcAPI::buildFamily() Found daddypid on the system: 3592
10/27 09:02:01 ProcAPI::buildFamily() Found daddypid on the system: 3578
10/27 09:02:01 ProcAPI::buildFamily() Found daddypid on the system: 3590
10/27 09:02:36 ProcAPI::buildFamily() Found daddypid on the system: 3622
10/27 09:02:51 ProcAPI::buildFamily() Found daddypid on the system: 3592
10/27 09:02:51 ProcAPI::buildFamily() Found daddypid on the system: 3578
10/27 09:02:52 ProcAPI::buildFamily() Found daddypid on the system: 3590
10/27 09:02:55 DaemonCore: Command received via UDP from host <10.14.0.12:47323>
10/27 09:02:55 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
10/27 09:02:55 DaemonCore: Command received via UDP from host <10.14.0.12:47579>
10/27 09:02:55 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
10/27 09:02:55 DaemonCore: Command received via UDP from host <10.14.0.12:47579>
10/27 09:02:55 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
10/27 09:02:55 DaemonCore: Command received via UDP from host <10.14.0.12:47579>
10/27 09:02:55 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
10/27 09:02:56 Getting monitoring info for pid 31827
10/27 09:03:27 ProcAPI::buildFamily() Found daddypid on the system: 3622
10/27 09:03:42 ProcAPI::buildFamily() Found daddypid on the system: 3592
10/27 09:03:42 ProcAPI::buildFamily() Found daddypid on the system: 3578
10/27 09:03:42 ProcAPI::buildFamily() Found daddypid on the system: 3590
10/27 09:03:43 Got SIGTERM. Performing graceful shutdown.
10/27 09:03:43 Started timer to call main_shutdown_fast in 1800 seconds
10/27 09:03:43 shutdown graceful
10/27 09:03:43 StartdCronMgr: Shutting down
10/27 09:03:43 CronMgr: Killing all jobs
10/27 09:03:43 CronMgr: 0 jobs alive
10/27 09:03:43 vm1: Changing activity: Busy -> Retiring
10/27 09:03:43 vm1: State change: retirement ended/expired
10/27 09:03:43 vm1: Canceled ClaimLease timer (4204)
10/27 09:03:43 vm1: Changing state and activity: Claimed/Retiring -> Preempting/Vacating
10/27 09:03:46 vm1: In Starter::kill() with pid 3592, sig 15 (SIGTERM)
10/27 09:03:46 vm2: Changing activity: Busy -> Retiring
10/27 09:03:46 vm2: State change: retirement ended/expired
10/27 09:03:46 vm2: Canceled ClaimLease timer (4205)
10/27 09:03:46 vm2: Changing state and activity: Claimed/Retiring -> Preempting/Vacating
10/27 09:03:46 vm2: In Starter::kill() with pid 3578, sig 15 (SIGTERM)
10/27 09:03:46 vm3: Changing activity: Busy -> Retiring
10/27 09:03:46 vm3: State change: retirement ended/expired
10/27 09:03:46 vm3: Canceled ClaimLease timer (4206)
10/27 09:03:46 vm3: Changing state and activity: Claimed/Retiring -> Preempting/Vacating
10/27 09:03:46 vm3: In Starter::kill() with pid 3590, sig 15 (SIGTERM)
10/27 09:03:46 vm4: Changing activity: Busy -> Retiring
10/27 09:03:46 vm4: State change: retirement ended/expired
10/27 09:03:46 vm4: Canceled ClaimLease timer (4215)
10/27 09:03:46 vm4: Changing state and activity: Claimed/Retiring -> Preempting/Vacating
10/27 09:03:46 vm4: In Starter::kill() with pid 3622, sig 15 (SIGTERM)
10/27 09:03:46 CronMgr: 0 jobs alive
10/27 09:03:47 Trying to update collector <10.14.0.12:9618>
10/27 09:03:47 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/27 09:03:47 vm1: Sent update to 1 collector(s)
10/27 09:03:51 Trying to update collector <10.14.0.12:9618>
10/27 09:03:51 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/27 09:03:51 vm2: Sent update to 1 collector(s)
10/27 09:03:51 CronMgr: 0 jobs alive
10/27 09:03:52 Trying to update collector <10.14.0.12:9618>
10/27 09:03:52 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/27 09:03:52 vm3: Sent update to 1 collector(s)
10/27 09:03:53 Trying to update collector <10.14.0.12:9618>
10/27 09:03:53 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
10/27 09:03:53 vm4: Sent update to 1 collector(s)
10/27 09:03:56 CronMgr: 0 jobs alive
10/27 09:04:01 CronMgr: 0 jobs alive
10/27 09:04:06 CronMgr: 0 jobs alive
10/27 09:04:11 CronMgr: 0 jobs alive
10/27 09:04:11 vm3: Got KILL_FRGN_JOB while in Preempting state, ignoring.
10/27 09:04:11 vm2: Got KILL_FRGN_JOB while in Preempting state, ignoring.
10/27 09:04:12 vm1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3596 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3596 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3596 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3596 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3596 does not exist.
10/27 09:04:12 ProcAPI::getProcSetInfo(): Pid 3596 does not exist, ignoring.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3597 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3597 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3597 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3597 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3597 does not exist.
10/27 09:04:12 ProcAPI::getProcSetInfo(): Pid 3597 does not exist, ignoring.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3579 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3579 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3579 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3579 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3579 does not exist.
10/27 09:04:12 ProcAPI::getProcSetInfo(): Pid 3579 does not exist, ignoring.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3580 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3580 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3580 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3580 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3580 does not exist.
10/27 09:04:12 ProcAPI::getProcSetInfo(): Pid 3580 does not exist, ignoring.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3593 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3593 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3593 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3593 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3593 does not exist.
10/27 09:04:12 ProcAPI::getProcSetInfo(): Pid 3593 does not exist, ignoring.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3594 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3594 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3594 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3594 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3594 does not exist.
10/27 09:04:12 ProcAPI::getProcSetInfo(): Pid 3594 does not exist, ignoring.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3623 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3623 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3623 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3623 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3623 does not exist.
10/27 09:04:12 ProcAPI::getProcSetInfo(): Pid 3623 does not exist, ignoring.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3624 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3624 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3624 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3624 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3624 does not exist.
10/27 09:04:12 ProcAPI::getProcSetInfo(): Pid 3624 does not exist, ignoring.
10/27 09:04:12 vm4: Got KILL_FRGN_JOB while in Preempting state, ignoring.
10/27 09:04:12 DaemonCore: No more children processes to reap.
10/27 09:04:12 Starter pid 3592 exited with status 0
10/27 09:04:12 ProcAPI::getProcInfo() pid 6491 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 6491 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 6491 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 6491 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 6491 does not exist.
10/27 09:04:12 ProcAPI::buildFamily failed: parent 3592 not found on system.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3592 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3592 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3592 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3592 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3592 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3596 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3596 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3596 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3596 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3596 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3597 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3597 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3597 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3597 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3597 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3598 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3598 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3598 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3598 does not exist.
10/27 09:04:12 ProcAPI::getProcInfo() pid 3598 does not exist.
10/27 09:04:12 ProcAPI::buildFamily failed: parent 3598 not found on system.
*** End of file StartLog

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Fri Oct 27 13:34:05 2006 (1161974046)
Date: Mon, 30 Oct 2006 17:21:09 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: Erik Espinoza <espinoza_e__AT__ligo.caltech.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64     
 Upgrade [Fwd: [Condor] Problem]

Derek,
	Any luck tracking this down further from the extra debug output?
Is there any more information we can provide that would be helpful in
determining the cause of this problem?

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Oct 30 19:21:26 2006 (1162257687)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Thu, 2 Nov 2006 12:14:00 -0800
To: condor-support__AT__cs.wisc.edu


On Oct 30, 2006, at 5:21 PM, condor-support response tracking system  
wrote:

> 	Any luck tracking this down further from the extra debug output?

sadly, no.  the problem doesn't appear to be coming from where i  
thought it was. :(  i have a few new theories, but i'd need to get  
y'all a new version of the startd to verify if they're true.

> Is there any more information we can provide that would be helpful  
> in determining the cause of this problem?

probably.  unfortunately, i was nearly killed tuesday night in a  
freakish hit-and-run car accident, so i've been a little shaken up  
and not terribly productive for the last few days. :(

i'll dust this ticket off and give it some more love this afternoon,  
and i'll try to get y'all a new version of the startd with yet more  
debugging output...

sorry for the delays.  part of the annoying thing is this doesn't  
seem to be happening anywhere else, so it's harder to debug. :(

thanks for your patience,
-derek



===========================================================================
Date mail was appended: Thu Nov  2 14:14:10 2006 (1162498450)
Date: Mon, 6 Nov 2006 14:46:07 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Thu, Nov 02, 2006 at 02:14:10PM -0600, condor-support response tracking system wrote:
> 
> On Oct 30, 2006, at 5:21 PM, condor-support response tracking system  
> wrote:
> 
> > 	Any luck tracking this down further from the extra debug output?
> 
> sadly, no.  the problem doesn't appear to be coming from where i  
> thought it was. :(  i have a few new theories, but i'd need to get  
> y'all a new version of the startd to verify if they're true.

Do you have an estimate for this so I can schedule an update to our
production condor pool?

> 
> > Is there any more information we can provide that would be helpful  
> > in determining the cause of this problem?
> 
> probably.  unfortunately, i was nearly killed tuesday night in a  
> freakish hit-and-run car accident, so i've been a little shaken up  
> and not terribly productive for the last few days. :(
> 
> i'll dust this ticket off and give it some more love this afternoon,  
> and i'll try to get y'all a new version of the startd with yet more  
> debugging output...
> 
> sorry for the delays.  part of the annoying thing is this doesn't  
> seem to be happening anywhere else, so it's harder to debug. :(
> 
> thanks for your patience,

Glad to hear you survived!

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Nov  6 16:46:31 2006 (1162853192)
Date: Thu, 9 Nov 2006 09:46:53 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Mon, Nov 06, 2006 at 02:46:07PM -0800, Stuart Anderson wrote:
> On Thu, Nov 02, 2006 at 02:14:10PM -0600, condor-support response tracking system wrote:
> > 
> > On Oct 30, 2006, at 5:21 PM, condor-support response tracking system  
> > wrote:
> > 
> > > 	Any luck tracking this down further from the extra debug output?
> > 
> > sadly, no.  the problem doesn't appear to be coming from where i  
> > thought it was. :(  i have a few new theories, but i'd need to get  
> > y'all a new version of the startd to verify if they're true.
> 
> Do you have an estimate for this so I can schedule an update to our
> production condor pool?
> 

Derek,
	I would appreciate an update on the status of this bug fix. If you want
different/more log files just let us know, if you have another debug version of
startd we are willing to run that.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Nov  9 11:47:26 2006 (1163094447)
Date: Mon, 13 Nov 2006 17:56:17 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Thu, Nov 09, 2006 at 09:46:53AM -0800, Stuart Anderson wrote:
> On Mon, Nov 06, 2006 at 02:46:07PM -0800, Stuart Anderson wrote:
> > On Thu, Nov 02, 2006 at 02:14:10PM -0600, condor-support response tracking system wrote:
> > > 
> > > On Oct 30, 2006, at 5:21 PM, condor-support response tracking system  
> > > wrote:
> > > 
> > > > 	Any luck tracking this down further from the extra debug output?
> > > 
> > > sadly, no.  the problem doesn't appear to be coming from where i  
> > > thought it was. :(  i have a few new theories, but i'd need to get  
> > > y'all a new version of the startd to verify if they're true.
> > 
> > Do you have an estimate for this so I can schedule an update to our
> > production condor pool?
> > 
> 
> Derek,
> 	I would appreciate an update on the status of this bug fix. If you want
> different/more log files just let us know, if you have another debug version of
> startd we are willing to run that.
> 

Here is another instance that is slightly different in that it does not
include the string "KILL_FRGN_JOB".


This is an automated email from the Condor system
on machine "node94.ldas-cit.ligo.caltech.edu".  Do not reply.

"/ldcg/condor/sbin/condor_startd" on "node94.ldas-cit.ligo.caltech.edu" died due to signal 6.
Condor will automatically restart this process in 10 seconds.

*** Last 200 line(s) of file StartLog:
11/13 17:34:02 ProcAPI::getProcSetInfo(): Pid 30156 does not exist, ignoring.
11/13 17:34:07 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:07 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:07 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:07 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:07 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:07 ProcAPI::getProcSetInfo(): Pid 30156 does not exist, ignoring.
11/13 17:34:12 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:12 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:12 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:12 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:12 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:12 ProcAPI::getProcSetInfo(): Pid 30156 does not exist, ignoring.
11/13 17:34:17 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:17 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:17 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:17 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:17 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:17 ProcAPI::getProcSetInfo(): Pid 30156 does not exist, ignoring.
11/13 17:34:22 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:22 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:22 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:22 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:22 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:22 ProcAPI::getProcSetInfo(): Pid 30156 does not exist, ignoring.
11/13 17:34:26 DaemonCore: Command received via UDP from host <10.14.0.18:60810>
11/13 17:34:26 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
11/13 17:34:26 vm4: Got ALIVE while in Preempting state, ignoring.
11/13 17:34:27 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:27 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:27 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:27 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:27 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:27 ProcAPI::getProcSetInfo(): Pid 30156 does not exist, ignoring.
11/13 17:34:32 Swap space: 10446880
11/13 17:34:32 91390200 kbytes available for "/usr1/condor/execute"
11/13 17:34:32 Looking up RESERVED_DISK parameter
11/13 17:34:32 Reserving 5120 kbytes for file system
11/13 17:34:32 Disk space: 91385080
11/13 17:34:32 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:32 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:32 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:32 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:32 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:32 ProcAPI::getProcSetInfo(): Pid 30156 does not exist, ignoring.
11/13 17:34:32 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:32 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:32 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:32 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:32 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:32 ProcAPI::getProcSetInfo(): Pid 30156 does not exist, ignoring.
11/13 17:34:36 Trying to update collector <10.14.0.12:9618>
11/13 17:34:36 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/13 17:34:36 vm1: Sent update to 1 collector(s)
11/13 17:34:37 Trying to update collector <10.14.0.12:9618>
11/13 17:34:37 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/13 17:34:37 vm2: Sent update to 1 collector(s)
11/13 17:34:37 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:37 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:37 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:37 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:37 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:37 ProcAPI::getProcSetInfo(): Pid 30156 does not exist, ignoring.
11/13 17:34:38 Trying to update collector <10.14.0.12:9618>
11/13 17:34:38 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/13 17:34:38 vm3: Sent update to 1 collector(s)
11/13 17:34:39 Trying to update collector <10.14.0.12:9618>
11/13 17:34:39 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/13 17:34:39 vm4: Sent update to 1 collector(s)
11/13 17:34:42 ProcAPI::buildFamily() Found daddypid on the system: 6945
11/13 17:34:42 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:42 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:42 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:42 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:42 ProcAPI::getProcInfo() pid 30156 does not exist.
11/13 17:34:47 ProcAPI::buildFamily() Found daddypid on the system: 15527
11/13 17:34:47 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:34:47 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:34:47 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:34:47 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:34:47 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:34:47 ProcAPI::getProcSetInfo(): Pid 30267 does not exist, ignoring.
11/13 17:34:52 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:34:52 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:34:52 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:34:52 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:34:52 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:34:52 ProcAPI::getProcSetInfo(): Pid 30267 does not exist, ignoring.
11/13 17:34:57 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:34:57 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:34:57 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:34:57 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:34:57 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:34:57 ProcAPI::getProcSetInfo(): Pid 30267 does not exist, ignoring.
11/13 17:35:02 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:02 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:02 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:02 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:02 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:02 ProcAPI::getProcSetInfo(): Pid 30267 does not exist, ignoring.
11/13 17:35:07 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:07 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:07 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:07 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:07 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:07 ProcAPI::getProcSetInfo(): Pid 30267 does not exist, ignoring.
11/13 17:35:12 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:12 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:12 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:12 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:12 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:12 ProcAPI::getProcSetInfo(): Pid 30267 does not exist, ignoring.
11/13 17:35:17 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:17 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:17 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:17 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:17 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:17 ProcAPI::getProcSetInfo(): Pid 30267 does not exist, ignoring.
11/13 17:35:22 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:22 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:22 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:22 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:22 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:22 ProcAPI::getProcSetInfo(): Pid 30267 does not exist, ignoring.
11/13 17:35:27 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:27 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:27 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:27 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:27 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:27 ProcAPI::getProcSetInfo(): Pid 30267 does not exist, ignoring.
11/13 17:35:32 ProcAPI::buildFamily() Found daddypid on the system: 6945
11/13 17:35:32 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:32 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:32 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:32 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:32 ProcAPI::getProcInfo() pid 30267 does not exist.
11/13 17:35:37 ProcAPI::buildFamily() Found daddypid on the system: 15527
11/13 17:35:37 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:37 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:37 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:37 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:37 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:37 ProcAPI::getProcSetInfo(): Pid 30386 does not exist, ignoring.
11/13 17:35:42 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:42 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:42 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:42 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:42 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:42 ProcAPI::getProcSetInfo(): Pid 30386 does not exist, ignoring.
11/13 17:35:47 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:47 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:47 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:47 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:47 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:47 ProcAPI::getProcSetInfo(): Pid 30386 does not exist, ignoring.
11/13 17:35:52 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:52 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:52 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:52 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:52 ProcAPI::getProcInfo() pid 30386 does not exist.
11/13 17:35:52 ProcAPI::getProcSetInfo(): Pid 30386 does not exist, ignoring.
11/13 17:35:54 DaemonCore: No more children processes to reap.
11/13 17:35:54 Starter pid 6945 exited with status 0
11/13 17:35:54 ProcAPI::buildFamily failed: parent 6945 not found on system.
11/13 17:35:54 ProcAPI::getProcInfo() pid 6945 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 6945 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 6945 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 6945 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 6945 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 6947 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 6947 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 6947 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 6947 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 6947 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 6963 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 6963 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 6963 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 6963 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 6963 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7154 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7154 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7154 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7154 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7154 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7158 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7158 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7158 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7158 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7158 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7159 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7159 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7159 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7159 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7159 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7243 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7243 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7243 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7243 does not exist.
11/13 17:35:54 ProcAPI::getProcInfo() pid 7243 does not exist.
11/13 17:35:54 ProcAPI::buildFamily failed: parent 7159 not found on system.
*** End of file StartLog

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Nov 13 19:57:12 2006 (1163469432)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Mon, 20 Nov 2006 12:08:01 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 13, 2006, at 5:57 PM, condor-support response tracking system  
wrote:

> Here is another instance that is slightly different in that it does  
> not
> include the string "KILL_FRGN_JOB".

thanks for the report.  however, after much reading of the code, this  
still makes absolutely no sense. :(

are you getting any core files from these SIGABRT deaths?  that would  
be *incredibly* valuable.

also, can you please give me a sense of how often this is happening?

and, just to confirm, this is only happening on the 64 bit machines,  
right?

once again, sorry for the delays... i've had a very terrible couple  
of weeks. :(

-derek




===========================================================================
Date mail was appended: Mon Nov 20 14:08:11 2006 (1164053292)
Date: Mon, 20 Nov 2006 20:58:31 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Mon, Nov 20, 2006 at 02:08:11PM -0600, condor-support response tracking system wrote:
> 
> On Nov 13, 2006, at 5:57 PM, condor-support response tracking system  
> wrote:
> 
> > Here is another instance that is slightly different in that it does  
> > not
> > include the string "KILL_FRGN_JOB".
> 
> thanks for the report.  however, after much reading of the code, this  
> still makes absolutely no sense. :(

Hopefully this means it will be an interesting bug to fix, and not just
a stupid configuration error on our part.

> 
> are you getting any core files from these SIGABRT deaths?  that would  
> be *incredibly* valuable.

No, but we now suspect that we unknowingly lost Condor coredump functionality
when we upgraded to Condor 6.8.2 (x86_64). Todd tells me that 64-bit Condor
daemons (unlike 32-bit ones) do not override the Linux default settings
regarding setuid() coredumps. We will test this theory and if so restart all
the startd's with coredump enabled sometime soon. We will let you know
either way.

> 
> also, can you please give me a sense of how often this is happening?

Initially it was happening several times a day, but more recently it has
dropped to a few per week as the mix of jobs running in the pool has changed.

> 
> and, just to confirm, this is only happening on the 64 bit machines,  
> right?

Correct, we have not seen this problem on our 32-bit pools running 6.8.2,
however, they are admittedly are not running nearly as many jobs.

> 
> once again, sorry for the delays... i've had a very terrible couple  
> of weeks. :(
> 

Make sure you cook your Turkey long enough to avoid food poisoning :)

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Nov 20 22:58:54 2006 (1164085135)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Mon, 20 Nov 2006 22:30:58 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 20, 2006, at 8:58 PM, condor-support response tracking system  
wrote:

> Hopefully this means it will be an interesting bug to fix, and not  
> just
> a stupid configuration error on our part.

hehe. ;)

> No, but we now suspect that we unknowingly lost Condor coredump  
> functionality
> when we upgraded to Condor 6.8.2 (x86_64). Todd tells me that 64- 
> bit Condor
> daemons (unlike 32-bit ones) do not override the Linux default  
> settings
> regarding setuid() coredumps.

indeed, that's no theory, it's the honest truth.  you have to tell  
your kernel to let daemons drop core files even if ruid != euid if  
you want core files on 64 bit linux boxes, and in this case, you  
*definitely* want the core files. ;)

> We will test this theory and if so restart all the startd's with  
> coredump enabled sometime soon. We will let you know either way.

yes, please.  i think a core file is going to be the only way to find  
this bug now (i just hope the core file gives us useful info!).

> Initially it was happening several times a day, but more recently  
> it has
> dropped to a few per week as the mix of jobs running in the pool  
> has changed.

duly noted.  so, while annoying, this isn't actually a critical  
problem for you...  seems like it only happens when a job had just  
finished and the starter is exiting, anyway, so it's not like jobs  
are dying and you're losing computations, right?

> Correct, we have not seen this problem on our 32-bit pools running  
> 6.8.2,
> however, they are admittedly are not running nearly as many jobs.

gotcha.

> Make sure you cook your Turkey long enough to avoid food poisoning :)

good advice. ;)  however, it turns out that what i thought was food  
poisoning was really the first day in a multi-week flu/plague (it  
moved quickly from my gut to my throat and head).  joy.

anyway, i'm hopeful once you get a core file to me i'll be able to  
finally squash this problem, whether it's a condor bug or a LIGO mis- 
configuration. ;)

thanks,
-derek



===========================================================================
Date mail was appended: Tue Nov 21 10:11:05 2006 (1164125467)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Tue Nov 21 10:29:54 2006 (1164126594)
Date: Tue, 21 Nov 2006 11:11:32 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Tue, Nov 21, 2006 at 10:11:05AM -0600, condor-support response tracking system wrote:
> 
> > Initially it was happening several times a day, but more recently  
> > it has
> > dropped to a few per week as the mix of jobs running in the pool  
> > has changed.
> 
> duly noted.  so, while annoying, this isn't actually a critical  
> problem for you...  seems like it only happens when a job had just  
> finished and the starter is exiting, anyway, so it's not like jobs  
> are dying and you're losing computations, right?

Yes, I would categorize it as currently annoying and not critical, but it
is a question of whether users (or DAGMan) will properly respond to the
shadow exceptions that occur when a shadow lose its connection to a
startd, i.e., I don't think the side effects are completely benign.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Tue Nov 21 13:11:54 2006 (1164136315)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Tue, 21 Nov 2006 12:43:46 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 21, 2006, at 11:11 AM, condor-support response tracking system  
wrote:

> I don't think the side effects are completely benign.

fair enough.

well, let me know as soon as you've got a core file.
preferably, stick the core file in some FTP/web location and just  
send a link to it into this ticket.  our problem tracking system  
doesn't handle attachments very well at all. :(

thanks,
-derek



===========================================================================
Date mail was appended: Tue Nov 21 14:43:57 2006 (1164141837)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Tue Nov 21 14:44:14 2006 (1164141854)
Date: Wed, 22 Nov 2006 15:38:34 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Tue, Nov 21, 2006 at 02:43:57PM -0600, condor-support response tracking system wrote:
> 
> On Nov 21, 2006, at 11:11 AM, condor-support response tracking system  
> wrote:
> 
> > I don't think the side effects are completely benign.
> 
> fair enough.
> 
> well, let me know as soon as you've got a core file.
> preferably, stick the core file in some FTP/web location and just  
> send a link to it into this ticket.  our problem tracking system  
> doesn't handle attachments very well at all. :(
> 

Derek,
	Do you have a preference on whether we restart our coredump
enabled condor pool with the 6.8.2 condor_startd, or the patched 6.8.3pre
version you generated for this ticket?

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Wed Nov 22 17:38:57 2006 (1164238738)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Wed, 22 Nov 2006 23:07:45 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 22, 2006, at 3:38 PM, condor-support response tracking system  
wrote:

> patched 6.8.3pre version you generated for this ticket

yes, please.

thanks,
-derek




===========================================================================
Date mail was appended: Thu Nov 23  1:07:59 2006 (1164265680)
Date: Sun, 26 Nov 2006 15:47:40 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Thu, Nov 23, 2006 at 01:07:59AM -0600, condor-support response tracking system wrote:
> 
> On Nov 22, 2006, at 3:38 PM, condor-support response tracking system  
> wrote:
> 
> > patched 6.8.3pre version you generated for this ticket
> 
> yes, please.
> 

Derek,
	Now that we have 64-bit setuid root core dump enabled on our
cluster (for 2 days) we are observing a large number of condor_starter core
dumps.  We have not had any condor_startd core dumps yet, but you might
want to look at [condor-support #1750] to see if the condor_starter SIGABRT
problem is related to condor_startd SIGABRT.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Sun Nov 26 17:48:05 2006 (1164584885)
Date: Sun, 26 Nov 2006 20:50:18 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

Derek,

We finally have a core dump image (core.13747.gz) for this problem
that can be found at,
http://www.ligo.caltech.edu/~anderson/condor.1694

Here is the associated email message with the last 200 lines before
the startd SIGABRT:


This is an automated email from the Condor system
on machine "node318.ldas-cit.ligo.caltech.edu".  Do not reply.

"/ldcg/condor/sbin/condor_startd" on "node318.ldas-cit.ligo.caltech.edu" died due to signal 6.
Condor will automatically restart this process in 10 seconds.

*** Last 200 line(s) of file StartLog:
11/26 20:28:03 Reserving 5120 kbytes for file system
11/26 20:28:03 Disk space: 216279812
11/26 20:28:07 Trying to update collector <10.14.0.12:9618>
11/26 20:28:07 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/26 20:28:07 vm1: Sent update to 1 collector(s)
11/26 20:28:08 Trying to update collector <10.14.0.12:9618>
11/26 20:28:08 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/26 20:28:08 vm2: Sent update to 1 collector(s)
11/26 20:28:09 Trying to update collector <10.14.0.12:9618>
11/26 20:28:09 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/26 20:28:09 vm3: Sent update to 1 collector(s)
11/26 20:28:10 Trying to update collector <10.14.0.12:9618>
11/26 20:28:10 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/26 20:28:10 vm4: Sent update to 1 collector(s)
11/26 20:28:14 ProcAPI::buildFamily() Found daddypid on the system: 14243
11/26 20:28:29 ProcAPI::buildFamily() Found daddypid on the system: 12943
11/26 20:28:34 ProcAPI::buildFamily() Found daddypid on the system: 14469
11/26 20:28:43 DaemonCore: Command received via UDP from host <10.14.0.12:44771>
11/26 20:28:43 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
11/26 20:28:43 DaemonCore: Command received via UDP from host <10.14.0.12:44771>
11/26 20:28:43 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
11/26 20:28:43 DaemonCore: Command received via UDP from host <10.14.0.12:44771>
11/26 20:28:43 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
11/26 20:28:43 DaemonCore: Command received via UDP from host <10.14.0.12:44771>
11/26 20:28:43 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
11/26 20:28:49 ProcAPI::buildFamily() Found daddypid on the system: 14719
11/26 20:29:04 ProcAPI::buildFamily() Found daddypid on the system: 14243
11/26 20:29:19 ProcAPI::buildFamily() Found daddypid on the system: 12943
11/26 20:29:25 ProcAPI::buildFamily() Found daddypid on the system: 14469
11/26 20:29:40 ProcAPI::buildFamily() Found daddypid on the system: 14719
11/26 20:29:55 ProcAPI::buildFamily() Found daddypid on the system: 14243
11/26 20:30:10 ProcAPI::buildFamily() Found daddypid on the system: 12943
11/26 20:30:15 ProcAPI::buildFamily() Found daddypid on the system: 14469
11/26 20:30:30 Unable to calculate keyboard/mouse idle time due to them both being USB or not present, assuming infinite idle time for these devices.
11/26 20:30:30 ProcAPI::buildFamily() Found daddypid on the system: 14719
11/26 20:30:45 ProcAPI::buildFamily() Found daddypid on the system: 14243
11/26 20:31:00 ProcAPI::buildFamily() Found daddypid on the system: 12943
11/26 20:31:05 ProcAPI::buildFamily() Found daddypid on the system: 14469
11/26 20:31:20 ProcAPI::buildFamily() Found daddypid on the system: 14719
11/26 20:31:35 ProcAPI::buildFamily() Found daddypid on the system: 14243
11/26 20:31:50 ProcAPI::buildFamily() Found daddypid on the system: 12943
11/26 20:31:55 ProcAPI::buildFamily() Found daddypid on the system: 14469
11/26 20:32:02 Getting monitoring info for pid 13747
11/26 20:32:10 ProcAPI::buildFamily() Found daddypid on the system: 14719
11/26 20:32:17 DaemonCore: Command received via UDP from host <10.14.2.68:33911>
11/26 20:32:17 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand)
11/26 20:32:25 ProcAPI::buildFamily() Found daddypid on the system: 14243
11/26 20:32:40 ProcAPI::buildFamily() Found daddypid on the system: 12943
11/26 20:32:45 ProcAPI::buildFamily() Found daddypid on the system: 14469
11/26 20:33:00 ProcAPI::buildFamily() Found daddypid on the system: 14719
11/26 20:33:03 Swap space: 10475612
11/26 20:33:03 216284896 kbytes available for "/usr1/condor/execute"
11/26 20:33:03 Looking up RESERVED_DISK parameter
11/26 20:33:03 Reserving 5120 kbytes for file system
11/26 20:33:03 Disk space: 216279776
11/26 20:33:07 Trying to update collector <10.14.0.12:9618>
11/26 20:33:07 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/26 20:33:07 vm1: Sent update to 1 collector(s)
11/26 20:33:08 Trying to update collector <10.14.0.12:9618>
11/26 20:33:08 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/26 20:33:08 vm2: Sent update to 1 collector(s)
11/26 20:33:09 Trying to update collector <10.14.0.12:9618>
11/26 20:33:09 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/26 20:33:09 vm3: Sent update to 1 collector(s)
11/26 20:33:10 Trying to update collector <10.14.0.12:9618>
11/26 20:33:10 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/26 20:33:10 vm4: Sent update to 1 collector(s)
11/26 20:33:15 ProcAPI::buildFamily() Found daddypid on the system: 14243
11/26 20:33:30 ProcAPI::buildFamily() Found daddypid on the system: 12943
11/26 20:33:35 ProcAPI::buildFamily() Found daddypid on the system: 14469
11/26 20:33:43 DaemonCore: Command received via UDP from host <10.14.0.12:44771>
11/26 20:33:43 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
11/26 20:33:43 DaemonCore: Command received via UDP from host <10.14.0.12:44771>
11/26 20:33:43 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
11/26 20:33:43 DaemonCore: Command received via UDP from host <10.14.0.12:44771>
11/26 20:33:43 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
11/26 20:33:43 DaemonCore: Command received via UDP from host <10.14.0.12:44771>
11/26 20:33:43 DaemonCore: received command 441 (ALIVE), calling handler (command_handler)
11/26 20:33:50 ProcAPI::buildFamily() Found daddypid on the system: 14719
11/26 20:34:05 ProcAPI::buildFamily() Found daddypid on the system: 14243
11/26 20:34:20 ProcAPI::buildFamily() Found daddypid on the system: 12943
11/26 20:34:25 ProcAPI::buildFamily() Found daddypid on the system: 14469
11/26 20:34:40 ProcAPI::buildFamily() Found daddypid on the system: 14719
11/26 20:34:55 ProcAPI::buildFamily() Found daddypid on the system: 14243
11/26 20:35:10 ProcAPI::buildFamily() Found daddypid on the system: 12943
11/26 20:35:15 ProcAPI::buildFamily() Found daddypid on the system: 14469
11/26 20:35:30 ProcAPI::buildFamily() Found daddypid on the system: 14719
11/26 20:35:45 ProcAPI::buildFamily() Found daddypid on the system: 14243
11/26 20:36:00 ProcAPI::buildFamily() Found daddypid on the system: 12943
11/26 20:36:02 Getting monitoring info for pid 13747
11/26 20:36:05 ProcAPI::buildFamily() Found daddypid on the system: 14469
11/26 20:36:20 ProcAPI::buildFamily() Found daddypid on the system: 14719
11/26 20:36:23 DaemonCore: Command received via UDP from host <10.14.2.68:33918>
11/26 20:36:23 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand)
11/26 20:36:35 ProcAPI::buildFamily() Found daddypid on the system: 14243
11/26 20:36:50 ProcAPI::buildFamily() Found daddypid on the system: 12943
11/26 20:36:55 ProcAPI::buildFamily() Found daddypid on the system: 14469
11/26 20:37:10 ProcAPI::buildFamily() Found daddypid on the system: 14719
11/26 20:37:25 ProcAPI::buildFamily() Found daddypid on the system: 14243
11/26 20:37:40 ProcAPI::buildFamily() Found daddypid on the system: 12943
11/26 20:37:45 ProcAPI::buildFamily() Found daddypid on the system: 14469
11/26 20:38:00 ProcAPI::buildFamily() Found daddypid on the system: 14719
11/26 20:38:03 Swap space: 10475612
11/26 20:38:03 216284840 kbytes available for "/usr1/condor/execute"
11/26 20:38:03 Looking up RESERVED_DISK parameter
11/26 20:38:03 Reserving 5120 kbytes for file system
11/26 20:38:03 Disk space: 216279720
11/26 20:38:07 vm1: match_info called
11/26 20:38:07 vm1: Started match timer (6391) for 120 seconds.
11/26 20:38:07 Trying to update collector <10.14.0.12:9618>
11/26 20:38:07 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/26 20:38:07 vm1: Sent update to 1 collector(s)
11/26 20:38:07 vm3: match_info called
11/26 20:38:07 vm3: Started match timer (6392) for 120 seconds.
11/26 20:38:07 vm4: match_info called
11/26 20:38:07 vm4: Started match timer (6393) for 120 seconds.
11/26 20:38:08 vm1: Canceled match timer (6391)
11/26 20:38:08 vm1: Schedd addr = <10.14.0.12:44107>
11/26 20:38:08 vm1: Alive interval = 300
11/26 20:38:08 vm1: Received ClaimId from schedd (<10.14.2.68:37360>#1164400071#82)
11/26 20:38:08 vm1: Rank of this claim is: 0.000000
11/26 20:38:08 vm1: Preempting claim has correct ClaimId.
11/26 20:38:08 vm1: New claim has sufficient rank, preempting current claim.
11/26 20:38:08 vm1: Got RemoteUser (ileonor@ligo) from request classad
11/26 20:38:08 vm1: State change: preempting claim based on user priority
11/26 20:38:08 vm1: State change: retiring due to preempting claim
11/26 20:38:08 vm1: Changing activity: Busy -> Retiring
11/26 20:38:08 vm1: State change: retirement ended/expired
11/26 20:38:08 vm1: Canceled ClaimLease timer (5736)
11/26 20:38:08 vm1: Changing state and activity: Claimed/Retiring -> Preempting/Vacating
11/26 20:38:08 vm1: In Starter::kill() with pid 14469, sig 15 (SIGTERM)
11/26 20:38:08 Trying to update collector <10.14.0.12:9618>
11/26 20:38:08 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/26 20:38:08 vm2: Sent update to 1 collector(s)
11/26 20:38:08 vm3: Canceled match timer (6392)
11/26 20:38:08 vm3: Schedd addr = <10.14.0.12:44107>
11/26 20:38:08 vm3: Alive interval = 300
11/26 20:38:08 vm3: Received ClaimId from schedd (<10.14.2.68:37360>#1164400071#84)
11/26 20:38:08 vm3: Rank of this claim is: 0.000000
11/26 20:38:08 vm3: Preempting claim has correct ClaimId.
11/26 20:38:08 vm3: New claim has sufficient rank, preempting current claim.
11/26 20:38:08 vm3: Got RemoteUser (ileonor@ligo) from request classad
11/26 20:38:08 vm3: State change: preempting claim based on user priority
11/26 20:38:08 vm3: State change: retiring due to preempting claim
11/26 20:38:08 vm3: Changing activity: Busy -> Retiring
11/26 20:38:08 vm3: State change: retirement ended/expired
11/26 20:38:08 vm3: Canceled ClaimLease timer (5763)
11/26 20:38:08 vm3: Changing state and activity: Claimed/Retiring -> Preempting/Vacating
11/26 20:38:08 vm3: In Starter::kill() with pid 14719, sig 15 (SIGTERM)
11/26 20:38:08 vm4: Canceled match timer (6393)
11/26 20:38:08 vm4: Schedd addr = <10.14.0.12:44107>
11/26 20:38:08 vm4: Alive interval = 300
11/26 20:38:08 vm4: Received ClaimId from schedd (<10.14.2.68:37360>#1164400071#80)
11/26 20:38:08 vm4: Rank of this claim is: 0.000000
11/26 20:38:08 vm4: Preempting claim has correct ClaimId.
11/26 20:38:08 vm4: New claim has sufficient rank, preempting current claim.
11/26 20:38:08 vm4: Got RemoteUser (ileonor@ligo) from request classad
11/26 20:38:08 vm4: State change: preempting claim based on user priority
11/26 20:38:08 vm4: State change: retiring due to preempting claim
11/26 20:38:08 vm4: Changing activity: Busy -> Retiring
11/26 20:38:08 vm4: State change: retirement ended/expired
11/26 20:38:08 vm4: Canceled ClaimLease timer (5700)
11/26 20:38:08 vm4: Changing state and activity: Claimed/Retiring -> Preempting/Vacating
11/26 20:38:08 vm4: In Starter::kill() with pid 14243, sig 15 (SIGTERM)
11/26 20:38:08 vm1: Got KILL_FRGN_JOB while in Preempting state, ignoring.
11/26 20:38:08 vm4: Got KILL_FRGN_JOB while in Preempting state, ignoring.
11/26 20:38:08 DaemonCore: No more children processes to reap.
11/26 20:38:08 Starter pid 14469 exited with status 0
11/26 20:38:08 ProcAPI::getProcInfo() pid 14246 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14246 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14246 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14246 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14246 does not exist.
11/26 20:38:08 ProcAPI::buildFamily failed: parent 14469 not found on system.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14469 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14469 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14469 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14469 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14469 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14470 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14470 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14470 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14470 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14470 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14471 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14471 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14471 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14471 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14471 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14472 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14472 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14472 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14472 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 14472 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 20948 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 20948 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 20948 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 20948 does not exist.
11/26 20:38:08 ProcAPI::getProcInfo() pid 20948 does not exist.
11/26 20:38:08 ProcAPI::buildFamily failed: parent 14472 not found on system.
*** End of file StartLog

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Sun Nov 26 22:50:40 2006 (1164603040)
Date: Mon, 27 Nov 2006 09:33:51 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

I have posted 5 more core dump images and assocated 200 line log file
messages at,
http://www.ligo.caltech.edu/~anderson/condor.1694

Thanks.

On Sun, Nov 26, 2006 at 08:50:18PM -0800, Stuart Anderson wrote:
> Derek,
> 
> We finally have a core dump image (core.13747.gz) for this problem
> that can be found at,
> http://www.ligo.caltech.edu/~anderson/condor.1694
> 
> Here is the associated email message with the last 200 lines before
> the startd SIGABRT:
> 

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Nov 27 11:34:17 2006 (1164648857)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Mon, 27 Nov 2006 12:13:09 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 27, 2006, at 9:34 AM, condor-support response tracking system  
wrote:

> I have posted 5 more core dump images and assocated 200 line log file
> messages at,
> http://www.ligo.caltech.edu/~anderson/condor.1694

i could download the core files (thanks!), but i can't get the logs:


Forbidden

You don't have permission to access /~anderson/condor.1694/log.4816  
on this server.



i'll take a look at the core files now and see if they can show me  
anything.

cheers,

-derek







===========================================================================
Date mail was appended: Mon Nov 27 14:13:21 2006 (1164658402)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Mon Nov 27 14:13:21 2006 (1164658403)
Date: Mon, 27 Nov 2006 12:18:23 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Mon, Nov 27, 2006 at 02:13:21PM -0600, condor-support response tracking system wrote:
> 
> 
> On Nov 27, 2006, at 9:34 AM, condor-support response tracking system  
> wrote:
> 
> > I have posted 5 more core dump images and assocated 200 line log file
> > messages at,
> > http://www.ligo.caltech.edu/~anderson/condor.1694
> 
> i could download the core files (thanks!), but i can't get the logs:
> 
> 
> Forbidden
> 
> You don't have permission to access /~anderson/condor.1694/log.4816  
> on this server.
> 

Should work now.

> 
> 
> i'll take a look at the core files now and see if they can show me  
> anything.
> 

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Nov 27 14:18:47 2006 (1164658727)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Mon, 27 Nov 2006 15:16:04 -0800
To: condor-support__AT__cs.wisc.edu

> Should work now.

indeed, thanks.

>> i'll take a look at the core files now and see if they can show me  
>> anything.

:(  sadly, the stack frame on all of these core files is totally  
busted (as you've probably noticed).

a few follow-up questions and random thoughts:

1) are you *SURE* these machines were running the version of the  
condor_startd i sent you?

% ident condor_startd.dynamic | grep CondorVersion
      $CondorVersion: 6.8.3 Oct 24 2006 PRE-RELEASE-UWCS $

% file condor_startd.dynamic
condor_startd.dynamic: ELF 64-bit LSB executable, AMD x86-64, version  
1 (SYSV), for GNU/Linux 2.4.0, dynamically linked (uses shared libs),  
not stripped

% md5sum condor_startd.dynamic
47191eb6f8b509bc0ddc7a35c87444b5  condor_startd.dynamic


2) any chance these machines are running out of RAM?


i'll keep trying to analyze the core files to see if i can find any  
hidden meaning or other clues about what's causing the SIGABRT.   
meanwhile, we might be able to try another version of the  
condor_startd with different compiler options that could help us  
debug trashed stack frames.  i'll let you know as soon as i've got  
another version for you to try (if that's what we end up having to do).

stay tuned,
-derek



===========================================================================
Date mail was appended: Mon Nov 27 17:16:16 2006 (1164669377)
Date: Mon, 27 Nov 2006 15:42:59 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Mon, Nov 27, 2006 at 05:16:16PM -0600, condor-support response tracking system wrote:
> >> i'll take a look at the core files now and see if they can show me  
> >> anything.
> 
> :(  sadly, the stack frame on all of these core files is totally  
> busted (as you've probably noticed).

The stack looks reasonable to me:


[root@node155 log]# gdb /ldcg/condor/sbin/condor_startd core.3813 
GNU gdb Red Hat Linux (6.3.0.0-1.84rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/libthread_db.so.1".

Core was generated by `condor_startd -f'.
Program terminated with signal 6, Aborted.
wReading symbols from /lib64/libcrypt.so.1...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libresolv.so.2...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /usr/lib64/libstdc++.so.5...done.
Loaded symbols for /usr/lib64/libstdc++.so.5
Reading symbols from /lib64/libm.so.6...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libgcc_s.so.1...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /lib64/libc.so.6...hdone.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...done.
Loaded symbols for /lib64/libnss_files.so.2
#0  0x0000003eab92f280 in raise () from /lib64/libc.so.6
(gdb) where
#0  0x0000003eab92f280 in raise () from /lib64/libc.so.6
#1  0x0000003eab930750 in abort () from /lib64/libc.so.6
#2  0x00002b22a7230146 in __cxa_call_unexpected () from /usr/lib64/libstdc++.so.5
#3  0x00002b22a7230183 in std::terminate () from /usr/lib64/libstdc++.so.5
#4  0x00002b22a72302a3 in __cxa_throw () from /usr/lib64/libstdc++.so.5
#5  0x00002b22a7230492 in operator new () from /usr/lib64/libstdc++.so.5
#6  0x00002b22a7230549 in operator new[] () from /usr/lib64/libstdc++.so.5
#7  0x000000000051fa41 in ExtArray<int>::resize (this=0x7fff03935c50, newsz=134217728) at extArray.h:260
#8  0x00000000004ff94d in ExtArray<int>::operator[] (this=0x7fff03935c50, i=67108864) at ../condor_c++_util/extArray.h:149
#9  0x000000000054bd23 in ProcFamily::takesnapshot (this=0xae6e60) at killfamily.C:349
#10 0x000000000054b721 in ProcFamily::hardkill (this=0xae6e60) at killfamily.C:86
#11 0x00000000004b60f8 in Starter::exited (this=0xb44f50) at Starter.C:584
#12 0x00000000004a5021 in Claim::starterExited (this=0xa96cc0) at claim.C:958
#13 0x00000000004bd418 in reaper (pid=10684, status=0) at startd_main.C:770
#14 0x00000000004fe26f in DaemonCore::HandleProcessExit (this=0xa31e00, pid=10684, exit_status=0) at daemon_core.C:7427
#15 0x00000000004fdf0a in DaemonCore::HandleDC_SERVICEWAITPIDS (this=0xa31e00, sig=60009) at daemon_core.C:7024
#16 0x00000000004f5565 in DaemonCore::Driver (this=0xa31e00) at daemon_core.C:2207
#17 0x0000000000502d1d in main (argc=1, argv=0x7fff039366d0) at daemon_core_main.C:1919


> 
> a few follow-up questions and random thoughts:
> 
> 1) are you *SURE* these machines were running the version of the  
> condor_startd i sent you?
> 
> % ident condor_startd.dynamic | grep CondorVersion
>       $CondorVersion: 6.8.3 Oct 24 2006 PRE-RELEASE-UWCS $
> 
> % file condor_startd.dynamic
> condor_startd.dynamic: ELF 64-bit LSB executable, AMD x86-64, version  
> 1 (SYSV), for GNU/Linux 2.4.0, dynamically linked (uses shared libs),  
> not stripped
> 
> % md5sum condor_startd.dynamic
> 47191eb6f8b509bc0ddc7a35c87444b5  condor_startd.dynamic
> 

Yes,

[root@node1 ~]# lsof /ldcg/stow_pkgs/condor-6.8.2/condor/sbin/condor_startd
COMMAND    PID   USER  FD   TYPE DEVICE     SIZE   NODE NAME
condor_st 4524 condor txt    REG   0,18 12150006 181033 /ldcg/stow_pkgs/condor-6.8.2/condor/sbin/condor_startd

[root@node1 ~]# md5sum /ldcg/stow_pkgs/condor-6.8.2/condor/sbin/condor_startd
47191eb6f8b509bc0ddc7a35c87444b5  /ldcg/stow_pkgs/condor-6.8.2/condor/sbin/condor_startd

and even more conclusively,

$ strings core.9404 | grep CondorVersion
CondorVersion
CondorVersion
$CondorVersion: 6.8.2 Oct 12 2006 $
$CondorVersion: 6.8.2 Oct 12 2006 $
CondorVersion
$CondorVersion: 6.8.3 Oct 24 2006 PRE-RELEASE-UWCS $
$CondorVersion: 6.8.3 Oct 24 2006 PRE-RELEASE-UWCS $


> 
> 2) any chance these machines are running out of RAM?
> 

Always a possibility, but unlikely. The machines have 4GB of physical RAM
and 10GB of swap. They are only running Condor jobs and we run,
ulimit -v 1300000
in the start up script for condor_master to prevent users from using
up all the memory.

It appears that normally condor_startd only uses ~26MByte of memory on our
nodes (as reported by pmap) so that is pretty far away from 1.3GByte of
allowed virual memory. However, the uncompressed core dump images are
~800MByte which may be a clue in and of itself.

> 
> i'll keep trying to analyze the core files to see if i can find any  
> hidden meaning or other clues about what's causing the SIGABRT.   
> meanwhile, we might be able to try another version of the  
> condor_startd with different compiler options that could help us  
> debug trashed stack frames.  i'll let you know as soon as i've got  
> another version for you to try (if that's what we end up having to do).
> 
> stay tuned,
> -derek
> 

Tomorrow we will be shutting down the pool for hardware maintenance work,
so this is a good time to drop a new binary in place.

Thanks.


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Nov 27 17:43:26 2006 (1164671007)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Mon, 27 Nov 2006 20:08:14 -0800
To: condor-support__AT__cs.wisc.edu


> The stack looks reasonable to me:

wow, i wonder why you're getting such useful info and i'm just getting:

#0  0x00000033f5b2f280 in ?? ()
#1  0x00000033f5b30750 in ?? ()
#2  0x00002b1a92cc8980 in ?? ()
#3  0x00002b1a92aec130 in ?? ()
#4  0x00007fff17fdc9b8 in ?? ()
#5  0x00002b1a92aebde8 in ?? ()
#6  0x0000000000000000 in ?? ()

i'm using the binaries i sent you.  looks like the only difference is  
our copy of /usr/bin/gdb is a slightly older version than yours.  weird.

anyway, that backtrace is *incredibly* helpful.  armed with that  
info, and after a very quick glance at the code, i now have a pretty  
solid lead on what the underlying problem is.  it'll take me a little  
while to code the fix and test it thoroughly, but the end is  
*finally* in sight.

i'll let you know as soon as i have a version that fixes the problem  
which you could install as a beta tester before the official 6.8.3  
release.  expect new binaries sometime in the next day or 2 at the  
latest.

thanks so much for all your patience and help!

-derek



===========================================================================
Date mail was appended: Mon Nov 27 22:08:21 2006 (1164686902)
Date: Mon, 27 Nov 2006 20:29:41 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Mon, Nov 27, 2006 at 10:08:21PM -0600, condor-support response tracking system wrote:
> 
> > The stack looks reasonable to me:
> 
> wow, i wonder why you're getting such useful info and i'm just getting:
> 
> #0  0x00000033f5b2f280 in ?? ()
> #1  0x00000033f5b30750 in ?? ()
> #2  0x00002b1a92cc8980 in ?? ()
> #3  0x00002b1a92aec130 in ?? ()
> #4  0x00007fff17fdc9b8 in ?? ()
> #5  0x00002b1a92aebde8 in ?? ()
> #6  0x0000000000000000 in ?? ()
> 
> i'm using the binaries i sent you.  looks like the only difference is  
> our copy of /usr/bin/gdb is a slightly older version than yours.  weird.

Are you running gdb on an FC4 x86_64 system? If so, then besides the gdb
version itself you might want to check how close your system is to the
glibc and libstdc++ libraries from our systems that are generating the core
images, i.e.,
compat-libstdc++-33-3.2.3-47.fc4
glibc-2.3.5-10.3


> 
> anyway, that backtrace is *incredibly* helpful.  armed with that  
> info, and after a very quick glance at the code, i now have a pretty  
> solid lead on what the underlying problem is.  it'll take me a little  
> while to code the fix and test it thoroughly, but the end is  
> *finally* in sight.


Here is another stack traced from a different crash today that looks the same
to me, and hopefully helps confirm your current working hypothesis.

(gdb) where
#0  0x0000003551f2f280 in raise () from /lib64/libc.so.6
#1  0x0000003551f30750 in abort () from /lib64/libc.so.6
#2  0x00002b566948d146 in __cxa_call_unexpected () from /usr/lib64/libstdc++.so.5
#3  0x00002b566948d183 in std::terminate () from /usr/lib64/libstdc++.so.5
#4  0x00002b566948d2a3 in __cxa_throw () from /usr/lib64/libstdc++.so.5
#5  0x00002b566948d492 in operator new () from /usr/lib64/libstdc++.so.5
#6  0x00002b566948d549 in operator new[] () from /usr/lib64/libstdc++.so.5
#7  0x000000000051fa41 in ExtArray<int>::resize (this=0x7fff416d8aa0, newsz=134217728) at extArray.h:260
#8  0x00000000004ff94d in ExtArray<int>::operator[] (this=0x7fff416d8aa0, i=67108864)
    at ../condor_c++_util/extArray.h:149
#9  0x000000000054bd23 in ProcFamily::takesnapshot (this=0xb52c50) at killfamily.C:349
#10 0x000000000054b721 in ProcFamily::hardkill (this=0xb52c50) at killfamily.C:86
#11 0x00000000004b60f8 in Starter::exited (this=0xb11800) at Starter.C:584
#12 0x00000000004a5021 in Claim::starterExited (this=0xb5c820) at claim.C:958
#13 0x00000000004bd418 in reaper (pid=28394, status=0) at startd_main.C:770
#14 0x00000000004fe26f in DaemonCore::HandleProcessExit (this=0xa31de0, pid=28394, exit_status=0) at daemon_core.C:7427
#15 0x00000000004fdf0a in DaemonCore::HandleDC_SERVICEWAITPIDS (this=0xa31de0, sig=60009) at daemon_core.C:7024
#16 0x00000000004f5565 in DaemonCore::Driver (this=0xa31de0) at daemon_core.C:2207
#17 0x0000000000502d1d in main (argc=1, argv=0x7fff416d9520) at daemon_core_main.C:1919

> 
> i'll let you know as soon as i have a version that fixes the problem  
> which you could install as a beta tester before the official 6.8.3  
> release.  expect new binaries sometime in the next day or 2 at the  
> latest.

Great. I would also be interested in a brief discription of the problem
at some point.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Mon Nov 27 22:30:04 2006 (1164688205)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Mon, 27 Nov 2006 20:55:00 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 27, 2006, at 8:30 PM, Stuart Anderson wrote:

> Are you running gdb on an FC4 x86_64 system?

no, i was using an rhas_3 system. :(  that's the platform i built the  
binaries on, which i assumed was what you were using. ;)

> If so, then besides the gdb
> version itself you might want to check how close your system is to the
> glibc and libstdc++ libraries from our systems that are generating  
> the core
> images, i.e.,
> compat-libstdc++-33-3.2.3-47.fc4
> glibc-2.3.5-10.3

duly noted, thanks.

> Here is another stack traced from a different crash today that  
> looks the same
> to me, and hopefully helps confirm your current working hypothesis.

yup, definitely matches my current theory, thanks.

> I would also be interested in a brief discription of the problem at  
> some point.

sure.  i'll wait until i've actually started fixing the code and make  
sure that solves it, but writing up my findings for some other condor  
developers has pretty thoroughly convinced me of the correctness of  
my theory.   probably tomorrow afternoon i'll have new binaries for  
you to try, and an explanation of what went wrong and how the new  
binaries fix it.

short answer: some recently added code isn't checking a return value  
it should be and is walking off the end of an uninitialized array as  
a result. :(  i haven't looked closely enough to see why this isn't  
happening all the time, and also why we never caught this bug with  
our nightly tests.  that's included in my list of "bugs" to fix as a  
result of this ticket... not just the code, but why we never saw the  
bug until now.

-derek



===========================================================================
Date mail was appended: Mon Nov 27 22:55:08 2006 (1164689708)
Date: Tue, 28 Nov 2006 10:28:34 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Mon, Nov 27, 2006 at 10:55:08PM -0600, condor-support response tracking system wrote:
> 
> On Nov 27, 2006, at 8:30 PM, Stuart Anderson wrote:
> 
> > Are you running gdb on an FC4 x86_64 system?
> 
> no, i was using an rhas_3 system. :(  that's the platform i built the  
> binaries on, which i assumed was what you were using. ;)

There is no FC4 x86_64 natively built release of Condor that I know of
and I believe the recomended bits for this platform are the RHEL3 ones,
however, please correct me if I am wrong.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Tue Nov 28 12:29:03 2006 (1164738544)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Tue, 28 Nov 2006 10:46:00 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 28, 2006, at 10:29 AM, Stuart Anderson wrote:

> There is no FC4 x86_64 natively built release of Condor that I know of
> and I believe the recomended bits for this platform are the RHEL3  
> ones,
> however, please correct me if I am wrong.

all true.  i just assumed that when you asked for the RHEL3 binaries,  
that's what you needed.  your reply about gdb was the first time you  
referenced "FC4" in the entire ticket. ;)  a few times i asked about  
rhel3 and got "yes" as the answer...

anyway, no big deal.  we're talking about adding some code so that  
when the daemons print up their start-up banner in the log files,  
they print out now just the compiled platform, but also the runtime  
OS, to help avoid these sorts of miscommunications in the future.

however, stay tuned for the bug fix,
-derek



===========================================================================
Date mail was appended: Tue Nov 28 12:46:08 2006 (1164739568)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Wed, 29 Nov 2006 12:08:12 -0800
To: condor-support__AT__cs.wisc.edu


a new 6.8.3 pre-release of the startd that should fix this bug is now  
available here:

ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_64_rhel3/ 
condor-6.8.3-pre-x86_64_rhel3-2006-11-29.tar.gz

again, there's .static and .dynamic versions, and both are unstripped.

after futher review, it's clear that there's a race condition which  
has to be hit for this bug to appear, which is why it's been elusive  
in local testing/debugging.

please give this a try and let me know as soon as possible if it  
solves your startd SIGABRT woes.  we're rapidly approaching a code  
freeze for the 6.8.3 release, and i'd *love* to get this committed  
before that happens.

thanks!
-derek

p.s. i've already started looking into "[condor-support #1750] LIGO  
condor_starter core dumps".  i'll reply to that ticket, but just  
wanted to let you know i haven't forgotten about that one, either. ;)




===========================================================================
Date mail was appended: Wed Nov 29 14:08:23 2006 (1164830903)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Wed Nov 29 14:08:23 2006 (1164830905)
Date: Wed, 29 Nov 2006 12:11:53 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64     
 Upgrade  [Fwd: [Condor] Problem]
X-Enigmail-Version: 0.94.1.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No

Hey Derek,

I will be testing this on my local install first. If all goes well,
we'll probably deploy this on the main cluster by the end of the day.

Thanks,
Erik

condor-support response tracking system wrote:
> a new 6.8.3 pre-release of the startd that should fix this bug is now  
> available here:
> 
> ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_64_rhel3/ 
> condor-6.8.3-pre-x86_64_rhel3-2006-11-29.tar.gz
> 
> again, there's .static and .dynamic versions, and both are unstripped.
> 
> after futher review, it's clear that there's a race condition which  
> has to be hit for this bug to appear, which is why it's been elusive  
> in local testing/debugging.
> 
> please give this a try and let me know as soon as possible if it  
> solves your startd SIGABRT woes.  we're rapidly approaching a code  
> freeze for the 6.8.3 release, and i'd *love* to get this committed  
> before that happens.
> 
> thanks!
> -derek
> 
> p.s. i've already started looking into "[condor-support #1750] LIGO  
> condor_starter core dumps".  i'll reply to that ticket, but just  
> wanted to let you know i haven't forgotten about that one, either. ;)
> 
> 
> 
> 
> 
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Derek Wright <wright__AT__cs.wisc.edu>
> * Ticket Email List: espinoza__AT__ligo.caltech.edu, anderson__AT__ligo.caltech.edu,espinoza_e__AT__ligo.caltech.edu
> 

-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Wed Nov 29 14:12:20 2006 (1164831141)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Wed, 29 Nov 2006 12:13:47 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 29, 2006, at 12:12 PM, Erik A. Espinoza wrote:

> I will be testing this on my local install first. If all goes well,
> we'll probably deploy this on the main cluster by the end of the day.

lovely.  please let me know as soon as you have any news (either good  
or bad). ;)

thanks,
-derek



===========================================================================
Date mail was appended: Wed Nov 29 14:13:59 2006 (1164831240)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Wed Nov 29 14:13:59 2006 (1164831241)
Date: Wed, 29 Nov 2006 12:31:22 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: condor-support__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64     
 Upgrade  [Fwd: [Condor] Problem]
X-Enigmail-Version: 0.94.1.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No

> lovely.  please let me know as soon as you have any news (either good  
> or bad). ;)

All tests on my local system have succeeded. We will be proceeding with
the deployment today. I'll keep you posted.

Thanks,
-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Wed Nov 29 14:32:00 2006 (1164832321)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Wed Nov 29 14:52:57 2006 (1164833577)
Date: Wed, 29 Nov 2006 15:41:33 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Wed, Nov 29, 2006 at 03:23:23PM -0800, Stuart Anderson wrote:
> On Wed, Nov 29, 2006 at 02:13:59PM -0600, condor-support response tracking system wrote:
> > 
> > 
> > On Nov 29, 2006, at 12:12 PM, Erik A. Espinoza wrote:
> > 
> > > I will be testing this on my local install first. If all goes well,
> > > we'll probably deploy this on the main cluster by the end of the day.
> > 
> > lovely.  please let me know as soon as you have any news (either good  
> > or bad). ;)
> > 
> 
> Bad news, it does not work at all on our production cluster and exits
> shortly after startup with:
> 
> 11/29 15:01:23 ERROR "DaemonCore: select() returned an unexpected error: 9 (Bad file descriptor)" at line 2410 in file daemon_core.C
> 

We now suspect this is because part of the fix is presumably in daemoncore
and Erik tested the static binary on his desktop machine (which worked),
but we installed the dynamically linked binary on the production cluster
which is then using the old (6.8.2) daemon core.

Is this plausible enough from what you know that we should try again
with the static version, or is there likely to be a different problem?

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Wed Nov 29 18:31:19 2006 (1164846679)
Date: Wed, 29 Nov 2006 15:23:23 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Wed, Nov 29, 2006 at 02:13:59PM -0600, condor-support response tracking system wrote:
> 
> 
> On Nov 29, 2006, at 12:12 PM, Erik A. Espinoza wrote:
> 
> > I will be testing this on my local install first. If all goes well,
> > we'll probably deploy this on the main cluster by the end of the day.
> 
> lovely.  please let me know as soon as you have any news (either good  
> or bad). ;)
> 

Bad news, it does not work at all on our production cluster and exits
shortly after startup with:

11/29 15:01:23 ERROR "DaemonCore: select() returned an unexpected error: 9 (Bad file descriptor)" at line 2410 in file daemon_core.C


Here is an example 200 line condor_startd crash report.


This is an automated email from the Condor system
on machine "node8.ldas-cit.ligo.caltech.edu".  Do not reply.

"/ldcg/condor/sbin/condor_startd" on "node8.ldas-cit.ligo.caltech.edu" exited with status 4.
Condor will automatically restart this process in 10 seconds.

*** Last 200 line(s) of file StartLog:
11/29 14:54:43 vm4: Changing state: Unclaimed -> Backfill
11/29 14:54:43 Got SIGQUIT.  Performing fast shutdown.
11/29 14:54:43 shutdown fast
11/29 14:54:43 StartdCronMgr: Shutting down
11/29 14:54:43 CronMgr: Killing all jobs
11/29 14:54:43 CronMgr: 0 jobs alive
11/29 14:54:43 Deleting Cronmgr
11/29 14:54:43 StartdCronMgr: Shutting down
11/29 14:54:43 CronMgr: Killing all jobs
11/29 14:54:43 StartdCronMgr: Bye
11/29 14:54:43 CronMgr: bye
11/29 14:54:43 About to send final update to the central manager
11/29 14:54:43 Trying to update collector <10.14.0.12:9618>
11/29 14:54:43 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/29 14:54:43 Trying to update collector <10.14.0.12:9618>
11/29 14:54:43 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/29 14:54:43 Trying to update collector <10.14.0.12:9618>
11/29 14:54:43 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/29 14:54:43 Trying to update collector <10.14.0.12:9618>
11/29 14:54:43 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/29 14:54:43 Destroying a BOINC_BackfillMgr
11/29 14:54:43 Destroying a BackfillMgr
11/29 14:54:43 All resources are free, exiting.
11/29 14:54:43 **** condor_startd (condor_STARTD) EXITING WITH STATUS 0
11/29 14:59:38 ******************************************************
11/29 14:59:38 ** condor_startd (CONDOR_STARTD) STARTING UP
11/29 14:59:38 ** /ldcg/stow_pkgs/condor-6.8.2/condor/sbin/condor_startd
11/29 14:59:38 ** $CondorVersion: 6.8.3 Nov 29 2006 PRE-RELEASE-UWCS $
11/29 14:59:38 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/29 14:59:38 ** PID = 16286
11/29 14:59:38 ** Log last touched 11/29 14:54:43
11/29 14:59:38 ******************************************************
11/29 14:59:38 Using config source: /usr1/condor/condor_config
11/29 14:59:38 Using local config sources: 
11/29 14:59:38    /usr1/condor/condor_config.local
11/29 14:59:38 DaemonCore: Command Socket at <10.14.1.8:37728>
11/29 14:59:38 Memory: Detected 4005 megs RAM
11/29 14:59:43 "/ldcg/condor/sbin/condor_starter.pvm -classad" did not produce any output, ignoring
11/29 14:59:43 Will use UDP to update collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/29 14:59:44 vm1: New machine resource of type 1 allocated
11/29 14:59:44 vm2: New machine resource of type 1 allocated
11/29 14:59:44 vm3: New machine resource of type 2 allocated
11/29 14:59:44 vm4: New machine resource of type 2 allocated
11/29 14:59:44 Instantiating a BackfillMgr
11/29 14:59:44 Instantiating a BOINC_BackfillMgr
11/29 14:59:44 Created a BOINC Backfill Manager
11/29 14:59:44 UidDomain = "ligo"
11/29 14:59:44 FileSystemDomain = "ligo"
11/29 14:59:44 Subnet = "10.14.1"
11/29 14:59:44 Swap space: 10482320
11/29 14:59:44 165741980 kbytes available for "/usr1/condor/execute"
11/29 14:59:44 Looking up RESERVED_DISK parameter
11/29 14:59:44 Reserving 5120 kbytes for file system
11/29 14:59:44 Disk space: 165736860
11/29 14:59:44 /dev/null major dev num is 1
11/29 14:59:44 Initialized last_km_activity
11/29 14:59:44 About to run initial benchmarks.
11/29 14:59:44 About to compute mips
11/29 14:59:44 Computed mips: 4378
11/29 14:59:44 About to compute kflops
11/29 14:59:47 Computed kflops: 1081365
11/29 14:59:47 recalc:DHRY_MIPS=4378, CLINPACK KFLOPS=1081365
11/29 14:59:47 Completed initial benchmarks.
11/29 14:59:47 CronMgr: Constructing 'startd'
11/29 14:59:47 CronMgr: Setting name to 'startd'
11/29 14:59:47 CronMgr: Setting parameter base to 'startd'
11/29 14:59:47 CronMgr: Doing config (initial)
11/29 14:59:47 vm1: State change: IS_OWNER is false
11/29 14:59:47 vm1: Changing state: Owner -> Unclaimed
11/29 14:59:47 vm1: State change: START_BACKFILL is TRUE
11/29 14:59:47 vm1: Changing state: Unclaimed -> Backfill
11/29 14:59:47 Started polling timer.
11/29 14:59:47 vm2: State change: IS_OWNER is false
11/29 14:59:47 vm2: Changing state: Owner -> Unclaimed
11/29 14:59:47 vm2: State change: START_BACKFILL is TRUE
11/29 14:59:47 vm2: Changing state: Unclaimed -> Backfill
11/29 14:59:47 vm3: State change: IS_OWNER is false
11/29 14:59:47 vm3: Changing state: Owner -> Unclaimed
11/29 14:59:47 vm3: State change: START_BACKFILL is TRUE
11/29 14:59:47 vm3: Changing state: Unclaimed -> Backfill
11/29 14:59:47 vm4: State change: IS_OWNER is false
11/29 14:59:47 vm4: Changing state: Owner -> Unclaimed
11/29 14:59:47 vm4: State change: START_BACKFILL is TRUE
11/29 14:59:47 vm4: Changing state: Unclaimed -> Backfill
11/29 14:59:47 DaemonCore: No more children processes to reap.
11/29 14:59:47 Getting monitoring info for pid 16286
11/29 14:59:47 DaemonCore: in SendAliveToParent()
11/29 14:59:47 DaemonCore: attempting to connect to '<10.14.1.8:51829>'
11/29 14:59:51 Trying to update collector <10.14.0.12:9618>
11/29 14:59:51 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/29 14:59:51 vm1: Sent update to 1 collector(s)
11/29 14:59:52 Entering BOINC_BackfillMgr::spawnClient()
11/29 14:59:52 About to Create_Process "condor_starter -f -append boinc -job-keyword boinc"
11/29 14:59:52 ProcAPI::buildFamily() Found daddypid on the system: 16294
11/29 14:59:52 Spawned BOINC starter: (pid 16294)
11/29 14:59:52 New BOINC_BackfillVM (id 1) created
11/29 14:59:52 State change: BOINC client running for vm1
11/29 14:59:52 vm1: Changing activity: Idle -> Busy
11/29 14:59:52 VM 2 wants to do backfill, already have a BOINC client running (pid 16294)
11/29 14:59:52 New BOINC_BackfillVM (id 2) created
11/29 14:59:52 State change: BOINC client running for vm2
11/29 14:59:52 vm2: Changing activity: Idle -> Busy
11/29 14:59:52 VM 3 wants to do backfill, already have a BOINC client running (pid 16294)
11/29 14:59:52 New BOINC_BackfillVM (id 3) created
11/29 14:59:52 State change: BOINC client running for vm3
11/29 14:59:52 vm3: Changing activity: Idle -> Busy
11/29 14:59:52 VM 4 wants to do backfill, already have a BOINC client running (pid 16294)
11/29 14:59:52 New BOINC_BackfillVM (id 4) created
11/29 14:59:52 State change: BOINC client running for vm4
11/29 14:59:52 vm4: Changing activity: Idle -> Busy
11/29 14:59:52 Trying to update collector <10.14.0.12:9618>
11/29 14:59:52 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/29 14:59:52 vm2: Sent update to 1 collector(s)
11/29 14:59:53 Trying to update collector <10.14.0.12:9618>
11/29 14:59:53 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/29 14:59:53 vm3: Sent update to 1 collector(s)
11/29 14:59:53 DaemonCore: Command received via UDP from host <10.14.1.8:34745>
11/29 14:59:53 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand)
11/29 14:59:54 Trying to update collector <10.14.0.12:9618>
11/29 14:59:54 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/29 14:59:54 vm4: Sent update to 1 collector(s)
11/29 14:59:56 Trying to update collector <10.14.0.12:9618>
11/29 14:59:56 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/29 14:59:56 vm1: Sent update to 1 collector(s)
11/29 15:00:55 vm3: match_info called
11/29 15:00:55 vm3: Received match <10.14.1.8:37728>#1164841178#3
11/29 15:00:55 vm3: Started match timer (17) for 120 seconds.
11/29 15:00:55 vm3: State change: match notification protocol successful
11/29 15:00:55 vm3: Set destination state to Matched
11/29 15:00:55 vm3: Changing activity: Busy -> Killing
11/29 15:00:55 BOINC_BackfillVM (id 3) deleted
11/29 15:00:55 vm3: State change: starter exited
11/29 15:00:55 vm3: Changing state and activity: Backfill/Killing -> Matched/Idle
11/29 15:00:58 vm4: match_info called
11/29 15:00:58 vm4: Received match <10.14.1.8:37728>#1164841178#4
11/29 15:00:58 vm4: Started match timer (19) for 120 seconds.
11/29 15:00:58 vm4: State change: match notification protocol successful
11/29 15:00:58 vm4: Set destination state to Matched
11/29 15:00:58 vm4: Changing activity: Busy -> Killing
11/29 15:00:58 BOINC_BackfillVM (id 4) deleted
11/29 15:00:58 vm4: State change: starter exited
11/29 15:00:58 vm4: Changing state and activity: Backfill/Killing -> Matched/Idle
11/29 15:01:01 Trying to update collector <10.14.0.12:9618>
11/29 15:01:01 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/29 15:01:01 vm3: Sent update to 1 collector(s)
11/29 15:01:05 Trying to update collector <10.14.0.12:9618>
11/29 15:01:05 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/29 15:01:05 vm4: Sent update to 1 collector(s)
11/29 15:01:11 vm2: match_info called
11/29 15:01:11 vm2: Received match <10.14.1.8:37728>#1164841178#2
11/29 15:01:11 vm2: Started match timer (23) for 120 seconds.
11/29 15:01:11 vm2: State change: match notification protocol successful
11/29 15:01:11 vm2: Set destination state to Matched
11/29 15:01:11 vm2: Changing activity: Busy -> Killing
11/29 15:01:11 BOINC_BackfillVM (id 2) deleted
11/29 15:01:11 vm2: State change: starter exited
11/29 15:01:11 vm2: Changing state and activity: Backfill/Killing -> Matched/Idle
11/29 15:01:16 Trying to update collector <10.14.0.12:9618>
11/29 15:01:16 Attempting to send update via UDP to collector ldas-grid.ligo.caltech.edu <10.14.0.12:9618>
11/29 15:01:16 vm2: Sent update to 1 collector(s)
11/29 15:01:23 vm3: Canceled match timer (17)
11/29 15:01:23 vm3: Schedd addr = <10.14.0.12:47360>
11/29 15:01:23 vm3: Alive interval = 300
11/29 15:01:23 vm3: Received ClaimId from schedd (<10.14.1.8:37728>#1164841178#3)
11/29 15:01:23 vm3: Rank of this claim is: 0.000000
11/29 15:01:23 vm3: Request accepted.
11/29 15:01:23 vm3: Remote owner is igor@ligo
11/29 15:01:23 vm3: State change: claiming protocol successful
11/29 15:01:23 vm3: Changing state: Matched -> Claimed
11/29 15:01:23 vm3: Started ClaimLease timer (26) w/ 1800 second lease duration
11/29 15:01:23 vm3: Computing claimWorklifeExpired(); ClaimAge=0, ClaimWorklife=1800
11/29 15:01:23 ERROR "DaemonCore: select() returned an unexpected error: 9 (Bad file descriptor)
" at line 2410 in file daemon_core.C
11/29 15:01:23 CronMgr: 0 jobs alive
11/29 15:01:23 vm3: Canceled ClaimLease timer (26)
11/29 15:01:23 vm3: Changing state and activity: Claimed/Idle -> Preempting/Killing
11/29 15:01:23 Entered vacate_client <10.14.0.12:47360> ldas-grid.ligo.caltech.edu...
11/29 15:01:23 vm3: State change: No preempting claim, returning to owner
11/29 15:01:23 vm3: Changing state and activity: Preempting/Killing -> Owner/Idle
11/29 15:01:23 vm3: State change: IS_OWNER is false
11/29 15:01:23 vm3: Changing state: Owner -> Unclaimed
11/29 15:01:23 vm3: State change: START_BACKFILL is TRUE
11/29 15:01:23 vm3: Changing state: Unclaimed -> Backfill
11/29 15:01:23 vm4: Changing state: Matched -> Owner
11/29 15:01:23 vm4: Canceled match timer (19)
11/29 15:01:23 vm4: State change: IS_OWNER is false
11/29 15:01:23 vm4: Changing state: Owner -> Unclaimed
11/29 15:01:23 vm4: State change: START_BACKFILL is TRUE
11/29 15:01:23 vm4: Changing state: Unclaimed -> Backfill
11/29 15:01:23 vm2: Changing state: Matched -> Owner
11/29 15:01:23 vm2: Canceled match timer (23)
11/29 15:01:23 vm2: State change: IS_OWNER is false
11/29 15:01:23 vm2: Changing state: Owner -> Unclaimed
11/29 15:01:23 vm2: State change: START_BACKFILL is TRUE
11/29 15:01:23 vm2: Changing state: Unclaimed -> Backfill
11/29 15:01:23 vm1: Set destination state to Owner
11/29 15:01:23 vm1: Changing activity: Busy -> Killing
11/29 15:01:23 In Starter::kill() with pid 16294, sig 3 (SIGQUIT)
11/29 15:01:23 BOINC_BackfillMgr::killClient(): told BOINC starter (pid 16294) to hardkill
11/29 15:01:23 startd exiting because of fatal exception.
*** End of file StartLog



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: ldas_admin_cit__AT__ligo.caltech.edu
The Official Condor Homepage is http://www.cs.wisc.edu/condor


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Wed Nov 29 18:31:20 2006 (1164846680)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Wed, 29 Nov 2006 16:53:12 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 29, 2006, at 4:31 PM, Stuart Anderson wrote:

> We now suspect this is because part of the fix is presumably in  
> daemoncore
> and Erik tested the static binary on his desktop machine (which  
> worked),
> but we installed the dynamically linked binary on the production  
> cluster
> which is then using the old (6.8.2) daemon core.

uhh, no. ;)

a) all condor libraries are always statically linked, no matter  
what.  the dynamic vs. static is just for the system libraries (libc,  
libm, etc).

b) the fix didn't touch daemoncore at all.  the only change i made  
was in src/condor_c++_util/killfamily.C (if you're following along at  
home).

> Is this plausible enough from what you know that we should try again
> with the static version, or is there likely to be a different problem?

i'm at a complete loss as to where this select() problem is coming  
from. :(  i didn't touch anything remotely related to this, so i have  
no idea what's going on with this.  sadly, select() doesn't tell you  
*which* fd is invalid when it returns with EBADF.

however, from looking at the CVS logs, i have a suspicion. ;)  lemme  
build you a new version with slightly older code and see if this  
clears up the SIGABRT w/o introducing the new badness...

all that said, it wouldn't hurt to give the static binaries a quick  
try, just in case. ;)

stay tuned for new binaries.

-derek



===========================================================================
Date mail was appended: Wed Nov 29 18:53:22 2006 (1164848003)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Wed, 29 Nov 2006 17:15:51 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 29, 2006, at 4:31 PM, Stuart Anderson wrote:

> 11/29 15:01:23 ERROR "DaemonCore: select() returned an unexpected  
> error: 9 (Bad file descriptor)" at line 2410 in file daemon_core.C

ok, please try this:

ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_64_rhel3/
condor-6.8.3-pre-x86_64_rhel3-2006-11-29_2.tar.gz

(note the "_2" near the end).

this is code from before some recent changes to DaemonCore, so it  
should be closer to the earlier 6.8.3 pre-release i gave you, but  
still with my fix for killfamily.C.

please let me know ASAP what kind of luck you have with it.

thanks, and sorry for the hassles!
-derek



===========================================================================
Date mail was appended: Wed Nov 29 19:16:00 2006 (1164849361)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Wed Nov 29 19:16:00 2006 (1164849362)
Date: Wed, 29 Nov 2006 17:30:57 -0800
From: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
To: Stuart Anderson <anderson__AT__ligo.caltech.edu>
CC: condor-support response tracking system <condor-support__AT__cs.wisc.edu>,
 espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade  [Fwd: [Condor] Problem]
X-Enigmail-Version: 0.94.1.0
Openpgp: url=http://pgp.mit.edu/
X-No-Archive: Yes
X-Archive: No

Hey Derek,

> We now suspect this is because part of the fix is presumably in daemoncore
> and Erik tested the static binary on his desktop machine (which worked),
> but we installed the dynamically linked binary on the production cluster
> which is then using the old (6.8.2) daemon core.

I tested the dynamically linked binary on frog and it worked as well.
The issue appears to be related to the job being run, and not strictly
limited to a dynamic vs. static thing.

Thanks,
-- 
Erik A. Espinoza
Systems Administrator
LIGO/Caltech - MS 18-34
Pasadena, CA 91125
Ph: 626-395-8517

===========================================================================
Date mail was appended: Wed Nov 29 19:31:32 2006 (1164850293)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Wed Nov 29 19:35:31 2006 (1164850531)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Wed, 29 Nov 2006 17:45:42 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 29, 2006, at 5:31 PM, condor-support response tracking system  
wrote:

> I tested the dynamically linked binary on frog and it worked as well.

a) which binary? (the original one i sent today, or the "_2" version?)
b) "worked" meaning what? ;)  no ERROR from DaemonCore about select()?

> The issue appears to be related to the job being run, and not strictly
> limited to a dynamic vs. static thing.

this doesn't make much sense to me.  DaemonCore's select() loop  
shouldn't care at all as to whether a job is running or not.  you  
mean that you only see the error once a job starts running?  do you  
*always* see the error when a job is running?  do you only see it  
when the machine goes into the Backfill state?  or, perhaps the error  
seems to be a race condition that only happens occasionally, and  
there's no direct correlation to if jobs are running...

more data points needed, please. ;)

also, do you see the same errors with the "_2" version of the startd  
i recently posted or not?

thanks,
-derek



===========================================================================
Date mail was appended: Wed Nov 29 19:45:53 2006 (1164851153)
Date: Wed, 29 Nov 2006 19:31:15 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: "Erik A. Espinoza" <espinoza__AT__ligo.caltech.edu>
CC: condor-support response tracking system <condor-support__AT__cs.wisc.edu>,
 espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Wed, Nov 29, 2006 at 05:30:57PM -0800, Erik A. Espinoza wrote:
> Hey Derek,
> 
> > We now suspect this is because part of the fix is presumably in daemoncore
> > and Erik tested the static binary on his desktop machine (which worked),
> > but we installed the dynamically linked binary on the production cluster
> > which is then using the old (6.8.2) daemon core.
> 
> I tested the dynamically linked binary on frog and it worked as well.
> The issue appears to be related to the job being run, and not strictly
> limited to a dynamic vs. static thing.

Derek,

For clarification, what "worked as well" means in this context is that
some simple condor jobs jobs where successfully run on a standalone
condor installation on a desktop machine of similar hardware/OS as the
production cluster. This probably means my wild guess about dynamic/static
linking being the cause of the first 6.8.3pre-release binary failures
was probably just wrong--further confirmed in your previous email regarding
what was changed in startd and what dynamic linking means in this context.

Note, the original startd SIGABRT problem has never occured on this
standalone machine so this does not indicate anything about whether
the original startd problem has been fixed, this was just an attempt
to figure out why the first 6.8.3pre1 binary from today was failing
on the large cluster with exit status 4.

We will now try to give the new 6.8.3pre2 startd binary a test.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Wed Nov 29 21:31:38 2006 (1164857499)
Date: Wed, 29 Nov 2006 21:19:51 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Wed, Nov 29, 2006 at 07:16:00PM -0600, condor-support response tracking system wrote:
> 
> 
> On Nov 29, 2006, at 4:31 PM, Stuart Anderson wrote:
> 
> > 11/29 15:01:23 ERROR "DaemonCore: select() returned an unexpected  
> > error: 9 (Bad file descriptor)" at line 2410 in file daemon_core.C
> 
> ok, please try this:
> 
> ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_64_rhel3/
> condor-6.8.3-pre-x86_64_rhel3-2006-11-29_2.tar.gz
> 
> (note the "_2" near the end).
> 
> this is code from before some recent changes to DaemonCore, so it  
> should be closer to the earlier 6.8.3 pre-release i gave you, but  
> still with my fix for killfamily.C.
> 
> please let me know ASAP what kind of luck you have with it.
> 

We now have 329 copies of this binary running. This version is able to start
up and run jobs. There have been no SIGABRT's in the first 45 minutes of
running. It is not completely clear how long we need to run with a null result
before declaring success, but we will keep you updated.

It is interesting to note that ~10% of the previously running condor_startd
processes (the original debug version from a Month ago) did SIGABRT at the
same time when condor_master restarted them. This happened synchronously as
the individual compute nodes detected the new binary on a shared filesystem
and tried to automatically restart condor_startd with the new bits. You may
be able to exploit this this on one of your test clusters to reproduce
the race condition that causes the SIGABRT and help provide evidence that
your patch is a sufficient fix.

Presumably we also serendipitously (and somewhat painfully from our end)
learned that there is some post-6.8.2 code that needs to _not_ be
included in 6.8.3.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Wed Nov 29 23:20:13 2006 (1164864014)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Wed, 29 Nov 2006 21:45:57 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 29, 2006, at 9:20 PM, condor-support response tracking system  
wrote:

> We now have 329 copies of this binary running. This version is able  
> to start
> up and run jobs. There have been no SIGABRT's in the first 45  
> minutes of
> running. It is not completely clear how long we need to run with a  
> null result
> before declaring success, but we will keep you updated.

lovely, that's exactly what i was hoping you'd say. ;)  but yes,  
please keep me posted if you observe any more SIGABRT badness.

> It is interesting to note that ~10% of the previously running  
> condor_startd
> processes (the original debug version from a Month ago) did SIGABRT  
> at the
> same time when condor_master restarted them. This happened  
> synchronously as
> the individual compute nodes detected the new binary on a shared  
> filesystem
> and tried to automatically restart condor_startd with the new bits.

this makes sense.  the problem comes when a starter exits, and the  
startd is trying to clean up any possible traces of the process  
family.  there's code to see if all the pids are gone that attempts  
to reconstruct the process family tree of everything associated with  
the job.  the race condition comes if there are processes in the tree  
that haven't been reaped yet.  there's a part of the function that's  
trying to reconstruct the tree that checks if a given pid is still  
valid in the system.  if so, we try to find all of its descendents  
and add them into the tree.  meanwhile, if the pid gets reaped and is  
no longer on the system before we try to find all of its children,  
our internal library to find all the children returns failure.  the  
bug was that once the first check for the existence of the pid  
passes, some careless code assumed we'd find the children, and didn't  
check the return value for that step.  so, the function failed,  
didn't fill in any valid information about children, but the code  
blindly assumed we had a nice big array (of uninitialized memory)  
that represented other pids we needed to check on. :(  iterating over  
this array, adding all the "pids" to the family tree until we hit a 0  
(which for some reason, never came), caused the library to keep  
allocating more and more RAM (hence the enormous sizes for the core  
files) and is what eventually caused malloc() to fail which threw the  
SIGABRT.

so, when the condor_master saw the new binaries, it told all of your  
startds to shutdown.  that caused a bunch of starters to exit around  
the same time, and about 10% of them hit this race condition and  
triggered the bug.

the fix is simply to a) always check the return values, since a pid  
could potentially "disappear" at any time and b) initialize that  
array before we pass it off to the helper method (as added  
paranoia... we never actually touch that array if the function fails,  
but i wanted to be extra defensive just in case).

hopefully that answers any lingering questions you have about the  
nature of the bug, the fix, and under what conditions you'd expect to  
see the SIGABRTs.

> You may be able to exploit this this on one of your test clusters  
> to reproduce
> the race condition that causes the SIGABRT and help provide  
> evidence that
> your patch is a sufficient fix.

not exactly.  i think part of the problem is that we have a tiny  
handful of x86_64 boxes in our pool (only 2 as far as i know), and  
you've got a huge pile.  i suspect on other platforms, we're just  
getting lucky when we hit that uninitialized array, find a 0 right  
away and no harm is done.  for some reason, the version of gcc, or  
the nature of the kernel itself on the x86_64_rhas3/fc* boxes ensures  
we never see a 64 bit 0, and that causes all the badness.  so, the  
combo of needing to never hit a 0, and hitting this race of when the  
pids are reaped is rare enough that it only really showed up on ~10%  
of your large pool, whereas our tiny "pool" of this platform just  
wasn't giving us a big enough data set to see the bug. :(

> Presumably we also serendipitously (and somewhat painfully from our  
> end)
> learned that there is some post-6.8.2 code that needs to _not_ be
> included in 6.8.3.

to be clear, you're talking about the DaemonCore stuff now, not the  
killfamily.C fix, right? ;) i just want to make sure there's no  
miscommunication.

thanks again for all your pain, patience, and help.  if it's any  
consolation, it's been quite painful for me, too. ;)  i'd much rather  
be working on developing new features than debugging other people's  
code, but such is the life of a senior condor developer sometimes... ;)

cheers,
-derek



===========================================================================
Date mail was appended: Wed Nov 29 23:46:08 2006 (1164865569)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Wed Nov 29 23:46:08 2006 (1164865570)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Wed, 29 Nov 2006 22:09:34 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 29, 2006, at 9:20 PM, Stuart Anderson wrote:

> Presumably we also serendipitously (and somewhat painfully from our  
> end)
> learned that there is some post-6.8.2 code that needs to _not_ be
> included in 6.8.3.

amazingly, todd is still awake over in madison, and replied that he's  
pretty sure he sees the problem (it was his change to DaemonCore that  
i thought was causing the select() errors).  he's planning to have it  
fixed for our builds tonight (and i'll commit my killfamily.C fix,  
too).  so, in theory, tomorrow morning's builds should solve all the  
known problems we've seen in this ticket.

did you see the select() error even on your small test pool?

would you be willing to run tomorrow's binaries and see if they  
continue to work?

that would be a big help for ensuring that our 6.8.3 release  
candidate is really stable.

just let me know.

thanks!
-derek



===========================================================================
Date mail was appended: Thu Nov 30  0:09:46 2006 (1164866990)
Date: Wed, 29 Nov 2006 22:10:37 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>,
 Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Wed, Nov 29, 2006 at 11:46:08PM -0600, condor-support response tracking system wrote:
> 
> hopefully that answers any lingering questions you have about the  
> nature of the bug, the fix, and under what conditions you'd expect to  
> see the SIGABRTs.

Yes, thanks.

> > Presumably we also serendipitously (and somewhat painfully from our  
> > end)
> > learned that there is some post-6.8.2 code that needs to _not_ be
> > included in 6.8.3.
> 
> to be clear, you're talking about the DaemonCore stuff now, not the  
> killfamily.C fix, right? ;) i just want to make sure there's no  
> miscommunication.

Yes.

> thanks again for all your pain, patience, and help.  if it's any  
> consolation, it's been quite painful for me, too. ;)  i'd much rather  
> be working on developing new features than debugging other people's  
> code, but such is the life of a senior condor developer sometimes... ;)

Todd,
	It appears that this particular bug, and a few of the other
more nasty ones that we have run into at CIT have been more readily
exposed when stressing different condor daemons at startup/shutdown, e.g.,
the DedicatedScheduler bugs you where never able to reproduce after
we stopped crashing/restarting schedd for other reasons. If you do not
already, has the flight worthy team considered taking pot shots at a
flying condor pool to see if it crash lands or keeps flying, e.g.,
pkill -9 condor_schedd, touch condor_startd on all the execute machine
(if that is sufficient for condor_master to think there are new bits
to restart), /sbin/shutdown -r now on random machines, submit a ridiculous
number of jobs and/or dags and then remove them, ... just a few random
thoughts.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Nov 30  0:14:50 2006 (1164867292)
Date: Wed, 29 Nov 2006 22:21:14 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>,
 Todd Tannenbaum <tannenba__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Thu, Nov 30, 2006 at 12:09:46AM -0600, condor-support response tracking system wrote:
> 
> 
> On Nov 29, 2006, at 9:20 PM, Stuart Anderson wrote:
> 
> > Presumably we also serendipitously (and somewhat painfully from our  
> > end)
> > learned that there is some post-6.8.2 code that needs to _not_ be
> > included in 6.8.3.
> 
> amazingly, todd is still awake over in madison, and replied that he's  
> pretty sure he sees the problem (it was his change to DaemonCore that  
> i thought was causing the select() errors).  he's planning to have it  
> fixed for our builds tonight (and i'll commit my killfamily.C fix,  
> too).  so, in theory, tomorrow morning's builds should solve all the  
> known problems we've seen in this ticket.
> 
> did you see the select() error even on your small test pool?

No, that was the cause of the red-herring with respect to static/symbolic
linked versions. A few small test jobs on a standalone system did not
have select() errors. However, on the larger cluster, I believe all
(or at least a very large fraction) of the 329 execute machines where
looping over select error crash/restart by condor_master.

What I first noticed was that the schedd was trying to connect to
ports on the execute machine that nobody was listening too, e.g.,

11/29 15:19:58 (pid:1369) attempt to connect to <10.14.1.228:42636> failed: Connection refused (connect errno = 111).
11/29 15:19:58 (pid:1369) attempt to connect to <10.14.1.154:52532> failed: Connection refused (connect errno = 111).
11/29 15:19:58 (pid:1369) attempt to connect to <10.14.1.124:34482> failed: Connection refused (connect errno = 111).

when I logged into those machine and ran lsof -i :XYZ there was nobody
listening, yet there was a condor_startd running but listening on a different
port.


> 
> would you be willing to run tomorrow's binaries and see if they  
> continue to work?
> 
> that would be a big help for ensuring that our 6.8.3 release  
> candidate is really stable.
> 
> just let me know.
> 

Go ahead and send me a pointer when it is ready and I will take a look
at the tea leaves tomorrow and decide.

Thanks.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Nov 30  0:22:53 2006 (1164867773)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Wed, 29 Nov 2006 22:34:24 -0800
To: condor-support__AT__cs.wisc.edu

On Nov 29, 2006, at 10:22 PM, Stuart Anderson wrote:

> A few small test jobs on a standalone system did not
> have select() errors. However, on the larger cluster, I believe all
> (or at least a very large fraction) of the 329 execute machines where
> looping over select error crash/restart by condor_master.

duly noted.

> Go ahead and send me a pointer when it is ready and I will take a look
> at the tea leaves tomorrow and decide.

ok, cool.  will do.

keep in mind that the 2006-11-29_2 binaries you're currently running  
have some known bugs that have since been fixed in the V6_8-branch  
(the code is really from 2006-11-20, except for my fix to  
killfamily.C).  so, assuming todd's fix to his previous fix works,  
the binaries from tomorrow should be better code than what you have  
now... ;)  hopefully that'll help the tea leaves point in a good  
direction.

thanks,
-derek



===========================================================================
Date mail was appended: Thu Nov 30  0:34:33 2006 (1164868473)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Thu, 30 Nov 2006 13:30:44 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 29, 2006, at 10:22 PM, Stuart Anderson wrote:

> Go ahead and send me a pointer when it is ready

ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_64_rhel3/
condor-6.8.3-linux-x86_64-rhel3-dynamic-unstripped.tar.gz

that's a complete set of unstripped, dynamic binaries from this  
morning's builds.

i reviewed todd's fix to DaemonCore last night and it definitely  
should fix the kinds of errors you were seeing with select(), based  
on the other stuff going on in the log file at that time.  i have  
high hopes. ;)

cheers,
-derek



===========================================================================
Date mail was appended: Thu Nov 30 15:30:54 2006 (1164922255)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Thu Nov 30 15:30:54 2006 (1164922256)
Date: Thu, 30 Nov 2006 13:43:17 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Thu, Nov 30, 2006 at 03:30:54PM -0600, condor-support response tracking system wrote:
> 
> 
> On Nov 29, 2006, at 10:22 PM, Stuart Anderson wrote:
> 
> > Go ahead and send me a pointer when it is ready
> 
> ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_64_rhel3/
> condor-6.8.3-linux-x86_64-rhel3-dynamic-unstripped.tar.gz
> 
> that's a complete set of unstripped, dynamic binaries from this  
> morning's builds.
> 
> i reviewed todd's fix to DaemonCore last night and it definitely  
> should fix the kinds of errors you were seeing with select(), based  
> on the other stuff going on in the log file at that time.  i have  
> high hopes. ;)

OK. I will give this version a try.

We successfully completed 17647 jobs with the 2006-11-29_2 condor_startd
binary since last night without any core dumps yet:

$ condor_history -completedsince '11/29/2006 20:42' | wc -l; /bin/date
17647
Thu Nov 30 13:41:24 PST 2006

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Nov 30 15:43:37 2006 (1164923017)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Thu Nov 30 15:45:25 2006 (1164923125)
Date: Thu, 30 Nov 2006 15:25:38 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Thu, Nov 30, 2006 at 01:43:17PM -0800, Stuart Anderson wrote:
> On Thu, Nov 30, 2006 at 03:30:54PM -0600, condor-support response tracking system wrote:
> > 
> > 
> > On Nov 29, 2006, at 10:22 PM, Stuart Anderson wrote:
> > 
> > > Go ahead and send me a pointer when it is ready
> > 
> > ftp://ftp.cs.wisc.edu/condor/temporary/forligo/x86_64_rhel3/
> > condor-6.8.3-linux-x86_64-rhel3-dynamic-unstripped.tar.gz
> > 
> > that's a complete set of unstripped, dynamic binaries from this  
> > morning's builds.
> > 
> > i reviewed todd's fix to DaemonCore last night and it definitely  
> > should fix the kinds of errors you were seeing with select(), based  
> > on the other stuff going on in the log file at that time.  i have  
> > high hopes. ;)
> 
> OK. I will give this version a try.
> 
> We successfully completed 17647 jobs with the 2006-11-29_2 condor_startd
> binary since last night without any core dumps yet:
> 
> $ condor_history -completedsince '11/29/2006 20:42' | wc -l; /bin/date
> 17647
> Thu Nov 30 13:41:24 PST 2006
> 

This new condor_startd binary is now running (150 jobs and counting). However,
there is a new problem. The "2006-11-29_2 condor_startd" that was previously
running flawlessly since last night generated a segfault on ~90% of our nodes
when it was shutdown by condor_master as it detected the new 11-30 bits on
a common shared filesystem. This is clearly a different problem since these
core images are only ~2.6MB (rather than 800MB). An example core dump image
may be found at:
http://www.ligo.caltech.edu/~anderson/condor.1694/core.712


Here is the stack trace from gdb:

# gdb /ldcg/condor/sbin/condor_startd.6.8.3pre2 core.712 
GNU gdb Red Hat Linux (6.3.0.0-1.84rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/libthread_db.so.1".


warning: core file may not match specified executable file.
Core was generated by `condor_startd -f'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib64/libcrypt.so.1...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libresolv.so.2...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /usr/lib64/libstdc++.so.5...done.
Loaded symbols for /usr/lib64/libstdc++.so.5
Reading symbols from /lib64/libm.so.6...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libgcc_s.so.1...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...done.
Loaded symbols for /lib64/libnss_files.so.2
#0  0x000000000051df9a in Item (this=Cannot access memory at address 0x910164c8
) at list.h:229
229     list.h: No such file or directory.
        in list.h
(gdb) where
#0  0x000000000051df9a in Item (this=Cannot access memory at address 0x910164c8
) at list.h:229
Cannot access memory at address 0x910164d0




#0  0x000000000051df9a in Item (this=Cannot access memory at address 0x9f67e938
) at list.h:229
229     list.h: No such file or directory.
        in list.h
(gdb) where
#0  0x000000000051df9a in Item (this=Cannot access memory at address 0x9f67e938
) at list.h:229
Cannot access memory at address 0x9f67e940


-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Nov 30 17:26:03 2006 (1164929164)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Thu, 30 Nov 2006 15:45:12 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 30, 2006, at 3:26 PM, condor-support response tracking system  
wrote:

> However,
> there is a new problem. The "2006-11-29_2 condor_startd" that was  
> previously
> running flawlessly since last night generated a segfault on ~90% of  
> our nodes
> when it was shutdown by condor_master

my first impression is that this is due to the somewhat crazy mix of  
code i put into that release snapshot, given that i had to back out a  
bunch of changes.  seg faults like this are often caused by  
incomplete rebuilds, and i fear that in my haste (and exhaustion), i  
only partly rebuilt the source tree, and that's what caused the  
error.  also, as i said, those binaries contain some known bugs that  
were already fixed (in the 2006-11-30 binaries), so i'm not too  
interested in investing a lot of time trying to debug them.

the binaries from today are a totally clean rebuild from our nightly  
build system, with all the latest bug fixes, so they shouldn't have  
this problem.  if you can stand it, try touching the condor_startd  
binary and watching the pool restart itself, from the 2006-11-30  
binaries, and see if you hit any seg faults.  if so, i'd be pretty  
surprised.

thanks,
-derek



===========================================================================
Date mail was appended: Thu Nov 30 17:45:22 2006 (1164930322)
Subject: Actions

Status changed from open to pending by wright
===========================================================================
Date of actions: Thu Nov 30 17:45:22 2006 (1164930324)
Date: Thu, 30 Nov 2006 20:35:33 -0800
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-support response tracking system <condor-support__AT__cs.wisc.edu>
CC: espinoza__AT__ligo.caltech.edu, espinoza_e__AT__ligo.caltech.edu
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64  
 Upgrade [Fwd: [Condor] Problem]

On Thu, Nov 30, 2006 at 05:45:22PM -0600, condor-support response tracking system wrote:
> 
> 
> On Nov 30, 2006, at 3:26 PM, condor-support response tracking system  
> wrote:
> 
> > However,
> > there is a new problem. The "2006-11-29_2 condor_startd" that was  
> > previously
> > running flawlessly since last night generated a segfault on ~90% of  
> > our nodes
> > when it was shutdown by condor_master
> 
> my first impression is that this is due to the somewhat crazy mix of  
> code i put into that release snapshot, given that i had to back out a  
> bunch of changes.  seg faults like this are often caused by  
> incomplete rebuilds, and i fear that in my haste (and exhaustion), i  
> only partly rebuilt the source tree, and that's what caused the  
> error.  also, as i said, those binaries contain some known bugs that  
> were already fixed (in the 2006-11-30 binaries), so i'm not too  
> interested in investing a lot of time trying to debug them.

Good call.

> 
> the binaries from today are a totally clean rebuild from our nightly  
> build system, with all the latest bug fixes, so they shouldn't have  
> this problem.  if you can stand it, try touching the condor_startd  
> binary and watching the pool restart itself, from the 2006-11-30  
> binaries, and see if you hit any seg faults.  if so, i'd be pretty  
> surprised.

You are right. I ran a small controlled test on 10 nodes, and rather
my expection of 9 core dumps there where 0 when condor_master restarted
the 2006-22-30 condor_startd processes with the same bits.

I will let you know sooner if someting goes wrong, otherwise, I will
wait until Monday to declare complete success from my perspective and
we will upgrade when 6.8.3 is released.

Many thanks for tracking this down to the bitter end.

-- 
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

===========================================================================
Date mail was appended: Thu Nov 30 22:36:07 2006 (1164947767)
From: Derek Wright <wright__AT__cs.wisc.edu>
Subject: Re: [condor-support #1694] LIGO Ticket: Condor 6.8.2 x86_64    
 Upgrade [Fwd: [Condor] Problem]
Date: Thu, 30 Nov 2006 23:47:01 -0800
To: condor-support__AT__cs.wisc.edu


On Nov 30, 2006, at 8:36 PM, Stuart Anderson wrote:

> I will let you know sooner if someting goes wrong, otherwise, I will
> wait until Monday to declare complete success from my perspective and
> we will upgrade when 6.8.3 is released.

excellent.  once again, this is exactly what i wanted to hear.

> Many thanks for tracking this down to the bitter end

many thanks for being the guinea pig for the 6.8.3 stable release.  ;)

i'm hereby resolving this ticket.  feel free to reopen if anything  
surfaces, but if i don't see any more replies in this thread, i'll be  
happy knowing that all's well.

cheers,
-derek



===========================================================================
Date mail was appended: Fri Dec  1  1:47:11 2006 (1164959231)
Subject: Actions

Ticket resolved by wright
===========================================================================
Date of actions: Fri Dec  1  1:47:11 2006 (1164959234)