LIGO Support Ticket 18390

Ticket Information
  Number:      admin 18390
  User:        carsten.aulbert@aei.mpg.de
  Email:       anderson__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu
  Status:      open
  Assigned To: nleroy
Date: Fri, 29 Aug 2008 20:46:32 +0200
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>,        Scott Koranda
 <skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: Accountant information badly skewed
X-Enigmail-Version: 0.95.0
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

Hi,

as indicated in the telecon today, we ran a - possibly badly configured - HA setup between two nodes when his happened. At some point this happened:

condor_userprio: "Can't find address for negotiator"

and possibly related the following situation arose.

h2[3038]:~% condor_userprio -all -allusers
Last Priority Update:  8/29 20:39
                                    Effective   Real     Priority   Res   Accumulated       Usage            Last
User Name                           Priority  Priority    Factor    Used  Usage (hrs)    Start Time       Usage Time
------------------------------      --------- -------- ------------ ----  ----------- ---------------- ----------------
ballen__AT__atlas.local                      50.00     0.50       100.00    0       168.94  4/08/2008 23:11  4/08/2008 23:50
chrism__AT__atlas.local                      50.00     0.50       100.00    0         1.70  5/07/2008 19:58  5/07/2008 20:13
miroslav__AT__atlas.local                    50.00     0.50       100.00    0        16.45  4/06/2008 14:31  4/11/2008 18:56
fehrmann__AT__atlas.local                    50.00     0.50       100.00    0         9.90  5/06/2008 16:24  5/06/2008 16:46
carsten__AT__atlas.local                     50.00     0.50       100.00    0      1189.61  4/09/2008 20:34  4/11/2008 17:48
kwwette__AT__atlas.local                    164.86     1.65       100.00    9        45.26  8/29/2008 15:37  8/29/2008 20:39
christian__AT__atlas.local                  397.14     3.97       100.00    4     13438.05  5/27/2008 18:45  8/29/2008 20:39
hpletsch__AT__atlas.local                   411.54     4.12       100.00    0     19452.60  4/11/2008 15:33  8/29/2008 18:10
bose__AT__atlas.local                       500.00     0.50      1000.00    0       511.32  4/11/2008 18:18  4/15/2008 00:05
rubab__AT__atlas.local                      500.00     0.50      1000.00    0         0.00  1/01/1970 01:00        ???
ajith__AT__atlas.local                      500.00     0.50      1000.00    0     73891.45  6/26/2008 15:41  7/15/2008 15:43
chmess__AT__atlas.local                     690.48     6.90       100.00    0       509.13  8/28/2008 14:35  8/29/2008 13:37
dfazi__AT__atlas.local                     1275.92     1.28      1000.00    0   -202281.96  6/26/2008 15:41  8/29/2008 05:39
sfairhur__AT__atlas.local                  6978.91     6.98      1000.00    8    206069.49  4/11/2008 15:28  8/29/2008 20:39
benk__AT__atlas.local                     10517.92   105.18       100.00  283     22306.49  5/06/2008 15:12  8/29/2008 20:39
igor__AT__atlas.local                     12493.31    12.49      1000.00    0      5113.03  8/21/2008 13:37  8/28/2008 14:35
lucia__AT__atlas.local                   208067.61   208.07      1000.00 3160      7445.06  8/29/2008 17:55  8/29/2008 20:39
spxiwh__AT__atlas.local                  219876.44   219.88      1000.00    0     78972.76  8/21/2008 13:37  8/29/2008 15:43
volodya__AT__atlas.local                 480233.00   480.23      1000.00 1307     25203.58  8/21/2008 13:37  8/29/2008 20:39
stas__AT__atlas.local                    639320.44   639.32      1000.00  522    537400.53  5/27/2008 18:45  8/29/2008 20:39
pau__AT__atlas.local                     669049.44   669.05      1000.00   49    103649.87  8/21/2008 13:37  8/29/2008 20:39
------------------------------      --------- -------- ------------ ----  ----------- ---------------- ----------------
Number of users: 21                                                 5342    893113.24  4/11/2008 15:28        ???

On the other node it looked normal at the time, i.e. no negative numbers and about 3.6m CPU-hours more. In the mean time we went back to a set-up 
with only a single condor head node. Sorry for being very vague about what might have happened.

Cheers

Carsten

PS: Sorry for the long lines, but this way the table looks nicer ;) 

-- 
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
Phone/Fax: +49 511 762-17185 / -17193
http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31 

===========================================================================
Date of creation: Fri Aug 29 13:46:42 2008 (1220035604)
Subject: Actions

Assigned to nleroy by gthain
===========================================================================
Date of actions: Fri Aug 29 20:20:29 2008 (1220059229)
From: Nick LeRoy <nleroy__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18390] LIGO: Accountant information badly skewed
Date: Tue, 2 Sep 2008 13:22:59 -0500

> Hi,
Hello Carsten,

> as indicated in the telecon today, we ran a - possibly badly configured -
> HA setup between two nodes when his happened. At some point this happened:
>
> condor_userprio: "Can't find address for negotiator"
>
> and possibly related the following situation arose.
>
> h2[3038]:~% condor_userprio -all -allusers
> Last Priority Update:  8/29 20:39

> On the other node it looked normal at the time, i.e. no negative numbers
> and about 3.6m CPU-hours more. In the mean time we went back to a set-up
> with only a single condor head node. Sorry for being very vague about what
> might have happened.

Are you running the replication service of HAD?  If so, it seems that the 
replication did something bad...

In any case, can you send me the HAD related logs from the time of the event 
(from both systems)?

Thanks

-Nick

-- 
           <<< Follow the white rabbit. >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy__AT__cs.wisc.edu              The University of Wisconsin
 |_*_|   608-265-5761                    Department of Computer Sciences

===========================================================================
Date mail was appended: Tue Sep  2 13:22:41 2008 (1220379761)
Subject: Actions

Status changed from open to pending by nleroy
===========================================================================
Date of actions: Tue Sep  2 13:22:41 2008 (1220379763)
Date: Tue, 02 Sep 2008 20:31:26 +0200
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #18390] LIGO: Accountant information badly skewed
X-Enigmail-Version: 0.95.0
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

Hi Nick (et al)

condor-admin response tracking system wrote:
> 
> Are you running the replication service of HAD?  If so, it seems that the 
> replication did something bad...
> 
The relevant excerpt from our config file:
HAD_PORT = 51450
HAD_ARGS = -p \$(HAD_PORT)
HAD_LIST =
\$(CENTRAL_MANAGER1):\$(HAD_PORT),\$(CENTRAL_MANAGER2):\$(HAD_PORT)
HAD_CONNECTION_TIMEOUT = 4
HAD_USE_PRIMARY = TRUE
HOSTALLOW_ADMINISTRATOR = \$(COLLECTOR_HOST)
HAD = \$(SBIN)/condor_had
HOSTALLOW_NEGOTIATOR = \$(COLLECTOR_HOST)
HOSTALLOW_NEGOTIATOR_SCHEDD = \$(COLLECTOR_HOST)
DAEMON_LIST = MASTER,COLLECTOR,NEGOTIATOR,HAD,SCHEDD
MAX_HAD_LOG = 640000
HAD_DEBUG = D_COMMAND
MAX_JOBS_RUNNING = 10000
HAD_LOG = \$(LOG)/HADLog

That's all I can spot here.

> In any case, can you send me the HAD related logs from the time of the event 
> (from both systems)?

The system where the numbers are bad has only a very small log file size
and I don't have those logs, sorry about this. This is from the other
machine, where the data looked ok. This is the only "unusual" stuff I
can spot around the time when this happened. 10.20.30.1 is
CENTRAL_MANAGER1 (h1) and 10.20.30.2 is CENTRAL_MANAGER2 (h2) - I'm not
aware that condor was restarted on either node at that time.

8/21 13:03:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:08:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:13:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:18:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:23:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:28:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:33:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:34:40 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:34:40 cannot connect to addr <10.20.30.2:51450>
8/21 13:34:57 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:34:57 cannot connect to addr <10.20.30.2:51450>
8/21 13:35:14 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:35:14 cannot connect to addr <10.20.30.2:51450>
8/21 13:35:31 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:35:31 cannot connect to addr <10.20.30.2:51450>
8/21 13:35:48 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:35:48 cannot connect to addr <10.20.30.2:51450>
8/21 13:36:05 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:36:05 cannot connect to addr <10.20.30.2:51450>
8/21 13:36:22 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:36:22 cannot connect to addr <10.20.30.2:51450>
8/21 13:36:39 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:36:39 cannot connect to addr <10.20.30.2:51450>
8/21 13:36:56 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:36:56 cannot connect to addr <10.20.30.2:51450>
8/21 13:37:13 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:37:13 cannot connect to addr <10.20.30.2:51450>
8/21 13:37:30 DaemonCore: Command received via UDP from host
<10.20.30.2:48999>, access level ALLOW
8/21 13:37:30 DaemonCore: received command 60014 (DC_INVALIDATE_KEY),
calling handler (handle_invalidate_key())
8/21 13:37:30 Calling HandleReq <handle_invalidate_key()> (0)
8/21 13:37:30 Return from HandleReq <handle_invalidate_key()>
8/21 13:38:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:38:36 DaemonCore: Command received via UDP from host
<10.20.30.2:35438>, access level ALLOW
8/21 13:38:36 DaemonCore: received command 60014 (DC_INVALIDATE_KEY),
calling handler (handle_invalidate_key())
8/21 13:38:36 Calling HandleReq <handle_invalidate_key()> (0)
8/21 13:38:36 Return from HandleReq <handle_invalidate_key()>
8/21 13:43:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:43:36 Calling Handler <<10.20.30.2:9618>>
8/21 13:43:36 Return from Handler <<10.20.30.2:9618>>
8/21 13:48:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:53:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:58:36 HADStateMachine::updateCollectors 2 successful updates
8/21 14:03:36 HADStateMachine::updateCollectors 2 successful updates
8/21 14:08:36 HADStateMachine::updateCollectors 2 successful updates
8/21 14:13:36 HADStateMachine::updateCollectors 2 successful updates
8/21 14:18:36 HADStateMachine::updateCollectors 2 successful updates
8/21 14:23:36 HADStateMachine::updateCollectors 2 successful updates
8/21 14:28:36 HADStateMachine::updateCollectors 2 successful updates
8/21 14:33:36 HADStateMachine::updateCollectors 2 successful updates

Sorry that I don't have more

Cheers

Carsten

-- 
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
Phone/Fax: +49 511 762-17185 / -17193
http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31

===========================================================================
Date mail was appended: Tue Sep  2 13:31:40 2008 (1220380300)
From: Nick LeRoy <nleroy__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18390] LIGO: Accountant information badly skewed
Date: Tue, 2 Sep 2008 13:45:10 -0500

On Tue September 2 2008, condor-admin response tracking system wrote:
> Hi Nick (et al)

Hello again,

> The system where the numbers are bad has only a very small log file size
> and I don't have those logs, sorry about this. This is from the other
> machine, where the data looked ok. This is the only "unusual" stuff I
> can spot around the time when this happened. 10.20.30.1 is
> CENTRAL_MANAGER1 (h1) and 10.20.30.2 is CENTRAL_MANAGER2 (h2) - I'm not
> aware that condor was restarted on either node at that time.

Just want to make sure that I have everything clear:

1. h1 is the primary CM
2. the log snippet you sent me is from h1
3. the accountant data on h1 is fine
4. h2 is the secondary CM
5. the accountant data on h2 is corrupt

Is that all correct?  Do you have relevant log snippets from either of the 
MasterLog files or NegotiatorLog files?

Thanks

-Nick

-- 
           <<< The answer is out there, Neo. >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy__AT__cs.wisc.edu              The University of Wisconsin
 |_*_|   608-265-5761                    Department of Computer Sciences

===========================================================================
Date mail was appended: Tue Sep  2 13:44:52 2008 (1220381092)
Date: Tue, 02 Sep 2008 21:08:16 +0200
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #18390] LIGO: Accountant information badly skewed
X-Enigmail-Version: 0.95.0
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu



condor-admin response tracking system wrote:
> 
> 1. h1 is the primary CM
> 2. the log snippet you sent me is from h1
> 3. the accountant data on h1 is fine
> 4. h2 is the secondary CM
> 5. the accountant data on h2 is corrupt
> 
Almost yes to all of them. Both h1 and h2 ran with the same settings at
that time and if I recall correctly h2 was the first machine to be
started at that time.

> Is that all correct?  Do you have relevant log snippets from either of the 
> MasterLog files or NegotiatorLog files?

Let me see. This will be a bit longer since I do not know 100% sure when
the problem happened. The first report by a user was on 2008-08-21. I
think that subsequently condor was restarted on the head nodes to fix
the initial error of

condor_userprio: "Can't find address for negotiator"

(of course a futile attempt since the problem was that the negotiator
was running on h1 at that time and bound to the wrong NIC):

MasterLog (h2)
8/19 16:44:59 Preen pid is 22056
8/19 16:45:14 Child 22056 died, but not a daemon -- Ignored
8/20 16:45:00 Preen pid is 11479
8/20 16:45:09 Child 11479 died, but not a daemon -- Ignored
8/21 13:34:40 Got SIGQUIT.  Performing fast shutdown.
8/21 13:34:40 Sent SIGQUIT to COLLECTOR (pid 7475)
8/21 13:34:40 Sent SIGQUIT to HAD (pid 7483)
8/21 13:34:40 Sent SIGQUIT to SCHEDD (pid 7484)
8/21 13:34:40 Handling daemon-specific command for "NEGOTIATOR"
8/21 13:34:40 The HAD (pid 7483) exited with status 0
8/21 13:34:40 The COLLECTOR (pid 7475) exited with status 0
8/21 13:35:05 ******************************************************
8/21 13:35:05 ** condor_master (CONDOR_MASTER) STARTING UP
8/21 13:35:05 ** /scr/opt/condor-7.0.1/sbin/condor_master
8/21 13:35:05 ** $CondorVersion: 7.0.1 Mar 21 2008 $
8/21 13:35:05 ** $CondorPlatform: X86_64-LINUX_DEBIAN40 $
8/21 13:35:05 ** PID = 29709
8/21 13:35:05 ** Log last touched 8/21 13:34:40
8/21 13:35:05 ******************************************************
8/21 13:35:05 Using config source: /opt/condor/etc/condor_config
8/21 13:35:05 Using local config sources:
8/21 13:35:05    /etc/default/condor|
8/21 13:35:05 FileLock::obtain(1) failed - errno 11 (Resource
temporarily unavailable)
8/21 13:35:05 ERROR "Can't get lock on "/tmp//InstanceLock"" at line 848
in file master.C
8/21 13:36:42 The SCHEDD (pid 7484) exited with status 1
8/21 13:36:42 All daemons are gone.  Exiting.
8/21 13:36:42 **** condor_master (condor_MASTER) EXITING WITH STATUS 0
8/21 13:37:23 ******************************************************
8/21 13:37:23 ** condor_master (CONDOR_MASTER) STARTING UP
8/21 13:37:23 ** /scr/opt/condor-7.0.1/sbin/condor_master
8/21 13:37:23 ** $CondorVersion: 7.0.1 Mar 21 2008 $
8/21 13:37:23 ** $CondorPlatform: X86_64-LINUX_DEBIAN40 $
8/21 13:37:23 ** PID = 30300
8/21 13:37:23 ** Log last touched 8/21 13:36:42
8/21 13:37:23 ******************************************************
8/21 13:37:23 Using config source: /opt/condor/etc/condor_config
8/21 13:37:23 Using local config sources:
8/21 13:37:23    /etc/default/condor|
8/21 13:37:23 DaemonCore: Command Socket at <10.20.30.2:58489>
8/21 13:37:23 Started DaemonCore process
"/opt/condor/sbin/condor_collector", pid and pgroup = 30301
8/21 13:37:26 Started DaemonCore process
"/opt/condor/sbin/condor_negotiator", pid and pgroup = 30309
8/21 13:37:26 Started DaemonCore process "/opt/condor/sbin/condor_had",
pid and pgroup = 30310
8/21 13:37:26 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 30311
8/21 13:37:27 Handling daemon-specific command for "NEGOTIATOR"
8/21 13:37:27 Sent SIGQUIT to NEGOTIATOR (pid 30309)
8/21 13:37:27 The NEGOTIATOR (pid 30309) exited with status 0
8/21 14:37:26 Preen pid is 21638
8/21 14:37:32 Child 21638 died, but not a daemon -- Ignored

MasterLog (h1)
8/19 17:17:51 Preen pid is 22043
8/19 17:17:51 Child 22043 died, but not a daemon -- Ignored
8/20 17:17:51 Preen pid is 26267
8/20 17:17:51 Child 26267 died, but not a daemon -- Ignored
8/21 17:17:51 Preen pid is 21081
8/21 17:17:51 Child 21081 died, but not a daemon -- Ignored


NegotiatorLog (h1)
This one is quite big:

h1# grep -c '^8/21' NegotiatorLog
25416

First glance does not look strange. I can pack this together if you want
all the gory details.

Since h2 is still running Condor, the oldest logs are from the 28th and
I don't have the log files in the backup (which is now on my todo list).

Sorry for not having thought about that earlier.

Thanks for looking into this

Carsten

===========================================================================
Date mail was appended: Tue Sep  2 14:08:42 2008 (1220382522)