LIGO Support Ticket 18390
Ticket Information
Number: admin 18390
User: carsten.aulbert@aei.mpg.de
Email: anderson__AT__ligo.caltech.edu,skoranda__AT__gravity.phys.uwm.edu
Status: open
Assigned To: nleroy
Date: Fri, 29 Aug 2008 20:46:32 +0200
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
CC: Stuart Anderson <anderson__AT__ligo.caltech.edu>, Scott Koranda
<skoranda__AT__gravity.phys.uwm.edu>
Subject: LIGO: Accountant information badly skewed
X-Enigmail-Version: 0.95.0
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
Hi,
as indicated in the telecon today, we ran a - possibly badly configured - HA setup between two nodes when his happened. At some point this happened:
condor_userprio: "Can't find address for negotiator"
and possibly related the following situation arose.
h2[3038]:~% condor_userprio -all -allusers
Last Priority Update: 8/29 20:39
Effective Real Priority Res Accumulated Usage Last
User Name Priority Priority Factor Used Usage (hrs) Start Time Usage Time
------------------------------ --------- -------- ------------ ---- ----------- ---------------- ----------------
ballen__AT__atlas.local 50.00 0.50 100.00 0 168.94 4/08/2008 23:11 4/08/2008 23:50
chrism__AT__atlas.local 50.00 0.50 100.00 0 1.70 5/07/2008 19:58 5/07/2008 20:13
miroslav__AT__atlas.local 50.00 0.50 100.00 0 16.45 4/06/2008 14:31 4/11/2008 18:56
fehrmann__AT__atlas.local 50.00 0.50 100.00 0 9.90 5/06/2008 16:24 5/06/2008 16:46
carsten__AT__atlas.local 50.00 0.50 100.00 0 1189.61 4/09/2008 20:34 4/11/2008 17:48
kwwette__AT__atlas.local 164.86 1.65 100.00 9 45.26 8/29/2008 15:37 8/29/2008 20:39
christian__AT__atlas.local 397.14 3.97 100.00 4 13438.05 5/27/2008 18:45 8/29/2008 20:39
hpletsch__AT__atlas.local 411.54 4.12 100.00 0 19452.60 4/11/2008 15:33 8/29/2008 18:10
bose__AT__atlas.local 500.00 0.50 1000.00 0 511.32 4/11/2008 18:18 4/15/2008 00:05
rubab__AT__atlas.local 500.00 0.50 1000.00 0 0.00 1/01/1970 01:00 ???
ajith__AT__atlas.local 500.00 0.50 1000.00 0 73891.45 6/26/2008 15:41 7/15/2008 15:43
chmess__AT__atlas.local 690.48 6.90 100.00 0 509.13 8/28/2008 14:35 8/29/2008 13:37
dfazi__AT__atlas.local 1275.92 1.28 1000.00 0 -202281.96 6/26/2008 15:41 8/29/2008 05:39
sfairhur__AT__atlas.local 6978.91 6.98 1000.00 8 206069.49 4/11/2008 15:28 8/29/2008 20:39
benk__AT__atlas.local 10517.92 105.18 100.00 283 22306.49 5/06/2008 15:12 8/29/2008 20:39
igor__AT__atlas.local 12493.31 12.49 1000.00 0 5113.03 8/21/2008 13:37 8/28/2008 14:35
lucia__AT__atlas.local 208067.61 208.07 1000.00 3160 7445.06 8/29/2008 17:55 8/29/2008 20:39
spxiwh__AT__atlas.local 219876.44 219.88 1000.00 0 78972.76 8/21/2008 13:37 8/29/2008 15:43
volodya__AT__atlas.local 480233.00 480.23 1000.00 1307 25203.58 8/21/2008 13:37 8/29/2008 20:39
stas__AT__atlas.local 639320.44 639.32 1000.00 522 537400.53 5/27/2008 18:45 8/29/2008 20:39
pau__AT__atlas.local 669049.44 669.05 1000.00 49 103649.87 8/21/2008 13:37 8/29/2008 20:39
------------------------------ --------- -------- ------------ ---- ----------- ---------------- ----------------
Number of users: 21 5342 893113.24 4/11/2008 15:28 ???
On the other node it looked normal at the time, i.e. no negative numbers and about 3.6m CPU-hours more. In the mean time we went back to a set-up
with only a single condor head node. Sorry for being very vague about what might have happened.
Cheers
Carsten
PS: Sorry for the long lines, but this way the table looks nicer ;)
--
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
Phone/Fax: +49 511 762-17185 / -17193
http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31
===========================================================================
Date of creation: Fri Aug 29 13:46:42 2008 (1220035604)
Subject: Actions
Assigned to nleroy by gthain
===========================================================================
Date of actions: Fri Aug 29 20:20:29 2008 (1220059229)
From: Nick LeRoy <nleroy__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18390] LIGO: Accountant information badly skewed
Date: Tue, 2 Sep 2008 13:22:59 -0500
> Hi,
Hello Carsten,
> as indicated in the telecon today, we ran a - possibly badly configured -
> HA setup between two nodes when his happened. At some point this happened:
>
> condor_userprio: "Can't find address for negotiator"
>
> and possibly related the following situation arose.
>
> h2[3038]:~% condor_userprio -all -allusers
> Last Priority Update: 8/29 20:39
> On the other node it looked normal at the time, i.e. no negative numbers
> and about 3.6m CPU-hours more. In the mean time we went back to a set-up
> with only a single condor head node. Sorry for being very vague about what
> might have happened.
Are you running the replication service of HAD? If so, it seems that the
replication did something bad...
In any case, can you send me the HAD related logs from the time of the event
(from both systems)?
Thanks
-Nick
--
<<< Follow the white rabbit. >>>
/`-_ Nicholas R. LeRoy The Condor Project
{ }/ http://www.cs.wisc.edu/~nleroy http://www.cs.wisc.edu/condor
\ / nleroy__AT__cs.wisc.edu The University of Wisconsin
|_*_| 608-265-5761 Department of Computer Sciences
===========================================================================
Date mail was appended: Tue Sep 2 13:22:41 2008 (1220379761)
Subject: Actions
Status changed from open to pending by nleroy
===========================================================================
Date of actions: Tue Sep 2 13:22:41 2008 (1220379763)
Date: Tue, 02 Sep 2008 20:31:26 +0200
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #18390] LIGO: Accountant information badly skewed
X-Enigmail-Version: 0.95.0
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
Hi Nick (et al)
condor-admin response tracking system wrote:
>
> Are you running the replication service of HAD? If so, it seems that the
> replication did something bad...
>
The relevant excerpt from our config file:
HAD_PORT = 51450
HAD_ARGS = -p \$(HAD_PORT)
HAD_LIST =
\$(CENTRAL_MANAGER1):\$(HAD_PORT),\$(CENTRAL_MANAGER2):\$(HAD_PORT)
HAD_CONNECTION_TIMEOUT = 4
HAD_USE_PRIMARY = TRUE
HOSTALLOW_ADMINISTRATOR = \$(COLLECTOR_HOST)
HAD = \$(SBIN)/condor_had
HOSTALLOW_NEGOTIATOR = \$(COLLECTOR_HOST)
HOSTALLOW_NEGOTIATOR_SCHEDD = \$(COLLECTOR_HOST)
DAEMON_LIST = MASTER,COLLECTOR,NEGOTIATOR,HAD,SCHEDD
MAX_HAD_LOG = 640000
HAD_DEBUG = D_COMMAND
MAX_JOBS_RUNNING = 10000
HAD_LOG = \$(LOG)/HADLog
That's all I can spot here.
> In any case, can you send me the HAD related logs from the time of the event
> (from both systems)?
The system where the numbers are bad has only a very small log file size
and I don't have those logs, sorry about this. This is from the other
machine, where the data looked ok. This is the only "unusual" stuff I
can spot around the time when this happened. 10.20.30.1 is
CENTRAL_MANAGER1 (h1) and 10.20.30.2 is CENTRAL_MANAGER2 (h2) - I'm not
aware that condor was restarted on either node at that time.
8/21 13:03:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:08:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:13:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:18:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:23:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:28:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:33:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:34:40 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:34:40 cannot connect to addr <10.20.30.2:51450>
8/21 13:34:57 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:34:57 cannot connect to addr <10.20.30.2:51450>
8/21 13:35:14 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:35:14 cannot connect to addr <10.20.30.2:51450>
8/21 13:35:31 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:35:31 cannot connect to addr <10.20.30.2:51450>
8/21 13:35:48 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:35:48 cannot connect to addr <10.20.30.2:51450>
8/21 13:36:05 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:36:05 cannot connect to addr <10.20.30.2:51450>
8/21 13:36:22 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:36:22 cannot connect to addr <10.20.30.2:51450>
8/21 13:36:39 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:36:39 cannot connect to addr <10.20.30.2:51450>
8/21 13:36:56 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:36:56 cannot connect to addr <10.20.30.2:51450>
8/21 13:37:13 attempt to connect to <10.20.30.2:51450> failed:
Connection refused (connect errno = 111).
8/21 13:37:13 cannot connect to addr <10.20.30.2:51450>
8/21 13:37:30 DaemonCore: Command received via UDP from host
<10.20.30.2:48999>, access level ALLOW
8/21 13:37:30 DaemonCore: received command 60014 (DC_INVALIDATE_KEY),
calling handler (handle_invalidate_key())
8/21 13:37:30 Calling HandleReq <handle_invalidate_key()> (0)
8/21 13:37:30 Return from HandleReq <handle_invalidate_key()>
8/21 13:38:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:38:36 DaemonCore: Command received via UDP from host
<10.20.30.2:35438>, access level ALLOW
8/21 13:38:36 DaemonCore: received command 60014 (DC_INVALIDATE_KEY),
calling handler (handle_invalidate_key())
8/21 13:38:36 Calling HandleReq <handle_invalidate_key()> (0)
8/21 13:38:36 Return from HandleReq <handle_invalidate_key()>
8/21 13:43:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:43:36 Calling Handler <<10.20.30.2:9618>>
8/21 13:43:36 Return from Handler <<10.20.30.2:9618>>
8/21 13:48:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:53:36 HADStateMachine::updateCollectors 2 successful updates
8/21 13:58:36 HADStateMachine::updateCollectors 2 successful updates
8/21 14:03:36 HADStateMachine::updateCollectors 2 successful updates
8/21 14:08:36 HADStateMachine::updateCollectors 2 successful updates
8/21 14:13:36 HADStateMachine::updateCollectors 2 successful updates
8/21 14:18:36 HADStateMachine::updateCollectors 2 successful updates
8/21 14:23:36 HADStateMachine::updateCollectors 2 successful updates
8/21 14:28:36 HADStateMachine::updateCollectors 2 successful updates
8/21 14:33:36 HADStateMachine::updateCollectors 2 successful updates
Sorry that I don't have more
Cheers
Carsten
--
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
Phone/Fax: +49 511 762-17185 / -17193
http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31
===========================================================================
Date mail was appended: Tue Sep 2 13:31:40 2008 (1220380300)
From: Nick LeRoy <nleroy__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #18390] LIGO: Accountant information badly skewed
Date: Tue, 2 Sep 2008 13:45:10 -0500
On Tue September 2 2008, condor-admin response tracking system wrote:
> Hi Nick (et al)
Hello again,
> The system where the numbers are bad has only a very small log file size
> and I don't have those logs, sorry about this. This is from the other
> machine, where the data looked ok. This is the only "unusual" stuff I
> can spot around the time when this happened. 10.20.30.1 is
> CENTRAL_MANAGER1 (h1) and 10.20.30.2 is CENTRAL_MANAGER2 (h2) - I'm not
> aware that condor was restarted on either node at that time.
Just want to make sure that I have everything clear:
1. h1 is the primary CM
2. the log snippet you sent me is from h1
3. the accountant data on h1 is fine
4. h2 is the secondary CM
5. the accountant data on h2 is corrupt
Is that all correct? Do you have relevant log snippets from either of the
MasterLog files or NegotiatorLog files?
Thanks
-Nick
--
<<< The answer is out there, Neo. >>>
/`-_ Nicholas R. LeRoy The Condor Project
{ }/ http://www.cs.wisc.edu/~nleroy http://www.cs.wisc.edu/condor
\ / nleroy__AT__cs.wisc.edu The University of Wisconsin
|_*_| 608-265-5761 Department of Computer Sciences
===========================================================================
Date mail was appended: Tue Sep 2 13:44:52 2008 (1220381092)
Date: Tue, 02 Sep 2008 21:08:16 +0200
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
CC: anderson__AT__ligo.caltech.edu, skoranda__AT__gravity.phys.uwm.edu
Subject: Re: [condor-admin #18390] LIGO: Accountant information badly skewed
X-Enigmail-Version: 0.95.0
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
condor-admin response tracking system wrote:
>
> 1. h1 is the primary CM
> 2. the log snippet you sent me is from h1
> 3. the accountant data on h1 is fine
> 4. h2 is the secondary CM
> 5. the accountant data on h2 is corrupt
>
Almost yes to all of them. Both h1 and h2 ran with the same settings at
that time and if I recall correctly h2 was the first machine to be
started at that time.
> Is that all correct? Do you have relevant log snippets from either of the
> MasterLog files or NegotiatorLog files?
Let me see. This will be a bit longer since I do not know 100% sure when
the problem happened. The first report by a user was on 2008-08-21. I
think that subsequently condor was restarted on the head nodes to fix
the initial error of
condor_userprio: "Can't find address for negotiator"
(of course a futile attempt since the problem was that the negotiator
was running on h1 at that time and bound to the wrong NIC):
MasterLog (h2)
8/19 16:44:59 Preen pid is 22056
8/19 16:45:14 Child 22056 died, but not a daemon -- Ignored
8/20 16:45:00 Preen pid is 11479
8/20 16:45:09 Child 11479 died, but not a daemon -- Ignored
8/21 13:34:40 Got SIGQUIT. Performing fast shutdown.
8/21 13:34:40 Sent SIGQUIT to COLLECTOR (pid 7475)
8/21 13:34:40 Sent SIGQUIT to HAD (pid 7483)
8/21 13:34:40 Sent SIGQUIT to SCHEDD (pid 7484)
8/21 13:34:40 Handling daemon-specific command for "NEGOTIATOR"
8/21 13:34:40 The HAD (pid 7483) exited with status 0
8/21 13:34:40 The COLLECTOR (pid 7475) exited with status 0
8/21 13:35:05 ******************************************************
8/21 13:35:05 ** condor_master (CONDOR_MASTER) STARTING UP
8/21 13:35:05 ** /scr/opt/condor-7.0.1/sbin/condor_master
8/21 13:35:05 ** $CondorVersion: 7.0.1 Mar 21 2008 $
8/21 13:35:05 ** $CondorPlatform: X86_64-LINUX_DEBIAN40 $
8/21 13:35:05 ** PID = 29709
8/21 13:35:05 ** Log last touched 8/21 13:34:40
8/21 13:35:05 ******************************************************
8/21 13:35:05 Using config source: /opt/condor/etc/condor_config
8/21 13:35:05 Using local config sources:
8/21 13:35:05 /etc/default/condor|
8/21 13:35:05 FileLock::obtain(1) failed - errno 11 (Resource
temporarily unavailable)
8/21 13:35:05 ERROR "Can't get lock on "/tmp//InstanceLock"" at line 848
in file master.C
8/21 13:36:42 The SCHEDD (pid 7484) exited with status 1
8/21 13:36:42 All daemons are gone. Exiting.
8/21 13:36:42 **** condor_master (condor_MASTER) EXITING WITH STATUS 0
8/21 13:37:23 ******************************************************
8/21 13:37:23 ** condor_master (CONDOR_MASTER) STARTING UP
8/21 13:37:23 ** /scr/opt/condor-7.0.1/sbin/condor_master
8/21 13:37:23 ** $CondorVersion: 7.0.1 Mar 21 2008 $
8/21 13:37:23 ** $CondorPlatform: X86_64-LINUX_DEBIAN40 $
8/21 13:37:23 ** PID = 30300
8/21 13:37:23 ** Log last touched 8/21 13:36:42
8/21 13:37:23 ******************************************************
8/21 13:37:23 Using config source: /opt/condor/etc/condor_config
8/21 13:37:23 Using local config sources:
8/21 13:37:23 /etc/default/condor|
8/21 13:37:23 DaemonCore: Command Socket at <10.20.30.2:58489>
8/21 13:37:23 Started DaemonCore process
"/opt/condor/sbin/condor_collector", pid and pgroup = 30301
8/21 13:37:26 Started DaemonCore process
"/opt/condor/sbin/condor_negotiator", pid and pgroup = 30309
8/21 13:37:26 Started DaemonCore process "/opt/condor/sbin/condor_had",
pid and pgroup = 30310
8/21 13:37:26 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 30311
8/21 13:37:27 Handling daemon-specific command for "NEGOTIATOR"
8/21 13:37:27 Sent SIGQUIT to NEGOTIATOR (pid 30309)
8/21 13:37:27 The NEGOTIATOR (pid 30309) exited with status 0
8/21 14:37:26 Preen pid is 21638
8/21 14:37:32 Child 21638 died, but not a daemon -- Ignored
MasterLog (h1)
8/19 17:17:51 Preen pid is 22043
8/19 17:17:51 Child 22043 died, but not a daemon -- Ignored
8/20 17:17:51 Preen pid is 26267
8/20 17:17:51 Child 26267 died, but not a daemon -- Ignored
8/21 17:17:51 Preen pid is 21081
8/21 17:17:51 Child 21081 died, but not a daemon -- Ignored
NegotiatorLog (h1)
This one is quite big:
h1# grep -c '^8/21' NegotiatorLog
25416
First glance does not look strange. I can pack this together if you want
all the gory details.
Since h2 is still running Condor, the oldest logs are from the 28th and
I don't have the log files in the backup (which is now on my todo list).
Sorry for not having thought about that earlier.
Thanks for looking into this
Carsten
===========================================================================
Date mail was appended: Tue Sep 2 14:08:42 2008 (1220382522)