LIGO Support Ticket 19593
Ticket Information
Number: admin 19593
User: carsten.aulbert@aei.mpg.de
Email: condorligo__AT__aei.mpg.de
Status: resolved
Assigned To: psilord
To: condor-admin__AT__cs.wisc.edu
Subject: LIGO: Why does Condor start jobs from the user with the worst prio?
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
Date: Fri, 14 Aug 2009 21:12:29 +0200
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
Hi all (resending this to create a ticket),
the current situation on Atlas is the following:
RUN/HELD/IDLE | atlas1 atlas2 | Total
--------------+------------------------------+--------------
benk | 41/0/11782 | 41/0/11782
christian | 10/0/0 | 10/0/0
jclayton | 410/0/1758 | 410/0/1758
jveitch | 330/0/1565 | 330/0/1565
kazu | 7/0/0 | 7/0/0
rahul | 11/0/24 | 11/0/24
volodya | 5540/0/461 | 5540/0/461
--------------+------------------------------+--------------
Total | 458/0/13540 5891/0/2050 | 6349/0/15590
$ condor_userprio -all
Last Priority Update: 8/12 20:51
Effective Real Priority Res Accumulated Usage Last
User Name Priority Priority Factor Used Usage (hrs) Start Time Usage Time
------------------------------ --------- -------- ------------ ---- ----------- ---------------- ----------------
omega__AT__atlas.local 500.00 0.50 1000.00 0 3.45 7/31/2009 20:39 8/12/2009 19:24
jrollins__AT__atlas.local 500.00 0.50 1000.00 0 0.01 8/12/2009 20:00 8/12/2009 20:00
ajw__AT__atlas.local 1012.80 1.01 1000.00 0 327880.36 9/23/2008 18:51 8/12/2009 06:06
christian__AT__atlas.local 1658.56 16.59 100.00 10 168815.48 5/27/2008 23:42 8/12/2009 20:51
kazu__AT__atlas.local 4771.63 4.77 1000.00 6 350.55 7/31/2009 15:33 8/12/2009 20:51
benk__AT__atlas.local 15334.47 153.34 100.00 41 4958965.19 5/29/2008 08:35 8/12/2009 20:51
rahul__AT__atlas.local 16075.66 16.08 1000.00 6 77230.44 7/31/2009 00:05 8/12/2009 20:51
elr__AT__atlas.local 16359.38 16.36 1000.00 143 5485.87 8/04/2009 17:15 8/12/2009 20:51
vedovato__AT__atlas.local 23576.82 23.58 1000.00 10 844500.48 11/29/2008 22:22 8/12/2009 20:51
jclayton__AT__atlas.local 101497.28 101.50 1000.00 403 4093.22 8/12/2009 02:10 8/12/2009 20:51
jveitch__AT__atlas.local 576979.19 576.98 1000.00 330 1731070.15 10/07/2008 19:01 8/12/2009 20:51
volodya__AT__atlas.local 5444216.00 5444.22 1000.00 5539 4962344.96 8/14/2008 04:40 8/12/2009 20:51
------------------------------ --------- -------- ------------ ---- ----------- ---------------- ----------------
Number of users: 12 6488 13080739.84 5/27/2008 23:42 8/11/2009 20:51
As you can easily see, user volodya has the worst priority, he's running a large dag job:
8/12 20:54:42 Number of idle job procs: 487
8/12 20:54:42 Of 187000 nodes total:
8/12 20:54:42 Done Pre Queued Post Ready Un-Ready Failed
8/12 20:54:42 === === === === === === ===
8/12 20:54:42 19566 0 6000 0 161331 0 103
Now it's getting interesting. On our current negotiator atlas4 I see no new matches for volodya, in fact these being
the last ones in the Negotiator log:
8/10 13:37:37 Matched 777105.0 volodya__AT__atlas.local <10.20.30.2:46449> preempting none <10.10.9.0:46291> slot1__AT__n0900.atlas.local
8/10 13:41:34 Matched 777116.0 volodya__AT__atlas.local <10.20.30.2:46449> preempting none <10.10.11.6:33452> slot3__AT__n1106.atlas.local
8/10 13:42:09 Matched 777119.0 volodya__AT__atlas.local <10.20.30.2:46449> preempting none <10.10.3.31:41259> slot4__AT__n0331.atlas.local
8/10 13:43:21 Matched 777120.0 volodya__AT__atlas.local <10.20.30.2:46449> preempting none <10.10.2.61:59892> slot1__AT__n0261.atlas.local
8/10 13:43:21 Matched 777121.0 volodya__AT__atlas.local <10.20.30.2:46449> preempting none <10.10.8.82:58399> slot4__AT__n0882.atlas.local
8/12 12:11:54 Matched 789331.0 volodya__AT__atlas.local <10.20.30.2:46449> preempting none <10.10.0.16:38652> slot4__AT__n0016.atlas.local
Since then, the scheduler on atlas2 decides on its own, to substitute a finished job without asking the negotiator again.
CLAIM_WORKLIFE is set to 0 for months now, and I thought we fixed it with this, any idea what this could be about?
Cheers
Carsten
PS: setup for condor on atlas2 (this is sourced from condor_config):
UID_DOMAIN=atlas.local
FILESYSTEM_DOMAIN=atlas.local
TRUST_UID_DOMAIN=TRUE
LOG=/var/log/condor
RELEASE_DIR=/opt/condor
LOCAL_DIR=/local/condor.atlas2
REQUIRE_LOCAL_CONFIG_FILE=TRUE
CONDOR_ADMIN=root@localhost
MAIL=/usr/bin/mail
SOFT_UID_DOMAIN=true
COLLECTOR_NAME=MPI-GRAPHY-AEI-Hannover
CONDOR_HOST=$(HOSTNAME)
LOCK=/tmp/
START=TRUE
SUSPEND=FALSE
PREEMPT=FALSE
KILL=FALSE
HOSTALLOW_WRITE=*
NEGOTIATOR_HOST=
CONDOR_HOST=
WANT_SUSPEND=FALSE
WANT_VACATE=FALSE
CENTRAL_MANAGER1=atlas1
CENTRAL_MANAGER2=atlas2
CENTRAL_MANAGER3=atlas3
CENTRAL_MANAGER4=atlas4
COLLECTOR_HOST=$(CENTRAL_MANAGER4),$(CENTRAL_MANAGER3),$(CENTRAL_MANAGER2),$(CENTRAL_MANAGER1)
USE_CKPT_SERVER=TRUE
STARTER_CHOOSES_CKPT_SERVER=TRUE
PERIODIC_CHECKPOINT = $(LastCkpt) > ($(HOUR) + $RANDOM_INTEGER(0,60,10)*$(MINUTE))
COMPRESS_PERIODIC_CKPT=TRUE
COMPRESS_VACATE_CKPT=TRUE
MAX_MASTER_LOG=(100*$(MByte))
MAX_CKPT_SERVER_LOG=(100*$(MByte))
MAX_STARTD_LOG=(500*$(MByte))
MAX_COLLECTOR_LOG=(500*$(MByte))
MAX_NEGOTIATOR_LOG=(500*$(MByte))
MAX_HAD_LOG=(100*$(MByte))
MAX_SCHEDD_LOG=(500*$(MByte))
MAX_SHADOW_LOG=(500*$(MByte))
MAX_MATCH_LOG=(500*$(MByte))
JOB_RENICE_INCREMENT=0
DEFAULT_PRIO_FACTOR=1000
NICE_USER_PRIO_FACTOR=10000
REMOTE_PRIO_FACTOR=1000000
CONDOR_IDS=666.666
CLAIM_WORKLIFE=0
HAD_PORT=51450
HAD_ARGS=-p $(HAD_PORT)
HAD_LIST=$(CENTRAL_MANAGER4):$(HAD_PORT),$(CENTRAL_MANAGER1):$(HAD_PORT),$(CENTRAL_MANAGER2):$(HAD_PORT),$(CENTRAL_MANAGER3):$(HAD_PORT)
HAD_CONNECTION_TIMEOUT=4
HAD_USE_PRIMARY=TRUE
# REPLICATION_LIST=$(HAD_LIST)
# HAD_USE_REPLICATION=TRUE
HOSTALLOW_ADMINISTRATOR=$(COLLECTOR_HOST)
HAD=$(SBIN)/condor_had
HOSTALLOW_NEGOTIATOR=$(COLLECTOR_HOST)
HOSTALLOW_NEGOTIATOR_SCHEDD=$(COLLECTOR_HOST)
MAX_HAD_LOG=640000
HAD_DEBUG=D_COMMAND
MAX_JOBS_RUNNING=10000
HAD_LOG=$(LOG)/HADLog
PREEMPTION_REQUIREMENTS=FALSE
PREEMPTION_RANK=0
NEGOTIATOR_CONSIDER_PREEMPTION=False
WantRemoteIO=FALSE
notification=NEVER
SUBMIT_EXPRS=notification, WantRemoteIO
HISTORY=$(SPOOL)/history
QUILL_ENABLED=TRUE
QUILL_USE_SQL_LOG=FALSE
QUILL_NAME=quill
QUILL_DB_USER=quillwriter
QUILL_DB_NAME=quill
QUILL_DB_IP_ADDR=10.20.30.4:5432
QUILL_POLLING_PERIOD=10
QUILL_HISTORY_DURATION=30
QUILL_MANAGE_VACUUM=FALSE
QUILL_IS_REMOTELY_QUERYABLE=TRUE
QUILL_DB_QUERY_PASSWORD=quill
QUILL_ADDRESS_FILE=$(LOG)/.quill_address
QUILL_DB_TYPE=PGSQL
DATABASE_PURGE_INTERVAL=86400
DATABASE_REINDEX_INTERVAL=86400
QUILL_RESOURCE_HISTORY_DURATION=7
QUILL_RUN_HISTORY_DURATION=7
QUILL_JOB_HISTORY_DURATION=3650
QUILL_DBSIZE_LIMIT=20
QUILL_MAINTAIN_DB_CONN=TRUE
SCHEDD_SQLLOG=$(LOG)/schedd_sql.log
SCHEDD_DAEMON_AD_FILE=$(LOG)/.schedd_classad
# DAG config
DAGMAN_SUBMIT_DELAY=0
DAGMAN_MAX_JOBS_IDLE=500
DAGMAN_MAX_JOBS_SUBMITTED=6000
DAGMAN_SUBMIT_DEPTH_FIRST=TRUE
DAGMAN_ABORT_DUPLICATES=TRUE
DAGMAN_MAX_SUBMITS_PER_INTERVAL=200
DAGMAN_COPY_TO_SPOOL=TRUE
#special debug settings
MAX_DBMSD_LOG=(100*$(MByte))
DBMSD_NOT_RESPONDING_TIMEOUT=14400
NETWORK_INTERFACE=10.20.30.2
DAEMON_LIST=MASTER,COLLECTOR,NEGOTIATOR,HAD,SCHEDD,QUILL
I've put some (hopefully relevant) log files here:
http://atlas1.atlas.aei.uni-hannover.de/~carsten/Condor/ScheddNegotiator/
===========================================================================
Date of creation: Fri Aug 14 14:12:38 2009 (1250277160)
Subject: Actions
Eta changed from to spam by gthain
===========================================================================
Date of actions: Mon Aug 17 10:41:55 2009 (1250523715)
Subject: Actions
Ticket resolved by gthain
===========================================================================
Date of actions: Mon Aug 17 10:41:56 2009 (1250523716)
Subject: Actions
Ticket reopened by gthain
===========================================================================
Date of actions: Mon Aug 17 10:44:10 2009 (1250523850)
Subject: Actions
Ticket was reopened by gthain
===========================================================================
Date of actions: Mon Aug 17 10:44:10 2009 (1250523850)
Subject: Actions
Assigned to zmiller by gthain
===========================================================================
Date of actions: Mon Aug 17 14:09:24 2009 (1250536164)
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19593] LIGO: Why does Condor start jobs from
the user with the worst prio?
Date: Tue, 18 Aug 2009 20:06:43 +0200
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
Just a brief question, if someone had a look already into the log files or the
details. As this is not a real show stopper right now only a (big) nuisance
the other big LIGO ticket has a much higher prio than this one.
Cheers
Carsten
===========================================================================
Date mail was appended: Tue Aug 18 13:06:52 2009 (1250618813)
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19593] LIGO: Why does Condor start jobs from
the user with the worst prio?
Date: Thu, 27 Aug 2009 08:36:21 +0200
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu
Hi
from time to time I'm still looking into this.
At first I thought this might have to do with PIDs above 32768 which was only
activated on this particular machine. However, reverting that, did not cure
it.
It's happening still and I found this on a node whose slot was reallocated
without re-negotiation:
8/27 08:25:08 slot2: Called deactivate_claim_forcibly()
8/27 08:25:08 condor_write(): Socket closed when trying to write 56 bytes to
<10.20.30.2:47976>, fd is 6
8/27 08:25:08 Buf::write(): condor_write() failed
8/27 08:25:08 Starter pid 5544 exited with status 0
8/27 08:25:08 slot2: State change: starter exited
8/27 08:25:08 slot2: Changing activity: Busy -> Idle
8/27 08:25:08 slot2: Got activate_claim request from shadow
(<10.20.30.2:50369>)
8/27 08:25:08 slot2: Remote job ID is 994697.0
8/27 08:25:08 slot2: Got universe "STANDARD" (1) from request classad
8/27 08:25:08 slot2: State change: claim-activation protocol successful
8/27 08:25:08 slot2: Changing activity: Idle -> Busy
I don't understand why atlas2 (10.20.30.2) would not react to network traffic
as it's pretty idle currently. Other settings which might be relevant:
atlas2# cat /proc/sys/fs/file-max
1587193
atlas2# ulimit -n
1024
atlas2# cat /proc/sys/kernel/pid_max
32768
atlas2# cat /proc/sys/net/ipv4/ip_local_port_range
32768 61000
Any ideas?
Carsten
===========================================================================
Date mail was appended: Thu Aug 27 1:36:29 2009 (1251354991)
Subject: Actions
Eta changed from spam to by tannenba
===========================================================================
Date of actions: Fri Aug 28 15:34:36 2009 (1251491676)
Subject: Actions
Assigned to psilord by tannenba
===========================================================================
Date of actions: Fri Aug 28 16:14:36 2009 (1251494076)
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19593] LIGO: Why does Condor start jobs from
the user with the worst prio?
Date: Mon, 31 Aug 2009 13:45:11 +0200
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu
Hi again,
I think I've found some culprit here:
CLAIM_WORKLIFE is a startd configuration item right?
We only set this on our submit machines as we wrongly believed this being a
schedd setting. I think if we activate this on all compute nodes, this should
solve the problem, right?
Todd, what do you think?
Cheers
Carsten <=== feeling stupid if I'm right this time.
===========================================================================
Date mail was appended: Mon Aug 31 6:45:40 2009 (1251719140)
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu,
"Tannenbaum, Todd" <tannenba__AT__cs.wisc.edu>,
"ML: CondorLIGO" <condorligo__AT__aei.mpg.de>
Subject: Re: [condor-admin #19593] LIGO: Why does Condor start jobs from
the user with the worst prio?
Date: Tue, 1 Sep 2009 11:25:41 +0200
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu
Hi Todd,
On Monday 31 August 2009 13:45:11 Carsten Aulbert wrote:
> CLAIM_WORKLIFE is a startd configuration item right?
>
> We only set this on our submit machines as we wrongly believed this being a
> schedd setting. I think if we activate this on all compute nodes, this
> should solve the problem, right?
>
> Todd, what do you think?
OK, I think you can close this ticket due to PEBKAC. We totally ignored the
fact that CLAIM_WORKLIFE is a startd configuration and only set it for our
schedds.
After setting it to 0 on all compute nodes and reconfiguring condor it worked
and users with better priority were able to take over slots from finishing
jobs from other users.
As we thought we used it since ~October 2008 but in fact had it always at the
default value of -1 we reverted to 3600 now to allow the schedds to run fast
jobs for some limited time on their own (which might especially help CBC jobs
compared to the setting of 0 when the jobs run only a couple of seconds).
Sorry for the hassle, but we have been too blind to check this earlier from
this angle.
Cheers
Carsten
===========================================================================
Date mail was appended: Tue Sep 1 4:25:51 2009 (1251797152)
Subject: Comments added
closed by request.
Comments added by psilord
===========================================================================
Date comments were added: Fri Sep 11 13:32:04 2009 (1252693924)
Subject: Actions
Ticket resolved by psilord
===========================================================================
Date of actions: Fri Sep 11 13:33:04 2009 (1252693984)