LIGO Support Ticket 19593

Ticket Information
  Number:      admin 19593
  User:        carsten.aulbert@aei.mpg.de
  Email:       condorligo__AT__aei.mpg.de
  Status:      resolved
  Assigned To: psilord
To: condor-admin__AT__cs.wisc.edu
Subject: LIGO: Why does Condor start jobs from the user with the worst prio?
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
Date: Fri, 14 Aug 2009 21:12:29 +0200
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Hi all (resending this to create a ticket),

the current situation on Atlas is the following:

RUN/HELD/IDLE | atlas1         atlas2        | Total
--------------+------------------------------+--------------
benk          | 41/0/11782                   | 41/0/11782
christian     |                10/0/0        | 10/0/0
jclayton      | 410/0/1758                   | 410/0/1758
jveitch       |                330/0/1565    | 330/0/1565
kazu          | 7/0/0                        | 7/0/0
rahul         |                11/0/24       | 11/0/24
volodya       |                5540/0/461    | 5540/0/461
--------------+------------------------------+--------------
Total         | 458/0/13540    5891/0/2050   | 6349/0/15590

$ condor_userprio -all
Last Priority Update:  8/12 20:51
                                    Effective   Real     Priority   Res   Accumulated       Usage            Last
User Name                           Priority  Priority    Factor    Used  Usage (hrs)    Start Time       Usage Time
------------------------------      --------- -------- ------------ ----  ----------- ---------------- ----------------
omega__AT__atlas.local                      500.00     0.50      1000.00    0         3.45  7/31/2009 20:39  8/12/2009 19:24
jrollins__AT__atlas.local                   500.00     0.50      1000.00    0         0.01  8/12/2009 20:00  8/12/2009 20:00
ajw__AT__atlas.local                       1012.80     1.01      1000.00    0    327880.36  9/23/2008 18:51  8/12/2009 06:06
christian__AT__atlas.local                 1658.56    16.59       100.00   10    168815.48  5/27/2008 23:42  8/12/2009 20:51
kazu__AT__atlas.local                      4771.63     4.77      1000.00    6       350.55  7/31/2009 15:33  8/12/2009 20:51
benk__AT__atlas.local                     15334.47   153.34       100.00   41   4958965.19  5/29/2008 08:35  8/12/2009 20:51
rahul__AT__atlas.local                    16075.66    16.08      1000.00    6     77230.44  7/31/2009 00:05  8/12/2009 20:51
elr__AT__atlas.local                      16359.38    16.36      1000.00  143      5485.87  8/04/2009 17:15  8/12/2009 20:51
vedovato__AT__atlas.local                 23576.82    23.58      1000.00   10    844500.48 11/29/2008 22:22  8/12/2009 20:51
jclayton__AT__atlas.local                101497.28   101.50      1000.00  403      4093.22  8/12/2009 02:10  8/12/2009 20:51
jveitch__AT__atlas.local                 576979.19   576.98      1000.00  330   1731070.15 10/07/2008 19:01  8/12/2009 20:51
volodya__AT__atlas.local                5444216.00  5444.22      1000.00 5539   4962344.96  8/14/2008 04:40  8/12/2009 20:51
------------------------------      --------- -------- ------------ ----  ----------- ---------------- ----------------
Number of users: 12                                                 6488  13080739.84  5/27/2008 23:42  8/11/2009 20:51

As you can easily see, user volodya has the worst priority, he's running a large dag job:

8/12 20:54:42 Number of idle job procs: 487
8/12 20:54:42 Of 187000 nodes total:
8/12 20:54:42  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
8/12 20:54:42   ===     ===      ===     ===     ===        ===      ===
8/12 20:54:42 19566       0     6000       0   161331          0      103


Now it's getting interesting. On our current negotiator atlas4 I see no new matches for volodya, in fact these being
the last ones in the Negotiator log:

8/10 13:37:37       Matched 777105.0 volodya__AT__atlas.local <10.20.30.2:46449> preempting none <10.10.9.0:46291> slot1__AT__n0900.atlas.local
8/10 13:41:34       Matched 777116.0 volodya__AT__atlas.local <10.20.30.2:46449> preempting none <10.10.11.6:33452> slot3__AT__n1106.atlas.local
8/10 13:42:09       Matched 777119.0 volodya__AT__atlas.local <10.20.30.2:46449> preempting none <10.10.3.31:41259> slot4__AT__n0331.atlas.local
8/10 13:43:21       Matched 777120.0 volodya__AT__atlas.local <10.20.30.2:46449> preempting none <10.10.2.61:59892> slot1__AT__n0261.atlas.local
8/10 13:43:21       Matched 777121.0 volodya__AT__atlas.local <10.20.30.2:46449> preempting none <10.10.8.82:58399> slot4__AT__n0882.atlas.local
8/12 12:11:54       Matched 789331.0 volodya__AT__atlas.local <10.20.30.2:46449> preempting none <10.10.0.16:38652> slot4__AT__n0016.atlas.local

Since then, the scheduler on atlas2 decides on its own, to substitute a finished job without asking the negotiator again.

CLAIM_WORKLIFE is set to 0 for months now, and I thought we fixed it with this, any idea what this could be about?

Cheers

Carsten

PS: setup for condor on atlas2 (this is sourced from condor_config):

        UID_DOMAIN=atlas.local
        FILESYSTEM_DOMAIN=atlas.local
        TRUST_UID_DOMAIN=TRUE        
        LOG=/var/log/condor          
        RELEASE_DIR=/opt/condor      
        LOCAL_DIR=/local/condor.atlas2
        REQUIRE_LOCAL_CONFIG_FILE=TRUE
        CONDOR_ADMIN=root@localhost   
        MAIL=/usr/bin/mail            
        SOFT_UID_DOMAIN=true          
        COLLECTOR_NAME=MPI-GRAPHY-AEI-Hannover
        CONDOR_HOST=$(HOSTNAME)               
        LOCK=/tmp/                            
        START=TRUE                            
        SUSPEND=FALSE                         
        PREEMPT=FALSE                         
        KILL=FALSE                            
        HOSTALLOW_WRITE=*                     
        NEGOTIATOR_HOST=                      
        CONDOR_HOST=                          
        WANT_SUSPEND=FALSE                    
        WANT_VACATE=FALSE                     
        CENTRAL_MANAGER1=atlas1               
        CENTRAL_MANAGER2=atlas2               
        CENTRAL_MANAGER3=atlas3               
        CENTRAL_MANAGER4=atlas4               

        COLLECTOR_HOST=$(CENTRAL_MANAGER4),$(CENTRAL_MANAGER3),$(CENTRAL_MANAGER2),$(CENTRAL_MANAGER1)

        USE_CKPT_SERVER=TRUE
        STARTER_CHOOSES_CKPT_SERVER=TRUE
        PERIODIC_CHECKPOINT = $(LastCkpt) > ($(HOUR) + $RANDOM_INTEGER(0,60,10)*$(MINUTE))
        COMPRESS_PERIODIC_CKPT=TRUE                                                       
        COMPRESS_VACATE_CKPT=TRUE                                                         

        MAX_MASTER_LOG=(100*$(MByte)) 
        MAX_CKPT_SERVER_LOG=(100*$(MByte)) 
        MAX_STARTD_LOG=(500*$(MByte))      
        MAX_COLLECTOR_LOG=(500*$(MByte))   
        MAX_NEGOTIATOR_LOG=(500*$(MByte))  
        MAX_HAD_LOG=(100*$(MByte))         
        MAX_SCHEDD_LOG=(500*$(MByte))      
        MAX_SHADOW_LOG=(500*$(MByte))      
        MAX_MATCH_LOG=(500*$(MByte))       

        JOB_RENICE_INCREMENT=0

        DEFAULT_PRIO_FACTOR=1000
        NICE_USER_PRIO_FACTOR=10000
        REMOTE_PRIO_FACTOR=1000000 
        CONDOR_IDS=666.666         

        CLAIM_WORKLIFE=0
        HAD_PORT=51450  
        HAD_ARGS=-p $(HAD_PORT)
        HAD_LIST=$(CENTRAL_MANAGER4):$(HAD_PORT),$(CENTRAL_MANAGER1):$(HAD_PORT),$(CENTRAL_MANAGER2):$(HAD_PORT),$(CENTRAL_MANAGER3):$(HAD_PORT)                                                                                                                                              
        HAD_CONNECTION_TIMEOUT=4                                                                                                               
        HAD_USE_PRIMARY=TRUE                                                                                                                   
#        REPLICATION_LIST=$(HAD_LIST)                                                                                                          
#        HAD_USE_REPLICATION=TRUE                                                                                                              
        HOSTALLOW_ADMINISTRATOR=$(COLLECTOR_HOST)                                                                                              
        HAD=$(SBIN)/condor_had                                                                                                                 
        HOSTALLOW_NEGOTIATOR=$(COLLECTOR_HOST)                                                                                                 
        HOSTALLOW_NEGOTIATOR_SCHEDD=$(COLLECTOR_HOST)                                                                                          
        MAX_HAD_LOG=640000                                                                                                                     
        HAD_DEBUG=D_COMMAND                                                                                                                    
        MAX_JOBS_RUNNING=10000                                                                                                                 
        HAD_LOG=$(LOG)/HADLog                                                                                                                  
        PREEMPTION_REQUIREMENTS=FALSE                                                                                                          
        PREEMPTION_RANK=0                                                                                                                      
        NEGOTIATOR_CONSIDER_PREEMPTION=False
        WantRemoteIO=FALSE
        notification=NEVER
        SUBMIT_EXPRS=notification, WantRemoteIO

        HISTORY=$(SPOOL)/history
        QUILL_ENABLED=TRUE
        QUILL_USE_SQL_LOG=FALSE
        QUILL_NAME=quill
        QUILL_DB_USER=quillwriter
        QUILL_DB_NAME=quill
        QUILL_DB_IP_ADDR=10.20.30.4:5432
        QUILL_POLLING_PERIOD=10
        QUILL_HISTORY_DURATION=30
        QUILL_MANAGE_VACUUM=FALSE
        QUILL_IS_REMOTELY_QUERYABLE=TRUE
        QUILL_DB_QUERY_PASSWORD=quill
        QUILL_ADDRESS_FILE=$(LOG)/.quill_address
        QUILL_DB_TYPE=PGSQL
        DATABASE_PURGE_INTERVAL=86400
        DATABASE_REINDEX_INTERVAL=86400
        QUILL_RESOURCE_HISTORY_DURATION=7
        QUILL_RUN_HISTORY_DURATION=7
        QUILL_JOB_HISTORY_DURATION=3650
        QUILL_DBSIZE_LIMIT=20
        QUILL_MAINTAIN_DB_CONN=TRUE
        SCHEDD_SQLLOG=$(LOG)/schedd_sql.log
        SCHEDD_DAEMON_AD_FILE=$(LOG)/.schedd_classad

        # DAG config
        DAGMAN_SUBMIT_DELAY=0
        DAGMAN_MAX_JOBS_IDLE=500
        DAGMAN_MAX_JOBS_SUBMITTED=6000
        DAGMAN_SUBMIT_DEPTH_FIRST=TRUE
        DAGMAN_ABORT_DUPLICATES=TRUE
        DAGMAN_MAX_SUBMITS_PER_INTERVAL=200
        DAGMAN_COPY_TO_SPOOL=TRUE

        #special debug settings
        MAX_DBMSD_LOG=(100*$(MByte))
        DBMSD_NOT_RESPONDING_TIMEOUT=14400
NETWORK_INTERFACE=10.20.30.2
DAEMON_LIST=MASTER,COLLECTOR,NEGOTIATOR,HAD,SCHEDD,QUILL


I've put some (hopefully relevant) log files here:

http://atlas1.atlas.aei.uni-hannover.de/~carsten/Condor/ScheddNegotiator/


===========================================================================
Date of creation: Fri Aug 14 14:12:38 2009 (1250277160)
Subject: Actions

Eta changed from  to spam by gthain
===========================================================================
Date of actions: Mon Aug 17 10:41:55 2009 (1250523715)
Subject: Actions

Ticket resolved by gthain
===========================================================================
Date of actions: Mon Aug 17 10:41:56 2009 (1250523716)
Subject: Actions

Ticket reopened by gthain
===========================================================================
Date of actions: Mon Aug 17 10:44:10 2009 (1250523850)
Subject: Actions

Ticket was reopened by gthain
===========================================================================
Date of actions: Mon Aug 17 10:44:10 2009 (1250523850)
Subject: Actions

Assigned to zmiller by gthain
===========================================================================
Date of actions: Mon Aug 17 14:09:24 2009 (1250536164)
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19593] LIGO: Why does Condor start jobs from
 the user with the worst prio?
Date: Tue, 18 Aug 2009 20:06:43 +0200
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Just a brief question, if someone had a look already into the log files or the 
details. As this is not a real show stopper right now only a (big) nuisance 
the other big LIGO ticket has a much higher prio than this one.

Cheers
Carsten

===========================================================================
Date mail was appended: Tue Aug 18 13:06:52 2009 (1250618813)
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19593] LIGO: Why does Condor start jobs from
 the user with the worst prio?
Date: Thu, 27 Aug 2009 08:36:21 +0200
X-Seen-BY: mailfromd 4.1 obsidian.cs.wisc.edu

Hi

from time to time I'm still looking into this.

At first I thought this might have to do with PIDs above 32768 which was only 
activated on this particular machine. However, reverting that, did not cure 
it.

It's happening still and I found this on a node whose slot was reallocated 
without re-negotiation:


8/27 08:25:08 slot2: Called deactivate_claim_forcibly()
8/27 08:25:08 condor_write(): Socket closed when trying to write 56 bytes to 
<10.20.30.2:47976>, fd is 6
8/27 08:25:08 Buf::write(): condor_write() failed
8/27 08:25:08 Starter pid 5544 exited with status 0
8/27 08:25:08 slot2: State change: starter exited
8/27 08:25:08 slot2: Changing activity: Busy -> Idle
8/27 08:25:08 slot2: Got activate_claim request from shadow 
(<10.20.30.2:50369>)
8/27 08:25:08 slot2: Remote job ID is 994697.0
8/27 08:25:08 slot2: Got universe "STANDARD" (1) from request classad
8/27 08:25:08 slot2: State change: claim-activation protocol successful
8/27 08:25:08 slot2: Changing activity: Idle -> Busy

I don't understand why atlas2 (10.20.30.2) would not react to network traffic 
as it's pretty idle currently. Other settings which might be relevant:


atlas2# cat /proc/sys/fs/file-max
1587193
atlas2# ulimit -n
1024
atlas2# cat /proc/sys/kernel/pid_max
32768
atlas2# cat /proc/sys/net/ipv4/ip_local_port_range
32768   61000


Any ideas?

Carsten


===========================================================================
Date mail was appended: Thu Aug 27  1:36:29 2009 (1251354991)
Subject: Actions

Eta changed from spam to   by tannenba
===========================================================================
Date of actions: Fri Aug 28 15:34:36 2009 (1251491676)
Subject: Actions

Assigned to psilord by tannenba
===========================================================================
Date of actions: Fri Aug 28 16:14:36 2009 (1251494076)
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19593] LIGO: Why does Condor start jobs from
 the user with the worst prio?
Date: Mon, 31 Aug 2009 13:45:11 +0200
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

Hi again,

I think I've found some culprit here:

CLAIM_WORKLIFE is a startd configuration item right?

We only set this on our submit machines as we wrongly believed this being a 
schedd setting. I think if we activate this on all compute nodes, this should 
solve the problem, right?

Todd, what do you think?

Cheers

Carsten <=== feeling stupid if I'm right this time.

===========================================================================
Date mail was appended: Mon Aug 31  6:45:40 2009 (1251719140)
From: Carsten Aulbert <carsten.aulbert__AT__aei.mpg.de>
To: condor-admin__AT__cs.wisc.edu,
 "Tannenbaum, Todd" <tannenba__AT__cs.wisc.edu>,
 "ML: CondorLIGO" <condorligo__AT__aei.mpg.de>
Subject: Re: [condor-admin #19593] LIGO: Why does Condor start jobs from
 the user with the worst prio?
Date: Tue, 1 Sep 2009 11:25:41 +0200
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

Hi Todd,

On Monday 31 August 2009 13:45:11 Carsten Aulbert wrote:
> CLAIM_WORKLIFE is a startd configuration item right?
>
> We only set this on our submit machines as we wrongly believed this being a
> schedd setting. I think if we activate this on all compute nodes, this
> should solve the problem, right?
>
> Todd, what do you think?

OK, I think you can close this ticket due to PEBKAC. We totally ignored the 
fact that CLAIM_WORKLIFE is a startd configuration and only set it for our 
schedds.

After setting it to 0 on all compute nodes and reconfiguring condor it worked 
and users with better priority were able to take over slots from finishing 
jobs from other users.

As we thought we used it since ~October 2008 but in fact had it always at the 
default value of -1 we reverted to 3600 now to allow the schedds to run fast 
jobs for some limited time on their own (which might especially help CBC jobs 
compared to the setting of 0 when the jobs run only a couple of seconds).

Sorry for the hassle, but we have been too blind to check this earlier from 
this angle.

Cheers

Carsten

===========================================================================
Date mail was appended: Tue Sep  1  4:25:51 2009 (1251797152)
Subject: Comments added

closed by request.

Comments added by psilord

===========================================================================
Date comments were added: Fri Sep 11 13:32:04 2009 (1252693924)
Subject: Actions

Ticket resolved by psilord
===========================================================================
Date of actions: Fri Sep 11 13:33:04 2009 (1252693984)