LIGO Support Ticket 19245
Ticket Information
Number: admin 19245
User: anderson@ligo.caltech.edu
Email:
Status: open
Assigned To: psilord
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Possible leak in MAX_JOBS_RUNNING
Date: Sat, 25 Apr 2009 13:03:27 -0700
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
On the LIGO Caltech Condor pool running version,
# condor_version
$CondorVersion: 7.2.2 Apr 9 2009 BuildID: 145189 $
$CondorPlatform: X86_64-LINUX_RHEL3 $
the Schedd is logging MAX_JOBS_RUNNING reached when condor_q against
the same schedd does not agree.
# tac SchedLog | grep MAX_JOBS_RUNNING | head -1
4/25 12:55:04 (pid:12546) Reached MAX_JOBS_RUNNING: no more can run, 5
jobs matched, 499 jobs idle
# condor_config_val MAX_JOBS_RUNNING -schedd
1500
# condor_q | tail -1
2270 jobs; 1291 idle, 970 running, 9 held
This may be a byproduct of some local jobs that where stuck in the X
state for ~24 hours that I finally removed from the schedd queue with,
condor_rm -forcex
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date of creation: Sat Apr 25 15:03:46 2009 (1240689828)
Subject: Actions
Assigned to psilord by gthain
===========================================================================
Date of actions: Mon Apr 27 16:56:31 2009 (1240869391)
Date: Tue, 28 Apr 2009 12:04:28 -0500
From: Peter Keller <psilord__AT__cs.wisc.edu>
To: gthain <condor-admin__AT__cs.wisc.edu>
Subject: Re: [condor-admin #19245] Possible leak in MAX_JOBS_RUNNING
Hello,
> From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
>
> On the LIGO Caltech Condor pool running version,
> # condor_version
> $CondorVersion: 7.2.2 Apr 9 2009 BuildID: 145189 $
> $CondorPlatform: X86_64-LINUX_RHEL3 $
>
> the Schedd is logging MAX_JOBS_RUNNING reached when condor_q against
> the same schedd does not agree.
>
> # tac SchedLog | grep MAX_JOBS_RUNNING | head -1
> 4/25 12:55:04 (pid:12546) Reached MAX_JOBS_RUNNING: no more can run, 5
> jobs matched, 499 jobs idle
>
> # condor_config_val MAX_JOBS_RUNNING -schedd
> 1500
>
> # condor_q | tail -1
> 2270 jobs; 1291 idle, 970 running, 9 held
>
>
> This may be a byproduct of some local jobs that where stuck in the X
> state for ~24 hours that I finally removed from the schedd queue with,
> condor_rm -forcex
Hrm, does a condor_reconfig -full fix it? Also, was there a big time difference
in between the writing of the log file entry and the querying of the schedd?
Condor Admin
===========================================================================
Date mail was appended: Tue Apr 28 12:04:33 2009 (1240938274)
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19245] Possible leak in MAX_JOBS_RUNNING
Date: Tue, 28 Apr 2009 10:30:47 -0700
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu
On Apr 28, 2009, at 10:04 AM, condor-admin response tracking system
wrote:
> Hello,
>
>> From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
>>
>> On the LIGO Caltech Condor pool running version,
>> # condor_version
>> $CondorVersion: 7.2.2 Apr 9 2009 BuildID: 145189 $
>> $CondorPlatform: X86_64-LINUX_RHEL3 $
>>
>> the Schedd is logging MAX_JOBS_RUNNING reached when condor_q against
>> the same schedd does not agree.
>>
>> # tac SchedLog | grep MAX_JOBS_RUNNING | head -1
>> 4/25 12:55:04 (pid:12546) Reached MAX_JOBS_RUNNING: no more can
>> run, 5
>> jobs matched, 499 jobs idle
>>
>> # condor_config_val MAX_JOBS_RUNNING -schedd
>> 1500
>>
>> # condor_q | tail -1
>> 2270 jobs; 1291 idle, 970 running, 9 held
>>
>>
>> This may be a byproduct of some local jobs that where stuck in the X
>> state for ~24 hours that I finally removed from the schedd queue
>> with,
>> condor_rm -forcex
>
> Hrm, does a condor_reconfig -full fix it?
I ended up fully restarting the schedd process to fix the problem, so
I don't know if something short of that would have worked.
> Also, was there a big time difference
> in between the writing of the log file entry and the querying of the
> schedd?
Looking at the load on the pool I strongly suspect that it was in this
incorrect state for several hours, i.e., not a short transient
problem. Please also note that I had not changed the value of
MAX_JOBS_RUNNING since the startup of the original schedd process.
Thanks.
--
Stuart Anderson anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson
===========================================================================
Date mail was appended: Tue Apr 28 12:31:01 2009 (1240939861)
Subject: Comments added
Seems like condor_rm -forcex should either (1) try to send a kill -9 to the corresponding shadow,
or (2) decrement the max_job_running counter, or both.
Comments added by tannenba
===========================================================================
Date comments were added: Fri May 22 13:48:39 2009 (1243018119)
Subject: Comments added
Comments added by tannenba
===========================================================================
Date comments were added: Fri May 22 13:53:36 2009 (1243018416)