LIGO Support Ticket 19725

Ticket Information
  Number:      admin 19725
  User:        anderson@ligo.caltech.edu
  Email:       
  Status:      open
  Assigned To: psilord
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin response tracking system <condor-admin__AT__cs.wisc.edu>
Subject: LIGO: condor_history performance
Date: Thu, 17 Sep 2009 12:31:16 -0700
X-Seen-BY: mailfromd 4.1 silica.cs.wisc.edu

Accessing the Job ClassAdd for a historical job on version,

# condor_version
$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
$CondorPlatform: X86_64-LINUX_RHEL5 $

takes an extremely long time for what naively appears to be a simple  
task, e.g.,

[root@ldas-grid ~]# time condor_history -long 54230554.0
MyType = "Job"
...
JobFinishedHookDone = 1253208688


real	4m7.432s
user	4m4.844s
sys	0m2.480s


What are the prospects of improving this performance?

Thanks.

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date of creation: Thu Sep 17 14:31:28 2009 (1253215891)
Subject: Actions

Assigned to griswold by griswold
===========================================================================
Date of actions: Fri Sep 18 11:10:04 2009 (1253290204)
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com;
 s=gamma;        h=domainkey-signature:mime-version:sender:received:in-reply-to
         :references:date:x-google-sender-auth:message-id:subject:from:to  
       :content-type;        bh=Ru7slk3rHWrfTVrJlIsVO6aoYg2mlVW9Vp/MTLBLP9Y=;
        b=tFKvlcoiGW+Xcn3h5I5CJDRME6wlvhoaFTRBDGWpw+zUItaeeaCGwHonxHMHsvDjz/
         KLAjan3mYmVoANCXR+QpKYM4yczVFIt7PMUbRnj37AZgVEtsRmgUY5lPrp8wOeFEFN6i
         PIlv8VvraYDe3w484TM60ruIAhKWuc2Lbqlec=
Domainkey-Signature: a=rsa-sha1; c=nofws;        d=gmail.com; s=gamma;    
    h=mime-version:sender:in-reply-to:references:date        
 :x-google-sender-auth:message-id:subject:from:to:content-type;       
 b=kWAwyxwucw3klr3lr+y8+sAFH2lNXYxrwiDy4dKzliCg6t1+mEt3JS8sLTB6dP8tlC      
   yYvy4IUVexhVhGRCE17O0BHZfcATfneSovyzeTBHKDD8MGFTSWn9NtAIU6s0hPDxSqR6    
     Dk9+DuF8T5QST2sV5wFlQLqOFUpa3+rU3YRYo=
Date: Fri, 18 Sep 2009 14:03:07 -0500
X-Google-Sender-Auth: a509d4ee095df6fd
Subject: Re: [condor-admin #19725] LIGO: condor_history performance
From: Nathaniel Griswold <griswold__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

--0016363109534ef64a0473dec8cd

This is a common problem and one that needs attention. For now, you can
either a) run quill, which will give you faster queries and finer control of
the rotation, or 2) you can do cleanup of rotated history files on the
filesystem yourself. Also note the -reverse option, which could be useful
for performance

-nate

Accessing the Job ClassAdd for a historical job on version,
>
> # condor_version
> $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
> $CondorPlatform: X86_64-LINUX_RHEL5 $
>
> takes an extremely long time for what naively appears to be a simple
> task, e.g.,
>
> [root@ldas-grid ~]# time condor_history -long 54230554.0
> MyType = "Job"
> ...
> JobFinishedHookDone = 1253208688
>
>
> real    4m7.432s
> user    4m4.844s
> sys     0m2.480s
>
>
> What are the prospects of improving this performance?
>
> Thanks.
>
> --
>

--0016363109534ef64a0473dec8cd

This is a common problem and one that needs attention. For now, you can eit=
her a) run quill, which will give you faster queries and finer control of t=
he rotation, or 2) you can do cleanup of rotated history files on the files=
ystem yourself. Also note the -reverse option, which could be useful for pe=
rformance<br>
<br>-nate<br><br><div class=3D"gmail_quote"><blockquote class=3D"gmail_quot=
e" style=3D"border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt =
0.8ex; padding-left: 1ex;">
Accessing the Job ClassAdd for a historical job on version,<br>
<br>
# condor_version<br>
$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $<br>
$CondorPlatform: X86_64-LINUX_RHEL5 $<br>
<br>
takes an extremely long time for what naively appears to be a simple<br>
task, e.g.,<br>
<br>
[root@ldas-grid ~]# time condor_history -long 54230554.0<br>
MyType =3D "Job"<br>
...<br>
JobFinishedHookDone =3D 1253208688<br>
<br>
<br>
real =A0 =A04m7.432s<br>
user =A0 =A04m4.844s<br>
sys =A0 =A0 0m2.480s<br>
<br>
<br>
What are the prospects of improving this performance?<br>
<br>
Thanks.<br>
<br>
--<br>
</blockquote></div><br>

--0016363109534ef64a0473dec8cd--

===========================================================================
Date mail was appended: Fri Sep 18 14:03:20 2009 (1253300600)
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19725] LIGO: condor_history performance
Date: Fri, 18 Sep 2009 12:34:35 -0700
X-Seen-BY: mailfromd 4.1 gypsum.cs.wisc.edu

I don't see a "-reverse" option, but if you are referring to "- 
backward" that appears to be even slower in this case, e.g.,

[root@ldas-grid ~]# time condor_history -backwards -long 54230554.0
MyType = "Job"
...
JobFinishedHookDone = 1253208688


real	10m45.653s
user	0m1.024s
sys	0m2.614s


We have enough interesting problems with Quill that I am reticent to  
consider that again unless we come up with another compelling reason.

Thanks.


On Sep 18, 2009, at 12:03 PM, condor-admin response tracking system  
wrote:

> --0016363109534ef64a0473dec8cd
> Content-Type: text/plain; charset=ISO-8859-1
>
> This is a common problem and one that needs attention. For now, you  
> can
> either a) run quill, which will give you faster queries and finer  
> control of
> the rotation, or 2) you can do cleanup of rotated history files on the
> filesystem yourself. Also note the -reverse option, which could be  
> useful
> for performance
>
> -nate
>
> Accessing the Job ClassAdd for a historical job on version,
>>
>> # condor_version
>> $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
>> $CondorPlatform: X86_64-LINUX_RHEL5 $
>>
>> takes an extremely long time for what naively appears to be a simple
>> task, e.g.,
>>
>> [root@ldas-grid ~]# time condor_history -long 54230554.0
>> MyType = "Job"
>> ...
>> JobFinishedHookDone = 1253208688
>>
>>
>> real    4m7.432s
>> user    4m4.844s
>> sys     0m2.480s
>>
>>
>> What are the prospects of improving this performance?
>>
>> Thanks.
>>
>> --
>>
>
> --0016363109534ef64a0473dec8cd
> Content-Type: text/html; charset=ISO-8859-1
> Content-Transfer-Encoding: quoted-printable
>
> This is a common problem and one that needs attention. For now, you  
> can eit=
> her a) run quill, which will give you faster queries and finer  
> control of t=
> he rotation, or 2) you can do cleanup of rotated history files on  
> the files=
> ystem yourself. Also note the -reverse option, which could be useful  
> for pe=
> rformance<br>
> <br>-nate<br><br><div class=3D"gmail_quote"><blockquote  
> class=3D"gmail_quot=
> e" style=3D"border-left: 1px solid rgb(204, 204, 204); margin: 0pt  
> 0pt 0pt =
> 0.8ex; padding-left: 1ex;">
> Accessing the Job ClassAdd for a historical job on version,<br>
> <br>
> # condor_version<br>
> $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $<br>
> $CondorPlatform: X86_64-LINUX_RHEL5 $<br>
> <br>
> takes an extremely long time for what naively appears to be a  
> simple<br>
> task, e.g.,<br>
> <br>
> [root@ldas-grid ~]# time condor_history -long 54230554.0<br>
> MyType =3D "Job"<br>
> ...<br>
> JobFinishedHookDone =3D 1253208688<br>
> <br>
> <br>
> real =A0 =A04m7.432s<br>
> user =A0 =A04m4.844s<br>
> sys =A0 =A0 0m2.480s<br>
> <br>
> <br>
> What are the prospects of improving this performance?<br>
> <br>
> Thanks.<br>
> <br>
> --<br>
> </blockquote></div><br>
>
> --0016363109534ef64a0473dec8cd--
>
>
> ========================================
> MESSAGE INFORMATION
> ========================================
> * From: Nathaniel Griswold <griswold__AT__cs.wisc.edu>
> * Ticket Email List: anderson__AT__ligo.caltech.edu,
>

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Fri Sep 18 14:34:48 2009 (1253302489)
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com;
 s=gamma;        h=domainkey-signature:mime-version:sender:received:in-reply-to
         :references:date:x-google-sender-auth:message-id:subject:from:to  
       :content-type;        bh=cTqu+9GxbMJH/FMrUqF33+eMWF2LIiMOomUi8KhaixU=;
        b=sMLk6ni/OLfR3T/6Hrwbf9M9wMgAhS6IUaQ+V6+mKRHcsyXIUgavKlvih8rdGdPggR
         hGIwlu8dE7vOsOh/CN09WIRsu08LED2eLenkgO1BCkROYAwuCDZ2lrobtdDeBG3ywosj
         9weMknANDmfbBS+aSZVpM8dRnfdQ46yLgNvFk=
Domainkey-Signature: a=rsa-sha1; c=nofws;        d=gmail.com; s=gamma;    
    h=mime-version:sender:in-reply-to:references:date        
 :x-google-sender-auth:message-id:subject:from:to:content-type;       
 b=axecxIInqKFqZmc2vo/rpIyuyy2yNBNWSo0ToY2NB8memT48/6DE82xLml0Psd1246      
   OJ6GF7afZNyk+a2uC74zae6/6zSBsZkjNSLY4mDRNp7BvNiGFsG8LrgIHmZJAbpJgIOn    
     FcLd/JpIpl2DMPZVWRu1otcMdah9nHG/zW07o=
Date: Tue, 22 Sep 2009 17:44:14 -0500
X-Google-Sender-Auth: 364837339ae6cda8
Subject: Re: [condor-admin #19725] LIGO: condor_history performance
From: Nathaniel Griswold <griswold__AT__cs.wisc.edu>
To: condor-admin__AT__cs.wisc.edu
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu

--0016364eccd27c0158047432561b

Stuart,

Yes, i  meant to say -backwards.

I was just speaking with someone who was wondering if maybe you would
benefit from leave_in_queue (
http://www.cs.wisc.edu/condor/manual/v7.3/2_5Submitting_Job.html#1946 ).
This would allow you to keep select jobs in the queue for querying and
manipulation, say for doing some processing on the jobs before finally
removing them from the queue.

As far as quill problems go, if you are having performance problems, you
might benefit from only enabling quill on select daemons of select servers.

Condor is considering adding indexing to condor_history. I'm wondering what
exactly you are trying to do. Do you need to query arbitrary job ids in
condor_history, or are they always jobs of a certain kind? If you can't
really benefit from leave_in_queue, that is also good for us to know.

-nate

On Fri, Sep 18, 2009 at 2:34 PM, condor-admin response tracking system <
condor-admin__AT__cs.wisc.edu> wrote:

> I don't see a "-reverse" option, but if you are referring to "-
> backward" that appears to be even slower in this case, e.g.,
>
> [root@ldas-grid ~]# time condor_history -backwards -long 54230554.0
> MyType = "Job"
> ..
> JobFinishedHookDone = 1253208688
>
>
> real    10m45.653s
> user    0m1.024s
> sys     0m2.614s
>
>
> We have enough interesting problems with Quill that I am reticent to
> consider that again unless we come up with another compelling reason.
>
> Thanks.
>
>
>

--0016364eccd27c0158047432561b

Stuart,<br><br>Yes, i=A0 meant to say -backwards. <br><br>I was just speaki=
ng with someone who was wondering if maybe you would benefit from leave_in_=
queue ( <a href=3D"http://www.cs.wisc.edu/condor/manual/v7.3/2_5Submitting_=
Job.html#1946">http://www.cs.wisc.edu/condor/manual/v7.3/2_5Submitting_Job.=
html#1946</a> ). This would allow you to keep select jobs in the queue for =
querying and manipulation, say for doing some processing on the jobs before=
 finally removing them from the queue.<br>
<br>As far as quill problems go, if you are having performance problems, yo=
u might benefit from only enabling quill on select daemons of select server=
s.<br><br>Condor is considering adding indexing to condor_history. I'm =
wondering what exactly you are trying to do. Do you need to query arbitrary=
 job ids in condor_history, or are they always jobs of a certain kind? If y=
ou can't really benefit from leave_in_queue, that is also good for us t=
o know.<br>
<br>-nate<br><br><div class=3D"gmail_quote">On Fri, Sep 18, 2009 at 2:34 PM=
, condor-admin response tracking system <span dir=3D"ltr"><<a href=3D"ma=
ilto:condor-admin__AT__cs.wisc.edu">condor-admin__AT__cs.wisc.edu</a>></span> wrot=
e:<br>
<blockquote class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, =
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">I don't see a=
 "-reverse" option, but if you are referring to "-<br>
backward" that appears to be even slower in this case, e.g.,<br>
<br>
[root@ldas-grid ~]# time condor_history -backwards -long 54230554.0<br>
MyType =3D "Job"<br>
..<br>
JobFinishedHookDone =3D 1253208688<br>
<br>
<br>
real =A0 =A010m45.653s<br>
user =A0 =A00m1.024s<br>
sys =A0 =A0 0m2.614s<br>
<br>
<br>
We have enough interesting problems with Quill that I am reticent to<br>
consider that again unless we come up with another compelling reason.<br>
<br>
Thanks.<br>
<br><br></blockquote></div>

--0016364eccd27c0158047432561b--

===========================================================================
Date mail was appended: Tue Sep 22 17:44:27 2009 (1253659468)
From: Stuart Anderson <anderson__AT__ligo.caltech.edu>
To: condor-admin__AT__cs.wisc.edu
Subject: Re: [condor-admin #19725] LIGO: condor_history performance
Date: Sat, 26 Sep 2009 20:13:11 -0700
X-Seen-BY: mailfromd 4.1 granite.cs.wisc.edu


On Sep 22, 2009, at 3:44 PM, condor-admin response tracking system  
wrote:

> --0016364eccd27c0158047432561b
> Content-Type: text/plain; charset=ISO-8859-1
>
> Stuart,
>
> Yes, i  meant to say -backwards.
>
> I was just speaking with someone who was wondering if maybe you would
> benefit from leave_in_queue (
> http://www.cs.wisc.edu/condor/manual/ 
> v7.3/2_5Submitting_Job.html#1946 ).
> This would allow you to keep select jobs in the queue for querying and
> manipulation, say for doing some processing on the jobs before finally
> removing them from the queue.

We probably run too many jobs to ask the Schedd to remember everything.

>
> As far as quill problems go, if you are having performance problems,  
> you
> might benefit from only enabling quill on select daemons of select  
> servers.

I am more worried about the Quill stability and scaling problems I  
have seen reported by other users.

>
> Condor is considering adding indexing to condor_history. I'm  
> wondering what
> exactly you are trying to do. Do you need to query arbitrary job ids  
> in
> condor_history, or are they always jobs of a certain kind? If you  
> can't
> really benefit from leave_in_queue, that is also good for us to know.

I am interested in condor_history to help debug problems reported by  
users after their jobs have left the queue, usually this means taking  
a look at the job clasadd. Our Schedd's have typically run O(10M) jobs  
each, so I am pretty sure leaving completed jobs in the queue for a  
significant period of time for the schedd to keep track of would be  
problematic.

I don't need this access very frequently, so this is not a high- 
priority request, but I think some sort of indexing for condor_history  
would be a generally good thing--especially when you get compared to a  
Google search :)

Thanks.

--
Stuart Anderson  anderson__AT__ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




===========================================================================
Date mail was appended: Sat Sep 26 22:13:25 2009 (1254021205)
Subject: Comments added

git trac ticket #785 is tracking this rust ticket.

Comments added by psilord

===========================================================================
Date comments were added: Fri Oct  9 11:44:03 2009 (1255106643)
Subject: Comments added

LIGO has stated that at this time this is a low priority ticket. It is being
kept around because condor_history is useful enough for Condor debugging that
they'd like to keep an eye on it for being fixed.

Comments added by psilord

===========================================================================
Date comments were added: Fri Oct  9 14:11:59 2009 (1255115519)
Subject: Actions

Assigned to psilord by psilord
===========================================================================
Date of actions: Fri Oct  9 14:14:45 2009 (1255115685)