Reading the report

The grid exerciser distributes lightweight testing scripts to one or more Globus gatekeepers. Each site is allocated a certain number of jobs; as jobs finish more jobs are submitted to keep the site at the given allocation.

Status Summary

The grid exerciser currently generates two sections of results. The first is the Status Summary:

Site                                                Simul Submit  Rec'd   Done Errors    Run Time
Default
 atlas.iu.edu/jobmanager-pbs                           10    478    478    478      0      107.09
 citgrid3.cacr.caltech.edu/jobmanager-condor           10      0      0      0      0        0.00
 cluster28.knu.ac.kr/jobmanager-condor                  3      0      0      0      0        0.00
 spider.usatlas.bnl.gov/jobmanager-condor              10    849    849    850     13      200.42
 tier2b.cacr.caltech.edu/jobmanager-pbs                10     10      0      0   1320        0.00
 ufgrid01.phys.ufl.edu/jobmanager-condor               10      0      0      0      0        0.00
Known Bad
 atgrid.grid.umich.edu/jobmanager-condor               10      0      0      0   2232        0.00

For each site the summary lists the number of jobs that succeeded, the number of times jobs went on hold, and a rough total of the CPU time used on the remote site.

"Site" is the the jobmanager contact string that jobs are being sent to. Sites are sorted into categories as determined by the person running the report generator.

"Simul" is the maximum number of simultaneous submits. The grid monitor attempts to keep exactly this many jobs submitted at all times.

"Submit" is the number of jobs actually submitted to Condor in the given time period.

"Rec'd" is the number of jobs actually received by the remote Globus gatekeeper in the given time period.

"Done" is the number of jobs that successfully ran in for the given period.

"Errors" are the number of problems encountered. Jobs are placed on hold by Condor-G when a problem is encountered. The grid exerciser notes that the job has been placed on hold but retries it in a few minutes. Thus, if a site is having problems it's not uncommon for the hold count to be very large as jobs repeatedly fail to run.

"Run Time" is the time used by the jobs and is measured in hours. Strictly speaking, the run time measures from when Globus reports that the job has started (which may lag the actual start) until Condor-G decides the job is done (including staging back output). It can only be viewed as an approximation.

In the above example it looks like tier2b has serious problems. spider has been doing pretty well, but has had some problems. (Those problems might be minor transient issues.) citgrid3 has neither failed jobs nor run jobs. It's possible that there is a problem, but it is also possible the citgrid3 is simply very busy and the grid exerciser has not been able to get any processor time.

Sites can optionally be assigned categories using the grid file. Any site not assigned a category appears in the "Default" category. In the above example atgrid has been placed in the "Known Bad" category, perhaps to signify that it is expected to fail. The significance of categories is entirely under the control of the person running the reporting tool.

Error Details

The Error Details section gives more details on the errors encountered.

grid.dpcc.uta.edu
            590 Globus error 7: an authentication operation failed
           1748 Globus error 7: authentication with the remote server failed

The first number is the number of times the error was encountered. The rest of the line is the error message.