Grid Exerciser

site_submit

site_submit starts a Grid Exerciser run.

Execution

site_submit takes pairs of arguments. The first element in each pair is a Globus resource. The second is the number of simultaneous jobs to send to that resource. Thus, to send 10 jobs to grid.example.com and 20 jobs to grid2.example.net, you could run:

site_submit grid.example.com 10 grid2.example.net 20

site_submit creates a directory in the current working directory. The directory is named after the current date and time. This directory holds all of the working files and log files.

Each site is submitted as a seperate DAGMan job. As a result you can use "condor_q -dag" to group the jobs by site. When a job finishes it is immediately resubmitted back to the site to maintain the load.

The --proxy option can be used to specify the path to a different X509 proxy for the jobs.

The --no-max-per-site option removes any limits on how many jobs are sent to any site. --max-per-site takes an integer argument; no site will be sent more than that many jobs regardless of the number specified for the site.

--grid-file specifies a grid file from which to pull sites to send jobs to. This can replace the normal usage of specifying sites on the command line.

Here is an example usage:

#! /bin/sh
./site_submit \
        --proxy /p/condor/workspaces/adesmet/gridexerciser/doegrids-2week-proxy \
        a197107.n1.vanderbilt.edu/jobmanager-pbs 12 \
        acdc.ccr.buffalo.edu/jobmanager-pbs 80 \
        atgrid.grid.umich.edu/jobmanager-condor 11 \
        atlas12.hep.anl.gov/jobmanager-condor 1 \
        atlas.iu.edu/jobmanager-pbs 64 \
        citgrid3.cacr.caltech.edu/jobmanager-condor 12 \
        cluster28.knu.ac.kr/jobmanager-condor 3 \
        cmssrv04.fnal.gov/jobmanager-fbsng 80 \
        garlic.hep.wisc.edu/jobmanager-condor 64 \
        grid02.uchicago.edu/jobmanager-condor 3 \
        grid.dpcc.uta.edu/jobmanager-pbs 162 \
        iuatlas.physics.indiana.edu/jobmanager-condor 4 \
        lldimu.alliance.unm.edu/jobmanager-pbs 516 \
        spider.usatlas.bnl.gov/jobmanager-condor 20 \
        t2cms0.sdsc.edu/jobmanager-condor 42 \
        tam01.fnal.gov/jobmanager-condor 10 \
        ufgrid01.phys.ufl.edu/jobmanager-condor 42 \
        ufloridaPG.phys.ufl.edu/jobmanager-condor 80 \
        uscmstb0.ucsd.edu/jobmanager-condor 2 \

Implementation

site_submit first creates a directory based on the current date and time in the current working directory. This directory holds the many files necessary for an exerciser run.

For each resource to which jobs are sent a a resourcename.dag DAGMan file is created. For each simultaneous job sent to that site an entry is created in that file in the form:

Job resourcename-0001 uscmstb0.ucsd.edu-jobmanager-condor.submit
script post resourcename-0001 /bin/false
retry resourcename-0001 100000

Thus, when a given job finishes DAGMan considers it to have failed (because of the POST script) and it is immediately retried (up to 100,000 times). This maintains the load on the site.

All jobs to a given resource share the same submit file. That submit file currently sends a simple shell script entitled "simple". simple, like the other files, is written by site_submit.

Jobs are currently configured to automatically release (periodic_release) and resubmit (globus_resubmit) jobs that go on hold.

A typical current submit file:

executable=simple
universe = globus
globusscheduler = a197107.n1.vanderbilt.edu/jobmanager-pbs
submit_event_user_notes=GLOBUSRESOURCE:a197107.n1.vanderbilt.edu/jobmanager-pbs
log = a197107.n1.vanderbilt.edu-jobmanager-pbs.joblog
output=output/a197107.n1.vanderbilt.edu-jobmanager-pbs.out.$(Cluster)
error=output/a197107.n1.vanderbilt.edu-jobmanager-pbs.err.$(Cluster)
notification=never
periodic_release = ((CurrentTime - EnteredCurrentStatus) > 300) && (HoldReason =!= "via condor_hold (by user $(USERNAME))")
globus_resubmit = (NumSystemHolds >= (NumGlobusSubmits + 1) * 2)
globusrsl = (maxTime=22)
x509userproxy=/p/condor/workspaces/adesmet/gridexerciser/doegrids-2week-proxy
queue

Output files are placed into a subdirectory named "output." At this time the files are ignored and will accumulate over time. Goals include having a more advanced POST script that will confirm that the output came back successfully, then delete the output.

site_submit also writes a resourcename.info file. This file contains key=value pairs. At the moment it only contains the resource name, hostname, jobmanager, and the number of jobs sent. Of those only the number of jobs sent is used by summarize-condor-log.