Condor - High Throughput Computing

Condor Pool Goodput Index

The following graphs show goodput statistics computed by tracking the behavior of a representative set of jobs in the pool. The representative set contained 10 Intel and Sparc Solaris jobs, with checkpoint sizes of 4, 24, 48, 96, and 180 MB. The cost of remote file access is not included in the overhead statistic (i.e., the representative jobs perform no file I/O), since this cost varies from job to job.

This first graph compares weekly throughput and goodput totals for these jobs. Throughput varies significantly due to pool usage and system outages. For example, when there are many applications submitted to the system, the throughput allocated to the goodput jobs is reduced. The maximum possible weekly throughput for these jobs is 1680 hours (7 days * 24 hours * 10 jobs). The gap between goodput and throughput is the result of roll-back and network wait time .

Goodput Graph

The following graph shows the throughput lost due to roll-back and network wait time. Network wait time varies due to changes in available network capacity and the length of workstation allocations. Roll-back occurs when migrations fail, due to network outages or missed deadlines.

Badput Graph

The following graph shows the cumulative distribution of the length of workstation allocations between 7/16/98 and 3/26/99. The majority of allocations are less than two hours long, so applications must checkpoint frequently to make forward progress. However, in rare cases, an allocation can last many days.

(X, Y) = Y% of CPU allocations were X minutes or shorter in length.

Allocation Distribution Graph

The following graph shows the cumulative distribution of the length of workstation allocations as a percentage of total throughput, instead of a percentage of allocation instances, as above.

(X, Y) = Y% of total allocated throughput was obtained from CPU allocations X minutes or shorter in length.

Allocation Distribution
Graph

The following graph shows the cumulative distribution of the work lost due to roll-back, between 7/16/98 and 3/26/99. When a migration fails, the application must roll-back to an earlier state. This can result in a loss of a few minutes of work for short allocations or a number of hours of work for longer allocations. Jobs in our pool are configured to only perform checkpoints of work over ten minutes, so the vast majority of roll-backs seen here are for work under ten minutes.

(X, Y) = Y% of failures lost X minutes or less of CPU time.

Failed Migration Distribution Graph

The following graph shows the cumulative distribution of roll-backs as a percentage of total lost throughput. A significant number of roll-backs occur at the three hour mark. This is the point at which the job performs its first periodic checkpoint. Unfortunately, the periodic checkpoint operation is prone to failure. The policy of only checkpointing allocations over ten minutes is also clear on this graph. Since every job performs a periodic checkpoint every three hours, it should be very rare for a loss of more than three hours of work. However, there have been a few instances of losses of over 12 hours of work, apparently in cases when the periodic checkpoint timer failed to go off.

(X, Y) = Y% of total lost throughput resulted from individual losses of X minutes or less.

Failed Migration Distribution Graph


condor-admin@cs.wisc.edu