Condor - High Throughput Computing

Checkpoint Server Statistics

The following graphs show the daily totals for checkpoint data transfers to the checkpoint server (Received) and from the checkpoint server (Sent). The first graph shows the aggregate amount of data successfully sent and received. This value depends on the number of jobs running in the pool, the sizes of their checkpoints, and the number of checkpoint transfers each job performs each day. The checkpoint server should typically receive more traffic than it sends due to periodic checkpointing, since each eviction checkpoint should be followed by a checkpoint restore transfer. So, when send traffic is greater than receive traffic, we know there is a problem that requires further investigation. One common problem occurs when checkpoint restores fail and the job tries the checkpoint restore over and over again.

Checkpoint Transfer Graph (GB)

This graph shows the number of successful checkpoint transfers in the pool each day. Again, the receives should outnumber the sends due to periodic checkpointing when the pool is running smoothly.

Checkpoint Successful Transfer Graph

This graph shows the number of failed checkpoint transfers in the pool each day. The server fails to successfully send a checkpoint when the job's allocation is terminated before all bytes of the checkpoint arrive or if a network outage occurs during the transfer. The server fails to successfully receive a checkpoint when the job initiates a transfer but does not send a commit notification.

Checkpoint Failed Transfer Graph

This graph shows the average data throughput of the checkpoint server. Since transfers occur simultaneously, this is not the aggregate throughput of the server.

Checkpoint Data Throughput Graph (Mb/s)

The following graphs compare the checkpoint totals over 10 minute intervals in the day. These graphs show daily checkpointing patterns. For example, a number of workstations are rebooted for maintenance at 4am each morning. This first results in a number of checkpoint receives while the applications are being checkpointed, followed by a number of checkpoint sends while the applications are restarting after the reboot. Also, more checkpoints are sent by the checkpoint server during work hours because jobs are frequently started for short periods of time while users are away from their workstations.

TOD Checkpoint Transfer Graph TOD Checkpoint Successful Transfer Graph TOD Checkpoint Failed Transfer Graph TOD Checkpoint Data Throughput Graph (Mb/s)

The following graphs show cumulative distributions of checkpoint sizes in our pool by year.

Checkpoint Size Graph Checkpoint Size Graph


condor-admin@cs.wisc.edu