

This graph shows the number of successful checkpoint transfers in the pool each day. Again, the receives should outnumber the sends due to periodic checkpointing when the pool is running smoothly.

This graph shows the number of failed checkpoint transfers in the pool each day. The server fails to successfully send a checkpoint when the job's allocation is terminated before all bytes of the checkpoint arrive or if a network outage occurs during the transfer. The server fails to successfully receive a checkpoint when the job initiates a transfer but does not send a commit notification.

This graph shows the average data throughput of the checkpoint server. Since transfers occur simultaneously, this is not the aggregate throughput of the server.
The following graphs compare the checkpoint totals over 10 minute intervals in the day. These graphs show daily checkpointing patterns. For example, a number of workstations are rebooted for maintenance at 4am each morning. This first results in a number of checkpoint receives while the applications are being checkpointed, followed by a number of checkpoint sends while the applications are restarting after the reboot. Also, more checkpoints are sent by the checkpoint server during work hours because jobs are frequently started for short periods of time while users are away from their workstations.
The following graphs show cumulative distributions of checkpoint sizes in our pool by year.