The Grid Console

Problem

A user has many programs to run, so he/she takes advantage of a system like Condor or Globus to distribute programs to idle machines spread around the world. As these programs run, they produce output that the user wants to collect and analyze on his/her home machine. Because the foreign machines are a motley collection of machines with differing capabilities and owners, they do not share a distributed filesystem on which to write output. How can the output data be sent back to the user's home machine?

There are two traditional solutions to this problem:

  1. Store the files on a disk, then send the whole file back by FTP or a similar protocol.
  2. Connect the standard output directly to a TCP stream back to the home machine.

Both of these solutions are problematic.

Solution #1 does not satisfy the user who needs to see progressively updated output. Many users rely on log-like output to determine if their applications are running correctly. In addition, the remote machine may not have enough disk space to store all of the output. Both of the objections are magnified for a program that is designed to never terminate.

Solution #2 makes the application dependent upon the whims of the network. We expect users will be making use of hundreds of machines spread around the globe, and at any given time, some of them will be disconnected from the home machine. We want the user's to make forward progress when they are disconnected from home. Although the user prefers to have continuous output, a partitioned network should not bring all work to a screeching halt.

Solution

The Grid Console (GC) is a system for getting mostly-continuous output from remote programs running on an unreliable network. On such a network, common problems include crashed machines, partitioned networks, full disks, and mysteriously missing services. The GC is robust to any of these failures. Its first priority is to keep jobs running. Its second priority is to keep the output moving when conditions permit.

The GC is implemented using Bypass and consists of two software components: an agent and a shadow. An agent intercepts some of the I/O operations performed by an application running on a remote machine. When possible, it sends that output back to a shadow process running on the home machine. Any number of remote processes may use a single shadow to record their output. A GC system looks like this when everything is working perfectly:

The shadow manages the input and output files according to the request of the agent. Of course, it does not trust any arbitrary agent that connects. If the user so desires, the processes may authenticate using Globus GSS or by simple domain names.

If the agent is unable to send its output to the shadow, it will instead write the output to the local disk.

It doesn't matter why the output operation failed -- whether it be a full disk, a refused connection, or a broken route -- the GC will keep the process running. At regular intervals, it will attempt the network connection again. If the connection succeeds, it will transfer any buffered data to the shadow, and then resume normal operation.
Of course, there are many reasons that writing to the disk might fail! The most likely is that the disk is full. However, we can imagine many other failure modes that are hard to plan for. Perhaps the temporary disk is mounted via NFS, and that server is broken as well. Perhaps the system is temporarily out of file descriptors. No matter -- the GC will pause the job and keep attempting both the disk and the network connection:
When one or the other succceeds, the GC will commit any buffered data and resume the application.

To set up and run the GC, the shadow must first be started on the home machine:

% grid_console_shadow -port 50000
To run the GC agent on the remote machine, set some environment variables and then execute the application:

% setenv BYPASS_SHADOW_HOST  server.cs.wisc.edu
% setenv BYPASS_SHADOW_PORT  50000
% setenv GRID_CONSOLE_STDOUT outfile.5
% setenv GRID_CONSOLE_STDERR errfile.5
% setenv GRID_CONSOLE_STDIN  infile.5
% setenv LD_PRELOAD          /path/to/grid_console_agent.so
% /path/to/application

The GC agent retries failed operations at regular intervals for a certain number of times, after which it will give up and kill the process. We expect that each user will have different requirements for this failure condition. The number of times to retry and the number of seconds to pass between each retry can be set with two environment variables:

% setenv GRID_CONSOLE_RETRY_LIMIT 100
% setenv GRID_CONSOLE_RETRY_TIMEOUT 10
If the retry limit is set to zero, the GC will keep retrying errors forever.

The Grid Console is distributed in the examples directory of the standard Bypass package.


Bypass Home
Condor Home