There are two traditional solutions to this problem:
Both of these solutions are problematic.
Solution #1 does not satisfy the user who needs to see progressively updated output. Many users rely on log-like output to determine if their applications are running correctly. In addition, the remote machine may not have enough disk space to store all of the output. Both of the objections are magnified for a program that is designed to never terminate.
Solution #2 makes the application dependent upon the whims of the network. We expect users will be making use of hundreds of machines spread around the globe, and at any given time, some of them will be disconnected from the home machine. We want the user's to make forward progress when they are disconnected from home. Although the user prefers to have continuous output, a partitioned network should not bring all work to a screeching halt.
The GC is implemented using Bypass and consists of two software components: an agent and a shadow. An agent intercepts some of the I/O operations performed by an application running on a remote machine. When possible, it sends that output back to a shadow process running on the home machine. Any number of remote processes may use a single shadow to record their output. A GC system looks like this when everything is working perfectly:
The shadow manages the input and output files according to the request of the agent. Of course, it does not trust any arbitrary agent that connects. If the user so desires, the processes may authenticate using Globus GSS or by simple domain names.
If the agent is unable to send its output to the shadow, it will instead write the output to the local disk.
To set up and run the GC, the shadow must first be started on the home machine:
% grid_console_shadow -port 50000
% setenv BYPASS_SHADOW_HOST server.cs.wisc.edu % setenv BYPASS_SHADOW_PORT 50000 % setenv GRID_CONSOLE_STDOUT outfile.5 % setenv GRID_CONSOLE_STDERR errfile.5 % setenv GRID_CONSOLE_STDIN infile.5 % setenv LD_PRELOAD /path/to/grid_console_agent.so % /path/to/application
The GC agent retries failed operations at regular intervals for a certain number of times, after which it will give up and kill the process. We expect that each user will have different requirements for this failure condition. The number of times to retry and the number of seconds to pass between each retry can be set with two environment variables:
% setenv GRID_CONSOLE_RETRY_LIMIT 100 % setenv GRID_CONSOLE_RETRY_TIMEOUT 10
The Grid Console is distributed in the examples directory of the standard Bypass package.