next up previous contents index
Next: 2.10 Extending your Condor Up: 2. Users' Manual Previous: 2.8 Parallel Applications in

Subsections

  
2.9 Running MPICH jobs in Condor

In addition to PVM, Condor also supports the execution of parallel jobs that utilize MPI. Our current implementation supports the following features: However, there are some limitations to our current implementation.

  
2.9.1 Caveats

MPICH
Your MPI job must be compiled with MPICH, Argonne National Labs' implementation of MPI. Specifically, you must use the ``ch_p4'' device for MPICH. For information on MPICH, see Argonne's web page at http://www-unix.mcs.anl.gov/mpi/mpich/. Your version of MPICH must not be compiled with the path to RSH hard-coded into the library (As a result of running configure as ./configure -rsh=/path/to/your/rsh possilbly.) Condor provides a special version of rsh that it uses to start jobs.

Dedicated Resources
You must make sure that your MPICH jobs will be running on machines that will not vacate the job before the job terminates naturally. (This is a limitation of MPICH and the MPI specification.) Unlike PVM (Section 2.8), the current MPICH implementation does not support dynamic resource management. That is, processes in the virtual machine may NOT join or leave the computation at any time. If you start an MPI job with 4 nodes, for example, none of those 4 nodes can be preempted by other Condor jobs or the machine's owner.

Scheduling
We do not yet have a sophisticated scheduling algorithm in place for MPI jobs. If you set things up properly, there shouldn't be much of a problem. However, if there are several users trying to run MPI jobs on the same machines, it may be the case that no jobs will run at all and Condor's scheduling will deadlock. Writing a good scheduler for this environment is high on the priority list for Condor version 6.5.

``New'' shadow and starter
We have been developing new versions of the condor_shadow and the condor_starter. You have to use these new versions to run MPI jobs. For information on obtaining these binaries, see below.

Shared File System
The machines where you want your MPI job to run must have a shared file system. There is no remote I/O for our MPI support like there is for our Standard Universe jobs.

Condor Version 6.1.15+
You must be running this version of the Condor distribution (or greater) in order to use this contrib module.

  
2.9.2 Getting the Binaries

There is now an MPI ``contrib'' module available with Condor. It can be found in the contrib section of the downloads. When you un-tar the tarfile, there will be three files:

The last item is named rsh, but it is not the rsh utility you're familiar with -- it's a wrapper that is required for our implementation to function correctly. These three binaries should go in Condor's sbin directory, where many other files like them reside.

  
2.9.3 Configuring Condor

Now that you've got the necessary binaries, you'll have to configure Condor to use MPI. Insert the following lines in the main condor_config file:

ALTERNATE_STARTER_2	= $(SBIN)/condor_starter.v61
STARTER_2_IS_DC		= TRUE
MPI_CONDOR_RSH_PATH	= $(SBIN)
SHADOW_MPI		= $(SBIN)/condor_shadow.v61
Reconfigure your pool by typing
condor_reconfig `condor_status -m`
The -m argument tells condor_status to return just the names of all the running condor_master daemons in your pool. Note that you have to do this from a machine with administrator privileges.

  
2.9.4 Managing Dedicated Machines

There are several ways that you can set up a pool to run MPI jobs without interruption. We will cover two methods that will work, although more sophisticated solutions are possible. Familiarity with Startd policy configuration (Section 3.6) is necessary to understand the following examples.

For the first example, let's assume that you have a cluster of machines which do not have regular users on them. Let's also assume that these machines are solely dedicated to the use of Condor. The simplest way to set up your policy is as follows:

START       = TRUE
CONTINUE    = TRUE
SUSPEND     = FALSE
PREEMPT     = FALSE
KILL        = FALSE

With the above configuration, the machines will accept any Condor job, and the jobs will never be suspended, preempted, or killed. You will never have to worry about an MPI job (or any job, for that matter) being evicted from the machines.

For a more complex example, let us assume you have machines with sophisticated policies already in place, and you'd like the machines to manage MPI jobs differently. The following macros (which should be specified near other Startd policy support macros) allow you to accomplish the task easily.

MPI	  = 8
IsMPI = (JobUniverse == $(MPI))
Now change your configuration from
START	= /* your interesting policy here */
to
FORMER_START	= /* your interesting policy here */
Similarly, the CONTINUE    , SUSPEND    , PREEMPT    , and KILL     expressions should be changed to macros named FORMER_CONTINUE, etc. The following configuration will ensure that MPI jobs are never suspended or evicted while implementing your former policy for all other jobs.
START		= ( $(FORMER_START) )
CONTINUE	= ( $(FORMER_CONTINUE) )
SUSPEND		= ( $(FORMER_SUSPEND) && ((IsMPI) == FALSE ) )
PREEMPT		= ( $(FORMER_PREEMPT) && ((IsMPI) == FALSE ) )
KILL		= ( $(FORMER_KILL) && ((IsMPI) == FALSE ) )
Thus, Condor will never attempt to vacate an MPI job from a machine once it starts running on that machine. Some machine owners may not like this setup, so you may need to customize your configuration to suit your needs. The most important point to remember when creating your Startd policy is that MPI jobs are immediately killed if one or more nodes of the job leave the computation.

  
2.9.5 Submitting to Condor

Here is a minimal submit file to submit an MPI job to Condor. For more information on writing submit files, see Section 2.5.1.

universe = MPI
executable = your_mpi_program
machine_count = 4
queue

This tells Condor to start the executable named your_mpi_program on four machines. These four machines will be of the same architechture and operating system as the submitting machine. Note the universe = MPI line tells Condor that an MPICH job is being submitted.

Now let's try a more sophisticated submit file:

###################################################################
## submitfile                                                    ##
###################################################################
universe = MPI
executable = simplempi
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 4
queue

Notice the $(NODE)   macro, which is expanded when the job starts so that it becomes equivalent to the MPI ``id'' of the MPICH job. The first process started becomes ``0'', the second is ``1'', etc. For example, let's say I prepared four input files, named infile.0 through infile.3:

infile.0: 
Hello number zero.

infile.1: 
Hello number one.
etc. I then created a simple MPI job, named simplempi.c
/******************************************************************
 * simplempi.c
 ******************************************************************/
#include <stdio.h>
#include "mpi.h"

int main(argc,argv)
    int argc;
    char *argv[];
{
    int myid;
    char line[128];

    MPI_Init(&argc,&argv);
    MPI_Comm_rank(MPI_COMM_WORLD,&myid);

    fprintf ( stdout, "Printing to stdout...%d\n", myid );
    fprintf ( stderr, "Printing to stderr...%d\n", myid );
    fgets ( line, 128, stdin );
    fprintf ( stdout, "From stdin: %s", line );

    MPI_Finalize();
    return 0;
}
And to complete the demonstration, here's the Makefile:
###################################################################
## This is a very basic Makefile                                 ##
###################################################################

# Change this part to your mpicc, obviously....
CC          = /usr/local/bin/mpicc
CLINKER     = $(CC)

CFLAGS    = -g
EXECS     = simplempi

all: $(EXECS)

simplempi: simplempi.o
        $(CLINKER) -o simplempi simplempi.o -lm

.c.o:
        $(CC) $(CFLAGS) -c $*.c

Once simplempi is built, use condor_submit to submit your job. This job should finish pretty quickly once it finds machines to run on, and the results will be what you expect: 8 files will be created: errfile.[0-3] and outfile.[0-3]. For example, outfile.0 will contain

Printing to stdout...0
From stdin: Hello number zero.
and errfile.0 will contain
Printing to stderr...0

Of course, individual tasks may open other files; this example was constructed to demonstrate the $(NODE) feature and the setup of the expected stdin, stdout, and stderr files in the MPI universe.


next up previous contents index
Next: 2.10 Extending your Condor Up: 2. Users' Manual Previous: 2.8 Parallel Applications in
condor-admin@cs.wisc.edu