There is now an MPI ``contrib'' module available with Condor. It can be found in the contrib section of the downloads. When you un-tar the tarfile, there will be three files:
The last item is named rsh, but it is not the rsh utility you're familiar with -- it's a wrapper that is required for our implementation to function correctly. These three binaries should go in Condor's sbin directory, where many other files like them reside.
Now that you've got the necessary binaries, you'll have to configure Condor to use MPI. Insert the following lines in the main condor_config file:
ALTERNATE_STARTER_2 = $(SBIN)/condor_starter.v61 STARTER_2_IS_DC = TRUE MPI_CONDOR_RSH_PATH = $(SBIN) SHADOW_MPI = $(SBIN)/condor_shadow.v61Reconfigure your pool by typing
condor_reconfig `condor_status -m`The -m argument tells condor_status to return just the names of all the running condor_master daemons in your pool. Note that you have to do this from a machine with administrator privileges.
There are several ways that you can set up a pool to run MPI jobs without interruption. We will cover two methods that will work, although more sophisticated solutions are possible. Familiarity with Startd policy configuration (Section 3.6) is necessary to understand the following examples.
For the first example, let's assume that you have a cluster of machines which do not have regular users on them. Let's also assume that these machines are solely dedicated to the use of Condor. The simplest way to set up your policy is as follows:
START = TRUE CONTINUE = TRUE SUSPEND = FALSE PREEMPT = FALSE KILL = FALSE
With the above configuration, the machines will accept any Condor job, and the jobs will never be suspended, preempted, or killed. You will never have to worry about an MPI job (or any job, for that matter) being evicted from the machines.
For a more complex example, let us assume you have machines with sophisticated policies already in place, and you'd like the machines to manage MPI jobs differently. The following macros (which should be specified near other Startd policy support macros) allow you to accomplish the task easily.
MPI = 8 IsMPI = (JobUniverse == $(MPI))Now change your configuration from
START = /* your interesting policy here */to
FORMER_START = /* your interesting policy here */Similarly, the CONTINUE , SUSPEND , PREEMPT , and KILL expressions should be changed to macros named FORMER_CONTINUE, etc. The following configuration will ensure that MPI jobs are never suspended or evicted while implementing your former policy for all other jobs.
START = ( $(FORMER_START) ) CONTINUE = ( $(FORMER_CONTINUE) ) SUSPEND = ( $(FORMER_SUSPEND) && ((IsMPI) == FALSE ) ) PREEMPT = ( $(FORMER_PREEMPT) && ((IsMPI) == FALSE ) ) KILL = ( $(FORMER_KILL) && ((IsMPI) == FALSE ) )Thus, Condor will never attempt to vacate an MPI job from a machine once it starts running on that machine. Some machine owners may not like this setup, so you may need to customize your configuration to suit your needs. The most important point to remember when creating your Startd policy is that MPI jobs are immediately killed if one or more nodes of the job leave the computation.
Here is a minimal submit file to submit an MPI job to Condor. For more information on writing submit files, see Section 2.5.1.
universe = MPI executable = your_mpi_program machine_count = 4 queue
This tells Condor to start the executable named your_mpi_program
on four machines. These four machines will be of the same architechture
and operating system as the submitting machine. Note the
universe = MPI line tells Condor that an MPICH job is being submitted.
Now let's try a more sophisticated submit file:
################################################################### ## submitfile ## ################################################################### universe = MPI executable = simplempi log = logfile input = infile.$(NODE) output = outfile.$(NODE) error = errfile.$(NODE) machine_count = 4 queue
Notice the $(NODE) macro, which is expanded when the job starts so that it becomes equivalent to the MPI ``id'' of the MPICH job. The first process started becomes ``0'', the second is ``1'', etc. For example, let's say I prepared four input files, named infile.0 through infile.3:
infile.0: Hello number zero. infile.1: Hello number one.etc. I then created a simple MPI job, named simplempi.c
/******************************************************************
* simplempi.c
******************************************************************/
#include <stdio.h>
#include "mpi.h"
int main(argc,argv)
int argc;
char *argv[];
{
int myid;
char line[128];
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
fprintf ( stdout, "Printing to stdout...%d\n", myid );
fprintf ( stderr, "Printing to stderr...%d\n", myid );
fgets ( line, 128, stdin );
fprintf ( stdout, "From stdin: %s", line );
MPI_Finalize();
return 0;
}
And to complete the demonstration, here's the Makefile:
###################################################################
## This is a very basic Makefile ##
###################################################################
# Change this part to your mpicc, obviously....
CC = /usr/local/bin/mpicc
CLINKER = $(CC)
CFLAGS = -g
EXECS = simplempi
all: $(EXECS)
simplempi: simplempi.o
$(CLINKER) -o simplempi simplempi.o -lm
.c.o:
$(CC) $(CFLAGS) -c $*.c
Once simplempi is built, use condor_submit to submit your job. This job should finish pretty quickly once it finds machines to run on, and the results will be what you expect: 8 files will be created: errfile.[0-3] and outfile.[0-3]. For example, outfile.0 will contain
Printing to stdout...0 From stdin: Hello number zero.and errfile.0 will contain
Printing to stderr...0
Of course, individual tasks may open other files; this example was constructed to demonstrate the $(NODE) feature and the setup of the expected stdin, stdout, and stderr files in the MPI universe.