next up previous contents index
Next: 2.11 Parallel Applications Up: 2. Users' Manual Previous: 2.9 PVM Applications   Contents   Index

Subsections


2.10 MPI Applications

MPI stands for Message Passing Interface. It provides an environment under which parallel programs may synchronize, by providing communication support. Running the MPI-based parallel programs within Condor eases the programmer's effort. Condor dedicates machines for running the programs, and it does so using the same interface used when submitting non-MPI jobs.

The MPI universe in Condor currently supports MPICH versions 1.2.2, 1.2.3, and 1.2.4 using the ch_p4 device. The MPI universe does not support MPICH version 1.2.5. These supported implementations are offered by Argonne National Labs without charge by download. See the web page at http://www-unix.mcs.anl.gov/mpi/mpich/ for details and availability. Programs to be submitted for execution under Condor will have been compiled using mpicc. No further compilation or linking is necessary to run jobs under Condor.

The Parallel universe 2.11 is now the preferred way to run MPI jobs. Support for the MPI universe will be removed from Condor at a future date.


2.10.1 MPI Details of Set Up

Administratively, Condor must be configured such that resources (machines) running MPI jobs are dedicated. Dedicated machines never vacate their running condor jobs should the machine's interactive owner return. Once the dedicated scheduler claims a dedicated machine for use, it will try to use that machine to satisfy the requirements of the queue of MPI jobs.

Since Condor is not ordinarily used in this manner (Condor uses opportunistic scheduling), machines that are to be used as dedicated resources must be configured as such. Section 3.13.8 of Administrator's Manual describes the necessary configuration and provides detailed examples.

To simplify the dedicated scheduling of resources, a single machine becomes the scheduler of dedicated resources. This leads to a further restriction that jobs submitted to execute under the MPI universe (with dedicated machines) must be submitted from the machine running as the dedicated scheduler.


2.10.2 MPI Job Submission

Once the programs are written and compiled, and Condor resources are correctly configured, jobs may be submitted. Each Condor job requires a submit description file. The simplest submit description file for an MPI job:

#############################################
##   submit description file for mpi_program
#############################################
universe = MPI
executable = mpi_program
machine_count = 4
queue

This job specifies the universe as mpi, letting Condor know that dedicated resources will be required. The machine_count command identifies the number of machines required by the job. The four machines that run the program will default to be of the same architecture and operating system as the machine on which the job is submitted, since a platform is not specified as a requirement.

The simplest example does not specify an input or output, meaning that the computation completed is useless, since both input comes from and the output goes to /dev/null. A more complex example of a submit description file utilizes other features.

######################################
## MPI example submit description file
######################################
universe = MPI
executable = simplempi
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 4
queue

The specification of the input, output, and error files utilize a predefined macro that is only relevant to mpi universe jobs. See the condor_ submit manual page on page [*] for further description of predefined macros. The $(NODE) macro is given a unique value as programs are assigned to machines. This value is what the MPICH version ch_p4 implementation terms the rank of a program. Note that this term is unrelated and independent of the Condor term rank. The $(NODE) value is fixed for the entire length of the job. It can therefore be used to identify individual aspects of the computation. In this example, it is used to give unique names to input and output files.

If your site does NOT have a shared file system across all the nodes where your MPI computation will execute, you can use Condor's file transfer mechanism. You can find out more details about these settings by reading the condor_ submit man page or section 2.5.4 on page [*]. Assuming your job only reads input from STDIN, here is an example submit file for a site without a shared file system:

######################################
## MPI example submit description file
## without using a shared file system
######################################
universe = MPI
executable = simplempi
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 4
should_transfer_files = yes
when_to_transfer_output = on_exit
queue

Consider the following C program that uses this example submit description file.

/**************
 * simplempi.c
 **************/
#include <stdio.h>
#include "mpi.h"

int main(argc,argv)
    int argc;
    char *argv[];
{
    int myid;
    char line[128];

    MPI_Init(&argc,&argv);
    MPI_Comm_rank(MPI_COMM_WORLD,&myid);

    fprintf ( stdout, "Printing to stdout...%d\n", myid );
    fprintf ( stderr, "Printing to stderr...%d\n", myid );
    fgets ( line, 128, stdin );
    fprintf ( stdout, "From stdin: %s", line );

    MPI_Finalize();
    return 0;
}

Here is a makefile that works with the example. It would build the MPI executable, using the MPICH version ch_p4 implementation.

###################################################################
## This is a very basic Makefile                                 ##
###################################################################

# the location of the MPICH compiler
CC          = /usr/local/bin/mpicc
CLINKER     = $(CC)

CFLAGS    = -g
EXECS     = simplempi

all: $(EXECS)

simplempi: simplempi.o
        $(CLINKER) -o simplempi simplempi.o -lm

.c.o:
        $(CC) $(CFLAGS) -c $*.c

The submission to Condor requires exactly four machines, and queues four programs. Each of these programs requires an input file (correctly named) and produces an output file.

If input file for $(NODE) = 0 (called infile.0) contains

Hello number zero.
and the input file for $(NODE) = 1 (called infile.1) contains
Hello number one.
then after the job is submitted to Condor, there will be eight files created: errfile.[0-3] and outfile.[0-3]. outfile.0 will contain
Printing to stdout...0
From stdin: Hello number zero.
and errfile.0 will contain
Printing to stderr...0

Different nodes for an MPI job can have different machine requirements. For example, often the first node, sometimes called the head node, needs to run on a specific machine. This can be also useful for debugging. Condor accomodates this by supporting multiple queue statements in the submit file, much like with the other universes. For example:

######################################
## MPI example submit description file
## with multiple procs
######################################
universe = MPI
executable = simplempi
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 1
should_transfer_files = yes
when_to_transfer_output = on_exit
requirements = ( machine == "machine1")
queue

requirements = ( machine =!= "machine1")
machine_count = 3
queue

The dedicated scheduler will allocate four machines (nodes) total in two procs for this job. The first proc has one node, (rank 0 in MPI terms) and will run on the machine named machine1. The other three nodes, in the second proc, will run on other machines. Like in the other condor universes, the second requirements command overwrites the first, but the other commands are inherited from the first proc.

When submitting jobs with multiple requirements, it is best to write the requirements to be mutually exclusive, or to have the most selective requirement first in the submit file. This is because the scheduler tries to match jobs to machine in submit file order. If the requirements are not mutually exclusive, it can happen that the scheduler may unable to schedule the job, even if all needed resources are available.


next up previous contents index
Next: 2.11 Parallel Applications Up: 2. Users' Manual Previous: 2.9 PVM Applications   Contents   Index
condor-admin@cs.wisc.edu