How to Install and Use MPI Stack

OpenMP has several strong points: it is a very simple system to use and it is widely available in compilers for most major platforms. There are however, other methods to express parallelism in your code. On distributed parallel systems, like Linux clusters, the Message Passing Interface (MPI) is widely used. MPI is not a programming language, but rather a standard library that is used to send messages between multiple processes. These processes can be located on the same system (a single multi-core SMP system) or on a collection of distributed servers. Unlike OpenMP, the distributed nature of MPI allows it to work on almost any parallel environment.

On a multi-core system a single shared address space is the basis for a shared memory programming model. OpenMP makes relatively easy work of this environment. Sharing nothing between “threads”, or independent processes running on possibly independent machines, is typically called a shared nothing programming model, or more succinctly, a distributed programming model. There is nothing that precludes a shared nothing or distributed program from running on a shared memory system.

Distributed Programming Models

Distributed memory memory models assume that every CPU has access to its own memory space, and can alter its own local variables. To alter memory or variables or another/remote process, you need to send that process a message.

Again, the remote process can be on the same machine or on another machine and has to agree to accept the message. In essence, distributed memory programming requires copying memory from one memory space to another, sending and receiving messages. Unlike the shared memory model, you the programmer, have to consciously move data between “threads” or processes. The compiler will not handle this for you as it does in OpenMP. This requirement often makes MPI programming a bit more difficult than using OpenMP.

In addition to using physically remote machines, MPI has the additional “advantage” of forcing you divide your memory use between processes. In essence, you have greater control over locality. On modern CPUs with multiple cores, caches and memory hierarchies, you want to enable as much data re-use as possible. The distributed model forces you to think about that, as you have to decide what to move, and when to move it.

The rest of this article will focus upon distributed memory programming using MPI (Message Passing Interface) on a single multi-core system. We will provide some basic examples similar to the OpenMP tutorial.


  1. Source Code to parallelize using MPI
  2. One multi-core (or cluster) computer
  3. A few hundred megabytes of disk space (for compiler)
  4. A few gigabytes of ram (for program examples)
  5. An editor (pick your favorite)
  6. A compiler
  7. An MPI stack

The preparation time is a little more than an hour before programming if you have to build your own compiler and MPI packages. A few minutes if you can install pre-built compiler and MPI packages.

If you have already checked your cores, memory size, file space, and installed the GNU compilers from the previous tutorials, then you can skip to the Installing Open MPIsection.

Preliminary Configuration

We are going to assume a Linux based machine with a few cores. I am using a Pegasus many-core workstation. To see how many cores you have in your unit, try this:

grep 'processor.*:'  /proc/cpuinfo  | wc -l

The Pegasus system I’m using reports 8 CPUs. We need to see how much disk space we have to work with. This is fairly easy to do. The df command is your tool of choice. A quick df -h will tell us what we have available, though it might not tell us if we can use it.

df -h `pwd`
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3              62G   29G   33G  48% /

This tells us that our home directory has as much as 33GB available. So let’s make ourselves a workspace, and continue.

mkdir ~/workspace
cd ~/workspace

We would like to know how much physical RAM we have, so try this:

grep "MemTotal:" /proc/meminfo

The Pegasus machine has 8184008 kB, or about 8 GB ram. For my editor, I may use Vim or pico. Any text editor should work.

We need a compiler. The GNU Foundation’s GNU Compiler Collection has excellent free compilers, and commercial compilers from Intel, Portland Group, and others available for your use, and they all support building and using MPI stacks. This may represent a dilemma for you: why should you purchase the other compilers if you can get GCC for free? In short, the other compilers do offer more in the way of optimization, support, and added elements of value.

If you can get gcc packages for your distribution, it is usually easiest to have those installed. Unfortunately, your system administrators may not let you modify the machines , so you might be forced to install the packages yourself into your own directories. Luckily, it is not hard. To build GCC, you need GMP and MPFR. Get GMP, build it and install it as follows:

cd ~
wget http://ftp.sunet.se/pub/gnu/gmp/gmp-4.2.2.tar.bz2
cd workspace
tar -xjvf /home/joe/gmp-4.2.2.tar.bz2
cd gmp-4.2.2/
./configure 0x2013prefix=/home/joe/local
make -j4
make install

Get MPFR, built it, and install it:

cd ~
wget http://www.mpfr.org/mpfr-current/mpfr-2.3.0.tar.bz2
cd workspace
tar -xjvf /home/joe/mpfr-2.3.0.tar.bz2
cd mpfr-2.3.0
./configure --prefix=/home/joe/local --with-gmp=/home/joe/local
make -j4
make install

Next, get GCC, uncompress, configure, and build it as follows:

cd ~
wget ftp://ftp.gnu.org/gnu/gcc/gcc-4.2.2/gcc-4.2.2.tar.bz2
cd workspace
tar -xjf /home/joe/gcc-4.2.2.tar.bz2
mkdir build
cd build
../gcc-4.2.2/configure --prefix=/home/joe/local         

time make -j8 bootstrap
make install

Compilation required 12.25 minutes on 8 processors. It would take about eight times longer on one processor, which is why we used the -j8 switch on the Makefile (parallel build).

Now to use this compiler, we need to set an environment variable. This setting enables the linker to find the libraries.

export LD_LIBRARY_PATH=/home/joe/local/lib64/lib

The compiler itself is located in /home/joe/bin/gcc which can be set in the CC variable in the Makefile. Of course your path will be different.

One more thing is needed to correct a bug in the installation for 64 bit platforms. (It shouldn’t be needed on 32 bit platforms.)

ln -s /home/joe/local/lib64/libgomp.spec  

Without this, we might get errors that look like the following on 64 bit platforms:

gcc: libgomp.spec: No such file or directory

As a sanity check, lets compile a simple test case. Here’s the classic “hello world” C program.

#include "stdio.h"

int main(int argc, char *argv[])
  printf("hello MPI user!n");

We will call this hello.c and place it in our home directory. To verify, let’s compile and run it:

/home/joe/local/bin/gcc hello.c -o hello.exe

If everything worked, when you run it you should see:

hello MPI user!

Our new Fortran compiler should also work. Here is a quick translation of the hello.c code into fortran:

program hello

print *,"hello Fortran MPI user!"

And a quick compilation and execution:

/home/joe/local/bin/gfortran hello.f 
        -o hello-fortran.exe

If everything worked you should see something like the following when you run this code:

hello Fortran MPI user!

To run the serial version of the code, first make the executable (make -f Makefile.rzf) type and run the executable as follows:

./rzf.exe -l INFINITY -n n

Where INFINITY is an integer value (e.g. 1000000) of how many terms you would like to use for your sum, and n is the argument to the Riemann Zeta Function. If you use 2 for n, then it will calculate 0x03C0 for you as well. The sum is basically a loop that looks like the following:

sum     = 0.0;
for( k=1 ; k<=inf ; k++)
  sum += pow(k,(double)n);

As it turns out, for numerical stability (roundoff error accumulation) reasons, you run the sum in the other order (from inf to 1), which we do in the code.

Unlike OpenMP, we cannot think of simply parallelizing this at a loop level. More about this in a moment. This loop is an example of something called a reduction operation. Specifically, the loop is a sum reduction. A reduction operation occurs (in a technical sense) when you reduce the rank or number of dimensions of something. In this case, imagine that we have a big long string (or vector) of numbers.

[1-2, 2-2, 3-2, ... , inf-2]

We are going to reduce these numbers to a single number by summing (taking a dot product with them [1, 1, 1, …, 1]). First things first, how does the code perform, and specifically, where is the slow area? Lets run it and find out.

time ./rzf.exe -n 2 -l 1000000000
D: checking arguments: N_args=5
D: arg[0] = ./rzf.exe
D: arg[1] = -n
D: N found to be = 2
D: should be 2
D: arg[2] = 2
D: arg[3] = -l
D: infinity found to be = 2
D: should be 1000000000
D: arg[4] = 1000000000
D: running on machine = pegasus-i
zeta(2)  = 1.644934065848226
pi = 3.141592652634863
error in pi = 0.000000000954930
relative error in pi = 0.000000000303964
Milestone 0 to 1: time = 0.000s
Milestone 1 to 2: time = 50.946s

real    0m50.949s
user    0m50.911s
sys     0m0.040s

Using -n 2 and -l 1000000000, the loop took about 51 seconds to execute. The loop is the most time consuming portion of this program. It makes sense to pay attention to it, and see what we can do with it.

The wallclock and user times are important as well. In this case, the wallclock reports being 3 milliseconds more than the loop time, and the user CPU time reports being a little less than the wallclock. The sum of user and sys times does in fact, pretty nearly equal the wallclock time. This result is good.

It is important to note that as you scale up your parallel application, this time sum should remain very nearly constant. This goal may sound counter-intuitive, but it means that you are effectively partitioning work among the N COU/Cores you are using. If this sum increases significantly with increasing processor count, you may not be operating as effectively as you could be, and your code won’t scale.

Can we get any benefit out of running this in parallel? And how hard is it? Well, lets add in the MPI bits as we did before, and then think about how to run this loop in parallel. First make a copy of the program and Makefile (Note: the source code contains the program and Makefile with the changes we are making), and rename them rzf_2.c and Makefile.rzf_2, adding in the MPI elements. The rzf_2.c is more complex, and requires some explanation.

Remember, the loop is the time expensive portion of this code, so after setting up the MPI processes, wouldn’t it be nice to partition or decompose the loop such that the calculations occurred in parallel? Imagine that our loop had an “outer loop” around it, with some additional helper bits. The “outer loop” sets the lower and upper bounds of the index for the inner loop.

    loop_min    = 1 +  (int)((tid + 0)  *  (inf-1)/NCPUs);
    loop_max    =      (int)((tid + 1)  *  (inf-1)/NCPUs);

          sum += 1.0/pow(i,(double)n);

The “outer loop” is indexed by the value of tid. Imagine (though you don’t see it there) that you have:

for(tid=0; tid < NCPU; tid++)
   /*  the above inner loop */

This is effectively what MPI provides you, the caveat being that MPI is running these loop bodies in parallel (simultaneously) on different processes, and you cannot trivially share data between loops (tid) without explicitly moving it.

Before we continue, look at the construction of loop_min and loop_max. Note that we have something that looks like tid * inf, and for large enough inf and tid we will overflow the integer maximum (2^31-1). Specifically, lets use Octave to investigate (users with Matlab should be able to perform the same calculations without changing the code):

octave:1> inf=1000000000
inf =  1.0000e+09
octave:2> tid=[0:7]
tid =

   0   1   2   3   4   5   6   7

octave:3> temp = tid .* (inf - 1)
temp =

 Columns 1 through 6:

   0.0000e+00   1.0000e+09   2.0000e+09   3.0000e+09   4.0000e+09   5.0000e+09

 Columns 7 and 8:

   6.0000e+09   7.0000e+09

octave:4> dec2bin([temp,2^31-1])
ans =


The second bit in from the left is the sign bit for 32 bit integers. If it is “1,” then the quantity is negative (and the 32 bit integer will happily throw away the bits above it). That is, you will get a wrap-around effect. Which you will see reflected in the index calculation, when some come out negative.

This is an issue in that you want to perform the multiplication before the division, so as not to lose bits of precision. What you need to do is to operate in a higher precision, in order to do this. The rzf_3.c code has been modified for this reason, so we use tis version so you can see the difference between this and the naive rzf_2.c version.

When we run this version, each process sets its own value of sum based upon the calculation over the index range it had. These are partial sums. We have to further reduce them to a sum. Hence we will use a sum reduction. First, it is worth your time to relabel the sum in the inner loop as p_sum to reflect its nature as a partial sum. Second, we have to add an explicit sum reduction call to move the data back to the main process. It turns out this is fairly easy. We add in a line after the summation finishes, that looks like this:


This calls the MPI_Reduce function, taking values from every processes double precision p_sum variable, and combining them using sum reduction into the tid == 0 variable named sum.

That’s our message passing. Not hard at all, but you still have to think about it.

Now, recall that the sequential loop took about 50.95 seconds. For 8 processors, if the work is evenly divided among processes, we should get about 50.95/8 second run time for the loop. We estimate that this would be 6.37 seconds to traverse this loop. Don’t forget to make the program ( make -f Makefile.rzf_3.) The output will look something like this (though the output has been trimmed for space.)

time /home/joe/local/bin/mpirun -np 8 -hostfile hostfile ./rzf_3.exe -n 2 -l 1000000000
D[tid=0]: checking arguments: N_args=5
D[tid=0]: arg[0] = ./rzf_3.exe
D[tid=0]: NCPUs found to be = 8
D[tid=0]: arg[1] = -n
D[tid=0]: n found to be = 2
D[tid=0]: should be 2
D[tid=0]: NCPUs found to be = 8
D[tid=0]: arg[2] = 2
D[tid=0]: NCPUs found to be = 8
D[tid=0]: arg[3] = -l
D[tid=0]: infinity found to be = 2
D[tid=0]: should be 1000000000
D[tid=0]: NCPUs found to be = 8
D[tid=0]: arg[4] = 1000000000
D[tid=0]: NCPUs found to be = 8
D[tid=0]: running on machine = pegasus-i
D[tid=0]: loop_max=124999999, loop_min=1
D[tid=1]: checking arguments: N_args=5
D[tid=1]: arg[0] = ./rzf_3.exe
D[tid=1]: NCPUs found to be = 8
D[tid=1]: arg[1] = -n
D[tid=1]: n found to be = 2
D[tid=1]: should be 2
D[tid=1]: NCPUs found to be = 8
D[tid=1]: arg[2] = 2
D[tid=1]: NCPUs found to be = 8
D[tid=1]: arg[3] = -l
D[tid=1]: infinity found to be = 2
D[tid=1]: should be 1000000000
D[tid=1]: NCPUs found to be = 8
D[tid=3]: checking arguments: N_args=5
D[tid=3]: arg[0] = ./rzf_3.exe
D[tid=3]: NCPUs found to be = 8
D[tid=3]: arg[1] = -n
D[tid=3]: n found to be = 2
D[tid=3]: should be 2
D[tid=3]: NCPUs found to be = 8
D[tid=3]: arg[2] = 2
D[tid=3]: NCPUs found to be = 8
D[tid=7]: NCPUs found to be = 8
D[tid=7]: arg[2] = 2
D[tid=7]: NCPUs found to be = 8
D[tid=7]: arg[3] = -l
D[tid=7]: infinity found to be = 2
D[tid=7]: should be 1000000000
D[tid=7]: NCPUs found to be = 8
D[tid=7]: arg[4] = 1000000000
D[tid=7]: NCPUs found to be = 8
D[tid=7]: running on machine = pegasus-i
D[tid=7]: loop_max=999999999, loop_min=875000000
[tid=5] Milestone 1 to 2: time = 6.335s
[tid=1] Milestone 1 to 2: time = 6.336s
[tid=2] Milestone 1 to 2: time = 6.341s
[tid=7] Milestone 1 to 2: time = 6.337s
[tid=4] Milestone 1 to 2: time = 6.344s
[tid=6] Milestone 1 to 2: time = 6.349s
[tid=3] Milestone 1 to 2: time = 6.421s
zeta(2)  = 1.644934065848226
pi = 3.141592652634863
error in pi = 0.000000000954930
relative error in pi = 0.000000000303964
[tid=0] Milestone 0 to 1: time = 0.000s
[tid=0] Milestone 1 to 2: time = 6.421s

real    0m8.520s
user    0m0.008s
sys     0m0.008s

What we measure is 6.33-6.41 seconds, quite similar to the 6.35 we predicted. Notice that the estimated error is different than in the single core run. This result leads into a whole other discussion, which we will talk about very briefly below. The point is that our code, or at least that loop, is now about 8 times faster on 8 CPUs than on one CPU. The result is quite good. Especially considering how little work this required.

Notice also that the user time is about the same as a single processes user time. Slightly more in fact, and this is due to the cost to set up and launch processes on this system. About 2 seconds appear to be needed to launch 8 processes on this system, averaging about 0.25 seconds per process. This time is expensive relative to the calculation time for this particular example, but in general, the launch time should be a small fraction of the overall run time.

Very briefly, the way roundoff error accumulates in these calculations does have an impact upon the final results. Running this sum in the other direction, which 99+% of people would normally do, accumulates error in a different order, leading to a different result. That is, when you change the order of your floating point computations, you will change the roundoff error accumulation. It it possible to observe a catastrophic loss of accuracy, impacting first and second digits of a summation like this, by not being aware of these issues.

Application: Matrix Multiply

We now have had a fairly “trivial” set of examples, and we have used up about one half of our 30 minutes. The hello case is great to use as a sanity check, specifically something to use to verify that MPI is installed and operational. The rzf code is an example of an effectively embarrassingly parallel code, there is no communication between parallel processes until the very last iteration where a sum reduction is performed.

A more complicated problem is matrix multiplication. The core algorithm in the matmul code is the triply nested for loop. You can see what happens when we run it with reasonable values of DIM. Use make -f Makefile.matmul to make the program for yourself. You can easily change the value of DIM by using matmul.exe -n DIM where DIM is the size of the array. If no arguments are given, then the default DIM is 4000.

Table One: Matrix size vs run time

N (matrix dimension) T(N) in seconds
100 0.003
200 0.023
400 0.553
1000 14.46
1500 57.46
2000 137.1
4000 1107

A 4000 x 4000 matrix requires 16 million (16Mi) elements. Each element is 8 bytes (double precision floating point). This means we have each array being 128Mi bytes, or 122.07 MB (remember a MB is different than an MiB).

Looking over the calipers for the 4000×4000 run (enter ./matmul.exe -n 4000), we see:

milestone 0 to 1 time=0.337 seconds
milestone 1 to 2 time=0.819 seconds
milestone 2 to 3 time=1.004 seconds
milestone 3 to 4 time=1106.836 seconds
milestone 4 to 5 time=0.004 seconds

This suggests that the matrix fill time (calipers 1 to 2) and the normalization time (calipers 2 to 3) are under 0.1% of the execution time. So we should focus our effort upon the matrix multiplication loops.

Now that we have established what takes the most time, how do we parallelize this? As it turns out, we use similar methods to what was used before. The main multiplication loop looks like this:

/* matrix multiply
    *   c[i][j]= a_row[i] dot b_col[j]  for all i,j
    *           a_row[i] -< a[i][0 .. DIM-1]
    *            b_col[j] -< b[0 .. DIM-1][j]
           dot += a[i][k]*b[k][j];


You will likely notice that the interior loop is a sum reduction. It is basically a dot product between two vectors. That inner loop is executed N2 times. This loop is not the appropriate place to perform our decomposition.

In general, for most parallelization efforts, you want to enclose the maximum amount of work within the parallel region. This goal suggests you really want to put the directives outside the outer most loop, or as high up in the loop hierarchy as possible.

So we use a similar technique to what was used in rzf.c. Precompute the loop indexes. We will compute “panels0x201D, that is, outer loop index ranges, of the c matrix on each CPU.

    loop_min    =(int)((long)(tid + 0) *
    loop_max    =(int)((long)(tid + 1) *

           dot += a[i][k]*b[k][j];


Again, each process computes its own minimum and maximum for the outer loop index. These computations only require knowing the total number of processes in the MPI session, and the particular process number.

Before we discuss the rest of the code, please note that we did not restrict the argument parsing to the first process. There is no need to. It would take more effort to parse on the master node, then distribute the arguments than it would to parse on all nodes at once. Also note that we construct the initial matrices (a, b) on all of the nodes, the same way, simultaneously. In real codes, we would probably not want to do this as the matrix will be passed (sent) to the nodes, or constructed in some other manner. However, if you can divide and conquer the construction, it could be worth the effort in some cases.

At the end of the calculation, we use MPI_Send and MPI_Recv pairs to send every row of the computed panel back to the master process. This method is not meant to be efficient. It is designed solely to show that you need to pair MPI_Send with MPI_Recv. If one does not correctly complete a posted Send with a Receive), your code will block.

To move the data back to the master process, our loops look like this:

       /* have the master process receive the row */
       loop_min =(int)((long)(i + 0) *
       loop_max =(int)((long)(i + 1) *

       /* receive the rows from the i_th_ process to the master process */
       if (tid == 0)


       /* send the rows from the i_th_ process to the master process */
       if (tid == i)

As mentioned, this code is not efficient, and it does not make good use of MPI. Its entire purpose is to illustrate MPI Send and Recieve pairs. You can view the code in matmult_mpi.c. To build the code, enter, make -f Makefile.matmul_mpi, then to run on four cores mpirun -np 4 -hostfile hostfile ./matmul_mpi -n 1000. (Adjust the np argument to reflect your number of cores.)

We could effectively remove the inner loop over k with a larger single Send/Recieve pair. This way, instead of DIM Send/Recieve pairs, we have NCPU Send/Recieve pairs. If this is one of the optimizations you would like to do, first make it work correctly, then apply the optimization.

As for performance, this is what we observe for the N=4000 case:

Table Two: Speed-up vs number of cores for first version of matmul_mpi.c

Number Cores Time Speed-up Efficiency
1 1111.0 1 100.0%
2 573.40 1.94 96.9%
4 302.60 3.67 91.8%
8 197.60 5.62 70.3%

This isn’t bad. We measure 5.62x speedup on 8 CPUs. About 70% efficient. The parallel version takes 197.6 seconds as compared to over 1111 seconds.

Can we make it go any faster? There are two ways, either get better speedup using MPI or tweak the code slightly, and get good parallel performance as well as absolute performance?

The answer is yes to both. With a relatively simple code adjustment (that some compilers might do for you if you can coax them to), you can see significantly better performance in parallel. What we do is increase the amount of work done in parallel. We do this by unrolling a loop. Our code now looks like this (see matmult_mpi2.c):

    loop_min    =(int)((long)(tid + 0) *
    loop_max    =(int)((long)(tid + 1) *

           dot[0] += a[i+0][k]*b[k][j];
           dot[1] += a[i+1][k]*b[k][j];
           dot[2] += a[i+2][k]*b[k][j];
           dot[3] += a[i+3][k]*b[k][j];



With this modification, and a few additional ones to the definition and use of the variable dot (now an array), we now measure the performance for the N=4000 case. To build the code, enter, make -f Makefile.matmul_mpi2, then to run on four cores mpirun -np 4 -hostfile hostfile ./matmul_mpi -n 4000. Also note that since we “unrolled the loop” by 4, we need to make sure the array dimension (-n value) divided by the number of processes (-np value) is divisible by 4. Otherwise the program will crash, fixing it is an exercise for the reader.

Table Three: Speed-up vs number of cores for second version of matmul_mpi2.c

Number Cores Time Speed-up Efficiency
1 311.18 1 100.0%
2 160.22 1.94 97.1%
4 87.70 3.55 88.7%
8 60.87 5.11 63.9%

This version of the program operates upon 4 rows at a time, thus increasing the amount of work done per iteration. We have also reduced the cache miss penalty per iteration by reusing some of the more expensive elements (the b[k][j]). If we were sufficiently clever, we would unroll by 2 on the j loop, which would allow us to use a cache-line at a time of b[k][j .. j+1], which could lead to even better overall performance. Of course, after all of these modifications, we check the results to make sure that the addition of parallelization has not also caused the addition of bugs.

In the final version of this code (matmult_3.c), we fix the MPI_Send/MPI_Recv pairs so that they transfer a panel at a time. This improves the performance by not having so much work focused upon data transfer at the tail end of the computation. We could further overlap communication and computation by using MPI_Isend/MPI_Irecv pairs so that after we finish each panel row or column, we start a data transfer operation. This is one of the more interesting aspects of MPI, and allows you to overlap operations, effectively hiding the latency associated with longer delay operations.


We have shown in this article how to obtain, build, and use an MPI stack for Linux machines. The MPI examples shown range from simple “hello” examples, through parallel matrix multiplication. All codes demonstrate excellent performance. The exercise takes slightly more than 30 minutes and allows one to develop and run MPI codes on a multi-core server or on a HPC cluster.