COSC330/530 Parallel and Distributed Computing

Lecture 13 - Distributed Memory Computing with MPI

Dr. Mitchell Welch

Reading

Chapters 2 and 3 from An Introduction to Parallel Programming by Peter Pacheco
MPI specification: http://www.mpi-forum.org
Open MPI reference: http://www.open-mpi.org

Summary

Flynn's Taxonomy
Performance
Amdahl's Law
The Beowulf Cluster
Message Passing Interface
MPI Elementary Functions

Flynn's Taxonomy

Flynn's taxonomy provides a framework to classify computer architectures based on the number of instruction streams and the number of data streams the system can simultaneously manage.
Flynn's taxonomy only refers to the hardware of the system.
The categories within Flynn's taxonomy can be broken down further based upon software implementations that operate on the different platforms.

Flynn's Taxonomy

A classical Von-Neumann system is categorised as a single instruction, single data stream (SISD)
Such a system executes a single instruction at a time on a single item of data.
Examples: PIC and AVR microcontrollers, x86/64 single core processors etc.

Alt text

Flynn's Taxonomy

Single instruction, multiple data (SIMD) systems are parallel machines that operate on multiple data streams.
In such a system, the same instructions are applied to multiple (different) data items.
These systems can thought of as having a single control unit with multiple Arithmetic Logic Units (ALUs).
- Instructions broadcast from the control unit and the instruction is performed by the ALU on the data items.
The ALUs are all executing the same instruction or they are idle. (i.e. they can't execute different logic)

Flynn's Taxonomy

The SIMD approach:

Alt text

Flynn's Taxonomy

A motivating example:


for (i = 0; i < n ; i++){
    x[i] += y[i]
}

If we have a system with n ALUs, then the operation on x[i] and y[i] can be performed by the ith ALU.
Parallelism achieved though the SIMD approach is referred to as Data Parallelism.
SIMD is very efficient for large data-parallel problems.

Flynn's Taxonomy

Graphics Processing Units make use of the SIMD approach for performing the complex calculations required to render detailed 3D graphics.
GPUs use a processing pipeline to convert an internal representation into an array of pixels that are sent to the screen.
Stages within the pipeline are programmable through shader functions
NVidia's Compute Unified Device Architecture and OpenCL platforms provide a platform for general purpose computation using the GPU SIMD devices.
GPUs now enjoy wide-spread use for high-performance computing tasks.

Flynn's Taxonomy

A CUDA kernel function that performs an addition operation over N items.


// Kernel definition 
__global__ void VecAdd(float* A, float* B, float* C){ 
    int i = threadIdx.x; 
    C[i] = A[i] + B[i]; 
} 

int main() { 
    ... // Kernel invocation with N threads 
    VecAdd<<<1, N>>>(A, B, C); 
    ... 
}

Flynn's Taxonomy

Multiple Instruction, Multiple Data (MIMD) systems support simultaneous instruction streams operating over multiple data streams.
They are typically constructed using a set of independent processors, where each controls its own ALU.
Instructions in MIMD platforms are usually executed in an asynchronous fashion - Usually, there is no global clock signal.

Alt text

Flynn's Taxonomy

MIMD systems can be implemented using shared memory or distributed memory
- In a shared memory arrangement, each processing unit can access a global block of memory - e.g. A multicore processor
- In a distributed memory arrangement, each processing unit has its own memory and must rely on IPC to work with other processing units.
The Beowulf cluster we will be working with is an example of a distributed memory platform.
We work with the Message Passing Interface MPI software platform to provide IPC.

Flynn's Taxonomy

Multiple Instruction, Single Data (MISD) systems are generally implemented to achieve fault tolerance.
Multiple processors independently operate on the same stream of data.
This approach can be used to detect and correct errors in calculations.

Alt text

Flynn's Taxonomy

At the software level, MIMD hardware can support two common architectures:
- Single Program, Multiple Data streams (SPMD)
- Multiple Program, Multiple Data streams (MPMD)
The SPMD approach is the most common approach, where a single executable is run across all processing units but behaves differently on different processors.
Some of the multi-process programs we constructed in the first module fall into this category.

Flynn's Taxonomy

In the MPMD architecture, different executables run on the different processing units.
A master-slave architecture is an example of this approach.
An MPMD architecture can be implemented with multiple processes using the exec( ... ) system call to execute separate programs.
MPI supports the MPMD architecture by allowing different nodes in the cluster to run different programs.

Performance

When looking at performance of a parallel program can be evaluated using two metrics:
- Speedup - The time saved from parallelism
- Efficiency - A measure of how effectively the system is being used.
Speedup (S) is given by the expression:

\( S = \frac{T_{serial}}{T_{parallel}}\)

Where, \(T_{serial}\) is the serial runtime of the program and \(T_{parallel}\) is the parallel runtime.

The theoretical best-case is a linear speed where: \(T_{parallel} = \frac{T_{serial}}{p}\), where p is the number of processes.

Performance

Efficiency (E) is given by:

\( E = \frac{S}{p} = \frac{\big(\frac{T_{serial}}{T_{parallel}}\big)}{p} = \frac{T_{serial}}{p \cdot T_{parallel}}\)

Where, S is the speedup, p is the number of processes, \(T_{serial}\) is the serial runtime of the program and \(T_{parallel}\) is the parallel runtime

\(T_{parallel}\), S and E all depend on the number of processes and threads.

Performance

Unfortunately in the real world there are additional overheads that are associated with parallel execution.
We have looked at some of these when we covered processes and threads:
- Context switches.
- Maintaining meta-data for processes and threads (e.g. program counters, file descriptors etc.)
- Network communication times in distributed systems.
- Delays due to mutual exclusion etc.

Performance

This means that the actual speedup achieved by parallelism will be:

\( T_{parallel} = \frac{T_{serial}}{p + T_{overhead}}\)

where p is the number of processes and \(T_{overhead}\) is the time consumed by the parallel overheads.

Performance

A simplistic way to analyse the time taken by a C program, we can use the clock system call.
clock() returns the number of clock ticks since the start of execution of a program.

Performance

In order to calculate the actual execution time, you need to make two calls to the clock function:

#include <time.h>
#include <stdio.h>
#include <stdlib.h>

int main(){
    clock_t begin = clock();
    /* here, do your time-consuming job */
    int *testing;
    for(int i = 0 ;  i< 1000000000; i++){
        /*In this situation, we are pointlessly allocating/deallocating memory*/
        testing = (int*) malloc(512);
        free(testing);
    }
    clock_t end = clock();
    double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("Time spent processing:  %f\n", time_spent);
    exit(EXIT_SUCCESS);
    return 0;
}

Amdahl's Law

Amdahl's Law is (roughly speaking) an observation that unless most of a program is parallelised, the speedup on parallel hardware is very limited.
Suppose that we are able to parallelise 90% of a serial program, and assume that this implementation is perfect - p cores results in a speedup of p
If the total runtime of the serial program is T_serial = 20 seconds, then the runtime of the parallel part will be 0.9 x T_serial / p
The runtime of the un-parallelised part will be 0.1 x T_serial = 2

Amdahl's Law

This means that the overall parallel runtime will be given by:

\( T_{parallel} = (0.9 \times T_{serial} / p) + 0.1 \times T_{serial} \)

The speedup, S, will be given by:

\( S = T_{serial} / T_{parallel} \)

For our example:

\( S = 20 / (18/p +2)\)

Amdahl's Law

If we look at the expression for \(T_{parallel}\), as p gets larger, \((0.9 \times T_{serial} / p)\) approaches 0.
This means that the total parallel runtime can't be smaller than \(0.1 \times T_{serial}\)
e.g.

\( S \leq T_{serial} / ( 0.1 \times T_{serial}) = 20 / 2 = 10 \)

This means that even if we do a perfect job parallelising our code, we can never get a speedup of more than 10 (in our example)!

Amdahl's Law

Alt text

Amdahl's Law

More generally, if a fraction of our program r remains un-parallelised, then we can't get a speedup better than \(1/r\)
Therefore, on a system with thousands of cores, the speedup is heavily constrained by the sections of the program that cannot be parallelised.
So why are we even bothering with this parallel programming business??

Amdahl's Law

Amdahl's Law doesn't take into consideration the size of the problem.
A 10-fold speedup on a very large problem can provide a very large reduction in processing time.
As the size of a problem increases, the percentage of un-parallelised processing decreases.
Achieving a small speedup for a small effort is still worth while in most situations.

The Beowulf Cluster

A variation on the PoPC (Pile of PCs) model:
A collection of PCs applied in unison to a problem. Similar to:
- COWs (Cluster of Workstations), and
- NOWs (Networks of Workstations)
But emphasising:
- Off the shelf components,
- Dedicated processors (as opposed to scavenging cycles), and
- A private SAN ( System Area Network)
- Operating as one machine instead of connected but independent machines.

The Beowulf Cluster

Beowulf adds to this model by emphasising:
- No custom components
- Scalable I/O
- Open source operating system, Linux
- Using open source distributed computing tools
- Keeping the result open source.
Typically a Beowulf cluster will consist of at least 8 machines.

The Beowulf Cluster

Our Beowulf Cluster:
- A virtual 4 node system running on the turing architecture.
- This configuration is used to provide reliability, not necessarily processing power.
- This system is designed for teaching, not super computing.
- Never the less, it is still a powerful machine.

The Beowulf Cluster

Programming on a Beowulf Cluster is Distributed Computing.
To be precise: Distributed Memory MIMD Computing
- The cluster machine is capable of MIMD
A program generally consists of a (large) number of autonomous processes or tasks that:
Communicate with one another via asynchronous message passing
Coordinate with one another via synchronisation primitives

The Beowulf Cluster

Crudely put, one subdivides:
- A problem into subproblems and smaller data sets.
- Each subproblem and data set is farmed out as a task.
The computation then progresses with these tasks:
- communicating their datasets and their results
- possibly spawning new tasks and coordinating their actions and results.
A Beowulf cluster should be able solve problems with:
- a speedup approaching 3 orders of magnitude (base 2)
- datasets approaching 3 orders of magnitude larger (base 2) than a single CPU (comparable to an individual node).

The Beowulf Cluster

The cluster computer runs from the server bourbaki, which can be accessed from turing via ssh.
Your turing file system is mapped over to bourbaki, so all of your files will be available as normal.
You can even ssh into the individual nodes:

Last login: Tue Aug  9 13:02:17 2023 from turing.une.edu.au
[cosc330@bourbaki ~] $ ssh b1
[cosc330@b1 ~] $ exit
logout
Connection to b1 closed.
[cosc330@bourbaki ~] $ ssh b2
[cosc330@b2 ~] $ exit
logout
Connection to b2 closed.
[cosc330@bourbaki ~]

Message Passing Interface

Message Passing Interface (MPI) is a specification for implementing parallel software on a cluster machine.
Based around the idea of asynchronous message passing between nodes in a distributed system.
MPI is a specification: http://www.mpi-forum.org
Open MPI is the implementation we will use:
Open MPI reference: http://www.open-mpi.org

Message Passing Interface

Helloworld in MPI.


#include <stdio.h>
#include "mpi.h"
int main(int argc, char** argv) {
  MPI_Init(&argc, &argv);
  printf("\tHello World!\n");
  MPI_Finalize();
  return 0;
}

Message Passing Interface

We can compile the program using the MPI C compiler mpicc and run using mpiexec

[cosc330@bourbaki examples] $ mpicc hellompi.c -Wall -o hellompi
[cosc330@bourbaki examples] $ mpiexec -np 4 hellompi
    Hello World!
    Hello World!
    Hello World!
    Hello World!
[cosc330@bourbaki examples] $ mpiexec -np 2 hellompi
    Hello World!
    Hello World!
[cosc330@bourbaki examples] $

The MPI Paradigm

Note that the number of processes is a command line argument!
Thus we never explicitly write forks, spawns, creates or whatever.
We will cover MPI in two stages:
- Elementary MPI
- Advanced MPI
Elementary MPI is small - just six (6) functions. Advanced MPI is large: 372 functions for MPI-3

The MPI Paradigm

One can master elementary MPI using just 6 functions:
- MPI_Init -- initializing
- MPI_Comm_size -- number of peers
- MPI_Comm_rank -- rank amongst peers
- MPI_Send -- sending
- MPI_Recv -- receiving
- MPI_Finalize -- exiting

MPI Elementary Functions

#include "mpi.h"
int MPI_Init(int *argc, char ***argv)

Initializes the MPI execution environment
On exit from this routine, all processes will have a copy of the argument list.
Compared to main there is an extra * in both arguments!
Thus a typical call looks like:

MPI_Init(&argc, &argv);

MPI Elementary Functions

MPI specifies no command-line arguments but does allow an MPI implementation to make use of them.
This is not required by the MPI standard.
Portable programs should not rely on it.
This is provided as a service by this implementation.

MPI Elementary Functions

An MPI implementation is allowed to distribute the command line arguments but is not required to.
It is actually mpirun that distributes argc and argv.
All MPI routines return an error value, or MPI_SUCCESS indicates success.
I'll omit error checking from my code, you shouldn't!

MPI Elementary Functions

#include "mpi.h"
int MPI_Finalize()

Terminates MPI execution environment
All processes must call this routine before exiting.
The MPI standard does not say what a program can do before an MPI_INIT or after an MPI_FINALIZE.
You should do as little as possible:
- It is best not to perform much more than a return after calling MPI_Finalize.
In particular, avoid anything that changes the external state of the program, such as opening files, reading standard input or writing to standard output.
The number of processes running after this routine is called is undefined.

MPI Elementary Functions

Self-Discovery in MPI involves querying the system to answer two questions:
- How many processes are there? size
- Which one of them am I? rank
How many?
- Is answered using: MPI_Comm_size.
Which one?
- Is answered using: MPI_Comm_rank.
- The rank ranges from 0 upto size - 1.

MPI Elementary Functions

whoami.c


#include <stdio.h>
#include "mpi.h"
int main(int argc, char** argv) {
  int me, nproc; 
  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &me);
  MPI_Comm_size(MPI_COMM_WORLD, &nproc);
  printf("I'm clone %d out of %d total!\n", me, nproc);
  MPI_Finalize();
  return 0;
}

MPI Elementary Functions

Process numbers bigger than 20 cause noticeable load spikes on turing!
Process numbers bigger than 50 might make you very unpopular!

[cosc330@bourbaki examples] $ mpiexec -np 20 whoAmI
I'm clone 8 out of 20 total!
I'm clone 10 out of 20 total!
I'm clone 11 out of 20 total!
I'm clone 12 out of 20 total!
I'm clone 13 out of 20 total!
I'm clone 17 out of 20 total!
I'm clone 0 out of 20 total!
I'm clone 1 out of 20 total!
I'm clone 2 out of 20 total!
I'm clone 3 out of 20 total!
I'm clone 6 out of 20 total!
I'm clone 7 out of 20 total!
I'm clone 16 out of 20 total!
I'm clone 19 out of 20 total!
I'm clone 4 out of 20 total!
I'm clone 5 out of 20 total!
I'm clone 15 out of 20 total!
I'm clone 18 out of 20 total!
I'm clone 9 out of 20 total!
I'm clone 14 out of 20 total!
[cosc330@bourbaki examples] $

MPI Elementary Functions

whereAmI.c


#include <stdio.h>
#include <unistd.h>
#include <limits.h>
#include "mpi.h"
int main(int argc, char** argv) {
  int me, nproc;                
  char hostname[PATH_MAX];  
  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &me);
  MPI_Comm_size(MPI_COMM_WORLD, &nproc);
  if(gethostname(hostname, PATH_MAX) == 0){
    printf("I'm clone %d out of %d total located @ %s\n", 
           me, nproc, hostname);
    } else {
      printf("I'm a lost clone %d out of %d total!\n", 
             me, nproc);
    }
  MPI_Finalize();
  return 0;
}

MPI Elementary Functions

[cosc330@bourbaki examples] $ mpiexec -np 10 whereAmI
I'm clone 0 out of 10 total located @ bourbaki.une.edu.au
I'm clone 1 out of 10 total located @ bourbaki.une.edu.au
I'm clone 2 out of 10 total located @ bourbaki.une.edu.au
I'm clone 4 out of 10 total located @ bourbaki.une.edu.au
I'm clone 7 out of 10 total located @ bourbaki.une.edu.au
I'm clone 8 out of 10 total located @ bourbaki.une.edu.au
I'm clone 9 out of 10 total located @ bourbaki.une.edu.au
I'm clone 3 out of 10 total located @ bourbaki.une.edu.au
I'm clone 5 out of 10 total located @ bourbaki.une.edu.au
I'm clone 6 out of 10 total located @ bourbaki.une.edu.au
[cosc330@bourbaki examples] $

MPI Elementary Functions

So far we haven't actually distributed our processing across the nodes of the cluster.
All processes are executing on bourbaki
We need to create a multi-node machine and instruct MPI on how to use it.
The first step is to create a hostfile that lists all of the nodes that we are going to use.

[cosc330@bourbaki examples] $ cat b1tob4 
b1
b2
b3
b4
[cosc330@bourbaki examples] $

MPI Elementary Functions

Now we can use the host file to tell MPI to run our program on the nodes listed.
We will use the -map-by option to specify how many instances of the program will run on each node:

[cosc330@bourbaki examples] $ mpirun -np 5 --map-by node --hostfile b1tob4 whereAmI
I'm clone 0 out of 5 total located @ b1
I'm clone 4 out of 5 total located @ b1
I'm clone 1 out of 5 total located @ b2
I'm clone 2 out of 5 total located @ b3
I'm clone 3 out of 5 total located @ b4
[cosc330@bourbaki examples] $ mpirun -np 10 --map-by node --hostfile b1tob4 whereAmI
I'm clone 3 out of 10 total located @ b4
I'm clone 7 out of 10 total located @ b4
I'm clone 9 out of 10 total located @ b2
I'm clone 5 out of 10 total located @ b2
I'm clone 0 out of 10 total located @ b1
I'm clone 4 out of 10 total located @ b1
I'm clone 1 out of 10 total located @ b2
I'm clone 2 out of 10 total located @ b3
I'm clone 6 out of 10 total located @ b3
I'm clone 8 out of 10 total located @ b1
[cosc330@bourbaki examples] $

MPI Elementary Functions

The b-nodes are single core.

[cosc330@bourbaki examples] $ mpirun -np 10 --map-by core  --hostfile b1tob4 whereAmI
I'm clone 8 out of 10 total located @ b4
I'm clone 9 out of 10 total located @ b4
I'm clone 1 out of 10 total located @ b1
I'm clone 2 out of 10 total located @ b1
I'm clone 5 out of 10 total located @ b2
I'm clone 3 out of 10 total located @ b2
I'm clone 7 out of 10 total located @ b3
I'm clone 0 out of 10 total located @ b1
I'm clone 6 out of 10 total located @ b3
I'm clone 4 out of 10 total located @ b2
[cosc330@bourbaki examples] $

MPI Elementary Functions

#include "mpi.h"
int MPI_Comm_rank ( MPI_Comm comm, int *rank )

Determines the rank of the calling process in the communicator comm, placed in the location rank.
A communicator is essentially a collection of processes that can send each other messages.
The rank is a number between 0 and one less than the size size - 1.
The rank of a process (in the communicator) is also used as its address, when sending or receiving messages.

MPI Elementary Functions

We'll mostly use MPI_COMM_WORLD as the first argument, for a while.
It consists of all the processes running when program execution begins (i.e. the <num> used in the mpirun -np <num> <prog>). Typical call:

MPI_Comm_rank(MPI_COMM_WORLD,&me);

MPI Elementary Functions


#include "mpi.h"
int MPI_Comm_size ( MPI_Comm comm, int *size )

Puts the number of processes in the group of comm into size
We'll mostly use MPI_COMM_WORLD as the first argument, for a while.
Typical call:

MPI_Comm_size(MPI_COMM_WORLD,&nproc);

Summary

Flynn's Taxonomy
Performance
Amdahl's Law
The Beowulf Cluster
Message Passing Interface
MPI Elementary Functions

Reading

Chapters 2 and 3 from An Introduction to Parallel Programming by Peter Pacheco
MPI specification: http://www.mpi-forum.org
Open MPI reference: http://www.open-mpi.org