COSC330/530 Parallel and Distributed Computing

Lecture 15 - Communication Structures in MPI

Dr. Mitchell Welch

Reading

Chapters 3 from An Introduction to Parallel Programming by Peter Pacheco
MPI specification: http://www.mpi-forum.org
Open MPI reference: http://www.open-mpi.org

Summary

Collective Communication
Matrix Multiplication (v1)
Tree-Structured Communication
Butterfly Communication

Collective Communication

Last lecture we started looking at inter-process communication.
We covered the communication primitives:
- MPI_Send(...)
- MPI_Recv(...)
We used this to communicate buffers of data between our processes in the trapezoidal rule example.
These functions provide the basic point-to-point communication.
Now we will start looking at collective communication where we need to distribute/collect data from a set of processes.

Collective Communication

We have seen two forms of collective communication in MPI:
- MPI_Bcast
- MPI_Reduce
MPI_Bcast acts:
- like a send for the process whose rank is root.
- like a receive for all other processes.

Collective Communication

MPI_Reduce acts:
- like a receive for the process whose rank is root.
- like a send for all other processes.
- the thing received is the accumulation of all the sends.
- it is accumulated using the binary operation MPI_Op.
Built-in MPI_Ops:
- MPI_MAX MPI_MIN MPI_SUM MPI_PROD MPI_LAND MPI_LOR

Collective Communication

Alt text

Collective Communication

MPI_Gather acts:
- like a receive for the process whose rank is root.
- like a send for all processes, including root.
- the thing received is the accumulation of all the sends.
- it is accumulated via concatenation in the receive buffer, ordered by rank.
Thus MPI_Gather would be perfect for collecting the rows of a matrix

Collective Communication

#include "mpi.h"
int MPI_Gather (void *sndbuf,  int sndcnt, MPI_Datatype sndtyp, 
                void *rcvbuf,  int rcvcnt, MPI_Datatype rcvtyp, 
                int root, MPI_Comm comm )

Gathers together values from a group of processes.
sndbuf is the address of send buffer.
sndcnt is the number of elements in send buffer.

Collective Communication

sndtyp is the type of send buffer items.
rcvbuf is the address of receive buffer.
significant only at root.
rcvcnt is the number of elements for any single receive
significant only at root.
root is the rank of receiving process.
comm is the communicator of the processes participating in the gathering.

Collective Communication

Alt text

Collective Communication

The rcv arguments are only significant at the root process.
rcvcnt is the number of elements received from each particular process, not the total.
Thus rcvbuf should be capable of storing rcvcnt times the number of processes in the communicator.

Collective Communication

Each process in the comm communicator: sends the contents of sndbuf to the process with rank root.
This includes the process with rank root.
The process root concatenates the received data in process rank order in rcvbuff.
That means:
- The data from process 0 comes first,
- Followed by the data from process 1,
- and so on
- all the way to the last process in the communicator comm.

Collective Communication

#include "mpi.h"
int MPI_Scatter(void *sndbuf, int sndcnt, MPI_Datatype sndtyp, 
                void *rcvbuf, int rcvcnt, MPI_Datatype rcvtyp, 
                int root, MPI_Comm comm )

Scatters the values to a group of processes.

Collective Communication

sndbuf is the address of send buffer.
- significant only at root.
sndcnt is the number of elements in send buffer.
- significant only at root.
sndtype is the type of send buffer items.
rcvbuf is the address of receive buffer.
rcvcnt is the number of elements for any single receive
root is the rank of scattering process.
comm is the communicator of the processes participating in the scattering.

Collective Communication

Alt text

Collective Communication

Another procedure that would great for porting!
MPI_Scatter is the inverse operation to a gathering.
The process with rank root distributes the contents of sndbuf among the processes in the communicator comm.
The contents of the sndbuf are split into nproc segments, each consisting of sndcnt items.

Collective Communication

Scattering means:
- process 0 gets the first segment.
- process 1 gets the second segment.
- and so on
- all the way to the last process in the communicator comm which gets the nprocth segment.
The snd arguments are only significant at the root process.
The rcv arguments are significant at all processes.

Matrix Multiplication (v1)

Lets use MPI_Scatter and MPI_Gather to do matrix multiplication ring style.
There are two versions of the program in the Examples subdirectory.
- mpimm.c contains oodles of error checking.
- abbreviated-mpimm.c contains no error checking.
You should know by now that for good marks your solutions should look like the former not the latter!!

Matrix Multiplication (v1)

The matrix example makes use of a purpose-built matrix library.
This is in matrix.c and matrix.h.
The library contains routines for loading and saving matrices to from and to files.
mkIdentityMatrix.c and mkRandomMatrix.c make use of these functions.

Matrix Multiplication (v1)

The example works by using the scatter operation to distribute rows of matrix A and rows of matrix B evenly to each process.
These means that no process has an entire copy of the either matrix. (Imagine that the matrices are too large to be stored by each process).
The program then uses the MPI_Send( ... ) and MPI_Recv( ... ) functions to pass rows of matrix B to the next process in the ring.

Tree-Structured Communication

Last lecture, we looked at an example where we calculate a global sum.
The result of this sum was communicated back to the root process.
We saw a simple way to do this by communicating all outputs back to the root process and have it complete the sum.
E.g. Version 1 of the trapezoid program.
This doesn't take advantage of the processing power within the cluster to parallelise the calculation.

Tree-Structured Communication

By adopting a tree-structured topology for communication, we can use multiple processes to calculate parts of the result.
We can achieve this through the use of the MPI_Broadcast and MPI_Reduce
These functions make use of a tree structure for calculating and communicating the result.
In this structure, processes (other than the root process) are used to calculate parts of the result.

Tree-Structured Communication

The simplest case is the global sum:

Alt text

Tree-Structured Communication

In this approach, the work of computing the operation (in this case a sum) is distributed to some of the participating processes.
In this example, processes 1,3,5 and 7 calculate intermediate results and communicate them to the next process.
This results in most of the calculation being completed concurrently. (e.g. the root node isn't responsible for calculating all of the result).
Demonstrate the idea, we can implement a global sum operation that uses the tree-structured topology with the MPI_Send and MPI_Recv primitives.

Tree-Structured Communication

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
const int MAX_CONTRIB = 10;

int Global_sum(int my_contrib, int my_rank, int p, MPI_Comm comm);

int main(int argc, char* argv[]) {
   int      p, my_rank;
   MPI_Comm comm;
   int      my_contrib;
   int      sum;

   MPI_Init(&argc, &argv);
   comm = MPI_COMM_WORLD;
   MPI_Comm_size(comm, &p);
   MPI_Comm_rank(comm, &my_rank);

...

Tree-Structured Communication

...

   srandom(my_rank+1);
   my_contrib = random() % MAX_CONTRIB;
   printf("Proc %d > my_contrib = %d\n", my_rank, my_contrib);
   sum = Global_sum(my_contrib, my_rank, p, comm);
   if (my_rank == 0)
      printf("Proc %d > global sum = %d\n", my_rank, sum);
   MPI_Finalize();
   return 0;
}

Tree-Structured Communication


int Global_sum(int my_contrib, int my_rank, int p, MPI_Comm comm) {
    int        sum = my_contrib;
    int        temp;
    int        partner;
    int        done = 0;
    unsigned   bitmask = 1;
    MPI_Status status;
...

Tree-Structured Communication


...

    while (!done && bitmask < p) {
        partner = my_rank ^ bitmask;
        if (my_rank < partner) {
            MPI_Recv(&temp, 1, MPI_INT, partner, 0, comm, &status);
            sum += temp;
            bitmask <<= 1;
        } else {
            MPI_Send(&sum, 1, MPI_INT, partner, 0, comm); 
            done = 1;
        }
    }

    /*Valid only on 0 */
    return sum;
}  /* Global_sum */

Tree-Structured Communication

The pairing of the processes is done using bitwise exclusive or operation - ^

X	Y	X^Y
0	0	0
0	1	1
1	0	1
1	1	0

]

Tree-Structured Communication

Here's a table showing the process pairing with 8 processes (r = my_rank, other column heads are bitmask)

r	t	001	010	100
0	000	001	010	100
1	001	000	x	x
2	010	011	000	x
3	011	010	x	x
4	100	101	110	000
5	101	100	x	x
6	110	111	100	x
7	111	110	x	x

This scheme requires that p, the number of processes, must be a power of 2.

Tree-Structured Communication

[cosc330@bourbaki examples] $ mpirun -np 8 treeGlobalSum
Proc 3 > my_contrib = 1
Proc 4 > my_contrib = 5
Proc 5 > my_contrib = 1
Proc 6 > my_contrib = 7
Proc 7 > my_contrib = 6
Proc 0 > my_contrib = 3
Proc 0 > global sum = 29
Proc 1 > my_contrib = 0
Proc 2 > my_contrib = 6
[cosc330@bourbaki examples] $

Tree-Structured Communication

Although creating our own implementation of tree-structured communication is a good exercise to understand the principles, MPI implements this for us in the MPI_Reduce function.
parallelTrap02.c Uses the MPI_Reduce function to take advantage of this.
There may be situations where you will need to modify this arrangement to implement custom operations.
Tree-structured communication scales well for very large problems.

Butterfly Communication

There are many situations where the result of a particular calculation may need to be communicated to all processes in the communicator group.
The simple way to do this would be to have a single process to send the calculated data out to each individual process.
This doesn't use any of the available parallelism.
A slightly better approach to use the tree-based communication structure to compute the global operation, then use a tree-based structure to distribute the result.
We can still do better: The Butterfly structure.

Butterfly Communication

Global sum using a butterfly structure.

Alt text

Butterfly Communication

This structure can be implemented using a similar scheme to the tree structure.
The pairings are created using exclusive OR operations.
The terminating condition is different as the partial results need to be communicated multiple times.

Butterfly Communication


int Global_sum(int my_contrib, int my_rank, int p, MPI_Comm comm) {
    int        sum = my_contrib;
    int        temp;
    int        partner;
    unsigned   bitmask = 1;
    while (bitmask < p) {
        partner = my_rank ^ bitmask;
        MPI_Sendrecv(&sum, 1, MPI_INT, partner, 0, 
                    &temp, 1, MPI_INT, partner, 0, 
                    comm, MPI_STATUS_IGNORE);
        sum += temp;
        bitmask <<= 1;
    }
    return sum;
}  /* Global_sum */

Butterfly Communication


[cosc330@bourbaki examples] $ mpirun -np 4 butterflyGlobalSum
Process Values:
Proc 0 > 3
Proc 1 > 3
Proc 2 > 10
Proc 3 > 6

Process Totals:
Proc 0 > 22
Proc 1 > 22
Proc 2 > 22
Proc 3 > 22

Butterfly Communication

The send-receive operations combine in one call the sending of a message to one destination and the receiving of another message, from another process.
MPI_Sendrecv executes a blocking send and receive operation. Both send and receive use the same communicator, but possibly different tags.
The send buffer and receive buffers must be disjoint, and may have different lengths and datatypes.

#include <mpi.h>
int MPI_Sendrecv(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
    int dest, int sendtag, void *recvbuf, int recvcount,
    MPI_Datatype recvtype, int source, int recvtag,
    MPI_Comm comm, MPI_Status *status)

Butterfly Communication

Parameters:
- sendbuf :Initial address of send buffer (choice).
- sendcount: Number of elements to send (integer).
- sendtype: Type of elements in send buffer (handle).
- dest: Rank of destination (integer).
- sendtag: Send tag (integer).
- recvcount : Maximum number of elements to receive (integer).
- recvtype: Type of elements in receive buffer (handle).
- source: Rank of source (integer).
- recvtag: Receive tag (integer).
- comm: Communicator (handle).

Butterfly Communication

MPI Implements and optimises butterfly-structured communication in the function MPI_AllReduce
We will see this function next lecture.
We can modify this structure to suit specific requirements.
For example, not all process may need the result - to implement this, we can put aditional conditions before the send.

Summary

Collective Communication
Matrix Multiplication (v1)
Tree-Structured Communication
Butterfly Communication

Reading

Chapters 3 from An Introduction to Parallel Programming by Peter Pacheco
MPI specification: http://www.mpi-forum.org
Open MPI reference: http://www.open-mpi.org

r	t	001	010	100
0	000	001	010	100
1	001	000	x	x
2	010	011	000	x
3	011	010	x	x
4	100	101	110	000
5	101	100	x	x
6	110	111	100	x
7	111	110	x	x

r	t	001	010	100
0	000	001	010	100
1	001	000	x	x
2	010	011	000	x
3	011	010	x	x
4	100	101	110	000
5	101	100	x	x
6	110	111	100	x
7	111	110	x	x

r	t	001	010	100
0	000	001	010	100
1	001	000	x	x
2	010	011	000	x
3	011	010	x	x
4	100	101	110	000
5	101	100	x	x
6	110	111	100	x
7	111	110	x	x