Last lecture we started looking at inter-process communication.
We covered the communication primitives:
MPI_Send(...)
MPI_Recv(...)
We used this to communicate buffers of data between our processes in the trapezoidal rule example.
These functions provide the basic point-to-point communication.
Now we will start looking at collective communication where we need to distribute/collect data from a set of processes.
Collective Communication
We have seen two forms of collective communication in MPI:
MPI_Bcast
MPI_Reduce
MPI_Bcast acts:
like a send for the process whose rank is root.
like a receive for all other processes.
Collective Communication
MPI_Reduce acts:
like a receive for the process whose rank is root.
like a send for all other processes.
the thing received is the accumulation of all the sends.
it is accumulated using the binary operation MPI_Op.
Built-in MPI_Ops:
MPI_MAX MPI_MIN MPI_SUM MPI_PROD MPI_LAND MPI_LOR
Collective Communication
Collective Communication
MPI_Gather acts:
like a receive for the process whose rank is root.
like a send for all processes, including root.
the thing received is the accumulation of all the sends.
it is accumulated via concatenation in the receive buffer, ordered by rank.
Thus MPI_Gather would be perfect for collecting the rows of a matrix
Collective Communication
#include "mpi.h"
int MPI_Gather (void *sndbuf, int sndcnt, MPI_Datatype sndtyp,
void *rcvbuf, int rcvcnt, MPI_Datatype rcvtyp,
int root, MPI_Comm comm )
Gathers together values from a group of processes.
sndbuf is the address of send buffer.
sndcnt is the number of elements in send buffer.
Collective Communication
sndtyp is the type of send buffer items.
rcvbuf is the address of receive buffer.
significant only at root.
rcvcnt is the number of elements for any single receive
significant only at root.
root is the rank of receiving process.
comm is the communicator of the processes participating in the gathering.
Collective Communication
Collective Communication
The rcv arguments are only significant at the root process.
rcvcnt is the number of elements received from each particular process, not the total.
Thus rcvbuf should be capable of storing rcvcnt times the number of processes in the communicator.
Collective Communication
Each process in the comm communicator:
sends the contents of sndbuf to the process with rank root.
This includes the process with rank root.
The process root concatenates the received data in process rank order in rcvbuff.
That means:
The data from process 0 comes first,
Followed by the data from process 1,
and so on
all the way to the last process in the communicator comm.
Collective Communication
#include "mpi.h"
int MPI_Scatter(void *sndbuf, int sndcnt, MPI_Datatype sndtyp,
void *rcvbuf, int rcvcnt, MPI_Datatype rcvtyp,
int root, MPI_Comm comm )
Scatters the values to a group of processes.
Collective Communication
sndbuf is the address of send buffer.
significant only at root.
sndcnt is the number of elements in send buffer.
significant only at root.
sndtype is the type of send buffer items.
rcvbuf is the address of receive buffer.
rcvcnt is the number of elements for any single receive
root is the rank of scattering process.
comm is the communicator of the processes participating in the scattering.
Collective Communication
Collective Communication
Another procedure that would great for porting!
MPI_Scatter is the inverse operation to a gathering.
The process with rank root distributes the contents of sndbuf among the processes in the communicator comm.
The contents of the sndbuf are split into nproc segments, each consisting of sndcnt items.
Collective Communication
Scattering means:
process 0 gets the first segment.
process 1 gets the second segment.
and so on
all the way to the last process in the communicator comm which gets the nprocth segment.
The snd arguments are only significant at the root process.
The rcv arguments are significant at all processes.
Matrix Multiplication (v1)
Lets use MPI_Scatter and MPI_Gather to do matrix multiplication ring style.
There are two versions of the program in the Examples subdirectory.
The example works by using the scatter operation to distribute rows of matrix A and rows of matrix B evenly to each process.
These means that no process has an entire copy of the either matrix. (Imagine that the matrices are too large to be stored by each process).
The program then uses the MPI_Send( ... ) and MPI_Recv( ... ) functions to pass rows of matrix B to the next process in the ring.
Tree-Structured Communication
Last lecture, we looked at an example where we calculate a global sum.
The result of this sum was communicated back to the root process.
We saw a simple way to do this by communicating all outputs back to the root process and have it complete the sum.
E.g. Version 1 of the trapezoid program.
This doesn't take advantage of the processing power within the cluster to parallelise the calculation.
Tree-Structured Communication
By adopting a tree-structured topology for communication, we can use multiple processes to calculate parts of the result.
We can achieve this through the use of the MPI_Broadcast and MPI_Reduce
These functions make use of a tree structure for calculating and communicating the result.
In this structure, processes (other than the root process) are used to calculate parts of the result.
Tree-Structured Communication
The simplest case is the global sum:
Tree-Structured Communication
In this approach, the work of computing the operation (in this case a sum) is distributed to some of the participating processes.
In this example, processes 1,3,5 and 7 calculate intermediate results and communicate them to the next process.
This results in most of the calculation being completed concurrently. (e.g. the root node isn't responsible for calculating all of the result).
Demonstrate the idea, we can implement a global sum operation that uses the tree-structured topology with the MPI_Send and MPI_Recv primitives.
Tree-Structured Communication
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
const int MAX_CONTRIB = 10;
int Global_sum(int my_contrib, int my_rank, int p, MPI_Comm comm);
int main(int argc, char* argv[]) {
int p, my_rank;
MPI_Comm comm;
int my_contrib;
int sum;
MPI_Init(&argc, &argv);
comm = MPI_COMM_WORLD;
MPI_Comm_size(comm, &p);
MPI_Comm_rank(comm, &my_rank);
...
Tree-Structured Communication
...
srandom(my_rank+1);
my_contrib = random() % MAX_CONTRIB;
printf("Proc %d > my_contrib = %d\n", my_rank, my_contrib);
sum = Global_sum(my_contrib, my_rank, p, comm);
if (my_rank == 0)
printf("Proc %d > global sum = %d\n", my_rank, sum);
MPI_Finalize();
return 0;
}
Tree-Structured Communication
int Global_sum(int my_contrib, int my_rank, int p, MPI_Comm comm) {
int sum = my_contrib;
int temp;
int partner;
int done = 0;
unsigned bitmask = 1;
MPI_Status status;
...
Although creating our own implementation of tree-structured communication is a good exercise to understand the principles, MPI implements this for us in the MPI_Reduce function.
parallelTrap02.c Uses the MPI_Reduce function to take advantage of this.
There may be situations where you will need to modify this arrangement to implement custom operations.
Tree-structured communication scales well for very large problems.
Butterfly Communication
There are many situations where the result of a particular calculation may need to be communicated to all processes in the communicator group.
The simple way to do this would be to have a single process to send the calculated data out to each individual process.
This doesn't use any of the available parallelism.
A slightly better approach to use the tree-based communication structure to compute the global operation, then use a tree-based structure to distribute the result.
We can still do better: The Butterfly structure.
Butterfly Communication
Global sum using a butterfly structure.
Butterfly Communication
This structure can be implemented using a similar scheme to the tree structure.
The pairings are created using exclusive OR operations.
The terminating condition is different as the partial results need to be communicated multiple times.
Butterfly Communication
int Global_sum(int my_contrib, int my_rank, int p, MPI_Comm comm) {
int sum = my_contrib;
int temp;
int partner;
unsigned bitmask = 1;
while (bitmask < p) {
partner = my_rank ^ bitmask;
MPI_Sendrecv(&sum, 1, MPI_INT, partner, 0,
&temp, 1, MPI_INT, partner, 0,
comm, MPI_STATUS_IGNORE);
sum += temp;
bitmask <<= 1;
}
return sum;
} /* Global_sum */
The send-receive operations combine in one call the sending of a message to one destination and the receiving of another message, from another process.
MPI_Sendrecv executes a blocking send and receive operation. Both send and receive use the same communicator, but possibly different tags.
The send buffer and receive buffers must be disjoint, and may have different lengths and datatypes.
#include <mpi.h>
int MPI_Sendrecv(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
int dest, int sendtag, void *recvbuf, int recvcount,
MPI_Datatype recvtype, int source, int recvtag,
MPI_Comm comm, MPI_Status *status)
Butterfly Communication
Parameters:
sendbuf :Initial address of send buffer (choice).
sendcount: Number of elements to send (integer).
sendtype: Type of elements in send buffer (handle).
dest: Rank of destination (integer).
sendtag: Send tag (integer).
recvcount : Maximum number of elements to receive (integer).
recvtype: Type of elements in receive buffer (handle).
source: Rank of source (integer).
recvtag: Receive tag (integer).
comm: Communicator (handle).
Butterfly Communication
MPI Implements and optimises butterfly-structured communication in the function MPI_AllReduce
We will see this function next lecture.
We can modify this structure to suit specific requirements.
For example, not all process may need the result - to implement this, we can put aditional conditions before the send.
Summary
Collective Communication
Matrix Multiplication (v1)
Tree-Structured Communication
Butterfly Communication
Reading
Chapters 3 from An Introduction to Parallel Programming by Peter Pacheco