COSC330/530 Parallel and Distributed Computing

Lecture 16 - Advanced Communication and Performance Analysis in MPI

Dr. Mitchell Welch

Reading

Chapters 3 from An Introduction to Parallel Programming by Peter Pacheco
MPI specification: http://www.mpi-forum.org
Open MPI reference: http://www.open-mpi.org

Summary

Collective Communication
Derived Datatypes
Timings in MPI
Performance Evaluation of MPI Applications

Collective Communication

Last time we looked at three forms of collective communication.
Other than the usual sending, receiving, and barriers.
They were:
- MPI_Bcast used to distribute a value.
- MPI_Reduce used to accumulate (via an MPI_OP) a result at the root.
- The idea being that it reduces the results to s single value.
- MPI_Scatter used to distribute blocks of data to the nodes.
- MPI_Gather accumulates blocks of data in a buffer at the root.

Collective Communication

There are versions that accumulate the results at all nodes, not just the root. These are called:
- MPI_Allreduce
- MPI_Allgather

Collective Communication


int MPI_Allreduce(void *sndbuf, 
    void *rcvbuf, 
    int count, 
    MPI_Datatype datatyp, 
    MPI_Op op, 
    MPI_Comm comm 
);

This procedure combines the values on all processes to a single value, and then distributes that value to all processes.
The functions provides an implementation of the butterfly structure.
This is a more efficient way of collecting results, since it uses tree shaped communication.
Note that it differs from MPI_Reduce in that it has no root argument.

Collective Communication

sndbuf is the send buffer
rcvbuf address of receive buffer significant at all nodes
datatyp is data type of elements of send buffer
op is the reduce operation.
comm is the communicator

Collective Communication

int MPI_Allgather(void *sndbuf,  
    int sndcnt, 
    MPI_Datatype sndtyp, 
    void *rcvbuf,  
    int rcvcnt, 
    MPI_Datatype rcvtyp, 
    MPI_Comm comm
);

Gathers together values from a group of processes, then distributes the results to all processes.

Collective Communication

sndbuf is the address of send buffer.
sndcnt is the number of elements in send buffer.
sndtyp is the type of send buffer items.
rcvbuf is the address of receive buffer.
- significant at all processes
rcvcnt is the number of elements for any single receive
- significant at all processes
comm is the communicator of the processes participating in the gathering.

Derived Datatypes

Sends are expensive!
A general rule of thumb: The fewer messages the better!
Recall versions 1 and 2 of the parallel trapezoidal rule from lecture 14.
To distribute three numbers:

float a; //start of interval
float b; //end of interval
int n;   //number of trapezoids

Derived Datatypes

We had to do:
- three sends and three receives in version one.
three broadcasts in version two.
Today we'll see how to send structured data (i.e structs).
As well as use MPI's pack and unpack functions.

Derived Datatypes

First declare our new type:

typedef struct {
  float a;
  float b;
  int   n;
} ParTrapData;

Create the appropriate data:

printf("Enter a, b, and n\n");  
scanf("%f %f %d", a_ptr, b_ptr, n_ptr);
ParTrapData  ptd = { *a_ptr, *b_ptr, *n_ptr };

Then send it:

MPI_Bcast(&ptd, 1, ParTrapData, 0, MPI_COMM_WORLD);

Derived Datatypes

The Problem?
ParTrapData is not the correct type!
It is not MPI_Datatype.
MPI is a library of precompiled functions.
Thus MPI only knows about pre-existing datatypes.

Derived Datatypes

Firstly it isn't trivial.
We only need to follow a simple recipe
We use:
- MPI_Type_create_struct
- MPI_Get_address and
- MPI_Type_commit

Derived Datatypes

#include "mpi.h"
int MPI_Type_create_struct(int count, 
                    int blocklens[], 
               MPI_Aint indices[], 
           MPI_Datatype old_types[], 
           MPI_Datatype *newtype )

Creates a struct datatype of type :
- MPI_Datatype placed in newtype
- The input parameters describe the type of struct making up the datatype.
- count is the number of items in the struct.
- count is also the length of the next three arrays.
- blocklens is the number of elements in each block (element of the struct).
- indices is the byte displacement of each block.
- old_types is the type of elements in each block.

Derived Datatypes

Make space for our newly created type:

MPI_Datatype mpi_ptdatatype;

Create the appropriate arguments:

int count = 3;
MPI_Aint blocks[count]  = {1, 1, 1};
MPI_Aint indices[count];
MPI_Datatype old_types[count] = {MPI_Float, MPI_Float, MPI_Int};

Computing the indices requires an actual ParTrapData object:

printf("Enter a, b, and n\n");  
scanf("%f %f %d", a_ptr, b_ptr, n_ptr);
ParTrapData  ptd = { *a_ptr, *b_ptr, *n_ptr };

Derived Datatypes

Find the addresses of things:

MPI_Aint addresses[count + 1];
MPI_Get_address(&ptd, &addresses[0]);
MPI_Get_address(&(ptd.a) , &addresses[1]);
MPI_Get_address(&(ptd.b) , &addresses[2]);
MPI_Get_address(&(ptd.n) , &addresses[3]);

Compute the displacements:

indices[0] = addresses[1] - addresses[0];
indices[1] = addresses[2] - addresses[0]; 
indices[2] = addresses[3] - addresses[0];

Make the Datatype:

MPI_Type_create_struct(count, blocks, indices, old_types, &mpi_ptdatatype);

Commit it, so it can be used: MPI_Type_commit(&mpi_ptdatatype);

Derived Datatypes

#include "mpi.h"
int MPI_Type_commit(MPI_Datatype *datatype);

Commits the datatype
*datatype is the datatype to be commited

Derived Datatypes

Review parallel_trap3.c

Derived Datatypes

There are three other derived datatype constructors
- MPI_Type_contiguous
- MPI_Type_vector
- MPI_Type_indexed
However we won't deal with them.

Derived Datatypes

We can pack data into a buffer for communication.
At the receiving end we need to unpack data out of a buffer.
To pack the data needed for the parallel trapezoidal rule:

#define BUFF 100
char buff[BUFF]
int position = 0;

MPI_Pack(a_ptr, 1, MPI_FLOAT, buff, BUFF, &position, MPI_COMM_WORLD);
MPI_Pack(b_ptr, 1, MPI_FLOAT, buff, BUFF, &position, MPI_COMM_WORLD);
MPI_Pack(n_ptr, 1, MPI_INT, buff, BUFF, &position, MPI_COMM_WORLD);

Derived Datatypes

To broadcast the buffer:

MPI_Bcast(buffer, BUFF, MPI_PACKED, root, MPI_COMM_WORLD);

To unpack it:

MPI_Unpack(buff, BUFF, &position, a_ptr, 1, MPI_FLOAT,  MPI_COMM_WORLD);
MPI_Unpack(buff, BUFF, &position, b_ptr, 1, MPI_FLOAT,  MPI_COMM_WORLD);
MPI_Unpack(buff, BUFF, &position, n_ptr, 1, MPI_INT,  MPI_COMM_WORLD);

Derived Datatypes

Review:
- parallel_trap4.c
- parallel_trap5.c

Timings in MPI

Recall that runtime can be measured in a single process program using the clock function.
MPI provides the MPI_Wtime(void) function for similar functionality.

double start, finish;

start =  MPI_Wtime(void);

/* Something time-consuming*/

finish = MPI_Wtime(void);
printf("Proc %d > Elapsed time = %e seconds\n",my_rank, finish-start);

Timings in MPI

The MPI_Wtime returns the wall clock time, which includes the idle time for the process (e.g. the total elapsed time including the blocking time for I/O etc).
The clock() function we looked at in lecture 13 returns the CPU time - This does not include the idle periods.
The wall clock time in a single process program can be measured using the GET_TIME(double now) macro.

#include "timer.h"

double start, finish;
GET_TIME(start);

/* Something time-consuming*/

GET_TIME(finish);

finish = MPI_Wtime(void);
printf("Proc %d > Elapsed time = %e seconds\n",my_rank, finish-start);

Timings in MPI

The previous example using MPI_Wtime returns the wall clock time for each process.
In some situations it is desirable to calculate the wall clock time for the last process to finish.
May not be a measure of the an individual processes wall clock time as the process will not necessarily start at the same time.

Timings in MPI

We can get around this (to a point) with the MPI_Barrier call.


double local start, local finish, local elapsed, elapsed; 
...

MPI Barrier(comm);
local_start = MPI Wtime();

/* Code to be timed */
...

local_finish = MPI Wtime();
local_elapsed = local_finish − local_start;
MPI_Reduce(&local elapsed, &elapsed, 1, MPI_DOUBLE,
MPI_MAX, 0, comm); 
if (my rank == 0){
    printf("Elapsed time = %e seconds\n", elapsed);
}

This example uses the MPI_MAX function to return the largest elapsed time to the root process.

Timings in MPI

When taking times (either wall clock or CPU), there are many factors that can cause variations.
The interactions with the operating system can change through the system load and scheduling scheme in place.
Memory usage and paging/thrashing (e.g. transfers between physical memory and virtual memory) can have significant impacts on execution times.
When analysing runtimes it is important to use multiple trials and be aware of the other process running on the system.
This can be very difficult in virtualised environments (such as Amazon and Azure web services).

Performance Evaluation of MPI Applications

We will finish up this lecture by analysing an example, by calculating the execution times, speedup and efficiency.
The example we will work with is a program that computes matrix-vector multiplication.
The code for this experiment can be found in mpi_mat_vect_time.c
This algorithm is from chapter 3 of An Introduction to parallel Programming

Performance Evaluation of MPI Applications

Recall from intro maths the operation for multiplying a vector by a matrix.

Alt text

Recall that the number of columns in our matrix must match the number of rows in our vector in order to multiply them together.

Performance Evaluation of MPI Applications

To run or application, we simply specify the order (e.g. dimensions) of the matrix and vector.
In our experiment, we will only look at square matrices, so our order will only list a single dimensions.
We can run our experiment like this:

[cosc330@bourbaki examples] $ mpiexec -n 1 mat_vect_mult_time
Enter the number of rows
1024
Enter the number of columns
1024
Elapsed time = 3.496170e-03
[cosc330@bourbaki examples] $ mpiexec -n 2 mat_vect_mult_time
Enter the number of rows
1024
Enter the number of columns
1024
Elapsed time = 1.714945e-03

Performance Evaluation of MPI Applications

Here we have the wall clock times for matrices ranging from order 1024 through 16384 (in msec).

Processes	Order:1024	Order:2048	Order:4096	Order:8192	Order:16384
1	3.43	13.21	53.62	211.94	837.77
2	1.78	6.91	26.83	107.39	422.23
4	1.03	3.83	16.88	66.63	233.58
8	0.76	2.04	9.030	34.59	169.47
16	0.71	1.71	6.334	24.70	94.06
32	0.80	1.92	6.611	32.94	91.35

Performance Evaluation of MPI Applications

When graphed by matrix order:

Alt text

Performance Evaluation of MPI Applications

Recall the speedup and efficiency can be calculated according to the relationship:

\( S = \frac{T_{serial}}{T_{parallel}}\)

\( E = \frac{S}{p} = \frac{\big(\frac{T_{serial}}{T_{parallel}}\big)}{p} = \frac{T_{serial}}{p \cdot T_{parallel}}\)

Where, E is the efficiency, S is the speedup, p is the number of processes, \(T_{serial}\) is the serial runtime of the program and \(T_{parallel}\) is the parallel runtime

\(T_{parallel}\), S and E all depend on the number of processes and threads.

Performance Evaluation of MPI Applications

From our base wall clock time data, we can look at the speedup achieved:

Processes	Order:1024	Order:2048	Order:4096	Order:8192	Order:16384
1	1	1	1	1	1
2	1.92	1.91	1.99	1.97	1.98
4	3.33	3.44	3.17	3.18	3.58
8	4.51	6.47	5.93	6.12	4.94
16	4.83	7.72	8.46	8.58	8.90
32	4.28	6.88	8.11	6.43	9.17

Performance Evaluation of MPI Applications

From our base wall clock time and speedup data, we can look at the efficiency achieved:

Processes	Order:1024	Order:2048	Order:4096	Order:8192	Order:16384
1	1	1	1	1	1
2	0.96	0.96	1	0.99	0.99
4	0.83	0.86	0.79	0.8	0.9
8	0.56	0.81	0.74	0.77	0.62
16	0.3	0.48	0.53	0.54	0.56
32	0.13	0.22	0.25	0.2	0.29

Performance Evaluation of MPI Applications

Observations from our analysis:
- We didn't achieve linear speedup - no real surprise here.
- We reach a sweet-spot where adding processes did not improve speedup (somewhere between 16-32 process)
- This sweet-spot changes with the magnitude of the processing. (e.g. with an order 16384 matrix, the 32 process run achieved a modest speedup over the 16 process run)
- The efficiency decreases as the number of processes is increased, due to the communication overhead.

Summary

Collective Communication
Derived Datatypes
Timings in MPI
Performance Evaluation of MPI Applications

Reading

Chapters 3 from An Introduction to Parallel Programming by Peter Pacheco
MPI specification: http://www.mpi-forum.org
Open MPI reference: http://www.open-mpi.org