COSC330/530 Parallel and Distributed Computing

Lecture 17 - Matrix Multiplication using Fox's Algorithm

Dr. Mitchell Welch

Reading

Chapters 3 from An Introduction to Parallel Programming by Peter Pacheco
MPI specification: http://www.mpi-forum.org
Open MPI reference: http://www.open-mpi.org

Summary

Matrix Revision
Matrix Multiplication on a Torus
The Torus MPI Implementation

Matrix Revision

Todays lecture will be about implementing matrix multiplication using Fox's Algorithm
This algorithm breaks the input matrices down into blocks and completes the multiplication by passing these blocks around between the processes.
The algorithm is designed to solve truely large multiplication problems where the input matrices are too large to be sent to each process.
This makes the algorithms very relevant when working with big data problems.

Matrix Revision

A matrix is a rectangular collection of things:

center-aligned image

Remember its always rows by columns!
The things will usually be numbers, but not always.
Sometimes in this unit they will be other matrices.
We'll see why later.

Matrix Revision

They are usually drawn thusly:

center-aligned image

source: http://upload.wikimedia.org/wikipedia/commons/b/bb/Matrix.svg

Matrix Revision

Ok so let:
- A be an m x p matrix, and
- B be an p x n matrix:

center-aligned image

Matrix Revision

Then we can define the product C = A x B.
C = A x B is the following m x n matrix:

center-aligned image

where

center-aligned image

Matrix Revision

center-aligned image

Matrix Revision

center-aligned image

source http://en.wikipedia.org/wiki/Matrix_%28mathematics%29

Matrix Multiplication on a Torus

You are all familiar, by now, with the idea of multiplying matrices by passing the rows of B around a ring.
Now we will elaborate on this method by adding an extra dimension.
We shall multiply matrices, not on a ring, but on a torus!

Matrix Multiplication on a Torus

Suppose A and B are square matrices 2m x 2m.
Then to form C = AB
We can actually think of A and B as 2 x 2 matrices, whose entries are m by m matrices. Thus:

alt text

Matrix Multiplication on a Torus

For foxes algorithm, we assume that the factor matrices (A = a_ij and B = b_ij) have order n
We will need the number of processes , p, to be a perfect square whose root evenly divides n
So we can say p = q², and n_sub = n / q
This will give us a grid of q x q process, each assigned a n_sub x n_sub sub-matrix to calculate.

Matrix Multiplication on a Torus

We could also think of A and B as m x m matrices whose elements are 2 x 2 matrices.
This is called block matrix multiplication, and we are going to do such a computation, on a torus.
Lets start by describing the algorithm, then you'll see why its done on a torus. We are going to have m x m tasks, and each task will manage the appropriate entry in the matrices.
I.e. each task corresponds to an entry in the matrix.
Now the entries in the matrix will themselves be matrices of size blk.
Thus in reality we will be multiplying m x blk square matrices.

Matrix Multiplication on a Torus

First lets explain why its a torus.
We have already seen an algorithm where we passed rows (of B ) up, and received rows (of B ) from below.
In this algorithm we will be passing blocks of B up, and receiving blocks of B from below.
In addition to this we will be (broadcasting) blocks of A horizontally.
Thus we have rings up, and rings around.
Thats a torus.

Matrix Multiplication on a Torus

Here is how to visualise it as a torus.
We start with a square matrix
Each entry in the matrix will actually be a block.
Furthermore for each block there will be a task.
Thus tasks will manage three blocks, an A block, a B block, and a C block, the corresponding answer block.
So we can think of the +'s as either blocks or tasks.

Matrix Multiplication on a Torus

Now the bottom rows nearest neighbours lie in the row above, and the top row.
Thus we can think of the thing as rolled over to form a tube.
The same reasoning goes for the columns, thus the tube its rolled over and joined.

Matrix Multiplication on a Torus

alt text

Matrix Multiplication on a Torus

Now this is how the algorithm works:
Each task starts out with an A block, a B block, and a (initialised to zero) C block.
In step one, the tasks on the diagonal (i.e. row i = column i) broadcast their A block along their row. They each then multiply this A block with their B block and add it to their C block.
Now we rotate the B blocks.
We pass our B block to the task above us (if we are in the top row we pass our B to the entry in the bottom row) but if you think of the thing as a torus we are just passing it up.

Matrix Multiplication on a Torus

The process now iterates:
- We go to the next diagonal along.
- I.e tasks in the diagonal (i.e. row i , column i + 1) broadcast their A 's, do the multiplication, add it to C , then pass the B 's up. our result C = AB .
After M iterations everything returns to its original place and we have finished.

Matrix Multiplication on a Torus

Before we move on, I will run through a simple example of the algorithm in action.
We will apply the algorithm to a 2 x 2 matrices (A and B). The product will be a 2 x 2 matrix (C).

alt text

Matrix Multiplication on a Torus

We will use n² processes to perform our calculation - think of them as arranged in a grid:

alt text

The algorithm will start of with all elements in matrix C initialised to 0.

Matrix Multiplication on a Torus

We start in stage 0. We want A_i,i on process P_i,j, so we broadcast the diagonal elements of A across the rows.
The A elements on each row will be:

alt text

Matrix Multiplication on a Torus

Now we broadcast B_i,j on to process P_i,j
Now the A and B values on each process will be:

alt text

Matrix Multiplication on a Torus

Now we can calculate our first intermediate result:

alt text

Once this is done, we can move to stage 2.

Matrix Multiplication on a Torus

We start be broadcasting the next A column across each row. For the first row, A_0,1 will be broadcast.
For the second row, A_1,0 will be broadcast - in this situation has wrapped around (mod n)
This gives each process the following A values:

alt text

Matrix Multiplication on a Torus

In this step we complete our torus - we shift be up, wrapping around according to mod n.
B_1,0 moves up from P_0,1 to P_0,0
B_0,0 moves up from P_0, to P_1,0 (wrapping according to mod n):

alt text

Matrix Multiplication on a Torus

Now we can calculate our final result:

alt text

The algorithm is now complete after n stages.

The Torus MPI Implementation

To implement the algorithm in MPI we need to: Be able to broadcast our \( A \) block along our row of sibling processes.
Have a simple way of referring to the source and destination in our \( B\) column of sibling processes.
To do these both nicely and efficiently in MPI requires futzing with:
- communicators.
- topologies.
These will be the topic for today.

The Torus MPI Implementation

A communicator in MPI is:
- A collection of processes that can communicate with one another, and
- Participate in collective communication operations.
A communicator consists of:
- A group: an ordered set of processes.
- A context: a handle that uniquely identifies the communicator.

The Torus MPI Implementation

There are various methods for constructing communicators.
The simplest three are:
Subdividing an existing communicator using:
- MPI_Comm_split.
Constructing a communicator from a subset of an existing communicator using
- MPI_Comm_group, MPI_Comm_incl, and MPI_Comm_create.
Using topologies, this is done in two separate steps:
- By imposing a grid like topology on an already existing communicator using MPI_Cart_create
- Then using this grid like topology to form column and row communicators using MPI_Cart_sub.

The Torus MPI Implementation

A topology on a communicator is a scheme for addressing processes in that communicator.
MPI offers two different scheme:
- A Cartesian Grid based scheme.
- A Graph-like scheme.
We are only interested in the former.
To create a grid-like topology on a communicator one needs to specify:
- The number of dimensions in the grid.
- The size of each dimension.
Wraparound: whether the endpoints in a particular dimension should be considered to be neighbours.

The Torus MPI Implementation

In the torus for the Fox algorithm we have:
- A two dimensional grid, actually a square grid.
- The dimensions are the same, the dimensions of our matrix of blocks m .
For a torus we have wrap around in both dimensions
Sure why not?

The Torus MPI Implementation

To create our grid communicator we execute:

  MPI_Comm grid;                    //The resulting Communicator
  int dim = 2;                      //The number of Dimensions
  int dims[dim] = { m , m };        //The size of each Dimension
  int wrap[dim] = { 1 , 1};         //Wraparound?
  int optimize = 1;                 //Optimize?

  MPI_Cart_create(MPI_COMM_WORLD, dim, dims, wrap, optimize, &grid);

After executing this grid will be our desired grid topology.
By optimising the topology, a processes rank in grid will usually be different from its rank in MPI_COMM_WORLD.
In other words optimising may reorder.
MPI_Cart_create is a collective operation!
Every process must perform it.

The Torus MPI Implementation

To find out ones rank in the new communicator:


 int grid_rank;
 MPI_Comm_rank(grid, &grid_rank);

To find out ones coordinates in the grid:


  // dim == 2
  int coords[dim];
  MPI_Cart_coords(grid, grid_rank, dim, coords);

Note that we need our grid rank to find our coordinates.

The Torus MPI Implementation

To make a communicator corresponding to our row:


  // dim == 2
  MPI_Comm row;
  int direction[dim] = { 0, 1 };
  MPI_Cart_sub(grid, direction, &row);

To make a communicator corresponding to our column:


  // dim == 2
  MPI_Comm col;
  int direction[dim] = { 1, 0 };
  MPI_Cart_sub(grid, direction, &col);

MPI_Cart_sub is a collective operation!
Every process must perform it.

The Torus MPI Implementation

Review torus.c in the examples directory

Summary

Matrix Revision
Matrix Multiplication on a Torus
The Torus MPI Implementation

Reading

Chapters 3 from An Introduction to Parallel Programming by Peter Pacheco
MPI specification: http://www.mpi-forum.org
Open MPI reference: http://www.open-mpi.org