You are all familiar, by now, with the idea of multiplying matrices by passing the rows of B around a ring.
Now we will elaborate on this method by adding an extra dimension.
We shall multiply matrices, not on a ring, but on a torus!
Matrix Multiplication on a Torus
Suppose A and B are square matrices 2m x 2m.
Then to form C = AB
We can actually think of A and B as 2 x 2 matrices, whose entries are m by m matrices.
Thus:
Matrix Multiplication on a Torus
For foxes algorithm, we assume that the factor matrices (A = aij and B = bij) have order n
We will need the number of processes , p, to be a perfect square whose root evenly divides n
So we can say p = q2, and nsub = n / q
This will give us a grid of q x q process, each assigned a nsub x nsub sub-matrix to calculate.
Matrix Multiplication on a Torus
We could also think of A and B as m x m matrices whose elements are 2 x 2 matrices.
This is called block matrix multiplication, and we are going to do such a computation, on a torus.
Lets start by describing the algorithm, then you'll see why its done on a torus.
We are going to have m x m tasks, and each task will manage the appropriate entry in the matrices.
I.e. each task corresponds to an entry in the matrix.
Now the entries in the matrix will themselves be matrices of size blk.
Thus in reality we will be multiplying m x blk square matrices.
Matrix Multiplication on a Torus
First lets explain why its a torus.
We have already seen an algorithm where we passed rows (of B ) up, and received rows (of B ) from below.
In this algorithm we will be passing blocks of B up, and receiving blocks of B from below.
In addition to this we will be (broadcasting) blocks of A horizontally.
Thus we have rings up, and rings around.
Thats a torus.
Matrix Multiplication on a Torus
Here is how to visualise it as a torus.
We start with a square matrix
Each entry in the matrix will actually be a block.
Furthermore for each block there will be a task.
Thus tasks will manage three blocks, an A block, a B block, and a C block, the corresponding answer block.
So we can think of the +'s as either blocks or tasks.
Matrix Multiplication on a Torus
Now the bottom rows nearest neighbours lie in the row above, and the top row.
Thus we can think of the thing as rolled over to form a tube.
The same reasoning goes for the columns, thus the tube its rolled over and joined.
Matrix Multiplication on a Torus
Matrix Multiplication on a Torus
Now this is how the algorithm works:
Each task starts out with an A block, a B block, and a (initialised to zero) C block.
In step one, the tasks on the diagonal (i.e. row i = column i) broadcast their A block along their row.
They each then multiply this A block with their B block and add it to their C block.
Now we rotate the B blocks.
We pass our B block to the task above us (if we are in the top row we pass our B to the entry in the bottom row) but if you think of the thing as a torus we are just passing it up.
Matrix Multiplication on a Torus
The process now iterates:
We go to the next diagonal along.
I.e tasks in the diagonal (i.e. row i , column i + 1) broadcast their A 's, do the multiplication, add it to C , then pass the B 's up. our result C = AB .
After M iterations everything returns to its original place and we have finished.
Matrix Multiplication on a Torus
Before we move on, I will run through a simple example of the algorithm in action.
We will apply the algorithm to a 2 x 2 matrices (A and B). The product will be a 2 x 2 matrix (C).
Matrix Multiplication on a Torus
We will use n2 processes to perform our calculation - think of them as arranged in a grid:
The algorithm will start of with all elements in matrix C initialised to 0.
Matrix Multiplication on a Torus
We start in stage 0. We want Ai,i on process Pi,j, so we broadcast the diagonal elements of A across the rows.
The A elements on each row will be:
Matrix Multiplication on a Torus
Now we broadcast Bi,j on to process Pi,j
Now the A and B values on each process will be:
Matrix Multiplication on a Torus
Now we can calculate our first intermediate result:
Once this is done, we can move to stage 2.
Matrix Multiplication on a Torus
We start be broadcasting the next A column across each row. For the first row, A0,1 will be broadcast.
For the second row, A1,0 will be broadcast - in this situation has wrapped around (mod n)
This gives each process the following A values:
Matrix Multiplication on a Torus
In this step we complete our torus - we shift be up, wrapping around according to mod n.
B1,0 moves up from P 0,1 to P 0,0
B0,0 moves up from P 0, to P 1,0 (wrapping according to mod n):
Matrix Multiplication on a Torus
Now we can calculate our final result:
The algorithm is now complete after n stages.
The Torus MPI Implementation
To implement the algorithm in MPI we need to:
Be able to broadcast our \( A \) block along our row of sibling processes.
Have a simple way of referring to the source and destination in our \( B\) column of sibling processes.
To do these both nicely and efficiently in MPI requires futzing with:
communicators.
topologies.
These will be the topic for today.
The Torus MPI Implementation
A communicator in MPI is:
A collection of processes that can communicate with one another, and
Participate in collective communication operations.
A communicator consists of:
A group: an ordered set of processes.
A context: a handle that uniquely identifies the communicator.
The Torus MPI Implementation
There are various methods for constructing communicators.
The simplest three are:
Subdividing an existing communicator using:
MPI_Comm_split.
Constructing a communicator from a subset of an existing communicator using
MPI_Comm_group, MPI_Comm_incl, and MPI_Comm_create.
Using topologies, this is done in two separate steps:
By imposing a grid like topology on an already existing communicator using MPI_Cart_create
Then using this grid like topology to form column and row communicators using MPI_Cart_sub.
The Torus MPI Implementation
A topology on a communicator is a scheme for addressing processes in that communicator.
MPI offers two different scheme:
A Cartesian Grid based scheme.
A Graph-like scheme.
We are only interested in the former.
To create a grid-like topology on a communicator one needs to specify:
The number of dimensions in the grid.
The size of each dimension.
Wraparound: whether the endpoints in a particular dimension should be considered to be neighbours.
The Torus MPI Implementation
In the torus for the Fox algorithm we have:
A two dimensional grid, actually a square grid.
The dimensions are the same, the dimensions of our matrix of blocks m .
For a torus we have wrap around in both dimensions
Sure why not?
The Torus MPI Implementation
To create our grid communicator we execute:
MPI_Comm grid; //The resulting Communicator
int dim = 2; //The number of Dimensions
int dims[dim] = { m , m }; //The size of each Dimension
int wrap[dim] = { 1 , 1}; //Wraparound?
int optimize = 1; //Optimize?
MPI_Cart_create(MPI_COMM_WORLD, dim, dims, wrap, optimize, &grid);
After executing this grid will be our desired grid topology.
By optimising the topology, a processes rank in grid will usually be different from its rank in MPI_COMM_WORLD.
In other words optimising may reorder.
MPI_Cart_create is a collective operation!
Every process must perform it.
The Torus MPI Implementation
To find out ones rank in the new communicator:
int grid_rank;
MPI_Comm_rank(grid, &grid_rank);
To find out ones coordinates in the grid:
// dim == 2
int coords[dim];
MPI_Cart_coords(grid, grid_rank, dim, coords);
Note that we need our grid rank to find our coordinates.
The Torus MPI Implementation
To make a communicator corresponding to our row:
// dim == 2
MPI_Comm row;
int direction[dim] = { 0, 1 };
MPI_Cart_sub(grid, direction, &row);
To make a communicator corresponding to our column:
// dim == 2
MPI_Comm col;
int direction[dim] = { 1, 0 };
MPI_Cart_sub(grid, direction, &col);