Title: COMP309/509 - Lecture 18 class: middle, center, inverse

COMP309/509 - Parallel and Distributed Computing

Lecture 18 - PVM Matrix Multiplication

By Mitchell Welch

University of New England

Reading

The Program Examples from PVM: Parallel Virtual Machine A Users' Guide and Tutorial for Networked Parallel Computing
Assignment 3
Project (COMP509)

Summary

Matrix Multiplication on a Torus
Torus Example Implementation

Matrix Multiplication on a Torus

You are all familiar, by now, with the idea of multiplying matrices by passing the rows of B around a ring.
What I want to do is elaborate on this method by adding an extra dimension.
We shall multiply matrices, not on a ring, but on a torus!

Matrix Multiplication on a Torus

This algorithm is called Fox’s Matrix Multiplication.
It comes from the PVM book pages 76 through 83.

http://www.netlib.org/pvm3/book/pvm-book.html

So you may want to read that part of the book.
A similar but not identical algorithm also makes an appearance with pictures on page 36 and 37 of the PVM book.
The first really nice property of matrices that the algorithm uses is that one can divide the computation up into blocks.

Matrix Multiplication on a Torus

Suppose A and B are square matrices 2m x 2m.
Then to form C = AB
We can actually think of A and B as 2 x 2 matrices, whose entries are m by m matrices. Thus:

alt text

Matrix Multiplication on a Torus

We could also think of A and B as m x m matrices whose elements are 2 x 2 matrices.
This is called block matrix multiplication, and we are going to do such a computation, on a torus.
Lets start by describing the algorithm, then you’ll see why its done on a torus. We are going to have m x m tasks, and each task will manage the appropriate entry in the matrices.
I.e. each task corresponds to an entry in the matrix.
Now the entries in the matrix will themselves be matrices of size blk.
Thus in reality we will be multiplying m x blk square matrices.

Matrix Multiplication on a Torus

First lets explain why its a torus.
We have already seen an algorithm where we passed rows (of B ) up, and received rows (of B ) from below.
In this algorithm we will be passing blocks of B up, and receiving blocks of B from below.
In addition to this we will be (broadcasting) blocks of A horizontally.
Thus we have rings up, and rings around.
Thats a torus.

Matrix Multiplication on a Torus

Here is how to visualize it as a torus.
We start with a square matrix
Each entry in the matrix will actually be a block.
Furthermore for each block there will be a task.
Thus tasks will manage three blocks, an A block, a B block, and a C block, the corresponding answer block.
So we can think of the +’s as either blocks or tasks.

Matrix Multiplication on a Torus

Now the bottom rows nearest neighbours lie in the row above, and the top row.
Thus we can think of the thing as rolled over to form a tube.
The same reasoning goes for the columns, thus the tube its rolled over and joined.

Matrix Multiplication on a Torus

alt text

Torus Example Implementation

Now this is how the algorithm works:
Each task starts out with an A block, a B block, and a (initialized to zero) C block.
In step one, the tasks on the diagonal (i.e. row i = column i) broadcast their A block along their row. They each then multiply this A block with their B block and add it to their C block.
Now we rotate the B blocks.
We pass our B block to the task above us (if we are in the top row we pass our B to the entry in the bottom row) but if you think of the thing as a torus we are just passing it up.

Torus Example Implementation

The process now iterates:
- We go to the next diagonal along.
- I.e tasks in the diagonal (i.e. row i , column i + 1) broadcast their A ’s, do the multiplication, add it to C , then pass the B ’s up. our result C = AB .
After M iterations everything returns to its original place and we have finished.

Torus Example Implementation

The following is straight out of the book.
There are some obvious toy aspects to the code.
Like creating A , B , and C on the fly. We’ll obviously want to get them from files.
The toy program generates a random A , and uses the identity as B , thus we can easily check the answer by comparing the binary forms of A and C using cmp!
In fact we’ll have to think about this aspect of the algorithm in detail.
We will retain the choice of A and B since it is quite convenient for verification purposes. There is also an unnecessary goto in the code.

Torus Example Implementation

The code is not particularly complex.
It uses dynamic groups for identification purposes.
Each task joins the group mmult, and receives its membership number (from 0 upto the size of group).
From this number it figures out its row and column number, the identity of the other tasks in its row, and the tasks immediately above it and below it (columnwise) on the torus.

Torus Example Implementation

mm.0.0.c

Torus Example Implementation

Ok as I mentioned last time, we are going to try and tweak mmult until we get it running fast and smooth on the cluster.
We will do this in several stages.
All the programs are in:

http://turing.une.edu.au/~comp309/markdown_lectures/lecture_18/examples/Torus/

Please take a look and play with the stuff there.

Torus Example Implementation

We can expect to get an order of magnitude speed up, and an order of magnitude increase in the size of the matrices we can multiply.
Maybe more if we are optimists, or use algorithms specifically designed for distributed computing.
The best would be a speed up of $ 16$ (for the same algorithm), and a size increase of two orders of magnitude.
We could probably do better on this last figure, after all we could subdivide the blocks to multiply them, and so on and so forth.
Thats for another time.

Summary

Matrix Multiplication on a Torus
Torus Example Implementation

class: middle, center, inverse

Questions?

Reading

The Program Examples from PVM: Parallel Virtual Machine A Users' Guide and Tutorial for Networked Parallel Computing
Assignment 3
- Project (COMP509)