Flynn's taxonomy provides a framework to classify computer architectures based on the number of instruction streams and the number of data streams the system can simultaneously manage.
Flynn's taxonomy only refers to the hardware of the system.
The categories within Flynn's taxonomy can be broken down further based upon software implementations that operate on the different platforms.
Flynn's Taxonomy
A classical Von-Neumann system is categorised as a single instruction, single data stream (SISD)
Such a system executes a single instruction at a time on a single item of data.
Examples: PIC and AVR microcontrollers, x86/64 single core processors etc.
Flynn's Taxonomy
Single instruction, multiple data (SIMD) systems are parallel machines that operate on multiple data streams.
In such a system, the same instructions are applied to multiple (different) data items.
These systems can thought of as having a single control unit with multiple Arithmetic Logic Units (ALUs).
Instructions broadcast from the control unit and the instruction is performed by the ALU on the data items.
The ALUs are all executing the same instruction or they are idle. (i.e. they can't execute different logic)
Flynn's Taxonomy
The SIMD approach:
Flynn's Taxonomy
A motivating example:
for (i = 0; i < n ; i++){
x[i] += y[i]
}
If we have a system with n ALUs, then the operation on x[i] and y[i] can be performed by the ith ALU.
Parallelism achieved though the SIMD approach is referred to as Data Parallelism.
SIMD is very efficient for large data-parallel problems.
Flynn's Taxonomy
Graphics Processing Units make use of the SIMD approach for performing the complex calculations required to render detailed 3D graphics.
GPUs use a processing pipeline to convert an internal representation into an array of pixels that are sent to the screen.
Stages within the pipeline are programmable through shader functions
NVidia's Compute Unified Device Architecture and OpenCL platforms provide a platform for general purpose computation using the GPU SIMD devices.
GPUs now enjoy wide-spread use for high-performance computing tasks.
Flynn's Taxonomy
A CUDA kernel function that performs an addition operation over N items.
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C){
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main() {
... // Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
...
}
Flynn's Taxonomy
Multiple Instruction, Multiple Data (MIMD) systems support simultaneous instruction streams operating over multiple data streams.
They are typically constructed using a set of independent processors, where each controls its own ALU.
Instructions in MIMD platforms are usually executed in an asynchronous fashion - Usually, there is no global clock signal.
Flynn's Taxonomy
MIMD systems can be implemented using shared memory or distributed memory
In a shared memory arrangement, each processing unit can access a global block of memory - e.g. A multicore processor
In a distributed memory arrangement, each processing unit has its own memory and must rely on IPC to work with other processing units.
The Beowulf cluster we will be working with is an example of a distributed memory platform.
We work with the Message Passing Interface MPI software platform to provide IPC.
Flynn's Taxonomy
Multiple Instruction, Single Data (MISD) systems are generally implemented to achieve fault tolerance.
Multiple processors independently operate on the same stream of data.
This approach can be used to detect and correct errors in calculations.
Flynn's Taxonomy
At the software level, MIMD hardware can support two common architectures:
Single Program, Multiple Data streams (SPMD)
Multiple Program, Multiple Data streams (MPMD)
The SPMD approach is the most common approach, where a single executable is run across all processing units but behaves differently on different processors.
Some of the multi-process programs we constructed in the first module fall into this category.
Flynn's Taxonomy
In the MPMD architecture, different executables run on the different processing units.
A master-slave architecture is an example of this approach.
An MPMD architecture can be implemented with multiple processes using the exec( ... ) system call to execute separate programs.
MPI supports the MPMD architecture by allowing different nodes in the cluster to run different programs.
Performance
When looking at performance of a parallel program can be evaluated using two metrics:
Speedup - The time saved from parallelism
Efficiency - A measure of how effectively the system is being used.
Speedup (S) is given by the expression:
\( S = \frac{T_{serial}}{T_{parallel}}\)
Where, \(T_{serial}\) is the serial runtime of the program and \(T_{parallel}\) is the parallel runtime.
The theoretical best-case is a linear speed where: \(T_{parallel} = \frac{T_{serial}}{p}\), where p is the number of processes.
Performance
Efficiency (E) is given by:
\( E = \frac{S}{p} = \frac{\big(\frac{T_{serial}}{T_{parallel}}\big)}{p} = \frac{T_{serial}}{p \cdot T_{parallel}}\)
Where, S is the speedup, p is the number of processes, \(T_{serial}\) is the serial runtime of the program and \(T_{parallel}\) is the parallel runtime
\(T_{parallel}\), S and E all depend on the number of processes and threads.
Performance
Unfortunately in the real world there are additional overheads that are associated with parallel execution.
We have looked at some of these when we covered processes and threads:
Context switches.
Maintaining meta-data for processes and threads (e.g. program counters, file descriptors etc.)
Network communication times in distributed systems.
Delays due to mutual exclusion etc.
Performance
This means that the actual speedup achieved by parallelism will be:
Note that the number of processes is a command line argument!
Thus we never explicitly write forks, spawns, creates or whatever.
We will cover MPI in two stages:
Elementary MPI
Advanced MPI
Elementary MPI is small - just six (6) functions.
Advanced MPI is large: 372 functions for MPI-3
The MPI Paradigm
One can master elementary MPI using just 6 functions:
MPI_Init -- initializing
MPI_Comm_size -- number of peers
MPI_Comm_rank -- rank amongst peers
MPI_Send -- sending
MPI_Recv -- receiving
MPI_Finalize -- exiting
MPI Elementary Functions
#include "mpi.h"
int MPI_Init(int *argc, char ***argv)
Initializes the MPI execution environment
On exit from this routine, all processes will have a copy of the argument list.
Compared to main there is an extra * in both arguments!
Thus a typical call looks like:
MPI_Init(&argc, &argv);
MPI Elementary Functions
MPI specifies no command-line arguments but does allow an MPI implementation to make use of them.
This is not required by the MPI standard.
Portable programs should not rely on it.
This is provided as a service by this implementation.
MPI Elementary Functions
An MPI implementation is allowed to distribute the command line arguments but is not required to.
It is actually mpirun that distributes argc and argv.
All MPI routines return an error value, or MPI_SUCCESS indicates success.
I'll omit error checking from my code, you shouldn't!
MPI Elementary Functions
#include "mpi.h"
int MPI_Finalize()
Terminates MPI execution environment
All processes must call this routine before exiting.
The MPI standard does not say what a program can do before an MPI_INIT or after an MPI_FINALIZE.
You should do as little as possible:
It is best not to perform much more than a return after calling MPI_Finalize.
In particular, avoid anything that changes the external state of the program, such as opening files, reading standard input or writing to standard output.
The number of processes running after this routine is called is undefined.
MPI Elementary Functions
Self-Discovery in MPI involves querying the system to answer two questions:
How many processes are there? size
Which one of them am I? rank
How many?
Is answered using: MPI_Comm_size.
Which one?
Is answered using: MPI_Comm_rank.
The rank ranges from 0 upto size - 1.
MPI Elementary Functions
whoami.c
#include <stdio.h>
#include "mpi.h"
int main(int argc, char** argv) {
int me, nproc;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &me);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
printf("I'm clone %d out of %d total!\n", me, nproc);
MPI_Finalize();
return 0;
}
MPI Elementary Functions
Process numbers bigger than 20 cause noticeable load spikes on turing!
Process numbers bigger than 50 might make you very unpopular!
[cosc330@bourbaki examples] $ mpiexec -np 20 whoAmI
I'm clone 8 out of 20 total!
I'm clone 10 out of 20 total!
I'm clone 11 out of 20 total!
I'm clone 12 out of 20 total!
I'm clone 13 out of 20 total!
I'm clone 17 out of 20 total!
I'm clone 0 out of 20 total!
I'm clone 1 out of 20 total!
I'm clone 2 out of 20 total!
I'm clone 3 out of 20 total!
I'm clone 6 out of 20 total!
I'm clone 7 out of 20 total!
I'm clone 16 out of 20 total!
I'm clone 19 out of 20 total!
I'm clone 4 out of 20 total!
I'm clone 5 out of 20 total!
I'm clone 15 out of 20 total!
I'm clone 18 out of 20 total!
I'm clone 9 out of 20 total!
I'm clone 14 out of 20 total!
[cosc330@bourbaki examples] $
MPI Elementary Functions
whereAmI.c
#include <stdio.h>
#include <unistd.h>
#include <limits.h>
#include "mpi.h"
int main(int argc, char** argv) {
int me, nproc;
char hostname[PATH_MAX];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &me);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
if(gethostname(hostname, PATH_MAX) == 0){
printf("I'm clone %d out of %d total located @ %s\n",
me, nproc, hostname);
} else {
printf("I'm a lost clone %d out of %d total!\n",
me, nproc);
}
MPI_Finalize();
return 0;
}
MPI Elementary Functions
[cosc330@bourbaki examples] $ mpiexec -np 10 whereAmI
I'm clone 0 out of 10 total located @ bourbaki.une.edu.au
I'm clone 1 out of 10 total located @ bourbaki.une.edu.au
I'm clone 2 out of 10 total located @ bourbaki.une.edu.au
I'm clone 4 out of 10 total located @ bourbaki.une.edu.au
I'm clone 7 out of 10 total located @ bourbaki.une.edu.au
I'm clone 8 out of 10 total located @ bourbaki.une.edu.au
I'm clone 9 out of 10 total located @ bourbaki.une.edu.au
I'm clone 3 out of 10 total located @ bourbaki.une.edu.au
I'm clone 5 out of 10 total located @ bourbaki.une.edu.au
I'm clone 6 out of 10 total located @ bourbaki.une.edu.au
[cosc330@bourbaki examples] $
MPI Elementary Functions
So far we haven't actually distributed our processing across the nodes of the cluster.
All processes are executing on bourbaki
We need to create a multi-node machine and instruct MPI on how to use it.
The first step is to create a hostfile that lists all of the nodes that we are going to use.
Now we can use the host file to tell MPI to run our program on the nodes listed.
We will use the -map-by option to specify how many instances of the program will run on each node:
[cosc330@bourbaki examples] $ mpirun -np 5 --map-by node --hostfile b1tob4 whereAmI
I'm clone 0 out of 5 total located @ b1
I'm clone 4 out of 5 total located @ b1
I'm clone 1 out of 5 total located @ b2
I'm clone 2 out of 5 total located @ b3
I'm clone 3 out of 5 total located @ b4
[cosc330@bourbaki examples] $ mpirun -np 10 --map-by node --hostfile b1tob4 whereAmI
I'm clone 3 out of 10 total located @ b4
I'm clone 7 out of 10 total located @ b4
I'm clone 9 out of 10 total located @ b2
I'm clone 5 out of 10 total located @ b2
I'm clone 0 out of 10 total located @ b1
I'm clone 4 out of 10 total located @ b1
I'm clone 1 out of 10 total located @ b2
I'm clone 2 out of 10 total located @ b3
I'm clone 6 out of 10 total located @ b3
I'm clone 8 out of 10 total located @ b1
[cosc330@bourbaki examples] $
MPI Elementary Functions
The b-nodes are single core.
[cosc330@bourbaki examples] $ mpirun -np 10 --map-by core --hostfile b1tob4 whereAmI
I'm clone 8 out of 10 total located @ b4
I'm clone 9 out of 10 total located @ b4
I'm clone 1 out of 10 total located @ b1
I'm clone 2 out of 10 total located @ b1
I'm clone 5 out of 10 total located @ b2
I'm clone 3 out of 10 total located @ b2
I'm clone 7 out of 10 total located @ b3
I'm clone 0 out of 10 total located @ b1
I'm clone 6 out of 10 total located @ b3
I'm clone 4 out of 10 total located @ b2
[cosc330@bourbaki examples] $
MPI Elementary Functions
#include "mpi.h"
int MPI_Comm_rank ( MPI_Comm comm, int *rank )
Determines the rank of the calling process in the communicator comm, placed in the location rank.
A communicator is essentially a collection of processes that can send each other messages.
The rank is a number between 0 and one less than the size size - 1.
The rank of a process (in the communicator) is also used as its address, when sending or receiving messages.
MPI Elementary Functions
We'll mostly use MPI_COMM_WORLD as the first argument, for a while.
It consists of all the processes running when program execution begins (i.e. the <num> used in the mpirun -np <num> <prog>).
Typical call:
MPI_Comm_rank(MPI_COMM_WORLD,&me);
MPI Elementary Functions
#include "mpi.h"
int MPI_Comm_size ( MPI_Comm comm, int *size )
Puts the number of processes in the group of comm into size
We'll mostly use MPI_COMM_WORLD as the first argument, for a while.
Typical call:
MPI_Comm_size(MPI_COMM_WORLD,&nproc);
Summary
Flynn's Taxonomy
Performance
Amdahl's Law
The Beowulf Cluster
Message Passing Interface
MPI Elementary Functions
Reading
Chapters 2 and 3 from An Introduction to Parallel Programming by Peter Pacheco