Note the use of the cudaGetLastError() after the kernel call.
The CUDA Memory Hierarchy
Last lecture we briefly looked at some of the different memory stores that are available on a CUDA device:
Global Memory
Shared Memory
Local Memory
Texture Memory
The CUDA Memory Hierarchy
Examples that we have worked with thus far have all made use of global and local memory.
Global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions.
In depths of the GPU, multiple memory transactions from multiple threads can be grouped into longer contiguous memory reads - this is dependent on the size and distribution of the memory sections read.
Individual Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes.
The CUDA Memory Hierarchy
Any access to data residing in the global memory compiles to a single instruction if and only if the memory is accessed in these denominations.
E.g. Accessing a single (4-byte) floating point number will require only a single instruction. Accessing a memory segment of 6 char-type values (6 bytes) will require multiple operations.
Try to avoid odd memory access sizes to keep performance high.
The CUDA Memory Hierarchy
Global memory is allocated using cudaMalloc and is allocated in linear blocks.
Two dimensional arrays can be inefficient - recall that a 2D array consists of an array of pointers that point to the actual memory locations of the row/column arrays.
Each access requires two look-ups to get to the data item you want.
Solution - Flatten them out.
It's faster to calculate the position within a 1D array using the width and height than going through the memory lookups.
Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D().
These functions are recommended for allocations of 2D or 3D arrays as it makes sure that the allocation is appropriately padded to meet the alignment requirements.
This ensures that accesses to the memory occur in transaction-sized blocks.
The returned pitch (or stride) must be used to access array elements.
Essentially, CUDA ensures that everything is optimised.
The CUDA Memory Hierarchy
Here is an optimised 2D array access example.
// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code
__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height) {
for (int r = 0; r < height; ++r) {
/*Tricky pointer arithmetic to get the pointer to the row*/
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c];
}
}
}
The CUDA Memory Hierarchy
Shared memory works at the block-level - all threads within a given block can access the shared memory.
Shared memory is faster than global memory, so any chance to exploit this should improve performance.
Shared memory is declared using the __shared__ variable type qualifier.
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
The CUDA Memory Hierarchy
To motivate our discussion on shared memory, we will first look at an example of matrix multiplication on CUDA.
We've done this example on every other platform, why stop now.
The initial example is fairly simplistic - It used 1 thread per item.
The threads determine the element that they will work on using the blockIdx and threadIdx structures.
This version has each thread making multiple accesses to global memory.
// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) {
// Each thread computes one element of C
// by accumulating results into Cvalue
float Cvalue = 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
for (int e = 0; e < A.width; ++e)
Cvalue += A.elements[row * A.width + e] * B.elements[e * B.width + col];
C.elements[row * C.width + col] = Cvalue;
}
The CUDA Memory Hierarchy
We can improve this by copying blocks of global memory into the shared memory so that subsequent accesses are more efficient.
We have to re-work the arrangement of threads so that each block computes a result for the sub-matrix.
This will allow us to copy blocks of the global memory over to shared memory that the threads can operate on.
As you should have should have worked out by now, CUDA threads execute in parallel.
Just like the POSIX threads that we looked at before, they can be scheduled on and off the GPU.
The threads are scheduled by the block.
This means that there is an arbitrary order in which they access items.
This means that we must think about race conditions.
Thread Synchronisation
There will be situations where we will need to synchronise threads at different levels - e.g. Grid, block etc.
We will look at three measures:
__syncthreads() for synchronising threads in a block.
Splitting across kernels for synchronising multiple grids.
Atomic Functions for performing thread-safe operations. (We will look at these next lecture)
Thread Synchronisation
__syncthreads() acts as a barrier at which all threads in the block must wait before any is allowed to proceed.
Very similar to the MPI barrier function.
Notice that this call operates at the block level - threads outside the block will not be synchronised.
__syncthreads()
All threads in the block need to call the function - otherwise the behaviour will be unstable.
Thread Synchronisation
The shared-memory matrix multiplication example we used the __syncthreads() call to ensure that all threads had completed their calculations before new items are loaded into the shared memory.
We will see another example shortly.
Thread Synchronisation
There is no function for synchronising threads across multiple blocks.
This is because threads are executed on the GPU in blocks in their block groups.
The only way to do this is to split up the logic across two kernel calls.
At the point in the logic where we need all threads synchronised, we end the kernel and create a second.
Recall that global memory contents are retained across multiple kernel calls.
Conway's game of life in CUDA.
To demonstrate the synchronisation approaches, we will develop a simple implementation of Conway's Game of Life.
Conway's Game of Life consists of a two-dimensional grid of squares (representing cells).
Each square is either alive(1) or dead(0).
The state of each cell changes in discrete time steps.
The final version of our program has two key features:
The kernel can complete multiple iterations of the simulation, without multiple kernel calls.
The kernel uses a single array to store the values because the __syncthreads() call has been placed after the calculations for the next state have been completed.
This means that the old state information is not needed and can be overwritten.