COSC330/530 Parallel and Distributed Computing

Lecture 19 - CUDA Threads and Memory

Dr. Mitchell Welch

Reading

NVidia CUDA Programming Guide, sections 1 and 2: http://docs.nvidia.com/cuda/#axzz4JztKrqow
CUDA C Best Practises Guide, section 9: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#coalesced-access-to-global-memory

Summary

Error Checking in CUDA C
The CUDA Memory Hierarchy
Thread Sychronisation
Conway's game of life in CUDA.

Error Checking in CUDA C

Error checking in CUDA C is very similar to what we have seen in the other platforms.
Check your return values!
The return values that are defined in CUDA are documented here
A return value of cudaSuccess indicates that there are no errors - on most platforms this will be 0

Error Checking in CUDA C

There are a couple of additional functions that will aid in error checking:
http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__ERROR.html#axzz4KBPv9COj
The first function that we will look at is cudaGetErrorString( ... ):

__host__  __device__ const char* cudaGetErrorString ( cudaError_t error )

Callable from the host or device.
Returns the description string of the error indicated by error

Error Checking in CUDA C

Similarly the cudaGetErrorName( ... ) can return the string representation of the error's name.

__host__  __device__ const char* cudaGetErrorName ( cudaError_t error )

Callable from the host or device.
Returns the name string of the error indicated by error

Error Checking in CUDA C

In order to determine if a CUDA kernel function has successfully completed, we will need to make use of the cudaGetLastError( ... ) function.

__host__  __device__ cudaError_t cudaGetLastError ( void )

Callable from the host or device.
This function returns the last error from a runtime call.
The function should be called after each kernel call to determine if it was successful.

Error Checking in CUDA C

Lets put this all together in a full-strength example of the vectorAdd program.
This is the same program we reviewed last lecture, just with all of the error checking bits and pieces.
Review VectorAdd_full.cu
Note the use of the cudaGetLastError() after the kernel call.

The CUDA Memory Hierarchy

Last lecture we briefly looked at some of the different memory stores that are available on a CUDA device:
- Global Memory
- Shared Memory
- Local Memory
- Texture Memory

The CUDA Memory Hierarchy

Examples that we have worked with thus far have all made use of global and local memory.
Global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions.
In depths of the GPU, multiple memory transactions from multiple threads can be grouped into longer contiguous memory reads - this is dependent on the size and distribution of the memory sections read.
Individual Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes.

The CUDA Memory Hierarchy

Any access to data residing in the global memory compiles to a single instruction if and only if the memory is accessed in these denominations.
E.g. Accessing a single (4-byte) floating point number will require only a single instruction. Accessing a memory segment of 6 char-type values (6 bytes) will require multiple operations.
Try to avoid odd memory access sizes to keep performance high.

The CUDA Memory Hierarchy

Global memory is allocated using cudaMalloc and is allocated in linear blocks.
Two dimensional arrays can be inefficient - recall that a 2D array consists of an array of pointers that point to the actual memory locations of the row/column arrays.
Each access requires two look-ups to get to the data item you want.
Solution - Flatten them out.
It's faster to calculate the position within a 1D array using the width and height than going through the memory lookups.

The CUDA Memory Hierarchy

We can work with 2D structures using 1-D memory

#define WIDTH 10
#define HEIGHT 10

float* myArr = (float*)malloc(WIDTH * HEIGHT * sizeof(float));

for(int y = 0; y< HEIGHT; y++){
    for(int x=0; x < WIDTH; x++){

        printf("%f", myArr[y * WIDTH + x]);

    }
    printf("\n");
}

The CUDA Memory Hierarchy

Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D().
These functions are recommended for allocations of 2D or 3D arrays as it makes sure that the allocation is appropriately padded to meet the alignment requirements.
This ensures that accesses to the memory occur in transaction-sized blocks.
The returned pitch (or stride) must be used to access array elements.
Essentially, CUDA ensures that everything is optimised.

The CUDA Memory Hierarchy

Here is an optimised 2D array access example.

// Host code 
int width = 64, height = 64; 
float* devPtr; 
size_t pitch; 
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height); 
MyKernel<<<100, 512>>>(devPtr, pitch, width, height); 

// Device code 
__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height) { 
    for (int r = 0; r < height; ++r) {
        /*Tricky pointer arithmetic to get the pointer to the row*/  
        float* row = (float*)((char*)devPtr + r * pitch); 
        for (int c = 0; c < width; ++c) { 
            float element = row[c]; 
        } 
    } 
}

The CUDA Memory Hierarchy

Shared memory works at the block-level - all threads within a given block can access the shared memory.
Shared memory is faster than global memory, so any chance to exploit this should improve performance.
Shared memory is declared using the __shared__ variable type qualifier.

 __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

The CUDA Memory Hierarchy

To motivate our discussion on shared memory, we will first look at an example of matrix multiplication on CUDA.
We've done this example on every other platform, why stop now.
The initial example is fairly simplistic - It used 1 thread per item.
The threads determine the element that they will work on using the blockIdx and threadIdx structures.
Review matrix_basic.cu

The CUDA Memory Hierarchy

This version has each thread making multiple accesses to global memory.

// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) {
    // Each thread computes one element of C
    // by accumulating results into Cvalue
    float Cvalue = 0;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    for (int e = 0; e < A.width; ++e)
        Cvalue += A.elements[row * A.width + e] * B.elements[e * B.width + col];
    C.elements[row * C.width + col] = Cvalue;
}

The CUDA Memory Hierarchy

We can improve this by copying blocks of global memory into the shared memory so that subsequent accesses are more efficient.
We have to re-work the arrangement of threads so that each block computes a result for the sub-matrix.
This will allow us to copy blocks of the global memory over to shared memory that the threads can operate on.

The CUDA Memory Hierarchy

The code for this example is in matrixMul.cu
Individual threads are still responsible for individual elements.
The approach is more efficient because an entire block of the Global memory is copied into the shared memory.
Repeated accesses are then performed on the shared memory.
The number of threads is fixed, the number of blocks is determined by the dimensions of the matrices.

The CUDA Memory Hierarchy

This approach (generally referred to as tiling) can be generalised to most problems.
The idea is to limit global memory accesses and group them into contiguous read operations to reduce overheads.

The CUDA Memory Hierarchy

Local memory is only available within an individual thread.
Typically variables that are declared within a kernel will reside within the registers and cache.
However, in some situations there will not be enough space - in this situation a register spill will occur and the local memory comes into play.
The local memory space resides in device memory, so local memory accesses have same high latency and low bandwidth as global memory accesses

The CUDA Memory Hierarchy

Local memory usage should be limited and can be caused by:
- Arrays for which the length cannot be determined,
- Large structures or arrays that would consume too much register space,
- Any variable if the kernel uses more registers than available
Essentially whenever there are not enough registers, local memory will be used.

The CUDA Memory Hierarchy

The final item in the memory hierarchy is texture memory.
Texture memory is designed to be used as part of the graphics rendering pipeline.
It is a read-only memory store.
Texture memory resides on the devices global memory, however it is cached on the chip.
Reads of the texture memory will only result in a global memory (DRAM) read if there is a cache miss.

The CUDA Memory Hierarchy

Caching in the texture memory is optimised for 2D spatial locality
This differs from CPU caching, which is typically based around both temporal and spatial locality.
Texture memory can be allocated in 1D, 2D and 3D arrays.
Texture memory is not addressed by pointers - it is done with a set of coordinates.

The CUDA Memory Hierarchy

Working with texture memory is a bit involved.
- Create a CUDA Array to store the texture information.
- Specify the texture object parameters.
- Generate the texture.
We will talk through the code snippet: simpleTexture.cu
We will come back to texture memory later on.

Thread Synchronisation

As you should have should have worked out by now, CUDA threads execute in parallel.
Just like the POSIX threads that we looked at before, they can be scheduled on and off the GPU.
The threads are scheduled by the block.
This means that there is an arbitrary order in which they access items.
This means that we must think about race conditions.

Thread Synchronisation

There will be situations where we will need to synchronise threads at different levels - e.g. Grid, block etc.
We will look at three measures:
- __syncthreads() for synchronising threads in a block.
- Splitting across kernels for synchronising multiple grids.
- Atomic Functions for performing thread-safe operations. (We will look at these next lecture)

Thread Synchronisation

__syncthreads() acts as a barrier at which all threads in the block must wait before any is allowed to proceed.
Very similar to the MPI barrier function.
Notice that this call operates at the block level - threads outside the block will not be synchronised.

__syncthreads()

All threads in the block need to call the function - otherwise the behaviour will be unstable.

Thread Synchronisation

The shared-memory matrix multiplication example we used the __syncthreads() call to ensure that all threads had completed their calculations before new items are loaded into the shared memory.
We will see another example shortly.

Thread Synchronisation

There is no function for synchronising threads across multiple blocks.
This is because threads are executed on the GPU in blocks in their block groups.
The only way to do this is to split up the logic across two kernel calls.
At the point in the logic where we need all threads synchronised, we end the kernel and create a second.
- Recall that global memory contents are retained across multiple kernel calls.

Conway's game of life in CUDA.

To demonstrate the synchronisation approaches, we will develop a simple implementation of Conway's Game of Life.
Conway's Game of Life consists of a two-dimensional grid of squares (representing cells).
Each square is either alive(1) or dead(0).
The state of each cell changes in discrete time steps.
You can review an online example of this game at: http://www.bitstorm.org/gameoflife/

Conway's game of life in CUDA.

At each time step, the following transitions occur:
- Any live cell with fewer than two live neighbours dies, as if caused by under-population.
- Any live cell with two or three live neighbours lives on to the next generation.
- Any live cell with more than three live neighbours dies, as if by over-population.
- Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction.
We will implement this game in CUDA.

Conway's game of life in CUDA.

Our initial implementation only calculates the result of one time step.
Review conways.cu
This version returns the results to the host using a separate array.
The width and height are used to specify the number of blocks and the threads per block.

Conway's game of life in CUDA.

Now we need to implement multiple time steps.
Our first attempt will simply call the kernel function inside a loop.
Review conway_multi_call.cu
All threads on the GPU are synchronised at the end of each kernel call.

Conway's game of life in CUDA.

What happens if we wish for our kernel to complete multiple time steps before returning.
In the conway_multi_call.cu implementation, threads are split across multiple blocks.
There is no way (on the device) to synchronise them!
We will end up with race conditions.

Conway's game of life in CUDA.

Solution: Use a single block with more threads and use the __syncthreads() call.
This will stop execution at the end of each cycle until all threads have reached the call.
- If you look closely at the logic, threads can take different paths - some will finish before others.
Review single_kernel_multistep.cu

Conway's game of life in CUDA.

The final version of our program has two key features:
- The kernel can complete multiple iterations of the simulation, without multiple kernel calls.
- The kernel uses a single array to store the values because the __syncthreads() call has been placed after the calculations for the next state have been completed.
- This means that the old state information is not needed and can be overwritten.

Summary

Error Checking in CUDA C
The CUDA Memory Hierarchy
Thread Synchronisation
Conway's game of life in CUDA.

Reading

NVidia CUDA Programming Guide, sections 1 and 2: http://docs.nvidia.com/cuda/#axzz4JztKrqow
CUDA C Best Practises Guide, section 9: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#coalesced-access-to-global-memory