COSC330/530 Parallel and Distributed Computing

Lecture 20 - Mutual Exclusion and Atomic Functions

Dr. Mitchell Welch

Reading

NVidia CUDA Programming Guide, sections 3: http://docs.nvidia.com/cuda/#axzz4JztKrqow

Summary

Atomic Operations and Mutual Exclusion
Warps and Divergence
Measuring Performance
Using Visual Studio Code for Debugging

Atomic Operations and Mutual Exclusion

An atomic function performs a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory.
The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads.
Mutual exclusion is enforced, removing the possibility of Lost Updates occurring as a result of race conditions.
CUDA provides a full suite of atomic functions for performing arithmetic operations.

Atomic Operations and Mutual Exclusion

The first function we will look at is atomicAdd( ... )

int atomicAdd(int* address, 
    int val
);

unsigned int atomicAdd(unsigned int* address,
    unsigned int val
);

unsigned long long int atomicAdd(unsigned long long int* address,
    unsigned long long int val
);

float atomicAdd(float* address, 
    float val
);

Atomic Operations and Mutual Exclusion

Reads the 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old + val), and stores the result back to memory at the same address.
These three operations are performed in one atomic transaction.
The function returns old.

Atomic Operations and Mutual Exclusion

To motivate our discussion we will look at an example that involves calculating a total from a vector.
Here we have an example of a CUDA kernel with a thread race (updates will be lost in the result addition):

__global__ void vectSumRace(int* d_vect,size_t size, int* result){

    size_t tid = blockIdx.x * blockDim.x + threadIdx.x;

    while(tid < size){

        *result+=d_vect[tid];

        tid+=blockDim.x * gridDim.x;
    }


}

Atomic Operations and Mutual Exclusion

We can solve this problem by using the atomicAdd function.
Now there are no lost updates.


__global__ void vectSumAtomic(int* d_vect,size_t size, int* result){

    size_t tid = blockIdx.x * blockDim.x + threadIdx.x;

    while(tid < size){

        atomicAdd(result, d_vect[tid]);

        tid+=blockDim.x * gridDim.x;
    }


}

Atomic Operations and Mutual Exclusion

[bourbaki Debug] $ ./VectorSum 
4 2 3 1 4 1 2 3 5 2 

vectSumRace Result: 2

vectSumAtomic Result: 27

Atomic Operations and Mutual Exclusion

There are atomic functions for a range of aggegations
- atomicSub - subraction
- atomicMin - Min
- atomicMax - Max
There are a couple of more specialised ones:

int atomicExch(int* address, int val);

unsigned int atomicExch(unsigned int* address, unsigned int val);

unsigned long long int atomicExch(unsigned long long int* address, unsigned long long int val);

float atomicExch(float* address, float val);

Reads the 32-bit or 64-bit word old located at the address address in global or shared memory and stores val back to memory at the same address.

Atomic Operations and Mutual Exclusion

int atomicCAS(int* address, int compare, int val);

unsigned int atomicCAS(unsigned int* address, unsigned int compare, unsigned int val);

unsigned long long int atomicCAS(unsigned long long int* address, unsigned long long int compare, unsigned long long int val);

Reads the 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old == compare ? val : old), and stores the result back to memory at the same address.
These three operations are performed in one atomic transaction.
The function returns old (Compare And Swap).

Atomic Operations and Mutual Exclusion

Recall from from module 2, that atomic operations don't stop the problem of synchronisation.
There are situation where a sequence of actions must be carried out in within atomic transactions.
CUDA does not provide functions for mutexes.
We can implement them ourselves using the atomic functions.

Atomic Operations and Mutual Exclusion

struct Lock {
    int *mutex;
    Lock( void ) {
        int state = 0;
        cudaMalloc( (void**)& mutex,
        sizeof(int) ) ;
        cudaMemcpy( mutex, &state, sizeof(int),
        cudaMemcpyHostToDevice );
    }
    ~Lock( void ) {
        cudaFree( mutex );
    }
    __device__ void lock( void ) {
        while( atomicCAS( mutex, 0, 1 ) != 0 );
    }
    __device__ void unlock( void ) {
        atomicExch( mutex, 1 );
    }
};

Atomic Operations and Mutual Exclusion

However there is an issue - this does not work on all GPUs.
Spin locks can cause a deadlock due to the thread scheduling.
An example of mutex locks in action can be found in the dotProduct.cu - this frequently locks up on bourbaki.
Mutexes and critical sections should be used sparingly as they circumvent many of the benefits of the data-parallel approach for computation.

Warps and Divergence

As we discussed in previous lectures, CUDA threads are executed by the block.
In order for a block to be executed, it is assigned to one of the GPUs streaming multiprocessors (SM).
The GPU on bourbaki has 108 such cores with 64 threads per core. (see the deviceQuery example)
Once a block of threads has been allocated to a processor, it is them further divided into 32-thread units called warps
The number of threads per warp (32) is implementation specific and may change in the future - know the properties of the GPU that you are targeting.

Warps and Divergence

The warps are structured so that they consist of consecutive threadIdx values.
threads 0 ... 31 will be in the first warp, threads 32 ... 63 will be in the next.
The number of warps in a block will be determined by dividing the number of threads in the block by 32.
The warps are the unit used by the GPU for thread scheduling.

Warps and Divergence

The processors can switch between warps with no apparent overhead.
Just like POSIX threads, CUDA threads can be in different states - Waiting for Data, Ready to Execute or Executing
Waiting for data occurs when threads within a warp need the result from another calculation.
Once the data is available, the are ready to execute and the whole warp will be scheduled onto the processor.
The ability to schedule warps within a block, gives the GPU the ability to tolerate long-latency operations.

Warps and Divergence

Threads are always allocated in warp-sized denominations, even if the logic in the program does not make use of them.
When we are working with the kernel launch configuration, we should maximise the number of warps that a SM has to work with.
This will give the scheduler the flexibility to ensure that the there is no idle time.

kernel<<<N, 1>>> ( ... )         /* Bad Design - the warp will be almost empty*/
kernel<<<N / 32, 32>>>( ... )    /*  Okay Design - the warp will be full*/
kernel<<<N / 128, 128>>>( ... )  /* Better Design - Maximise the number of full warps */

Warps and Divergence

When the number of threads is not a multiple of the block size, use some bounds checking:


__global__ void kn(int n, .... )
{
 int i = threadIdx.x + blockDim.x * blockIdx.x;
 if (i < n)
{
 // Your task
 }
}

This prevents buffer over-runs if we intend to use threadIdx.x to calculate an array index.
This is the pattern that many of the examples we have looked at this far have used.

Warps and Divergence

Lets motivate our next section with a code snippet:

__shared__ float partialSum[]
...
unsigned int t = threadIdx.x;
for (unsigned int stride = 1; stride < blockDim.x; stride *= 2){
    __syncthreads();

    if (t % (2*stride) == 0)
        partialSum[t] += partialSum[t+stride];
 }

Warps and Divergence

The code snippet conducts a parallel sum reduction.
The original array is copied from global memory to shared memory - the result is then copied back (these are not shown).
Each iteration of the loop completes one round of the reduction.
the __syncthreads(); call ensures that all threads will use results from the previous iteration.

Warps and Divergence

Very similar in structure to the tree communication structure we looked at when we covered MPI.

center-aligned image

Warps and Divergence

The approach requires log₂(N) rounds, giving an efficiency of O(N) - so reasonably efficient
There is divergence caused by the branching.
Divergence causes some of the threads within each warp to block - essentially the branches need to be serialised
- The first pass with run all threads that step into the if statement.
- A second pass will be required to run those that do not enter the if statement.
This decreases the efficiency of the approach.

Warps and Divergence

The aim is to get all threads within a warp taking the same path.
We can still have divergence, it just needs to be organised in a different way.
In our example, we can modify the stride to limit the divergence within each warp.

Warps and Divergence

Instead of adding neighbour elements in the first round, it adds elements that are half a section away from each other.

__shared__ float partialSum[]
unsigned int t = threadIdx.x;
for (unsigned int stride = blockDim.x; stride > 1; stride /= 2){
    __syncthreads();
    if (t < stride)
        partialSum[t] += partialSum[t+stride];
}

Warps and Divergence

center-aligned image

Measuring Performance

Performing performance measures with CUDA involves the use of cudaEvent_t structs, the cudaEventCreate function and the cudaEventRecord function.
The general pattern (without error checking) looks like:

cudaEvent_t start;
cudaEventCreate(&start);
cudaEvent_t stop;
cudaEventCreate(&stop);
// Record the start event
cudaEventRecord(start, NULL);

...
//Launch kernels etc.
...

// Record the stop event
cudaEventRecord(stop, NULL);
// Wait for the stop event to complete
cudaEventSynchronize(stop);
float msecTotal = 0.0f;
cudaEventElapsedTime(&msecTotal, start, stop);

Measuring Performance

Review the MatrixMul.cu example.

Using Visual Studio Code for Debugging

There is a CUDA Extension for Visual Studio Code that provides debugging functionality for device and host code.
Review the documentation and features video: https://developer.nvidia.com/nsight-visual-studio-code-edition
You will need to run this from bourbaki using the code command.
Make sure the you connect up to bourbaki with the -Y flag set - this will connect the GUI to bourbaki

Summary

Atomic Operations and Mutual Exclusion
Warps and Divergence
Measuring Performance
Using Visual Studio Code for Debugging

Reading

NVidia CUDA Programming Guide, sections 3: http://docs.nvidia.com/cuda/#axzz4JztKrqow