COSC330/530 Parallel and Distributed Computing

Lecture 22 - CUDA Processing Streams

Dr. Mitchell Welch

Reading

Summary

Single CUDA Streams
Multiple CUDA Streams
Using Multiple GPUs

Single CUDA Streams

A CUDA stream is a queue of GPU operations that are executed in a specific order.
The order in which the tasks are added to this queue determines their order of execution.
Independent streams of operations can be executed concurrently and asynchronously.
Streams take advantage of the capability of the GPU to overlap kernel execution with memory copy operations.
Nearly all CUDA enabled cards with a compute capability of 1.1 or higher can do this.

Single CUDA Streams

To test if a GPU is capable of the overlapped memory copies, we can run the following sample.

#include <stdio.h>

int main( void ) {
    cudaDeviceProp prop;
    int whichDevice;
    cudaGetDevice( &whichDevice );
    cudaGetDeviceProperties( &prop, whichDevice ) ;
    if(!prop.deviceOverlap){
            printf( "Device will not handle overlaps, so no speed up from streams\n" );
    }else{
        printf( "Device will handle overlaps\n" );
    }
return 0;
}

Single CUDA Streams

The overlap capability will be relevant when we move to the use of multiple streams.
To set up a stream, we will make use of cudaStreamCreate

cudaError_t cudaStreamCreate(cudaStream_t * pStream );

Creates a new asynchronous stream
pStream - Pointer to new stream identifier
Returns: cudaSuccess, cudaErrorInvalidValue
This function may also return error codes from previous, asynchronous launches.

Single CUDA Streams

// initialize the stream
cudaStream_t    stream;
HANDLE_ERROR( cudaStreamCreate( &stream ) );

We can then begin to add items to the stream of operations.

Single CUDA Streams

To motivate this example, we will use a cuda kernel that calculates the average of three values supplied from two vectors -just a toy calculation.

__global__ void kernel(int *a, int *b, int *c) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < N) {
        int idx1 = (idx + 1) % 256;
        int idx2 = (idx + 2) % 256;
        float as = (a[idx] + a[idx1] + a[idx2]) / 3.0f;
        float bs = (b[idx] + b[idx1] + b[idx2]) / 3.0f;
        c[idx] = (as + bs) / 2;
    }
}

Single CUDA Streams

The first step is to allocate some host memory to hold the a, b and c arrays.
In order to use the host memory locations within the stream (i.e. Copy their values to and from the GPU in the stream), it will need to be allocated as page-locked memory.
Page-locked memory is the same as normal heap memory (that we would allocate through malloc), except it is guaranteed never to be swapped into the virtual memory on disk by the OS.
The OS ensures that the memory is always within the physical memory of the system.

Single CUDA Streams

Because the memory is never moved by the OS, the GPU can use direct memory access to copy data from this location in the host.
DMA works without intervention from the CPU.
Interestingly, even when copying from pageable memory, the GPU still uses DMA.
- The pageable memory block (potentially residing in the virtual memory on disk) is copied to a page-locked section of memory for staging.
- The DMA transfer is then done from the staging location to the GPU.
- Ergo, transfers from pageable memory are bounded by the transfer speed of the virtual memory.

Single CUDA Streams

Remember that your system has a finite amount of physical memory - so be careful with the page-locked memory allocation.
The system will run our of memory much faster if page-locked memory is overused.
Know the limits of you system and the other applications that share the memory.

Single CUDA Streams

To allocate paged-locked memory, we can use the cudaHostAlloc() function.

cudaError_t cudaHostAlloc(void ** pHost,
    size_t size, 
    unsigned int flag)

Allocates size bytes of host memory that is page-locked and accessible to the device.
Parameters:
- pHost - Device pointer to allocated memory
- size - Requested allocation size in bytes
- flags - Requested properties of allocated memory. We will use cudaHostAllocDefault

Single CUDA Streams

We will allocate some page-locked memory for our asynchronous stream.
We will fill the a and b arrays with random values.

// allocate page-locked memory, used to stream
cudaHostAlloc( (void**)&host_a, FULL_DATA_SIZE * sizeof(int), cudaHostAllocDefault );
cudaHostAlloc( (void**)&host_b, FULL_DATA_SIZE * sizeof(int), cudaHostAllocDefault );
cudaHostAlloc( (void**)&host_c, FULL_DATA_SIZE * sizeof(int), cudaHostAllocDefault );

for (int i = 0; i < FULL_DATA_SIZE; i++) {
        host_a[i] = rand();
        host_b[i] = rand();
}

Single CUDA Streams

We will now copy our page-locked memory blocks to the GPU.
Up until now, we have used the cudaMemcpy
This is a synchronous function that only returns when the copy is completed.
We will use cudaMemcpyAsync which returns to the caller immediately. The copy operation is completed at some time after this call.
A cudaStream_t is passed in as a parameter and allows the program to guarantee that the copy will be completed before the next operation in the stream takes place.

Single CUDA Streams

cudaError_t cudaMemcpyAsync (   void * dst,
    const void *    src,
    size_t count,
    enum cudaMemcpyKind kind,
    cudaStream_t    stream = 0   
)

Parameters:
- dst - Destination memory address
- src - Source memory address
- count - Size in bytes to copy
- kind - Type of transfer
- stream - Stream identifier

Single CUDA Streams

We will be copying the a array like this.

cudaMemcpyAsync( dev_a, host_a+i, N * sizeof(int), cudaMemcpyHostToDevice, stream);

Single CUDA Streams

Once we have made asynchronous calls to copy over the data to the GPU, we can asynchronously call our kernel.
This is done by adding the cudaStream_t stream to the end of the invocation configuration.
In our example, this will be done as follows:


kernel<<<N/256,256,0,stream>>>( dev_a, dev_b, dev_c );

This call will return control to the host immediately and the kernel will run at some point after the previous operation in the stream is complete.

Single CUDA Streams

After kernel has been spun off, the host is free to complete other processing.
This is the real strength of using CUDA streams.
At some point it will want to access the results of the kernel launch.
The host will need to first check that the GPU has actually finished it work!
We can do this using cudaStreamSynchronize( ... ), which will block on the host until the stream is complete.

Single CUDA Streams

c cudaError_t cudaStreamSynchronize(cudaStream_t stream)

Blocks until stream has completed all operations.
Parameters:
- stream - Stream identifier
Returns: cudaSuccess, cudaErrorInvalidResourceHandle

Single CUDA Streams

Review the entire listing in stream.cu
This has some additional fluff:
- Error checking implemented using a Macro template.
- Some performance measurement calculations.
The kernel calls are spun off in a for loop - they will execute one after the other.

Multiple CUDA Streams

The real power of streams is that they can execute concurrently on the GPU.
Making use of multiple streams can improve the efficiency as the GPU can schedule multiple kernels at once.
The overlap functionality is critical:
- The GPU can be running a kernel while copying items between the host and the GPU.

Multiple CUDA Streams

The ideal arrangement:

center-aligned image

Multiple CUDA Streams

We will add a second stream that is identical to the first.
This second steam will need its own memory for the calculations.
Review the example in m_stream.cu
Take note of the execution times between the stream examples.

Using Multiple GPUs

This is actually fairly easy - spin off a second thread and launch on another device.
Use cudaSetDevice to select the appropriate GPU, then launch the kernel.

cudaError_t cudaSetDevice   (int    device)

Records device as the device on which the active host thread executes the device code.
Parameters:
- device - Device on which the active host thread should execute the device code.
Returns: cudaSuccess, cudaErrorInvalidDevice, cudaErrorSetOnActiveProcess

Using Multiple GPUs

This is a design decision based upon the observation that data-parallel problems can usually be distributed across multiple GPUs
Review the full example in MultiDevice.cu

Summary

Single CUDA Streams
Multiple CUDA Streams
Using Multiple GPUs

COSC330/530 Parallel and Distributed Computing

Lecture 22 - CUDA Processing Streams

Dr. Mitchell Welch

Reading

Summary

Single CUDA Streams

Single CUDA Streams

Single CUDA Streams

Single CUDA Streams

Single CUDA Streams

Single CUDA Streams

Single CUDA Streams

Single CUDA Streams

Single CUDA Streams

Single CUDA Streams

Single CUDA Streams

Single CUDA Streams

Single CUDA Streams

Single CUDA Streams

Single CUDA Streams

Single CUDA Streams

Single CUDA Streams

Multiple CUDA Streams

Multiple CUDA Streams

Multiple CUDA Streams

Using Multiple GPUs

Using Multiple GPUs

Summary

Reading

* N/A