Sample: globalToShmemAsyncCopy Minimum spec: SM 7.0 This sample implements matrix multiplication which uses asynchronous copy of data from global to shared memory when on compute capability 8.0 or higher. Also demonstrates arrive-wait barrier for synchronization. Key concepts: CUDA Runtime API Linear Algebra CPP11 CUDA