Initializing... GPU Device 0: "Hopper" with compute capability 9.0 M: 4096 (16 x 256) N: 4096 (16 x 256) K: 4096 (16 x 256) Preparing data for GPU... Required shared memory size: 64 Kb Computing... using high performance kernel compute_gemm Time: 1.223904 ms TFLOPS: 112.30