Initializing...
GPU Device 0: "Hopper" with compute capability 9.0

M: 4096 (16 x 256)
N: 4096 (16 x 256)
K: 4096 (16 x 256)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm_imma 
Time: 0.629184 ms
TOPS: 218.44
