Initializing...
GPU Device 0: "Hopper" with compute capability 9.0

M: 8192 (16 x 512)
N: 8192 (16 x 512)
K: 4096 (8 x 512)
Preparing data for GPU...
Required shared memory size: 72 Kb
Computing using high performance kernel = 0 - compute_tf32gemm_async_copy
Time: 69.720161 ms
TFLOPS: 7.89
