Share your thoughts, 1 month free Claude Pro on usSee more

Linear Kernel Performance on 256x256x1 Matrix Multiplication NVIDIA H800 GPU (test)

5.54Latency (us)

BWTA_QK

Updated 3mo ago

Evaluation Results

Method	Links
BWTA_QK 2026.04		5.54	5,915
BWTA_Attn 2026.04		6.71	4,883
Binary A x V (Supported) 2026.04		6.97	4,701
Binary Linear (Supported) 2026.04		7.29	4,495
BWTA 2026.04		8.56	3,828
bitlinear_int8xint2(Bitnet) 2026.04		9.08	3,609
torch.nn.functional 2026.04		10.74	3,050
bnb.nn.Linear4bit 2026.04		60.33	543