Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Linear Kernel Performance on 256x256x1 Matrix Multiplication NVIDIA H800 GPU (test)
Loading...
5.54
Latency (us)
BWTA_QK
3.3484
18.1417
32.935
47.7283
Apr 5, 2026
Latency (us)
FLOPs (x10^3)
Updated 12d ago
Evaluation Results
Method
Method
Links
Latency (us)
FLOPs (x10^3)
BWTA_QK
Wbit/Abit=1.5/1.5
2026.04
5.54
5,915
BWTA_Attn
Wbit/Abit=1/1.5
2026.04
6.71
4,883
Binary A x V (Supported)
Wbit/Abit=1/1
2026.04
6.97
4,701
Binary Linear (Supported)
Wbit/Abit=1/1
2026.04
7.29
4,495
BWTA
Wbit/Abit=1/1.5
2026.04
8.56
3,828
bitlinear_int8xint2(Bitnet)
Wbit/Abit=2/8
2026.04
9.08
3,609
torch.nn.functional
Wbit/Abit=16/16
2026.04
10.74
3,050
bnb.nn.Linear4bit
Wbit/Abit=4/16
2026.04
60.33
543
Feedback
Search any
task
Search any
task