TorchAO: PyTorch-Native Training-to-Serving Model Optimization

About

We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at https://github.com/pytorch/ao/.

Andrew Or, Apurva Jain, Daniel Vega-Myhre, Jesse Cai, Charles David Hernandez, Zhenrui Zheng, Driss Guessous, Vasiliy Kuznetsov, Christian Puhrsch, Mark Saroufim, Supriya Rao, Thien Tran, Aleksandar Samard\v{z}i\'c• 2025

Related benchmarks

Task	Dataset	Result
Language Understanding	MMLU 5-shot (test)	Accuracy72.5	149
Zero-shot Downstream Task Evaluation	LM-EVAL (Average of HellaSwag, PIQA, ARC-Easy, ARC-Challenge, and WinoGrande) zero-shot latest	Average Accuracy72.4	30
Model Compression	Llama 8B 3.1	Model Size (GB)5.7	7
Model Compression	Qwen 7B 2.5	Model Size (GB)6	7
GEMM Kernel Dispatch Optimization	Generic GEMM Kernels	Speedup1.08	6
Latency profiling	Llama-3.1-8B	TPOT10.2	5
Energy Profiling	Qwen-2.5-7B on Jetson Orin Nano 8G (512 prefill, 512 generation tokens) (inference)	Energy per Request381.1	3
Latency profiling	Qwen 7B Nano 8G 2.5 (inference)	TPOT (ms)133.6	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord