Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TorchAO: PyTorch-Native Training-to-Serving Model Optimization

About

We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at https://github.com/pytorch/ao/.

Andrew Or, Apurva Jain, Daniel Vega-Myhre, Jesse Cai, Charles David Hernandez, Zhenrui Zheng, Driss Guessous, Vasiliy Kuznetsov, Christian Puhrsch, Mark Saroufim, Supriya Rao, Thien Tran, Aleksandar Samard\v{z}i\'c• 2025

Related benchmarks

TaskDatasetResultRank
Language UnderstandingMMLU 5-shot (test)
Accuracy72.5
149
Zero-shot Downstream Task EvaluationLM-EVAL (Average of HellaSwag, PIQA, ARC-Easy, ARC-Challenge, and WinoGrande) zero-shot latest
Average Accuracy72.4
30
Model CompressionLlama 8B 3.1
Model Size (GB)5.7
7
Model CompressionQwen 7B 2.5
Model Size (GB)6
7
Latency profilingLlama-3.1-8B
TPOT10.2
5
Energy ProfilingQwen-2.5-7B on Jetson Orin Nano 8G (512 prefill, 512 generation tokens) (inference)
Energy per Request381.1
3
Latency profilingQwen 7B Nano 8G 2.5 (inference)
TPOT (ms)133.6
3
Showing 7 of 7 rows

Other info

Follow for update