Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Pre$^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation

About

Extensive LLM applications demand efficient structured generations, particularly for LR(1) grammars, to produce outputs in specified formats (e.g., JSON). Existing methods primarily parse LR(1) grammars into a pushdown automaton (PDA), leading to runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches. To address these issues, we propose Pre$^3$ that exploits deterministic pushdown automata (DPDA) to optimize the constrained LLM decoding efficiency. First, by precomputing prefix-conditioned edges during the preprocessing, Pre$^3$ enables ahead-of-time edge analysis and thus makes parallel transition processing possible. Second, by leveraging the prefix-conditioned edges, Pre$^3$ introduces a novel approach that transforms LR(1) transition graphs into DPDA, eliminating the need for runtime path exploration and achieving edge transitions with minimal overhead. Pre$^3$ can be seamlessly integrated into standard LLM inference frameworks, reducing time per output token (TPOT) by up to 40% and increasing throughput by up to 36% in our experiments. Our code is available at https://github.com/ModelTC/lightllm.

Junyi Chen, Shihao Bai, Zaijun Wang, Siyu Wu, Chuheng Du, Hailong Yang, Ruihao Gong, Shengzhong Liu, Fan Wu, Guihai Chen• 2025

Related benchmarks

TaskDatasetResultRank
Constrained LLM DecodingLlama-3-8B
Inference Time (ms)11.77
14
Constrained LLM DecodingQwen2-14B INT8
Inference Time (ms)16.52
14
Constrained LLM DecodingDeepSeek-V2-Lite-Chat 15.7B
Inference Time (ms)49.91
10
Constrained LLM DecodingLlama-2-70B
Latency (ms)27.2
10
LLM DecodingLlama-3-8B
Decode Time per Step0.5172
4
LLM DecodingLlama-2-70B
Per-step Decoding Latency0.2163
4
Showing 6 of 6 rows

Other info

Code

Follow for update