Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration
About
Reasoning-focused LLMs improve answer quality by generating longer reasoning traces, but the additional tokens dramatically increase serving cost, motivating inference optimization. We extend and apply Puzzle, a post-training neural architecture search (NAS) framework, to gpt-oss-120B to produce gpt-oss-puzzle-88B, a deployment-optimized derivative. Our approach combines heterogeneous MoE expert pruning, selective replacement of full-context attention with window attention, FP8 KV-cache quantization with calibrated scales, and post-training reinforcement learning to recover accuracy, while maintaining low generation length. In terms of per-token speeds, on an 8XH100 node we achieve 1.63X and 1.22X throughput speedups in long-context and short-context settings, respectively. gpt-oss-puzzle-88B also delivers throughput speedups of 2.82X on a single NVIDIA H100 GPU. However, because token counts can change with reasoning effort and model variants, per-token throughput (tok/s) and latency (ms/token) do not necessarily lead to end-to-end speedups: a 2X throughput gain is erased if traces grow 2X. Conversely, throughput gains can be spent on more reasoning tokens to improve accuracy; we therefore advocate request-level efficiency metrics that normalize throughput by tokens generated and trace an accuracy--speed frontier across reasoning efforts. We show that gpt-oss-puzzle-88B improves over gpt-oss-120B along the entire frontier, delivering up to 1.29X higher request-level efficiency. Across various benchmarks, gpt-oss-puzzle-88B matches or slightly exceeds the parent on suite-average accuracy across reasoning efforts, with retention ranging from 100.8% (high) to 108.2% (low), showing that post-training architecture search can substantially reduce inference costs without sacrificing quality.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Understanding | MMLU-Pro | Accuracy79.32 | 70 | |
| Scientific Reasoning | GPQA Diamond | Accuracy75.25 | 45 | |
| Reasoning | Reasoning Benchmark Suite Aggregate | Average Score59.44 | 26 | |
| Instruction Following | IFBench | Accuracy67.77 | 25 | |
| General Reasoning | HLE | Accuracy17.52 | 21 | |
| Mathematics | AIME25 | Accuracy93.33 | 16 | |
| Reasoning | AALCR | Accuracy42.25 | 16 | |
| Scientific Coding | SciCode | Accuracy0.4142 | 16 | |
| Retrieval | RULER 128K context | Accuracy66.71 | 12 | |
| Max token throughput | 64K/64K serving scenario 8xH100 node 1.0 | Max Throughput (K tok/s)9.3 | 5 |