Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Teaching Metric Distance to Discrete Autoregressive Language Models

About

Large language models (LLMs) operate as autoregressive predictors over discrete token vocabularies, a formulation that has enabled their adaptation far beyond natural language to vision, robotics, and multimodal reasoning. However, training against one-hot targets disregards metric relationships between tokens and limits effectiveness on tasks where distance is meaningful, such as numerical values, spatial coordinates, or quantized embeddings. We introduce DIST2Loss, a distance-aware objective for discrete autoregressive models that replaces one-hot targets with reward-weighted distributions derived from predefined token distances. DIST2Loss can be interpreted as the closed-form solution to entropy-regularized policy optimization with known per-token rewards, retaining the core mechanism of reinforcement learning while avoiding sampling, rollouts, and instability. Our experiments show that DIST2Loss improves data efficiency and downstream performance across diverse domains. It yields tighter bounding boxes in visual grounding, accelerates robotic manipulation by improving action learning, enhances reward modeling for LLM alignment, and strengthens vector-quantized image generation. These results demonstrate that distance-aware supervision offers a simple and general alternative to one-hot supervision for discrete autoregressive models.

Jiwan Chung, Saejin Kim, Yongrae Jo, Jaewoo Park, Dongjun Min, Youngjae Yu• 2025

Related benchmarks

TaskDatasetResultRank
Visual GroundingRefCOCO+ (val)
Accuracy87.1
253
Visual GroundingRefCOCO+ (testA)
Accuracy92.2
245
Visual GroundingRefCOCO+ (testB)
Accuracy81.4
219
Reward ModelingRewardBench
Chat Score95
216
Visual GroundingRefCOCO (val)
Accuracy94.8
172
Visual GroundingRefCOCO (testA)
Accuracy94.5
162
Visual GroundingRefCOCO (testB)
Accuracy87.3
159
Visual GroundingRefCOCOg (val)
Accuracy92.8
158
Visual GroundingRefCOCOg (test)
Accuracy88
155
Image GenerationImageNet
FID3.04
101
Showing 10 of 16 rows

Other info

Follow for update