Teaching Metric Distance to Discrete Autoregressive Language Models

About

Large language models (LLMs) operate as autoregressive predictors over discrete token vocabularies, a formulation that has enabled their adaptation far beyond natural language to vision, robotics, and multimodal reasoning. However, training against one-hot targets disregards metric relationships between tokens and limits effectiveness on tasks where distance is meaningful, such as numerical values, spatial coordinates, or quantized embeddings. We introduce DIST2Loss, a distance-aware objective for discrete autoregressive models that replaces one-hot targets with reward-weighted distributions derived from predefined token distances. DIST2Loss can be interpreted as the closed-form solution to entropy-regularized policy optimization with known per-token rewards, retaining the core mechanism of reinforcement learning while avoiding sampling, rollouts, and instability. Our experiments show that DIST2Loss improves data efficiency and downstream performance across diverse domains. It yields tighter bounding boxes in visual grounding, accelerates robotic manipulation by improving action learning, enhances reward modeling for LLM alignment, and strengthens vector-quantized image generation. These results demonstrate that distance-aware supervision offers a simple and general alternative to one-hot supervision for discrete autoregressive models.

Jiwan Chung, Saejin Kim, Yongrae Jo, Jaewoo Park, Dongjun Min, Youngjae Yu• 2025

Related benchmarks

Task	Dataset	Result
Reward Modeling	RewardBench	Safety Score86.5	284
Visual Grounding	RefCOCO+ (val)	Accuracy87.1	264
Visual Grounding	RefCOCO+ (testA)	Accuracy92.2	256
Visual Grounding	RefCOCO+ (testB)	Accuracy81.4	230
Visual Grounding	RefCOCO (val)	Accuracy94.8	177
Visual Grounding	RefCOCO (testA)	Accuracy94.5	167
Visual Grounding	RefCOCO (testB)	Accuracy87.3	164
Visual Grounding	RefCOCOg (val)	Accuracy92.8	163
Visual Grounding	RefCOCOg (test)	Accuracy88	160
Image Generation	ImageNet	FID3.04	106

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord