Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization

About

Alignment of large language models (LLMs) via SFT and RLHF/DPO typically ignores the global geometry of the representation space, relying instead on local token likelihoods or scalar scores. We view generation as tracing a semantic trajectory in hidden space and propose a topology-enhanced alignment framework that regularizes these trajectories using 0-dimensional persistent homology. First, for SFT, we introduce Trajectory Topology Loss (TTL). Treating prompt and gold-answer embeddings as a mixed point cloud, we use a 0D persistent homology algorithm to extract "prompt-answer bridges." TTL aligns the model's actual update direction with these topological bridges rather than arbitrary directions. Second, for DPO, we propose Topological Preference Optimization (TPO). TPO constructs topic-specific semantic preference vectors and aligns the improvement direction between rejected and chosen responses with these vectors in an intermediate hidden layer. We also introduce a dynamic weighting scheme to balance DPO and TPO losses. Evaluating on Qwen2.5-7B-Instruct using UltraChat and Anthropic HH-RLHF, our topology-enhanced objectives consistently outperform strong non-topological baselines (e.g., per-example, nearest-neighbor, random regularizers) on automatic preference metrics and LLM-judge evaluations, while maintaining or improving toxicity. Results show persistent homology and trajectory geometry offer a promising direction for controllable alignment.

Yurui Pan, Ke Xu, Bo Peng• 2026

Related benchmarks

Task	Dataset	Result
Instruction Following	IFEval	IFEval Accuracy71.8	854
Instruction Following	AlpacaEval 2.0	Win Rate55.6	752
Multi-turn Dialogue Evaluation	MT-Bench	Overall Score8.88	532
Instruction Following	AlpacaEval	Win Rate56.5	423
Reward Modeling	RewardBench	--	284
Multi-turn dialogue	MT-Bench	MT-Bench Score8.81	126
Harmlessness evaluation	HH-RLHF	Harmlessness Rate94.5	6
Reward Modeling Evaluation	RewardBench	R-Bench Score88.1	3
Alignment Evaluation	HH-RLHF (test)	Reward Model Score65.4	2
Instruction Following	UltraChat	RM Score67.8	2

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord