RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

About

Direct Preference Optimization (DPO), the efficient alternative to PPO-based RLHF, falls short on knowledge-intensive generation: standard preference signals from human annotators or LLM judges exhibit a systematic verbosity bias that rewards fluency over logical correctness. This blindspot leaves a logical alignment gap -- SFT models reach NLI entailment of only 0.05-0.22 despite producing fluent text. We propose RLearner-LLM with Hybrid-DPO: an automated preference pipeline that fuses a DeBERTa-v3 NLI signal with a verifier LLM score, removing human annotation while overcoming the "alignment tax" of single-signal optimization. Evaluated across five academic domains (Biology, Medicine, Law) with three base architectures (LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it), RLearner-LLM yields up to 6x NLI improvement over SFT, with NLI gains in 11 of 15 cells and consistent answer-coverage gains. On Gemma 4 E4B-it (4.5B effective params), Hybrid-DPO lifts NLI in four of five domains (+11.9% to +2.4x) with faster inference across all five, scaling down to compact base models without losing the alignment-tax mitigation. Our Qwen3-8B RLearner-LLM wins 95% of pairwise comparisons against its own SFT baseline; GPT-4o-mini in turn wins 95% against our concise output -- alongside the 69% win the same judge gives a verbose SFT over our DPO model, this replicates verbosity bias on a frontier comparator and motivates logic-aware metrics (NLI, ACR) over LLM-as-a-judge for knowledge-intensive generation.

Qiming Bao, Juho Leinonen, Paul Denny, Michael J. Witbrock• 2026

Related benchmarks

Task	Dataset	Result
Scientific Reasoning	Sydney Biology per-architecture breakdown (full)	BLEU8.93	8
Natural Language Inference	Cardiff Biology	NLI Accuracy35.05	7
Natural Language Inference	Sydney Biology	NLI Score0.3562	7
Natural Language Inference	Auckland Law	NLI Score43.77	7
Natural Language Inference	UK Medicine Y1	NLI Score42.51	7
Natural Language Inference	UK Medicine Y2	NLI Score38.92	7
Legal Reasoning and Text Generation	Auckland Law	BLEU13.15	7
Medical Educational Explanation Generation	UK Medicine full per-architecture breakdown (Year 2)	BLEU0.0325	7
Medical Explanation Generation	UK Medicine Year 1 (full breakdown)	BLEU0.028	7

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord