RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
About
Direct Preference Optimization (DPO), the efficient alternative to PPO-based RLHF, falls short on knowledge-intensive generation: standard preference signals from human annotators or LLM judges exhibit a systematic verbosity bias that rewards fluency over logical correctness. This blindspot leaves a logical alignment gap -- SFT models reach NLI entailment of only 0.05-0.22 despite producing fluent text. We propose RLearner-LLM with Hybrid-DPO: an automated preference pipeline that fuses a DeBERTa-v3 NLI signal with a verifier LLM score, removing human annotation while overcoming the "alignment tax" of single-signal optimization. Evaluated across five academic domains (Biology, Medicine, Law) with three base architectures (LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it), RLearner-LLM yields up to 6x NLI improvement over SFT, with NLI gains in 11 of 15 cells and consistent answer-coverage gains. On Gemma 4 E4B-it (4.5B effective params), Hybrid-DPO lifts NLI in four of five domains (+11.9% to +2.4x) with faster inference across all five, scaling down to compact base models without losing the alignment-tax mitigation. Our Qwen3-8B RLearner-LLM wins 95% of pairwise comparisons against its own SFT baseline; GPT-4o-mini in turn wins 95% against our concise output -- alongside the 69% win the same judge gives a verbose SFT over our DPO model, this replicates verbosity bias on a frontier comparator and motivates logic-aware metrics (NLI, ACR) over LLM-as-a-judge for knowledge-intensive generation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Scientific Reasoning | Sydney Biology per-architecture breakdown (full) | BLEU8.93 | 8 | |
| Natural Language Inference | Cardiff Biology | NLI Accuracy35.05 | 7 | |
| Natural Language Inference | Sydney Biology | NLI Score0.3562 | 7 | |
| Natural Language Inference | Auckland Law | NLI Score43.77 | 7 | |
| Natural Language Inference | UK Medicine Y1 | NLI Score42.51 | 7 | |
| Natural Language Inference | UK Medicine Y2 | NLI Score38.92 | 7 | |
| Legal Reasoning and Text Generation | Auckland Law | BLEU13.15 | 7 | |
| Medical Educational Explanation Generation | UK Medicine full per-architecture breakdown (Year 2) | BLEU0.0325 | 7 | |
| Medical Explanation Generation | UK Medicine Year 1 (full breakdown) | BLEU0.028 | 7 |