Unlocking Data Value in Finance: A Study on Distillation and Difficulty-Aware Training
About
Large Language Models (LLMs) have demonstrated strong general capabilities, yet their deployment in finance remains challenging due to dense domain-specific terminology, stringent numerical reasoning requirements, and low tolerance for factual errors. We conduct a controlled empirical study showing that in specialized vertical domains, performance is largely determined by the quality and difficulty/verifiability profile of post-training data. We introduce \textbf{ODA-Fin-SFT-318k}, constructed via multi-stage distillation and verification to produce high-quality Chain-of-Thought supervision, and \textbf{ODA-Fin-RL-12k}, curated for hard-but-verifiable tasks that balance reward precision and task diversity. Using standard SFT and RL pipelines, we show that high-quality CoT distillation establishes a robust foundation during SFT, while difficulty- and verifiability-aware sampling improves RL generalization. Evaluated on nine benchmarks spanning general financial tasks, sentiment analysis, and numerical reasoning, our ODA-Fin-RL-8B consistently surpasses open-source state-of-the-art (SOTA) financial LLMs of comparable size. We release our ODA-Fin-SFT-318k and ODA-Fin-RL-12k datasets, along with trained models to advance data-centric financial AI research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Sentiment Analysis | FOMC | -- | 44 | |
| Financial Reasoning | FinQA | Accuracy73.3 | 33 | |
| Financial Reasoning | ConvFinQA | Accuracy80.4 | 23 | |
| Sentiment Analysis | FPB | Weighted F10.834 | 15 | |
| Sentiment Analysis | Headlines | Weighted F178.5 | 15 | |
| Financial Knowledge | FinanceIQ | Accuracy74.2 | 15 | |
| Financial Knowledge | Fineval | Accuracy77 | 15 | |
| Numerical Reasoning | TATQA | Accuracy89.3 | 14 | |
| Financial Knowledge | Finova | Accuracy54.6 | 14 |