PhyloLM : Inferring the Phylogeny of Large Language Models and Predicting their Performances in Benchmarks

About

This paper introduces PhyloLM, a method adapting phylogenetic algorithms to Large Language Models (LLMs) to explore whether and how they relate to each other and to predict their performance characteristics. Our method calculates a phylogenetic distance metric based on the similarity of LLMs' output. The resulting metric is then used to construct dendrograms, which satisfactorily capture known relationships across a set of 111 open-source and 45 closed models. Furthermore, our phylogenetic distance predicts performance in standard benchmarks, thus demonstrating its functional validity and paving the way for a time and cost-effective estimation of LLM capabilities. To sum up, by translating population genetic concepts to machine learning, we propose and validate a tool to evaluate LLM development, relationships and capabilities, even in the absence of transparent training information.

Nicolas Yax, Pierre-Yves Oudeyer, Stefano Palminteri• 2024

Related benchmarks

Task	Dataset	Result
Ranking correlation with full dataset evaluation	WinoGrande	Kendall Correlation0.439	13
Static model relationship recognition	Bench-A 20 fixed (train test)	Accuracy85	10
LLM-relation detection	LLM Relationship Detection (test)	Accuracy74.2	6
Model Relation Prediction	Model Relation Prediction Dataset 135 pairs	Accuracy70.4	5
Benchmark Ranking Prediction	MMLU	Kendall's Tau Correlation0.557	3
Benchmark Ranking Prediction	GSM8K	Kendall's Tau Correlation0.536	3
Benchmark Ranking Prediction	ARC	Kendall's Tau0.514	3
Benchmark Ranking Prediction	HellaSwag	Kendall's Tau0.601	3
Benchmark Ranking Prediction	TruthfulQA	Kendall's Tau Correlation0.26	3

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord