Abstract
We study empirical scaling laws for language model merging measured by cross-entropy. Despite wide practical use, merging has lacked a quantitative rule predicting returns as experts are added or model size scales. We identify a compact power law coupling model size and expert count: a size-dependent loss floor that decreases with capacity and an inverse‑k merging tail with clear diminishing returns. The law holds in-domain and cross-domain, tightly fits curves across architectures and methods (Average, TA, TIES, DARE), and explains two regularities: most gains arrive early and variability shrinks as more experts are included. A simple theory accounts for the ~1/k gain pattern and links floor and tail to base model properties and cross-domain diversity. This enables predictive planning: estimate experts needed for a target loss, decide when to stop, and trade off scaling the base model versus adding experts under a fixed budget—turning merging from heuristic practice into a computationally efficient, plannable alternative to multitask training. It suggests a scaling principle for distributed generative AI: predictable gains via composing specialists, offering a complementary path toward AGI-level systems.
Model Merging Scaling Law. CE vs. number of merged experts $k$ at multiple model sizes $N$ for four merging methods.
In our new paper, “Model Merging Scaling Laws in Large Language Models,” we introduce the first unified scaling law for model merging. This law quantitatively predicts how cross-entropy loss decreases as you scale:
- The number of merged experts $k$
- The base model size $N$
Our findings reveal a consistent power-law relationship that holds across domains, architectures, and merging methods — turning merging from a heuristic practice into a predictable, budget-aware strategy.
The Merging Scaling Law: What It Is and Why It Matters
We propose a compact, interpretable scaling law that captures the joint effect of model size and expert count:
Scaling law formula visualization.
This means:
- Larger models have lower floors and shorter tails — they’re easier to merge and saturate faster
- Most gains come from the first few experts — merging is highly efficient early on
- The same law applies to both in-domain and cross-domain merging
Overview of Merging vs MultiTask.
Validated across 10,506 merged models, from 0.5B to 72B parameters, 9 domains, and 4 merging methods (Average, TA, TIES, DARE), with $R^2$ > 0.98.
Key Insights & Takeaways
- Larger Models Are Easier to Merge
- The loss floor $L_\infty(N)$ drops predictably with model size
- Tail amplitude $A(N)$ shrinks — larger models need fewer experts to reach saturation
Larger models are easier to merge.
Method sensitivity is low at scale.
Cross-backbone validation on LLaMA.
Practical Recommendations for Model Merging
Predicting the k-curve from three points.
- Choose the Right Stopping Point
- Most domains saturate around k = 5–6 experts
- Use the law to find when marginal gains drop below your threshold
- Trade Off Model Size vs. Expert Count
- Under fixed budget: decide whether to scale $N$ or increase $k$
- Larger $N$ gives better floors; more $k$ improves early gains
- Don’t Over-optimize Method Choice at Scale
- Method differences compress with larger $k$ and $N$
- Focus on diversity and quality of experts instead
Conclusion: From Art to Science
Our work provides:
- A unified scaling law for model merging
- Large-scale validation across domains, sizes, and methods
- A simple theory explaining the inverse-k tail
- An operational recipe for predictive planning
Model merging is no longer a black art — it’s a predictable, budgetable alternative to multitask training that scales with clear, quantifiable returns.
Citation Information
If you find this work useful, citations to the following papers are welcome:
@article{wang2025model,
title={Model Merging Scaling Laws in Large Language Models},
author={Wang, Yuanyi and Gu, Yanggan and Zhang, Yiming and Zhou, Qi and Yan, Zhaoyi and Xie, Congkai and Wang, Xinyao and Yuan, Jianbo and Yang, Hongxia},
journal={arXiv preprint arXiv:2509.24244},
year={2025}
}
Read the full paper & code: Contact: hongxia.yang@polyu.edu.hk