Model Merging Scaling Laws: A New Way to Predict and Plan LLM Composition

Yuanyi Wang , Yanggan Gu , Yiming Zhang , Qi Zhou , Zhaoyi Yan , Congkai Xie , Xinyao Wang , Jianbo Yuan , Hongxia Yang
InfiX.ai

Abstract

We study empirical scaling laws for language model merging measured by cross-entropy. Despite wide practical use, merging has lacked a quantitative rule predicting returns as experts are added or model size scales. We identify a compact power law coupling model size and expert count: a size-dependent loss floor that decreases with capacity and an inverse‑k merging tail with clear diminishing returns. The law holds in-domain and cross-domain, tightly fits curves across architectures and methods (Average, TA, TIES, DARE), and explains two regularities: most gains arrive early and variability shrinks as more experts are included. A simple theory accounts for the ~1/k gain pattern and links floor and tail to base model properties and cross-domain diversity. This enables predictive planning: estimate experts needed for a target loss, decide when to stop, and trade off scaling the base model versus adding experts under a fixed budget—turning merging from heuristic practice into a computationally efficient, plannable alternative to multitask training. It suggests a scaling principle for distributed generative AI: predictable gains via composing specialists, offering a complementary path toward AGI-level systems.


Method Overview

Model Merging Scaling Law. CE vs. number of merged experts $k$ at multiple model sizes $N$ for four merging methods.


In our new paper, “Model Merging Scaling Laws in Large Language Models,” we introduce the first unified scaling law for model merging. This law quantitatively predicts how cross-entropy loss decreases as you scale:

  • The number of merged experts $k$
  • The base model size $N$

Our findings reveal a consistent power-law relationship that holds across domains, architectures, and merging methods — turning merging from a heuristic practice into a predictable, budget-aware strategy.


The Merging Scaling Law: What It Is and Why It Matters

We propose a compact, interpretable scaling law that captures the joint effect of model size and expert count:

Method Overview

Scaling law formula visualization.

Where: - $L_\infty(N)$ is the loss floor — the best achievable performance for a given model size - The merging tail $\frac{A(N)}{k+b}$ captures diminishing returns as you add more experts

This means:

  • Larger models have lower floors and shorter tails — they’re easier to merge and saturate faster
  • Most gains come from the first few experts — merging is highly efficient early on
  • The same law applies to both in-domain and cross-domain merging
Method Overview

Overview of Merging vs MultiTask.

Validated across 10,506 merged models, from 0.5B to 72B parameters, 9 domains, and 4 merging methods (Average, TA, TIES, DARE), with $R^2$ > 0.98.


Key Insights & Takeaways

  1. Larger Models Are Easier to Merge
  • The loss floor $L_\infty(N)$ drops predictably with model size
  • Tail amplitude $A(N)$ shrinks — larger models need fewer experts to reach saturation
Method Overview

Larger models are easier to merge.

2. Method Differences Shrink at Scale - Early advantages of TA/TIES narrow as $k$ and $N$ grow - All methods converge to similar performance with enough experts
Method Overview

Method sensitivity is low at scale.

3. The Law Transfers Across Architectures - Validated on LLaMA-3.2 3B and LLaMA-3 8B - Same functional form, same diminishing-returns pattern
Method Overview

Cross-backbone validation on LLaMA.


Practical Recommendations for Model Merging

Method Overview

Predicting the k-curve from three points.

1. Predict Full Performance from Few Points - Fit the law using just $k = \{1, 2, 4\}$ to forecast the entire curve - Enables early stopping and budget planning
  1. Choose the Right Stopping Point
  • Most domains saturate around k = 5–6 experts
  • Use the law to find when marginal gains drop below your threshold
  1. Trade Off Model Size vs. Expert Count
  • Under fixed budget: decide whether to scale $N$ or increase $k$
  • Larger $N$ gives better floors; more $k$ improves early gains
  1. Don’t Over-optimize Method Choice at Scale
  • Method differences compress with larger $k$ and $N$
  • Focus on diversity and quality of experts instead

Conclusion: From Art to Science

Our work provides:

  • A unified scaling law for model merging
  • Large-scale validation across domains, sizes, and methods
  • A simple theory explaining the inverse-k tail
  • An operational recipe for predictive planning

Model merging is no longer a black art — it’s a predictable, budgetable alternative to multitask training that scales with clear, quantifiable returns.


Citation Information

If you find this work useful, citations to the following papers are welcome:

@article{wang2025model,
  title={Model Merging Scaling Laws in Large Language Models},
  author={Wang, Yuanyi and Gu, Yanggan and Zhang, Yiming and Zhou, Qi and Yan, Zhaoyi and Xie, Congkai and Wang, Xinyao and Yuan, Jianbo and Yang, Hongxia},
  journal={arXiv preprint arXiv:2509.24244},
  year={2025}
}

Read the full paper & code: Contact: hongxia.yang@polyu.edu.hk