Model Merging Scaling Laws: A New Way to Predict and Plan LLM Composition

Yuanyi Wang , Yanggan Gu , Yiming Zhang , Qi Zhou , Zhaoyi Yan , Congkai Xie , Xinyao Wang , Jianbo Yuan , Hongxia Yang

InfiX.ai

Abstract

We study empirical scaling laws for language model merging measured by cross-entropy. Despite wide practical use, merging has lacked a quantitative rule predicting returns as experts are added or model size scales. We identify a compact power law coupling model size and expert count: a size-dependent loss floor that decreases with capacity and an inverse‑k merging tail with clear diminishing returns. The law holds in-domain and cross-domain, tightly fits curves across architectures and methods (Average, TA, TIES, DARE), and explains two regularities: most gains arrive early and variability shrinks as more experts are included. A simple theory accounts for the ~1/k gain pattern and links floor and tail to base model properties and cross-domain diversity. This enables predictive planning: estimate experts needed for a target loss, decide when to stop, and trade off scaling the base model versus adding experts under a fixed budget—turning merging from heuristic practice into a computationally efficient, plannable alternative to multitask training. It suggests a scaling principle for distributed generative AI: predictable gains via composing specialists, offering a complementary path toward AGI-level systems.

Model Merging Scaling Law. CE vs. number of merged experts $k$ at multiple model sizes $N$ for four merging methods.

In our new paper, “Model Merging Scaling Laws in Large Language Models,” we introduce the first unified scaling law for model merging. This law quantitatively predicts how cross-entropy loss decreases as you scale:

The number of merged experts $k$
The base model size $N$

Our findings reveal a consistent power-law relationship that holds across domains, architectures, and merging methods — turning merging from a heuristic practice into a predictable, budget-aware strategy.

The Merging Scaling Law: What It Is and Why It Matters

We propose a compact, interpretable scaling law that captures the joint effect of model size and expert count:

Scaling law formula visualization.

Where: - $L_\infty(N)$ is the loss floor — the best achievable performance for a given model size - The merging tail $\frac{A(N)}{k+b}$ captures diminishing returns as you add more experts

This means:

Larger models have lower floors and shorter tails — they’re easier to merge and saturate faster
Most gains come from the first few experts — merging is highly efficient early on
The same law applies to both in-domain and cross-domain merging

Overview of Merging vs MultiTask.

Validated across 10,506 merged models, from 0.5B to 72B parameters, 9 domains, and 4 merging methods (Average, TA, TIES, DARE), with $R^2$ > 0.98.

Key Insights & Takeaways

Larger Models Are Easier to Merge

The loss floor $L_\infty(N)$ drops predictably with model size
Tail amplitude $A(N)$ shrinks — larger models need fewer experts to reach saturation

Larger models are easier to merge.

2. Method Differences Shrink at Scale - Early advantages of TA/TIES narrow as $k$ and $N$ grow - All methods converge to similar performance with enough experts

Method sensitivity is low at scale.

3. The Law Transfers Across Architectures - Validated on LLaMA-3.2 3B and LLaMA-3 8B - Same functional form, same diminishing-returns pattern

Cross-backbone validation on LLaMA.

Practical Recommendations for Model Merging

Predicting the k-curve from three points.

1. Predict Full Performance from Few Points - Fit the law using just $k = \{1, 2, 4\}$ to forecast the entire curve - Enables early stopping and budget planning

Choose the Right Stopping Point

Most domains saturate around k = 5–6 experts
Use the law to find when marginal gains drop below your threshold

Trade Off Model Size vs. Expert Count

Under fixed budget: decide whether to scale $N$ or increase $k$
Larger $N$ gives better floors; more $k$ improves early gains

Don’t Over-optimize Method Choice at Scale

Method differences compress with larger $k$ and $N$
Focus on diversity and quality of experts instead

Conclusion: From Art to Science

Our work provides:

A unified scaling law for model merging
Large-scale validation across domains, sizes, and methods
A simple theory explaining the inverse-k tail
An operational recipe for predictive planning

Model merging is no longer a black art — it’s a predictable, budgetable alternative to multitask training that scales with clear, quantifiable returns.

Citation Information

If you find this work useful, citations to the following papers are welcome:

@article{wang2025model,
  title={Model Merging Scaling Laws in Large Language Models},
  author={Wang, Yuanyi and Gu, Yanggan and Zhang, Yiming and Zhou, Qi and Yan, Zhaoyi and Xie, Congkai and Wang, Xinyao and Yuan, Jianbo and Yang, Hongxia},
  journal={arXiv preprint arXiv:2509.24244},
  year={2025}
}

Read the full paper & code: Contact: hongxia.yang@polyu.edu.hk