InfiGFusion: Graph-on-Logits Distillation for Scalable Model Fusion

Yuanyi Wang , Zhaoyi Yan , Yiming Zhang , Qi Zhou , Yanggan Gu , Fei Wu , Hongxia Yang
InfiX.ai
To appear at NeurIPS 2025!

Abstract

Recent advances in large language models (LLMs) have intensified efforts to fuse heterogeneous open-source models into a unified system that inherits their complementary strengths. Existing logit-based fusion methods maintain inference efficiency but treat vocabulary dimensions independently, overlooking semantic dependencies encoded by cross-dimension interactions. These dependencies reflect how token types interact under a model's internal reasoning and are essential for aligning models with diverse generation behaviors. To explicitly model these dependencies, we propose InfiGFusion, the first structure-aware fusion framework with a novel Graph-on-Logits Distillation (GLD) loss. Specifically, we retain the top-k logits per output and aggregate their outer products across sequence positions to form a global co-activation graph, where nodes represent vocabulary channels and edges quantify their joint activations. To ensure scalability and efficiency, we design a sorting-based closed-form approximation that reduces the original O(n^4) cost of Gromov-Wasserstein distance to O(n log n), with provable approximation guarantees. Experiments across multiple fusion settings show that GLD consistently improves fusion quality and stability. InfiGFusion outperforms SOTA models and fusion baselines across 11 benchmarks spanning reasoning, coding, and mathematics. It shows particular strength in complex reasoning tasks, with +35.6 improvement on Multistep Arithmetic and +37.06 on Causal Judgement over SFT, demonstrating superior multi-step and relational inference.

🚀 Executive Snapshot

Notation. Parenthetical deltas (e.g., (+6.0↑) / (-0.8↓)) show absolute change versus the Phi-4 (14B) baseline. Bold indicates the best score in the row.

InfiGFusion (14B) achieves the highest average across 18 tasks, improving +6.73 points over the strongest SFT baseline (Phi-4 SFT).

Model ↓ / Task → Abstract Algebra Marketing International Law Moral Scenarios Virology Formal Logic Security Studies logical_fallacies Ruin Names Tracking 7 Objects Tracking 5 Objects Logical Deduction 3 Logical Deduction 5 Logical Deduction 7 Colored Objects Multistep Arithmetic Dyck Languages Causal Judgement Average (18 Tasks)
Qwen2.5-Instruct (14B) 73.00 90.17 81.82 71.96 52.41 68.25 77.96 85.28 82.80 80.40 79.20 97.60 80.40 68.80 93.60 96.40 35.60 43.85 75.53
Mistral (24B) 65.00 92.31 81.82 68.60 49.40 66.67 78.78 84.66 76.80 96.40 99.20 98.80 82.00 62.80 90.00 93.20 37.20 68.98 77.37
DeepSeek-R1-Distill-Qwen (14B) 85.00 89.32 82.64 73.41 54.22 91.27 78.37 85.28 39.60 80.80 82.80 83.20 86.00 83.20 87.60 81.60 9.60 45.35 73.29
Phi-4 (14B) 82.00 92.74 91.74 75.75 53.01 77.78 77.14 86.50 88.80 94.40 96.80 98.40 85.60 88.40 96.40 64.00 11.20 (−2.0↓) 32.99 77.43
Phi-4 (SFT, 14B) 86.00 (+4.0↑) 92.74 (+0.0) 90.91 (−0.83↓) 74.75 (−1.0↓) 53.61 (+0.60↑ 78.57 (+0.79↑) 79.59 (+2.45↑) 87.73 (+1.23↑) 88.00 (−0.8↓) 90.00 (−4.4↓) 95.60 (−1.2↓) 96.40 (−2.0↓) 92.40 (+6.8↑) 88.80 (+0.4↑) 96.80 (+0.4↑) 62.00 (−2.0↓) 13.20 (+2.0↑) 35.83 (+2.84↑) 77.94 (+0.51↑)
InfiGFusion (14B) 88.00 (+6.0↑) 95.30 (+2.56↑) 90.91 (−0.83↓) 73.97 (−1.78↓) 54.22 (+1.21↑) 74.60 (−3.18↓) 81.22 (+4.08↑) 87.73 (+1.23↑) 88.40 (−0.4↓) 96.80 (+2.40↑) 96.80 (+0.0) 97.60 (−0.8↓) 94.00 (+8.4↑) 89.20 (+0.8↑) 96.40 (+0.0) 99.60 (+35.6↑) 40.00 (+28.8↑) 70.05 (+37.06↑) 84.16 (+6.73↑

Interpretation.

  • Against a strong SFT baseline (Phi-4 SFT, 14B), InfiGFusion (14B) boosts average accuracy from 77.94 → 84.16 (+6.73).
  • Gains are especially pronounced on tasks emphasizing multi-step reasoning and structure-sensitive signals (e.g., Multistep Arithmetic, Dyck Languages, Causal Judgement).
  • On a few tasks (e.g., Formal Logic), specialized systems (e.g., DeepSeek-R1-Distill-Qwen) may still lead, illustrating that teacher diversity/selection remains a complementary lever—InfiGFusion cleanly plugs into such strategies.

📌 At-a-Glance

InfiGFusion is a unified framework for multi-source model fusion built on two complementary loss:

  • Universal Logit Distillation (ULD): token-level distribution alignment between teacher(s) and student.
  • Graph-on-Logits Distillation (GLD): structural alignment by constructing graphs on logits and aligning them with approximated Gromov–Wasserstein (GW). We introduce a sorting-based closed-form approximation that reduces complexity from O(n⁴) to O(n log n), together with stability and error analyses.

Result: fuse knowledge from multiple experts without teacher weights and without changing the student architecture, delivering consistent improvements on cross-domain, muti-step reasoning, and other complex reasonging tasks with limited training cost.


🧩 Motivation: Distribution Alone Isn’t Enough

Token-level objectives (e.g., KL/CE) align marginals but ignore the geometry between classes or semantic dimensions. In multi-expert fusion, complementarity and conflicts abound:

  • Which classes (or semantic axes) are close or distant?
  • How should we preserve relative structure across teachers?

Key idea: logits are not just probabilities—they encode relational structure. Aligning the internal geometry (who is similar to whom, how strongly) is crucial for robust fusion.


🛠️ Method

InfiGFusion Structure

InfiGFusion framework overview.

ULD (Universal Logit Distillation) — Token-Level Alignment

  • A unified, multi-teacher friendly objective that aligns p(y|x) distributions.
  • Provides a strong, stable training signal.

GLD (Graph-on-Logits Distillation) — Structural Alignment on Logits

  • Build sparse graphs over logits (e.g., top-k neighbors or thresholded edges).
  • Use GW to align pairwise distance structures between teacher and student graphs.
  • Sorting-based closed-form approximation: replaces intractable O(n⁴) GW with O(n log n) while maintaining training usefulness; we also provide error bounds and stability guarantees.

Intuition: ULD makes the student look like the source(s); GLD makes the student’s relationships look like the source(s). The two are complementary.


🧪 Why Structure Helps (GLD vs. KL)

GLD vs KL on logits

Comparison between GLD and KL divergence on logits.

  • KL/CE/ULD only: may over-concentrate on a few dominant classes and miss inter-class geometry.
  • Add GLD: preserves relative structure, amplifying cross-teacher complementarity and improving long-tail robustness and reasoning-heavy tasks.

⚖️ Case Study: Cross-Domain Complementarity & Conflicts

Cross-domain relations heatmap

Cross-domain relations heatmap showing complementarity and conflicts.

  • Domains such as math, physics, multi-step reasoning show structural proximity on sub-tasks.
  • GLD helps amplify complementarities and mitigate conflicts, yielding more consistent fused models.

📊 Simulation Study for the GW Approximation

We conducted a controlled simulation: sample n ∈ {50, 100, 200, 500, 1000, 1500, 2000, 2500, 3000} points from N(0, I) and N(1, I), build the two n×n Euclidean distance matrices, and compare:

  • Exact GW
  • Sinkhorn GW (widely-used, high-quality approximation)
  • Approx GW (ours) (sorting-based closed-form estimator)

Key takeaways:

  • Estimator quality: Sinkhorn matches Exact (RE ≈ 0). Approx GW’s relative error decreases with n (≈ O(1/n)): from ~0.59 @ n=50 to ~0.33 @ n=3000.
  • Runtime: Exact/Sinkhorn quickly become impractical beyond ~n=50 (seconds → tens of seconds); Approx stays sub-second even at n=3000.
  • Conclusion: Our Approx GW is designed as a training objective that balances scalability and fidelity—precisely what KD/fusion scenarios require at scale.
n Exact GW Sinkhorn GW Approx GW Sinkhorn RE Approx RE Time Exact (s) Time Sink (s) Time Approx (s)
50 0.4815±0.0452 0.4815±0.0452 0.1978±0.0207 0.0000 0.5888 0.0171 0.0027 0.0004
100 0.3247±0.0297 0.3247±0.0297 0.1261±0.0217 0.0000 0.6048 0.0072 0.0070 0.0001
200 0.2033±0.0088 0.2033±0.0088 0.0973±0.0125 0.0000 0.5230 0.1390 0.0983 0.0359
500 0.1131±0.0050 0.1131±0.0050 0.0587±0.0143 0.0000 0.4850 0.4405 0.4815 0.0607
1000 0.0764±0.0011 0.0764±0.0011 0.0387±0.0133 0.0000 0.4921 2.3631 2.2456 0.1238
1500 0.0621±0.0028 0.0621±0.0028 0.0308±0.0016 0.0000 0.5031 12.7151 9.1998 0.0982
2000 0.0496±0.0028 0.0496±0.0028 0.0294±0.0058 0.0000 0.3995 20.1236 19.8941 0.0411
2500 0.0455±0.0008 0.0455±0.0008 0.0305±0.0075 0.0000 0.3285 20.9198 30.1367 0.0026
3000 0.0404±0.0002 0.0404±0.0002 0.0324±0.0141 0.0000 0.3941 52.8898 53.5049 0.2408

🎯 How to Use InfiGFusion (Quick Start)

  1. Collect teacher logits (no access to teacher weights required).
  2. Compute student logits by a forward pass.
  3. Apply ULD (token-level distribution alignment).
  4. Build graphs on logits and compute GLD (use our O(n log n) approximation and recommended top-k).
  5. Combine losses and train.
  6. For multiple teachers, consider temperature/weight tuning; for strong conflicts/redundancy, add source selection/weighting.

🔗 BibTeX

@article{wang2025infigfusion,
  title={InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion},
  author={Wang, Yuanyi and Yan, Zhaoyi and Zhang, Yiming and Zhou, Qi and Gu, Yanggan and Wu, Fei and Yang, Hongxia},
  journal={arXiv preprint arXiv:2505.13893},
  year={2025}
}

📫 Contact

For collaborations, implementation details, or deployment support, please feel free to reach out to yuanyi-2000.wang@connect.polyu.hk.