InfiFPO icon and overview.

--- ## *InfiFPO*: A Paradigm Shift in Model Fusion Our solution, *InfiFPO*, introduces three key breakthroughs: ![InfiFPO Overview](assets/infiFPO.webp)

InfiFPO methodology overview.

### **Ⅰ. Implicit Model Fusion** Instead of wrestling with complex vocabulary alignment at the token level, we operate at the **sequence level**, seamlessly integrating probabilities from multiple source models. This elegant approach preserves crucial probability information while avoiding compatibility headaches. ### **Ⅱ. Three Pillars of Stability** - **Length Normalization** - Eliminates bias from varying tokenization patterns across models - Ensures fair comparison of sequence probabilities regardless of token length - **Probability Clipping** - Prevents underperforming source models from introducing noise - Maintains training stability by setting intelligent probability boundaries - **Max-Margin Fusion** - Automatically identifies the most informative source model for each scenario - Focuses on learning distinctive, complementary knowledge ### **Ⅲ. Efficient Training Pipeline** By transforming the complex reinforcement learning problem into an efficient offline optimization objective, InfiFPO achieves remarkable results without the computational overhead of traditional methods. --- ## Impressive Results Across the Board Our comprehensive evaluation across **11 diverse benchmarks** demonstrates *InfiFPO*'s consistent superiority: | Capability Area | Before *InfiFPO* | After *InfiFPO* | Improvement | |----------------|----------------|----------------|-------------| | **Mathematics** | 72.85 | 75.80 | **+2.95** | | **Coding** | 79.47 | 85.15 | **+5.68** | | **Overall** | 79.95 | 83.33 | **+3.38** | --- ## Beyond the Numbers: Technical Foundation *InfiFPO*'s beauty lies in its mathematical foundation. By replacing the reference model in Direct Preference Optimization with a carefully fused source model, we create an optimization objective that simultaneously: - Aligns with human preferences - Distills knowledge from multiple expert models - Maintains training stability and efficiency The extra gradient analysis reveals how *InfiFPO* weights training samples based on the divergence between source and pivot model preferences, focusing optimization efforts where they matter most. --- ## BibTeX ```bibtex @misc{gu-2025-infifpo, title={InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models}, author={Yanggan Gu and Zhaoyi Yan and Yuanyi Wang and Yiming Zhang and Qi Zhou and Fei Wu and Hongxia Yang}, year={2025}, eprint={2505.13878}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.13878}, } ```