PublicationSep 26, 2025

InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models

Wenjun Wang*¹, Shuo Cai*¹, Congkai Xie², Mingfa Feng², Yiming Zhang¹, Zhen Li¹², Kejing Yang², Ming Li¹, Jiannong Cao¹, Hongxia Yang¹²

¹The Hong Kong Polytechnic University, ²InfiX.ai

Abstract

The immense computational cost of training Large Language Models (LLMs) presents a major barrier to innovation. While FP8 training offers a promising solution with significant theoretical efficiency gains, its widespread adoption has been hindered by the lack of a comprehensive, open-source training recipe. To bridge this gap, we introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we will release the accompanying code to further democratize large-scale model training.

Low Precision TrainingPost-TrainingLarge Language Models

Main Contributions

  • End-to-End FP8 Training Recipe: InfiR2 is the first open-source, end-to-end FP8 training recipe that unifies continual pre-training and supervised fine-tuning in one workflow, offering a practical alternative to traditional BF16 training. It resolves the longstanding difficulties that prevented FP8 from being used in full model training pipelines.
  • Hybrid-Granularity Quantization: InfiR2 implements finer-grained FP8 quantization for computationally intensive operators, as shown in Figure 1: -For Linear/GEMM operations, it applies block-wise quantization to weights and token-wise quantization to activations. This balances accuracy with computational speed to fully leverage the hardware value of Tensor Cores. -For critical components like Master Weights, Optimizer States, and Gradient Accumulation, it employs a high-precision strategy for direct optimization and updates, effectively providing a "safety belt" for the most crucial parts of model training.
  • Stable and Reproducible Performance: We demonstrate for the first time that FP8 training can match full-precision training in quality across critical reasoning benchmarks. InfiR2's FP8-trained models achieve accuracy on par with BF16 on tasks like AIME, GPQA, and LiveCodeBench (often within 1-2% or less). Notably, for smaller models (e.g. ~7B parameters), FP8 training even slightly outperforms BF16 on some benchmarks (acting as a form of regularization), confirming that FP8 is not only computationally lighter but also rock-solid in training convergence.
  • Community Release & Impact: Based on this complete FP8 training workflow (applying CPT and SFT to the Qwen-2.5 model series), we have developed two high-performing models: InfiR2-1.5B-FP8 and InfiR2-7B-FP8. We are open-sourcing the model checkpoints and the complete training code repository to the community.

Pretraining performance comparison between FP8 and BF16.

Experiments

  • Pretraining with FP8 is lossless. InfiR2-1.5B-FP8 shows almost no performance loss on AIME25, AIME24, GPQA, and LiveCodeBench compared to the BF16 baseline, as detailed in the results below.
Base ModelQuantization MethodAIME 25AIME24GPQALiveCodeBench v5
Qwen2.5-Math-7B (Stage 1)BF1644.1656.8745.1432.22
FP8 w. FP32 scale44.0656.6747.9832.18
FP844.8957.8147.1031.34
Qwen2.5-Math-7B (Stage 2)BF1650.0059.4848.3635.22
FP8 w. FP32 scale46.4657.9245.3935.87
FP849.7959.6946.7836.21
Qwen2.5-Math-1.5B (Stage 1)BF1615.4118.3324.6810.71
FP8 w. FP32 scale15.7318.6525.3810.14
FP817.5016.8823.179.84
Qwen2.5-Math-1.5B (Stage 2)BF1617.9221.3524.4812.16
FP8 w. FP32 scale20.6222.8127.7812.69
FP820.7321.7725.1312.96
  • SFT with FP8 is lossless. After fine-tuning, Qwen2.5-Math-1.5B and 7B also show no degradation compared to the BF16 baseline, and even achieve a 1-2 point improvement on the AIME math competition dataset, as detailed in the results below.
AIME 25AIME 24GPQALiveCodeBench v5
BF1617.9117.5031.9416.41
FP818.4517.3929.4817.10
  • Memory optimization and Speed up. Memory Optimization & Computation Acceleration: Compared to the widely used BF16, FP8 delivers: -Up to 22% increase in end-to-end training speed. -Up to 14% savings in peak memory usage. -Up to 19% increase in end-to-end throughput.

Model Size = 1.5B

Context Length = 32k, TP = 2, CP = 1, MBS = 1

ForwardBackwardTotalRatioPeak MemoryRatioThroughputRatio
BF16841 ms2329 ms3170 ms-57.8 GB-345 TFlops-
FP8875 ms2075 ms2950 ms0.93×51.7 GB0.89×360 TFlops1.04×

Context Length = 8k, TP = 1, CP = 1, MBS = 2

ForwardBackwardTotalRatioPeak MemoryRatioThroughputRatio
BF16463 ms1567 ms2030 ms-68.1 GB-340 TFlops-
FP8529 ms1061 ms1590 ms0.78×58.3 GB0.86×376 TFlops1.10×

Model Size = 7B

Context Length = 32k, TP = 4, CP = 1, MBS = 1

ForwardBackwardTotalRatioPeak MemoryRatioThroughputRatio
BF162790 ms6800 ms9590 ms-78.1 GB-409 TFlops-
FP82660 ms5700 ms8360 ms0.87×67.4 GB0.86×461 TFlops1.14×

Context Length = 8k, TP = 2, CP = 1, MBS = 1

ForwardBackwardTotalRatioPeak MemoryRatioThroughputRatio
BF161760 ms5320 ms7080 ms-53.2 GB-453 TFlops-
FP82300 ms3230 ms5530 ms0.78×50.8 GB0.95×537 TFlops1.19×

Citation Information

If you find this work useful, citations to the following papers are welcome:

@misc{wang2025infir2comprehensivefp8training,
      title={InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models},
      author={Wenjun Wang and Shuo Cai and Congkai Xie and Mingfa Feng and Yiming Zhang and Zhen Li and Kejing Yang and Ming Li and Jiannong Cao and Hongxia Yang},
      year={2025},
      eprint={2509.22536},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.22536},
}

Acknowledgements

We would like to express our gratitude for the following open-source projects: Slime, LLaMA-Factory, and Qwen2.5