InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models

Wenjun Wang*¹ , Shuo Cai*¹ , Congkai Xie² , Mingfa Feng² , Yiming Zhang¹ , Zhen Li¹² , Kejing Yang² , Ming Li¹ , Jiannong Cao¹ , Yuan Xie³ , Hongxia Yang¹²
¹The Hong Kong Polytechnic University, ²InfiX.ai, ³The Hong Kong University of Science and Technology
Low Precision Training Post-Training Large Language Models

Abstract

The immense computational cost of training Large Language Models (LLMs) presents a major barrier to innovation. While FP8 training offers a promising solution with significant theoretical efficiency gains, its widespread adoption has been hindered by the lack of a comprehensive, open-source training recipe. To bridge this gap, we introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we will release the accompanying code to further democratize large-scale model training.

Main Contributions

  • End-to-End FP8 Training Recipe: InfiR2 is the first open-source, end-to-end FP8 training recipe that unifies continual pre-training and supervised fine-tuning in one workflow, offering a practical alternative to traditional BF16 training. It resolves the longstanding difficulties that prevented FP8 from being used in full model training pipelines.
  • Hybrid-Granularity Quantization: InfiR2 implements finer-grained FP8 quantization for computationally intensive operators, as shown in Figure 1: -For Linear/GEMM operations, it applies block-wise quantization to weights and token-wise quantization to activations. This balances accuracy with computational speed to fully leverage the hardware value of Tensor Cores. -For critical components like Master Weights, Optimizer States, and Gradient Accumulation, it employs a high-precision strategy for direct optimization and updates, effectively providing a “safety belt” for the most crucial parts of model training.
  • Stable and Reproducible Performance: We demonstrate for the first time that FP8 training can match full-precision training in quality across critical reasoning benchmarks. InfiR2’s FP8-trained models achieve accuracy on par with BF16 on tasks like AIME, GPQA, and LiveCodeBench (often within 1-2% or less). Notably, for smaller models (e.g. ~7B parameters), FP8 training even slightly outperforms BF16 on some benchmarks (acting as a form of regularization), confirming that FP8 is not only computationally lighter but also rock-solid in training convergence.
  • Community Release & Impact: Based on this complete FP8 training workflow (applying CPT and SFT to the Qwen-2.5 model series), we have developed two high-performing models: InfiR2-1.5B-FP8 and InfiR2-7B-FP8. We are open-sourcing the model checkpoints and the complete training code repository to the community.

Pretraining performance comparison between FP8 and BF16.

Experiments

  • Pretraining with FP8 is lossless. InfiR2-1.5B-FP8 shows almost no performance loss on AIME25, AIME24, GPQA, and LiveCodeBench compared to the BF16 baseline, as detailed in the results below.
Base Model Quantization Method AIME 25 AIME24 GPQA LiveCodeBench v5
Qwen2.5-Math-7B (Stage 1) BF16 44.16 56.87 45.14 32.22
FP8 w. FP32 scale 44.06 56.67 47.98 32.18
FP8 44.89 57.81 47.10 31.34
Qwen2.5-Math-7B (Stage 2) BF16 50.00 59.48 48.36 35.22
FP8 w. FP32 scale 46.46 57.92 45.39 35.87
FP8 49.79 59.69 46.78 36.21
Qwen2.5-Math-1.5B (Stage 1) BF16 15.41 18.33 24.68 10.71
FP8 w. FP32 scale 15.73 18.65 25.38 10.14
FP8 17.50 16.88 23.17 9.84
Qwen2.5-Math-1.5B (Stage 2) BF16 17.92 21.35 24.48 12.16
FP8 w. FP32 scale 20.62 22.81 27.78 12.69
FP8 20.73 21.77 25.13 12.96
  • SFT with FP8 is lossless. After fine-tuning, Qwen2.5-Math-1.5B and 7B also show no degradation compared to the BF16 baseline, and even achieve a 1-2 point improvement on the AIME math competition dataset, as detailed in the results below.
AIME 25 AIME 24 GPQA LiveCodeBench v5
BF16 17.91 17.50 31.94 16.41
FP8 18.45 17.39 29.48 17.10
  • Memory optimization and Speed up. Memory Optimization & Computation Acceleration: Compared to the widely used BF16, FP8 delivers: -Up to 22% increase in end-to-end training speed. -Up to 14% savings in peak memory usage. -Up to 19% increase in end-to-end throughput.

Model Size = 1.5B

Context Length = 32k, TP = 2, CP = 1, MBS = 1

Forward Backward Total Ratio Peak Memory Ratio Throughput Ratio
BF16 841 ms 2329 ms 3170 ms - 57.8 GB - 345 TFlops -
FP8 875 ms 2075 ms 2950 ms 0.93× 51.7 GB 0.89× 360 TFlops 1.04×

Context Length = 8k, TP = 1, CP = 1, MBS = 2

Forward Backward Total Ratio Peak Memory Ratio Throughput Ratio
BF16 463 ms 1567 ms 2030 ms - 68.1 GB - 340 TFlops -
FP8 529 ms 1061 ms 1590 ms 0.78× 58.3 GB 0.86× 376 TFlops 1.10×

Model Size = 7B

Context Length = 32k, TP = 4, CP = 1, MBS = 1

Forward Backward Total Ratio Peak Memory Ratio Throughput Ratio
BF16 2790 ms 6800 ms 9590 ms - 78.1 GB - 409 TFlops -
FP8 2660 ms 5700 ms 8360 ms 0.87× 67.4 GB 0.86× 461 TFlops 1.14×

Context Length = 32k, TP = 2, CP = 1, MBS = 1

Forward Backward Total Ratio Peak Memory Ratio Throughput Ratio
BF16 1760 ms 5320 ms 7080 ms - 53.2 GB - 453 TFlops -
FP8 2300 ms 3230 ms 5530 ms 0.78× 50.8 GB 0.95× 537 TFlops 1.19×

Citation Information

If you find this work useful, citations to the following papers are welcome:

@misc{wang2025infir2comprehensivefp8training,
      title={InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models}, 
      author={Wenjun Wang and Shuo Cai and Congkai Xie and Mingfa Feng and Yiming Zhang and Zhen Li and Kejing Yang and Ming Li and Jiannong Cao and Yuan Xie and Hongxia Yang},
      year={2025},
      eprint={2509.22536},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.22536}, 
}

Acknowledgements

We would like to express our gratitude for the following open-source projects: Slime, LLaMA-Factory, and Qwen2.5