Abstract
The immense computational cost of training Large Language Models (LLMs) presents a major barrier to innovation. While FP8 training offers a promising solution with significant theoretical efficiency gains, its widespread adoption has been hindered by the lack of a comprehensive, open-source training recipe. To bridge this gap, we introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we will release the accompanying code to further democratize large-scale model training.
Main Contributions
- End-to-End FP8 Training Recipe: InfiR2 is the first open-source, end-to-end FP8 training recipe that unifies continual pre-training and supervised fine-tuning in one workflow, offering a practical alternative to traditional BF16 training. It resolves the longstanding difficulties that prevented FP8 from being used in full model training pipelines.
- Hybrid-Granularity Quantization: InfiR2 implements finer-grained FP8 quantization for computationally intensive operators, as shown in Figure 1: -For Linear/GEMM operations, it applies block-wise quantization to weights and token-wise quantization to activations. This balances accuracy with computational speed to fully leverage the hardware value of Tensor Cores. -For critical components like Master Weights, Optimizer States, and Gradient Accumulation, it employs a high-precision strategy for direct optimization and updates, effectively providing a “safety belt” for the most crucial parts of model training.
- Stable and Reproducible Performance: We demonstrate for the first time that FP8 training can match full-precision training in quality across critical reasoning benchmarks. InfiR2’s FP8-trained models achieve accuracy on par with BF16 on tasks like AIME, GPQA, and LiveCodeBench (often within 1-2% or less). Notably, for smaller models (e.g. ~7B parameters), FP8 training even slightly outperforms BF16 on some benchmarks (acting as a form of regularization), confirming that FP8 is not only computationally lighter but also rock-solid in training convergence.
- Community Release & Impact: Based on this complete FP8 training workflow (applying CPT and SFT to the Qwen-2.5 model series), we have developed two high-performing models: InfiR2-1.5B-FP8 and InfiR2-7B-FP8. We are open-sourcing the model checkpoints and the complete training code repository to the community.
Pretraining performance comparison between FP8 and BF16.
Experiments
- Pretraining with FP8 is lossless. InfiR2-1.5B-FP8 shows almost no performance loss on AIME25, AIME24, GPQA, and LiveCodeBench compared to the BF16 baseline, as detailed in the results below.
| Base Model | Quantization Method | AIME 25 | AIME24 | GPQA | LiveCodeBench v5 |
|---|---|---|---|---|---|
| Qwen2.5-Math-7B (Stage 1) | BF16 | 44.16 | 56.87 | 45.14 | 32.22 |
| FP8 w. FP32 scale | 44.06 | 56.67 | 47.98 | 32.18 | |
| FP8 | 44.89 | 57.81 | 47.10 | 31.34 | |
| Qwen2.5-Math-7B (Stage 2) | BF16 | 50.00 | 59.48 | 48.36 | 35.22 |
| FP8 w. FP32 scale | 46.46 | 57.92 | 45.39 | 35.87 | |
| FP8 | 49.79 | 59.69 | 46.78 | 36.21 | |
| Qwen2.5-Math-1.5B (Stage 1) | BF16 | 15.41 | 18.33 | 24.68 | 10.71 |
| FP8 w. FP32 scale | 15.73 | 18.65 | 25.38 | 10.14 | |
| FP8 | 17.50 | 16.88 | 23.17 | 9.84 | |
| Qwen2.5-Math-1.5B (Stage 2) | BF16 | 17.92 | 21.35 | 24.48 | 12.16 |
| FP8 w. FP32 scale | 20.62 | 22.81 | 27.78 | 12.69 | |
| FP8 | 20.73 | 21.77 | 25.13 | 12.96 |
- SFT with FP8 is lossless. After fine-tuning, Qwen2.5-Math-1.5B and 7B also show no degradation compared to the BF16 baseline, and even achieve a 1-2 point improvement on the AIME math competition dataset, as detailed in the results below.
| AIME 25 | AIME 24 | GPQA | LiveCodeBench v5 | |
|---|---|---|---|---|
| BF16 | 17.91 | 17.50 | 31.94 | 16.41 |
| FP8 | 18.45 | 17.39 | 29.48 | 17.10 |
- Memory optimization and Speed up. Memory Optimization & Computation Acceleration: Compared to the widely used BF16, FP8 delivers: -Up to 22% increase in end-to-end training speed. -Up to 14% savings in peak memory usage. -Up to 19% increase in end-to-end throughput.
Model Size = 1.5B
Context Length = 32k, TP = 2, CP = 1, MBS = 1
| Forward | Backward | Total | Ratio | Peak Memory | Ratio | Throughput | Ratio | |
|---|---|---|---|---|---|---|---|---|
| BF16 | 841 ms | 2329 ms | 3170 ms | - | 57.8 GB | - | 345 TFlops | - |
| FP8 | 875 ms | 2075 ms | 2950 ms | 0.93× | 51.7 GB | 0.89× | 360 TFlops | 1.04× |
Context Length = 8k, TP = 1, CP = 1, MBS = 2
| Forward | Backward | Total | Ratio | Peak Memory | Ratio | Throughput | Ratio | |
|---|---|---|---|---|---|---|---|---|
| BF16 | 463 ms | 1567 ms | 2030 ms | - | 68.1 GB | - | 340 TFlops | - |
| FP8 | 529 ms | 1061 ms | 1590 ms | 0.78× | 58.3 GB | 0.86× | 376 TFlops | 1.10× |
Model Size = 7B
Context Length = 32k, TP = 4, CP = 1, MBS = 1
| Forward | Backward | Total | Ratio | Peak Memory | Ratio | Throughput | Ratio | |
|---|---|---|---|---|---|---|---|---|
| BF16 | 2790 ms | 6800 ms | 9590 ms | - | 78.1 GB | - | 409 TFlops | - |
| FP8 | 2660 ms | 5700 ms | 8360 ms | 0.87× | 67.4 GB | 0.86× | 461 TFlops | 1.14× |
Context Length = 8k, TP = 2, CP = 1, MBS = 1
| Forward | Backward | Total | Ratio | Peak Memory | Ratio | Throughput | Ratio | |
|---|---|---|---|---|---|---|---|---|
| BF16 | 1760 ms | 5320 ms | 7080 ms | - | 53.2 GB | - | 453 TFlops | - |
| FP8 | 2300 ms | 3230 ms | 5530 ms | 0.78× | 50.8 GB | 0.95× | 537 TFlops | 1.19× |
Citation Information
If you find this work useful, citations to the following papers are welcome:
@misc{wang2025infir2comprehensivefp8training,
title={InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models},
author={Wenjun Wang and Shuo Cai and Congkai Xie and Mingfa Feng and Yiming Zhang and Zhen Li and Kejing Yang and Ming Li and Jiannong Cao and Hongxia Yang},
year={2025},
eprint={2509.22536},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.22536},
}
Acknowledgements
We would like to express our gratitude for the following open-source projects: Slime, LLaMA-Factory, and Qwen2.5