Skip to content

sail-sg/Precision-RL-verl

 
 

Repository files navigation

Defeating the Training-Inference Mismatch via FP16

Penghui Qi, Zichen Liu, Xiangxin Zhou Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin

Paper Github

Overview

Figure 1: Training reward comparison between BF16 and FP16. We evaluate across diverse settings: our Sanity test with various algorithms (GRPO, GSPO, TIS, MIS, PG); different model families (R1D, Qwen and OctoThinker); alternative fine-tuning methods of Lora; and larger scale models (Dense-14B, MoE). Results are validated on two independent frameworks (VeRL and Oat).

Figure 2: Evaluation comparisons between BF16 and FP16 across various frameworks, algorithms, datasets and training regimes.

Figure 3: Simply switching from BF16 to FP16 stabilizes and prolongs RL training. The basic importance-weighted policy gradient algorithm in FP16 outperforms all baselines in BF16.

Figure 4: Comparisons between various algorithms based on FP16.

Figure 5: FP16 significantly reduces the training-inference mismatch. The left two plots show the token-level probability distribution, and the right two plots present the distribution of sequence-level log probability ratio between the inference policy ($\mu$) and the training policy ($\pi$).

Reproduce the Sanity Test 🎯

# ALGO: [PPO, PPO-Token-TIS, PPO-Seq-MIS, PPO-Seq-TIS, PG-Seq-IS, PG-Seq-TIS, PG-Seq-MIS, Vanilla-GSPO]
# DTYPE: [bfloat16, float16]
ALGO=PG-Seq-IS DTYPE=float16 ./run_sanity_test.sh

Data

The sanity test dataset for DeepSeek-R1-Distill-Qwen-1.5B is included in the folder sanity_test. The data processing script will be released soon.

Citation

If you find our works useful for your research, please consider citing:

@article{qi2025precisionrl,
  title={Defeating the Training-Inference Mismatch via FP16},
  author={Qi, Penghui and Liu, Zichen and Zhou, Xiangxin and Pang, Tianyu and Du, Chao and Lee, Wee Sun and Lin, Min},
  journal={arXiv preprint arXiv:2510.26788},
  year={2025}
}

About

Defeating the Training-Inference Mismatch via FP16

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 93.9%
  • Shell 5.7%
  • Roff 0.4%