Defeating the Training-Inference Mismatch via FP16

Penghui Qi, Zichen Liu, Xiangxin Zhou Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin

Overview

Figure 1: Training reward comparison between BF16 and FP16. We evaluate across diverse settings: our Sanity test with various algorithms (GRPO, GSPO, TIS, MIS, PG); different model families (R1D, Qwen and OctoThinker); alternative fine-tuning methods of Lora; and larger scale models (Dense-14B, MoE). Results are validated on two independent frameworks (VeRL and Oat).

Figure 2: Evaluation comparisons between BF16 and FP16 across various frameworks, algorithms, datasets and training regimes.

Figure 3: Simply switching from BF16 to FP16 stabilizes and prolongs RL training. The basic importance-weighted policy gradient algorithm in FP16 outperforms all baselines in BF16.

Figure 4: Comparisons between various algorithms based on FP16.

Figure 5: FP16 significantly reduces the training-inference mismatch. The left two plots show the token-level probability distribution, and the right two plots present the distribution of sequence-level log probability ratio between the inference policy ($\mu$) and the training policy ($\pi$).

Reproduce the Sanity Test 🎯

# ALGO: [PPO, PPO-Token-TIS, PPO-Seq-MIS, PPO-Seq-TIS, PG-Seq-IS, PG-Seq-TIS, PG-Seq-MIS, Vanilla-GSPO]
# DTYPE: [bfloat16, float16]
ALGO=PG-Seq-IS DTYPE=float16 ./run_sanity_test.sh

Data

The sanity test dataset for DeepSeek-R1-Distill-Qwen-1.5B is included in the folder sanity_test. The data processing script will be released soon.

Citation

If you find our works useful for your research, please consider citing:

@article{qi2025precisionrl,
  title={Defeating the Training-Inference Mismatch via FP16},
  author={Qi, Penghui and Liu, Zichen and Zhou, Xiangxin and Pang, Tianyu and Du, Chao and Lee, Wee Sun and Lin, Min},
  journal={arXiv preprint arXiv:2510.26788},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1,052 Commits
.gemini		.gemini
.github		.github
.vscode		.vscode
docker		docker
docs		docs
examples		examples
figures		figures
recipe		recipe
sanity_test		sanity_test
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
run_sanity_test.sh		run_sanity_test.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Defeating the Training-Inference Mismatch via FP16

Overview

Reproduce the Sanity Test 🎯

Data

Citation

About

Uh oh!

Releases

Packages

Languages

License

sail-sg/Precision-RL-verl

Folders and files

Latest commit

History

Repository files navigation

Defeating the Training-Inference Mismatch via FP16

Overview

Reproduce the Sanity Test 🎯

Data

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages