AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization

Abstract

Aligning large language models (LLMs) with human preferences requires balancing policy optimization with computational stability. While recent offline methods like DPO and SimPO bypass reinforcement learning’s complexity, they face critical limitations: DPO relies on static reference models that degrade with policy updates, and SimPO assumes a uniform target reward margin that ignores instance-wise preference strength. We propose AlphaDPO, an adaptive preference optimization framework that dynamically reparameterizes the reference distribution to address these issues. Our key innovation lies in an implicit reference model, which interpolates between policy-driven specialization and uniform exploration while enabling instance-adaptive reward margins. Theoretically, we prove AlphdDPO implicitly controls sequential KL divergence between iterative policy updates, ensuring stability even with poorly calibrated reference models. Empirically, AlphdDPO achieves state-of-the-art performance on AlpacaEval 2 (58.7% LC win rate) and Arena-Hard (35.7% win rate) across Mistral2-7B, Llama3-8B, and Gemma2-9B, demonstrating robust alignment without multi-stage training. Our work establishes adaptive reference reparameterization as a principled mechanism for preference optimization.

Publication
In ICML 2025

Citation:

@inproceedings{
      wu2025alpha,
      title={AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization}, 
      author={Junkang Wu and Xue Wang and Zhengyi Yang and Jiancan Wu and Jinyang Gao and Bolin Ding and Xiang Wang and Xiangnan He},
      booktitle={Forty-second International Conference on Machine Learning},
      year={2025},
}
Junkang Wu
Junkang Wu
吴俊康
Jiancan Wu
Jiancan Wu
吴剑灿 副研究员
Xiang Wang
Xiang Wang
王翔 教授
Xiangnan He
Xiangnan He
何向南 教授