$\beta$-DPO: Direct Preference Optimization with Dynamic $\beta$

Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

September 2024

Abstract

Direct Preference Optimization (DPO) has emerged as a compelling approach for training Large Language Models (LLMs) to adhere to human preferences. However, the performance of DPO is sensitive to the fine-tuning of its trade-off parameter β, as well as to the quality of the preference data. We analyze the impact of β and data quality on DPO, uncovering that optimal β values vary with the informativeness of pairwise data. Addressing the limitations of static β values, we introduce a novel framework that dynamically calibrates β at the batch level, informed by data quality considerations. Additionally, our method incorporates β-guided data filtering to safeguard against the influence of outliers. Through empirical evaluation, we demonstrate that our dynamic β adjustment technique significantly improves DPO’s performance across a range of models and datasets, offering a more robust and adaptable training paradigm for aligning LLMs with human feedback. The code is available at https://github.com/junkangwu/beta-DPO.

Type

Conference paper

Publication

In NeurIPS 2024

Citation:

@inproceedings{wu2024beta,
  title={{\(\beta\)}-DPO: Direct Preference Optimization with Dynamic {\(\beta\)}},
  author={Junkang Wu and Yuexiang Xie and Zhengyi Yang and Jiancan Wu and Jinyang Gao and Bolin Ding and Xiang Wang and Xiangnan He},
  booktitle={NeurIPS},
  year={2024}
}

$\beta$-DPO: Direct Preference Optimization with Dynamic $\beta$

Abstract

Junkang Wu

吴俊康

Zhengyi Yang

杨正一

Jiancan Wu

吴剑灿博士后

Xiang Wang

王翔教授

Xiangnan He

何向南教授

$\beta$-DPO: Direct Preference Optimization with Dynamic $\beta$

Abstract

Junkang Wu

吴俊康

Zhengyi Yang

杨正一

Jiancan Wu

吴剑灿 博士后

Xiang Wang

王翔 教授

Xiangnan He

何向南 教授

吴剑灿博士后

王翔教授

何向南教授