K-order Ranking Preference Optimization for Large Language Models

Abstract

To adapt Large Language Models (LLMs) to ranking tasks, existing list-wise methods, represented by list-wise Direct Preference Optimization (DPO), focus on optimizing partial-order or full-order list ranking consistency for LLMs to enhance their ranking abilities. However, we argue that optimizing top-K ranking consistency could be more appropriate for real-world applications. There are two main reasons: (1) users are typically concerned with only the top-K results, making top-K ranking more important, and (2) tail items often lack precise feedback, making top-K ranking more reliable. Based on this, we propose K-order Ranking Preference Optimization (KPO) by extending the DPO’s Plackett-Luce model to accommodate top-K rankings. Additionally, recognizing that the number of important items can vary across queries, we extend KPO to dynamically determine appropriate K for different samples and introduce a curriculum learning strategy to boost training efficiency. Extensive experiments demonstrate the effectiveness of KPO, highlighting its high sample efficiency and robustness to noise.

Publication
In ACL 2025 Findings

Citation:

@inproceedings{cai2025kpo,
  title={K-order Ranking Preference Optimization for Large Language Models},
  author={Cai, Shihao and Gao, Chongming and Zhang, Yang and Shi, Wentao and Zhang, Jizhi and Bao, Keqin and Wang, Qifan and Feng Fuli},
  booktitle={Findings of the Association for Computational Linguistics ACL 2025},
  year={2025}
}
Shihao Cai
Shihao Cai
蔡仕豪
Chongming Gao
Chongming Gao
高崇铭 博士后
Wentao Shi
Wentao Shi
石文焘
Jizhi Zhang
Jizhi Zhang
张及之
Keqin Bao
Keqin Bao
鲍克勤
Fuli Feng
Fuli Feng
冯福利 教授