HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning

Xiaoyuan Li, Moxin Li, Rui Men, Yichang Zhang, Keqin Bao, Wenjie Wang, Fuli Feng, Dayiheng Liu, Junyang Lin

May 2025

Abstract

Large language models (LLMs) have shown remarkable capabilities in commonsense reasoning; however, some variations in questions can trigger incorrect responses. Do these models truly understand commonsense knowledge, or just memorize expression patterns? To investigate this question, we present the first extensive robustness evaluation of LLMs in commonsense reasoning. We introduce HellaSwag-Pro, a large-scale bilingual benchmark consisting of 11,200 cases, by designing and compiling seven types of question variants. To construct this benchmark, we propose a two-stage method to develop Chinese HellaSwag, a finely annotated dataset comprising 12,000 instances across 56 categories. We conduct extensive experiments on 41 representative LLMs, revealing that these LLMs are far from robust in commonsense reasoning. Furthermore, this robustness varies depending on the language in which the LLM is tested. This work establishes a high-quality evaluation benchmark, with extensive experiments offering valuable insights to the community in commonsense reasoning for LLMs.

Type

Conference paper

Publication

In ACL 2025

Citation:

@inproceedings{li2025hellaswag,
  title={HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning},
  author={Li, Xiaoyuan and Li, Moxin and Men, Rui and Zhang, Yichang and Bao, Keqin and Wang, Wenjie and Feng, Fuli and Liu, Dayiheng and Lin, Junyang},
  booktitle	=	{The 63rd Annual Meeting of the Association for Computational Linguistics},
  year 		= 	{2025}
}

HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning

Abstract

Xiaoyuan Li

李晓媛

Keqin Bao

鲍克勤

Wenjie Wang

王文杰教授

Fuli Feng

冯福利教授

HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning

Abstract

Xiaoyuan Li

李晓媛

Keqin Bao

鲍克勤

Wenjie Wang

王文杰 教授

Fuli Feng

冯福利 教授

王文杰教授

冯福利教授