Mateusz Nowak, Xavier Cadet, Peter Chin
The paper introduces a bias-reduced evaluation protocol for LLMs in multiple-choice questions that improves robustness to answer permutations with minimal performance loss.
When evaluating large language models (LLMs) using multiple-choice questions, biases can arise from the position of answers, labels, and examples used in the prompts. This study identifies these biases and proposes a new method to reduce them by using uniform, unordered labels and requiring the model to consider the entire answer. This approach enhances the consistency of LLMs' performance across different arrangements of answers, showing that the models' true capabilities can be assessed more accurately. The method maintains high performance while reducing variability in results, making it a more reliable evaluation tool.