Xin Sheng, Jiaxin Li, Yujuan Pang, Ran Peng, Yong Ma
The paper introduces a method called positive-negative pairing for prompt selection in reinforcement learning with verifiable rewards, leading to improved performance on deterministic reasoning tasks by amplifying rare event signals.
This study explores how to better train large language models using a method called reinforcement learning with verifiable rewards (RLVR), which helps the models understand and reason through tasks with clear outcomes. The authors propose a new approach to selecting prompts, focusing on pairing a challenging but solvable prompt with an easier one that still has occasional failures. This pairing helps the model learn more effectively by emphasizing successes and failures, leading to improved performance. The results show that this method outperforms traditional approaches, even when using fewer prompts.