Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Binbin Zheng, Chaowen Hu, Zekai Shao, Cong Qin, Lu Pan, Ke Zeng, Xunliang Cai
MASPO is a new framework that overcomes limitations in existing RLVR algorithms for large language models by optimizing gradient use, probability mass, and signal reliability, achieving better performance than current methods.
In the world of artificial intelligence, large language models (LLMs) are powerful tools that can understand and generate human-like text. However, teaching these models using reinforcement learning can be challenging due to certain limitations in current methods. The new approach, called Mass-Adaptive Soft Policy Optimization (MASPO), addresses three main issues: inefficient use of learning signals, ignoring the distribution of possible outcomes, and inconsistent updates based on unclear feedback. By improving these areas, MASPO helps LLMs learn more effectively and reliably. This advancement means that language models can become even better at tasks like conversation, translation, and more.