Yuelin Hu, Zhengxue Cheng, Wei Liu, Li Song
EGSPO improves large language model training by using token-level gradient modulation to enhance performance on mathematical reasoning tasks with minimal computational overhead.
The paper introduces a new method called Entropy Gated Selective Policy Optimization (EGSPO) to improve the training of large language models. Traditional training methods combine supervised learning with reinforcement learning, but EGSPO adds a new step that adjusts the learning process at a more detailed level. By focusing on individual parts of the text (tokens) and adjusting how much they influence training based on their uncertainty, the method helps the model learn better from both correct and incorrect examples. This approach has shown to improve the model's performance on math-related tasks while only slightly increasing the computational effort required.