PaperPulse logo
FeedTopicsAI Researcher FeedBlogPodcastAccount

Stay Updated

Get the latest research delivered to your inbox

Platform

  • Home
  • About Us
  • Search Papers
  • Research Topics
  • Researcher Feed

Resources

  • Newsletter
  • Blog
  • Podcast
PaperPulse•

AI-powered research discovery platform

© 2024 PaperPulse. All rights reserved.

MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning

ArXivSource

Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Binbin Zheng, Chaowen Hu, Zekai Shao, Cong Qin, Lu Pan, Ke Zeng, Xunliang Cai

cs.LG
cs.AI
|
Feb 19, 2026
4 views

One-line Summary

MASPO is a new framework that overcomes limitations in existing RLVR algorithms for large language models by optimizing gradient use, probability mass, and signal reliability, achieving better performance than current methods.

Plain-language Overview

In the world of artificial intelligence, large language models (LLMs) are powerful tools that can understand and generate human-like text. However, teaching these models using reinforcement learning can be challenging due to certain limitations in current methods. The new approach, called Mass-Adaptive Soft Policy Optimization (MASPO), addresses three main issues: inefficient use of learning signals, ignoring the distribution of possible outcomes, and inconsistent updates based on unclear feedback. By improving these areas, MASPO helps LLMs learn more effectively and reliably. This advancement means that language models can become even better at tasks like conversation, translation, and more.

Technical Details