PaperPulse logo
FeedTopicsAI Researcher FeedBlogPodcastAccount

Stay Updated

Get the latest research delivered to your inbox

Platform

  • Home
  • About Us
  • Search Papers
  • Research Topics
  • Researcher Feed

Resources

  • Newsletter
  • Blog
  • Podcast
PaperPulse•

AI-powered research discovery platform

© 2024 PaperPulse. All rights reserved.

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

ArXivSource

Shaobo Wang, Xuan Ouyang, Tianyi Xu, Yuzheng Hu, Jialin Liu, Guo Chen, Tianyu Zhang, Junhao Zheng, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang

cs.CL
|
Feb 5, 2026
164 views

One-line Summary

OPUS is a dynamic data selection framework that improves the efficiency of large language model pre-training by selecting better tokens based on optimizer-induced updates, achieving significant performance gains with less data.

Plain-language Overview

As we run out of high-quality text data for training large language models, it's becoming more important to focus on selecting the best data rather than just more data. OPUS is a new method that helps in choosing better data for training by looking at how updates from modern optimizers affect the model. This method not only picks data dynamically during training but does so efficiently, adding minimal extra computational cost. It has shown impressive results, outperforming traditional methods by using fewer tokens, which is especially beneficial when high-quality data is scarce.

Technical Details