Shaobo Wang, Xuan Ouyang, Tianyi Xu, Yuzheng Hu, Jialin Liu, Guo Chen, Tianyu Zhang, Junhao Zheng, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang
OPUS is a dynamic data selection framework that improves the efficiency of large language model pre-training by selecting better tokens based on optimizer-induced updates, achieving significant performance gains with less data.
As we run out of high-quality text data for training large language models, it's becoming more important to focus on selecting the best data rather than just more data. OPUS is a new method that helps in choosing better data for training by looking at how updates from modern optimizers affect the model. This method not only picks data dynamically during training but does so efficiently, adding minimal extra computational cost. It has shown impressive results, outperforming traditional methods by using fewer tokens, which is especially beneficial when high-quality data is scarce.