Yizhao Gao, Jianyu Wei, Qihao Zhang, Yu Cheng, Shimao Chen, Zhengju Tang, Zihan Jiang, Yifan Song, Hailin Zhang, Liang Zhao, Bo Yang, Gang Wang, Shijie Cao, Fuli Luo
HySparse is a new hybrid sparse attention model that improves performance and reduces memory usage by using full attention layers as oracles for token selection and sharing KV caches with sparse layers.
Researchers have developed a new model called HySparse that enhances the efficiency of processing large language models. It combines full attention layers with sparse attention layers, where the full layers help identify important information for the sparse layers. This approach simplifies the process and reduces the need for extra memory and computation. HySparse shows significant improvements in performance compared to existing models while using less memory, making it a promising advancement for handling large datasets efficiently.