Hao Li, Shuai Yang, Yilun Chen, Yang Tian, Xiaoda Yang, Xinyi Chen, Hanqing Wang, Tai Wang, Feng Zhao, Dahua Lin, Jiangmiao Pang
CronusVLA enhances vision-language-action models by efficiently incorporating multi-frame motion data, achieving state-of-the-art performance in manipulation tasks.
CronusVLA is a new approach that improves how robots understand and act in their environment by using multiple frames of video data instead of just one. Traditional models that combine vision, language, and action have struggled with using multiple frames due to high computational costs. CronusVLA addresses this by introducing a method to efficiently process and use information from multiple frames, leading to better performance in tasks like object manipulation. This approach not only improves success rates in simulated environments but also demonstrates strong results in real-world experiments.