Wenhao Li, Wenwu Li, Chuyun Shen, Junjie Sheng, Zixiao Huang, Di Wu, Yun Hua, Wei Yin, Xiangfeng Wang, Hongyuan Zha, Bo Jin
TextAtari is a benchmark for testing language agents on long-horizon decision-making tasks using textual descriptions of Atari games, revealing significant performance gaps compared to human players.
TextAtari is a new benchmark designed to test how well language-based AI agents can play classic Atari games when the games are described in text rather than visuals. This benchmark involves nearly 100 different games and challenges the AI to make decisions over very long periods, up to 100,000 steps. The study evaluated several large language models using different strategies to see how they perform in these text-based game scenarios. The results showed that these AI agents struggle significantly compared to human players, especially in tasks that require complex planning and understanding over many moves.