Tina Khezresmaeilzadeh, Jike Zhong, Konstantinos Psounis
VRIQ benchmark reveals that current Vision Language Models struggle with visual reasoning, primarily due to perception limitations.
This study introduces a new benchmark called VRIQ to test how well Vision Language Models (VLMs) can perform visual reasoning tasks. The researchers found that these models struggle significantly, especially with abstract puzzles, achieving only about 28% accuracy, which is close to random guessing. Even with natural image tasks, the models only reached 45% accuracy. The study also found that most failures were due to perception issues rather than reasoning, suggesting that improving how these models perceive visual information could enhance their reasoning capabilities.