Mengqi He, Xinyu Tian, Xin Shen, Jinhong Ni, Shu Zou, Zhaoyuan Yang, Jing Zhang
The study shows that targeting high-entropy tokens in vision-language models can efficiently degrade performance and expose safety vulnerabilities with fewer resources.
Vision-language models (VLMs) are great at understanding and generating text based on images, but they can still be tricked by adversarial attacks. This research found that only a small portion of the words (about 20%) in a sentence, which are the most uncertain or unpredictable, are key to changing the model's output. By focusing on these critical words, attackers can disrupt the model's performance just as effectively as more widespread attacks, but with fewer resources. This method can even affect different types of VLMs, showing that these models have a common weak spot.