Jiayu Hu, Beibei Li, Jiangwei Xia, Yanjun Qin, Bing Ji, Zhongshi He
The paper introduces an adversarial parametric editing framework to reduce hallucinations in Vision-Language Models by prioritizing visual evidence over linguistic biases.
Vision-Language Models, which combine image and text understanding, often produce incorrect outputs because they rely too much on language rather than the visual input. This paper presents a new approach to reduce these errors, called hallucinations, by using a method that actively identifies and adjusts parts of the model that are prone to such mistakes. The approach involves creating a dataset to train the model to distinguish between accurate and hallucinated outputs and then fine-tuning the model to focus more on visual information. This method has shown to significantly improve the accuracy of these models in both generating and understanding tasks.