Jiulong Wu, Zhengliang Shi, Shuaiqiang Wang, Jizhou Huang, Dawei Yin, Lingyong Yan, Min Cao, Min Zhang
The paper introduces Entity-centric Multimodal Preference Optimization (EMPO) to reduce hallucinations in Large Vision-Language Models by improving modality alignment and utilizing automatically constructed high-quality preference data.
Large Vision-Language Models (LVLMs) are powerful tools that can perform a variety of tasks involving both images and text. However, they sometimes produce 'hallucinations,' or incorrect outputs, due to misalignment between the visual and textual information. This paper presents a new method called Entity-centric Multimodal Preference Optimization (EMPO) to better align these modalities and reduce hallucinations. By using automatically generated high-quality data, EMPO significantly decreases the occurrences of hallucinations in LVLMs.