SEMINAR

Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs

Daeun Moon

2026.01.19

LVLM

Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs

VENUE2024 ECCV

PAPER LINKarXiv

PDFPDF 다운로드

Overview

LVLM은 이미지와 텍스트를 함께 처리하지만 실제로는 이미지보다 텍스트에 더 의존하는 경향 존재
이로 인해 입력 이미지와 맞지 않는 hallucination이 발생
특히 image input이 제거되어도 동일한 응답이 생성되는 text inertia 문제가 핵심 원인
PAI는 training 없이 attention과 logits를 조정하여 이미지 기반 응답을 강화하는 방법 제안

Key Takeaways

Problem Setting

LVLM은 image encoder와 language model로 구성되며 image token이 text token보다 덜 활용됨
text inertia로 인해 모델이 이미지보다 기존 텍스트 패턴에 의존
image input이 없거나 바뀌어도 동일한 hallucinated output 생성 가능
attention 분석 결과 image token에 대한 attention 비중이 낮음

Main Idea

training 없이 inference 단계에서 image attention을 강화하는 방법 제안
Text Inertia
- 모델이 이미지보다 텍스트 prior에 의존하는 현상
- image token이 충분히 활용되지 않아 hallucination 발생
PAI (Pay Attention to Image)
- self-attention을 직접 수정하여 image token의 영향력 증가
- 이미지 기반 응답 방향으로 attention을 유도
Step 1: Attention 추출
- 현재 token 생성 시 attention matrix를 계산
- image, instruction, history token을 분리하여 분석
Step 2: Attention Intervention
- image token에 대한 attention weight를 증가
- trustful direction을 기반으로 이미지 정보 반영 강화
Step 3: Attention Mode Prior
- BOS token 등 불필요한 attention 집중 현상 완화
- 적절한 layer에서 intervention 적용
Logit Refinement
- 이미지 없이 생성된 분포를 기준으로 text prior를 억제
- image-conditioned prediction을 더 강조하도록 확률 조정

Result

다양한 decoding 방식에서 hallucination 감소 효과 확인
QA, VQA, description 등 여러 설정에서 일관된 성능 개선
GPT-4 기반 평가에서 accuracy와 detailedness 모두 향상
긴 응답에서도 이미지 기반 설명이 더 정확해짐
모델 크기와 관계없이 안정적으로 성능 개선
attention visualization에서 실제로 image 영역에 더 집중하는 경향 확인

이전 글Emerging Properties in Self-Supervised Vision Transformers

다음 글Chain of Agents: Large Language Models Collaborating on Long-Context Tasks