Experiment Pipeline: Human vs. VLM Emotion Rating on AI-Generated Face Images

Lab: GIST LCBL (Language, Cognition, and Brain Lab)
Domain: Cognitive Psychology — Affective Computing
Last Updated: 2026-02-22

1. Research Overview

1.1 Research Question

AI가 생성한 얼굴 이미지에 대해 사람과 Vision Language Model(VLM)의 정서 평정(emotion category, valence, arousal)은 어떤 양상으로 일치하거나 불일치하는가? 특히 성별(gender), 인종(race), 정서 범주(emotion)에 따라 체계적인 bias 또는 misalignment이 존재하는가?

1.2 Core Contribution

기존 정서 연구에서 사용되던 자극 세트(예: KDEF, ADFES, FER2013)는 피험자 간 외형 차이(머리 스타일, 조명, 배경 등)가 통제되지 않아 교란 변인이 존재했다. 본 연구는:

통제된 자극 생성: OpenArt로 인종별 중립 얼굴을 생성한 뒤, Gemini 2.5 Flash (image generation)로 정서만 변환하여 동일 인물의 identity를 유지하면서 정서 표현만 체계적으로 조작
대규모 인간 평정: 1,000명의 참가자로부터 정서 평정 수집
다중 VLM 비교: 다양한 크기와 아키텍처의 VLM으로 동일 자극을 평정하고, 사람과의 일치도를 체계적으로 비교
Attention 기반 불일치 분석: VLM이 사람과 불일치할 때 모델이 이미지의 어디를 보고 있었는지 attention 분석

2. Stimuli Construction

2.1 Image Generation Pipeline

OpenArt (neutral face generation)
    → 3 races × 2 genders × 40 identities = 240 base images
    → Gemini 2.5 Flash (emotion-only transformation)
    → 6 emotions per identity
    → Total: 1,440 images

2.2 Demographic Groups

Directory	Prefix	Count	Description
Black Man	BM##	240	40 identities × 6 emotions
Black Woman	BW##	240	40 identities × 6 emotions
Caucasian Man	CM#	240	40 identities × 6 emotions
Caucasian Woman	CW#	240	40 identities × 6 emotions
Korean Man	KM#	240	40 identities × 6 emotions
Korean Woman	KW#	240	40 identities × 6 emotions

2.3 Emotion Categories

6 basic emotions + neutral:

Angry (Ang)
Disgusted (Dis)
Fearful (Fea)
Happy (Hap)
Sad (Sad)
Neutral (Neu / NES)

2.4 Stimulus Control Advantages

Factor	Traditional Datasets	This Study
Identity consistency	Different individuals per emotion	Same identity across emotions
Background/lighting	Variable	Uniform (AI-generated)
Demographic balance	Often imbalanced	3 races × 2 genders, equal counts
Emotion intensity	Uncontrolled	Controlled via prompt specification

3. Human Rating Collection

Participants: N = 1,000
Platform: Online survey
Rating dimensions:
- Emotion category: forced-choice from 7 categories (happy, sad, angry, fear, disgust, surprised, neutral)
- Valence: 1 (very unpleasant) – 9 (very pleasant)
- Arousal: 1 (very calm) – 9 (very excited)
Data format: data/human_ratings/ratings.csv
- Columns: image_id, participant_id, emotion, valence, arousal

4. VLM Inference Pipeline

4.1 Models

Model	HuggingFace ID	Size	Architecture	Status
PaliGemma2	`google/paligemma2-3b-mix-224`	3B	SigLIP + Gemma2	✅ Done (1,440장)
LLaVA 1.5	`llava-hf/llava-1.5-7b-hf`	7B	CLIP ViT-L + Vicuna	✅ Done
Qwen2-VL	`Qwen/Qwen2-VL-7B-Instruct`	7B	ViT + Qwen2	Available
Qwen2.5-VL 3B	`Qwen/Qwen2.5-VL-3B-Instruct`	3B	ViT + Qwen2.5	Available
Qwen2.5-VL 7B	`Qwen/Qwen2.5-VL-7B-Instruct`	7B	ViT + Qwen2.5	Available
InternVL3 1B	`OpenGVLab/InternVL3-1B-hf`	1B	InternViT + Qwen2	Available
InternVL3 8B	`OpenGVLab/InternVL3-8B-hf`	8B	InternViT + Qwen2	Available
PaliGemma2 10B	`google/paligemma2-10b-mix-448`	10B	SigLIP + Gemma2	Available

4.1b MLX Backend Models (Apple Silicon 직접 추론)

MLX(mlx-vlm 0.4.1)를 사용한 로컬 GPU 추론 모델. Ollama HTTP 오버헤드 없이 in-process 실행.

검증된 모델 (mlx-vlm 호환 확인)

Model	MLX HuggingFace ID	Size	Quant	MLX 호환	비고
Qwen2.5-VL 3B	`mlx-community/Qwen2.5-VL-3B-Instruct-4bit`	3B	4bit	✅ 작동	93.4 tok/s, 안정적
Qwen2.5-VL 7B	`mlx-community/Qwen2.5-VL-7B-Instruct-4bit`	7B	4bit	✅ (미검증)
Qwen2-VL 2B	`mlx-community/Qwen2-VL-2B-Instruct-4bit`	2B	4bit	✅ (미검증)
Qwen3-VL 4B	`mlx-community/Qwen3-VL-4B-Instruct-4bit`	4B	4bit	❌ 버그	image token 매핑 에러 (아래 참조)
Qwen3-VL 4B (lmstudio)	`lmstudio-community/Qwen3-VL-4B-Instruct-MLX-4bit`	4B	4bit	❌ 버그	동일 에러
Qwen3-VL 8B	`lmstudio-community/Qwen3-VL-8B-Instruct-MLX-4bit`	8B	4bit	❌ (추정)	동일 아키텍처, 동일 버그 예상
Qwen3.5 4B	`mlx-community/Qwen3.5-4B-4bit` (text-only)	4B	4bit	❌	processor 미지원 (“Only returning PyTorch tensors”)
LLaMA 3.2 11B Vision	`mlx-community/Llama-3.2-11B-Vision-Instruct-4bit`	11B	4bit	✅ (미검증)	M1 Max 32GB에서 로드 가능
Gemma3 (Ollama 전용)	—	4B/12B	—	N/A	mlx-vlm `gemma3` 지원, 별도 검증 필요
InternVL Chat	—	—	—	✅ (미검증)	mlx-vlm `internvl_chat` 지원
Phi-4 Vision	—	—	—	✅ (미검증)	mlx-vlm `phi4_siglip` 지원
LFM2-VL 1.6B	`mlx-community/LFM2.5-VL-1.6B-4bit`	1.6B	4bit	✅ (미검증)	초소형 VLM

Qwen3-VL 호환성 이슈 (2026-03-23)

mlx-vlm 0.4.1에서 Qwen3-VL 모델 로드 시 다음 에러 발생:

ValueError: Image features and image tokens do not match: tokens: 0, features 1024

원인: mlx_vlm/models/qwen3_vl/qwen3_vl.py의 merge_input_ids_with_image_features()에서 이미지 토큰 수가 0으로 처리됨
영향: Qwen3-VL 전 사이즈(2B, 4B, 8B, 30B, 32B) 동일 버그 예상
상태: mlx-vlm upstream 수정 대기 중
대안: Ollama 백엔드(qwen3.5:4b)로 Qwen3.5 vision 사용 가능 (검증 완료, 72.7% accuracy)

mlx-vlm 0.4.1 지원 아키텍처 전체 목록

aya_vision, deepseek_vl_v2, deepseekocr, ernie4_5_moe_vl, fastvlm,
florence2, gemma3, gemma3n, glm4v, hunyuan_vl, idefics2, idefics3,
internvl_chat, jina_vlm, kimi_vl, lfm2_vl, llama4, llava, llava_bunny,
llava_next, minicpmo, mistral3, mistral4, mllama, molmo, molmo2,
moondream3, paligemma, phi3_v, phi4_siglip, phi4mm, pixtral,
qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe,
smolvlm

4.1c Ollama Backend Models (HTTP API 기반 로컬 추론)

Ollama 서버를 통한 GGUF 양자화 모델 추론. 설정: configs/ollama.yaml

Model	Ollama ID	Size	Strategy	Thinking	Status
Qwen3.5 4B	`qwen3.5:4b`	3.4GB	context_carry	✅	🔄 진행 중 (244/1440)
Qwen3.5 4B (single)	`qwen3.5:4b`	3.4GB	json_single	✅	대기
Gemma3 4B	`gemma3:4b`	2.5GB	json_single	—	대기
Qwen3.5 8B	`qwen3.5:8b`	5.5GB	json_single	✅	대기
Gemma3 12B	`gemma3:12b`	8GB	json_single	—	대기
Qwen2.5-VL 7B	`qwen2.5vl:7b`	4.5GB	json_single	—	대기

4.2 Inference Protocol

각 이미지에 대해 3-step sequential prompting with context carry로 정서 평정. 사람 실험에서 참가자가 emotion → valence → arousal을 순차적으로 판단하며 이전 판단이 다음 판단에 영향을 주는 과정(anchoring effect)을 재현한다.

Step 1: Emotion Classification
  "What is the facial expression in this image?
   Choose one from: {emotions}. Answer with a single word only."
  → 즉시 파싱하여 emotion 결과 확보 (예: "angry")

Step 2: Valence Rating (emotion context 주입)
  "You identified this face as {emotion}.
   How pleasant is this facial expression?
   Rate from 1 to 9 where 1 is very unpleasant and 9 is very pleasant.
   Answer with a single number only."
  → 즉시 파싱하여 valence 결과 확보 (예: 2)

Step 3: Arousal Rating (emotion + valence context 주입)
  "You identified this face as {emotion} with pleasantness {valence} out of 9.
   How intense or activated is the emotion in this face?
   Rate from 1 to 9 where 1 is very calm and 9 is very excited.
   Answer with a single number only."

Context Carry 설계 근거

측면	독립 호출 (이전 방식)	Context Carry (현재)
인간 실험 대응	순차 판단 과정 미반영	이전 판단이 다음 판단에 영향 (anchoring)
내적 일관성	emotion=angry인데 valence=7 가능	emotion 맥락이 valence 판단을 조율
성능 비용	3회 generate	동일 3회 generate (프롬프트 길이 미미 증가)
구현 위치	`BaseVLMBackend.generate_with_attention()`	동일 메서드, 파싱 순서만 변경

4.3 Decoding Strategy

Greedy decoding (do_sample=False, temperature=None)
결정론적 출력을 보장하여 재현성 확보
Test-retest reliability pilot로 결정론성 검증 (50 images × 3 repeats)

4.4 Technical Specifications

Hardware: Apple M1 Max 32GB
Precision: FP16 (float16) on MPS
Attention: attn_implementation="eager" (MPS compatibility)
Memory management: 모델 순차 실행, 모델 간 명시적 메모리 해제
Environment: PYTORCH_ENABLE_MPS_FALLBACK=1

4.5 Data Extraction

각 inference에서 3가지 데이터를 추출:

Predictions: emotion category, valence (1-9), arousal (1-9)
Cross-modal attention: self-attention에서 [generated_tokens, image_tokens] 슬라이스
Dark knowledge: 각 생성 step의 top-50 softmax distribution (logits)

5. Analysis Pipeline

5.0 Test-Retest Reliability (Pilot Validation)

본 실험에 앞서, VLM 출력의 신뢰성을 검증하는 pilot 단계:

# Greedy determinism verification
uv run python main.py test-retest -n 50 -r 3 -m paligemma2
 
# Stochastic test-retest (optional, for reporting)
uv run python main.py test-retest -n 50 -r 5 -m paligemma2 -t 0.3

Greedy mode: 동일 입력 → 동일 출력 비율 검증 (목표: 99%+)
Metrics: Fleiss’ kappa (emotion), ICC(2,k) (valence, arousal), within-image SD

5.1 Human Baseline (Inter-Rater Reliability)

사람 평정자 간 일치도를 먼저 산출하여 일치도의 ceiling을 설정:

Valence/Arousal: ICC(2,k) via pingouin
Emotion: majority vote agreement rate
이 baseline을 VLM-Human 일치도의 비교 기준으로 사용

5.2 VLM-Human Agreement

각 모델에 대해 사람 평정과의 일치도 산출:

Measure	Metric	Tool
Emotion category	Cohen’s kappa (κ)	sklearn
Valence	ICC(2,k)	pingouin
Arousal	ICC(2,k)	pingouin
Valence bias	Bland-Altman (mean diff, LoA)	custom
Arousal bias	Bland-Altman (mean diff, LoA)	custom

5.3 Bias & Misalignment Analysis

5.3.1 Emotion-Level Bias

각 정서 범주별로 VLM과 사람의 체계적 차이를 검정:

Shapiro-Wilk normality test → paired t-test or Wilcoxon signed-rank test
Effect size: Cohen’s d
Confusion matrix: 어떤 정서끼리 혼동되는지

5.3.2 Demographic Bias (Gender × Race)

6개 demographic group (3 race × 2 gender)에 대해:

Emotion accuracy by group: VLM의 정서 분류 정확도가 인종/성별에 따라 다른지
Valence/Arousal bias by group: 특정 인종이나 성별에 대해 VLM이 체계적으로 높거나 낮게 평정하는지
Interaction effects: Rating ~ RaterType * Emotion * Race * Gender + (1|Image) 혼합효과모델
다중비교 보정: Bonferroni 또는 FDR

5.3.3 Planned Comparisons

Comparison	Question
Race main effect	VLM이 특정 인종 얼굴을 더 negative/positive하게 평정하는가?
Gender main effect	VLM이 남/녀 얼굴의 정서를 다르게 평정하는가?
Race × Emotion	특정 인종의 특정 정서(예: Black × Angry)에서 bias가 있는가?
Gender × Emotion	남녀 간 정서별 misalignment 패턴이 다른가?
Model size effect	모델 크기에 따라 bias 정도가 달라지는가?

5.4 Attention Analysis

VLM이 사람과 불일치할 때 모델이 이미지의 어디를 주목했는지:

Cross-modal attention heatmap: LLM self-attention에서 image token 영역 추출
Per-task comparison: Emotion / Valence / Arousal 각 step에서의 attention 분포 차이
Agreement vs. disagreement: 사람과 일치한 이미지 vs. 불일치한 이미지의 attention 패턴 비교
Vision tower attention: ViT self-attention (사용 가능한 모델에서)

6. Execution Commands

6.1 Full Pipeline

# 1. Install dependencies
uv sync
 
# 2. Test-retest reliability pilot (do this first)
uv run python main.py test-retest -n 50 -r 3 -m paligemma2
 
# 3. Run inference (all enabled models)
uv run python main.py infer
 
# 4. Run inference (specific model)
uv run python main.py infer -m paligemma2
 
# 5. Statistical analysis
uv run python main.py analyze
 
# 6. Generate visualizations
uv run python main.py visualize
 
# 7. Or run all three at once
uv run python main.py pipeline

6.2 Quick Sample Check

# 5-image diverse sample with attention heatmaps
uv run python main.py sample --n 5 --model paligemma2 --seed 42

6.3 Run Tests

uv run pytest tests/ -v

7. Output Structure

outputs/
├── {model}_predictions.json          # Per-image predictions
├── {model}_attention_{image_id}.npz  # Cross-modal attention maps
├── {model}_logits_{image_id}.npz     # Dark knowledge (top-50 softmax)
├── test_retest_{model}.json          # Test-retest reliability report
├── inference_results.xlsx            # All results in Excel
└── figures/
    ├── {model}_scatter_valence.png   # Human vs VLM scatter
    ├── {model}_scatter_arousal.png
    ├── {model}_ba_valence.png        # Bland-Altman plots
    ├── {model}_ba_arousal.png
    ├── {model}_boxplot_valence.png   # Per-emotion box plots
    ├── {model}_boxplot_arousal.png
    ├── {model}_confusion.png         # Emotion confusion matrix
    └── {model}_attention_{id}.png    # Per-task attention heatmaps

8. Expected Analyses for Publication

Table 1: Human Inter-Rater Reliability (Baseline)

Valence ICC(2,k), Arousal ICC(2,k), Emotion agreement rate

Table 2: VLM-Human Agreement by Model

Emotion κ, Valence ICC, Arousal ICC per model
Comparison against human baseline ceiling

Table 3: Demographic Bias

Valence/Arousal mean difference by Race × Gender × Model
Significance tests with effect sizes

Table 4: Emotion-Level Confusion

Per-emotion accuracy, most common confusions per model

Figure Set

Confusion matrices (per model)
Bland-Altman plots (valence, arousal × model)
Attention heatmap examples (agreement vs. disagreement cases)
Demographic bias bar charts

9. Repository Structure

AI-Face-Rating-Analysis/
├── CLAUDE.md                         # Development instructions
├── main.py                           # CLI entry point (Typer)
├── configs/
│   └── default.yaml                  # Experiment configuration
├── src/
│   ├── config.py                     # Pydantic v2 config models
│   ├── models/
│   │   ├── base.py                   # BaseVLMBackend ABC
│   │   ├── registry.py               # @register_model decorator
│   │   ├── paligemma.py              # PaliGemma2 backend
│   │   ├── llava.py                  # LLaVA 1.5 backend
│   │   ├── qwen2vl.py               # Qwen2-VL backend
│   │   ├── qwen25vl.py              # Qwen2.5-VL backend
│   │   └── internvl3.py             # InternVL3 backend
│   ├── inference/
│   │   ├── runner.py                 # Batch inference orchestration
│   │   └── prompt.py                 # Response parsing (JSON/regex)
│   ├── analysis/
│   │   ├── statistics.py             # ICC, κ, Bland-Altman
│   │   └── bias.py                   # Per-emotion bias detection
│   ├── data/
│   │   ├── image_loader.py           # ImageDataset + metadata
│   │   └── human_ratings.py          # Human rating store
│   ├── attention/
│   │   ├── extractor.py              # Attention extraction
│   │   ├── mapper.py                 # Grid → image mapping
│   │   └── comparator.py            # Attention agreement
│   └── visualization/
│       ├── rating_plots.py           # Scatter, Bland-Altman, confusion
│       └── heatmaps.py              # Attention heatmap overlays
├── scripts/
│   └── test_retest_reliability.py    # Test-retest pilot script
├── tests/
│   ├── test_prompt_parsing.py        # Response parser tests
│   └── test_registry.py             # Model registry tests
├── data/
│   └── human_ratings/ratings.csv     # Human rating data
└── docs/
    └── experiment_pipeline.md        # This document

experiment_pipeline