Manuscript v6 변경 사항 가이드

v5 → v6 업데이트에서 변경이 필요한 섹션과 정확한 수치를 정리한 문서입니다.
전체 v6 원고는 이 가이드를 기반으로 v5를 수정하여 생성합니다.

Abstract 변경

v5: “two instruction-tuned VLMs — Gemma3-4B-IT (Google) and LLaMA-3.2-11B-Vision (Meta)”
v6: “five VLMs spanning local open-source (Gemma3-4B-IT, LLaMA-3.2-11B-Vision, Qwen3-VL-4B-Thinking) and frontier API models (GPT-4o-mini, Gemini 2.5 Flash)”

v5: “moderate-to-substantial categorical agreement (κ = 0.535–0.671)”
v6: “moderate-to-almost-perfect categorical agreement (κ = 0.458–0.855)”

추가: “Models with chain-of-thought reasoning (thinking) consistently outperform non-thinking counterparts by 7–8 percentage points in accuracy, with the largest gains on sadness recognition (55–58% vs. 9–25%). A 4B local thinking model (Qwen3-VL, κ = 0.764) achieves performance parity with a frontier non-thinking model (GPT-4o-mini, κ = 0.766), suggesting that explicit reasoning partially compensates for model scale.”

Section 3.3 VLM Inference 변경

v5: 2개 모델 (Gemma3-4B, LLaMA-3.2-11B)
v6: 5개 모델로 확장

추가할 모델 기술:

Qwen3-VL-4B-Thinking (Alibaba, 4B parameters, 4-bit quantized): MLX 프레임워크, thinking mode 활성화 (thinking_budget = 1024 tokens per step). Chain-of-thought reasoning을 <think>...</think> 태그 내에서 수행한 후 JSON 응답 출력.

GPT-4o-mini (OpenAI, frontier, full-precision): API 기반, temperature = 0, seed = 42, image_detail = “high”. Thinking 미지원. Structured output을 사용하지 않고 context-carry 3-step과 동일한 JSON 프롬프트 사용.

Gemini 2.5 Flash (Google, frontier, full-precision): API 기반, temperature = 0, thinking mode 활성화 (thinking_budget = dynamic). includeThoughts: true로 thinking traces 수집. media_resolution = MEDIUM.

Table: 5개 모델 요약

Model	Provider	Parameters	Quantization	Thinking	Backend
Gemma3-4B-IT	Google	4B	QAT 4-bit	❌	MLX (local)
LLaMA-3.2-11B	Meta	11B	4-bit	❌	MLX (local)
Qwen3-VL-4B	Alibaba	4B	4-bit	✅ (budget=1024)	MLX (local)
GPT-4o-mini	OpenAI	frontier	full-precision	❌	API
Gemini 2.5 Flash	Google	frontier	full-precision	✅ (dynamic)	API

추가 텍스트: “The inclusion of two frontier API models (GPT-4o-mini, Gemini 2.5 Flash) operating at full precision serves dual purposes: establishing a performance ceiling unconstrained by quantization artifacts, and enabling partial disentanglement of quantization effects from model architecture limitations. Recent work demonstrates that calibration-based 4-bit quantization retains 92–95% of FP16 quality on standard benchmarks (Lang et al., 2024), with vision tokens being less sensitive to quantization than language tokens due to higher redundancy (Li et al., CVPR 2025).”

Section 4.1 Emotion Classification 변경

Table 1 → 10-model ranking (5 VLM + 5 FER)

Model	Type	Thinking	Accuracy	Macro F1	Cohen’s κ
PosterV2	FER	—	0.899	0.900	0.878
Gemini 2.5 Flash	VLM	✅	0.881	—	0.855
MobileViT	FER	—	0.875	0.874	0.848
EfficientNet	FER	—	0.854	0.856	0.823
GPT-4o-mini	VLM	❌	0.812	—	0.766
Qwen3-VL-4B	VLM	✅	0.806	—	0.764
BEiT	FER	—	0.766	0.772	0.713
Gemma3-4B	VLM	❌	0.726	0.683	0.646
EmoNet	FER	—	0.731	0.724	0.665
LLaMA-3.2-11B	VLM	❌	0.613	0.402	0.458

Table 2 → 10-model emotion accuracy

Emotion	Gemini	Qwen3-VL	GPT	Gemma3	LLaMA	PosterV2	MobileViT	EfficientNet	BEiT	EmoNet
Happy	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	0.979	1.000
Neutral	0.992	0.962	1.000	1.000	1.000	0.912	0.863	0.729	0.529	0.533
Fear	0.971	0.896	0.929	0.979	0.654	0.933	0.942	0.846	0.792	0.912
Angry	0.929	0.875	0.942	0.404	0.921	0.917	0.954	0.887	0.800	0.637
Disgust	0.808	0.554	0.750	0.842	0.008	0.642	0.533	0.679	0.754	0.846
Sad	0.583	0.546	0.254	0.126	0.092	0.992	0.958	0.983	0.742	0.454

NEW Section 4.X: Thinking Effect on Emotion Classification

핵심 결과: Thinking 모델이 일관적으로 non-thinking 모델을 상회.

Frontier pair: Gemini (thinking, 88.1%) vs GPT (no-thinking, 81.2%) → +6.9%p
Local 4B pair: Qwen3-VL (thinking, 80.6%) vs Gemma3 (no-thinking, 72.5%) → +8.1%p

Sad에서 극적 효과:

No-thinking: LLaMA 9.2%, Gemma3 12.1%, GPT 25.4%
Thinking: Qwen3-VL 54.6%, Gemini 58.3%
Thinking이 sad-neutral 구분에 결정적 (2-6배 향상)

4B thinking ≈ Frontier no-thinking: Qwen3-VL (κ=0.764) ≈ GPT-4o-mini (κ=0.766)
→ “Explicit reasoning partially compensates for model scale”

Section 4.2 Valence 변경

Table 3 → 8-model valence (3 VLM + 5 FER → 5 VLM + 3 FER)

Model	Type	Thinking	Pearson r	MAE	Bias
Gemini 2.5 Flash	VLM	✅	.963	1.842	−1.280
MobileViT	FER	—	.950	0.916	—
EfficientNet	FER	—	.940	1.063	—
GPT-4o-mini	VLM	❌	.938	1.626	−1.018
EmoNet	FER	—	.928	0.795	—
Qwen3-VL-4B	VLM	✅	.913	1.445	−0.824
LLaMA-3.2-11B	VLM	❌	.901	1.808	—
Gemma3-4B	VLM	❌	.891	1.456	—

Section 4.3 Arousal 변경

Table 5 → 5 VLM + FER

Model	Type	Thinking	Pearson r	MAE
Gemini 2.5 Flash	VLM	✅	.767	1.951
Qwen3-VL-4B	VLM	✅	.758	2.013
GPT-4o-mini	VLM	❌	.622	1.572
LLaMA-3.2-11B	VLM	❌	.783	1.777
Gemma3-4B	VLM	❌	.759	1.137
EfficientNet	FER	—	.448	1.696

Thinking effect on arousal: r = 0.758–0.767 (thinking) vs 0.622 (GPT, no-thinking)
→ “Thinking enhances arousal estimation by providing intermediate reasoning about emotional intensity”

NEW Section 5.X: Thinking Tokens as Cognitive Load Proxy

Thinking token 수 = 모델의 인지적 부하(cognitive load) 프록시.

감정별 thinking 토큰:

Sad에서 가장 많음 (Gemini 1,290 tokens, Qwen3-VL 3,915 tokens)
Happy에서 가장 적음 (Gemini 949, Qwen3-VL 1,608)
Sad/Happy 비율: Gemini 1.36x, Qwen3-VL 2.43x

오답 시 thinking 증가:

Gemini: 정답 993 → 오답 1,248 (+26%)
Qwen3-VL: 정답 2,339 → 오답 3,959 (+69%)
해석: uncertainty ↑ → longer thinking, but not → higher accuracy

Step별 패턴: Arousal step에서 thinking이 가장 길음 → arousal이 모델에게도 가장 어려운 차원 (인간 α=0.125와 일치)

Section 5.6 Limitations 변경

해소된 limitation: “extending to frontier APIs” → GPT-4o-mini, Gemini Flash 추가로 해소

양자화 관련 업데이트:

Sad 실패가 frontier(GPT 25.4%)에서도 관찰 → 양자화가 아닌 구조적 한계
극성 과장이 frontier에서도 동일 → 양자화 무관
Fixed-value output은 로컬에서 더 두드러짐 → 양자화가 부분적 기여 (MBQ, CVPR 2025)

새로운 limitation:

Thinking 효과 비교에서 모델 아키텍처/훈련 데이터가 혼재 (Gemini vs GPT, Qwen vs Gemma는 thinking 외에도 다름)
Thinking budget=1024가 최적인지 불확실 (Qwen3-VL에서 budget 도달 시 강제 종료)

Section 6. Conclusion 변경

v5: “two VLMs”
v6: “five VLMs spanning three parameter scales (4B local, 11B local, frontier API) and two reasoning modes (standard and chain-of-thought thinking)”

추가 결론:

Chain-of-thought thinking improves emotion classification by 7-8%p, with the largest gains on perceptually ambiguous emotions (sad: 55-58% vs 9-25%)
A 4B local thinking model achieves performance parity with a frontier non-thinking model (κ = 0.764 vs 0.766)
Polarity exaggeration and sadness-neutral confusion persist even in frontier full-precision models, confirming these as architectural rather than quantization-induced limitations
Thinking tokens serve as a cognitive load proxy: models generate 26-69% more reasoning tokens on incorrect trials

추가 References

Lang, J. et al. (2024). A Comprehensive Study on Quantization Techniques for LLMs. arXiv:2411.02530.
Wang, C. et al. (2024). Q-VLM: Post-training Quantization for Large VLMs. arXiv:2410.08119.
Low-Bit Quantization Favors Undertrained LLMs. ACL 2025.
IJCAI 2025. Quantization Methods, Task Difficulty, and Model Size.

Juhyeon's Blog

탐색기

manuscript_VLM_emotion_2026_v6_changes