Manuscript v6 변경 사항 가이드

v5 → v6 업데이트에서 변경이 필요한 섹션과 정확한 수치를 정리한 문서입니다.
전체 v6 원고는 이 가이드를 기반으로 v5를 수정하여 생성합니다.


Abstract 변경

v5: “two instruction-tuned VLMs — Gemma3-4B-IT (Google) and LLaMA-3.2-11B-Vision (Meta)”
v6: “five VLMs spanning local open-source (Gemma3-4B-IT, LLaMA-3.2-11B-Vision, Qwen3-VL-4B-Thinking) and frontier API models (GPT-4o-mini, Gemini 2.5 Flash)”

v5: “moderate-to-substantial categorical agreement (κ = 0.535–0.671)”
v6: “moderate-to-almost-perfect categorical agreement (κ = 0.458–0.855)”

추가: “Models with chain-of-thought reasoning (thinking) consistently outperform non-thinking counterparts by 7–8 percentage points in accuracy, with the largest gains on sadness recognition (55–58% vs. 9–25%). A 4B local thinking model (Qwen3-VL, κ = 0.764) achieves performance parity with a frontier non-thinking model (GPT-4o-mini, κ = 0.766), suggesting that explicit reasoning partially compensates for model scale.”


Section 3.3 VLM Inference 변경

v5: 2개 모델 (Gemma3-4B, LLaMA-3.2-11B)
v6: 5개 모델로 확장

추가할 모델 기술:

Qwen3-VL-4B-Thinking (Alibaba, 4B parameters, 4-bit quantized): MLX 프레임워크, thinking mode 활성화 (thinking_budget = 1024 tokens per step). Chain-of-thought reasoning을 <think>...</think> 태그 내에서 수행한 후 JSON 응답 출력.

GPT-4o-mini (OpenAI, frontier, full-precision): API 기반, temperature = 0, seed = 42, image_detail = “high”. Thinking 미지원. Structured output을 사용하지 않고 context-carry 3-step과 동일한 JSON 프롬프트 사용.

Gemini 2.5 Flash (Google, frontier, full-precision): API 기반, temperature = 0, thinking mode 활성화 (thinking_budget = dynamic). includeThoughts: true로 thinking traces 수집. media_resolution = MEDIUM.

Table: 5개 모델 요약

ModelProviderParametersQuantizationThinkingBackend
Gemma3-4B-ITGoogle4BQAT 4-bitMLX (local)
LLaMA-3.2-11BMeta11B4-bitMLX (local)
Qwen3-VL-4BAlibaba4B4-bit✅ (budget=1024)MLX (local)
GPT-4o-miniOpenAIfrontierfull-precisionAPI
Gemini 2.5 FlashGooglefrontierfull-precision✅ (dynamic)API

추가 텍스트: “The inclusion of two frontier API models (GPT-4o-mini, Gemini 2.5 Flash) operating at full precision serves dual purposes: establishing a performance ceiling unconstrained by quantization artifacts, and enabling partial disentanglement of quantization effects from model architecture limitations. Recent work demonstrates that calibration-based 4-bit quantization retains 92–95% of FP16 quality on standard benchmarks (Lang et al., 2024), with vision tokens being less sensitive to quantization than language tokens due to higher redundancy (Li et al., CVPR 2025).”


Section 4.1 Emotion Classification 변경

Table 1 → 10-model ranking (5 VLM + 5 FER)

ModelTypeThinkingAccuracyMacro F1Cohen’s κ
PosterV2FER0.8990.9000.878
Gemini 2.5 FlashVLM0.8810.855
MobileViTFER0.8750.8740.848
EfficientNetFER0.8540.8560.823
GPT-4o-miniVLM0.8120.766
Qwen3-VL-4BVLM0.8060.764
BEiTFER0.7660.7720.713
Gemma3-4BVLM0.7260.6830.646
EmoNetFER0.7310.7240.665
LLaMA-3.2-11BVLM0.6130.4020.458

Table 2 → 10-model emotion accuracy

EmotionGeminiQwen3-VLGPTGemma3LLaMAPosterV2MobileViTEfficientNetBEiTEmoNet
Happy1.0001.0001.0001.0001.0001.0001.0001.0000.9791.000
Neutral0.9920.9621.0001.0001.0000.9120.8630.7290.5290.533
Fear0.9710.8960.9290.9790.6540.9330.9420.8460.7920.912
Angry0.9290.8750.9420.4040.9210.9170.9540.8870.8000.637
Disgust0.8080.5540.7500.8420.0080.6420.5330.6790.7540.846
Sad0.5830.5460.2540.1260.0920.9920.9580.9830.7420.454

NEW Section 4.X: Thinking Effect on Emotion Classification

핵심 결과: Thinking 모델이 일관적으로 non-thinking 모델을 상회.

Frontier pair: Gemini (thinking, 88.1%) vs GPT (no-thinking, 81.2%) → +6.9%p
Local 4B pair: Qwen3-VL (thinking, 80.6%) vs Gemma3 (no-thinking, 72.5%) → +8.1%p

Sad에서 극적 효과:

  • No-thinking: LLaMA 9.2%, Gemma3 12.1%, GPT 25.4%
  • Thinking: Qwen3-VL 54.6%, Gemini 58.3%
  • Thinking이 sad-neutral 구분에 결정적 (2-6배 향상)

4B thinking ≈ Frontier no-thinking: Qwen3-VL (κ=0.764) ≈ GPT-4o-mini (κ=0.766)
→ “Explicit reasoning partially compensates for model scale”


Section 4.2 Valence 변경

Table 3 → 8-model valence (3 VLM + 5 FER → 5 VLM + 3 FER)

ModelTypeThinkingPearson rMAEBias
Gemini 2.5 FlashVLM.9631.842−1.280
MobileViTFER.9500.916
EfficientNetFER.9401.063
GPT-4o-miniVLM.9381.626−1.018
EmoNetFER.9280.795
Qwen3-VL-4BVLM.9131.445−0.824
LLaMA-3.2-11BVLM.9011.808
Gemma3-4BVLM.8911.456

Section 4.3 Arousal 변경

Table 5 → 5 VLM + FER

ModelTypeThinkingPearson rMAE
Gemini 2.5 FlashVLM.7671.951
Qwen3-VL-4BVLM.7582.013
GPT-4o-miniVLM.6221.572
LLaMA-3.2-11BVLM.7831.777
Gemma3-4BVLM.7591.137
EfficientNetFER.4481.696

Thinking effect on arousal: r = 0.758–0.767 (thinking) vs 0.622 (GPT, no-thinking)
→ “Thinking enhances arousal estimation by providing intermediate reasoning about emotional intensity”


NEW Section 5.X: Thinking Tokens as Cognitive Load Proxy

Thinking token 수 = 모델의 인지적 부하(cognitive load) 프록시.

감정별 thinking 토큰:

  • Sad에서 가장 많음 (Gemini 1,290 tokens, Qwen3-VL 3,915 tokens)
  • Happy에서 가장 적음 (Gemini 949, Qwen3-VL 1,608)
  • Sad/Happy 비율: Gemini 1.36x, Qwen3-VL 2.43x

오답 시 thinking 증가:

  • Gemini: 정답 993 → 오답 1,248 (+26%)
  • Qwen3-VL: 정답 2,339 → 오답 3,959 (+69%)
  • 해석: uncertainty ↑ → longer thinking, but not → higher accuracy

Step별 패턴: Arousal step에서 thinking이 가장 길음 → arousal이 모델에게도 가장 어려운 차원 (인간 α=0.125와 일치)


Section 5.6 Limitations 변경

해소된 limitation: “extending to frontier APIs” → GPT-4o-mini, Gemini Flash 추가로 해소

양자화 관련 업데이트:

  • Sad 실패가 frontier(GPT 25.4%)에서도 관찰 → 양자화가 아닌 구조적 한계
  • 극성 과장이 frontier에서도 동일 → 양자화 무관
  • Fixed-value output은 로컬에서 더 두드러짐 → 양자화가 부분적 기여 (MBQ, CVPR 2025)

새로운 limitation:

  • Thinking 효과 비교에서 모델 아키텍처/훈련 데이터가 혼재 (Gemini vs GPT, Qwen vs Gemma는 thinking 외에도 다름)
  • Thinking budget=1024가 최적인지 불확실 (Qwen3-VL에서 budget 도달 시 강제 종료)

Section 6. Conclusion 변경

v5: “two VLMs”
v6: “five VLMs spanning three parameter scales (4B local, 11B local, frontier API) and two reasoning modes (standard and chain-of-thought thinking)”

추가 결론:

  1. Chain-of-thought thinking improves emotion classification by 7-8%p, with the largest gains on perceptually ambiguous emotions (sad: 55-58% vs 9-25%)
  2. A 4B local thinking model achieves performance parity with a frontier non-thinking model (κ = 0.764 vs 0.766)
  3. Polarity exaggeration and sadness-neutral confusion persist even in frontier full-precision models, confirming these as architectural rather than quantization-induced limitations
  4. Thinking tokens serve as a cognitive load proxy: models generate 26-69% more reasoning tokens on incorrect trials

추가 References

  • Lang, J. et al. (2024). A Comprehensive Study on Quantization Techniques for LLMs. arXiv:2411.02530.
  • Wang, C. et al. (2024). Q-VLM: Post-training Quantization for Large VLMs. arXiv:2410.08119.
  • Low-Bit Quantization Favors Undertrained LLMs. ACL 2025.
  • IJCAI 2025. Quantization Methods, Task Difficulty, and Model Size.