Do Vision Language Models See Emotions Like Humans? A Dual-Process Account of VLM Emotion Perception on AI-Generated Facial Stimuli

Authors: Jini Tae, Ju-Hyeon Park, Wonil Choi

Affiliation: Gwangju Institute of Science and Technology (GIST), South Korea

Abstract

Vision Language Models (VLMs) are increasingly deployed as scalable substitutes for human emotion annotation, yet their alignment with human emotion perception remains poorly understood beyond categorical accuracy. This study introduces a psychometric framework that treats VLMs as additional raters in a human emotion rating paradigm, comparing six VLMs — three local open-source models (Gemma3-4B-IT, Gemma3-12B-IT, LLaMA-3.2-11B-Vision) and two frontier API models (GPT-4o-mini, Gemini 2.5 Flash), plus one local thinking model (Qwen3-VL-4B-Thinking) — against 1,000 human participants on 1,440 AI-generated facial images balanced across three races (Black, Caucasian, Korean), two genders, and six basic emotions. Using Cohen’s κ, Pearson correlation, MAE, and mixed-effects models, we evaluate categorical agreement, dimensional alignment (valence and arousal), and demographic bias against human inter-rater reliability as a benchmark.

The six VLMs span moderate-to-almost-perfect categorical agreement (κ = 0.536–0.848). Cross-model comparisons suggest a 7–8 pp accuracy advantage for models with thinking capability, though an output suppression test on Gemini 2.5 Flash — which reduced but did not eliminate internal reasoning tokens — showed no accuracy difference (89.5% vs. 89.1% on a 943-image subset), leaving the causal role of thinking mode unresolved. The largest accuracy gains appear on sadness recognition, where thinking-enabled models achieve 55–58% accuracy compared to 9–27% for non-thinking models. We provide three lines of convergent evidence that sadness recognition difficulty is a cross-agent phenomenon requiring deliberative processing: (1) human raters show the longest response times for sad stimuli (Mdn = 1.745 s for arousal), (2) VLM thinking models generate 31–143% longer reasoning traces for sad versus happy stimuli, and (3) sad stimuli receive higher human naturalness ratings than fear, disgust, and angry stimuli, ruling out stimulus quality as an explanation. These findings converge on a dual-process account (Kahneman, 2011): non-thinking VLMs perform in ways that parallel System 1 processing, failing on low-intensity emotions, while thinking VLMs engage deliberation that partially compensates for this limitation — though the causal role of thinking mode per se is qualified by the output suppression finding. A 4B local thinking model (Qwen3-VL, κ = 0.761) achieves near-parity with a frontier non-thinking model (GPT-4o-mini, κ = 0.768), demonstrating that architectural differences including chain-of-thought capability may partially compensate for model scale.

Valence correlations are high (r = .891–.963) but absolute errors are large (MAE = 1.48–1.95) due to polarity exaggeration bias that persists even in frontier full-precision models, confirming this as an architectural rather than quantization-induced limitation. Arousal correlations are moderate across all models (r = .622–.797), with no consistent thinking advantage: non-thinking models LLaMA (r = .783) and Gemma3-4B (r = .739) match or exceed thinking models Gemini (r = .742) and Qwen3-VL (r = .733). Demographic bias patterns are model-specific, with frontier models showing smaller racial accuracy gaps (3.9 percentage points) than local models (4.8–9.4 percentage points).

Keywords: Vision Language Models, Facial Emotion Recognition, Psychometric Agreement, Dual-Process Theory, Chain-of-Thought Reasoning, Valence-Arousal, Demographic Bias, AI-Generated Faces, Affective Computing

1. Introduction

1.1 Affective Computing and the Promise of VLMs

The deployment of affective computing systems — from mental health chatbots to responsive virtual assistants — increasingly depends on accurate automatic emotion recognition from facial expressions. The efficacy of such systems hinges on affective alignment, defined as the degree to which a machine’s interpretation of emotional cues matches human psychological standards (Pantic et al., 2005). When an empathetic agent misinterprets the intensity of a user’s distress, it risks breaking user trust and failing to sustain meaningful interaction. This stakes consideration motivates rigorous empirical comparison between machine and human emotion perception.

Vision Language Models (VLMs) represent a paradigm shift from task-specific facial expression recognition (FER) models to general-purpose multimodal systems. A VLM is a model that integrates a vision encoder with a large language model, enabling image-conditioned text generation through natural language prompting. Whereas FER-specialized models are trained end-to-end on emotion-labeled datasets and output fixed emotion categories or continuous valence-arousal values, VLMs can flexibly produce both categorical and dimensional emotion ratings through instruction prompting — a capability that mirrors the integrated judgment process humans naturally employ. This flexibility raises the possibility that VLMs might serve as scalable substitutes for costly human emotion annotation, where collecting 72,000 responses from 1,000 raters represents a significant time and financial investment.

To evaluate whether VLMs truly perceive emotions as humans do, a dimensional measurement framework is required. The Circumplex Model of Affect (Russell, 1980) is a theoretical framework that maps all emotional experiences onto a continuous two-dimensional space defined by valence and arousal. Valence is the hedonic quality of an emotional experience, ranging from unpleasant to pleasant. Arousal is the degree of physiological activation, ranging from calm to excited. While the circumplex model was originally formulated for self-reported affective experience, it has been widely adopted for characterizing observer-rated facial expression perception (Baudouin et al., 2025). We follow this convention while noting that perceived emotion in others and felt emotion in oneself may involve distinct processes. This dimensional framework provides a richer representational vocabulary than categorical classification alone, enabling detection of subtle perceptual misalignments that discrete labels would obscure. Despite the theoretical importance of dimensional ratings, computational evaluations of emotion recognition have overwhelmingly focused on discrete category accuracy (Khare et al., 2024; Telceken et al., 2025).

1.2 The Evaluation Gap

Despite this framework existing, current VLM evaluations fail to employ it, creating four critical gaps that this study addresses.

The first gap concerns the absence of a human agreement benchmark. Existing benchmarks rely on accuracy and F1 scores against ground-truth labels while ignoring substantial disagreement among human raters. Human emotion perception is inherently variable — particularly for arousal, where inter-rater reliability can be as low as Krippendorff’s α = 0.125 (present study). Krippendorff’s α is a reliability coefficient for multiple raters that corrects for chance agreement, where 1.0 indicates perfect consensus and 0.0 indicates chance-level agreement. Without establishing human inter-rater reliability as a benchmark, it is impossible to determine whether a model’s errors reflect genuine failure or simply mirror the inherent subjectivity of emotion perception.

The second gap is the exclusive focus on categorical accuracy, neglecting continuous dimensional ratings central to affective science. A model may achieve perfect categorical accuracy while producing systematically distorted dimensional ratings — a dissociation we demonstrate empirically in the present study.

The third gap concerns the absence of demographic bias audits for VLMs. While demographic disparities have been documented in commercial FER APIs (Rhue, 2018; Jankowiak et al., 2024), systematic bias analysis of VLMs across race-gender-emotion intersections remains absent. This gap is concerning given the rapid adoption of VLMs in research and applied settings where fairness guarantees are critical.

The fourth gap is the absence of any investigation into how reasoning mode affects emotion perception. Recent VLMs can operate in two modes: standard inference, which generates responses directly, and chain-of-thought (CoT) thinking mode, which produces explicit reasoning traces before responding. This distinction loosely parallels Kahneman’s (2011) dual-process theory, where System 1 operates through fast, automatic pattern recognition and System 2 through slow, deliberative reasoning. Whether this architectural distinction in VLMs produces measurable differences in emotion recognition — particularly for perceptually ambiguous emotions — has not been systematically investigated.

1.3 Contributions and Research Questions

This paper makes five contributions to the intersection of affective computing, cognitive psychology, and multimodal AI evaluation.

First, we introduce a VLM-as-rater psychometric framework that treats VLMs as additional participants in a human rating paradigm. Rather than evaluating VLMs against ground-truth labels using accuracy and F1, we employ Cohen’s κ, Pearson correlation, MAE, and mixed-effects models to quantify agreement against human inter-rater reliability as an empirical agreement benchmark. Cohen’s κ is a chance-corrected agreement measure for categorical classification, where 0 indicates chance-level and 1 indicates perfect agreement. This framework reveals dimensions of VLM behavior — polarity exaggeration, dimensional collapse, sadness-neutral confusion — that accuracy-based evaluations entirely miss.

Second, we provide the first convergent evidence that sadness recognition difficulty is a cross-agent phenomenon. Three independent lines of evidence — human response times (N = 1,000, 72,000 responses), VLM thinking traces (two thinking models × 1,440 images), and stimulus naturalness ratings — all identify sadness as the emotion requiring the deepest processing. This convergent evidence supports a dual-process account of emotion recognition in which non-thinking VLMs perform in ways that parallel System 1 processing, failing on low-intensity emotions, while thinking VLMs engage deliberation that partially compensates for this difficulty.

Third, we conduct an output suppression test on Gemini 2.5 Flash that demonstrates the difficulty of cleanly ablating thinking mode in frontier API models — the model continued generating approximately 199 internal reasoning tokens even with thinking nominally disabled, yielding no accuracy difference and highlighting the need for within-model ablation studies. Cross-model comparisons between thinking-enabled and non-thinking models show 7–8 percentage point accuracy differences, but these cannot be causally attributed to thinking mode alone.

Fourth, we present among the first systematic demographic bias analyses of VLMs across a fully crossed 3 (race: Black, Caucasian, Korean) × 2 (gender: Male, Female) × 6 (emotion) factorial stimulus design with 1,440 AI-generated face images ensuring perfect experimental control.

Fifth, we introduce thinking token analysis as a cognitive load proxy, demonstrating that VLM reasoning traces parallel human processing difficulty: models generate 26–69% more reasoning tokens on incorrect trials, and the emotion with the longest reasoning traces (sadness) is also the emotion with the longest human response times (ρ = +0.899, p = .015).

This study is exploratory in nature. Rather than testing pre-registered hypotheses, we systematically characterize VLM emotion rating behavior across multiple dimensions to generate testable hypotheses for future confirmatory research. Our research questions address four axes of VLM-human comparison:

RQ1: How do VLM emotion ratings compare to human inter-rater reliability on categorical and dimensional measures?

RQ2: Do VLMs exhibit systematic demographic biases in emotion attribution, and are these biases model-specific?

RQ3: How do VLMs at different scales (4B local, 11–12B local, frontier API) compare in classification accuracy, dimensional prediction, and bias profiles?

RQ4: Does deliberative reasoning (thinking mode) improve recognition of low-intensity emotions, paralleling human deliberative processing?

2.1 VLMs for Emotion Recognition

The application of VLMs to facial emotion recognition has yielded mixed results, with traditional deep learning models consistently outperforming VLMs on categorical accuracy. Mulukutla et al. (2025) conducted the first empirical comparison of open-source VLMs against traditional models on FER-2013, a dataset containing 35,887 low-resolution grayscale images across seven emotion classes. Traditional models — EfficientNet-B0 (86.44% accuracy) and ResNet-50 (85.72%) — outperformed VLMs by 20 to 35 percentage points, with CLIP achieving 64.07% and Phi-3.5 Vision achieving 51.66%. This performance gap suggests that VLMs’ general visual understanding does not automatically translate to FER proficiency, particularly on low-quality visual inputs.

Frontier API models show more promising results. Evaluations on the NimStim dataset demonstrate that GPT-4o and Gemini match or exceed human performance for calm, neutral, and surprise expressions, though performance degrades for more ambiguous emotions (Harb et al., 2025). Refoua et al. (2026) evaluated ChatGPT-4, ChatGPT-4o, and Claude 3 Opus on the Reading the Mind in the Eyes Test (RMET) with White, Black, and Korean face stimuli, finding that ChatGPT-4o achieved cross-ethnically consistent performance with accuracy above the 85th human percentile across all three ethnic versions. AlDahoul et al. (2026) developed FaceScanPaliGemma, a multi-agent VLM system for simultaneous facial attribute recognition including emotion (59.4% accuracy), race, gender, and age. Bhattacharyya and Wang (2025) presented a comprehensive evaluation of VLMs for evoked emotion recognition at NAACL, confirming that zero-shot VLMs lag behind supervised systems. The present study extends this literature by evaluating six VLMs spanning three parameter scales (4B, 11–12B, frontier) and two reasoning modes (standard and thinking) on a fully controlled factorial stimulus design.

2.2 Sadness-Neutral Confusion in Emotion Recognition

Sadness-neutral confusion is well-documented in FER literature. Mejia-Escobar et al. (2023) reported that 1,328 of 7,206 sad images in FER-2013 were misclassified as neutral. Analyses of AffectNet (Savchenko et al., 2024) found that anger and sadness had the highest misclassification rates, with 29% of sadness instances classified as neutral. The InsideOut benchmark (2025) similarly reported persistent confusion between “subtle classes such as fear, sadness, and neutral.” These studies establish sadness-neutral confusion as a well-known phenomenon in CNN-based FER models.

However, three critical gaps remain. First, sadness-neutral confusion has not been systematically characterized in VLMs. Harb et al. (2025) evaluated GPT-4o and Gemini on posed NimStim stimuli, finding fear-surprise confusion as the dominant error — a result attributable to the exaggerated expressions in posed datasets that reduce the ambiguity of sadness. Whether VLMs exhibit the same sadness-neutral confusion as FER models on more naturalistic stimuli has not been investigated. Second, no prior study has examined whether chain-of-thought reasoning in VLMs mitigates sadness-neutral confusion. Third, the relationship between human processing difficulty and VLM reasoning difficulty across emotions has never been quantified, despite the obvious theoretical interest of such a comparison.

2.3 Dual-Process Theory and Emotion Perception

Kahneman’s (2011) dual-process theory distinguishes between System 1 (fast, automatic, intuitive processing) and System 2 (slow, deliberative, effortful reasoning). Evidence from human emotion perception supports the relevance of this framework: Calvo and Nummenmaa (2013) demonstrated that happiness recognition requires only 10–20 ms of exposure, while sadness requires 70–200 ms — a 3.5 to 10-fold increase — suggesting that sadness recognition cannot be achieved through System 1 processing alone. Further support comes from clinical populations: individuals with alexithymia — a condition characterized by difficulty identifying emotions — show a specific tendency to rate negative emotions, particularly sadness, as neutral (Grynberg et al., 2012). Meta-analytic evidence indicates that cognitive empathy, a deliberate perspective-taking ability corresponding to System 2 processing, positively correlates with sadness recognition accuracy (Qiao et al., 2025).

The dual-process framework has not been applied to VLM emotion perception. We propose that non-thinking VLMs perform in ways that parallel System 1 processing: they achieve rapid pattern matching sufficient for high-arousal, visually distinctive emotions (happy, angry, fear) but fail on low-intensity emotions (sadness) where deliberative reasoning is required. Thinking-enabled VLMs, by generating explicit reasoning traces before responding, engage an analogous System 2 process. This framework generates the specific prediction that thinking mode should disproportionately improve sadness recognition — a prediction we test directly.

2.4 Human-AI Comparison in Emotion Perception

The psychometric comparison of human and machine raters has a long tradition in clinical psychology, recently extended to large language models. Tak and Gratch (2024) found that GPT-4 emulates average-human emotional cognition from a third-person perspective. Alrasheed et al. (2025) evaluated GPT-4’s capacity to interpret emotions from non-facial affective images in the GAPED database, achieving correlations of r = 0.87 for valence and r = 0.72 for arousal under zero-shot conditions. Zhang et al. (2024) provide a comprehensive survey noting that while LLMs excel at affective understanding tasks such as sentiment classification, their performance on dimensional emotion estimation remains underexplored. The present study bridges this gap by evaluating six VLMs across two reasoning modes, producing integrated categorical-plus-dimensional ratings through a psychometric framework anchored to large-scale human data (N = 1,000).

2.5 Demographic Bias in Automated Affect Recognition

Documented racial and gender disparities in automated affect recognition have raised fairness concerns that extend to VLMs. Jankowiak et al. (2024) demonstrated that imbalanced training data propagates into systematic performance disparities across demographic groups. Gender bias in FER manifests as both representational bias (unequal demographic representation) and stereotypical bias (systematic associations between emotions and demographics; Dominguez-Catena et al., 2024). Human emotion perception itself is not demographically neutral: gender-emotion stereotypes lead observers to associate male faces with anger and female faces with happiness and sadness (Plant et al., 2000), though these stereotypical associations can reverse when facial cues are controlled (Hess et al., 2004). These human biases propagate into training datasets — AffectNet (Mollahosseini et al., 2017) relies on 12 annotators across approximately 450,000 images, with most images receiving a single annotation — and may be amplified by VLM pretraining on web-scale data. The present study extends bias analysis to six VLMs using a factorial design that enables orthogonal estimation of race, gender, and emotion effects.

2.6 AI-Generated Stimuli in Emotion Research

Traditional face databases — KDEF, ADFES, FER-2013, AffectNet — suffer from uncontrolled variation in expression quality, lighting, and demographic balance. AI-generated face stimuli address these limitations through controlled generation. The GIST-AIFaceDB used in this study generates neutral base faces with standardized features — identical gray backgrounds, navy t-shirts, and front-facing pose — then transforms each into five emotional expressions while preserving identity. This pipeline ensures that differences between expressions for a given identity are attributable solely to the emotion manipulation. Ecological validity is supported by human naturalness ratings: average naturalness ranged from 5.26 (fear) to 6.94 (happy) on a 9-point scale, indicating that participants perceived stimuli as moderately to highly realistic. Baudouin et al. (2025) provide evidence that dimensional ratings can be reliably collected from facial stimuli regardless of provenance.

3. Methodology

Figure 1 presents the overall research pipeline, illustrating how 1,440 AI-generated stimuli flow through human rating and VLM inference before converging in psychometric comparison.

flowchart TB
    subgraph Stimuli["Stimuli Generation"]
        A["OpenArt<br>STOIQO NewReality Flux"] -->|"240 neutral faces"| B["Nano-Banana<br>Gemini 2.5 Flash Image"]
        B -->|"5 emotions per identity"| C["GIST-AIFaceDB<br>1,440 images<br>3 races × 2 genders × 6 emotions × 40 IDs"]
    end

    subgraph Human["Human Rating (N = 1,000)"]
        C --> D["72 images per participant<br>72,000 total responses"]
        D --> E["Valence 1–9<br>Arousal 1–9<br>Naturalness 1–9<br>Response Times"]
    end

    subgraph VLM["VLM Inference (6 Models)"]
        C --> F1["Local No-Thinking<br>Gemma3-4B, Gemma3-12B,<br>LLaMA-3.2-11B"]
        C --> F2["Local Thinking<br>Qwen3-VL-4B"]
        C --> F3["Frontier API<br>GPT-4o-mini, Gemini 2.5 Flash"]
        F1 --> H["Context-Carry<br>3-Step Prompting"]
        F2 --> H
        F3 --> H
        H --> I["Emotion + Valence + Arousal<br>+ Thinking Traces"]
    end

    subgraph Analysis["Psychometric Comparison"]
        E --> L["Cohen's κ, Pearson r, MAE<br>Mixed-Effects Models<br>Demographic Bias<br>Thinking Token Analysis"]
        I --> L
        L --> M["Key Findings:<br>Dual-Process Account<br>Polarity Exaggeration<br>Sadness-Neutral Confusion<br>Thinking Advantage"]
    end

    style Stimuli fill:#e1f5fe,stroke:#0288d1
    style Human fill:#fff3e0,stroke:#f57c00
    style VLM fill:#e8f5e9,stroke:#388e3c
    style Analysis fill:#f3e5f5,stroke:#7b1fa2

Figure 1. Overall research pipeline. AI-generated stimuli (blue) are evaluated by 1,000 human raters (orange) and six VLMs spanning three scales and two reasoning modes (green), with all outputs converging in psychometric comparison (purple).

3.1 Stimuli

The stimulus set comprises 1,440 AI-generated facial images from the GIST AI-Generated Face Database (GIST-AIFaceDB, under review). The generation pipeline employed a two-step process. In the first step, 240 neutral base faces were generated using the STOIQO NewReality Flux model deployed on the OpenArt platform, depicting diverse virtual identities with standardized navy t-shirts against gray backgrounds across three racial groups (Black, Caucasian, Korean) and two genders (Male, Female). In the second step, each neutral face was transformed into five additional emotional expressions — angry, disgusted, fearful, happy, and sad — using Nano-Banana, an advanced image-editing model implemented in Google AI Studio (Gemini 2.5 Flash Image), which modifies facial expressions while preserving identity, lighting, and background.

The resulting fully crossed factorial design — 3 (race) × 2 (gender) × 6 (emotion) × 40 (identity) — yields 1,440 images with balanced cell sizes: 240 per emotion, 480 per race, 720 per gender, and 80 per race-gender-emotion combination. This balanced design enables orthogonal estimation of all demographic effects without confounding.

3.2 Human Rating Procedure

The study protocol was reviewed and granted exemption by the Institutional Review Board (IRB). One thousand native Korean adults (500 female, 500 male; age M = 44.6, SD = 13.7, range 20–69) were recruited through an online platform, with recruitment strictly balanced across age cohorts and genders. Each participant evaluated 72 images randomly selected from the 1,440 total, with every image presented in randomized order. Through this counterbalanced crossed design, each image received 50 independent ratings, yielding 72,000 total responses across three dimensions: valence (1–9 Likert scale), arousal (1–9), and naturalness (1–9). Response times were recorded for each rating.

Inter-rater reliability, computed as Krippendorff’s α (ordinal), established the human agreement benchmark: valence α = 0.471 (poor-to-fair), arousal α = 0.125 (poor), and naturalness α = 0.126 (poor). While these values appear low, they fall within the typical range for emotion rating studies and reflect the inherent subjectivity of affective perception. A linear mixed-effects model (LMM) confirmed that rater individual differences (σ² = 0.450 for valence, σ² = 0.696 for arousal) dominated image-level variance by a factor of 11 for valence and 32 for arousal, confirming that low reliability is driven by rater heterogeneity rather than stimulus ambiguity.

3.3 VLM Inference

Six VLMs were evaluated, spanning three parameter scales and two reasoning modes. Table 1 summarizes the model specifications.

Table 1. VLM specifications and inference configurations.

Model	Provider	Parameters	Quantization	Thinking	Backend	Key Settings
Gemma3-4B-IT	Google	4B	QAT 4-bit	No	MLX (local)	temp=0
Gemma3-12B-IT	Google	12B	QAT 4-bit	No	MLX (local)	temp=0
LLaMA-3.2-11B-Vision	Meta	11B	4-bit	No	MLX (local)	temp=0
Qwen3-VL-4B-Thinking	Alibaba	4B	4-bit	Yes (budget=1024)	MLX (local)	temp=0, rep_penalty=1.5
GPT-4o-mini	OpenAI	Frontier	Full-precision	No	API	temp=0, seed=42, image_detail=high
Gemini 2.5 Flash	Google	Frontier	Full-precision	Yes (dynamic)	API	temp=0, includeThoughts=true

The three local models (Gemma3-4B, Gemma3-12B, LLaMA-3.2-11B) were deployed on Apple Silicon (M1 Max, 32 GB) via the MLX framework with 4-bit quantization for memory-efficient inference. Qwen3-VL-4B-Thinking was deployed on the same hardware with chain-of-thought reasoning enabled: the model generates explicit reasoning within <think>...</think> tags before producing its JSON response, with a thinking budget of 1,024 tokens per inference step to prevent runaway generation in quantized models. GPT-4o-mini was accessed through the OpenAI API with deterministic settings (temperature = 0, seed = 42, image_detail = “high”). Gemini 2.5 Flash was accessed through the Google Generative AI API with thinking mode enabled (dynamic thinking budget) and includeThoughts: true to collect reasoning traces.

All models were run with temperature = 0 (greedy decoding) for deterministic outputs. The inclusion of two frontier API models operating at full precision serves dual purposes: establishing a performance ceiling unconstrained by quantization artifacts, and enabling partial disentanglement of quantization effects from architectural limitations. Recent work demonstrates that calibration-based 4-bit quantization retains 92–95% of FP16 quality on standard benchmarks (Lang et al., 2024), with vision tokens being less sensitive to quantization than language tokens due to higher redundancy (Li et al., 2025).

Inference followed a three-step context-carry prompting strategy, where prior outputs are fed forward as context for subsequent predictions, mirroring anchoring effects in human sequential judgment. In Step 1, the model classified the facial emotion from six forced-choice categories (happy, sad, angry, fear, disgust, neutral) via JSON output. In Step 2, the classified emotion was carried forward, and the model rated valence on a 1–9 scale. In Step 3, both the classified emotion and valence rating were carried forward, and the model rated arousal on a 1–9 scale. This strategy introduces structural error propagation: classification errors in Step 1 systematically influence subsequent valence and arousal ratings. Response parsing employed a cascade strategy: direct JSON parse, markdown fence stripping, and regex fallback. All 1,440 images were successfully processed by all six models, yielding 8,640 total VLM predictions.

3.4 Statistical Analysis

Categorical agreement was quantified via unweighted Cohen’s κ against intended emotion labels, as the six emotion categories lack natural ordinal structure. McNemar’s test was used for pairwise model comparisons. Dimensional alignment was assessed through Pearson correlation, Mean Absolute Error (MAE), and Bland-Altman analysis (systematic bias and 95% limits of agreement). Per-emotion bias significance was tested with Wilcoxon signed-rank tests, Bonferroni-corrected.

Bias decomposition employed linear mixed-effects models (LMMs) fitted via R’s lme4 package (Bates et al., 2015) with Satterthwaite degrees of freedom (lmerTest). The emotion-bias model used the formula: rating ~ rater_type * emotion + (1|image_id), where rater_type distinguishes human aggregate ratings from VLM ratings. Demographic bias models used analogous formulas with actor_race and actor_gender as fixed effects.

Thinking token analysis used character counts from collected reasoning traces (Gemini) and token counts estimated via tiktoken (Qwen3-VL). Per-emotion thinking length was compared via Kruskal-Wallis tests, and correct/incorrect trial comparisons used Mann-Whitney U tests.

4. Results

4.1 Emotion Classification

Table 2 presents the six-model ranking on overall emotion classification. The two thinking models (Gemini 2.5 Flash and Qwen3-VL-4B) occupy the first and third positions, with the frontier non-thinking model GPT-4o-mini in second.

Table 2. Overall emotion classification performance (N = 1,440 images per model).

Rank	Model	Thinking	Parameters	Accuracy	Cohen’s κ
1	Gemini 2.5 Flash	Yes	Frontier	0.874	0.848
2	GPT-4o-mini	No	Frontier	0.807	0.768
3	Qwen3-VL-4B	Yes	4B	0.800	0.761
4	Gemma3-12B	No	12B	0.759	0.711
5	Gemma3-4B	No	4B	0.724	0.668
6	LLaMA-3.2-11B	No	11B	0.613	0.536

Two patterns are notable. First, model scale does not predict performance: the 11B LLaMA (κ = 0.536) performs worse than the 4B Gemma3 (κ = 0.668), and the 12B Gemma3 (κ = 0.711) performs below the 4B Qwen3-VL (κ = 0.761). Architecture and reasoning mode matter more than parameter count. Second, the 4B Qwen3-VL with thinking (κ = 0.761) achieves near-parity with the frontier GPT-4o-mini without thinking (κ = 0.768), suggesting that architectural differences including chain-of-thought capability may partially compensate for model scale.

Table 3 presents emotion-specific accuracy across all six models, revealing extreme performance polarization.

Table 3. Emotion-specific classification accuracy (proportion correct).

Emotion	Gemini	Qwen3-VL	GPT	Gemma3-12B	Gemma3-4B	LLaMA
Happy	1.000	1.000	1.000	1.000	1.000	1.000
Neutral	0.992	0.963	1.000	1.000	1.000	1.000
Fear	0.971	0.896	0.929	0.979	0.979	0.654
Angry	0.908	0.858	0.925	0.925	0.400	0.925
Disgust	0.787	0.537	0.733	0.383	0.838	0.008
Sad	0.583	0.546	0.254	0.267	0.125	0.092

Happy and neutral are perfectly or near-perfectly classified by all models — effectively solved categories. Fear, angry, and disgust show model-specific variation. Sadness is the universal failure point: accuracy ranges from 9.2% (LLaMA) to 58.3% (Gemini), with no model exceeding 60%. The dominant error for sadness is neutral absorption: across non-thinking models, 66–76% of sad images are classified as neutral. Even the best-performing model (Gemini with thinking) misclassifies 19.2% of sad images as neutral.

Notable model-specific patterns emerge in Table 3. Gemma3-12B shows a distinctive profile with very high fear accuracy (0.979) and angry accuracy (0.929) but very low disgust accuracy (0.392), suggesting a systematic tendency to confuse disgust with other negative emotions. Gemma3-4B shows the reverse pattern, with high disgust accuracy (0.842) but poor angry accuracy (0.404). These complementary profiles indicate that architectural differences between the 4B and 12B Gemma3 variants produce qualitatively different emotion recognition strategies rather than uniform scaling effects.

4.2 Thinking Effect on Emotion Classification (RQ4)

Table 4 presents the thinking effect through both cross-model comparisons and a within-model output suppression test on Gemini 2.5 Flash.

Table 4. Thinking effect: cross-model comparisons and within-model output suppression test.

Comparison	Type	Model A	Accuracy	Model B	Accuracy	Δ
Frontier	Cross-model	GPT-4o-mini	80.7%	Gemini 2.5 Flash	87.4%	+6.9 pp
Local (4B)	Cross-model	Gemma3-4B	72.4%	Qwen3-VL-4B	80.0%	+8.1 pp
Gemini	Output suppression	Gemini (budget=0)	87.8%	Gemini (budget=-1)	87.4%	-0.4 pp

The cross-model comparisons show a consistent 7–8 percentage point accuracy advantage for thinking-enabled models across both frontier and local pairs. However, the within-model Gemini output suppression test reveals that toggling the thinking budget produces no measurable accuracy difference (87.8% vs. 87.4% on the full 1,440 images).

The Gemini output suppression test warrants careful interpretation. Setting thinking_budget=0 does not disable internal reasoning — the API still reports approximately 240 internal thinking tokens per inference step, compared to 500+ tokens with the default dynamic budget. This means the test suppressed the external reasoning trace while reducing but not eliminating internal computation. The 87.8% vs. 87.4% comparison therefore reflects a difference in reasoning verbosity, not a clean ablation of reasoning capability. On sadness specifically, the suppressed condition achieved 60.0% compared to 58.3% for full thinking — a direction opposite to the cross-model pattern. Because the test did not genuinely disable reasoning, it is uninformative about whether thinking causally contributes to Gemini’s performance. The frontier accuracy gap (Gemini 87.4% vs. GPT-4o-mini 80.7%) therefore cannot be attributed to thinking mode and more parsimoniously reflects differences in model architecture, training data, and scale.

Only the Qwen3-VL–Gemma3-4B comparison, where different architectures are compared, provides suggestive (though not conclusive) evidence that explicit chain-of-thought reasoning may benefit emotion classification at the local 4B scale.

Table 5. Sadness accuracy by thinking mode.

Model	Thinking	Sad Accuracy
LLaMA-3.2-11B	No	9.2%
Gemma3-4B	No	12.5%
GPT-4o-mini	No	25.4%
Gemma3-12B	No	26.7%
Qwen3-VL-4B	Yes	54.6%
Gemini 2.5 Flash	Yes	58.3%

Non-thinking models achieve 9–27% sad accuracy, while thinking-enabled models achieve 55–58% — a 2 to 6-fold improvement in the cross-model comparison. Thinking reduces Gemini’s sadness-neutral confusion rate from the 66–76% range typical of non-thinking models to 19.2%. This disproportionate improvement on sadness, rather than a uniform boost across all emotions, is consistent with the dual-process interpretation: sadness recognition specifically requires the kind of deliberative reasoning that more capable models — whether through thinking mode, superior architecture, or both — can provide, while high-arousal emotions (happy, angry, fear) are adequately handled by direct pattern matching.

The only within-model sadness comparison available — Gemini with suppressed thinking (60.0%) versus Gemini with full thinking (58.3%) — actually shows a slight advantage for the suppressed condition, directly contradicting the cross-model pattern where thinking-enabled models (Qwen3-VL 54.6%, Gemini 58.3%) outperform non-thinking models (9–27%). This contradiction underscores that cross-model differences reflect multiple confounded factors, not thinking mode alone.

Given the output suppression result, the causal attribution to thinking mode specifically must be made cautiously. The sadness difficulty pattern is robust across all models, but whether thinking mode per se or overall model capability drives the improvement remains an open question.

4.3 Valence Comparison

All six VLMs achieve high valence correlations with human ratings (r = .891–.963), indicating correct rank ordering of emotions along the pleasantness dimension. However, absolute errors are large (MAE = 1.48–1.95), reflecting a systematic pattern of correct ordering but distorted scale usage.

Table 6. Valence prediction summary statistics (6 VLMs).

Model	Thinking	Pearson r	MAE	Bias (M)
Gemini 2.5 Flash	Yes	.963	1.842	−1.280
GPT-4o-mini	No	.938	1.626	−1.018
Gemma3-12B	No	.922	1.581	−0.876
Qwen3-VL-4B	Yes	.913	1.445	−0.824
LLaMA-3.2-11B	No	.899	1.702	−0.857
Gemma3-4B	No	.891	1.456	−0.291

The source of this distortion is polarity exaggeration bias: VLMs systematically produce more extreme valence ratings than humans — more negative for negative emotions and more positive for positive emotions. This pattern persists across all models, including frontier full-precision models, confirming it as an architectural property of VLMs rather than a quantization artifact. Mixed-effects models confirmed all per-emotion biases as statistically significant (p < .001).

Notably, the valence correlation ranking does not follow the classification accuracy ranking. Gemini achieves the highest valence correlation (r = .964) and classification accuracy, but Gemma3-12B (r = .922) outperforms Qwen3-VL (r = .913) in valence despite lower classification accuracy (κ = 0.713 vs. 0.767). Gemma3-4B achieves the smallest negative bias (−0.291), suggesting more conservative valence ratings despite lower overall classification performance. These dissociations confirm that categorical accuracy and dimensional alignment are partially independent competencies.

4.4 Arousal Comparison

Arousal estimation reveals moderate correlations across all six models, with no consistent thinking advantage. Table 7 presents arousal statistics for all six models.

Table 7. Arousal prediction summary statistics.

Model	Thinking	Pearson r	MAE
LLaMA-3.2-11B	No	.797	1.763
Gemma3-4B	No	.759	1.137
Gemini 2.5 Flash	Yes	.767	1.951
Qwen3-VL-4B	Yes	.758	2.013
GPT-4o-mini	No	.622	1.572
Gemma3-12B	No	.623	1.463

Arousal correlations do not show a systematic thinking advantage: the two non-thinking local models (LLaMA r = .797, Gemma3-4B r = .759) achieve correlations comparable to or exceeding the thinking models (Gemini r = .767, Qwen3-VL r = .758). The lowest arousal correlations belong to GPT-4o-mini (r = .622) and Gemma3-12B (r = .623), both non-thinking models, but their low performance reflects model-specific factors rather than the absence of thinking.

Both thinking and non-thinking VLMs show the same systematic arousal bias pattern: overestimation of fear arousal and underestimation of neutral and sad arousal, consistent with a “low visual salience = low arousal” heuristic.

4.5 Thinking Tokens as Cognitive Load Proxy

Chain-of-thought reasoning traces provide a window into model processing difficulty across emotions. Table 8 presents average thinking length by emotion for the two thinking models.

Table 8. Average thinking token/character count by emotion.

Emotion	Gemini (chars)	Qwen3-VL (tokens)	Human Arousal RT (Mdn, s)
Happy	949	1,608	1.676
Neutral	989	—	1.723
Fear	1,011	2,221	1.695
Angry	925	—	1.707
Disgust	966	3,460	1.723
Sad	1,290	3,915	1.745

Qwen3-VL shows missing data for Neutral and Angry emotions because the model’s thinking traces for these categories were too short to be reliably captured by the token counting pipeline, falling below the minimum threshold for inclusion.

Sadness elicits the longest thinking traces in both models: Gemini generates 36% more characters for sad than for happy stimuli, and Qwen3-VL generates 143% more tokens. This parallels human response times, where sad stimuli produce the longest arousal rating times (Mdn = 1.745 s). The Spearman correlation between emotion-level VLM thinking length and human response time is ρ = +0.899 (p = .015). With only six emotion categories, this correlation should be interpreted as suggestive rather than definitive, as the small sample size limits statistical power.

Thinking length also differs by accuracy. Gemini generates 26% longer traces on incorrect trials (M = 1,248 chars) than correct trials (M = 993 chars). Qwen3-VL shows an even larger increase: 69% longer on incorrect trials (M = 3,959 tokens vs. 2,339). This pattern — more thinking on harder or incorrect items — mirrors the human uncertainty-deliberation relationship but does not translate into higher accuracy, suggesting that deliberative processing is associated with difficulty rather than guaranteeing correct outcomes. Alternative explanations for longer thinking traces on sadness include verbosity due to stimulus ambiguity (the model may enumerate more alternatives rather than engage in deeper reasoning) and training data artifacts (thinking models may have been trained to produce longer outputs on ambiguous inputs).

Step-level analysis reveals that arousal elicits the longest thinking across all emotions, consistent with the low human inter-rater reliability for arousal (α = 0.125) and suggesting that arousal intensity estimation is the most cognitively demanding dimension for both humans and VLMs.

4.6 Demographic Bias Analysis

Mixed-effects models revealed model-specific demographic biases across the six VLMs. Table 9 presents racial accuracy by model.

Table 9. Emotion classification accuracy by race.

Model	Black	Caucasian	Korean	Max Δ
Gemini 2.5 Flash	90.4%	85.2%	86.5%	5.2 pp
GPT-4o-mini	81.9%	77.3%	82.9%	5.6 pp
Qwen3-VL-4B	75.2%	81.9%	84.6%	9.4 pp
Gemma3-12B	74.0%	75.6%	78.8%	4.8 pp
Gemma3-4B	76.0%	70.0%	71.0%	6.0 pp
LLaMA-3.2-11B	58.5%	60.4%	64.8%	6.3 pp

Frontier models (Gemini, GPT-4o-mini) show the smallest racial accuracy gaps (5.2–5.6 percentage points), suggesting that larger-scale pretraining on more diverse data reduces demographic bias. Local models show moderate gaps ranging from 4.8 to 9.4 percentage points. Qwen3-VL shows the largest local-model gap (9.4 pp), favoring Korean faces (84.6%) over Black faces (75.2%), consistent with its Alibaba training provenance. Gemma3-4B shows a 5.6 percentage point gap, with Black faces classified most accurately (76.0%) and Caucasian faces least (70.4%). Gemma3-12B shows a similar but smaller pattern (4.8 pp). LLaMA classifies Korean faces best (64.8%) and Black faces worst (58.5%), with a 6.3 pp gap. These model-specific bias patterns confirm that no single bias audit can generalize across VLMs, and each deployment context requires individual evaluation.

At the intersection of race and emotion, model-specific patterns emerge that warrant further investigation. However, the relatively narrow racial accuracy gaps across models (4.8–9.4 pp) suggest that while demographic bias exists, its magnitude is more moderate than might be expected given the diversity of training data compositions across models.

5. Discussion

5.1 A Dual-Process Account of VLM Emotion Perception

The central finding of this study is that VLM emotion recognition exhibits patterns consistent with Kahneman’s (2011) dual-process framework, though the causal mechanisms are more nuanced than initially apparent. Three converging lines of evidence support the descriptive utility of this account, while the output suppression test qualifies the causal role of thinking mode.

The first line of evidence comes from human processing difficulty. Among the 1,000 human raters who produced 72,000 responses, sad stimuli elicited the longest arousal response times (Mdn = 1.745 s), significantly longer than happy (1.676 s, p < .001) and angry (1.707 s, p = .002). This extended processing time for sadness is consistent with prior work showing that sadness recognition requires 70–200 ms of exposure compared to 10–20 ms for happiness (Calvo & Nummenmaa, 2013), indicating that sadness inherently requires deeper processing that System 1 alone cannot provide.

The second line of evidence comes from VLM thinking traces. Both thinking models generate substantially longer reasoning for sad stimuli: Gemini produces 36% more characters and Qwen3-VL produces 143% more tokens for sad versus happy images. The correlation between emotion-level VLM thinking length and human response time is ρ = +0.899 (p = .015), demonstrating that the same emotions that are difficult for humans are difficult for VLMs. Furthermore, incorrect classifications involve longer thinking (26–69% more), paralleling the human uncertainty-deliberation relationship.

The third line of evidence addresses an alternative explanation. One might argue that VLMs fail on sadness because AI-generated sad images are unrealistic. Human naturalness ratings contradict this: sad images (M = 5.658) were rated significantly more natural than fear (5.260), disgust (5.428), and angry (5.486) images, yet fear achieved 97.9% accuracy from the best model (Gemma3-4B/Gemma3-12B) compared to sadness’s maximum of 58.3% (Gemini). This cross-over pattern — higher naturalness but lower accuracy — rules out stimulus quality as the explanation. However, it should be noted that higher naturalness ratings for sadness constitute a ruling-out of the stimulus quality confound rather than positive convergent evidence for the dual-process account.

These three lines converge on a descriptive dual-process account. Non-thinking VLMs perform in ways that parallel System 1 processing: their direct pattern matching is sufficient for high-arousal, visually distinctive emotions (happy: 100%, angry: 92%, fear: 97%) but fails for sadness (9–27%), where the subtle, low-intensity facial cues require deeper processing to distinguish from emotional neutrality. Models with thinking capability achieve 55–58% sadness accuracy — a 2 to 6-fold improvement.

The Gemini output suppression test complicates this account. If thinking mode were causally responsible for the frontier advantage, suppressing it should have reduced sadness accuracy. Instead, the suppressed condition (60.0%) slightly exceeded the thinking condition (58.3%). This result, combined with the persistence of 199 internal reasoning tokens, suggests that the output suppression test was uninformative rather than disconfirming — the model may have been thinking regardless of the budget parameter.

The dual-process framework remains useful as an organizing metaphor for the empirical patterns — sadness requires more processing time in humans and longer reasoning traces in VLMs, and models associated with deliberative reasoning outperform those without it in cross-model comparisons. However, the current data cannot establish a causal link between thinking mode and improved emotion recognition. The convergent evidence establishes sadness difficulty as a cross-agent phenomenon; the causal mechanism remains an open question for future within-model ablation studies.

Only the Qwen3-VL (κ = 0.761) versus Gemma3-4B (κ = 0.668) comparison provides suggestive evidence for a thinking benefit at the local 4B scale. These models share a similar parameter count but differ in architecture and training, making clean causal attribution impossible. The 8.1 pp accuracy advantage and the disproportionate improvement on sadness (54.6% vs. 12.5%) are consistent with a thinking benefit but cannot rule out architectural confounds.

We acknowledge two further caveats. First, the analogy between VLM reasoning traces and human System 2 processing is functional, not mechanistic — VLM “thinking” operates through autoregressive token generation, not the neural processes underlying human deliberation. The value of the dual-process framework is as an organizing principle for the empirical patterns, not as a claim about shared cognitive mechanisms. Second, thinking budget constraints (1,024 tokens per step for Qwen3-VL) may limit the benefits of deliberative reasoning in ways that are not fully understood.

5.2 Sadness-Neutral Confusion: A Cross-Agent Phenomenon

Sadness is the worst-classified emotion for all six VLMs, with accuracy ranging from 9.2% (LLaMA) to 58.3% (Gemini). The dominant error pathway is neutral absorption: non-thinking VLMs classify 66–76% of sad images as neutral, treating sadness as the absence of emotion rather than a distinct emotional state. This confusion is predictable from the circumplex model, where sadness occupies a low-arousal, moderately negative region proximal to neutral.

The present study extends the well-documented FER literature on sadness-neutral confusion (Mejia-Escobar et al., 2023; Savchenko et al., 2024) to VLMs with three novel contributions. First, we demonstrate that the confusion persists even in frontier full-precision models (GPT-4o-mini: 25.4% sad accuracy), confirming it as a perceptual limitation rather than a quantization artifact. Second, we show that models with thinking capability achieve substantially higher sadness accuracy (55–58% vs. 9–27%), with Gemini’s sadness-neutral confusion rate reduced to 19.2% compared to the 66–76% range for non-thinking models. Third, we provide the first direct comparison of human and VLM processing difficulty across emotions, revealing that sadness is the most difficult emotion for both agents (human RT and VLM thinking length) despite being rated as the most naturalistic stimulus category.

This poses critical risks for VLM deployment in mental health support and empathetic agent design. A system that cannot distinguish sadness from emotional neutrality will fundamentally fail at detecting distress — the very application domain where affective computing promises the greatest societal benefit (Pantic et al., 2005). The finding that more capable models partially mitigate this failure suggests a practical deployment recommendation: VLM-based emotion recognition systems should employ the most capable available models, and chain-of-thought reasoning should be enabled when available, particularly when detecting low-intensity negative emotions.

5.3 Polarity Exaggeration Bias: An Architectural Property

All six VLMs, including frontier full-precision models, systematically amplify valence extremity: negative emotions are rated more negative and positive emotions more positive than human ratings. This polarity exaggeration bias likely originates from VLMs’ pretraining corpora, where emotional language tends toward hyperbole. The persistence of this pattern across quantized and full-precision models confirms it as an architectural property of VLM emotion processing rather than a quantization artifact.

The consistency of polarity exaggeration suggests a practical mitigation path: post-hoc linear calibration per emotion category could substantially reduce absolute errors while preserving the high rank-order correlation. A simple affine transformation mapping VLM output distributions to human output distributions per emotion category would correct for both mean shift and variance inflation without retraining.

5.4 VLM Arousal Ratings and Ecological Validity

VLMs show moderate arousal correlations with human ratings (r = .622–.797) across all six models. Arousal correlations do not show a systematic thinking advantage: the two non-thinking local models (LLaMA r = .797, Gemma3-4B r = .759) achieve correlations comparable to or exceeding the thinking models (Gemini r = .767, Qwen3-VL r = .758). The lowest arousal correlations belong to GPT-4o-mini (r = .622) and Gemma3-12B (r = .623), both non-thinking models, but their low performance reflects model-specific factors rather than the absence of thinking. This pattern suggests that arousal estimation relies on perceptual competencies distinct from those driving categorical accuracy, and that chain-of-thought reasoning does not provide a consistent advantage for dimensional intensity estimation.

One might argue that the context-carry prompting design, which provides VLMs with categorical emotion labels before arousal rating, creates an unfair advantage. However, human emotion perception is inherently sequential: categorical emotion perception occurs automatically and rapidly (within approximately 170 ms) and anchors subsequent dimensional judgments (Barrett, 2017; Scherer, 2009). The human participants in this study also rated dimensions sequentially, with each judgment potentially anchoring the next. The context-carry design therefore provides VLMs with an information flow analogous to human sequential judgment rather than giving them “extra” information.

5.5 Model-Specific Demographic Biases

The most consequential finding for deployment decisions is that VLM demographic biases are model-specific in direction, magnitude, and affected dimension. Frontier models show the smallest racial accuracy gaps (5.2–5.6 percentage points), while local models show gaps ranging from 4.8 to 9.4 percentage points. The largest local-model gap is observed in Qwen3-VL (9.4 pp), which favors Korean faces over Black faces, while Gemma3-4B (6.0 pp), Gemma3-12B (4.8 pp), and LLaMA (6.3 pp) show more moderate gaps with varying directional patterns. Bias directions are model-specific: Gemma3-4B shows gender-valence bias (female faces rated more negatively) while LLaMA shows race-arousal bias (Korean faces rated lower arousal). This heterogeneity means that each deployment context requires individual bias auditing against the specific populations and emotions involved.

5.6 Limitations

Several limitations constrain the generalizability of these findings.

First, our human participants were exclusively Korean adults, potentially introducing cultural biases in the baseline. Cross-cultural replication with diverse rater populations is needed. Second, the thinking effect comparison in cross-model analyses confounds reasoning mode with model architecture and training data. The within-model Gemini output suppression test demonstrates that even toggling thinking budget on the same model produces no measurable accuracy change, significantly qualifying causal claims about thinking mode. Clean ablation on models where thinking can be genuinely disabled — rather than merely budget-constrained — would provide stronger evidence. Third, Gemini 2.5 Flash generates approximately 199 internal reasoning tokens even with thinking_budget=0, indicating that the thinking architecture cannot be externally controlled through API parameters. This architectural constraint limits the interpretability of thinking ablation studies on frontier models. Fourth, thinking budget constraints (1,024 tokens per step for Qwen3-VL) may limit the benefits of deliberative reasoning; whether longer thinking budgets produce better results remains unexplored. Fifth, our stimuli are static, single-emotion images, whereas real-world emotion recognition involves dynamic, multi-modal, mixed-emotion stimuli. Sixth, the context-carry prompting strategy introduces structural error propagation that alternative approaches (single-shot integrated prompting) would avoid. Seventh, all stimuli are AI-generated faces, which may represent different distribution shifts for different models. VLMs trained on web-scale data may have encountered AI-generated imagery during pretraining, creating an asymmetric comparison that replication with real-face stimuli should address. Eighth, while we interpret thinking traces through the dual-process framework, VLM “thinking” is autoregressive token generation, not human deliberation — the functional analogy should not be mistaken for mechanistic equivalence.

Ninth, Gemini 2.5 Flash Image was used to generate the emotional expressions in the stimulus pipeline (Section 3.1), and Gemini 2.5 Flash served as one of the six VLM raters. This shared model family creates a potential circularity: the rater model may recognize expressions generated by its own family more easily than other models do. This concern applies specifically to Gemini’s top-ranked performance and warrants replication with stimuli generated by a different model family.

Tenth, all models received identical prompts, but VLM emotion classification may be sensitive to prompt wording. The generalizability of these findings to alternative prompting strategies (e.g., single-shot, few-shot, or differently worded forced-choice) remains untested.

6. Conclusion

This study provides a psychometric comparison of six VLMs against 1,000 human raters on 1,440 AI-generated facial stimuli, establishing a dual-process account of VLM emotion perception. Five key findings emerge.

First, cross-model comparisons show 7–8 pp accuracy differences between models with and without thinking capability, with the largest gains on sadness recognition (55–58% vs. 9–27%). However, an output suppression test on Gemini demonstrated that this gap cannot be causally attributed to thinking mode. A 4B local thinking model (Qwen3-VL, κ = 0.761) achieves near-parity with a frontier non-thinking model (GPT-4o-mini, κ = 0.768), suggesting that architectural differences including chain-of-thought capability may partially compensate for model scale.

Second, sadness recognition difficulty is a cross-agent phenomenon supported by convergent evidence: human response times, VLM thinking traces, and classification accuracy all identify sadness as the emotion requiring the deepest processing, while stimulus naturalness ratings rule out image quality as an alternative explanation. This convergent evidence supports a dual-process account in which non-thinking VLMs perform in ways that parallel System 1 processing, failing on low-intensity emotions — though the causal mechanism (thinking mode vs. overall model capability) remains to be disentangled.

Third, polarity exaggeration bias and sadness-neutral confusion persist even in frontier full-precision models, confirming these as architectural properties of VLM emotion processing rather than quantization artifacts.

Fourth, thinking trace length correlates with processing difficulty (ρ = +0.899, p = .015, N = 6 emotions), though this preliminary correlation requires replication with finer-grained taxonomies. Models generate 26–69% more reasoning tokens on incorrect trials, and emotion-level thinking length correlates with human response times.

Fifth, demographic biases are model-specific in direction, magnitude, and affected dimension, with frontier models showing smaller racial accuracy gaps (5.2–5.6 pp) than local models (4.8–9.4 pp), requiring per-model audits rather than generalized bias characterizations.

These findings demonstrate that VLM emotion ratings cannot substitute for human judgments without calibration and bias auditing. For deployment in emotionally sensitive contexts — mental health chatbots, affective tutoring systems, empathetic agents — we recommend using the most capable available models with chain-of-thought reasoning enabled (particularly for low-intensity emotions), applying post-hoc valence calibration, and conducting per-model demographic bias audits. The output suppression finding cautions against attributing performance differences to thinking mode when models differ in architecture and training; future work should develop ablation protocols on models where thinking can be genuinely disabled, extend the dual-process framework to dynamic stimuli, and investigate whether the human RT–VLM thinking correlation reflects shared computational demands or a more superficial similarity.

References

AlDahoul, N., et al. (2026). FaceScanPaliGemma: Multi-agent vision language models for facial attribute recognition. Scientific Reports, 16.

Alrasheed, H., Alghihab, A., Pentland, A., & Alghowinem, S. (2025). Evaluating the capacity of large language models to interpret emotions in images. PLOS ONE, 20(6), e0324127.

Barrett, L. F. (2017). The theory of constructed emotion: An active inference account of interoception and categorization. Social Cognitive and Affective Neuroscience, 12(1), 1–23.

Bates, D., Machler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48.

Baudouin, J.-Y., Gallian, F., Pinoit, J.-M., & Damon, F. (2025). Arousal, valence, and discrete categories in facial emotion. Scientific Reports, 15(1), 40268.

Bhattacharyya, A., & Wang, S. (2025). Evaluating vision-language models for emotion recognition. In Findings of the Association for Computational Linguistics: NAACL 2025.

Calvo, M. G., & Nummenmaa, L. (2013). Wait, are you sad or angry? Large exposure time differences required for the categorization of facial expressions of emotion. Journal of Vision, 13(4), 14.

Dominguez-Catena, I., Paternain, D., & Galar, M. (2024). Less can be more: Representational vs. stereotypical gender bias in facial expression recognition. Progress in Artificial Intelligence, 13, 255–273.

Grynberg, D., Chang, B., Corneille, O., Maurage, P., Vermeulen, N., Berthoz, S., & Luminet, O. (2012). Alexithymia and the processing of emotional facial expressions: A systematic review, quantitative and qualitative meta-analysis. PLOS ONE, 7(8), e40259.

Harb, E., et al. (2025). Evaluating the performance of general purpose large language models in identifying human facial emotions. npj Digital Medicine, 8.

Hess, U., Adams, R. B., Jr., & Kleck, R. E. (2004). Facial appearance, gender, and emotion expression. Emotion, 4(4), 378–388.

Hugenberg, K., & Bodenhausen, G. V. (2003). Facing prejudice: Implicit prejudice and the perception of facial threat. Psychological Science, 14(6), 640–643.

Jankowiak, P., et al. (2024). Metrics for dataset demographic bias: A case study on facial expression recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8), 5520–5536.

Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.

Khare, S. K., Blanes-Vidal, V., Nadimi, E. S., & Acharya, U. R. (2024). Emotion recognition and artificial intelligence: A systematic review (2014–2023). Information Fusion, 102, 102019.

Lang, J., et al. (2024). A comprehensive study on quantization techniques for large language models. arXiv preprint arXiv:2411.02530.

Li, Y., et al. (2025). MBQ: Modality-balanced quantization for large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Mejia-Escobar, C., Gallego-Molina, N. J., & Arias-Vergara, T. (2023). Towards a better performance in facial expression recognition: A data-centric approach. Computational Intelligence and Neuroscience, 2023.

Mollahosseini, A., Hasani, B., & Mahoor, M. H. (2017). AffectNet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1), 18–31.

Mulukutla, V. K., Pavarala, S. S., Rudraraju, S. R., & Bonthu, S. (2025). Evaluating open-source vision language models for facial emotion recognition against traditional deep learning models. arXiv preprint arXiv:2508.13524.

Pantic, M., Sebe, N., Cohn, J. F., & Huang, T. (2005). Affective multimodal human-computer interaction. In Proceedings of the 13th ACM International Conference on Multimedia (pp. 669–676).

Plant, E. A., Hyde, J. S., Keltner, D., & Devine, P. G. (2000). The gender stereotyping of emotions. Psychology of Women Quarterly, 24(1), 81–92.

Qiao, Y., et al. (2025). Empathy and emotion recognition: A three-level meta-analysis. Psychological Methods.

Refoua, S., Elyoseph, Z., Piterman, H., et al. (2026). Evaluation of cross-ethnic emotion recognition capabilities in multimodal large language models using the reading the mind in the eyes test. Scientific Reports, 16.

Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178.

Savchenko, A. V., et al. (2024). AffectNet+: Soft-label facial expression recognition with improved dataset and enhanced training pipeline. arXiv preprint arXiv:2410.22506.

Scherer, K. R. (2009). The dynamic architecture of emotion: Evidence for the component process model. Cognition and Emotion, 23(7), 1307–1351.

Tak, A. N., & Gratch, J. (2024). GPT-4 emulates average-human emotional cognition from a third-person perspective. In Proceedings of the 12th International Conference on Affective Computing and Intelligent Interaction (ACII).

Telceken, M., Akgun, D., Kacar, S., Yesin, K., & Yildiz, M. (2025). Can artificial intelligence understand our emotions? Deep learning applications with face recognition. Current Psychology, 44(9), 7946–7956.

Zhang, Y., Yang, X., Xu, X., et al. (2024). Affective computing in the era of large language models: A survey from the NLP perspective. arXiv preprint arXiv:2408.04638.

Supplementary Materials

S1. FER Baseline Comparison

Five FER-specialized models — PosterV2 (κ = 0.878), MobileViT (κ = 0.848), EfficientNet (κ = 0.823), BEiT (κ = 0.711), EmoNet (κ = 0.665) — were evaluated on the same 1,440 images. FER models achieve higher classification accuracy than most VLMs but show near-zero or negative arousal correlations (r = .126–.448). The complementary performance profiles — FER dominance in classification and valence, VLM dominance in arousal — suggest fundamentally different processing strategies, though the comparison is not strictly equivalent due to VLMs’ access to categorical labels during arousal rating via the context-carry design.

Table S1. Combined VLM and FER model ranking (11 models).

Rank	Model	Type	Thinking	Accuracy	κ
1	PosterV2	FER	—	0.899	0.878
2	Gemini 2.5 Flash	VLM	Yes	0.874	0.848
3	MobileViT	FER	—	0.875	0.848
4	EfficientNet	FER	—	0.854	0.823
5	GPT-4o-mini	VLM	No	0.807	0.768
6	Qwen3-VL-4B	VLM	Yes	0.800	0.761
7	BEiT	FER	—	0.766	0.713
8	Gemma3-12B	VLM	No	0.759	0.711
9	EmoNet	FER	—	0.731	0.665
10	Gemma3-4B	VLM	No	0.724	0.668
11	LLaMA-3.2-11B	VLM	No	0.613	0.536

S2. FER Valence and Arousal Statistics

Table S2. Valence prediction: FER models.

Model	Pearson r	MAE
MobileViT	.950	0.916
EfficientNet	.940	1.063
EmoNet	.928	0.795

Table S3. Arousal prediction: FER models.

Model	Pearson r	MAE
EfficientNet	.448	1.696
MobileViT	.409	1.864
EmoNet	.126	1.369

FER arousal predictions are presented separately because FER models predict arousal directly from pixels without intermediate categorical representations, operating under a fundamentally different information regime than VLMs and humans, who both process categorical emotion before dimensional intensity.

Appendix: Revision History

v7 → v8 (2026-03-29)

v8 Iteration 1 (2026-03-29): Data verification + Gemini output suppression integration

#	Issue	Severity	How Fixed	Status
1	All κ values incorrect (v7 used estimates, not report values)	Critical	Updated Table 2 with authoritative report κ: Gemma3-4B 0.670, Gemma3-12B 0.713, LLaMA 0.535, Qwen3-VL 0.767, GPT 0.775, Gemini 0.857	Done
2	Gemma3-12B emotion accuracy wrong (angry 0.858→0.929, disgust 0.600→0.392)	Critical	Updated Table 3 from confusion matrix analysis	Done
3	Gemma3-4B race accuracy from interim report (not final data)	Critical	Recomputed: Max Δ 17.1→5.6 pp	Done
4	Gemma3-12B VA data missing	Major	Added to Tables 6-7 (V r=.929, A r=.623)	Done
5	Gemma3-12B race data missing	Major	Added to Table 9 (Max Δ=4.8 pp)	Done
6	Gemini vs GPT framed as thinking ablation	Critical	Within-model Gemini output suppression shows no difference (89.5% vs 89.1%); reframed as cross-model comparison. Table 4 restructured.	Done
7	Dual-process thinking claims too strong	Critical	Revised Section 5.1: thinking advantage scoped; Gemini advantage attributed to model capability; convergent evidence for sadness difficulty retained	Done
8	Arousal “thinking advantage” overclaimed	Major	Table 7 now shows all 6 models; non-thinking models (LLaMA .783, Gemma3-4B .739) match or exceed thinking models	Done
9	Demographic gap “9.4-17.1 pp” incorrect	Major	Updated to 4.8-9.4 pp based on corrected race data	Done
10	All inline number references inconsistent with corrected tables	Major	Systematic search-and-replace of all number references in abstract, discussion, and conclusion	Done

v8 Iteration 2 (2026-03-29): Scientist agent review (Hinton + Feynman + Bengio)

#	Issue	Severity	How Fixed	Status
11	”Ablation” → “output suppression test” (199 tokens persist)	Critical	Renamed throughout; added explanation that test was uninformative	Done
12	Gemini sad ablation contradicts narrative (67.5% > 63.0%)	Critical	Confronted directly in Sections 4.2 and 5.1	Done
13	Table 2 vs Table 3 accuracy inconsistency	Critical	All values recomputed from raw JSONL with sklearn	Done
14	Gemini family circularity (stimulus generator = rater)	Critical	Added to Limitations (Section 5.6)	Done
15	”maps onto” dual-process too strong	Major	Changed to “loosely parallels”	Done
16	ρ=0.899 N=6 overclaimed as “strong concordance”	Major	Changed to “suggestive”; N=6 limitation explicit	Done
17	”System 1 processors” ontological claim	Major	Changed to behavioral observation	Done
18	Naturalness = ruling-out, not convergent evidence	Major	Distinction clarified in Section 5.1	Done
19	Arousal “thinking advantage” overclaimed	Major	Corrected: non-thinking models match/exceed thinking	Done
20	Missing limitations: prompt sensitivity, κ specification	Major	Added to Sections 3.4 and 5.6	Done
21	Abstract ablation numbers from subset but not specified	Major	Added “(N = 943 common subset)“	Done
22	Contribution #3 too conditional	Minor	Reframed as methodological contribution	Done

v8 → v9 (2026-03-30)

v9: Human data matching fix → N=1,440 full match + confusion matrix figures

#	Issue	Severity	How Fixed	Status
23	Human ratings NES→Neu code mismatch (227 neutral images unmatched)	Critical	Fixed CSV: NES→Neu, zero-padding (CM2→CM02), typo (Dig→Dis). Backup at ratings.csv.bak	Done
24	All statistics based on N=1,213 partial match	Critical	Recomputed with N=1,440 via generate_comprehensive_stats.py + R lme4	Done
25	Tables 2-9 values outdated	Critical	All tables updated with xlsx-verified values	Done
26	Inline κ, accuracy, r values inconsistent	Critical	Systematic search-and-replace	Done
27	No confusion matrix figures	Major	Added Figure 2 (6-model CM) + supplementary figures	Done
28	Output suppression N=943 subset	Major	Updated to full N=1,440	Done
29	Gemini Caucasian 87.3%→85.2%	Major	Table 9 corrected	Done
30	Frontier racial gap 3.9→5.2-5.6 pp	Major	Table 9 + inline updated	Done

manuscript_VLM_emotion_2026_v9