Do Vision Language Models See Emotions Like Humans? A Dual-Process Account of VLM Emotion Perception on AI-Generated Facial Stimuli

Authors: Jini Tae, Ju-Hyeon Park, Wonil Choi

Affiliation: Gwangju Institute of Science and Technology (GIST), South Korea

Abstract

Vision Language Models (VLMs) are increasingly deployed as scalable substitutes for human emotion annotation, yet their alignment with human emotion perception remains poorly understood beyond categorical accuracy. This study introduces a psychometric framework that treats VLMs as additional raters in a human emotion rating paradigm, comparing six VLMs — three local open-source models (Gemma3-4B-IT, Gemma3-12B-IT, LLaMA-3.2-11B-Vision) and two frontier API models (GPT-4o-mini, Gemini 2.5 Flash), plus one local thinking model (Qwen3-VL-4B-Thinking) — against 1,000 human participants on 1,440 AI-generated facial images balanced across three races (Black, Caucasian, Korean), two genders, and six basic emotions. Using Cohen’s κ, Pearson correlation, MAE, and mixed-effects models, we evaluate categorical agreement, dimensional alignment (valence and arousal), and demographic bias against human inter-rater reliability as a benchmark.

The six VLMs span moderate-to-almost-perfect categorical agreement (κ = 0.458–0.855), with chain-of-thought thinking models consistently outperforming non-thinking counterparts by 7–8 percentage points in accuracy. The largest thinking gains appear on sadness recognition, where thinking models achieve 55–58% accuracy compared to 9–25% for non-thinking models. We provide three lines of convergent evidence that sadness recognition difficulty is a cross-agent phenomenon requiring deliberative processing: (1) human raters show the longest response times for sad stimuli (Mdn = 1.745 s for arousal), (2) VLM thinking models generate 31–143% longer reasoning traces for sad versus happy stimuli, and (3) sad stimuli receive higher human naturalness ratings than fear, disgust, and angry stimuli, ruling out stimulus quality as an explanation. These findings converge on a dual-process account (Kahneman, 2011): non-thinking VLMs operate as System 1 processors that fail on low-intensity emotions, while thinking VLMs engage System 2 deliberation that partially compensates for this limitation. A 4B local thinking model (Qwen3-VL, κ = 0.764) achieves performance parity with a frontier non-thinking model (GPT-4o-mini, κ = 0.766), demonstrating that explicit reasoning partially compensates for model scale.

Valence correlations are high (r = .891–.963) but absolute errors are large (MAE = 1.45–1.84) due to polarity exaggeration bias that persists even in frontier full-precision models, confirming this as an architectural rather than quantization-induced limitation. Thinking models show higher arousal correlations (r = .758–.767) than non-thinking models (r = .622), indicating that chain-of-thought reasoning enhances dimensional emotion estimation. Demographic bias patterns are model-specific, with frontier models showing smaller racial accuracy gaps (3.9 percentage points) than local models (9.4–17.1 percentage points).

Keywords: Vision Language Models, Facial Emotion Recognition, Psychometric Agreement, Dual-Process Theory, Chain-of-Thought Reasoning, Valence-Arousal, Demographic Bias, AI-Generated Faces, Affective Computing

1. Introduction

1.1 Affective Computing and the Promise of VLMs

The deployment of affective computing systems — from mental health chatbots to responsive virtual assistants — increasingly depends on accurate automatic emotion recognition from facial expressions. The efficacy of such systems hinges on affective alignment, defined as the degree to which a machine’s interpretation of emotional cues matches human psychological standards (Pantic et al., 2005). When an empathetic agent misinterprets the intensity of a user’s distress, it risks breaking user trust and failing to sustain meaningful interaction. This stakes consideration motivates rigorous empirical comparison between machine and human emotion perception.

Vision Language Models (VLMs) represent a paradigm shift from task-specific facial expression recognition (FER) models to general-purpose multimodal systems. A VLM is a model that integrates a vision encoder with a large language model, enabling image-conditioned text generation through natural language prompting. Whereas FER-specialized models are trained end-to-end on emotion-labeled datasets and output fixed emotion categories or continuous valence-arousal values, VLMs can flexibly produce both categorical and dimensional emotion ratings through instruction prompting — a capability that mirrors the integrated judgment process humans naturally employ. This flexibility raises the possibility that VLMs might serve as scalable substitutes for costly human emotion annotation, where collecting 72,000 responses from 1,000 raters represents a significant time and financial investment.

To evaluate whether VLMs truly perceive emotions as humans do, a dimensional measurement framework is required. The Circumplex Model of Affect (Russell, 1980) is a theoretical framework that maps all emotional experiences onto a continuous two-dimensional space defined by valence and arousal. Valence is the hedonic quality of an emotional experience, ranging from unpleasant to pleasant. Arousal is the degree of physiological activation, ranging from calm to excited. While the circumplex model was originally formulated for self-reported affective experience, it has been widely adopted for characterizing observer-rated facial expression perception (Baudouin et al., 2025). We follow this convention while noting that perceived emotion in others and felt emotion in oneself may involve distinct processes. This dimensional framework provides a richer representational vocabulary than categorical classification alone, enabling detection of subtle perceptual misalignments that discrete labels would obscure. Despite the theoretical importance of dimensional ratings, computational evaluations of emotion recognition have overwhelmingly focused on discrete category accuracy (Khare et al., 2024; Telceken et al., 2025).

1.2 The Evaluation Gap

Despite this framework existing, current VLM evaluations fail to employ it, creating four critical gaps that this study addresses.

The first gap concerns the absence of a human agreement benchmark. Existing benchmarks rely on accuracy and F1 scores against ground-truth labels while ignoring substantial disagreement among human raters. Human emotion perception is inherently variable — particularly for arousal, where inter-rater reliability can be as low as Krippendorff’s α = 0.125 (present study). Krippendorff’s α is a reliability coefficient for multiple raters that corrects for chance agreement, where 1.0 indicates perfect consensus and 0.0 indicates chance-level agreement. Without establishing human inter-rater reliability as a benchmark, it is impossible to determine whether a model’s errors reflect genuine failure or simply mirror the inherent subjectivity of emotion perception.

The second gap is the exclusive focus on categorical accuracy, neglecting continuous dimensional ratings central to affective science. A model may achieve perfect categorical accuracy while producing systematically distorted dimensional ratings — a dissociation we demonstrate empirically in the present study.

The third gap concerns the absence of demographic bias audits for VLMs. While demographic disparities have been documented in commercial FER APIs (Rhue, 2018; Jankowiak et al., 2024), systematic bias analysis of VLMs across race-gender-emotion intersections remains absent. This gap is concerning given the rapid adoption of VLMs in research and applied settings where fairness guarantees are critical.

The fourth gap is the absence of any investigation into how reasoning mode affects emotion perception. Recent VLMs can operate in two modes: standard inference, which generates responses directly, and chain-of-thought (CoT) thinking mode, which produces explicit reasoning traces before responding. This distinction maps onto Kahneman’s (2011) dual-process theory, where System 1 operates through fast, automatic pattern recognition and System 2 through slow, deliberative reasoning. Whether this architectural distinction in VLMs produces measurable differences in emotion recognition — particularly for perceptually ambiguous emotions — has not been systematically investigated.

1.3 Contributions and Research Questions

This paper makes five contributions to the intersection of affective computing, cognitive psychology, and multimodal AI evaluation.

First, we introduce a VLM-as-rater psychometric framework that treats VLMs as additional participants in a human rating paradigm. Rather than evaluating VLMs against ground-truth labels using accuracy and F1, we employ Cohen’s κ, Pearson correlation, MAE, and mixed-effects models to quantify agreement against human inter-rater reliability as an empirical agreement benchmark. Cohen’s κ is a chance-corrected agreement measure for categorical classification, where 0 indicates chance-level and 1 indicates perfect agreement. This framework reveals dimensions of VLM behavior — polarity exaggeration, dimensional collapse, sadness-neutral confusion — that accuracy-based evaluations entirely miss.

Second, we provide the first convergent evidence that sadness recognition difficulty is a cross-agent phenomenon. Three independent lines of evidence — human response times (N = 1,000, 72,000 responses), VLM thinking traces (two thinking models × 1,440 images), and stimulus naturalness ratings — all identify sadness as the emotion requiring the deepest processing. This convergent evidence supports a dual-process account of emotion recognition in which non-thinking VLMs function as System 1 processors that fail on low-intensity emotions, while thinking VLMs engage System 2 deliberation that partially compensates for this difficulty.

Third, we demonstrate that chain-of-thought thinking consistently improves emotion classification by 7–8 percentage points across both local (4B) and frontier model pairs, with a 4B local thinking model (Qwen3-VL, κ = 0.764) achieving performance parity with a frontier non-thinking model (GPT-4o-mini, κ = 0.766). This finding suggests that explicit reasoning partially compensates for model scale and quantization constraints.

Fourth, we present among the first systematic demographic bias analyses of VLMs across a fully crossed 3 (race: Black, Caucasian, Korean) × 2 (gender: Male, Female) × 6 (emotion) factorial stimulus design with 1,440 AI-generated face images ensuring perfect experimental control.

Fifth, we introduce thinking token analysis as a cognitive load proxy, demonstrating that VLM reasoning traces parallel human processing difficulty: models generate 26–69% more reasoning tokens on incorrect trials, and the emotion with the longest reasoning traces (sadness) is also the emotion with the longest human response times (ρ = +0.899, p = .015).

This study is exploratory in nature. Rather than testing pre-registered hypotheses, we systematically characterize VLM emotion rating behavior across multiple dimensions to generate testable hypotheses for future confirmatory research. Our research questions address four axes of VLM-human comparison:

RQ1: How do VLM emotion ratings compare to human inter-rater reliability on categorical and dimensional measures?

RQ2: Do VLMs exhibit systematic demographic biases in emotion attribution, and are these biases model-specific?

RQ3: How do VLMs at different scales (4B local, 11–12B local, frontier API) compare in classification accuracy, dimensional prediction, and bias profiles?

RQ4: Does deliberative reasoning (thinking mode) improve recognition of low-intensity emotions, paralleling human deliberative processing?

2.1 VLMs for Emotion Recognition

The application of VLMs to facial emotion recognition has yielded mixed results, with traditional deep learning models consistently outperforming VLMs on categorical accuracy. Mulukutla et al. (2025) conducted the first empirical comparison of open-source VLMs against traditional models on FER-2013, a dataset containing 35,887 low-resolution grayscale images across seven emotion classes. Traditional models — EfficientNet-B0 (86.44% accuracy) and ResNet-50 (85.72%) — outperformed VLMs by 20 to 35 percentage points, with CLIP achieving 64.07% and Phi-3.5 Vision achieving 51.66%. This performance gap suggests that VLMs’ general visual understanding does not automatically translate to FER proficiency, particularly on low-quality visual inputs.

Frontier API models show more promising results. Evaluations on the NimStim dataset demonstrate that GPT-4o and Gemini match or exceed human performance for calm, neutral, and surprise expressions, though performance degrades for more ambiguous emotions (Harb et al., 2025). Refoua et al. (2026) evaluated ChatGPT-4, ChatGPT-4o, and Claude 3 Opus on the Reading the Mind in the Eyes Test (RMET) with White, Black, and Korean face stimuli, finding that ChatGPT-4o achieved cross-ethnically consistent performance with accuracy above the 85th human percentile across all three ethnic versions. AlDahoul et al. (2026) developed FaceScanPaliGemma, a multi-agent VLM system for simultaneous facial attribute recognition including emotion (59.4% accuracy), race, gender, and age. Bhattacharyya and Wang (2025) presented a comprehensive evaluation of VLMs for evoked emotion recognition at NAACL, confirming that zero-shot VLMs lag behind supervised systems. The present study extends this literature by evaluating six VLMs spanning three parameter scales (4B, 11–12B, frontier) and two reasoning modes (standard and thinking) on a fully controlled factorial stimulus design.

2.2 Sadness-Neutral Confusion in Emotion Recognition

Sadness-neutral confusion is well-documented in FER literature. Mejia-Escobar et al. (2023) reported that 1,328 of 7,206 sad images in FER-2013 were misclassified as neutral. Analyses of AffectNet (Savchenko et al., 2024) found that anger and sadness had the highest misclassification rates, with 29% of sadness instances classified as neutral. The InsideOut benchmark (2025) similarly reported persistent confusion between “subtle classes such as fear, sadness, and neutral.” These studies establish sadness-neutral confusion as a well-known phenomenon in CNN-based FER models.

However, three critical gaps remain. First, sadness-neutral confusion has not been systematically characterized in VLMs. Harb et al. (2025) evaluated GPT-4o and Gemini on posed NimStim stimuli, finding fear-surprise confusion as the dominant error — a result attributable to the exaggerated expressions in posed datasets that reduce the ambiguity of sadness. Whether VLMs exhibit the same sadness-neutral confusion as FER models on more naturalistic stimuli has not been investigated. Second, no prior study has examined whether chain-of-thought reasoning in VLMs mitigates sadness-neutral confusion. Third, the relationship between human processing difficulty and VLM reasoning difficulty across emotions has never been quantified, despite the obvious theoretical interest of such a comparison.

2.3 Dual-Process Theory and Emotion Perception

Kahneman’s (2011) dual-process theory distinguishes between System 1 (fast, automatic, intuitive processing) and System 2 (slow, deliberative, effortful reasoning). Evidence from human emotion perception supports the relevance of this framework: Calvo and Nummenmaa (2013) demonstrated that happiness recognition requires only 10–20 ms of exposure, while sadness requires 70–200 ms — a 3.5 to 10-fold increase — suggesting that sadness recognition cannot be achieved through System 1 processing alone. Further support comes from clinical populations: individuals with alexithymia — a condition characterized by difficulty identifying emotions — show a specific tendency to rate negative emotions, particularly sadness, as neutral (Grynberg et al., 2012). Meta-analytic evidence indicates that cognitive empathy, a deliberate perspective-taking ability corresponding to System 2 processing, positively correlates with sadness recognition accuracy (Qiao et al., 2025).

The dual-process framework has not been applied to VLM emotion perception. We propose that non-thinking VLMs function as System 1 processors: they achieve rapid pattern matching sufficient for high-arousal, visually distinctive emotions (happy, angry, fear) but fail on low-intensity emotions (sadness) where deliberative reasoning is required. Thinking-enabled VLMs, by generating explicit reasoning traces before responding, engage an analogous System 2 process. This framework generates the specific prediction that thinking mode should disproportionately improve sadness recognition — a prediction we test directly.

2.4 Human-AI Comparison in Emotion Perception

The psychometric comparison of human and machine raters has a long tradition in clinical psychology, recently extended to large language models. Tak and Gratch (2024) found that GPT-4 emulates average-human emotional cognition from a third-person perspective. Alrasheed et al. (2025) evaluated GPT-4’s capacity to interpret emotions from non-facial affective images in the GAPED database, achieving correlations of r = 0.87 for valence and r = 0.72 for arousal under zero-shot conditions. Zhang et al. (2024) provide a comprehensive survey noting that while LLMs excel at affective understanding tasks such as sentiment classification, their performance on dimensional emotion estimation remains underexplored. The present study bridges this gap by evaluating six VLMs across two reasoning modes, producing integrated categorical-plus-dimensional ratings through a psychometric framework anchored to large-scale human data (N = 1,000).

2.5 Demographic Bias in Automated Affect Recognition

Documented racial and gender disparities in automated affect recognition have raised fairness concerns that extend to VLMs. Jankowiak et al. (2024) demonstrated that imbalanced training data propagates into systematic performance disparities across demographic groups. Gender bias in FER manifests as both representational bias (unequal demographic representation) and stereotypical bias (systematic associations between emotions and demographics; Dominguez-Catena et al., 2024). Human emotion perception itself is not demographically neutral: gender-emotion stereotypes lead observers to associate male faces with anger and female faces with happiness and sadness (Plant et al., 2000), though these stereotypical associations can reverse when facial cues are controlled (Hess et al., 2004). These human biases propagate into training datasets — AffectNet (Mollahosseini et al., 2017) relies on 12 annotators across approximately 450,000 images, with most images receiving a single annotation — and may be amplified by VLM pretraining on web-scale data. The present study extends bias analysis to six VLMs using a factorial design that enables orthogonal estimation of race, gender, and emotion effects.

2.6 AI-Generated Stimuli in Emotion Research

Traditional face databases — KDEF, ADFES, FER-2013, AffectNet — suffer from uncontrolled variation in expression quality, lighting, and demographic balance. AI-generated face stimuli address these limitations through controlled generation. The GIST-AIFaceDB used in this study generates neutral base faces with standardized features — identical gray backgrounds, navy t-shirts, and front-facing pose — then transforms each into five emotional expressions while preserving identity. This pipeline ensures that differences between expressions for a given identity are attributable solely to the emotion manipulation. Ecological validity is supported by human naturalness ratings: average naturalness ranged from 5.26 (fear) to 6.94 (happy) on a 9-point scale, indicating that participants perceived stimuli as moderately to highly realistic. Baudouin et al. (2025) provide evidence that dimensional ratings can be reliably collected from facial stimuli regardless of provenance.

3. Methodology

Figure 1 presents the overall research pipeline, illustrating how 1,440 AI-generated stimuli flow through human rating and VLM inference before converging in psychometric comparison.

flowchart TB
    subgraph Stimuli["Stimuli Generation"]
        A["OpenArt<br>STOIQO NewReality Flux"] -->|"240 neutral faces"| B["Nano-Banana<br>Gemini 2.5 Flash Image"]
        B -->|"5 emotions per identity"| C["GIST-AIFaceDB<br>1,440 images<br>3 races × 2 genders × 6 emotions × 40 IDs"]
    end

    subgraph Human["Human Rating (N = 1,000)"]
        C --> D["72 images per participant<br>72,000 total responses"]
        D --> E["Valence 1–9<br>Arousal 1–9<br>Naturalness 1–9<br>Response Times"]
    end

    subgraph VLM["VLM Inference (6 Models)"]
        C --> F1["Local No-Thinking<br>Gemma3-4B, Gemma3-12B,<br>LLaMA-3.2-11B"]
        C --> F2["Local Thinking<br>Qwen3-VL-4B"]
        C --> F3["Frontier API<br>GPT-4o-mini, Gemini 2.5 Flash"]
        F1 --> H["Context-Carry<br>3-Step Prompting"]
        F2 --> H
        F3 --> H
        H --> I["Emotion + Valence + Arousal<br>+ Thinking Traces"]
    end

    subgraph Analysis["Psychometric Comparison"]
        E --> L["Cohen's κ, Pearson r, MAE<br>Mixed-Effects Models<br>Demographic Bias<br>Thinking Token Analysis"]
        I --> L
        L --> M["Key Findings:<br>Dual-Process Account<br>Polarity Exaggeration<br>Sadness-Neutral Confusion<br>Thinking Advantage"]
    end

    style Stimuli fill:#e1f5fe,stroke:#0288d1
    style Human fill:#fff3e0,stroke:#f57c00
    style VLM fill:#e8f5e9,stroke:#388e3c
    style Analysis fill:#f3e5f5,stroke:#7b1fa2

Figure 1. Overall research pipeline. AI-generated stimuli (blue) are evaluated by 1,000 human raters (orange) and six VLMs spanning three scales and two reasoning modes (green), with all outputs converging in psychometric comparison (purple).

3.1 Stimuli

The stimulus set comprises 1,440 AI-generated facial images from the GIST AI-Generated Face Database (GIST-AIFaceDB, under review). The generation pipeline employed a two-step process. In the first step, 240 neutral base faces were generated using the STOIQO NewReality Flux model deployed on the OpenArt platform, depicting diverse virtual identities with standardized navy t-shirts against gray backgrounds across three racial groups (Black, Caucasian, Korean) and two genders (Male, Female). In the second step, each neutral face was transformed into five additional emotional expressions — angry, disgusted, fearful, happy, and sad — using Nano-Banana, an advanced image-editing model implemented in Google AI Studio (Gemini 2.5 Flash Image), which modifies facial expressions while preserving identity, lighting, and background.

The resulting fully crossed factorial design — 3 (race) × 2 (gender) × 6 (emotion) × 40 (identity) — yields 1,440 images with balanced cell sizes: 240 per emotion, 480 per race, 720 per gender, and 80 per race-gender-emotion combination. This balanced design enables orthogonal estimation of all demographic effects without confounding.

3.2 Human Rating Procedure

The study protocol was reviewed and granted exemption by the Institutional Review Board (IRB). One thousand native Korean adults (500 female, 500 male; age M = 44.6, SD = 13.7, range 20–69) were recruited through an online platform, with recruitment strictly balanced across age cohorts and genders. Each participant evaluated 72 images randomly selected from the 1,440 total, with every image presented in randomized order. Through this counterbalanced crossed design, each image received 50 independent ratings, yielding 72,000 total responses across three dimensions: valence (1–9 Likert scale), arousal (1–9), and naturalness (1–9). Response times were recorded for each rating.

Inter-rater reliability, computed as Krippendorff’s α (ordinal), established the human agreement benchmark: valence α = 0.471 (poor-to-fair), arousal α = 0.125 (poor), and naturalness α = 0.126 (poor). While these values appear low, they fall within the typical range for emotion rating studies and reflect the inherent subjectivity of affective perception. A linear mixed-effects model (LMM) confirmed that rater individual differences (σ² = 0.450 for valence, σ² = 0.696 for arousal) dominated image-level variance by a factor of 11 for valence and 32 for arousal, confirming that low reliability is driven by rater heterogeneity rather than stimulus ambiguity.

3.3 VLM Inference

Six VLMs were evaluated, spanning three parameter scales and two reasoning modes. Table 1 summarizes the model specifications.

Table 1. VLM specifications and inference configurations.

Model	Provider	Parameters	Quantization	Thinking	Backend	Key Settings
Gemma3-4B-IT	Google	4B	QAT 4-bit	No	MLX (local)	temp=0
Gemma3-12B-IT	Google	12B	QAT 4-bit	No	MLX (local)	temp=0
LLaMA-3.2-11B-Vision	Meta	11B	4-bit	No	MLX (local)	temp=0
Qwen3-VL-4B-Thinking	Alibaba	4B	4-bit	Yes (budget=1024)	MLX (local)	temp=0, rep_penalty=1.5
GPT-4o-mini	OpenAI	Frontier	Full-precision	No	API	temp=0, seed=42, image_detail=high
Gemini 2.5 Flash	Google	Frontier	Full-precision	Yes (dynamic)	API	temp=0, includeThoughts=true

The three local models (Gemma3-4B, Gemma3-12B, LLaMA-3.2-11B) were deployed on Apple Silicon (M1 Max, 32 GB) via the MLX framework with 4-bit quantization for memory-efficient inference. Qwen3-VL-4B-Thinking was deployed on the same hardware with chain-of-thought reasoning enabled: the model generates explicit reasoning within <think>...</think> tags before producing its JSON response, with a thinking budget of 1,024 tokens per inference step to prevent runaway generation in quantized models. GPT-4o-mini was accessed through the OpenAI API with deterministic settings (temperature = 0, seed = 42, image_detail = “high”). Gemini 2.5 Flash was accessed through the Google Generative AI API with thinking mode enabled (dynamic thinking budget) and includeThoughts: true to collect reasoning traces.

All models were run with temperature = 0 (greedy decoding) for deterministic outputs. The inclusion of two frontier API models operating at full precision serves dual purposes: establishing a performance ceiling unconstrained by quantization artifacts, and enabling partial disentanglement of quantization effects from architectural limitations. Recent work demonstrates that calibration-based 4-bit quantization retains 92–95% of FP16 quality on standard benchmarks (Lang et al., 2024), with vision tokens being less sensitive to quantization than language tokens due to higher redundancy (Li et al., 2025).

Inference followed a three-step context-carry prompting strategy, where prior outputs are fed forward as context for subsequent predictions, mirroring anchoring effects in human sequential judgment. In Step 1, the model classified the facial emotion from six forced-choice categories (happy, sad, angry, fear, disgust, neutral) via JSON output. In Step 2, the classified emotion was carried forward, and the model rated valence on a 1–9 scale. In Step 3, both the classified emotion and valence rating were carried forward, and the model rated arousal on a 1–9 scale. This strategy introduces structural error propagation: classification errors in Step 1 systematically influence subsequent valence and arousal ratings. Response parsing employed a cascade strategy: direct JSON parse, markdown fence stripping, and regex fallback. All 1,440 images were successfully processed by all six models, yielding 8,640 total VLM predictions.

3.4 Statistical Analysis

Categorical agreement was quantified via Cohen’s κ against intended emotion labels, with McNemar’s test for pairwise model comparisons. Dimensional alignment was assessed through Pearson correlation, Mean Absolute Error (MAE), and Bland-Altman analysis (systematic bias and 95% limits of agreement). Per-emotion bias significance was tested with Wilcoxon signed-rank tests, Bonferroni-corrected.

Bias decomposition employed linear mixed-effects models (LMMs) fitted via R’s lme4 package (Bates et al., 2015) with Satterthwaite degrees of freedom (lmerTest). The emotion-bias model used the formula: rating ~ rater_type * emotion + (1|image_id), where rater_type distinguishes human aggregate ratings from VLM ratings. Demographic bias models used analogous formulas with actor_race and actor_gender as fixed effects.

Thinking token analysis used character counts from collected reasoning traces (Gemini) and token counts estimated via tiktoken (Qwen3-VL). Per-emotion thinking length was compared via Kruskal-Wallis tests, and correct/incorrect trial comparisons used Mann-Whitney U tests.

4. Results

4.1 Emotion Classification

Table 2 presents the six-model ranking on overall emotion classification. The two thinking models (Gemini 2.5 Flash and Qwen3-VL-4B) occupy the first and third positions, with the frontier non-thinking model GPT-4o-mini in second.

Table 2. Overall emotion classification performance (N = 1,440 images per model).

Rank	Model	Thinking	Parameters	Accuracy	Cohen’s κ
1	Gemini 2.5 Flash	Yes	Frontier	0.881	0.855
2	GPT-4o-mini	No	Frontier	0.812	0.766
3	Qwen3-VL-4B	Yes	4B	0.806	0.764
4	Gemma3-12B	No	12B	0.761	0.698
5	Gemma3-4B	No	4B	0.726	0.646
6	LLaMA-3.2-11B	No	11B	0.613	0.458

Two patterns are notable. First, model scale does not predict performance: the 11B LLaMA (κ = 0.458) performs worse than the 4B Gemma3 (κ = 0.646), and the 12B Gemma3 (κ = 0.698) performs below the 4B Qwen3-VL (κ = 0.764). Architecture and reasoning mode matter more than parameter count. Second, the 4B Qwen3-VL with thinking (κ = 0.764) achieves near-identical agreement to the frontier GPT-4o-mini without thinking (κ = 0.766), suggesting that explicit reasoning partially compensates for model scale.

Table 3 presents emotion-specific accuracy across all six models, revealing extreme performance polarization.

Table 3. Emotion-specific classification accuracy (proportion correct).

Emotion	Gemini	Qwen3-VL	GPT	Gemma3-12B	Gemma3-4B	LLaMA
Happy	1.000	1.000	1.000	1.000	1.000	1.000
Neutral	0.992	0.962	1.000	1.000	1.000	1.000
Fear	0.971	0.896	0.929	0.971	0.979	0.654
Angry	0.929	0.875	0.942	0.858	0.404	0.921
Disgust	0.808	0.554	0.750	0.600	0.842	0.008
Sad	0.583	0.546	0.254	0.267	0.126	0.092

Happy and neutral are perfectly or near-perfectly classified by all models — effectively solved categories. Fear, angry, and disgust show model-specific variation. Sadness is the universal failure point: accuracy ranges from 9.2% (LLaMA) to 58.3% (Gemini), with no model exceeding 60%. The dominant error for sadness is neutral absorption: across non-thinking models, 66–76% of sad images are classified as neutral. Even the best-performing model (Gemini with thinking) misclassifies 19.2% of sad images as neutral.

4.2 Thinking Effect on Emotion Classification (RQ4)

Table 4 presents the thinking effect as matched comparisons between thinking and non-thinking models.

Table 4. Thinking effect on emotion classification accuracy.

Comparison	No-Thinking Model	Accuracy	Thinking Model	Accuracy	Δ
Frontier (API)	GPT-4o-mini	81.2%	Gemini 2.5 Flash	88.1%	+6.9 pp
Local (4B)	Gemma3-4B	72.6%	Qwen3-VL-4B	80.6%	+8.0 pp

The thinking advantage is consistent across both frontier and local model pairs, ranging from 6.9 to 8.0 percentage points. Critically, the thinking advantage is not uniform across emotions. Figure 2 illustrates that thinking produces its largest gains on sadness, the worst-classified emotion.

Table 5. Sadness accuracy by thinking mode.

Model	Thinking	Sad Accuracy	Sad→Neutral Confusion Rate
LLaMA-3.2-11B	No	9.2%	66.7%
Gemma3-4B	No	12.6%	71.1%
GPT-4o-mini	No	25.4%	—
Gemma3-12B	No	26.7%	—
Qwen3-VL-4B	Yes	54.6%	—
Gemini 2.5 Flash	Yes	58.3%	19.2%

Non-thinking models achieve 9–27% sad accuracy, while thinking models achieve 55–58% — a 2 to 6-fold improvement. Thinking reduces Gemini’s sadness-neutral confusion rate from the 66–76% range typical of non-thinking models to 19.2%. This disproportionate improvement on sadness, rather than a uniform boost across all emotions, supports the dual-process interpretation: sadness recognition specifically requires the deliberative reasoning that thinking mode provides, while high-arousal emotions (happy, angry, fear) are adequately handled by direct pattern matching.

4.3 Valence Comparison

All six VLMs achieve high valence correlations with human ratings (r = .891–.963), indicating correct rank ordering of emotions along the pleasantness dimension. However, absolute errors are large (MAE = 1.45–1.84), reflecting a systematic pattern of correct ordering but distorted scale usage.

Table 6. Valence prediction summary statistics (6 VLMs).

Model	Thinking	Pearson r	MAE	Bias (M)
Gemini 2.5 Flash	Yes	.963	1.842	−1.280
GPT-4o-mini	No	.938	1.626	−1.018
Qwen3-VL-4B	Yes	.913	1.445	−0.824
LLaMA-3.2-11B	No	.901	1.808	—
Gemma3-4B	No	.891	1.456	—
Gemma3-12B	No	—	—	—

The source of this distortion is polarity exaggeration bias: VLMs systematically produce more extreme valence ratings than humans — more negative for negative emotions and more positive for positive emotions. This pattern persists across all models, including frontier full-precision models, confirming it as an architectural property of VLMs rather than a quantization artifact. Mixed-effects models confirmed all per-emotion biases as statistically significant (p < .001).

Qwen3-VL-4B achieves the lowest MAE (1.445) and smallest negative bias (−0.824) among the three VA-reporting models, suggesting that thinking mode may improve valence calibration in addition to categorical accuracy.

4.4 Arousal Comparison

Arousal estimation reveals a clear thinking advantage. Table 7 presents arousal statistics for the five models with arousal data.

Table 7. Arousal prediction summary statistics.

Model	Thinking	Pearson r	MAE
Gemini 2.5 Flash	Yes	.767	1.951
Qwen3-VL-4B	Yes	.758	2.013
LLaMA-3.2-11B	No	.783	1.777
Gemma3-4B	No	.759	1.137
GPT-4o-mini	No	.622	1.572

Thinking models achieve the highest arousal correlations (r = .758–.767), with GPT-4o-mini — the frontier non-thinking model — showing the lowest correlation (r = .622). This pattern suggests that chain-of-thought reasoning enhances arousal estimation by providing intermediate reasoning about emotional intensity. However, the comparison is imperfect because the models differ in architecture and training data beyond their thinking capability alone. LLaMA and Gemma3-4B also show moderate-to-high arousal correlations (.759–.783) despite lacking thinking mode, indicating that arousal estimation reflects multiple contributing factors.

Both thinking and non-thinking VLMs show the same systematic arousal bias pattern: overestimation of fear arousal and underestimation of neutral and sad arousal, consistent with a “low visual salience = low arousal” heuristic.

4.5 Thinking Tokens as Cognitive Load Proxy

Chain-of-thought reasoning traces provide a window into model processing difficulty across emotions. Table 8 presents average thinking length by emotion for the two thinking models.

Table 8. Average thinking token/character count by emotion.

Emotion	Gemini (chars)	Qwen3-VL (tokens)	Human Arousal RT (Mdn, s)
Happy	949	1,608	1.676
Neutral	989	—	1.723
Fear	1,011	2,221	1.695
Angry	925	—	1.707
Disgust	966	3,460	1.723
Sad	1,290	3,915	1.745

Sadness elicits the longest thinking traces in both models: Gemini generates 36% more characters for sad than for happy stimuli, and Qwen3-VL generates 143% more tokens. This parallels human response times, where sad stimuli produce the longest arousal rating times (Mdn = 1.745 s). The Spearman correlation between emotion-level VLM thinking length and human response time is ρ = +0.899 (p = .015), indicating strong concordance between human and VLM processing difficulty.

Thinking length also differs by accuracy. Gemini generates 26% longer traces on incorrect trials (M = 1,248 chars) than correct trials (M = 993 chars). Qwen3-VL shows an even larger increase: 69% longer on incorrect trials (M = 3,959 tokens vs. 2,339). This pattern — more thinking on harder or incorrect items — mirrors the human uncertainty-deliberation relationship but does not translate into higher accuracy, suggesting a necessary-but-not-sufficient role for deliberative processing.

Step-level analysis reveals that arousal elicits the longest thinking across all emotions, consistent with the low human inter-rater reliability for arousal (α = 0.125) and suggesting that arousal intensity estimation is the most cognitively demanding dimension for both humans and VLMs.

4.6 Demographic Bias Analysis

Mixed-effects models revealed model-specific demographic biases across the six VLMs. Table 9 presents racial accuracy by model.

Table 9. Emotion classification accuracy by race.

Model	Black	Caucasian	Korean	Max Δ
Gemini 2.5 Flash	90.4%	87.3%	86.5%	3.9 pp
GPT-4o-mini	81.9%	79.0%	82.9%	3.9 pp
Qwen3-VL-4B	75.2%	81.9%	84.6%	9.4 pp
Gemma3-12B	—	—	—	—
Gemma3-4B	82.7%	65.6%	69.2%	17.1 pp
LLaMA-3.2-11B	56.9%	59.0%	68.1%	11.2 pp

Frontier models (Gemini, GPT-4o-mini) show the smallest racial accuracy gaps (3.9 percentage points), suggesting that larger-scale pretraining on more diverse data reduces demographic bias. Local models show larger gaps: Qwen3-VL favors Korean faces (84.6%) over Black faces (75.2%), consistent with its Alibaba training provenance. Gemma3-4B shows the largest racial gap (17.1 percentage points), with Black faces classified most accurately (82.7%) and Caucasian faces least (65.6%). The pattern reverses in LLaMA, which classifies Korean faces best (68.1%) and Black faces worst (56.9%). These model-specific bias patterns confirm that no single bias audit can generalize across VLMs, and each deployment context requires individual evaluation.

At the intersection of race and emotion, model-specific patterns emerge. Gemma3-4B shows a 2.7-fold accuracy gap for angry classification between Black faces (61.3%) and Korean faces (22.5%). This accuracy difference reflects differential sensitivity to angry expressions across racial groups, not necessarily over-attribution of anger to Black faces — establishing over-attribution would require false positive rate analysis (Hugenberg & Bodenhausen, 2003). The pattern reverses for disgust (Korean 95.0% exceeding Black 75.0%), confirming emotion-specific rather than uniform racial effects.

5. Discussion

5.1 A Dual-Process Account of VLM Emotion Perception

The central finding of this study is that VLM emotion recognition can be understood through Kahneman’s (2011) dual-process framework, with three converging lines of evidence supporting this account.

The first line of evidence comes from human processing difficulty. Among the 1,000 human raters who produced 72,000 responses, sad stimuli elicited the longest arousal response times (Mdn = 1.745 s), significantly longer than happy (1.676 s, p < .001) and angry (1.707 s, p = .002). This extended processing time for sadness is consistent with prior work showing that sadness recognition requires 70–200 ms of exposure compared to 10–20 ms for happiness (Calvo & Nummenmaa, 2013), indicating that sadness inherently requires deeper processing that System 1 alone cannot provide.

The second line of evidence comes from VLM thinking traces. Both thinking models generate substantially longer reasoning for sad stimuli: Gemini produces 36% more characters and Qwen3-VL produces 143% more tokens for sad versus happy images. The correlation between emotion-level VLM thinking length and human response time is ρ = +0.899 (p = .015), demonstrating that the same emotions that are difficult for humans are difficult for VLMs. Furthermore, incorrect classifications involve longer thinking (26–69% more), paralleling the human uncertainty-deliberation relationship.

The third line of evidence addresses an alternative explanation. One might argue that VLMs fail on sadness because AI-generated sad images are unrealistic. Human naturalness ratings contradict this: sad images (M = 5.658) were rated significantly more natural than fear (5.260), disgust (5.428), and angry (5.486) images, yet fear achieved 97.1% accuracy from the best model (Gemma3-4B) compared to sadness’s maximum of 58.3% (Gemini). This cross-over pattern — higher naturalness but lower accuracy — rules out stimulus quality as the explanation.

These three lines converge on a dual-process account. Non-thinking VLMs function as System 1 processors: their direct pattern matching is sufficient for high-arousal, visually distinctive emotions (happy: 100%, angry: 92%, fear: 97%) but fails for sadness (9–27%), where the subtle, low-intensity facial cues require deliberative processing to distinguish from emotional neutrality. Thinking VLMs engage an analogous System 2 process: by generating explicit reasoning before responding, they achieve 55–58% sadness accuracy — a 2 to 6-fold improvement. The disproportionate thinking advantage on sadness (versus the modest advantages on already-well-classified emotions) supports the specificity of this account: thinking mode does not uniformly boost performance but specifically compensates for the System 1 limitation on low-intensity emotions.

We acknowledge two caveats. First, the thinking/non-thinking comparisons confound reasoning mode with model architecture and training data: Gemini versus GPT and Qwen3-VL versus Gemma3 differ in ways beyond thinking capability alone. Clean ablation (toggling thinking on the same model architecture) would provide stronger causal evidence. Second, the analogy between VLM reasoning traces and human System 2 processing is functional, not mechanistic — VLM “thinking” operates through autoregressive token generation, not the neural processes underlying human deliberation. The value of the dual-process framework is as an organizing principle for the empirical patterns, not as a claim about shared cognitive mechanisms.

5.2 Sadness-Neutral Confusion: A Cross-Agent Phenomenon

Sadness is the worst-classified emotion for all six VLMs, with accuracy ranging from 9.2% (LLaMA) to 58.3% (Gemini). The dominant error pathway is neutral absorption: non-thinking VLMs classify 66–76% of sad images as neutral, treating sadness as the absence of emotion rather than a distinct emotional state. This confusion is predictable from the circumplex model, where sadness occupies a low-arousal, moderately negative region proximal to neutral.

The present study extends the well-documented FER literature on sadness-neutral confusion (Mejia-Escobar et al., 2023; Savchenko et al., 2024) to VLMs with three novel contributions. First, we demonstrate that the confusion persists even in frontier full-precision models (GPT-4o-mini: 25.4% sad accuracy), confirming it as a perceptual limitation rather than a quantization artifact. Second, we show that thinking mode disproportionately reduces this confusion (from 66–76% to 19.2% confusion rate in Gemini), providing the first evidence that chain-of-thought reasoning specifically targets low-intensity emotion recognition. Third, we provide the first direct comparison of human and VLM processing difficulty across emotions, revealing that sadness is the most difficult emotion for both agents (human RT and VLM thinking length) despite being rated as the most naturalistic stimulus category.

This poses critical risks for VLM deployment in mental health support and empathetic agent design. A system that cannot distinguish sadness from emotional neutrality will fundamentally fail at detecting distress — the very application domain where affective computing promises the greatest societal benefit (Pantic et al., 2005). The finding that thinking mode partially mitigates this failure suggests a practical deployment recommendation: VLM-based emotion recognition systems should employ chain-of-thought reasoning, particularly when detecting low-intensity negative emotions.

5.3 Polarity Exaggeration Bias: An Architectural Property

All six VLMs, including frontier full-precision models, systematically amplify valence extremity: negative emotions are rated more negative and positive emotions more positive than human ratings. This polarity exaggeration bias likely originates from VLMs’ pretraining corpora, where emotional language tends toward hyperbole. The persistence of this pattern across quantized and full-precision models confirms it as an architectural property of VLM emotion processing rather than a quantization artifact.

The consistency of polarity exaggeration suggests a practical mitigation path: post-hoc linear calibration per emotion category could substantially reduce absolute errors while preserving the high rank-order correlation. A simple affine transformation mapping VLM output distributions to human output distributions per emotion category would correct for both mean shift and variance inflation without retraining.

5.4 VLM Arousal Ratings and Ecological Validity

VLMs show moderate-to-high arousal correlations with human ratings (r = .622–.783), with thinking models achieving the highest values (r = .758–.767). One might argue that the context-carry prompting design, which provides VLMs with categorical emotion labels before arousal rating, creates an unfair advantage. However, human emotion perception is inherently sequential: categorical emotion perception occurs automatically and rapidly (within approximately 170 ms) and anchors subsequent dimensional judgments (Barrett, 2017; Scherer, 2009). The human participants in this study also rated dimensions sequentially, with each judgment potentially anchoring the next. The context-carry design therefore provides VLMs with an information flow analogous to human sequential judgment rather than giving them “extra” information.

5.5 Model-Specific Demographic Biases

The most consequential finding for deployment decisions is that VLM demographic biases are model-specific in direction, magnitude, and affected dimension. Frontier models show the smallest racial accuracy gaps (3.9 percentage points), while local models show gaps up to 17.1 percentage points. Bias directions are model-specific: Gemma3-4B shows gender-valence bias (female faces rated more negatively) while LLaMA shows race-arousal bias (Korean faces rated lower arousal). This heterogeneity means that each deployment context requires individual bias auditing against the specific populations and emotions involved.

5.6 Limitations

Several limitations constrain the generalizability of these findings.

First, our human participants were exclusively Korean adults, potentially introducing cultural biases in the baseline. Cross-cultural replication with diverse rater populations is needed. Second, the thinking effect comparison confounds reasoning mode with model architecture and training data. Clean ablation — toggling thinking on the same model — would provide stronger causal evidence. Third, thinking budget constraints (1,024 tokens per step for Qwen3-VL) may limit the benefits of deliberative reasoning; whether longer thinking budgets produce better results remains unexplored. Fourth, our stimuli are static, single-emotion images, whereas real-world emotion recognition involves dynamic, multi-modal, mixed-emotion stimuli. Fifth, the context-carry prompting strategy introduces structural error propagation that alternative approaches (single-shot integrated prompting) would avoid. Sixth, all stimuli are AI-generated faces, which may represent different distribution shifts for different models. VLMs trained on web-scale data may have encountered AI-generated imagery during pretraining, creating an asymmetric comparison that replication with real-face stimuli should address. Seventh, while we interpret thinking traces through the dual-process framework, VLM “thinking” is autoregressive token generation, not human deliberation — the functional analogy should not be mistaken for mechanistic equivalence.

6. Conclusion

This study provides a psychometric comparison of six VLMs against 1,000 human raters on 1,440 AI-generated facial stimuli, establishing a dual-process account of VLM emotion perception. Five key findings emerge.

First, chain-of-thought thinking consistently improves emotion classification by 7–8 percentage points, with the largest gains on sadness recognition (55–58% vs. 9–25%). A 4B local thinking model (Qwen3-VL, κ = 0.764) achieves performance parity with a frontier non-thinking model (GPT-4o-mini, κ = 0.766), demonstrating that explicit reasoning partially compensates for model scale.

Second, sadness recognition difficulty is a cross-agent phenomenon supported by convergent evidence: human response times, VLM thinking traces, and classification accuracy all identify sadness as the emotion requiring the deepest processing, while stimulus naturalness ratings rule out image quality as an alternative explanation. This convergent evidence supports a dual-process account in which non-thinking VLMs function as System 1 processors that fail on low-intensity emotions.

Third, polarity exaggeration bias and sadness-neutral confusion persist even in frontier full-precision models, confirming these as architectural properties of VLM emotion processing rather than quantization artifacts.

Fourth, thinking tokens serve as a cognitive load proxy: models generate 26–69% more reasoning tokens on incorrect trials, and emotion-level thinking length correlates with human response times (ρ = +0.899, p = .015).

Fifth, demographic biases are model-specific in direction, magnitude, and affected dimension, with frontier models showing smaller racial accuracy gaps (3.9 pp) than local models (9.4–17.1 pp), requiring per-model audits rather than generalized bias characterizations.

These findings demonstrate that VLM emotion ratings cannot substitute for human judgments without calibration and bias auditing. For deployment in emotionally sensitive contexts — mental health chatbots, affective tutoring systems, empathetic agents — we recommend enabling chain-of-thought reasoning (particularly for low-intensity emotions), applying post-hoc valence calibration, and conducting per-model demographic bias audits. Future work should address the confounds in thinking-mode comparisons through clean ablation experiments, extend the dual-process framework to dynamic stimuli, and investigate whether the human RT–VLM thinking correlation reflects shared computational demands or a more superficial similarity.

References

AlDahoul, N., et al. (2026). FaceScanPaliGemma: Multi-agent vision language models for facial attribute recognition. Scientific Reports, 16.

Alrasheed, H., Alghihab, A., Pentland, A., & Alghowinem, S. (2025). Evaluating the capacity of large language models to interpret emotions in images. PLOS ONE, 20(6), e0324127.

Barrett, L. F. (2017). The theory of constructed emotion: An active inference account of interoception and categorization. Social Cognitive and Affective Neuroscience, 12(1), 1–23.

Bates, D., Machler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48.

Baudouin, J.-Y., Gallian, F., Pinoit, J.-M., & Damon, F. (2025). Arousal, valence, and discrete categories in facial emotion. Scientific Reports, 15(1), 40268.

Bhattacharyya, A., & Wang, S. (2025). Evaluating vision-language models for emotion recognition. In Findings of the Association for Computational Linguistics: NAACL 2025.

Calvo, M. G., & Nummenmaa, L. (2013). Wait, are you sad or angry? Large exposure time differences required for the categorization of facial expressions of emotion. Journal of Vision, 13(4), 14.

Dominguez-Catena, I., Paternain, D., & Galar, M. (2024). Less can be more: Representational vs. stereotypical gender bias in facial expression recognition. Progress in Artificial Intelligence, 13, 255–273.

Grynberg, D., Chang, B., Corneille, O., Maurage, P., Vermeulen, N., Berthoz, S., & Luminet, O. (2012). Alexithymia and the processing of emotional facial expressions: A systematic review, quantitative and qualitative meta-analysis. PLOS ONE, 7(8), e40259.

Harb, E., et al. (2025). Evaluating the performance of general purpose large language models in identifying human facial emotions. npj Digital Medicine, 8.

Hess, U., Adams, R. B., Jr., & Kleck, R. E. (2004). Facial appearance, gender, and emotion expression. Emotion, 4(4), 378–388.

Hugenberg, K., & Bodenhausen, G. V. (2003). Facing prejudice: Implicit prejudice and the perception of facial threat. Psychological Science, 14(6), 640–643.

Jankowiak, P., et al. (2024). Metrics for dataset demographic bias: A case study on facial expression recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8), 5520–5536.

Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.

Khare, S. K., Blanes-Vidal, V., Nadimi, E. S., & Acharya, U. R. (2024). Emotion recognition and artificial intelligence: A systematic review (2014–2023). Information Fusion, 102, 102019.

Lang, J., et al. (2024). A comprehensive study on quantization techniques for large language models. arXiv preprint arXiv:2411.02530.

Li, Y., et al. (2025). MBQ: Modality-balanced quantization for large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Mejia-Escobar, C., Gallego-Molina, N. J., & Arias-Vergara, T. (2023). Towards a better performance in facial expression recognition: A data-centric approach. Computational Intelligence and Neuroscience, 2023.

Mollahosseini, A., Hasani, B., & Mahoor, M. H. (2017). AffectNet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1), 18–31.

Mulukutla, V. K., Pavarala, S. S., Rudraraju, S. R., & Bonthu, S. (2025). Evaluating open-source vision language models for facial emotion recognition against traditional deep learning models. arXiv preprint arXiv:2508.13524.

Pantic, M., Sebe, N., Cohn, J. F., & Huang, T. (2005). Affective multimodal human-computer interaction. In Proceedings of the 13th ACM International Conference on Multimedia (pp. 669–676).

Plant, E. A., Hyde, J. S., Keltner, D., & Devine, P. G. (2000). The gender stereotyping of emotions. Psychology of Women Quarterly, 24(1), 81–92.

Qiao, Y., et al. (2025). Empathy and emotion recognition: A three-level meta-analysis. Psychological Methods.

Refoua, S., Elyoseph, Z., Piterman, H., et al. (2026). Evaluation of cross-ethnic emotion recognition capabilities in multimodal large language models using the reading the mind in the eyes test. Scientific Reports, 16.

Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178.

Savchenko, A. V., et al. (2024). AffectNet+: Soft-label facial expression recognition with improved dataset and enhanced training pipeline. arXiv preprint arXiv:2410.22506.

Scherer, K. R. (2009). The dynamic architecture of emotion: Evidence for the component process model. Cognition and Emotion, 23(7), 1307–1351.

Tak, A. N., & Gratch, J. (2024). GPT-4 emulates average-human emotional cognition from a third-person perspective. In Proceedings of the 12th International Conference on Affective Computing and Intelligent Interaction (ACII).

Telceken, M., Akgun, D., Kacar, S., Yesin, K., & Yildiz, M. (2025). Can artificial intelligence understand our emotions? Deep learning applications with face recognition. Current Psychology, 44(9), 7946–7956.

Zhang, Y., Yang, X., Xu, X., et al. (2024). Affective computing in the era of large language models: A survey from the NLP perspective. arXiv preprint arXiv:2408.04638.

Supplementary Materials

S1. FER Baseline Comparison

Five FER-specialized models — PosterV2 (κ = 0.878), MobileViT (κ = 0.848), EfficientNet (κ = 0.823), BEiT (κ = 0.713), EmoNet (κ = 0.665) — were evaluated on the same 1,440 images. FER models achieve higher classification accuracy than most VLMs but show near-zero or negative arousal correlations (r = .126–.448). The complementary performance profiles — FER dominance in classification and valence, VLM dominance in arousal — suggest fundamentally different processing strategies, though the comparison is not strictly equivalent due to VLMs’ access to categorical labels during arousal rating via the context-carry design.

Table S1. Combined VLM and FER model ranking (11 models).

Rank	Model	Type	Thinking	Accuracy	κ
1	PosterV2	FER	—	0.899	0.878
2	Gemini 2.5 Flash	VLM	Yes	0.881	0.855
3	MobileViT	FER	—	0.875	0.848
4	EfficientNet	FER	—	0.854	0.823
5	GPT-4o-mini	VLM	No	0.812	0.766
6	Qwen3-VL-4B	VLM	Yes	0.806	0.764
7	BEiT	FER	—	0.766	0.713
8	Gemma3-12B	VLM	No	0.761	0.698
9	EmoNet	FER	—	0.731	0.665
10	Gemma3-4B	VLM	No	0.726	0.646
11	LLaMA-3.2-11B	VLM	No	0.613	0.458

S2. FER Valence and Arousal Statistics

Table S2. Valence prediction: FER models.

Model	Pearson r	MAE
MobileViT	.950	0.916
EfficientNet	.940	1.063
EmoNet	.928	0.795

Table S3. Arousal prediction: FER models.

Model	Pearson r	MAE
EfficientNet	.448	1.696
MobileViT	.409	1.864
EmoNet	.126	1.369

FER arousal predictions are presented separately because FER models predict arousal directly from pixels without intermediate categorical representations, operating under a fundamentally different information regime than VLMs and humans, who both process categorical emotion before dimensional intensity.

manuscript_VLM_emotion_2026_v7