Do Vision Language Models See Emotions Like Humans? Comparing Human and VLM Emotion Ratings on AI-Generated Facial Stimuli with Demographic Bias Analysis
Authors: Jini Tae, Ju-Hyeon Park, Wonil Choi
Affiliation: Gwangju Institute of Science and Technology (GIST), South Korea
Abstract
Vision Language Models (VLMs) are increasingly deployed in affective computing, yet their alignment with human emotion perception remains poorly understood beyond categorical accuracy metrics. This study compares emotion ratings from 1,000 human participants with two instruction-tuned VLMs — Gemma3-4B-IT (Google) and LLaMA-3.2-11B-Vision (Meta) — on 1,440 AI-generated facial images balanced across three races (Black, Caucasian, Korean), two genders, and six basic emotions. Using a psychometric framework that treats VLMs as additional raters, we evaluate categorical agreement (Cohen’s κ), dimensional alignment (valence and arousal via Pearson correlation, MAE, and mixed-effects models), and demographic bias against human inter-rater reliability as a ceiling benchmark. Both VLMs achieve moderate-to-substantial categorical agreement (κ = 0.535–0.671) but exhibit stereotyped responding — producing only 1–6 unique values per emotion category with near-zero variance — indicating prototype lookup rather than per-image discrimination. Valence correlations are high (r = .891–.901) but absolute errors are large (MAE = 1.46–1.81) due to polarity exaggeration bias, where VLMs rate negative emotions as more negative and positive emotions as more positive than humans. Arousal predictions outperform all five FER-specialized baselines (r = .759–.783 vs. .126–.448), suggesting that language-mediated reasoning confers an advantage for intensity estimation. Demographic bias patterns are model-specific: Gemma3 shows gender-valence bias (β = −0.332) while LLaMA shows race-arousal bias three times larger. These findings demonstrate that VLM emotion ratings cannot substitute for human judgments and that bias audits must be conducted per-model. We additionally compare both VLMs against five FER models on the same stimuli, revealing complementary strengths across the categorical-dimensional divide.
Keywords: Vision Language Models, Facial Emotion Recognition, Psychometric Agreement, Valence-Arousal, Demographic Bias, AI-Generated Faces, Affective Computing
1. Introduction
1.1 Affective Computing and the Promise of VLMs
The deployment of affective computing systems — from mental health chatbots to responsive virtual assistants — increasingly depends on accurate automatic emotion recognition from facial expressions. The efficacy of such systems hinges on affective alignment, defined as the degree to which a machine’s interpretation of emotional cues matches human psychological standards (Pantic et al., 2005). When an empathetic agent misinterprets the intensity of a user’s distress, it risks breaking user trust and failing to sustain meaningful interaction. This stakes consideration motivates rigorous empirical comparison between machine and human emotion perception.
Vision Language Models (VLMs) represent a paradigm shift from task-specific facial expression recognition (FER) models to general-purpose multimodal systems. A VLM is a model that integrates a vision encoder with a large language model, enabling image-conditioned text generation through natural language prompting. Whereas FER-specialized models are trained end-to-end on emotion-labeled datasets and output fixed emotion categories or continuous valence-arousal values, VLMs can flexibly produce both categorical and dimensional emotion ratings through instruction prompting — a capability that mirrors the integrated judgment process humans naturally employ. This flexibility raises the possibility that VLMs might serve as scalable substitutes for costly human emotion annotation, where collecting 72,000 responses from 1,000 raters represents a significant time and financial investment.
To evaluate whether VLMs truly perceive emotions as humans do, a dimensional measurement framework is required. The Circumplex Model of Affect (Russell, 1980) is a theoretical framework that maps all emotional experiences onto a continuous two-dimensional space defined by valence and arousal. Valence is the hedonic quality of an emotional experience, ranging from unpleasant to pleasant. Arousal is the degree of physiological activation, ranging from calm to excited. This dimensional framework provides a richer representational vocabulary than categorical classification alone, enabling detection of subtle perceptual misalignments that discrete labels would obscure. Two systems might both correctly classify an expression as “angry,” yet differ in how intense (arousal) or how negative (valence) they perceive that anger to be. Despite the theoretical importance of dimensional ratings, computational evaluations of emotion recognition have overwhelmingly focused on discrete category accuracy (Khare et al., 2024; Telceken et al., 2025).
1.2 The Evaluation Gap
Despite this framework existing, current VLM evaluations fail to employ it, creating four critical gaps that this study addresses.
The first gap concerns the absence of a human performance ceiling. Existing benchmarks rely on accuracy and F1 scores against ground-truth labels while ignoring substantial disagreement among human raters. Human emotion perception is inherently variable — particularly for arousal, where inter-rater reliability can be as low as Krippendorff’s α = 0.125 (present study). Krippendorff’s α is a reliability coefficient for multiple raters that corrects for chance agreement, where 1.0 indicates perfect consensus and 0.0 indicates chance-level agreement. Without establishing human inter-rater reliability as a performance ceiling, it is impossible to determine whether a model’s errors reflect genuine failure or simply mirror the inherent subjectivity of emotion perception.
The second gap is the exclusive focus on categorical accuracy, neglecting continuous dimensional ratings central to affective science. A model may achieve perfect categorical accuracy while producing systematically distorted dimensional ratings — a dissociation we demonstrate empirically in the present study.
The third gap concerns the absence of demographic bias audits for open-source VLMs. While demographic disparities have been documented in commercial FER APIs (Rhue, 2018; Jankowiak et al., 2024), systematic bias analysis of open-source VLMs across race-gender-emotion intersections remains absent. This gap is concerning given the rapid adoption of open-source VLMs in research and applied settings where fairness guarantees are critical.
The fourth gap is the use of unrepresentative models as proxies for “AI.” Prior studies comparing human and AI emotion perception have predominantly used FER-specialized models — lightweight architectures with millions of parameters (e.g., MobileViT approximately 6M parameters, EfficientNet approximately 5M) trained exclusively on emotion-labeled datasets such as AffectNet (Mollahosseini et al., 2017). While these models achieve high classification accuracy, they neither represent the capabilities of modern foundation models with billions of parameters trained on internet-scale multimodal data, nor support the integrated categorical-plus-dimensional ratings that humans naturally produce. Prior work comparing only FER-specialized models to human raters may not adequately represent the capabilities of modern foundation models, motivating the transition to VLMs in the present study.
1.3 Contributions and Research Questions
This paper makes five contributions to the intersection of affective computing, cognitive psychology, and multimodal AI evaluation. First, we introduce a VLM-as-rater psychometric framework that treats VLMs as additional participants in a human rating paradigm. Rather than evaluating VLMs against ground-truth labels using accuracy and F1, we employ Intraclass Correlation Coefficients (ICC), Cohen’s κ, Krippendorff’s α, and Bland-Altman analysis to quantify agreement against human inter-rater reliability as an empirical ceiling. Cohen’s κ is a chance-corrected agreement measure for categorical classification, where 0 indicates chance-level and 1 indicates perfect agreement. Bland-Altman analysis is a method for assessing agreement between two measurement methods via systematic bias and 95% limits of agreement. This framework reveals dimensions of VLM behavior — stereotyped responding, polarity exaggeration, dimensional collapse — that accuracy-based evaluations entirely miss. Second, we present the first systematic demographic bias analysis of open-source VLMs using a fully crossed 3 (race: Black, Caucasian, Korean) × 2 (gender: Male, Female) × 6 (emotion) factorial stimulus design with 1,440 AI-generated face images ensuring perfect experimental control. Third, we discover stereotyped responding — a phenomenon where VLMs produce only 1–6 unique valence-arousal values per emotion category (e.g., neutral valence SD = 0.00 for LLaMA), indicating categorical prototype lookup rather than per-image intensity discrimination. Fourth, we perform a dual comparison of both VLMs and five FER-specialized models against the same human baseline (N = 1,000), revealing a striking strength inversion: FER models dominate valence prediction while VLMs dominate arousal prediction, suggesting complementary architectural advantages. Fifth, we identify model-specific demographic bias profiles where Gemma3 and LLaMA show biases in different dimensions, for different demographics, and in different directions.
Our research questions address three axes of VLM-human comparison. RQ1 asks how VLM emotion ratings compare to human inter-rater reliability on categorical and dimensional measures. RQ2 asks whether VLMs exhibit systematic demographic biases in emotion attribution, and whether these biases are model-specific. RQ3 asks how VLMs compare to FER-specialized models in classification accuracy, dimensional prediction, and bias profiles.
2. Related Work
2.1 VLMs for Emotion Recognition
The application of VLMs to facial emotion recognition has yielded mixed results, with traditional deep learning models consistently outperforming VLMs on categorical accuracy. Mulukutla et al. (2025) conducted the first empirical comparison of open-source VLMs against traditional models on FER-2013, a dataset containing 35,887 low-resolution grayscale images across seven emotion classes. Traditional models — EfficientNet-B0 (86.44% accuracy) and ResNet-50 (85.72%) — outperformed VLMs by a margin of 20 to 35 percentage points, with CLIP achieving 64.07% and Phi-3.5 Vision achieving 51.66%. This performance gap suggests that VLMs’ general visual understanding does not automatically translate to FER proficiency, particularly on low-quality visual inputs.
Frontier API models show more promising results, with GPT-4o and Gemini matching human performance for certain expressions. Evaluations on the NimStim dataset demonstrate that GPT-4o and Gemini match or exceed human performance for calm, neutral, and surprise expressions, though performance degrades for more ambiguous emotions (Harb et al., 2025). Refoua et al. (2026) evaluated ChatGPT-4, ChatGPT-4o, and Claude 3 Opus on the Reading the Mind in the Eyes Test (RMET) with White, Black, and Korean face stimuli, finding that ChatGPT-4o achieved cross-ethnically consistent performance with accuracy above the 85th human percentile across all three ethnic versions. Specialized VLM frameworks for FER have also emerged, including FACET-VLM (2025), which achieves up to 99.41% on BU-4DFE through multiview facial representation learning with semantic language guidance. However, these fine-tuned models sacrifice the generality that makes VLMs attractive as versatile emotion annotators. The present study evaluates open-source models at the 4B–11B scale that are accessible for research deployment, bridging the gap between FER-specialized and frontier API approaches.
2.2 Human-AI Comparison in Emotion Perception
The psychometric comparison of human and machine raters has a long tradition in clinical psychology, recently extended to large language models. The Intraclass Correlation Coefficient (ICC) is a measure of agreement between two measurement methods, and Bland-Altman analysis visualizes systematic bias between them; both serve as standard tools for assessing measurement agreement. In affective computing, Tak and Gratch (2024) found that GPT-4 emulates average-human emotional cognition from a third-person perspective, with its interpretations aligning more closely with human judgments about others’ emotions than with self-assessments. A study published in PLOS ONE (Alrasheed et al., 2025) evaluated GPT-4’s capacity to interpret emotions from images, achieving numeric response correlations of r = 0.87 for valence and r = 0.72 for arousal on the Geneva Affective Picture Database (GAPED) under zero-shot conditions. These results establish that large language models can approximate human emotion perception, though the extent of approximation varies across emotional dimensions.
Prior human-AI comparisons in emotion perception have typically used either FER-specialized models with limited dimensionality or frontier API models without transparent access to model internals. Zhang et al. (2024) provide a comprehensive survey of affective computing in the era of LLMs, noting that while LLMs excel at affective understanding tasks such as sentiment classification and emotion detection, their performance on dimensional emotion estimation remains underexplored. The present study bridges this gap by evaluating open-source VLMs that produce integrated categorical-plus-dimensional ratings through a psychometric framework anchored to large-scale human data (N = 1,000).
2.3 Demographic Bias in Automated Affect Recognition
Documented racial and gender disparities in automated affect recognition have raised fairness concerns that extend to VLMs. Jankowiak et al. (2024) proposed formal metrics for measuring dataset demographic bias in FER, demonstrating that imbalanced training data composition propagates into systematic performance disparities across demographic groups. Gender bias in FER manifests in two forms: representational bias, which refers to unequal demographic representation in training data, and stereotypical bias, which refers to systematic associations between emotions and demographics, such as linking female faces with sadness and male faces with anger (Dominguez-Catena et al., 2024).
Human emotion perception itself is not demographically neutral. Gender-emotion stereotypes lead observers to associate male faces with dominance-related emotions such as anger and female faces with prosocial emotions such as happiness and sadness (Hess et al., 2004). These biases in human annotation propagate into training datasets — AffectNet (Mollahosseini et al., 2017) relies on sparse annotation of approximately 12 raters per image — and may be amplified by algorithmic optimization. The present study extends bias analysis from commercial APIs and training datasets to open-source VLMs, using a factorial experimental design that enables orthogonal estimation of race, gender, and emotion effects through mixed-effects modeling.
2.4 AI-Generated Stimuli in Emotion Research
Traditional face databases used in emotion research — including KDEF, ADFES, FER-2013, and AffectNet — suffer from uncontrolled variation in expression quality, lighting, and demographic balance. Real-face databases rely on actors performing emotional expressions, introducing individual variation in expression quality and intensity that creates confounds compromising internal validity. Demographic balance is difficult to achieve, with most databases overrepresenting certain racial groups.
AI-generated face stimuli address these limitations through a controlled generation pipeline that ensures perfect experimental control. The GIST-AIFaceDB used in this study generates neutral base faces with standardized features — identical gray backgrounds, navy t-shirts, and front-facing pose — then transforms each neutral face into five emotional expressions while preserving identity. This pipeline ensures that any differences between emotional expressions for a given identity are attributable solely to the emotion manipulation, not to extraneous visual factors. The ecological validity of AI-generated stimuli is supported by human naturalness ratings: in the present dataset, average naturalness ranged from 5.26 (fear) to 6.94 (happy) on a 9-point scale, indicating that participants perceived the stimuli as moderately to highly realistic. Baudouin et al. (2025) provide supporting evidence that dimensional ratings can be reliably collected from facial stimuli regardless of their provenance, suggesting that AI-generated faces elicit comparable affective responses to photographed faces.
3. Methodology
Figure 1 presents the overall research pipeline, illustrating how 1,440 AI-generated stimuli flow through human rating, VLM inference, and FER baseline evaluation before converging in psychometric comparison.
flowchart TB subgraph Stimuli["Stimuli Generation"] A["OpenArt<br>STOIQO NewReality Flux"] -->|"240 neutral faces"| B["Nano-Banana<br>Gemini 2.5 Flash Image"] B -->|"5 emotions per identity"| C["GIST-AIFaceDB<br>1,440 images<br>3 races x 2 genders x 6 emotions x 40 IDs"] end subgraph Human["Human Rating"] C --> D["N = 1,000 Korean Adults<br>72 images each<br>72,000 total responses"] D --> E["Valence 1-9<br>Arousal 1-9<br>Naturalness 1-9"] end subgraph VLM["VLM Inference"] C --> F["Gemma3-4B-IT<br>Google, QAT 4-bit"] C --> G["LLaMA-3.2-11B-Vision<br>Meta, 4-bit"] F --> H["Context-Carry<br>3-Step Prompting"] G --> H H --> I["Emotion + Valence + Arousal<br>per image"] end subgraph FER["FER Baselines"] C --> J["5 Models<br>PosterV2, MobileViT,<br>EfficientNet, BEiT, EmoNet"] J --> K["Classification +<br>VA Prediction"] end subgraph Analysis["Psychometric Comparison"] E --> L["Cohen kappa<br>Pearson r, MAE<br>Mixed-Effects Models<br>Demographic Bias"] I --> L K --> L L --> M["Key Findings:<br>Stereotyped Responding<br>Polarity Exaggeration<br>Strength Inversion<br>Model-Specific Bias"] end style Stimuli fill:#e1f5fe,stroke:#0288d1 style Human fill:#fff3e0,stroke:#f57c00 style VLM fill:#e8f5e9,stroke:#388e3c style FER fill:#fce4ec,stroke:#c62828 style Analysis fill:#f3e5f5,stroke:#7b1fa2
Figure 1. Overall research pipeline. AI-generated stimuli (blue) are evaluated by human raters (orange), two VLMs (green), and five FER baselines (red), with all outputs converging in psychometric comparison (purple).
3.1 Stimuli
The stimulus set comprises 1,440 AI-generated facial images from the GIST AI-Generated Face Database (GIST-AIFaceDB, under review). The generation pipeline employed a two-step process. In the first step, 240 neutral base faces were generated using the STOIQO NewReality Flux model deployed on the OpenArt platform. These neutral faces depicted diverse virtual identities wearing standardized navy t-shirts against gray backgrounds, with generation prompts specifying age diversity, hairstyle variation, and demographic characteristics across three racial groups (Black, Caucasian, Korean) and two genders (Male, Female). In the second step, each neutral face was transformed into five additional emotional expressions — angry, disgusted, fearful, happy, and sad — using Nano-Banana, an advanced image-editing model implemented in Google AI Studio (Gemini 2.5 Flash Image), which modifies facial expressions while preserving the identity, lighting, and background of the original image.
The resulting fully crossed factorial design — 3 (race: Black, Caucasian, Korean) × 2 (gender: Male, Female) × 6 (emotion: angry, disgust, fear, happy, sad, neutral) × 40 (identity) — yields 1,440 images with balanced cell sizes: 240 images per emotion, 480 per race, 720 per gender, and 80 per race-gender-emotion combination. This balanced design enables orthogonal estimation of all demographic effects without confounding.
3.2 Human Rating Procedure
The study protocol was reviewed and granted exemption by the Institutional Review Board (IRB). One thousand native Korean adults (500 female, 500 male; age M = 44.6, SD = 13.7, range 20–69) were recruited through an online platform, with recruitment strictly balanced across age cohorts and genders. Each participant evaluated 72 images randomly selected from the total pool of 1,440, with every image presented in randomized order. Through this counterbalanced crossed design, each image received 50 independent ratings, yielding 72,000 total responses across three dimensions: valence (1–9 Likert scale, 1 = “extremely negative,” 9 = “extremely positive”), arousal (1–9, 1 = “not at all aroused,” 9 = “highly aroused”), and naturalness (1–9, 1 = “very unnatural,” 9 = “very natural”).
Inter-rater reliability, computed as Krippendorff’s α (ordinal), established the human performance ceiling: valence α = 0.471 (poor-to-fair), arousal α = 0.125 (poor), and naturalness α = 0.126 (poor). While these values appear low, they fall within the typical range for emotion rating studies and reflect the inherent subjectivity of affective perception, particularly for arousal. A linear mixed-effects model (LMM) is a regression model containing both fixed effects (systematic factors such as emotion category) and random effects (sources of variation such as individual images or raters). Mixed-effects variance decomposition confirmed that rater individual differences (σ² = 0.450 for valence, σ² = 0.696 for arousal) dominated image-level variance by a factor of 11 for valence and 32 for arousal, confirming that low reliability is driven by rater heterogeneity rather than stimulus ambiguity.
3.3 VLM Inference
Two instruction-tuned VLMs were evaluated: Gemma3-4B-IT (Google, 4 billion parameters, QAT 4-bit quantized) and LLaMA-3.2-11B-Vision-Instruct (Meta, 11 billion parameters, 4-bit quantized). Both models were deployed on Apple Silicon (M1 Max, 32GB) via the MLX framework for GPU-accelerated inference without HTTP overhead.
Inference followed a three-step context-carry prompting strategy, a term we introduce to describe a sequential approach where prior outputs are fed forward as context for subsequent predictions, mirroring anchoring effects in human sequential judgment. In Step 1, the model classified the facial emotion from six forced-choice categories (happy, sad, angry, fear, disgust, neutral) via structured JSON output. In Step 2, the classified emotion was carried forward as context, and the model rated valence on a 1–9 scale. In Step 3, both the classified emotion and valence rating were carried forward, and the model rated arousal on a 1–9 scale. This strategy introduces structural error propagation: classification errors in Step 1 systematically influence subsequent valence and arousal ratings. Response parsing employed a cascade strategy: direct JSON parse, markdown fence stripping, and regex fallback. Emotion labels were fuzzy-matched by their first three characters, and both valence and arousal were clamped to [1, 9]. Gemma3 achieved 100% JSON parse success with one invalid category output (0.07%, “doubt”), while LLaMA achieved comparable compliance. All 1,440 images were successfully processed by both models.
Figure 2 illustrates the three-step context-carry prompting strategy with the actual prompt templates used in the study.
flowchart TD IMG["Input: Face Image + Prompt"] --> S1 subgraph S1["Step 1 — Emotion Classification"] direction TB P1["PROMPT:<br>What is the facial expression<br>in this image? Choose one from:<br>happy, sad, angry, fear,<br>disgust, neutral.<br>Answer with a single word only."] P1 --> R1["MODEL RESPONSE:<br>e.g. happy"] end S1 -->|"emotion = happy<br>carried to Step 2"| S2 subgraph S2["Step 2 — Valence Rating"] direction TB P2["PROMPT:<br>You identified this face as happy.<br>How pleasant is this facial<br>expression? Rate from 1 to 9<br>where 1 is very unpleasant and<br>9 is very pleasant.<br>Answer with a single number only."] P2 --> R2["MODEL RESPONSE:<br>e.g. 8"] end S2 -->|"emotion = happy,<br>valence = 8<br>carried to Step 3"| S3 subgraph S3["Step 3 — Arousal Rating"] direction TB P3["PROMPT:<br>You identified this face as happy<br>with pleasantness 8 out of 9.<br>How intense or activated is the<br>emotion in this face? Rate from<br>1 to 9 where 1 is very calm<br>and 9 is very excited.<br>Answer with a single number only."] P3 --> R3["MODEL RESPONSE:<br>e.g. 7"] end S3 --> OUT["Final Output:<br>emotion=happy, valence=8, arousal=7"] S1 -.->|"Error Propagation"| S2 S2 -.->|"Anchoring Effect"| S3 style S1 fill:#e8f5e9,stroke:#388e3c style S2 fill:#fff3e0,stroke:#f57c00 style S3 fill:#fce4ec,stroke:#c62828 style OUT fill:#e1f5fe,stroke:#0288d1 style P1 fill:#f1f8e9,stroke:#689f38,text-align:left style P2 fill:#fff8e1,stroke:#ffa000,text-align:left style P3 fill:#fce4ec,stroke:#e57373,text-align:left
Figure 2. Three-step context-carry prompting strategy with actual prompt templates. Each step receives the face image along with a text prompt. Step 1 output (emotion label) is injected into Step 2’s prompt template; Step 2 output (valence) is further injected into Step 3’s prompt. Dashed arrows indicate error propagation: a misclassification in Step 1 (e.g., “sad” classified as “neutral”) causes Steps 2 and 3 to rate valence and arousal under the wrong emotional frame.
3.4 FER Baseline Models
Five FER-specialized models were evaluated on the same 1,440 images for comparative analysis. Facial Expression Recognition (FER) models are task-specific architectures trained end-to-end on emotion-labeled datasets to output fixed emotion categories or continuous valence-arousal values. The five baselines included PosterV2 (Pyramid Transformer, classification only), MobileViT (lightweight Vision Transformer, classification and VA prediction), EfficientNet-B0-8-VA-MTL (multi-task CNN, classification and VA prediction), BEiT (BERT Image Transformer, classification only), and EmoNet (CNN, classification and VA prediction). For the three VA-capable models (EmoNet, MobileViT, EfficientNet), predictions in the native [-1, 1] range were normalized to the human rating scale [1, 9] using the formula v_norm = (v_raw + 1) / 2 × 8 + 1.
3.5 Statistical Analysis
Categorical agreement was quantified via Cohen’s κ against intended emotion labels, with McNemar’s test for pairwise model comparisons. Dimensional alignment was assessed through Pearson correlation, Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Bland-Altman analysis (systematic bias and 95% limits of agreement). Per-emotion bias significance was tested with Wilcoxon signed-rank tests, Bonferroni-corrected for 18 comparisons (6 emotions × 3 VA-capable models per model family).
Bias decomposition employed linear mixed-effects models (LMMs) fitted via R’s lme4 package (Bates et al., 2015) with Satterthwaite degrees of freedom (lmerTest). Satterthwaite degrees of freedom are an approximation method for computing p-values in mixed-effects models where exact degrees of freedom are undefined. The emotion-bias model used the formula: rating ~ rater_type * emotion + (1|image_id), where rater_type distinguishes human aggregate ratings from VLM ratings and image_id is a crossed random effect controlling for between-image variability. Demographic bias models used analogous formulas with actor_race and actor_gender as fixed effects.
4. Results
4.1 Emotion Classification
Gemma3-4B-IT achieved Cohen’s κ = 0.671 (substantial agreement), outperforming EmoNet (κ = 0.665) and approaching BEiT (κ = 0.713), while LLaMA-3.2-11B-Vision achieved κ = 0.535 (moderate agreement), below all FER baselines. Table 1 presents the full seven-model ranking. The larger LLaMA (11B parameters) performed worse than the smaller Gemma3 (4B parameters), demonstrating that model scale does not guarantee improved emotion recognition and that instruction tuning quality and pretraining data composition are more consequential factors.
Table 1. Overall emotion classification performance (N = 1,440).
| Model | Type | Parameters | Accuracy | Macro F1 | Cohen’s κ |
|---|---|---|---|---|---|
| PosterV2 | FER | ~44M | 0.899 | 0.900 | 0.878 |
| MobileViT | FER | ~6M | 0.875 | 0.874 | 0.848 |
| EfficientNet | FER | ~5M | 0.854 | 0.856 | 0.823 |
| BEiT | FER | ~86M | 0.766 | 0.772 | 0.713 |
| Gemma3-4B | VLM | 4B | 0.726 | 0.683 | 0.671 |
| EmoNet | FER | ~5M | 0.731 | 0.724 | 0.665 |
| LLaMA-3.2-11B | VLM | 11B | 0.613 | 0.402 | 0.535 |
Both VLMs perfectly classified happy and neutral but failed dramatically on sadness. Gemma3 achieved a sad F1 of 0.223 while LLaMA achieved only 0.092, compared to PosterV2’s 0.992. Table 2 presents emotion-specific accuracy across all seven models, revealing extreme performance polarization.
Table 2. Emotion-specific classification accuracy (proportion correct).
| Emotion | Gemma3 | LLaMA | PosterV2 | MobileViT | EfficientNet | BEiT | EmoNet |
|---|---|---|---|---|---|---|---|
| Happy | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.979 | 1.000 |
| Neutral | 1.000 | 1.000 | 0.912 | 0.863 | 0.729 | 0.529 | 0.533 |
| Fear | 0.979 | 0.654 | 0.933 | 0.942 | 0.846 | 0.792 | 0.912 |
| Disgust | 0.842 | 0.008 | 0.642 | 0.533 | 0.679 | 0.754 | 0.846 |
| Angry | 0.404 | 0.921 | 0.917 | 0.954 | 0.887 | 0.800 | 0.637 |
| Sad | 0.126 | 0.092 | 0.992 | 0.958 | 0.983 | 0.742 | 0.454 |
The two VLMs exhibit complementary error profiles that are qualitatively distinct from FER confusion patterns. Gemma3 exhibits neutral absorption, classifying 71.1% of sad images as neutral, while LLaMA exhibits angry merger, classifying 99.2% of disgust images as angry. Neutral absorption is the dominant VLM error pattern of classifying sad expressions as neutral, suggesting the model treats sadness as the absence of emotion. In contrast, LLaMA excels at angry (92.1% accuracy) where Gemma3 struggles (40.4%), while Gemma3 excels at disgust (84.2%) where LLaMA fails completely (0.8%). These two dominant error pathways account for 70.6% of all classification errors, and both are qualitatively distinct from the angry-disgust visual overlap confusions that FER models share.
4.2 Valence Comparison
Both VLMs achieve high valence correlations (r = .891–.901), approaching but not matching FER models (r = .928–.950), as shown in Table 3. However, absolute errors are 1.5 to 2.0 times larger (VLM MAE = 1.46–1.81 vs. FER MAE = 0.80–1.06), reflecting a pattern of correct rank ordering but distorted scale usage.
Table 3. Valence prediction summary statistics.
| Model | Type | Pearson r | MAE | Model M (SD) | Human M (SD) |
|---|---|---|---|---|---|
| MobileViT | FER | .950 | 0.916 | 4.18 (2.35) | 4.60 (1.42) |
| EfficientNet | FER | .940 | 1.063 | 4.05 (2.57) | 4.60 (1.42) |
| EmoNet | FER | .928 | 0.795 | 4.32 (2.00) | 4.60 (1.42) |
| LLaMA-3.2-11B | VLM | .901 | 1.808 | 3.71 (3.08) | 4.60 (1.42) |
| Gemma3-4B | VLM | .891 | 1.456 | 4.31 (2.65) | 4.60 (1.42) |
The source of this distortion is polarity exaggeration bias, defined as the systematic tendency to produce more extreme valence ratings than humans — more negative for negative emotions and more positive for positive emotions. Gemma3’s valence SD of 2.65 is 1.87 times the human SD of 1.42, and LLaMA’s SD of 3.08 is 2.17 times the human SD. Table 4 presents per-emotion valence bias across all models.
Table 4. Per-emotion valence bias (Model − Human mean).
| Emotion | Gemma3 | LLaMA | EmoNet | MobileViT | EfficientNet |
|---|---|---|---|---|---|
| Fear | −1.99 | −2.68 | +0.40 | −0.14 | −0.62 |
| Disgust | −1.39 | −2.25 | −1.35 | −0.78 | −0.97 |
| Angry | −1.06 | −2.04 | −0.64 | −1.01 | −0.79 |
| Happy | +1.26 | +1.58 | +0.76 | +1.01 | +1.03 |
| Neutral | +1.05 | −0.28 | +0.04 | −0.09 | +0.01 |
| Sad | +0.38 | +0.53 | −0.89 | −1.51 | −1.95 |
LLaMA’s negative-emotion valence bias (−2.04 to −2.68) is approximately double Gemma3’s (−1.06 to −1.99). Mixed-effects models confirmed all per-emotion biases as statistically significant (p < .001). The LMM for LLaMA yielded a main effect of rater_type[vlm] β = −2.050 (t = −42.73, p < .001) for the angry reference category, approximately double Gemma3’s β = −1.053 (t = −18.06, p < .001). This indicates that increased model scale amplifies rather than reduces polarity exaggeration.
4.3 Arousal Comparison
A striking strength inversion emerges in arousal prediction. Strength inversion refers to the complementary pattern where FER models dominate valence prediction while VLMs dominate arousal prediction. As Table 5 shows, VLMs substantially outperform all five FER-specialized models on arousal (r = .759–.783 vs. .126–.448), suggesting that language-mediated reasoning about emotional intensity confers a structural advantage for arousal estimation. Gemma3 additionally achieves the lowest arousal MAE (1.137) among all seven models.
Table 5. Arousal prediction summary statistics.
| Model | Type | Pearson r | MAE | Model M (SD) | Human M (SD) |
|---|---|---|---|---|---|
| LLaMA-3.2-11B | VLM | .783 | 1.777 | 5.36 (2.42) | 5.61 (0.60) |
| Gemma3-4B | VLM | .759 | 1.137 | 5.49 (1.74) | 5.61 (0.60) |
| EfficientNet | FER | .448 | 1.696 | 6.53 (2.33) | 5.61 (0.60) |
| MobileViT | FER | .409 | 1.864 | 6.68 (2.61) | 5.61 (0.60) |
| EmoNet | FER | .126 | 1.369 | 6.48 (1.56) | 5.61 (0.60) |
The most striking between-model difference is happy arousal. Gemma3’s bias of +0.30 is non-significant in the LMM (β = +0.059, p = .442), indicating appropriate calibration for happy intensity. In contrast, LLaMA rates happy arousal at 8.87 (human mean: 6.48), yielding a +2.39 overestimation (β = +2.889, p < .001) that reflects an extreme “happiness = maximal excitement” prototype. Table 6 presents per-emotion arousal bias with LMM significance.
Table 6. Per-emotion arousal bias (VLM − Human mean), with LMM significance.
| Emotion | Gemma3 Bias | LMM p | LLaMA Bias | LMM p |
|---|---|---|---|---|
| Fear | +1.30 | < .001 | +1.21 | < .001 |
| Happy | +0.30 | .442 | +2.39 | < .001 |
| Angry | +0.24 | < .001 | −0.50 | < .001 |
| Disgust | +0.42 | .026 | −0.57 | .517 |
| Sad | −1.04 | < .001 | −2.10 | < .001 |
| Neutral | −1.90 | < .001 | −1.91 | < .001 |
Both VLMs severely underestimate neutral arousal (bias: −1.90 to −1.91) and sad arousal (bias: −1.04 to −2.10), revealing a systematic tendency to associate low visual salience with minimal arousal.
Figure 3 visualizes the strength inversion pattern: FER models dominate classification and valence, while VLMs dominate arousal prediction.
quadrantChart title Strength Inversion - FER vs VLM Performance x-axis "Low Valence r" --> "High Valence r" y-axis "Low Arousal r" --> "High Arousal r" quadrant-1 "VLM Advantage" quadrant-2 "Both Strong" quadrant-3 "Both Weak" quadrant-4 "FER Advantage" "Gemma3-4B": [0.68, 0.85] "LLaMA-11B": [0.70, 0.89] "EmoNet": [0.80, 0.10] "MobileViT": [0.86, 0.35] "EfficientNet": [0.84, 0.42]
Figure 3. Strength inversion between VLMs and FER models. Horizontal axis represents valence correlation (r) with human ratings; vertical axis represents arousal correlation. VLMs cluster in the upper-left quadrant (strong arousal, moderate valence), while FER models cluster in the lower-right (strong valence, weak arousal).
4.4 Stereotyped Responding and Dimensional Collapse
LLaMA’s neutral valence SD of 0.00 means that all 240 neutral images received the identical value of 5, with zero per-image discrimination. Dimensional collapse is the reduction of continuous dimensional variation to a small set of discrete prototype values. Stereotyped responding is the production of only 1–6 unique valence-arousal values per emotion category, indicating prototype lookup rather than per-image discrimination. Table 7 presents response variance across emotions for both VLMs and human raters.
Table 7. Response variance by emotion: standard deviation of ratings within each emotion category.
| Emotion | Gemma3 V SD | LLaMA V SD | Human V SD | Gemma3 A SD | LLaMA A SD | Human A SD |
|---|---|---|---|---|---|---|
| Happy | 0.48 | 0.13 | 1.31 | 0.66 | 0.72 | 1.57 |
| Neutral | 0.64 | 0.00 | 1.08 | 0.44 | 0.28 | 1.71 |
| Fear | 0.16 | 0.50 | 1.61 | 0.47 | 1.86 | 1.52 |
| Angry | 0.80 | 1.05 | 1.55 | 0.49 | 1.21 | 1.51 |
| Sad | 1.02 | 1.13 | 1.44 | 1.03 | 0.35 | 1.53 |
| Disgust | 0.39 | 0.82 | 1.54 | 0.49 | 1.55 | 1.51 |
Across all emotions, VLM valence SDs (range: 0.00–1.13) are dramatically lower than human SDs (range: 1.08–1.61). This dimensional collapse arises from the discrete token generation architecture of VLMs, which must select specific integer tokens from their vocabulary. In contrast, FER regression heads produce continuous outputs through dedicated prediction layers trained end-to-end on dimensional emotion data. This distinction represents a qualitatively different behavior from both human raters, who exhibit genuine individual variation, and FER models, which produce continuous distributions.
4.5 Demographic Bias Analysis
Mixed-effects models revealed that VLM demographic biases are model-specific in direction, magnitude, and affected dimension. Regarding race bias, Gemma3 showed no significant race-valence bias, while LLaMA showed significant valence bias for Korean faces (β = +0.319, p = .009). For arousal, LLaMA’s race bias was three times larger than Gemma3’s: Korean faces received 1.204 points lower arousal in LLaMA (compared to Gemma3’s 0.399 reduction), while Black faces were overestimated by 0.50 points.
Regarding gender bias, Gemma3 showed significant gender-valence bias (β = −0.332, p < .001), rating female faces 0.33 points more negatively on average, while LLaMA showed no significant gender-valence bias. The gender-arousal bias direction reversed between models: Gemma3 rated female faces as slightly higher arousal (+0.169, p = .020) while LLaMA rated them as lower arousal (−0.465, p < .001).
At the intersection of race and emotion, Gemma3 showed a 2.7-fold accuracy gap for angry classification between Black faces (61.3%) and Korean faces (22.5%), directionally consistent with the “angry Black man” stereotype documented in human social cognition (Hugenberg & Bodenhausen, 2003). Disgust showed the reverse pattern (Korean 95.0% accuracy exceeding Black 75.0%), revealing that racial bias is selectively activated for specific race-emotion combinations rather than operating uniformly.
Figure 4 summarizes the model-specific demographic bias profiles, showing that Gemma3 and LLaMA exhibit biases in different dimensions and directions.
flowchart TB subgraph Gemma3["Gemma3-4B-IT Bias Profile"] direction TB G1["Gender-Valence Bias<br>Female faces rated 0.33 pts<br>MORE NEGATIVE<br>beta = -0.332, p < .001"] G2["Gender-Arousal Bias<br>Female faces rated 0.17 pts<br>HIGHER arousal<br>p = .020"] G3["Race-Valence Bias<br>NOT significant"] G4["Race-Arousal Bias<br>Korean -0.40 pts<br>moderate effect"] end subgraph LLaMA["LLaMA-3.2-11B Bias Profile"] direction TB L1["Gender-Valence Bias<br>NOT significant"] L2["Gender-Arousal Bias<br>Female faces rated 0.47 pts<br>LOWER arousal<br>p < .001"] L3["Race-Valence Bias<br>Korean +0.32 pts<br>p = .009"] L4["Race-Arousal Bias<br>Korean -1.20 pts<br>3x larger than Gemma3"] end G1 -.-|"OPPOSITE<br>direction"| L1 G2 -.-|"REVERSED"| L2 G4 -.-|"3x LARGER<br>in LLaMA"| L4 style Gemma3 fill:#e8f5e9,stroke:#388e3c style LLaMA fill:#e3f2fd,stroke:#1565c0 style G1 fill:#ffcdd2,stroke:#c62828 style G2 fill:#fff9c4,stroke:#f9a825 style G3 fill:#e8f5e9,stroke:#388e3c style G4 fill:#fff9c4,stroke:#f9a825 style L1 fill:#e8f5e9,stroke:#388e3c style L2 fill:#ffcdd2,stroke:#c62828 style L3 fill:#ffcdd2,stroke:#c62828 style L4 fill:#ffcdd2,stroke:#c62828
Figure 4. Model-specific demographic bias profiles. Green boxes indicate non-significant bias; red boxes indicate significant bias in a potentially harmful direction; yellow boxes indicate moderate effects. Dashed lines connect corresponding bias dimensions between models, highlighting directional reversals and magnitude differences.
5. Discussion
5.1 Stereotyped Responding: Prototype Lookup vs. Per-Image Discrimination
The most fundamental finding is that VLMs perform emotion-category prototype lookup rather than genuine per-image perceptual discrimination, producing 1–6 fixed valence-arousal values per emotion category regardless of the specific facial expression shown. This dimensional collapse likely arises from the discrete token generation architecture of VLMs, which must select specific integer tokens from their vocabulary. In contrast, FER regression heads produce continuous outputs through dedicated prediction layers. VLMs can reproduce average emotion prototypes, and their rank ordering of emotions along valence and arousal dimensions is largely correct. However, they fail to capture the within-category intensity gradients that distinguish mild irritation from intense rage.
This finding has direct implications for the emerging practice of using VLMs as proxy annotators for emotion data at scale (Zhang et al., 2024). VLM-generated emotion labels carry systematic distortions — compressed variance and fixed prototypes — that would propagate through any downstream training pipeline. While VLMs may serve as rough screening tools for categorical emotion classification, they cannot substitute for human raters in research contexts where individual stimulus variation matters, such as norm development for emotion databases or calibration of therapeutic interventions.
5.2 Polarity Exaggeration Bias
Both VLMs systematically amplify the valence extremity of emotions, with standard deviations 1.87 to 2.17 times larger than human ratings. This polarity exaggeration bias likely originates from VLMs’ pretraining corpora, where emotional language tends toward hyperbole — descriptions of angry faces as “furious” rather than “slightly annoyed.” The larger LLaMA (11B) shows stronger polarity exaggeration than the smaller Gemma3 (4B), with angry valence bias of −2.05 compared to −1.05. This pattern is consistent with the possibility that increased model capacity amplifies rather than refines emotion stereotypes when pretraining data does not proportionally increase in emotional nuance, though the two models also differ in architecture and training data, preventing a clean causal attribution to scale alone.
The consistency of polarity exaggeration across emotions and models suggests a practical mitigation path. Post-hoc linear calibration per emotion category could substantially reduce absolute errors while preserving the high rank-order correlation. For example, a simple affine transformation mapping VLM output distributions to human output distributions per emotion category would correct for both the mean shift and the variance inflation, potentially bringing VLM MAE to within FER-model range without retraining.
5.3 The Sadness Paradox
Sadness is the worst-classified emotion for both VLMs (Gemma3 F1 = 0.223, LLaMA F1 = 0.092) despite being reliably classified by FER models (PosterV2 F1 = 0.994). The sadness paradox is the finding that VLMs systematically fail to recognize sadness, the very emotion most critical for detecting distress. The dominant error pathway is neutral absorption: Gemma3 classifies 71.1% and LLaMA classifies 66.7% of sad images as neutral. This pattern suggests that VLMs treat sadness as the absence of emotion rather than as a distinct emotional state, qualitatively different from the angry-disgust confusions that reflect visual feature overlap in FER models.
The sadness paradox extends the arousal inversion identified in our prior work (Tae et al., under review), where FER models showed inverse arousal correlations for female sad faces. The present VLM data reveals a more fundamental failure: VLMs cannot detect sadness as a distinct category, let alone estimate its intensity. This poses critical risks for VLM deployment in mental health support and empathetic agent design. A system that cannot distinguish sadness from emotional neutrality will fundamentally fail at detecting distress — the very application domain where affective computing promises the greatest societal benefit (Pantic et al., 2005).
5.4 The Arousal Advantage of VLMs
VLMs outperform all five FER-specialized models on arousal prediction (r = .759–.783 vs. .126–.448), the most unexpected finding of this study. We hypothesize that this advantage arises from language-mediated reasoning: VLMs can leverage their language model’s conceptual understanding of emotional intensity, encoded through pretraining on phrases such as “calm,” “agitated,” and “excited,” to estimate arousal. In contrast, FER models must learn arousal mapping purely from visual features and sparse continuous annotations. This finding, combined with FER models’ valence advantage, suggests that hybrid systems combining FER classification heads with VLM-based intensity estimation could outperform either architecture alone, a design recommendation for next-generation affective computing systems.
5.5 Model-Specific Demographic Biases
The most consequential finding for deployment decisions is that VLM demographic biases are model-specific in direction, magnitude, and affected dimension. Gemma3 shows gender-valence bias (β = −0.332) while LLaMA shows race-arousal bias three times larger than Gemma3’s. Gemma3 rates female faces as slightly higher arousal (+0.169) while LLaMA rates them as lower (−0.465). This heterogeneity means that no single bias audit can generalize across VLMs, and each deployment context requires individual evaluation against the specific populations and emotions involved. The emotion-selective nature of racial bias — where Gemma3’s angry classification accuracy for Black faces (61.3%) is 2.7 times that for Korean faces (22.5%) — echoes the “angry Black man” stereotype documented in race-emotion perception research (Hugenberg & Bodenhausen, 2003), but the bias reverses for disgust (Korean 95.0% exceeding Black 75.0%), revealing that racial effects operate through emotion-specific pathways rather than uniform racial preferences.
5.6 Limitations
Several limitations constrain the generalizability of these findings. First, our human participants were exclusively Korean adults, potentially introducing cultural biases in the baseline against which VLMs are evaluated. Cross-cultural replication with diverse rater populations is needed to establish whether the observed patterns are universal or culturally specific. Second, we tested only two open-source VLMs at the 4B–11B scale; extending to larger models (70B and above) and frontier APIs (GPT-4o, Claude, Gemini) would reveal whether stereotyped responding and polarity exaggeration persist across the model capability spectrum. Third, our stimuli are static, single-emotion images, whereas real-world emotion recognition typically involves dynamic, multi-modal, and mixed-emotion stimuli. Fourth, the context-carry prompting strategy introduces structural dependencies (error propagation from classification to dimensional ratings) that may not be present in alternative prompting approaches such as single-shot integrated prompting. Fifth, the 4-bit quantization used for edge deployment may affect model behavior compared to full-precision inference.
6. Conclusion
This study provides the first psychometric comparison of VLM and human emotion ratings using a fully factorial stimulus design, establishing that Vision Language Models achieve moderate-to-substantial categorical agreement (κ = 0.535–0.671) but exhibit qualitatively distinct biases — stereotyped responding, polarity exaggeration, and the sadness paradox — that distinguish them from both human raters and FER-specialized models.
Three key findings emerge with robust implications. First, VLMs perform categorical prototype lookup rather than per-image perceptual discrimination, producing near-zero variance within emotion categories. This dimensional collapse means VLMs cannot currently substitute for human raters in research contexts where stimulus-level variation matters. Second, a strength inversion exists between model families: FER models dominate classification (κ = 0.665–0.878) and valence (r = .928–.950), while VLMs dominate arousal (r = .759–.783 vs. .126–.448), suggesting complementary architectural advantages that could be exploited in hybrid systems. Third, demographic biases are model-specific in direction, magnitude, and affected dimension, requiring per-model audits rather than generalized “VLM bias” characterizations. As VLMs increasingly mediate human-computer interaction in emotionally sensitive contexts — from mental health chatbots to affective tutoring systems — the gap between their emotion perception and human psychological benchmarks demands both rigorous measurement, which this psychometric framework provides, and transparent reporting of model-specific limitations and biases. Future work should extend this framework to larger VLMs, frontier API models, dynamic video stimuli, and culturally diverse rater populations, while investigating whether fine-tuning on dimensionally annotated emotion data can mitigate the stereotyped responding and polarity exaggeration identified here.
References
Alrasheed, H., Alghihab, A., Pentland, A., & Alghowinem, S. (2025). Evaluating the capacity of large language models to interpret emotions in images. PLOS ONE, 20(6), e0324127.
Bates, D., Machler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48.
Baudouin, J.-Y., Gallian, F., Pinoit, J.-M., & Damon, F. (2025). Arousal, valence, and discrete categories in facial emotion. Scientific Reports, 15(1), 40268.
Dominguez-Catena, I., Paternain, D., & Galar, M. (2024). Less can be more: Representational vs. stereotypical gender bias in facial expression recognition. Progress in Artificial Intelligence, 13, 255–273.
Harb, E., et al. (2025). Evaluating the performance of general purpose large language models in identifying human facial emotions. npj Digital Medicine, 8.
Hess, U., Adams, R. B., Jr., & Kleck, R. E. (2004). Facial appearance, gender, and emotion expression. Emotion, 4(4), 378–388.
Hugenberg, K., & Bodenhausen, G. V. (2003). Facing prejudice: Implicit prejudice and the perception of facial threat. Psychological Science, 14(6), 640–643.
Jankowiak, P., et al. (2024). Metrics for dataset demographic bias: A case study on facial expression recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8), 5520–5536.
Khare, S. K., Blanes-Vidal, V., Nadimi, E. S., & Acharya, U. R. (2024). Emotion recognition and artificial intelligence: A systematic review (2014–2023). Information Fusion, 102, 102019.
Mollahosseini, A., Hasani, B., & Mahoor, M. H. (2017). AffectNet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1), 18–31.
Mulukutla, V. K., Pavarala, S. S., Rudraraju, S. R., & Bonthu, S. (2025). Evaluating open-source vision language models for facial emotion recognition against traditional deep learning models. arXiv preprint arXiv:2508.13524.
Pantic, M., Sebe, N., Cohn, J. F., & Huang, T. (2005). Affective multimodal human-computer interaction. In Proceedings of the 13th ACM International Conference on Multimedia (pp. 669–676).
Refoua, S., Elyoseph, Z., Piterman, H., et al. (2026). Evaluation of cross-ethnic emotion recognition capabilities in multimodal large language models using the reading the mind in the eyes test. Scientific Reports, 16.
Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178.
Tak, A. N., & Gratch, J. (2024). GPT-4 emulates average-human emotional cognition from a third-person perspective. In Proceedings of the 12th International Conference on Affective Computing and Intelligent Interaction (ACII).
Telceken, M., Akgun, D., Kacar, S., Yesin, K., & Yildiz, M. (2025). Can artificial intelligence understand our emotions? Deep learning applications with face recognition. Current Psychology, 44(9), 7946–7956.
Zhang, Y., Yang, X., Xu, X., et al. (2024). Affective computing in the era of large language models: A survey from the NLP perspective. arXiv preprint arXiv:2408.04638.