Do Vision Language Models See Emotions Like Humans? Comparing Human and VLM Emotion Ratings on AI-Generated Facial Stimuli with Demographic Bias Analysis

Authors: Jini Tae, Ju-Hyeon Park, Wonil Choi

Affiliation: Gwangju Institute of Science and Technology (GIST), South Korea


Abstract

Vision Language Models (VLMs) are increasingly deployed in affective computing applications, yet their alignment with human emotion perception remains poorly understood beyond categorical accuracy metrics. This study compares emotion ratings from 1,000 human participants with two instruction-tuned VLMs—Gemma3-4B-IT (Google) and LLaMA-3.2-11B-Vision (Meta)—on 1,440 AI-generated facial images balanced across three races (Black, Caucasian, Korean), two genders, and six basic emotions. Using a psychometric framework that treats VLMs as additional raters, we evaluate categorical agreement (Cohen’s κ), dimensional alignment (valence and arousal via Pearson correlation, MAE, and Bland-Altman analysis), and demographic bias (mixed-effects models) against human inter-rater reliability as a ceiling benchmark. Results reveal that both VLMs achieve moderate-to-substantial categorical agreement (κ = 0.535–0.671) but exhibit stereotyped responding—producing only 1–6 unique values per emotion category with near-zero variance—indicating prototype lookup rather than per-image perceptual discrimination. Valence correlations are high (r = .891–.901) but absolute errors are large (MAE = 1.46–1.81) due to polarity exaggeration bias, where VLMs rate negative emotions as more negative and positive emotions as more positive than humans. Arousal predictions, surprisingly, outperform all five FER-specialized baseline models (r = .759–.783 vs. .126–.448), suggesting that language-mediated reasoning confers an advantage for intensity estimation. Demographic bias patterns are model-specific: Gemma3 shows gender-valence bias while LLaMA shows race-arousal bias three times larger. We additionally compare both VLMs against five FER-specialized models (PosterV2, MobileViT, EfficientNet, BEiT, EmoNet) on the same stimuli, revealing complementary strengths: FER models dominate classification and valence, while VLMs dominate arousal. These findings demonstrate that VLM emotion ratings cannot substitute for human judgments and that bias audits must be conducted per-model.

Keywords: Vision Language Models, Facial Emotion Recognition, Psychometric Agreement, Valence-Arousal, Demographic Bias, AI-Generated Faces, Affective Computing


1. Introduction

1.1 Affective Computing and the Promise of VLMs

The deployment of affective computing systems—from mental health chatbots to responsive virtual assistants—increasingly depends on accurate automatic emotion recognition from facial expressions. The efficacy of such systems hinges on affective alignment: the degree to which a machine’s interpretation of emotional cues matches human psychological standards (Pantic et al., 2005). If an empathetic agent misinterprets the intensity of a user’s distress, it risks breaking user trust and failing to sustain meaningful interaction.

Vision Language Models (VLMs) represent a paradigm shift from task-specific facial expression recognition (FER) models to general-purpose multimodal systems. A VLM integrates a vision encoder with a large language model, enabling image-conditioned text generation through natural language prompting. Unlike FER-specialized models that are trained end-to-end on emotion-labeled datasets and output fixed emotion categories or continuous valence-arousal values, VLMs can flexibly produce both categorical and dimensional emotion ratings through instruction prompting—a capability that mirrors the integrated judgment process humans naturally employ. This flexibility raises the possibility that VLMs might serve as scalable substitutes for costly human emotion annotation, where collecting 72,000 responses from 1,000 raters represents a significant time and financial investment.

The Circumplex Model of Affect (Russell, 1980) maps all emotional experiences onto a continuous two-dimensional space defined by valence (the hedonic quality ranging from unpleasant to pleasant) and arousal (the degree of physiological activation ranging from calm to excited). This dimensional framework provides a richer representational vocabulary than categorical classification alone, enabling detection of subtle perceptual misalignments that discrete labels would obscure. For instance, two systems might both correctly classify an expression as “angry,” yet differ substantially in how intense (arousal) or how negative (valence) they perceive that anger to be. Despite the theoretical importance of dimensional ratings, computational evaluations of emotion recognition have overwhelmingly focused on discrete category accuracy (Khare et al., 2024; Telceken et al., 2025).

1.2 The Evaluation Gap

Current evaluations of VLM emotion recognition suffer from four critical limitations that this study addresses.

First, existing benchmarks rely on accuracy and F1 scores against ground-truth labels, but these metrics ignore the substantial disagreement among human raters themselves. Human emotion perception is inherently variable—particularly for arousal, where inter-rater reliability can be as low as α = 0.125 (present study)—and any meaningful evaluation must interpret model error relative to this human variability. Without establishing human inter-rater reliability as a performance ceiling, it is impossible to determine whether a model’s errors reflect genuine failure or simply mirror the inherent subjectivity of emotion perception.

Second, prior studies have focused almost exclusively on categorical accuracy, neglecting the continuous dimensional ratings (valence, arousal) that are central to affective science and necessary for detecting subtle perceptual biases. A model may achieve perfect categorical accuracy while producing dimensional ratings that are systematically distorted, a dissociation we demonstrate empirically.

Third, while demographic disparities have been documented in commercial FER APIs (Rhue, 2018; Jankowiak et al., 2024), systematic bias analysis of open-source VLMs across race-gender-emotion intersections remains absent. This gap is particularly concerning given the rapid adoption of open-source VLMs in research and applied settings where fairness guarantees are critical.

Fourth, prior studies comparing human and AI emotion perception have predominantly used FER-specialized models—lightweight architectures with millions of parameters (e.g., MobileViT ~6M, EfficientNet ~5M) trained exclusively on emotion-labeled datasets such as AffectNet (Mollahosseini et al., 2017). While these models achieve high classification accuracy, they neither represent the capabilities of modern foundation models (with billions of parameters trained on internet-scale multimodal data) nor support the integrated categorical-plus-dimensional ratings that humans naturally produce. A reviewer critique of our prior work (Tae et al., under review) directly challenged the representativeness of FER-specialized models as proxies for “AI,” motivating the transition to VLMs in the present study.

1.3 Contributions

This paper makes five contributions to the intersection of affective computing, cognitive psychology, and multimodal AI evaluation.

First, we introduce a VLM-as-rater psychometric framework that treats VLMs as additional participants in a human rating paradigm. Rather than evaluating VLMs against ground-truth labels using accuracy/F1, we employ Intraclass Correlation Coefficients (ICC), Cohen’s κ, Krippendorff’s α, and Bland-Altman analysis to quantify agreement against human inter-rater reliability as an empirical ceiling. This framework reveals dimensions of VLM behavior—stereotyped responding, polarity exaggeration, dimensional collapse—that accuracy-based evaluations entirely miss.

Second, we present the first systematic demographic bias analysis of open-source VLMs using a fully crossed 3 (race: Black, Caucasian, Korean) × 2 (gender: Male, Female) × 6 (emotion) factorial stimulus design. The use of 1,440 AI-generated face images ensures perfect experimental control (identical backgrounds, lighting, and identity consistency across emotion conditions), while mixed-effects models with crossed random effects separate systematic model bias from image-level noise.

Third, we discover stereotyped responding—a phenomenon where VLMs produce only 1–6 unique valence-arousal values per emotion category (e.g., neutral valence SD = 0.00 for LLaMA), indicating categorical prototype lookup rather than per-image intensity discrimination. This behavior is qualitatively distinct from both human variability and FER model behavior.

Fourth, we perform a dual comparison of both VLMs and five FER-specialized models against the same human baseline (N = 1,000), revealing a striking strength inversion: FER models dominate valence prediction while VLMs dominate arousal prediction, suggesting complementary architectural advantages.

Fifth, we identify model-specific demographic bias profiles where Gemma3 and LLaMA show biases in different dimensions, for different demographics, and in different directions—establishing that no single bias audit can generalize across VLMs.

Our research questions are as follows. RQ1: How do VLM emotion ratings compare to human inter-rater reliability on categorical and dimensional measures? RQ2: Do VLMs exhibit systematic demographic biases in emotion attribution, and are these biases model-specific? RQ3: How do VLMs compare to FER-specialized models in classification accuracy, dimensional prediction, and bias profiles?


2.1 VLMs for Emotion Recognition

The application of Vision Language Models to facial emotion recognition has emerged as a natural extension of their demonstrated competence in visual question answering. Recent evaluations, however, reveal mixed results. Mulukutla et al. (2025) conducted the first empirical comparison of open-source VLMs (CLIP, Phi-3.5 Vision) against traditional deep learning models on FER-2013, finding that traditional models—EfficientNet-B0 (86.44%) and ResNet-50 (85.72%)—significantly outperform VLMs (CLIP 64.07%, Phi-3.5 Vision 51.66%). This suggests that VLMs’ general visual understanding does not automatically translate to FER proficiency, particularly on low-resolution grayscale images.

In the domain of frontier API models, evaluations of GPT-4o, Gemini 2.0, and Claude 3.5 Sonnet on the NimStim dataset show that GPT-4o and Gemini match or exceed human performance for calm/neutral and surprise expressions, though performance degrades for more ambiguous emotions (npj Digital Medicine, 2025). Refoua et al. (2026) evaluated ChatGPT-4, ChatGPT-4o, and Claude 3 Opus on the Reading the Mind in the Eyes Test (RMET) with White, Black, and Korean face stimuli, finding that ChatGPT-4o achieved cross-ethnically consistent performance. These studies focus on frontier (closed) models with hundreds of billions of parameters, whereas the present study evaluates open-source models at the 4B–11B scale that are accessible for research deployment.

Specialized VLM frameworks for FER have also emerged, including FACET-VLM (2025), which integrates multiview facial representation learning with semantic guidance from language prompts and achieves up to 99.41% on BU-4DFE. However, these fine-tuned models sacrifice the generality that makes VLMs attractive as versatile emotion annotators.

2.2 Human-AI Comparison in Emotion Perception

The psychometric comparison of human and machine raters has a long tradition in clinical psychology, where ICC and Bland-Altman analysis serve as standard tools for assessing measurement agreement. In affective computing, Tak and Gratch (2024) found that GPT-4 emulates average-human emotional cognition from a third-person perspective, with its interpretations aligning more closely with human judgments about others’ emotions than with self-assessments. A PLOS ONE study (2025) evaluated GPT-4’s capacity to interpret emotions from images, achieving numeric response correlations of r = 0.87 for valence and r = 0.72 for arousal on the Geneva Affective Picture Database (GAPED) under zero-shot conditions. These results establish that large language models can approximate human emotion perception, though the extent of this approximation varies substantially across emotional dimensions.

Zhang et al. (2024) provide a comprehensive survey of affective computing in the era of LLMs, noting a paradigm shift from fine-tuned pre-trained language models to in-context learning approaches. They identify that while LLMs excel at affective understanding tasks (sentiment classification, emotion detection), their performance on dimensional emotion estimation remains underexplored.

Critically, prior human-AI comparisons in emotion perception have typically used either (a) FER-specialized models with limited dimensionality or (b) frontier API models without transparent access to model internals. The present study bridges this gap by evaluating open-source VLMs that produce integrated categorical-plus-dimensional ratings through a psychometric framework anchored to large-scale human data (N = 1,000).

2.3 Demographic Bias in FER

Documented racial and gender disparities in automated affect recognition have raised significant fairness concerns. Jankowiak et al. (2024) proposed formal metrics for measuring dataset demographic bias in FER, demonstrating that imbalanced training data composition propagates into systematic performance disparities across demographic groups. Gender bias in FER manifests in two forms: representational bias (unequal demographic representation in training data) and stereotypical bias (systematic associations between emotions and demographics, such as linking female faces with sadness and male faces with anger) (Springer PRAI, 2024).

Human emotion perception itself is not demographically neutral. Gender-emotion stereotypes lead observers to associate male faces with dominance-related emotions (anger) and female faces with prosocial emotions (happiness, sadness) (Hess et al., 2004). These biases in human annotation propagate into training datasets—AffectNet (Mollahosseini et al., 2017) relies on sparse annotation (N ≈ 12 per image)—and may be amplified by algorithmic optimization.

The present study extends bias analysis from commercial APIs and training datasets to open-source VLMs, using a factorial experimental design that enables orthogonal estimation of race, gender, and emotion effects through mixed-effects modeling.

2.4 AI-Generated Stimuli in Emotion Research

Traditional face databases used in emotion research (KDEF, ADFES, FER-2013, AffectNet) suffer from several methodological limitations. Real-face databases rely on actors performing emotional expressions, introducing individual variation in expression quality and intensity. Lighting, background, hairstyle, and makeup vary across stimuli, creating confounds that compromise internal validity. Demographic balance is difficult to achieve, with most databases overrepresenting certain racial groups.

AI-generated face stimuli address these limitations through a controlled generation pipeline. The GIST-AIFaceDB used in this study generates neutral base faces with standardized features (identical gray backgrounds, navy t-shirts, front-facing pose), then transforms each neutral face into five emotional expressions while preserving identity. This pipeline ensures that any differences between emotional expressions for a given identity are attributable solely to the emotion manipulation, not to extraneous visual factors.

The ecological validity of AI-generated stimuli is supported by human naturalness ratings: in our dataset, average naturalness ranged from 5.26 (fear) to 6.94 (happy) on a 9-point scale, indicating that participants perceived the stimuli as moderately to highly realistic despite knowing they were AI-generated. Baudouin et al. (2025) provide supporting evidence that dimensional ratings (valence, arousal) can be reliably collected from facial stimuli regardless of their provenance, suggesting that AI-generated faces elicit comparable affective responses to real faces.


3. Methodology

3.1 Stimuli

The stimulus set comprises 1,440 AI-generated facial images from the GIST AI-Generated Face Database (GIST-AIFaceDB, under review). The generation pipeline employed a two-step process. In the first step, 240 neutral base faces were generated using the STOIQO NewReality Flux model deployed on the OpenArt platform. These neutral faces depicted diverse virtual identities wearing standardized navy t-shirts against gray backgrounds, with generation prompts specifying age diversity, hairstyle variation, and demographic characteristics. In the second step, each neutral face was transformed into five additional emotional expressions (angry, disgusted, fearful, happy, sad) using Nano-Banana, an advanced image-editing model implemented in Google AI Studio (Gemini 2.5 Flash Image), which modifies facial expressions while preserving the identity, lighting, and background of the original image.

The resulting fully crossed factorial design—3 (race: Black, Caucasian, Korean) × 2 (gender: Male, Female) × 6 (emotion: angry, disgust, fear, happy, sad, neutral) × 40 (identity)—yields balanced cell sizes of 240 images per emotion, 480 per race, 720 per gender, and 80 per race-gender-emotion combination, enabling orthogonal estimation of all demographic effects. Image file extensions included .jpeg (1,149 files), .jpg (211), and .JPEG (80), handled via case-insensitive matching.

3.2 Human Rating Procedure

The study protocol was reviewed and granted exemption by the Institutional Review Board (IRB). One thousand native Korean adults (500 female, 500 male; age M = 44.6, SD = 13.7, range 20–69) were recruited through an online platform, with recruitment strictly balanced across age cohorts and genders. The experiment was administered online via participants’ personal computers.

Each participant evaluated 72 images randomly selected from the total pool of 1,440, with every image presented in randomized order. Through this counterbalanced crossed design, each image received 50 independent ratings, yielding 72,000 total responses. The procedure consisted of two primary affective rating tasks. In the valence task, participants rated the emotional positivity or negativity of each facial expression on a 9-point Likert scale (1 = “extremely negative,” 9 = “extremely positive”). In the arousal task, they rated the level of emotional activation or intensity on a 9-point scale (1 = “not at all aroused,” 9 = “highly aroused”). Naturalness ratings (1 = “very unnatural,” 9 = “very natural”) were also collected.

Inter-rater reliability, computed as Krippendorff’s α (ordinal), established the following human ceiling: valence α = 0.471 (poor–fair), arousal α = 0.125 (poor), and naturalness α = 0.126 (poor). While these values appear low, they fall within the typical range for emotion rating studies and reflect the inherent subjectivity of affective perception, particularly for arousal. Mixed-effects variance decomposition confirmed that rater individual differences (σ² = 0.450 for valence, 0.696 for arousal) dominated image-level variance by 11× (valence) and 32× (arousal), confirming that low reliability is driven by rater heterogeneity rather than stimulus ambiguity.

3.3 VLM Inference

Two instruction-tuned VLMs were evaluated: Gemma3-4B-IT (Google, 4 billion parameters, QAT 4-bit quantized) and LLaMA-3.2-11B-Vision-Instruct (Meta, 11 billion parameters, 4-bit quantized). Both models were deployed on Apple Silicon (M1 Max, 32GB) via the MLX framework for GPU-accelerated inference without HTTP overhead.

Inference followed a three-step context-carry prompting strategy. In Step 1, the model classified the facial emotion from six forced-choice categories (happy, sad, angry, fear, disgust, neutral) via structured JSON output. In Step 2, the classified emotion was carried forward as context, and the model rated valence on a 1–9 scale. In Step 3, both the classified emotion and valence rating were carried forward, and the model rated arousal on a 1–9 scale. This sequential strategy mirrors anchoring effects in human sequential judgment while introducing structural error propagation: classification errors in Step 1 systematically influence subsequent valence and arousal ratings.

Response parsing employed a cascade strategy: direct JSON parse → markdown fence stripping → regex fallback. Emotion labels were fuzzy-matched by their first three characters, valence was clamped to [1, 9], and arousal was clamped to [1, 9]. Gemma3 achieved 100% JSON parse success with one invalid category output (0.07%, “doubt”), while LLaMA achieved comparable compliance. All 1,440 images were successfully processed by both models.

3.4 FER Baseline Models

For comparative analysis, five FER-specialized models were evaluated on the same 1,440 images: PosterV2 (Pyramid Transformer, classification only), MobileViT (lightweight Vision Transformer, classification + VA), EfficientNet-B0-8-VA-MTL (multi-task CNN, classification + VA), BEiT (BERT Image Transformer, classification only), and EmoNet (CNN, classification + VA). For the three VA-capable models, predictions in the native [-1, 1] range were normalized to [1, 9] to match the human rating scale using the formula v_norm = (v_raw + 1) / 2 × 8 + 1.

3.5 Statistical Analysis

Categorical agreement was quantified via Cohen’s κ against intended emotion labels, with McNemar’s test for pairwise model comparisons. Dimensional alignment was assessed through Pearson correlation, Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Bland-Altman analysis (systematic bias and 95% limits of agreement). Per-emotion bias significance was tested with Wilcoxon signed-rank tests, Bonferroni-corrected for 18 comparisons (6 emotions × 3 VA-capable models per model family).

Bias decomposition employed linear mixed-effects models (LMM) fitted via R’s lme4 package (Bates et al., 2015) with Satterthwaite degrees of freedom (lmerTest). The emotion-bias model used the formula: rating ~ rater_type * emotion + (1|image_id), where rater_type distinguishes human aggregate from VLM ratings and image_id is a crossed random effect controlling for between-image variability. Demographic bias models used analogous formulas with actor_race and actor_gender as fixed effects.


4. Results

4.1 Emotion Classification

Table 1 presents the overall classification performance of both VLMs alongside five FER-specialized baselines.

Table 1. Overall emotion classification performance (N = 1,440).

ModelTypeAccuracyMacro F1Cohen’s κ
PosterV2FER0.8990.9000.878
MobileViTFER0.8750.8740.848
EfficientNetFER0.8540.8560.823
BEiTFER0.7660.7720.713
Gemma3-4BVLM0.7260.6830.671
EmoNetFER0.7310.7240.665
LLaMA-3.2-11BVLM0.6130.4020.535

Gemma3-4B-IT achieved Cohen’s κ = 0.671 (substantial agreement), outperforming EmoNet (κ = 0.665) and approaching BEiT (κ = 0.713). LLaMA-3.2-11B-Vision achieved κ = 0.535 (moderate agreement), below all FER baselines. Notably, the larger LLaMA (11B) performed worse than the smaller Gemma3 (4B), demonstrating that model scale does not guarantee improved emotion recognition.

Table 2 presents emotion-specific classification accuracy, revealing extreme performance polarization.

Table 2. Emotion-specific classification accuracy (proportion correct).

EmotionGemma3LLaMAPosterV2MobileViTEfficientNetBEiTEmoNet
Happy1.0001.0001.0001.0001.0000.9791.000
Neutral1.0001.0000.9120.8630.7290.5290.533
Fear0.9790.6540.9330.9420.8460.7920.912
Disgust0.8420.0080.6420.5330.6790.7540.846
Angry0.4040.9210.9170.9540.8870.8000.637
Sad0.1260.0920.9920.9580.9830.7420.454

Both VLMs perfectly classified happy and neutral but failed dramatically on sadness (Gemma3 F1 = 0.223, LLaMA F1 = 0.092). The two VLMs exhibit complementary error profiles: LLaMA excels at angry (92.1%) where Gemma3 struggles (40.4%), while Gemma3 excels at disgust (84.2%) where LLaMA fails completely (0.8%). The dominant error pathways are qualitatively distinct: Gemma3 exhibits neutral absorption (71.1% of sad images classified as neutral), while LLaMA exhibits angry merger (99.2% of disgust images classified as angry).

4.2 Valence Comparison

Table 3. Valence prediction summary statistics.

ModelTypePearson rMAEModel M(SD)Human M(SD)
MobileViTFER.9500.9164.18 (2.35)4.60 (1.42)
EfficientNetFER.9401.0634.05 (2.57)4.60 (1.42)
EmoNetFER.9280.7954.32 (2.00)4.60 (1.42)
LLaMA-3.2-11BVLM.9011.8083.71 (3.08)4.60 (1.42)
Gemma3-4BVLM.8911.4564.31 (2.65)4.60 (1.42)

Both VLMs achieve high valence correlations (r = .891–.901), approaching but not matching FER models (r = .928–.950). However, absolute errors are substantially larger (VLM MAE = 1.46–1.81 vs. FER MAE = 0.80–1.06), reflecting a pattern of “correct rank ordering but distorted scale usage.”

The source of this distortion is polarity exaggeration bias. Gemma3’s valence SD (2.65) is 1.87× the human SD (1.42), and LLaMA’s SD (3.08) is 2.17× human SD. Both VLMs systematically rate negative emotions more negatively and positive emotions more positively than humans.

Table 4. Per-emotion valence bias (VLM − Human mean).

EmotionGemma3 BiasLLaMA BiasEmoNet BiasMobileViT BiasEfficientNet Bias
Fear−1.99−2.68+0.40−0.14−0.62
Disgust−1.39−2.25−1.35−0.78−0.97
Angry−1.06−2.04−0.64−1.01−0.79
Happy+1.26+1.58+0.76+1.01+1.03
Neutral+1.05−0.28+0.04−0.09+0.01
Sad+0.38+0.53−0.89−1.51−1.95

LLaMA’s negative-emotion valence bias (−2.04 to −2.68) is approximately double Gemma3’s (−1.06 to −1.99), indicating that increased model scale amplifies rather than reduces polarity exaggeration. Mixed-effects models confirmed all per-emotion biases as statistically significant (p < .001), with LLaMA’s angry bias (β = −2.050) approximately double Gemma3’s (β = −1.053).

4.3 Arousal Comparison

Table 5. Arousal prediction summary statistics.

ModelTypePearson rMAEModel M(SD)Human M(SD)
LLaMA-3.2-11BVLM.7831.7775.36 (2.42)5.61 (0.60)
Gemma3-4BVLM.7591.1375.49 (1.74)5.61 (0.60)
EfficientNetFER.4481.6966.53 (2.33)5.61 (0.60)
MobileViTFER.4091.8646.68 (2.61)5.61 (0.60)
EmoNetFER.1261.3696.48 (1.56)5.61 (0.60)

A striking strength inversion emerges: VLMs substantially outperform all FER-specialized models on arousal prediction (r = .759–.783 vs. .126–.448), suggesting that language-mediated reasoning about emotional intensity confers a structural advantage for arousal estimation. Gemma3 additionally achieves the lowest arousal MAE (1.137) among all seven models.

Table 6. Per-emotion arousal bias (VLM − Human mean), with LMM significance.

EmotionGemma3 BiasLMM pLLaMA BiasLMM p
Fear+1.30< .001+1.21< .001
Happy+0.30.442+2.39< .001
Angry+0.24< .001−0.50< .001
Disgust+0.42.026−0.57.517
Sad−1.04< .001−2.10< .001
Neutral−1.90< .001−1.91< .001

The most striking between-model difference is happy arousal: Gemma3’s bias (+0.30) is non-significant in the LMM (p = .442), while LLaMA rates happy arousal at 8.87 (human mean: 6.48), yielding a +2.39 overestimation (p < .001). This reflects LLaMA’s extreme “happiness = maximal excitement” prototype. Both VLMs severely underestimate neutral arousal (−1.90 to −1.91) and sad arousal (−1.04 to −2.10), revealing a systematic tendency to associate low visual salience with minimal arousal.

4.4 Stereotyped Responding and Dimensional Collapse

Table 7. Response variance by emotion: number of unique values and standard deviation.

EmotionGemma3 V SDLLaMA V SDHuman V SDGemma3 A SDLLaMA A SDHuman A SD
Happy0.480.131.310.660.721.57
Neutral0.640.001.080.440.281.71
Fear0.160.501.610.471.861.52
Angry0.801.051.550.491.211.51
Sad1.021.131.441.030.351.53
Disgust0.390.821.540.491.551.51

LLaMA’s neutral valence SD = 0.00 means that all 240 neutral images received the identical value (5), with zero per-image discrimination. Across all emotions, VLM valence SDs (0.00–1.13) are dramatically lower than human SDs (1.08–1.61), confirming that VLMs perform prototype lookup rather than genuine per-image discrimination. This dimensional collapse represents a qualitatively different behavior from both human raters (who exhibit genuine individual variation) and FER models (which produce continuous distributions through regression heads).

4.5 Demographic Bias Analysis

Mixed-effects models revealed model-specific demographic bias patterns.

Race bias. Gemma3 showed no significant race-valence bias, while LLaMA showed significant bias for Korean faces (β = +0.319, p = .009). For arousal, LLaMA’s race bias was three times larger than Gemma3’s: Korean faces received −1.204 points lower arousal in LLaMA (vs. Gemma3’s −0.399), while Black faces were overestimated by +0.50 points.

Gender bias. Gemma3 showed significant gender-valence bias (β = −0.332, p < .001, rating female faces 0.33 points more negatively), while LLaMA showed no significant gender-valence bias. The gender-arousal bias direction reversed between models: Gemma3 rated female faces as slightly higher arousal (+0.169, p = .020) while LLaMA rated them as lower (−0.465, p < .001).

Emotion-selective racial bias. At the intersection of race and emotion, Gemma3 showed a 2.7× accuracy gap for angry classification (Black 61.3% vs. Korean 22.5%), directionally consistent with the “angry Black man” stereotype documented in human perception research. Disgust showed the reverse pattern (Korean 95.0% > Black 75.0%), revealing that racial bias is selectively activated for specific race-emotion combinations rather than operating uniformly.


5. Discussion

5.1 Stereotyped Responding: Prototype Lookup vs. Per-Image Discrimination

The most fundamental finding of this study is that VLMs perform emotion-category prototype lookup rather than genuine per-image perceptual discrimination, producing 1–6 fixed valence-arousal values per emotion category regardless of the specific facial expression shown. This dimensional collapse likely arises from the discrete token generation architecture of VLMs, which must select specific integer tokens from their vocabulary. In contrast, FER regression heads produce continuous outputs through dedicated prediction layers trained end-to-end on dimensional emotion data.

This finding has direct implications for the emerging practice of using VLMs as proxy annotators for emotion data at scale (Zhang et al., 2024). While VLMs can reproduce average emotion prototypes—and their rank ordering of emotions along valence and arousal dimensions is largely correct—they fail to capture the within-category intensity gradients that distinguish, for example, mild irritation from intense rage. VLM-generated emotion labels thus carry systematic distortions (compressed variance, fixed prototypes) that would propagate through any downstream training pipeline.

5.2 Polarity Exaggeration Bias

Both VLMs systematically amplify the valence extremity of emotions, with standard deviations 1.87–2.17 times larger than human ratings. This polarity exaggeration bias likely originates from VLMs’ pretraining corpora, where emotional language tends toward hyperbole (e.g., descriptions of angry faces as “furious” rather than “slightly annoyed”). Counterintuitively, the larger LLaMA (11B) shows stronger polarity exaggeration than the smaller Gemma3 (4B)—angry valence bias of −2.05 vs. −1.05—suggesting that increased model capacity may amplify rather than refine emotion stereotypes if pretraining data does not proportionally increase in emotional nuance.

The consistency of polarity exaggeration across emotions and models suggests that post-hoc calibration (e.g., simple linear rescaling per emotion category) could substantially reduce absolute errors while preserving the high rank-order correlation, offering a practical path for applied affective computing systems.

5.3 The Sadness Paradox

We identify a sadness paradox in VLM emotion recognition: sadness is the worst-classified emotion for both VLMs (Gemma3 F1 = 0.223, LLaMA F1 = 0.092) despite being reliably classified by FER models (PosterV2 F1 = 0.994). The dominant error pathway is neutral absorption: Gemma3 classifies 71.1% and LLaMA 66.7% of sad images as neutral, suggesting that VLMs treat sadness as the absence of emotion rather than as a distinct emotional state. This is qualitatively different from the angry-disgust confusions shared with FER models, which reflect visual feature overlap rather than categorical non-recognition.

This extends the sadness-arousal inversion first identified in our prior work (Tae et al., under review), where FER models showed inverse arousal correlations for female sad faces. The present VLM data adds a more severe dimension: VLMs fail to detect sadness as a distinct emotion category, let alone estimate its intensity.

The sadness paradox poses critical risks for VLM deployment in mental health support and empathetic agent design. A system that cannot distinguish sadness from emotional neutrality will fundamentally fail at detecting distress—the very application domain where affective computing promises the greatest societal benefit (Pantic et al., 2005).

5.4 The Arousal Advantage of VLMs

Perhaps the most unexpected finding is that VLMs substantially outperform all five FER-specialized models on arousal prediction (r = .759–.783 vs. .126–.448). We hypothesize that this advantage arises from language-mediated reasoning: VLMs can leverage their language model’s conceptual understanding of emotional intensity (encoded in their pretraining corpora through phrases like “calm,” “agitated,” “excited”) to estimate arousal, whereas FER models must learn arousal mapping purely from visual features and sparse continuous annotations.

This finding, combined with the valence advantage of FER models, suggests that hybrid systems—combining FER classification heads with VLM-based intensity estimation—could outperform either architecture alone. Such complementary integration represents a promising direction for next-generation affective computing systems.

5.5 Model-Specific Demographic Biases

The most consequential finding for deployment decisions is that VLM demographic biases are model-specific in direction, magnitude, and affected dimension. Gemma3 shows gender-valence bias while LLaMA shows race-arousal bias; Gemma3 rates female faces as slightly higher arousal while LLaMA rates them as lower arousal. This heterogeneity means that no single bias audit can generalize across VLMs, and each deployment context requires individual evaluation against the specific populations and emotions involved.

The emotion-selective nature of racial bias—where Gemma3’s angry classification accuracy for Black faces (61.3%) is 2.7× that for Korean faces (22.5%)—echoes the “angry Black man” stereotype documented in human social cognition (Hess et al., 2004). However, the bias reverses for disgust (Korean 95.0% > Black 75.0%), revealing that racial effects on VLM emotion recognition operate through emotion-specific pathways rather than uniform racial preferences.

5.6 Limitations

Several limitations should be noted. First, our human participants were exclusively Korean adults, potentially introducing cultural biases in the baseline against which VLMs are evaluated. Cross-cultural replication with diverse rater populations is needed. Second, we tested only two open-source VLMs at the 4B–11B scale; extending to larger models (70B+) and frontier APIs (GPT-4o, Claude, Gemini) would reveal whether the patterns reported here generalize across the model capability spectrum. Third, our stimuli are static, single-emotion images; real-world emotion recognition typically involves dynamic, multi-modal, and mixed-emotion stimuli. Fourth, the context-carry prompting strategy introduces structural dependencies (error propagation from classification to dimensional ratings) that may not be present in alternative prompting approaches (e.g., single-shot integrated prompting). Fifth, the 4-bit quantization used for edge deployment may affect model behavior compared to full-precision inference.


6. Conclusion

This study provides the first psychometric comparison of VLM and human emotion ratings using a fully factorial stimulus design, establishing that Vision Language Models achieve moderate-to-substantial categorical agreement (κ = 0.535–0.671) but exhibit qualitatively distinct biases—stereotyped responding, polarity exaggeration, and the sadness paradox—that distinguish them from both human raters and FER-specialized models.

Three key findings emerge. First, VLMs perform categorical prototype lookup rather than per-image perceptual discrimination, producing near-zero variance within emotion categories. This dimensional collapse means VLMs cannot currently substitute for human raters in research contexts where stimulus-level variation matters. Second, a striking strength inversion exists between model families: FER models dominate classification (κ = 0.665–0.878) and valence (r = .928–.950), while VLMs dominate arousal (r = .759–.783 vs. .126–.448), suggesting complementary architectural advantages. Third, demographic biases are model-specific in direction, magnitude, and affected dimension, requiring per-model audits rather than generalized “VLM bias” characterizations.

As VLMs increasingly mediate human-computer interaction in emotionally sensitive contexts—from mental health chatbots to affective tutoring systems—the gap between their emotion perception and human psychological benchmarks demands both rigorous measurement, which this psychometric framework provides, and transparent reporting of model-specific limitations and biases. Future work should extend this framework to larger VLMs, frontier API models, dynamic video stimuli, and culturally diverse rater populations, while investigating whether fine-tuning on dimensionally annotated emotion data can mitigate the stereotyped responding and polarity exaggeration identified here.


References

Baudouin, J.-Y., Gallian, F., Pinoit, J.-M., & Damon, F. (2025). Arousal, valence, and discrete categories in facial emotion. Scientific Reports, 15(1), 40268.

Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48.

Hess, U., Adams Jr, R. B., & Kleck, R. E. (2004). Facial appearance, gender, and emotion expression. Emotion, 4(4), 378–388.

Jankowiak, P., et al. (2024). Metrics for dataset demographic bias: A case study on facial expression recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8).

Khare, S. K., Blanes-Vidal, V., Nadimi, E. S., & Acharya, U. R. (2024). Emotion recognition and artificial intelligence: A systematic review (2014–2023). Information Fusion, 102, 102019.

Mollahosseini, A., Hasani, B., & Mahoor, M. H. (2017). AffectNet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1), 18–31.

Mulukutla, V. K., Pavarala, S. S., Rudraraju, S. R., & Bonthu, S. (2025). Evaluating open-source vision language models for facial emotion recognition against traditional deep learning models. arXiv preprint arXiv:2508.13524.

Nomiya, H., Shimokawa, K., Namba, S., Osumi, M., & Sato, W. (2025). An artificial intelligence model for sensing affective valence and arousal from facial images. Sensors, 25(4), 1188.

Pantic, M., Sebe, N., Cohn, J. F., & Huang, T. (2005). Affective multimodal human-computer interaction. In Proceedings of the 13th ACM International Conference on Multimedia (pp. 669–676).

Refoua, S., Elyoseph, Z., Piterman, H., et al. (2026). Evaluation of cross-ethnic emotion recognition capabilities in multimodal large language models using the reading the mind in the eyes test. Scientific Reports, 16.

Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178.

Tak, A. N., & Gratch, J. (2024). GPT-4 emulates average-human emotional cognition from a third-person perspective. arXiv preprint arXiv:2408.13718.

Telceken, M., Akgun, D., Kacar, S., Yesin, K., & Yıldız, M. (2025). Can artificial intelligence understand our emotions? Deep learning applications with face recognition. Current Psychology, 44(9), 7946–7956.

Zhang, Y., Yang, X., Xu, X., et al. (2024). Affective computing in the era of large language models: A survey from the NLP perspective. arXiv preprint arXiv:2408.04638.


Based on the interdisciplinary nature of this work (cognitive psychology × affective computing × multimodal AI), we recommend the following venues, ordered by fit:

Tier 1: Primary Targets

VenueTypeFit Rationale
ACM CHIConferenceHCI framing, affective agents, sadness paradox implications for mental health UX. Addresses prior desk-rejection by replacing FER models with VLMs.
EMNLP / ACLConferenceVLM evaluation methodology, context-carry prompting, NLP-meets-affective computing. Strong fit for Findings track.
IEEE Transactions on Affective Computing (TAFFC)JournalDisciplinary home for this work. 7-model comparison, human ceiling framework, polarity exaggeration concept.

Tier 2: Strong Alternatives

VenueTypeFit Rationale
ACII (Affective Computing & Intelligent Interaction)ConferenceSpecialist venue. Psychometric framework and dual VLM-FER comparison are novel contributions.
ACM FAccTConferenceFairness angle: model-specific demographic biases, emotion-selective racial effects, per-model audit requirements.
AIES (AAAI/ACM AI Ethics & Society)ConferenceBroader societal implications of VLM emotion biases for deployed systems.
CSCWConferenceCollaborative and social computing angle: how VLM biases affect group-facing affective technology.

Tier 3: High-Impact Journal Options

VenueTypeFit Rationale
Nature Human BehaviourJournalLarge-scale human-AI comparison (N=1,000), societal implications, cross-demographic analysis.
Cognition and EmotionJournalLeading affective science journal. Circumplex model grounding, psychometric framework, sadness paradox.
Computers in Human BehaviorJournalHuman-AI interaction, technology-mediated affect, applied implications.
PLOS ONEJournalBroad interdisciplinary audience, open access, replicable methodology.

Given the dual contribution (methodological framework + empirical findings), we recommend a primary submission to IEEE TAFFC (journal, highest disciplinary prestige for affective computing) with a concurrent short paper or poster at ACII or CHI Late-Breaking Work to establish priority and receive community feedback. If a conference-first strategy is preferred, EMNLP Findings offers rapid turnaround and high visibility in the NLP community where VLM evaluation is a trending topic.