Do Vision Language Models See Emotions Like Humans? Comparing Human and VLM Emotion Ratings on AI-Generated Facial Stimuli with Demographic Bias Analysis

Authors: Jini Tae, Ju-Hyeon Park, Wonil Choi

Affiliation: Gwangju Institute of Science and Technology (GIST), South Korea


Abstract

Vision Language Models (VLMs) are increasingly deployed in affective computing, yet their alignment with human emotion perception remains poorly understood beyond categorical accuracy metrics. This study compares emotion ratings from 1,000 human participants with two instruction-tuned VLMs — Gemma3-4B-IT (Google) and LLaMA-3.2-11B-Vision (Meta) — on 1,440 AI-generated facial images balanced across three races (Black, Caucasian, Korean), two genders, and six basic emotions. Using a psychometric framework that treats VLMs as additional raters, we evaluate categorical agreement (Cohen’s κ), dimensional alignment (valence and arousal via Pearson correlation, MAE, and mixed-effects models), and demographic bias against human inter-rater reliability as a human agreement benchmark. Both VLMs achieve moderate-to-substantial categorical agreement (κ = 0.535–0.671) but exhibit fixed-value output pattern — producing only 1–6 unique values per emotion category with near-zero variance — indicating fixed-value output rather than per-image discrimination. Valence correlations are high (r = .891–.901) but absolute errors are large (MAE = 1.46–1.81) due to polarity exaggeration bias, where VLMs rate negative emotions as more negative and positive emotions as more positive than humans. Arousal predictions outperform all five FER-specialized baselines (r = .759–.783 vs. .126–.448), suggesting that language-mediated reasoning confers an advantage for intensity estimation. Demographic bias patterns are model-specific: Gemma3 shows gender-valence bias (β = −0.332) while LLaMA shows race-arousal bias three times larger. These findings demonstrate that VLM emotion ratings cannot substitute for human judgments and that bias audits must be conducted per-model. We additionally compare both VLMs against five FER models on the same stimuli, revealing complementary strengths across the categorical-dimensional divide.

Keywords: Vision Language Models, Facial Emotion Recognition, Psychometric Agreement, Valence-Arousal, Demographic Bias, AI-Generated Faces, Affective Computing


1. Introduction

1.1 Affective Computing and the Promise of VLMs

The deployment of affective computing systems — from mental health chatbots to responsive virtual assistants — increasingly depends on accurate automatic emotion recognition from facial expressions. The efficacy of such systems hinges on affective alignment, defined as the degree to which a machine’s interpretation of emotional cues matches human psychological standards (Pantic et al., 2005). When an empathetic agent misinterprets the intensity of a user’s distress, it risks breaking user trust and failing to sustain meaningful interaction. This stakes consideration motivates rigorous empirical comparison between machine and human emotion perception.

Vision Language Models (VLMs) represent a paradigm shift from task-specific facial expression recognition (FER) models to general-purpose multimodal systems. A VLM is a model that integrates a vision encoder with a large language model, enabling image-conditioned text generation through natural language prompting. Whereas FER-specialized models are trained end-to-end on emotion-labeled datasets and output fixed emotion categories or continuous valence-arousal values, VLMs can flexibly produce both categorical and dimensional emotion ratings through instruction prompting — a capability that mirrors the integrated judgment process humans naturally employ. This flexibility raises the possibility that VLMs might serve as scalable substitutes for costly human emotion annotation, where collecting 72,000 responses from 1,000 raters represents a significant time and financial investment.

To evaluate whether VLMs truly perceive emotions as humans do, a dimensional measurement framework is required. The Circumplex Model of Affect (Russell, 1980) is a theoretical framework that maps all emotional experiences onto a continuous two-dimensional space defined by valence and arousal. Valence is the hedonic quality of an emotional experience, ranging from unpleasant to pleasant. Arousal is the degree of physiological activation, ranging from calm to excited. While the circumplex model was originally formulated for self-reported affective experience, it has been widely adopted for characterizing observer-rated facial expression perception (Baudouin et al., 2025). We follow this convention while noting that perceived emotion in others and felt emotion in oneself may involve distinct processes. This dimensional framework provides a richer representational vocabulary than categorical classification alone, enabling detection of subtle perceptual misalignments that discrete labels would obscure. Two systems might both correctly classify an expression as “angry,” yet differ in how intense (arousal) or how negative (valence) they perceive that anger to be. Despite the theoretical importance of dimensional ratings, computational evaluations of emotion recognition have overwhelmingly focused on discrete category accuracy (Khare et al., 2024; Telceken et al., 2025).

1.2 The Evaluation Gap

Despite this framework existing, current VLM evaluations fail to employ it, creating four critical gaps that this study addresses.

The first gap concerns the absence of a human agreement benchmark. Existing benchmarks rely on accuracy and F1 scores against ground-truth labels while ignoring substantial disagreement among human raters. Human emotion perception is inherently variable — particularly for arousal, where inter-rater reliability can be as low as Krippendorff’s α = 0.125 (present study). Krippendorff’s α is a reliability coefficient for multiple raters that corrects for chance agreement, where 1.0 indicates perfect consensus and 0.0 indicates chance-level agreement. Without establishing human inter-rater reliability as a human agreement benchmark, it is impossible to determine whether a model’s errors reflect genuine failure or simply mirror the inherent subjectivity of emotion perception.

The second gap is the exclusive focus on categorical accuracy, neglecting continuous dimensional ratings central to affective science. A model may achieve perfect categorical accuracy while producing systematically distorted dimensional ratings — a dissociation we demonstrate empirically in the present study.

The third gap concerns the absence of demographic bias audits for open-source VLMs. While demographic disparities have been documented in commercial FER APIs (Rhue, 2018; Jankowiak et al., 2024), systematic bias analysis of open-source VLMs across race-gender-emotion intersections remains absent. This gap is concerning given the rapid adoption of open-source VLMs in research and applied settings where fairness guarantees are critical.

The fourth gap is the use of unrepresentative models as proxies for “AI.” Prior studies comparing human and AI emotion perception have predominantly used FER-specialized models — lightweight architectures with millions of parameters (e.g., MobileViT approximately 6M parameters, EfficientNet approximately 5M) trained exclusively on emotion-labeled datasets such as AffectNet (Mollahosseini et al., 2017). While these models achieve high classification accuracy, they neither represent the capabilities of modern foundation models with billions of parameters trained on internet-scale multimodal data, nor support the integrated categorical-plus-dimensional ratings that humans naturally produce. Prior work comparing only FER-specialized models to human raters may not adequately represent the capabilities of modern foundation models, motivating the transition to VLMs in the present study.

1.3 Contributions and Research Questions

This paper makes five contributions to the intersection of affective computing, cognitive psychology, and multimodal AI evaluation. First, we introduce a VLM-as-rater psychometric framework that treats VLMs as additional participants in a human rating paradigm. Rather than evaluating VLMs against ground-truth labels using accuracy and F1, we employ Intraclass Correlation Coefficients (ICC), Cohen’s κ, Krippendorff’s α, and Bland-Altman analysis to quantify agreement against human inter-rater reliability as an empirical agreement benchmark. Cohen’s κ is a chance-corrected agreement measure for categorical classification, where 0 indicates chance-level and 1 indicates perfect agreement. Bland-Altman analysis is a method for assessing agreement between two measurement methods via systematic bias and 95% limits of agreement. This framework reveals dimensions of VLM behavior — fixed-value output pattern, polarity exaggeration, dimensional collapse — that accuracy-based evaluations entirely miss. Second, we present among the first systematic demographic bias analyses of open-source VLMs using a fully crossed 3 (race: Black, Caucasian, Korean) × 2 (gender: Male, Female) × 6 (emotion) factorial stimulus design with 1,440 AI-generated face images ensuring perfect experimental control. Third, we discover fixed-value output pattern — a phenomenon where VLMs produce only 1–6 unique valence-arousal values per emotion category (e.g., neutral valence SD = 0.00 for LLaMA), indicating categorical fixed-value output rather than per-image intensity discrimination. Fourth, we perform a dual comparison of both VLMs and five FER-specialized models against the same human baseline (N = 1,000), revealing a striking strength inversion: FER models dominate valence prediction while VLMs dominate arousal prediction, suggesting complementary architectural advantages. Fifth, we identify model-specific demographic bias profiles where Gemma3 and LLaMA show biases in different dimensions, for different demographics, and in different directions.

This study is exploratory in nature. Rather than testing pre-registered hypotheses, we systematically characterize VLM emotion rating behavior across multiple dimensions to generate testable hypotheses for future confirmatory research. Our research questions address three axes of VLM-human comparison. RQ1 asks how VLM emotion ratings compare to human inter-rater reliability on categorical and dimensional measures. RQ2 asks whether VLMs exhibit systematic demographic biases in emotion attribution, and whether these biases are model-specific. RQ3 asks how VLMs compare to FER-specialized models in classification accuracy, dimensional prediction, and bias profiles.


2.1 VLMs for Emotion Recognition

The application of VLMs to facial emotion recognition has yielded mixed results, with traditional deep learning models consistently outperforming VLMs on categorical accuracy. Mulukutla et al. (2025) conducted the first empirical comparison of open-source VLMs against traditional models on FER-2013, a dataset containing 35,887 low-resolution grayscale images across seven emotion classes. Traditional models — EfficientNet-B0 (86.44% accuracy) and ResNet-50 (85.72%) — outperformed VLMs by a margin of 20 to 35 percentage points, with CLIP achieving 64.07% and Phi-3.5 Vision achieving 51.66%. This performance gap suggests that VLMs’ general visual understanding does not automatically translate to FER proficiency, particularly on low-quality visual inputs.

Frontier API models show more promising results, with GPT-4o and Gemini matching human performance for certain expressions. Evaluations on the NimStim dataset demonstrate that GPT-4o and Gemini match or exceed human performance for calm, neutral, and surprise expressions, though performance degrades for more ambiguous emotions (Harb et al., 2025). Refoua et al. (2026) evaluated ChatGPT-4, ChatGPT-4o, and Claude 3 Opus on the Reading the Mind in the Eyes Test (RMET) with White, Black, and Korean face stimuli, finding that ChatGPT-4o achieved cross-ethnically consistent performance with accuracy above the 85th human percentile across all three ethnic versions. Specialized VLM frameworks for FER have also emerged, including FACET-VLM (2025), which achieves up to 99.41% on BU-4DFE through multiview facial representation learning with semantic language guidance. However, these fine-tuned models sacrifice the generality that makes VLMs attractive as versatile emotion annotators. Bhattacharyya and Wang (2025) presented a comprehensive evaluation of VLMs for evoked emotion recognition at NAACL, establishing benchmark comparisons across multiple datasets and confirming that zero-shot VLMs lag behind supervised systems. The present study evaluates open-source models at the 4B–11B scale that are accessible for research deployment, bridging the gap between FER-specialized and frontier API approaches.

2.2 Human-AI Comparison in Emotion Perception

The psychometric comparison of human and machine raters has a long tradition in clinical psychology, recently extended to large language models. The Intraclass Correlation Coefficient (ICC) is a measure of agreement between two measurement methods, and Bland-Altman analysis visualizes systematic bias between them; both serve as standard tools for assessing measurement agreement. In affective computing, Tak and Gratch (2024) found that GPT-4 emulates average-human emotional cognition from a third-person perspective, with its interpretations aligning more closely with human judgments about others’ emotions than with self-assessments. A study published in PLOS ONE (Alrasheed et al., 2025) evaluated GPT-4’s capacity to interpret emotions from images, achieving numeric response correlations of r = 0.87 for valence and r = 0.72 for arousal on the Geneva Affective Picture Database (GAPED) under zero-shot conditions. These results establish that large language models can approximate human emotion perception, though the extent of approximation varies across emotional dimensions.

Prior human-AI comparisons in emotion perception have typically used either FER-specialized models with limited dimensionality or frontier API models without transparent access to model internals. Zhang et al. (2024) provide a comprehensive survey of affective computing in the era of LLMs, noting that while LLMs excel at affective understanding tasks such as sentiment classification and emotion detection, their performance on dimensional emotion estimation remains underexplored. The present study bridges this gap by evaluating open-source VLMs that produce integrated categorical-plus-dimensional ratings through a psychometric framework anchored to large-scale human data (N = 1,000).

2.3 Demographic Bias in Automated Affect Recognition

Documented racial and gender disparities in automated affect recognition have raised fairness concerns that extend to VLMs. Jankowiak et al. (2024) proposed formal metrics for measuring dataset demographic bias in FER, demonstrating that imbalanced training data composition propagates into systematic performance disparities across demographic groups. Gender bias in FER manifests in two forms: representational bias, which refers to unequal demographic representation in training data, and stereotypical bias, which refers to systematic associations between emotions and demographics, such as linking female faces with sadness and male faces with anger (Dominguez-Catena et al., 2024).

Human emotion perception itself is not demographically neutral. Gender-emotion stereotypes lead observers to associate male faces with dominance-related emotions such as anger and female faces with prosocial emotions such as happiness and sadness (Plant et al., 2000). Notably, Hess et al. (2004) demonstrated that when facial dominance and affiliation cues are controlled, these stereotypical associations can reverse, highlighting the complexity of gender-emotion interactions in face perception. These biases in human annotation propagate into training datasets — AffectNet (Mollahosseini et al., 2017) relies on 12 annotators across approximately 450,000 images, with most images receiving a single annotation — and may be amplified by algorithmic optimization. The present study extends bias analysis from commercial APIs and training datasets to open-source VLMs, using a factorial experimental design that enables orthogonal estimation of race, gender, and emotion effects through mixed-effects modeling.

2.4 AI-Generated Stimuli in Emotion Research

Traditional face databases used in emotion research — including KDEF, ADFES, FER-2013, and AffectNet — suffer from uncontrolled variation in expression quality, lighting, and demographic balance. Real-face databases rely on actors performing emotional expressions, introducing individual variation in expression quality and intensity that creates confounds compromising internal validity. Demographic balance is difficult to achieve, with most databases overrepresenting certain racial groups.

AI-generated face stimuli address these limitations through a controlled generation pipeline that ensures perfect experimental control. The GIST-AIFaceDB used in this study generates neutral base faces with standardized features — identical gray backgrounds, navy t-shirts, and front-facing pose — then transforms each neutral face into five emotional expressions while preserving identity. This pipeline ensures that any differences between emotional expressions for a given identity are attributable solely to the emotion manipulation, not to extraneous visual factors. The ecological validity of AI-generated stimuli is supported by human naturalness ratings: in the present dataset, average naturalness ranged from 5.26 (fear) to 6.94 (happy) on a 9-point scale, indicating that participants perceived the stimuli as moderately to highly realistic. Baudouin et al. (2025) provide supporting evidence that dimensional ratings can be reliably collected from facial stimuli regardless of their provenance, suggesting that AI-generated faces elicit comparable affective responses to photographed faces.


3. Methodology

Figure 1 presents the overall research pipeline, illustrating how 1,440 AI-generated stimuli flow through human rating, VLM inference, and FER baseline evaluation before converging in psychometric comparison.

flowchart TB
    subgraph Stimuli["Stimuli Generation"]
        A["OpenArt<br>STOIQO NewReality Flux"] -->|"240 neutral faces"| B["Nano-Banana<br>Gemini 2.5 Flash Image"]
        B -->|"5 emotions per identity"| C["GIST-AIFaceDB<br>1,440 images<br>3 races x 2 genders x 6 emotions x 40 IDs"]
    end

    subgraph Human["Human Rating"]
        C --> D["N = 1,000 Korean Adults<br>72 images each<br>72,000 total responses"]
        D --> E["Valence 1-9<br>Arousal 1-9<br>Naturalness 1-9"]
    end

    subgraph VLM["VLM Inference"]
        C --> F["Gemma3-4B-IT<br>Google, QAT 4-bit"]
        C --> G["LLaMA-3.2-11B-Vision<br>Meta, 4-bit"]
        F --> H["Context-Carry<br>3-Step Prompting"]
        G --> H
        H --> I["Emotion + Valence + Arousal<br>per image"]
    end

    subgraph FER["FER Baselines"]
        C --> J["5 Models<br>PosterV2, MobileViT,<br>EfficientNet, BEiT, EmoNet"]
        J --> K["Classification +<br>VA Prediction"]
    end

    subgraph Analysis["Psychometric Comparison"]
        E --> L["Cohen kappa<br>Pearson r, MAE<br>Mixed-Effects Models<br>Demographic Bias"]
        I --> L
        K --> L
        L --> M["Key Findings:<br>Fixed-Value Output Pattern<br>Polarity Exaggeration<br>Strength Inversion<br>Model-Specific Bias"]
    end

    style Stimuli fill:#e1f5fe,stroke:#0288d1
    style Human fill:#fff3e0,stroke:#f57c00
    style VLM fill:#e8f5e9,stroke:#388e3c
    style FER fill:#fce4ec,stroke:#c62828
    style Analysis fill:#f3e5f5,stroke:#7b1fa2

Figure 1. Overall research pipeline. AI-generated stimuli (blue) are evaluated by human raters (orange), two VLMs (green), and five FER baselines (red), with all outputs converging in psychometric comparison (purple).

3.1 Stimuli

The stimulus set comprises 1,440 AI-generated facial images from the GIST AI-Generated Face Database (GIST-AIFaceDB, under review). The generation pipeline employed a two-step process. In the first step, 240 neutral base faces were generated using the STOIQO NewReality Flux model deployed on the OpenArt platform. These neutral faces depicted diverse virtual identities wearing standardized navy t-shirts against gray backgrounds, with generation prompts specifying age diversity, hairstyle variation, and demographic characteristics across three racial groups (Black, Caucasian, Korean) and two genders (Male, Female). In the second step, each neutral face was transformed into five additional emotional expressions — angry, disgusted, fearful, happy, and sad — using Nano-Banana, an advanced image-editing model implemented in Google AI Studio (Gemini 2.5 Flash Image), which modifies facial expressions while preserving the identity, lighting, and background of the original image.

The resulting fully crossed factorial design — 3 (race: Black, Caucasian, Korean) × 2 (gender: Male, Female) × 6 (emotion: angry, disgust, fear, happy, sad, neutral) × 40 (identity) — yields 1,440 images with balanced cell sizes: 240 images per emotion, 480 per race, 720 per gender, and 80 per race-gender-emotion combination. This balanced design enables orthogonal estimation of all demographic effects without confounding.

3.2 Human Rating Procedure

The study protocol was reviewed and granted exemption by the Institutional Review Board (IRB). One thousand native Korean adults (500 female, 500 male; age M = 44.6, SD = 13.7, range 20–69) were recruited through an online platform, with recruitment strictly balanced across age cohorts and genders. Each participant evaluated 72 images randomly selected from the total pool of 1,440, with every image presented in randomized order. Through this counterbalanced crossed design, each image received 50 independent ratings, yielding 72,000 total responses across three dimensions: valence (1–9 Likert scale, 1 = “extremely negative,” 9 = “extremely positive”), arousal (1–9, 1 = “not at all aroused,” 9 = “highly aroused”), and naturalness (1–9, 1 = “very unnatural,” 9 = “very natural”).

Inter-rater reliability, computed as Krippendorff’s α (ordinal), established the human agreement benchmark: valence α = 0.471 (poor-to-fair), arousal α = 0.125 (poor), and naturalness α = 0.126 (poor). While these values appear low, they fall within the typical range for emotion rating studies and reflect the inherent subjectivity of affective perception, particularly for arousal. A linear mixed-effects model (LMM) is a regression model containing both fixed effects (systematic factors such as emotion category) and random effects (sources of variation such as individual images or raters). Mixed-effects variance decomposition confirmed that rater individual differences (σ² = 0.450 for valence, σ² = 0.696 for arousal) dominated image-level variance by a factor of 11 for valence and 32 for arousal, confirming that low reliability is driven by rater heterogeneity rather than stimulus ambiguity. We report Krippendorff’s α for individual ratings; the aggregate reliability for image-level means (averaging 50 raters) would be substantially higher per the Spearman-Brown formula, though the individual-level α remains the appropriate benchmark for comparing a single VLM rater against the human rating distribution.

3.3 VLM Inference

Two instruction-tuned VLMs were evaluated: Gemma3-4B-IT (Google, 4 billion parameters, QAT 4-bit quantized) and LLaMA-3.2-11B-Vision-Instruct (Meta, 11 billion parameters, 4-bit quantized). Both models were deployed on Apple Silicon (M1 Max, 32GB) via the MLX framework for GPU-accelerated inference without HTTP overhead. Both models were run with temperature = 0 (greedy decoding), producing deterministic outputs for each image. This decoding configuration inherently limits output variance, as the model always selects the highest-probability token at each generation step.

Inference followed a three-step context-carry prompting strategy, a term we introduce to describe a sequential approach where prior outputs are fed forward as context for subsequent predictions, mirroring anchoring effects in human sequential judgment. In Step 1, the model classified the facial emotion from six forced-choice categories (happy, sad, angry, fear, disgust, neutral) via structured JSON output. In Step 2, the classified emotion was carried forward as context, and the model rated valence on a 1–9 scale. In Step 3, both the classified emotion and valence rating were carried forward, and the model rated arousal on a 1–9 scale. This strategy introduces structural error propagation: classification errors in Step 1 systematically influence subsequent valence and arousal ratings. Response parsing employed a cascade strategy: direct JSON parse, markdown fence stripping, and regex fallback. Emotion labels were fuzzy-matched by their first three characters, and both valence and arousal were clamped to [1, 9]. Gemma3 achieved 100% JSON parse success with one invalid category output (0.07%, “doubt”), while LLaMA achieved comparable compliance. All 1,440 images were successfully processed by both models.

Figure 2 illustrates the three-step context-carry prompting strategy with the actual prompt templates used in the study.

flowchart TD
    IMG["Input: Face Image + Prompt"] --> S1

    subgraph S1["Step 1 — Emotion Classification"]
        direction TB
        P1["PROMPT:<br>What is the facial expression<br>in this image? Choose one from:<br>happy, sad, angry, fear,<br>disgust, neutral.<br>Answer with a single word only."]
        P1 --> R1["MODEL RESPONSE:<br>e.g. happy"]
    end

    S1 -->|"emotion = happy<br>carried to Step 2"| S2

    subgraph S2["Step 2 — Valence Rating"]
        direction TB
        P2["PROMPT:<br>You identified this face as happy.<br>How pleasant is this facial<br>expression? Rate from 1 to 9<br>where 1 is very unpleasant and<br>9 is very pleasant.<br>Answer with a single number only."]
        P2 --> R2["MODEL RESPONSE:<br>e.g. 8"]
    end

    S2 -->|"emotion = happy,<br>valence = 8<br>carried to Step 3"| S3

    subgraph S3["Step 3 — Arousal Rating"]
        direction TB
        P3["PROMPT:<br>You identified this face as happy<br>with pleasantness 8 out of 9.<br>How intense or activated is the<br>emotion in this face? Rate from<br>1 to 9 where 1 is very calm<br>and 9 is very excited.<br>Answer with a single number only."]
        P3 --> R3["MODEL RESPONSE:<br>e.g. 7"]
    end

    S3 --> OUT["Final Output:<br>emotion=happy, valence=8, arousal=7"]

    S1 -.->|"Error Propagation"| S2
    S2 -.->|"Anchoring Effect"| S3

    style S1 fill:#e8f5e9,stroke:#388e3c
    style S2 fill:#fff3e0,stroke:#f57c00
    style S3 fill:#fce4ec,stroke:#c62828
    style OUT fill:#e1f5fe,stroke:#0288d1
    style P1 fill:#f1f8e9,stroke:#689f38,text-align:left
    style P2 fill:#fff8e1,stroke:#ffa000,text-align:left
    style P3 fill:#fce4ec,stroke:#e57373,text-align:left

Figure 2. Three-step context-carry prompting strategy with actual prompt templates. Each step receives the face image along with a text prompt. Step 1 output (emotion label) is injected into Step 2’s prompt template; Step 2 output (valence) is further injected into Step 3’s prompt. Dashed arrows indicate error propagation: a misclassification in Step 1 (e.g., “sad” classified as “neutral”) causes Steps 2 and 3 to rate valence and arousal under the wrong emotional frame.

3.4 FER Baseline Models

Five FER-specialized models were evaluated on the same 1,440 images for comparative analysis. Facial Expression Recognition (FER) models are task-specific architectures trained end-to-end on emotion-labeled datasets to output fixed emotion categories or continuous valence-arousal values. The five baselines included PosterV2 (Pyramid Transformer, classification only), MobileViT (lightweight Vision Transformer, classification and VA prediction), EfficientNet-B0-8-VA-MTL (multi-task CNN, classification and VA prediction), BEiT (BERT Image Transformer, classification only), and EmoNet (CNN, classification and VA prediction). For the three VA-capable models (EmoNet, MobileViT, EfficientNet), predictions in the native [-1, 1] range were normalized to the human rating scale [1, 9] using the formula v_norm = (v_raw + 1) / 2 × 8 + 1.

3.5 Statistical Analysis

Categorical agreement was quantified via Cohen’s κ against intended emotion labels, with McNemar’s test for pairwise model comparisons. Dimensional alignment was assessed through Pearson correlation, Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Bland-Altman analysis (systematic bias and 95% limits of agreement). Per-emotion bias significance was tested with Wilcoxon signed-rank tests, Bonferroni-corrected for 18 comparisons (6 emotions × 3 VA-capable models per model family).

Bias decomposition employed linear mixed-effects models (LMMs) fitted via R’s lme4 package (Bates et al., 2015) with Satterthwaite degrees of freedom (lmerTest). Satterthwaite degrees of freedom are an approximation method for computing p-values in mixed-effects models where exact degrees of freedom are undefined. The emotion-bias model used the formula: rating ~ rater_type * emotion + (1|image_id), where rater_type distinguishes human aggregate ratings from VLM ratings and image_id is a crossed random effect controlling for between-image variability. Demographic bias models used analogous formulas with actor_race and actor_gender as fixed effects.


4. Results

4.1 Emotion Classification

Gemma3-4B-IT achieved Cohen’s κ = 0.671 (substantial agreement), outperforming EmoNet (κ = 0.665) and approaching BEiT (κ = 0.713), while LLaMA-3.2-11B-Vision achieved κ = 0.535 (moderate agreement), below all FER baselines. Table 1 presents the full seven-model ranking. The larger LLaMA (11B parameters) performed worse than the smaller Gemma3 (4B parameters), demonstrating that model scale does not guarantee improved emotion recognition and that instruction tuning quality and pretraining data composition are more consequential factors.

Table 1. Overall emotion classification performance (N = 1,440).

ModelTypeParametersAccuracyMacro F1Cohen’s κ
PosterV2FER~44M0.8990.9000.878
MobileViTFER~6M0.8750.8740.848
EfficientNetFER~5M0.8540.8560.823
BEiTFER~86M0.7660.7720.713
Gemma3-4BVLM4B0.7260.6830.671
EmoNetFER~5M0.7310.7240.665
LLaMA-3.2-11BVLM11B0.6130.4020.535

Both VLMs perfectly classified happy and neutral but failed dramatically on sadness. Gemma3 achieved a sad F1 of 0.223 while LLaMA achieved only 0.092, compared to PosterV2’s 0.992. Table 2 presents emotion-specific accuracy across all seven models, revealing extreme performance polarization.

Table 2. Emotion-specific classification accuracy (proportion correct).

EmotionGemma3LLaMAPosterV2MobileViTEfficientNetBEiTEmoNet
Happy1.0001.0001.0001.0001.0000.9791.000
Neutral1.0001.0000.9120.8630.7290.5290.533
Fear0.9790.6540.9330.9420.8460.7920.912
Disgust0.8420.0080.6420.5330.6790.7540.846
Angry0.4040.9210.9170.9540.8870.8000.637
Sad0.1260.0920.9920.9580.9830.7420.454

The two VLMs exhibit complementary error profiles that are qualitatively distinct from FER confusion patterns. Gemma3 exhibits neutral absorption, classifying 71.1% of sad images as neutral, while LLaMA exhibits angry merger, classifying 99.2% of disgust images as angry. Neutral absorption is the dominant VLM error pattern of classifying sad expressions as neutral, suggesting the model treats sadness as the absence of emotion. In contrast, LLaMA excels at angry (92.1% accuracy) where Gemma3 struggles (40.4%), while Gemma3 excels at disgust (84.2%) where LLaMA fails completely (0.8%). These two dominant error pathways account for 70.6% of all classification errors, and both are qualitatively distinct from the angry-disgust visual overlap confusions that FER models share.

4.2 Valence Comparison

Both VLMs achieve high valence correlations (r = .891–.901), approaching but not matching FER models (r = .928–.950), as shown in Table 3. However, absolute errors are 1.5 to 2.0 times larger (VLM MAE = 1.46–1.81 vs. FER MAE = 0.80–1.06), reflecting a pattern of correct rank ordering but distorted scale usage.

Table 3. Valence prediction summary statistics.

ModelTypePearson rMAEModel M (SD)Human M (SD)
MobileViTFER.9500.9164.18 (2.35)4.60 (1.42)
EfficientNetFER.9401.0634.05 (2.57)4.60 (1.42)
EmoNetFER.9280.7954.32 (2.00)4.60 (1.42)
LLaMA-3.2-11BVLM.9011.8083.71 (3.08)4.60 (1.42)
Gemma3-4BVLM.8911.4564.31 (2.65)4.60 (1.42)

The source of this distortion is polarity exaggeration bias, defined as the systematic tendency to produce more extreme valence ratings than humans — more negative for negative emotions and more positive for positive emotions. Gemma3’s valence SD of 2.65 is 1.87 times the human SD of 1.42, and LLaMA’s SD of 3.08 is 2.17 times the human SD. Table 4 presents per-emotion valence bias across all models.

Table 4. Per-emotion valence bias (Model − Human mean).

EmotionGemma3LLaMAEmoNetMobileViTEfficientNet
Fear−1.99−2.68+0.40−0.14−0.62
Disgust−1.39−2.25−1.35−0.78−0.97
Angry−1.06−2.04−0.64−1.01−0.79
Happy+1.26+1.58+0.76+1.01+1.03
Neutral+1.05−0.28+0.04−0.09+0.01
Sad+0.38+0.53−0.89−1.51−1.95

LLaMA’s negative-emotion valence bias (−2.04 to −2.68) is approximately double Gemma3’s (−1.06 to −1.99). Mixed-effects models confirmed all per-emotion biases as statistically significant (p < .001). The LMM for LLaMA yielded a main effect of rater_type[vlm] β = −2.050 (t = −42.73, p < .001) for the angry reference category, approximately double Gemma3’s β = −1.053 (t = −18.06, p < .001). This indicates that increased model scale amplifies rather than reduces polarity exaggeration.

4.3 Arousal Comparison

A striking strength inversion emerges in arousal prediction. Strength inversion refers to the complementary pattern where FER models dominate valence prediction while VLMs dominate arousal prediction. As Table 5 shows, VLMs substantially outperform all five FER-specialized models on arousal (r = .759–.783 vs. .126–.448), suggesting that language-mediated reasoning about emotional intensity confers a structural advantage for arousal estimation. Gemma3 additionally achieves the lowest arousal MAE (1.137) among all seven models.

Table 5. Arousal prediction summary statistics.

ModelTypePearson rMAEModel M (SD)Human M (SD)
LLaMA-3.2-11BVLM.7831.7775.36 (2.42)5.61 (0.60)
Gemma3-4BVLM.7591.1375.49 (1.74)5.61 (0.60)
EfficientNetFER.4481.6966.53 (2.33)5.61 (0.60)
MobileViTFER.4091.8646.68 (2.61)5.61 (0.60)
EmoNetFER.1261.3696.48 (1.56)5.61 (0.60)

The most striking between-model difference is happy arousal. Gemma3’s bias of +0.30 is non-significant in the LMM (β = +0.059, p = .442), indicating appropriate calibration for happy intensity. In contrast, LLaMA rates happy arousal at 8.87 (human mean: 6.48), yielding a +2.39 overestimation (β = +2.889, p < .001) that reflects an extreme “happiness = maximal excitement” prototype. Table 6 presents per-emotion arousal bias with LMM significance.

Table 6. Per-emotion arousal bias (VLM − Human mean), with LMM significance.

EmotionGemma3 BiasLMM pLLaMA BiasLMM p
Fear+1.30< .001+1.21< .001
Happy+0.30.442+2.39< .001
Angry+0.24< .001−0.50< .001
Disgust+0.42.026−0.57.517
Sad−1.04< .001−2.10< .001
Neutral−1.90< .001−1.91< .001

Both VLMs severely underestimate neutral arousal (bias: −1.90 to −1.91) and sad arousal (bias: −1.04 to −2.10), revealing a systematic tendency to associate low visual salience with minimal arousal.

Figure 3 visualizes the strength inversion pattern: FER models dominate classification and valence, while VLMs dominate arousal prediction.

quadrantChart
    title Strength Inversion - FER vs VLM Performance
    x-axis "Low Valence r" --> "High Valence r"
    y-axis "Low Arousal r" --> "High Arousal r"
    quadrant-1 "VLM Advantage"
    quadrant-2 "Both Strong"
    quadrant-3 "Both Weak"
    quadrant-4 "FER Advantage"
    "Gemma3-4B": [0.68, 0.85]
    "LLaMA-11B": [0.70, 0.89]
    "EmoNet": [0.80, 0.10]
    "MobileViT": [0.86, 0.35]
    "EfficientNet": [0.84, 0.42]

Figure 3. Strength inversion between VLMs and FER models. Horizontal axis represents valence correlation (r) with human ratings; vertical axis represents arousal correlation. VLMs cluster in the upper-left quadrant (strong arousal, moderate valence), while FER models cluster in the lower-right (strong valence, weak arousal).

4.4 Fixed-Value Output Pattern and Dimensional Collapse

LLaMA’s neutral valence SD of 0.00 means that all 240 neutral images received the identical value of 5, with zero per-image discrimination. Dimensional collapse is the reduction of continuous dimensional variation to a small set of discrete prototype values. Stereotyped responding is the production of only 1–6 unique valence-arousal values per emotion category, indicating fixed-value output rather than per-image discrimination. Table 7 presents response variance across emotions for both VLMs and human raters.

Table 7. Response variance by emotion: standard deviation of ratings within each emotion category.

EmotionGemma3 V SDLLaMA V SDHuman V SDGemma3 A SDLLaMA A SDHuman A SD
Happy0.480.131.310.660.721.57
Neutral0.640.001.080.440.281.71
Fear0.160.501.610.471.861.52
Angry0.801.051.550.491.211.51
Sad1.021.131.441.030.351.53
Disgust0.390.821.540.491.551.51

Across all emotions, VLM valence SDs (range: 0.00–1.13) are dramatically lower than human SDs (range: 1.08–1.61). This dimensional collapse arises from the discrete token generation architecture of VLMs, which must select specific integer tokens from their vocabulary. In contrast, FER regression heads produce continuous outputs through dedicated prediction layers trained end-to-end on dimensional emotion data. This distinction represents a qualitatively different behavior from both human raters, who exhibit genuine individual variation, and FER models, which produce continuous distributions.

4.5 Demographic Bias Analysis

Mixed-effects models revealed that VLM demographic biases are model-specific in direction, magnitude, and affected dimension. Regarding race bias, Gemma3 showed no significant race-valence bias, while LLaMA showed significant valence bias for Korean faces (β = +0.319, p = .009). For arousal, LLaMA’s race bias was three times larger than Gemma3’s: Korean faces received 1.204 points lower arousal in LLaMA (compared to Gemma3’s 0.399 reduction), while Black faces were overestimated by 0.50 points.

Regarding gender bias, Gemma3 showed significant gender-valence bias (β = −0.332, p < .001), rating female faces 0.33 points more negatively on average, while LLaMA showed no significant gender-valence bias. The gender-arousal bias direction reversed between models: Gemma3 rated female faces as slightly higher arousal (+0.169, p = .020) while LLaMA rated them as lower arousal (−0.465, p < .001).

At the intersection of race and emotion, Gemma3 showed a 2.7-fold accuracy gap for angry classification between Black faces (61.3%) and Korean faces (22.5%). It is important to note that this accuracy difference reflects differential sensitivity to truly angry faces across racial groups, not necessarily over-attribution of anger to Black faces. Establishing over-attribution would require reporting the false positive rate (the rate of classifying non-angry faces as angry) by race, which is beyond the scope of the current analysis. Disgust showed the reverse pattern (Korean 95.0% accuracy exceeding Black 75.0%), indicating that racial effects on VLM emotion classification are emotion-specific rather than uniform.

Figure 4 summarizes the model-specific demographic bias profiles, showing that Gemma3 and LLaMA exhibit biases in different dimensions and directions.

flowchart TB
    subgraph Gemma3["Gemma3-4B-IT Bias Profile"]
        direction TB
        G1["Gender-Valence Bias<br>Female faces rated 0.33 pts<br>MORE NEGATIVE<br>beta = -0.332, p < .001"]
        G2["Gender-Arousal Bias<br>Female faces rated 0.17 pts<br>HIGHER arousal<br>p = .020"]
        G3["Race-Valence Bias<br>NOT significant"]
        G4["Race-Arousal Bias<br>Korean -0.40 pts<br>moderate effect"]
    end

    subgraph LLaMA["LLaMA-3.2-11B Bias Profile"]
        direction TB
        L1["Gender-Valence Bias<br>NOT significant"]
        L2["Gender-Arousal Bias<br>Female faces rated 0.47 pts<br>LOWER arousal<br>p < .001"]
        L3["Race-Valence Bias<br>Korean +0.32 pts<br>p = .009"]
        L4["Race-Arousal Bias<br>Korean -1.20 pts<br>3x larger than Gemma3"]
    end

    G1 -.-|"OPPOSITE<br>direction"| L1
    G2 -.-|"REVERSED"| L2
    G4 -.-|"3x LARGER<br>in LLaMA"| L4

    style Gemma3 fill:#e8f5e9,stroke:#388e3c
    style LLaMA fill:#e3f2fd,stroke:#1565c0
    style G1 fill:#ffcdd2,stroke:#c62828
    style G2 fill:#fff9c4,stroke:#f9a825
    style G3 fill:#e8f5e9,stroke:#388e3c
    style G4 fill:#fff9c4,stroke:#f9a825
    style L1 fill:#e8f5e9,stroke:#388e3c
    style L2 fill:#ffcdd2,stroke:#c62828
    style L3 fill:#ffcdd2,stroke:#c62828
    style L4 fill:#ffcdd2,stroke:#c62828

Figure 4. Model-specific demographic bias profiles. Green boxes indicate non-significant bias; red boxes indicate significant bias in a potentially harmful direction; yellow boxes indicate moderate effects. Dashed lines connect corresponding bias dimensions between models, highlighting directional reversals and magnitude differences.


5. Discussion

5.1 Fixed-Value Output Pattern: Prototype Lookup vs. Per-Image Discrimination

The most fundamental finding is that VLMs perform emotion-category fixed-value output rather than genuine per-image perceptual discrimination, producing 1–6 fixed valence-arousal values per emotion category regardless of the specific facial expression shown. This dimensional collapse likely arises from the discrete token generation architecture of VLMs, which must select specific integer tokens from their vocabulary. In contrast, FER regression heads produce continuous outputs through dedicated prediction layers. VLMs can reproduce average emotion prototypes, and their rank ordering of emotions along valence and arousal dimensions is largely correct. However, they fail to capture the within-category intensity gradients that distinguish mild irritation from intense rage.

This finding has direct implications for the emerging practice of using VLMs as proxy annotators for emotion data at scale (Zhang et al., 2024). VLM-generated emotion labels carry systematic distortions — compressed variance and fixed prototypes — that would propagate through any downstream training pipeline. While VLMs may serve as rough screening tools for categorical emotion classification, they cannot substitute for human raters in research contexts where individual stimulus variation matters, such as norm development for emotion databases or calibration of therapeutic interventions.

5.2 Polarity Exaggeration Bias

Both VLMs systematically amplify the valence extremity of emotions, with standard deviations 1.87 to 2.17 times larger than human ratings. This polarity exaggeration bias likely originates from VLMs’ pretraining corpora, where emotional language tends toward hyperbole — descriptions of angry faces as “furious” rather than “slightly annoyed.” The larger LLaMA (11B) shows stronger polarity exaggeration than the smaller Gemma3 (4B), with angry valence bias of −2.05 compared to −1.05. This pattern is consistent with the possibility that increased model capacity amplifies rather than refines emotion stereotypes when pretraining data does not proportionally increase in emotional nuance, though the two models also differ in architecture and training data, preventing a clean causal attribution to scale alone.

The consistency of polarity exaggeration across emotions and models suggests a practical mitigation path. Post-hoc linear calibration per emotion category could substantially reduce absolute errors while preserving the high rank-order correlation. For example, a simple affine transformation mapping VLM output distributions to human output distributions per emotion category would correct for both the mean shift and the variance inflation, potentially bringing VLM MAE to within FER-model range without retraining.

5.3 Sadness-Neutral Confusion

Sadness is the worst-classified emotion for both VLMs (Gemma3 F1 = 0.223, LLaMA F1 = 0.092) despite being reliably classified by FER models (PosterV2 F1 = 0.994). This confusion is predictable from the circumplex model, as sadness occupies a low-arousal, moderately negative region proximal to neutral. The finding is not paradoxical but rather reveals that VLMs are particularly insensitive to the subtle, low-intensity facial cues that distinguish sadness from emotional neutrality. The dominant error pathway is neutral absorption: Gemma3 classifies 71.1% and LLaMA classifies 66.7% of sad images as neutral. This pattern suggests that VLMs treat sadness as the absence of emotion rather than as a distinct emotional state, qualitatively different from the angry-disgust confusions that reflect visual feature overlap in FER models.

The sadness-neutral confusion extends the arousal inversion identified in our prior work (Tae et al., under review), where FER models showed inverse arousal correlations for female sad faces. The present VLM data reveals a more fundamental failure: VLMs cannot detect sadness as a distinct category, let alone estimate its intensity. This poses critical risks for VLM deployment in mental health support and empathetic agent design. A system that cannot distinguish sadness from emotional neutrality will fundamentally fail at detecting distress — the very application domain where affective computing promises the greatest societal benefit (Pantic et al., 2005).

5.4 The Arousal Advantage of VLMs

VLMs outperform all five FER-specialized models on arousal prediction (r = .759–.783 vs. .126–.448), the most unexpected finding of this study. We hypothesize that this advantage arises from language-mediated reasoning: VLMs can leverage their language model’s conceptual understanding of emotional intensity, encoded through pretraining on phrases such as “calm,” “agitated,” and “excited,” to estimate arousal. In contrast, FER models must learn arousal mapping purely from visual features and sparse continuous annotations. This finding, combined with FER models’ valence advantage, suggests that hybrid systems combining FER classification heads with VLM-based intensity estimation could outperform either architecture alone, a design recommendation for next-generation affective computing systems.

An important caveat qualifies this arousal advantage interpretation. In the context-carry prompting strategy, VLM arousal ratings in Step 3 received the classified emotion label from Step 1 as input context. This means the arousal advantage may partially reflect language-based inference — predicting arousal from the emotion label using the language model’s conceptual knowledge — rather than visual arousal perception from facial features alone. A text-only ablation, providing emotion labels without face images, would be needed to isolate the visual contribution to arousal estimation. This confound does not apply to FER models, which predict arousal directly from visual features without access to intermediate categorical labels.

5.5 Model-Specific Demographic Biases

The most consequential finding for deployment decisions is that VLM demographic biases are model-specific in direction, magnitude, and affected dimension. Gemma3 shows gender-valence bias (β = −0.332) while LLaMA shows race-arousal bias three times larger than Gemma3’s. Gemma3 rates female faces as slightly higher arousal (+0.169) while LLaMA rates them as lower (−0.465). This heterogeneity means that no single bias audit can generalize across VLMs, and each deployment context requires individual evaluation against the specific populations and emotions involved. The emotion-selective nature of racial accuracy differences — where Gemma3’s angry classification accuracy for Black faces (61.3%) is 2.7 times that for Korean faces (22.5%) — indicates differential sensitivity across racial groups for specific emotions. This accuracy gap does not directly indicate over-attribution of anger to Black faces, which would require false positive rate analysis (Hugenberg & Bodenhausen, 2003). The pattern reverses for disgust (Korean 95.0% exceeding Black 75.0%), confirming that racial effects on VLM classification operate through emotion-specific pathways rather than uniform racial preferences.

5.6 Limitations

Several limitations constrain the generalizability of these findings. First, our human participants were exclusively Korean adults, potentially introducing cultural biases in the baseline against which VLMs are evaluated. Cross-cultural replication with diverse rater populations is needed to establish whether the observed patterns are universal or culturally specific. Second, we tested only two open-source VLMs at the 4B–11B scale; extending to larger models (70B and above) and frontier APIs (GPT-4o, Claude, Gemini) would reveal whether fixed-value output pattern and polarity exaggeration persist across the model capability spectrum. Third, our stimuli are static, single-emotion images, whereas real-world emotion recognition typically involves dynamic, multi-modal, and mixed-emotion stimuli. Fourth, the context-carry prompting strategy introduces structural dependencies (error propagation from classification to dimensional ratings) that may not be present in alternative prompting approaches such as single-shot integrated prompting. Fifth, the 4-bit quantization used for edge deployment may independently contribute to output discretization and reduced response variance. Recent work demonstrates that VLMs can lose emotional nuance under aggressive quantization, with accuracy degradations exceeding 10% on affective tasks (MBQ, CVPR 2025). The fixed-value output pattern observed in this study cannot be cleanly attributed to VLM architecture alone without full-precision comparison. Sixth, the context-carry prompting strategy provides emotion labels as context for valence and arousal ratings, creating a confound between visual perception and language-based inference that cannot be resolved without a text-only ablation condition.


6. Conclusion

This study provides a psychometric comparison of VLM and human emotion ratings using a fully factorial stimulus design, establishing that Vision Language Models achieve moderate-to-substantial categorical agreement (κ = 0.535–0.671) but exhibit qualitatively distinct biases — fixed-value output pattern, polarity exaggeration, and the sadness-neutral confusion — that distinguish them from both human raters and FER-specialized models.

Three key findings emerge with robust implications. First, VLMs perform categorical fixed-value output rather than per-image perceptual discrimination, producing near-zero variance within emotion categories. This dimensional collapse means VLMs cannot currently substitute for human raters in research contexts where stimulus-level variation matters. Second, a strength inversion exists between model families: FER models dominate classification (κ = 0.665–0.878) and valence (r = .928–.950), while VLMs dominate arousal (r = .759–.783 vs. .126–.448), suggesting complementary architectural advantages that could be exploited in hybrid systems. Third, demographic biases are model-specific in direction, magnitude, and affected dimension, requiring per-model audits rather than generalized “VLM bias” characterizations. As VLMs increasingly mediate human-computer interaction in emotionally sensitive contexts — from mental health chatbots to affective tutoring systems — the gap between their emotion perception and human psychological benchmarks demands both rigorous measurement, which this psychometric framework provides, and transparent reporting of model-specific limitations and biases. Future work should extend this framework to larger VLMs, frontier API models, dynamic video stimuli, and culturally diverse rater populations, while investigating whether fine-tuning on dimensionally annotated emotion data can mitigate the fixed-value output pattern and polarity exaggeration identified here.


References

Alrasheed, H., Alghihab, A., Pentland, A., & Alghowinem, S. (2025). Evaluating the capacity of large language models to interpret emotions in images. PLOS ONE, 20(6), e0324127.

Bates, D., Machler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48.

Bhattacharyya, A., & Wang, S. (2025). Evaluating vision-language models for emotion recognition. In Findings of the Association for Computational Linguistics: NAACL 2025.

Baudouin, J.-Y., Gallian, F., Pinoit, J.-M., & Damon, F. (2025). Arousal, valence, and discrete categories in facial emotion. Scientific Reports, 15(1), 40268.

Dominguez-Catena, I., Paternain, D., & Galar, M. (2024). Less can be more: Representational vs. stereotypical gender bias in facial expression recognition. Progress in Artificial Intelligence, 13, 255–273.

Harb, E., et al. (2025). Evaluating the performance of general purpose large language models in identifying human facial emotions. npj Digital Medicine, 8.

Hess, U., Adams, R. B., Jr., & Kleck, R. E. (2004). Facial appearance, gender, and emotion expression. Emotion, 4(4), 378–388.

Hugenberg, K., & Bodenhausen, G. V. (2003). Facing prejudice: Implicit prejudice and the perception of facial threat. Psychological Science, 14(6), 640–643.

Jankowiak, P., et al. (2024). Metrics for dataset demographic bias: A case study on facial expression recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8), 5520–5536.

Khare, S. K., Blanes-Vidal, V., Nadimi, E. S., & Acharya, U. R. (2024). Emotion recognition and artificial intelligence: A systematic review (2014–2023). Information Fusion, 102, 102019.

Mollahosseini, A., Hasani, B., & Mahoor, M. H. (2017). AffectNet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1), 18–31.

Mulukutla, V. K., Pavarala, S. S., Rudraraju, S. R., & Bonthu, S. (2025). Evaluating open-source vision language models for facial emotion recognition against traditional deep learning models. arXiv preprint arXiv:2508.13524.

Plant, E. A., Hyde, J. S., Keltner, D., & Devine, P. G. (2000). The gender stereotyping of emotions. Psychology of Women Quarterly, 24(1), 81–92.

Pantic, M., Sebe, N., Cohn, J. F., & Huang, T. (2005). Affective multimodal human-computer interaction. In Proceedings of the 13th ACM International Conference on Multimedia (pp. 669–676).

Refoua, S., Elyoseph, Z., Piterman, H., et al. (2026). Evaluation of cross-ethnic emotion recognition capabilities in multimodal large language models using the reading the mind in the eyes test. Scientific Reports, 16.

Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178.

Tak, A. N., & Gratch, J. (2024). GPT-4 emulates average-human emotional cognition from a third-person perspective. In Proceedings of the 12th International Conference on Affective Computing and Intelligent Interaction (ACII).

Telceken, M., Akgun, D., Kacar, S., Yesin, K., & Yildiz, M. (2025). Can artificial intelligence understand our emotions? Deep learning applications with face recognition. Current Psychology, 44(9), 7946–7956.

Zhang, Y., Yang, X., Xu, X., et al. (2024). Affective computing in the era of large language models: A survey from the NLP perspective. arXiv preprint arXiv:2408.04638.