comprehensive_stats.xlsx Codebook (v10)

각 시트의 컬럼 설명. 모든 모델은 1,440장 기준(6 emotions × 240 images).
통계 분석에 사용된 회귀 모델, 라이브러리, 추정 방법을 명시한다.

파이프라인 실행 순서:

scripts/generate_comprehensive_stats.py → outputs/comprehensive_stats.xlsx (22 base sheets)
scripts/post_hoc/compute_zscore_per_emotion_bootstrap.py → outputs/zscore_per_emotion_bootstrap.xlsx
scripts/post_hoc/inter_llm_variance.py → inter_llm_variance.xlsx
scripts/post_hoc/_run_stratified_only.py → outputs/stratified_only.xlsx
scripts/consolidate_v10_xlsx.py → docs/manuscript/v10/data/comprehensive_stats.xlsx (35 sheets)

1_Model_Summary (8 rows)

모델별 종합 성능 요약. 한 눈에 비교하기 위한 시트.

Column	Description
model	모델 이름
accuracy	정서 분류 정확도 (human majority vs. VLM)
f1_macro	Macro-averaged F1 score (6 emotions)
kappa	Cohen’s κ (chance-corrected agreement)
val_r	Valence Pearson r (VLM vs. human mean)
val_rho	Valence Spearman ρ
val_mae	Valence Mean Absolute Error
val_bias	Valence Bland-Altman bias (human − VLM; 양수 = VLM 과소평정)
aro_r	Arousal Pearson r
aro_rho	Arousal Spearman ρ
aro_mae	Arousal MAE
aro_bias	Arousal Bland-Altman bias (human − VLM)
N	매칭된 이미지 수

생성: generate_comprehensive_stats.py → compute_classification_metrics() + compute_va_metrics()
라이브러리: Python sklearn, scipy.stats

2_Classification (9 rows)

정서 분류 성능. 8 VLM + Human (majority vote vs. image GT).

Column	Description
model	모델 이름 (Human 포함)
N	이미지 수
accuracy	정확도
f1_macro	Macro F1
kappa	Cohen’s κ

생성: generate_comprehensive_stats.py → compute_classification_metrics()
라이브러리: Python sklearn.metrics.cohen_kappa_score(), sklearn.metrics.f1_score(average="macro")

3_Per_Emotion_PRF (54 rows)

정서별 Precision / Recall / F1. 6 emotions × 9 models.

Column	Description
model	모델 이름
emotion	정서 범주 (happy, sad, angry, fear, disgust, neutral)
precision	Precision
recall	Recall
f1	F1 score
support	해당 정서의 이미지 수 (human majority 기준)

생성: generate_comprehensive_stats.py → sklearn.metrics.classification_report(output_dict=True)
라이브러리: Python sklearn.metrics

4_Confusion_Matrix (324 rows)

Long-format confusion matrix. 6×6 × 9 models.

Column	Description
model	모델 이름 (Human 포함)
human_emotion	Ground truth (human majority vote = image GT)
vlm_emotion	VLM 예측 (또는 Human의 경우 개인 응답)
count	해당 셀의 빈도

생성: generate_comprehensive_stats.py → sklearn.metrics.confusion_matrix()
라이브러리: Python sklearn.metrics

5_Valence_Overall (8 rows)

Valence 전체 상관/오차 지표. 모델별 1행.

Column	Description
pearson_r	Pearson r (VLM vs. human mean)
pearson_p	Pearson p-value (0.0 = float64 underflow, 실제 p < 10⁻³⁰⁸)
spearman_rho	Spearman ρ
spearman_p	Spearman p-value (동일 underflow)
mae	Mean Absolute Error
ba_bias	Bland-Altman bias = mean(human − VLM); 양수 = VLM 과소평정
ba_loa_lower	Bland-Altman 95% Limits of Agreement 하한 = bias − 1.96·SD_d
ba_loa_upper	Bland-Altman 95% LoA 상한 = bias + 1.96·SD_d
vlm_mean	VLM valence 평균
vlm_sd	VLM valence 표준편차
human_mean	Human valence 평균 (이미지별 50명 평균의 전체 평균)
human_sd	Human valence 표준편차
model	모델 이름

생성: generate_comprehensive_stats.py → compute_va_metrics()
라이브러리: Python scipy.stats.pearsonr(), scipy.stats.spearmanr(), src/analysis/statistics.py:compute_bland_altman() (ddof=1)
Bland-Altman 부호 관례: $d_{j} = \overset{y}{ˉ}_{j} - x_{j}$ (human − VLM). 코드: compute_bland_altman(h_vals, v_vals) → diffs = values_a - values_b

6_Valence_Per_Emotion (48 rows)

정서별 valence 분석. 6 emotions × 8 models.

Column	Description
emotion	정서 범주
N	이미지 수
human_mean	해당 정서의 human valence 평균
human_sd	Human valence SD
vlm_mean	VLM valence 평균
vlm_sd	VLM valence SD
bias	human_mean − vlm_mean (양수 = VLM 과소평정)
mae	Mean Absolute Error
icc	ICC(2,1) — VLM과 human을 2명의 rater로 취급 (※ Methods v10.2에서 부적합 판정, 참고용)
ba_bias	Bland-Altman bias (human − VLM)
ba_loa_lower	LoA 하한
ba_loa_upper	LoA 상한
within_r	정서 내 Pearson r (VLM vs. human, 같은 정서 이미지만)
human_alpha	해당 정서의 human inter-rater Krippendorff’s α
model	모델 이름

생성: generate_comprehensive_stats.py → compute_per_emotion_va() + compute_icc() (pingouin)
라이브러리: Python scipy.stats.pearsonr(), pingouin.intraclass_corr() (ICC), src/analysis/statistics.py:compute_bland_altman()

7_Arousal_Overall (8 rows)

Arousal 전체 지표. 구조는 5_Valence_Overall과 동일.

8_Arousal_Per_Emotion (48 rows)

정서별 arousal 분석. 구조는 6_Valence_Per_Emotion과 동일.

9_Demographic (45 rows)

인종/성별별 성능. (3 race + 2 gender) × 9 models.

Column	Description
demographic	집단 변수 (race 또는 gender)
group	집단 값 (black, caucasian, korean / man, woman)
N	이미지 수
accuracy	정서 분류 정확도
f1_macro	Macro F1
val_bias	Valence bias (human − VLM; 양수 = VLM 과소평정)
val_r	Valence Pearson r
aro_bias	Arousal bias (human − VLM)
aro_r	Arousal Pearson r
model	모델 이름

생성: generate_comprehensive_stats.py → compute_demographic_metrics()
라이브러리: Python sklearn.metrics.f1_score(), scipy.stats.pearsonr()

10_Response_Variability (48 rows)

VLM이 예측한 정서별 valence/arousal 응답 다양성. 6 emotions × 8 models.

Column	Description
predicted_emotion	VLM이 예측한 정서
N	해당 정서로 예측된 이미지 수
val_mean	예측 valence 평균
val_sd	예측 valence SD
val_unique	사용된 고유 valence 값 수
aro_mean	예측 arousal 평균
aro_sd	예측 arousal SD
aro_unique	사용된 고유 arousal 값 수
model	모델 이름

생성: generate_comprehensive_stats.py → compute_response_variability()
라이브러리: Python pandas (nunique(), std())

11_Thinking_Per_Emotion (12 rows)

Thinking 모델의 정서별 추론 토큰 길이. 6 emotions × 2 models (Gemini-2.5-Flash, Qwen3-VL-4B).

Column	Description
emotion	정서 범주
N	이미지 수
t_emotion_mean	Emotion step thinking 토큰 수 평균 (문자 수 기준)
t_emotion_sd	SD
t_valence_mean	Valence step thinking 평균
t_valence_sd	SD
t_arousal_mean	Arousal step thinking 평균
t_arousal_sd	SD
total_mean	3 step 합계 평균
total_sd	합계 SD
model	모델 이름

생성: generate_comprehensive_stats.py → compute_thinking_metrics()
라이브러리: Python len() (문자 수), numpy.mean(), numpy.std() (ddof=0)

12_Thinking_Steps (2 rows)

Thinking 모델의 step별 평균 추론 길이 + sad 정서 정오 비교.

Column	Description
step1_emotion_mean	Emotion step 전체 평균 토큰 수
step2_valence_mean	Valence step 전체 평균
step3_arousal_mean	Arousal step 전체 평균
model	모델 이름
sad_correct_n	Sad를 맞힌 이미지 수
sad_correct_mean	맞힌 경우의 총 thinking 토큰 평균
sad_wrong_n	Sad를 틀린 이미지 수
sad_wrong_mean	틀린 경우의 총 thinking 토큰 평균
sad_mw_p	Mann-Whitney U p-value (맞힌 vs. 틀린 thinking 길이 차이)

생성: generate_comprehensive_stats.py → compute_thinking_metrics()
통계 검정: scipy.stats.mannwhitneyu(correct_t, wrong_t) — 양측 검정(two-sided, scipy 기본값)
라이브러리: Python scipy

13_Human_RT (6 rows)

인간 평정 반응시간 및 자극 자연스러움. 정서별 1행.

Column	Description
emotion	정서 범주
valence_rt_median	Valence 평정 반응시간 중앙값 (초)
arousal_rt_median	Arousal 평정 반응시간 중앙값 (초)
naturalness_mean	이미지 자연스러움 평정 평균 (1-9)

생성: generate_comprehensive_stats.py → main() 내 직접 집계
라이브러리: Python pandas (median(), mean())

14_Human_Reliability (15 rows)

인간 평정자 간 신뢰도. Krippendorff’s α.

Column	Description
measure	측정 차원 (emotion, valence, arousal)
scope	범위 (overall 또는 특정 정서)
level	측정 수준 (nominal = 정서, interval = valence/arousal)
alpha	Krippendorff’s α
n_raters	평정자 수
n_items	이미지 수

생성: generate_comprehensive_stats.py → compute_human_reliability()
라이브러리: Python krippendorff.alpha(reliability_data, level_of_measurement=...)
데이터 형식: pivot matrix (rows = raters, cols = items). 95% 결측 (역균형 설계: 참가자당 72/1,440장)
측정 수준: emotion = “nominal”, valence/arousal = “interval”

15_Demo_ANOVA (18 rows)

인구통계 주효과 검정.

Column	Description
model	모델 이름 (Human 포함)
demographic	집단 변수 (race, gender)
chisq	Likelihood ratio test χ²
df	자유도
p_value	p-value

생성: generate_comprehensive_stats.py → compute_demographic_significance()
실행 방식: 외부 R subprocess (subprocess.run(["Rscript", ...]))
회귀 모델:

Full:  glmer(correct ~ demographic + (1|gt_emotion), family=binomial,
             control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=1e5)))
Null:  glmer(correct ~ 1 + (1|gt_emotion), family=binomial)
검정:  anova(fit_null, fit) → Likelihood Ratio Test (LRT) χ²

추정: MLE (Maximum Likelihood, family=binomial → logistic link)
R 라이브러리: lme4::glmer(), emmeans
Python 라이브러리: 없음 (R subprocess로 전량 위임, 결과 CSV를 pandas로 파싱)

16_Demo_PostHoc (36 rows)

인구통계 사후 검정. emmeans pairwise (Tukey 보정).

Column	Description
contrast	비교 쌍 (e.g. “black / caucasian”)
odds.ratio	오즈비
SE	표준오차
df	자유도
null	귀무가설 값 (1)
z.ratio	z 통계량
p.value	보정된 p-value
model	모델 이름
demographic	집단 변수

생성: generate_comprehensive_stats.py → compute_demographic_significance()
실행 방식: 외부 R subprocess
사후 검정: emmeans(fit, pairwise ~ demographic, type="response") → odds ratio + Tukey 다중비교 보정
R 라이브러리: emmeans

17_Demo_Emotion_Acc (270 rows)

인구통계 × 정서별 정확도. (3 race + 2 gender) × 6 emotions × 9 models.

Column	Description
demographic	집단 변수
group	집단 값
emotion	정서 범주
N	이미지 수
n_correct	정답 수
accuracy	정확도
model	모델 이름

생성: generate_comprehensive_stats.py → compute_demographic_emotion_metrics()
라이브러리: Python (단순 집계, 외부 통계 라이브러리 없음)

18_Demo_Emo_Interaction (54 rows)

인구통계 × 정서 교호작용 검정.

Column	Description
model	모델 이름
demographic	집단 변수
term	효과 항 (race:gt_emotion, race, gt_emotion 등)
chisq	Deviance χ²
df	자유도
p_value	p-value

생성: generate_comprehensive_stats.py → compute_demographic_emotion_interaction()
실행 방식: 외부 R subprocess (subprocess.run(["Rscript", ...]))
회귀 모델:

Full:       glm(correct ~ demographic * gt_emotion, family=binomial)
Main-only:  glm(correct ~ demographic + gt_emotion, family=binomial)
Demo-only:  glm(correct ~ gt_emotion, family=binomial)
Emo-only:   glm(correct ~ demographic, family=binomial)
검정:
  교호작용: anova(fit_main, fit_full, test="Chisq") → Deviance LRT
  주효과:   anova(fit_no_demo, fit_main, test="Chisq") → Deviance LRT

추정: MLE (family=binomial → logistic link). glmer가 아닌 glm 사용 (random effects 없음; VLM은 이미지당 1회 예측이므로 이미지 random effect 불필요)
R 라이브러리: stats::glm(), emmeans
Python 라이브러리: 없음 (R subprocess로 전량 위임)

19_Demo_Emo_PostHoc (216 rows)

교호작용 사후 검정. emmeans: pairwise ~ demo | gt_emotion.

Column	Description
gt_emotion	정서 범주
contrast	비교 쌍
odds.ratio	오즈비
SE	표준오차
df	자유도
null	귀무가설 값
z.ratio	z 통계량
p.value	p-value
model	모델 이름
demographic	집단 변수

생성: generate_comprehensive_stats.py → compute_demographic_emotion_interaction()
실행 방식: 외부 R subprocess
사후 검정: emmeans(fit_full, pairwise ~ demographic | gt_emotion, type="response") → 정서별 조건부 쌍별 비교, Tukey 보정
R 라이브러리: emmeans

20_LMM_Fixed_Effects (168 rows)

선형 혼합효과 모델 고정 효과.

※ --skip-lmm 이전 파일(comprehensive_stats(0330-1640).xlsx)에서 추출. Gemma3-27B 제외 7개 모델만 포함.

Column	Description
model	VLM 모델 이름
measure	valence 또는 arousal
term	고정 효과 항 (절편, rater_type, emotion, 상호작용)
estimate	계수 추정값
se	표준오차
df	Satterthwaite 자유도
t_value	t 통계량
p_value	p-value

생성: generate_comprehensive_stats.py → compute_lmm_analysis() → src/analysis/r_bridge.py:fit_lmer()
실행 방식: Python rpy2 in-process (R subprocess가 아님)
회귀 모델:

lmerTest::lmer(rating ~ rater_type * emotion + (1|image_id), data=combined, REML=TRUE)

rating: valence 또는 arousal (1-9)
rater_type: human-agg / vlm (이미지당 2행: 인간 평균 + VLM 예측)
emotion: 6 범주 (factor)
(1|image_id): 이미지별 random intercept
추정: REML (Restricted Maximum Likelihood)
자유도: Satterthwaite 근사 (lmerTest 패키지)
R 라이브러리: lmerTest::lmer() (via rpy2 importr("lmerTest"))
Python 라이브러리: rpy2.robjects, rpy2.robjects.packages.importr, rpy2.robjects.pandas2ri

21_LMM_Random_Effects (42 rows)

선형 혼합효과 모델 무선 효과 + 적합도.

※ --skip-lmm 이전 파일(comprehensive_stats(0330-1640).xlsx)에서 추출. Gemma3-27B 제외 7개 모델만 포함.

Column	Description
model	VLM 모델 이름
measure	valence 또는 arousal
aic	Akaike Information Criterion
bic	Bayesian Information Criterion
loglik	Log-likelihood
n_obs	관측치 수
n_groups	random effect 그룹 수 (image_id)
converged	수렴 여부 (또는 분산 성분 문자열)

생성: Sheet 20과 동일 파이프라인
실행 방식: Python rpy2 in-process

22_ZScore_Summary (18 rows)

Z-score 분석 요약. VLM 편차를 인간 SD 단위로 측정. 8 models × 2 measures + 2 human baseline.

Column	Description
model	모델 이름 (“Human (LOO)” = 인간 leave-one-out baseline)
measure	valence 또는 arousal
N	이미지 수
abs_z_mean	VLM
abs_z_median	VLM
abs_z_sd	VLM
within_1sd	VLM 응답이 인간 ±1SD 이내인 비율 ( $W_{1 SD}$ )
within_2sd	VLM 응답이 인간 ±2SD 이내인 비율
bias_z_mean	Signed z 평균 (음수 = VLM이 낮게, 양수 = 높게 평정)
bias_z_sd	Signed z 표준편차
human_loo_abs_z_mean	인간 LOO baseline
human_loo_within_1sd	인간 개인이 ±1SD 이내인 비율

생성: generate_comprehensive_stats.py → compute_zscore_analysis()
수식: $z_{j} = (x_{j} - M_{j}) / S D_{j}$ (x=VLM, M=human mean, SD=human std ddof=1)
LOO baseline: 인간 평정자 $h$ 를 제외한 49명으로 $M^{(- h)}, S D^{(- h)}$ 산출 → $z_{hj}^{LOO} = (x_{hj} - M_{j}^{(- h)}) / S D_{j}^{(- h)}$
라이브러리: Python numpy (순수 계산, 외부 통계 라이브러리 없음)

23_ZScore_Per_Emotion (109 rows)

정서별 z-score. 6 emotions × 2 measures × (8 models + Human LOO).

Column	Description
model	모델 이름
measure	valence 또는 arousal
emotion	정서 범주
N	이미지 수
abs_z_mean
abs_z_median
abs_z_sd
within_1sd	±1SD 이내 비율
within_2sd	±2SD 이내 비율
bias_z_mean	Signed z 평균 (편향 방향)
bias_z_sd	Signed z SD
human_loo_within_1sd	인간 LOO ±1SD 이내 비율

생성: scripts/post_hoc/compute_zscore_per_emotion_bootstrap.py → compute_per_emotion() (deterministic, no bootstrap)
라이브러리: Python numpy (순수 계산)

24_Krippendorff_Alpha (18 rows)

VLM을 1,001번째 평정자로 추가 시 Krippendorff’s α 변화. 2 measures × (1 baseline + 8 models).

Column	Description
model	”Human-only” = baseline, 나머지 = VLM 추가 후
measure	valence 또는 arousal
alpha	Krippendorff’s α (interval)
delta_alpha	α 변화량 (α_with_VLM − α_human_only). 양수 = 일치도 유지/향상
n_raters	평정자 수 (1,000 또는 1,001)

생성: generate_comprehensive_stats.py → compute_krippendorff_vlm()
수식: $α = 1 - D_{o} / D_{e}$ (관측 불일치 / 기대 불일치)
데이터: pivot_table(index="participant_id", columns="image_id", values=measure) → raters × items (95% 결측)
라이브러리: Python krippendorff.alpha(reliability_data, level_of_measurement="interval")

25_ZScore_Per_Emo_Boot (108 rows)

정서별 z-score Bootstrap 신뢰구간. 6 emotions × 2 measures × 9 models.

Column	Description
model	모델 이름
measure	valence 또는 arousal
emotion	정서 범주
N	이미지 수
n_bootstrap	Bootstrap 반복 횟수 (2,000)
abs_z_mean
abs_z_median
abs_z_sd
within_1sd_pct	±1SD 이내 비율
within_2sd_pct	±2SD 이내 비율
bias_z_mean	Signed z 평균
bias_z_sd	Signed z SD
r_rb_median	Bootstrap Mann-Whitney U rank-biserial r 중앙값
r_rb_ci_lower	r_rb 95% CI 하한
r_rb_ci_upper	r_rb 95% CI 상한
p_median	Bootstrap MW-U p-value 중앙값
p_ci_lower	p 95% CI 하한
p_ci_upper	p 95% CI 상한
pct_significant	2,000회 중 p<.05 비율

생성: scripts/post_hoc/compute_zscore_per_emotion_bootstrap.py
Bootstrap 절차: 이미지당 50명 중 1명 랜덤 샘플링 → 1,440개 human |z| 생성 → VLM |z|와 MW-U 비교 → 2,000회 반복
통계 검정: scipy.stats.mannwhitneyu() (양측)
효과 크기: rank-biserial correlation $r_{r b} = 1 - 2 U / (n_{1} n_{2})$
라이브러리: Python numpy, scipy (순수 계산)

27_Demo_DimStrat_ANOVA (~288 rows)

감정 층화 인구통계 × 차원별 ANOVA. model × emotion × dimension × term.

Column	Description
model	모델 이름
emotion	정서 범주
dimension	분석 차원 (val_bias, aro_bias)
term	효과 항 (race, gender, race:gender)
F_value	F 통계량
df1	분자 자유도
df2	분모 자유도
eta_sq	η² (효과 크기) = SS_term / SS_total
p_raw	원시 p-value
q_BH	Benjamini-Hochberg FDR 보정 q-value

생성: generate_comprehensive_stats.py → compute_dimensional_stratified_interaction() (Python에서 R script 생성 후 호출)
실행 방식: 외부 R subprocess (subprocess.run(["Rscript", ...])) + Python BH 보정
회귀 모델 (emotion stratum 내에서):

M_full: lm(bias ~ race * gender)        # 주효과 + 교호작용
M_add:  lm(bias ~ race + gender)         # 주효과만
M_race: lm(bias ~ race)                  # race만
M_gen:  lm(bias ~ gender)                # gender만

Nested F-tests:
  Race 주효과:      anova(M_gen,  M_add) → F(2, ~234)
  Gender 주효과:    anova(M_race, M_add) → F(1, ~234)
  Race×Gender 교호: anova(M_add,  M_full) → F(2, ~232)

bias: human − VLM (양수 = VLM 과소평정)
6 emotions × 3 terms = 18 검정 per (model, dimension) family
추정: OLS (Ordinary Least Squares) — stats::lm(). 고정효과 모델 (VLM은 이미지당 단일 예측이므로 random intercept 추정 불가)
효과 크기: $η^{2} = S S_{term} / S S_{total}$ (Type I SS, R anova(fit_full) 기본값; 완전 균형 설계이므로 Type I=II=III)
다중비교 보정: BH FDR — Python statsmodels.stats.multitest.multipletests(method="fdr_bh"), (model, dimension) 가족 내 18검정, α=0.05
R 라이브러리: stats::lm(), stats::anova(), emmeans
Python 라이브러리: statsmodels.stats.multitest.multipletests (BH FDR 보정만)

28_Demo_DimStrat_PostHoc (조건부 행 수)

층화 ANOVA 사후 검정. race:gender 교호작용이 유의한 (model, emotion, dimension) 조합에 한정.

Column	Description
contrast	비교 쌍 (e.g. “black man − caucasian woman”)
estimate	추정 차이
SE	표준오차
df	자유도
t.ratio	t 통계량
p.value	보정된 p-value
model	모델 이름
emotion	정서 범주
dimension	분석 차원

생성: Sheet 27과 동일 함수
실행 방식: 외부 R subprocess
사후 검정: emmeans(fit_full, ~ race * gender) → pairs(emm, adjust="tukey") → 6 cell(3 race × 2 gender) 간 Tukey 조정 쌍별 비교
필터링 조건: Sheet 27에서 term == "race:gender" AND q_BH < 0.05인 triplet만 포함 (Python 필터링)
R 라이브러리: emmeans

29_InterLLM_PerImage (1,440 rows)

이미지별 7개 VLM 간 평정 분산 vs. 인간 분산 (bootstrap 보정 포함).

Column	Description
image_id	이미지 식별자
emotion	정서 범주
race	배우 인종
gender	배우 성별
n_models	평정 모델 수 (7; Gemini-2.5-Flash-NoThink 제외 — 동일 모델 ablation)
valence_mean	VLM 7개 모델 valence 평균
valence_sd	VLM valence SD (ddof=1)
valence_range	VLM valence max − min
arousal_mean	VLM arousal 평균
arousal_sd	VLM arousal SD
arousal_range	VLM arousal max − min
human_valence_sd	인간 평정자 간 valence SD (n≈50, ddof=1)
human_arousal_sd	인간 평정자 간 arousal SD
human_valence_sd_boot_mean	Bootstrap 보정 valence SD 평균 (7명 비복원 추출 × 1,000회, seed=42)
human_valence_sd_boot_ci_lo	Bootstrap 95% CI 하한
human_valence_sd_boot_ci_hi	Bootstrap 95% CI 상한
human_arousal_sd_boot_mean	Bootstrap 보정 arousal SD 평균
human_arousal_sd_boot_ci_lo	Bootstrap 95% CI 하한
human_arousal_sd_boot_ci_hi	Bootstrap 95% CI 상한

생성: scripts/post_hoc/inter_llm_variance.py
Bootstrap: 인간 50명에서 7명 비복원 추출 → SD 계산 → 1,000회 반복 (seed=42). 표본 크기 비대칭(VLM 7 vs. Human 50) 보정 목적.
라이브러리: Python numpy, pandas (순수 계산)

30_InterLLM_ByEmotion (6 rows)

정서별 LLM 간 분산 vs. 인간 분산 집계.

Column	Description
emotion	정서 범주
n_images	이미지 수
llm_valence_sd_mean	LLM valence SD 이미지 평균
llm_valence_sd_median	LLM valence SD 중앙값
human_valence_sd_mean	인간 valence SD 평균 (n=50 기준)
llm_arousal_sd_mean	LLM arousal SD 이미지 평균
llm_arousal_sd_median	LLM arousal SD 중앙값
human_arousal_sd_mean	인간 arousal SD 평균

생성: scripts/post_hoc/inter_llm_variance.py
라이브러리: Python pandas

31_InterLLM_Valence_Wide (1,440 rows)

이미지 × 모델 wide format valence 원시 평정값.

Column	Description
image_id	이미지 식별자
gemini-2.5-flash ~ qwen3-vl-4b	각 모델의 valence 평정 (1-9)

생성: scripts/post_hoc/inter_llm_variance.py

32_InterLLM_Arousal_Wide (1,440 rows)

이미지 × 모델 wide format arousal 원시 평정값. 구조는 31과 동일.

33_Demo_Dim_Pooled_ANOVA (64 rows)

[Supplementary S5] v10.6 pooled 인구통계 × 차원 ANOVA. v10.7에서 emotion-stratified ANOVA(시트 27-28)로 대체되었으나, Supplementary §S5 참조용으로 보존.

Column	Description
model	모델 이름
dimension	분석 차원 (val_bias, aro_bias)
demographic	집단 변수 (race, gender)
term	효과 항 (race:gt_emotion, race, gender:gt_emotion, gender)
F_value	F 통계량
df1	분자 자유도
df2	분모 자유도
p_value	p-value
eta_sq	η² (효과 크기)

생성: generate_comprehensive_stats.py → compute_demographic_dimensional_lmm()
실행 방식: 외부 R subprocess (subprocess.run(["Rscript", ...]))
회귀 모델:

Full:       lm(bias ~ demographic * gt_emotion)
Main-only:  lm(bias ~ demographic + gt_emotion)
Demo-only:  lm(bias ~ gt_emotion)
검정:
  교호작용: anova(fit_main, fit_full) → nested F-test
  주효과:   anova(fit_no_demo, fit_main) → nested F-test

추정: OLS (stats::lm()). 감정을 회귀변수에 포함하여 pooling (→ v10.7에서 emotion-stratified로 대체된 사유: 감정별 잔차 분산 이질성)
효과 크기: $η^{2} = S S_{term} / S S_{total}$ (Type I SS)
R 라이브러리: stats::lm(), stats::anova(), emmeans

34_Demo_Dim_Pooled_PostHoc (384 rows)

[Supplementary S5] v10.6 pooled 인구통계 × 차원 사후 검정.

Column	Description
gt_emotion	정서 범주
contrast	비교 쌍
estimate	추정 차이
SE	표준오차
df	자유도
t.ratio	t 통계량
p.value	p-value
model	모델 이름
dimension	분석 차원
demographic	집단 변수

생성: Sheet 33과 동일 함수
실행 방식: 외부 R subprocess
사후 검정: emmeans(fit_full, pairwise ~ demographic | gt_emotion) → 정서별 조건부 쌍별 비교, Tukey 보정
R 라이브러리: emmeans

35_Demo_RaceGender_F1 (99 rows)

인구통계별 분류 성능. race, gender, race×gender 교차 포함. 9 models × 11 slices (3 race + 2 gender + 6 race×gender). Figure 6b(race×gender F1 bar chart)의 데이터 소스.

Column	Description
model	모델 이름 (Human 포함)
slice	집단 구분 (race, gender, race_x_gender)
race	인종 (black, caucasian, korean; gender slice에서는 “all”)
gender	성별 (man, woman; race slice에서는 “all”)
accuracy	정서 분류 정확도
f1_macro	Macro F1
N	이미지 수

생성: plot_demographic_performance.py (시각화 스크립트가 부산물로 생성)
라이브러리: Python sklearn.metrics.f1_score(average="macro")
시트 9와의 차이: 시트 9는 race/gender를 별도 행으로만 보고하고 valence/arousal bias를 포함하지만, 시트 35는 race×gender 교차 조합(6 combos)을 포함하고 분류 F1만 보고.

통계 분석 방법 요약

시트	회귀 모델	추정 방법	실행 환경	라이브러리
15, 16	`glmer(correct ~ demo + (1\|gt_emotion), binomial)`	MLE (Laplace approx.)	외부 R subprocess	R `lme4::glmer`, `emmeans`
18, 19	`glm(correct ~ demo * gt_emotion, binomial)`	MLE	외부 R subprocess	R `stats::glm`, `emmeans`
20, 21	`lmer(rating ~ rater_type * emotion + (1\|image_id), REML=T)`	REML	Python rpy2 in-process	R `lmerTest::lmer` via `rpy2`
27, 28	`lm(bias ~ race * gender)` nested F-tests	OLS	외부 R subprocess + Python BH	R `stats::lm`, `emmeans` + Python `statsmodels`
12	Mann-Whitney U (sad correct vs wrong)	비모수	Python	`scipy.stats.mannwhitneyu`
14, 24	Krippendorff’s α	— (불일치 비율 기반)	Python	`krippendorff.alpha`
25	Bootstrap Mann-Whitney U (2,000회)	비모수 + bootstrap	Python	`scipy.stats.mannwhitneyu`, numpy
33, 34	`lm(bias ~ demo * gt_emotion)` nested F-tests	OLS	외부 R subprocess	R `stats::lm`, `emmeans` (Supp. S5, v10.6 pooled)

comprehensive_stats_codebook