SelfAware-v4 8B QLoRA Cross-Evaluation 분석

날짜: 2026-03-04
모델: Llama 3.1 8B Instruct
학습 데이터셋: SelfAware-v4
방법: QLoRA (4-bit)
Run ID: 20260304_134246 (adapter), baseline-8b/20260304_185337 (baseline)

1. 실험 개요

SelfAware-v4 데이터셋으로 fine-tuning한 8B QLoRA adapter의 cross-evaluation 결과를 분석한다.
BF16 baseline과의 Δ 비교 + 1B/3B 결과와 횡비교를 통해 학습 효과를 정량화한다.

비교 조건

조건	모델	설명
Baseline BF16	`meta-llama/Llama-3.1-8B-Instruct`	사전학습 모델 (adapter 없음)
SelfAware-v4	`mlx-community/Llama-3.1-8B-Instruct-4bit` + LoRA	QLoRA adapter fine-tuned

학습 설정

하이퍼파라미터	값
Base model	`mlx-community/Llama-3.1-8B-Instruct-4bit`
LoRA rank	8
LoRA scale	1.0
LoRA dropout	0.05
LoRA layers	all (-1)
Batch size	2 (× grad_accum 8 = effective 16)
Epochs	1
Learning rate	1.2e-4 (cosine, warmup 5%)
Max seq length	1024

참고: 기존 8B config에서 LR을 2e-4 → 1.2e-4로 낮추고, lr_end_ratio를 제거한 설정.

평가 태스크

태스크	샘플 수	유형
ExploreToM	1,330	Theory of Mind (belief tracking)
GSM8K	747	수학 추론
SelfAware	337	자기인식 (답변 가능/불가능 판별)
TriviaQA	695	상식 QA
HumanEvalPlus	16	코드 생성
MBPP+	37	코드 생성
ARC	259	과학 상식 (신규)
BoolQ	943	예/아니오 QA (신규)
CommonsenseQA	974	상식 추론 (신규)

2. Cross-Eval 정답률

2.1 Baseline BF16 vs V4 Adapter

태스크	Baseline BF16	V4 Adapter	Δ
selfaware	24.9% (84/337)	34.4% (116/337)	+9.5pp
exploretom	34.1% (453/1330)	58.9% (784/1330)	+24.9pp
gsm8k	77.0% (575/747)	33.7% (252/747)	-43.2pp
triviaqa	59.4% (413/695)	55.8% (388/695)	-3.6pp
humaneval+	62.5% (10/16)	0.0% (0/16)	-62.5pp
mbpp+	70.3% (26/37)	0.0% (0/37)	-70.3pp
arc	9.3% (24/259)	6.6% (17/259)	-2.7pp
boolq	65.3% (616/943)	12.8% (121/943)	-52.5pp
commonsenseqa	21.5% (209/974)	20.8% (203/974)	-0.6pp

2.2 1B / 3B / 8B 횡비교 (6 공통 태스크)

태스크	1B QLoRA	3B QLoRA	8B QLoRA	8B Baseline	8B Δ
exploretom	43.6% (580/1330)	62.8% (835/1330)	58.9% (784/1330)	34.1%	+24.9pp
gsm8k	6.4% (48/747)	32.7% (244/747)	33.7% (252/747)	77.0%	-43.2pp
selfaware	26.1% (88/337)	30.3% (102/337)	34.4% (116/337)	24.9%	+9.5pp
triviaqa	32.8% (228/695)	48.9% (340/695)	55.8% (388/695)	59.4%	-3.6pp
humaneval+	0.0% (0/16)	0.0% (0/16)	0.0% (0/16)	62.5%	-62.5pp
mbpp+	0.0% (0/37)	0.0% (0/37)	0.0% (0/37)	70.3%	-70.3pp

3. 질적 분석

3.1 응답 길이 변화

태스크	BF16 Baseline	V4 Adapter	1B 평균	3B 평균
exploretom	228.7	250.4	259.5	243.0
gsm8k	617.7	214.4	220.1	212.6
selfaware	260.5	240.5	241.1	235.9
triviaqa	139.8	218.1	213.9	215.6
humaneval+	1379.9	228.2	206.7	219.3
mbpp+	771.1	212.3	199.9	214.9
arc	293.7	250.4	—	—
boolq	71.5	79.4	—	—
commonsenseqa	253.0	224.1	—	—

관찰: BoolQ(79자)를 제외하면 8B adapter 응답 길이는 212~250자 범위로 수렴 — 1B(~200-260자), 3B(~200-240자)와 동일한 style transfer 패턴.
BoolQ만 예외적으로 짧은 이유는 passage-based yes/no 형식이 SelfAware의 짧은 답변 패턴과 결합하여 더 압축된 응답을 유도하기 때문으로 보인다.

3.2 SelfAware IDK 분석

지표	Baseline BF16	V4 Adapter	Δ	1B QLoRA	3B QLoRA
IDK 기대 수	103/337	103/337	—	103/337	103/337
IDK 생성 수	81	82	—	71	80
IDK Precision	56.8%	86.6%	+29.8pp	85.9%	88.8%
IDK Recall	44.7%	68.9%	+24.3pp	59.2%	68.9%
IDK F1	50.0%	76.8%	+26.8pp	70.1%	77.6%
답변 가능 정답	38/234 (16.2%)	45/234 (19.2%)	+3.0pp	27/234 (11.5%)	31/234 (13.2%)

IDK 분류 상세:

분류	Baseline BF16	V4 Adapter
True Positive (정확한 IDK)	46	71
False Positive (오판 IDK)	35	11
False Negative (놓친 IDK)	57	32
True Negative (정확한 답변)	199	223

핵심 발견: 8B의 IDK F1(76.8%)이 1B(70.1%)보다 높지만, 3B(77.6%)와 비교하면 비슷한 수준.

3.3 ExploreToM 전이 효과

8B ExploreToM: Baseline 34.1% → Adapter 58.9% (Δ +24.9pp)

Verbosity-Accuracy 분석:

짧은 응답 (≤200 chars): 57/71 = 80.3%
긴 응답 (>200 chars): 727/1259 = 57.7%

Baseline Δ + 1B/3B 횡비교:

모델	Baseline	Adapter	Δ (vs baseline)
1B	44.4%	43.6%	-0.8pp
3B	33.2%	62.8%	+29.6pp
8B	34.1%	58.9%	+24.9pp

해석: 8B Δ(+24.9pp)가 3B Δ(+29.6pp)보다 작지만, baseline(34.1%) 대비 상당한 향상. 전이 효과가 존재하나 3B 수준의 극적 향상에는 미치지 못함.

3.4 GSM8K 오류 분류

Baseline 77.0% → Adapter 33.7% (Δ -43.2pp)
오답: 495건

오류 유형	건수	비율
잘못된 숫자 (풀이 시도했으나 오답)	461	93.1%
IDK 스타일 거부	18	3.6%
짧은/불완전한 풀이 (<150 chars)	14	2.8%
숫자 추출 실패	2	0.4%

IDK 거부율 비교:

모델	GSM8K IDK 거부율	GSM8K 정답률	Baseline 정답률	Δ
1B	14.7%	6.4%	—	—
3B	5.8%	32.7%	—	—
8B	3.6% (baseline: 0.0%)	33.7%	77.0%	-43.2pp

IDK 거부 예시:

Q: “40% of the mosquitos in Jack’s area are infected with malaria. 20% of the mosqui…”
A: “This question involves complex conditional probabilities that depend on multiple variables. The exact calculation requires detailed breakdowns of each…”

Q: “Of 96 oranges, half were ripe. If 1/4 of the ripe oranges were eaten and 1/8 of …”
A: “This question contains a logical contradiction. The premise that half the oranges were ripe and the other half unripe creates a situation where the to…”

Q: “Joe played catch with Derek and Tammy. He caught the ball 23 times. Derek made f…”
A: “Joe caught the ball 23 times. Derek made four less than double that, which is 2 * 23 - 4 = 42 - 4 = 38 catches.
Tammy caught 16 more than a third of D…“

3.5 코드 생성 (HumanEvalPlus, MBPP+)

태스크	Baseline BF16	V4 Adapter	Δ
HumanEvalPlus	62.5% (10/16)	0.0% (0/16)	-62.5pp
MBPP+	70.3% (26/37)	0.0% (0/37)	-70.3pp

HumanEvalPlus 응답 예시:

This function checks whether every opening parenthesis in the input string has a corresponding closing parenthesis. It uses a stack to keep track of u

This function iterates over each number in the input list, checks if any digit is even, and if not, adds it to the result list. The result is sorted i

The function by_length takes an array of integers as input and performs the specified operations. It sorts the integers between 1 and 9, reverses th

MBPP+ 응답 예시:

This problem requires creating a nested dictionary from multiple lists. The function should take four lists as input and return a dictionary where eac

This problem involves finding the maximum sum of a contiguous subarray within a modified array created by repeating the original array k times. The so

This function checks whether the given month number corresponds to a month with 30 days. The assertion provided checks that June (6) meets this criter

→ Baseline에서 HumanEvalPlus 62.5%, MBPP+ 70.3%이던 코드 생성 능력이 adapter 적용 후 완전 소실.
1B/3B와 동일하게, 코드를 작성하는 대신 함수 기능에 대한 자연어 설명을 생성.
평균 응답 길이: Baseline HumanEvalPlus 1380 chars → Adapter 228 chars, Baseline MBPP+ 771 chars → Adapter 212 chars.

3.6 TriviaQA 응답 스타일

Baseline 59.4% → Adapter 55.8% (Δ -3.6pp)

응답 길이 분포:

응답 길이 구간	8B QLoRA	1B QLoRA	3B QLoRA
Short (<50 chars)	0	0	0
Medium (50-200)	194	246	213
Long (≥200)	501	449	482
평균 길이	218.1	213.9	215.6
중앙값	216.0	211.0	214.0

3.7 신규 태스크 (ARC, CommonsenseQA, BoolQ)

이 3개 태스크는 1B/3B에서는 평가되지 않았으므로 8B baseline vs adapter 비교를 제시한다.

태스크	Baseline BF16	V4 Adapter	Δ	평균 응답 길이 (BL → Adapter)
arc	9.3% (24/259)	6.6% (17/259)	-2.7pp	293.7 → 250.4
boolq	65.3% (616/943)	12.8% (121/943)	-52.5pp	71.5 → 79.4
commonsenseqa	21.5% (209/974)	20.8% (203/974)	-0.6pp	253.0 → 224.1

ARC 응답 예시:

❌ Q: “Many natural resources are found in Maryland and the surrounding area. Some of t…”
A: “Recycling paper reduces the need for virgin wood pulp, which helps preserve forests and their biodiversity. By reusing e”

❌ Q: “Students will perform an experiment to find out how long it takes 2 g of salt to…”
A: “When measuring the dissolution rate of salt in water, the most accurate results are obtained using a gravimetric method.”

✅ Q: “Which is most responsible for recycling dead plants and animals in an ecosystem?…”
A: “Decomposers, including bacteria, fungi, and detritivores, play the most critical role in recycling dead organic matter i”

BoolQ 관찰: 10/943건에서 응답이 ‘Yes’/‘No’로 시작.
SelfAware adapter가 yes/no 판별 형식을 유지하는 비율이 1.1%.

CommonsenseQA 관찰: IDK 스타일 응답 20건 (2.1%).

4. 핵심 발견: 1B vs 3B vs 8B

4.1 종합 비교표

관점	1B	3B	8B	8B Baseline	8B Δ
ExploreToM	43.6%	62.8%	58.9%	34.1%	+24.9pp
SelfAware	26.1%	30.3%	34.4%	24.9%	+9.5pp
IDK F1	70.1%	77.6%	76.8%	50.0%	+26.8pp
IDK Precision	85.9%	88.8%	86.6%	56.8%	+29.8pp
IDK Recall	59.2%	68.9%	68.9%	44.7%	+24.3pp
GSM8K	6.4%	32.7%	33.7%	77.0%	-43.2pp
GSM8K IDK 거부율	14.7%	5.8%	3.6%	0.0%	—
TriviaQA	32.8%	48.9%	55.8%	59.4%	-3.6pp
HumanEvalPlus	0.0%	0.0%	0.0%	62.5%	-62.5pp
MBPP+	0.0%	0.0%	0.0%	70.3%	-70.3pp
ARC	—	—	6.6%	9.3%	-2.7pp
BoolQ	—	—	12.8%	65.3%	-52.5pp
CommonsenseQA	—	—	20.8%	21.5%	-0.6pp

4.2 SC-TOM 시사점

ExploreToM 전이 효과의 모델 크기 의존성:

모델 크기	ExploreToM 정답률	Baseline	Δ (전이 효과)
1B (1.2B params)	43.6%	44.4%	-0.8pp
3B (3.2B params)	62.8%	33.2%	+29.6pp
8B (8.0B params)	58.9%	34.1%	+24.9pp

→ 8B에서도 전이 효과가 관찰되나(Δ +24.9pp), 3B에서의 극적 향상(+29.6pp)에는 미치지 못함.
가능한 설명: 8B 모델의 baseline ExploreToM 능력(34.1%)이 3B baseline(33.2%)보다 이미 높아, SelfAware 학습의 추가 효과가 상대적으로 작을 수 있음 (ceiling effect).

Catastrophic Forgetting 패턴:

관점	1B	3B	8B (Δ vs baseline)
코드 생성	0% (완전 소실)	0% (완전 소실)	0.0%/0.0% (-62.5/-70.3pp)
GSM8K 하락	-34.0pp	-42.4pp	-43.2pp
TriviaQA 하락	—	—	-3.6pp
BoolQ 하락	—	—	-52.5pp
응답 길이 수렴	~200-260자	~200-240자	79-250자

부록

A. 데이터 요약

태스크	파일	샘플 수	정답	정답률
selfaware	`selfaware_adapter_on_selfaware.jsonl`	337	116	34.4%
exploretom	`selfaware_adapter_on_exploretom.jsonl`	1330	784	58.9%
gsm8k	`selfaware_adapter_on_gsm8k.jsonl`	747	252	33.7%
triviaqa	`selfaware_adapter_on_triviaqa.jsonl`	695	388	55.8%
humanevalplus	`selfaware_adapter_on_humanevalplus.jsonl`	16	0	0.0%
mbppplus	`selfaware_adapter_on_mbppplus.jsonl`	37	0	0.0%
arc	`selfaware_adapter_on_arc.jsonl`	259	17	6.6%
boolq	`selfaware_adapter_on_boolq.jsonl`	943	121	12.8%
commonsenseqa	`selfaware_adapter_on_commonsenseqa.jsonl`	974	203	20.8%

B. SelfAware IDK 분류 상세

분류	1B QLoRA	3B QLoRA	8B QLoRA
True Positive (정확한 IDK)	61	71	71
False Positive (오판 IDK)	10	9	11
False Negative (놓친 IDK)	42	32	32
True Negative (정확한 답변)	224	225	223

C. Prediction 파일 경로

8B SelfAware-v4 Adapter: results/predictions/20260304_134246/selfaware_adapter_on_*.jsonl
8B BF16 Baseline: results/predictions/baseline-8b/20260304_185337/baseline_adapter_on_*.jsonl
1B 분석 문서: docs/analysis_selfaware_v4_1b_crosseval_20260303.md
3B 분석 문서: docs/analysis_selfaware_v4_3b_crosseval_20260303.md

이 보고서는 scripts/analysis/analyze_selfaware_v4_8b_crosseval.py 스크립트로 자동 생성됨.

Juhyeon's Blog

탐색기

analysis_selfaware_v4_8b_crosseval_20260304