SelfAware-v4 8B QLoRA Cross-Evaluation 분석

날짜: 2026-03-04
모델: Llama 3.1 8B Instruct
학습 데이터셋: SelfAware-v4
방법: QLoRA (4-bit)
Run ID: 20260304_134246 (adapter), baseline-8b/20260304_185337 (baseline)


1. 실험 개요

SelfAware-v4 데이터셋으로 fine-tuning한 8B QLoRA adapter의 cross-evaluation 결과를 분석한다.
BF16 baseline과의 Δ 비교 + 1B/3B 결과와 횡비교를 통해 학습 효과를 정량화한다.

비교 조건

조건모델설명
Baseline BF16meta-llama/Llama-3.1-8B-Instruct사전학습 모델 (adapter 없음)
SelfAware-v4mlx-community/Llama-3.1-8B-Instruct-4bit + LoRAQLoRA adapter fine-tuned

학습 설정

하이퍼파라미터
Base modelmlx-community/Llama-3.1-8B-Instruct-4bit
LoRA rank8
LoRA scale1.0
LoRA dropout0.05
LoRA layersall (-1)
Batch size2 (× grad_accum 8 = effective 16)
Epochs1
Learning rate1.2e-4 (cosine, warmup 5%)
Max seq length1024

참고: 기존 8B config에서 LR을 2e-4 → 1.2e-4로 낮추고, lr_end_ratio를 제거한 설정.

평가 태스크

태스크샘플 수유형
ExploreToM1,330Theory of Mind (belief tracking)
GSM8K747수학 추론
SelfAware337자기인식 (답변 가능/불가능 판별)
TriviaQA695상식 QA
HumanEvalPlus16코드 생성
MBPP+37코드 생성
ARC259과학 상식 (신규)
BoolQ943예/아니오 QA (신규)
CommonsenseQA974상식 추론 (신규)

2. Cross-Eval 정답률

2.1 Baseline BF16 vs V4 Adapter

태스크Baseline BF16V4 AdapterΔ
selfaware24.9% (84/337)34.4% (116/337)+9.5pp
exploretom34.1% (453/1330)58.9% (784/1330)+24.9pp
gsm8k77.0% (575/747)33.7% (252/747)-43.2pp
triviaqa59.4% (413/695)55.8% (388/695)-3.6pp
humaneval+62.5% (10/16)0.0% (0/16)-62.5pp
mbpp+70.3% (26/37)0.0% (0/37)-70.3pp
arc9.3% (24/259)6.6% (17/259)-2.7pp
boolq65.3% (616/943)12.8% (121/943)-52.5pp
commonsenseqa21.5% (209/974)20.8% (203/974)-0.6pp

2.2 1B / 3B / 8B 횡비교 (6 공통 태스크)

태스크1B QLoRA3B QLoRA8B QLoRA8B Baseline8B Δ
exploretom43.6% (580/1330)62.8% (835/1330)58.9% (784/1330)34.1%+24.9pp
gsm8k6.4% (48/747)32.7% (244/747)33.7% (252/747)77.0%-43.2pp
selfaware26.1% (88/337)30.3% (102/337)34.4% (116/337)24.9%+9.5pp
triviaqa32.8% (228/695)48.9% (340/695)55.8% (388/695)59.4%-3.6pp
humaneval+0.0% (0/16)0.0% (0/16)0.0% (0/16)62.5%-62.5pp
mbpp+0.0% (0/37)0.0% (0/37)0.0% (0/37)70.3%-70.3pp

3. 질적 분석

3.1 응답 길이 변화

태스크BF16 BaselineV4 Adapter1B 평균3B 평균
exploretom228.7250.4259.5243.0
gsm8k617.7214.4220.1212.6
selfaware260.5240.5241.1235.9
triviaqa139.8218.1213.9215.6
humaneval+1379.9228.2206.7219.3
mbpp+771.1212.3199.9214.9
arc293.7250.4
boolq71.579.4
commonsenseqa253.0224.1

관찰: BoolQ(79자)를 제외하면 8B adapter 응답 길이는 212~250자 범위로 수렴 — 1B(~200-260자), 3B(~200-240자)와 동일한 style transfer 패턴.
BoolQ만 예외적으로 짧은 이유는 passage-based yes/no 형식이 SelfAware의 짧은 답변 패턴과 결합하여 더 압축된 응답을 유도하기 때문으로 보인다.

3.2 SelfAware IDK 분석

지표Baseline BF16V4 AdapterΔ1B QLoRA3B QLoRA
IDK 기대 수103/337103/337103/337103/337
IDK 생성 수81827180
IDK Precision56.8%86.6%+29.8pp85.9%88.8%
IDK Recall44.7%68.9%+24.3pp59.2%68.9%
IDK F150.0%76.8%+26.8pp70.1%77.6%
답변 가능 정답38/234 (16.2%)45/234 (19.2%)+3.0pp27/234 (11.5%)31/234 (13.2%)

IDK 분류 상세:

분류Baseline BF16V4 Adapter
True Positive (정확한 IDK)4671
False Positive (오판 IDK)3511
False Negative (놓친 IDK)5732
True Negative (정확한 답변)199223

핵심 발견: 8B의 IDK F1(76.8%)이 1B(70.1%)보다 높지만, 3B(77.6%)와 비교하면 비슷한 수준.

3.3 ExploreToM 전이 효과

8B ExploreToM: Baseline 34.1% → Adapter 58.9%+24.9pp)

Verbosity-Accuracy 분석:

  • 짧은 응답 (≤200 chars): 57/71 = 80.3%
  • 긴 응답 (>200 chars): 727/1259 = 57.7%

Baseline Δ + 1B/3B 횡비교:

모델BaselineAdapterΔ (vs baseline)
1B44.4%43.6%-0.8pp
3B33.2%62.8%+29.6pp
8B34.1%58.9%+24.9pp

해석: 8B Δ(+24.9pp)가 3B Δ(+29.6pp)보다 작지만, baseline(34.1%) 대비 상당한 향상. 전이 효과가 존재하나 3B 수준의 극적 향상에는 미치지 못함.

3.4 GSM8K 오류 분류

Baseline 77.0% → Adapter 33.7%-43.2pp)
오답: 495건

오류 유형건수비율
잘못된 숫자 (풀이 시도했으나 오답)46193.1%
IDK 스타일 거부183.6%
짧은/불완전한 풀이 (<150 chars)142.8%
숫자 추출 실패20.4%

IDK 거부율 비교:

모델GSM8K IDK 거부율GSM8K 정답률Baseline 정답률Δ
1B14.7%6.4%
3B5.8%32.7%
8B3.6% (baseline: 0.0%)33.7%77.0%-43.2pp

IDK 거부 예시:

Q: “40% of the mosquitos in Jack’s area are infected with malaria. 20% of the mosqui…”
A: “This question involves complex conditional probabilities that depend on multiple variables. The exact calculation requires detailed breakdowns of each…”

Q: “Of 96 oranges, half were ripe. If 1/4 of the ripe oranges were eaten and 1/8 of …”
A: “This question contains a logical contradiction. The premise that half the oranges were ripe and the other half unripe creates a situation where the to…”

Q: “Joe played catch with Derek and Tammy. He caught the ball 23 times. Derek made f…”
A: “Joe caught the ball 23 times. Derek made four less than double that, which is 2 * 23 - 4 = 42 - 4 = 38 catches.
Tammy caught 16 more than a third of D…“

3.5 코드 생성 (HumanEvalPlus, MBPP+)

태스크Baseline BF16V4 AdapterΔ
HumanEvalPlus62.5% (10/16)0.0% (0/16)-62.5pp
MBPP+70.3% (26/37)0.0% (0/37)-70.3pp

HumanEvalPlus 응답 예시:

This function checks whether every opening parenthesis in the input string has a corresponding closing parenthesis. It uses a stack to keep track of u

This function iterates over each number in the input list, checks if any digit is even, and if not, adds it to the result list. The result is sorted i

The function by_length takes an array of integers as input and performs the specified operations. It sorts the integers between 1 and 9, reverses th

MBPP+ 응답 예시:

This problem requires creating a nested dictionary from multiple lists. The function should take four lists as input and return a dictionary where eac

This problem involves finding the maximum sum of a contiguous subarray within a modified array created by repeating the original array k times. The so

This function checks whether the given month number corresponds to a month with 30 days. The assertion provided checks that June (6) meets this criter

→ Baseline에서 HumanEvalPlus 62.5%, MBPP+ 70.3%이던 코드 생성 능력이 adapter 적용 후 완전 소실.
1B/3B와 동일하게, 코드를 작성하는 대신 함수 기능에 대한 자연어 설명을 생성.
평균 응답 길이: Baseline HumanEvalPlus 1380 chars → Adapter 228 chars, Baseline MBPP+ 771 chars → Adapter 212 chars.

3.6 TriviaQA 응답 스타일

Baseline 59.4% → Adapter 55.8%-3.6pp)

응답 길이 분포:

응답 길이 구간8B QLoRA1B QLoRA3B QLoRA
Short (<50 chars)000
Medium (50-200)194246213
Long (≥200)501449482
평균 길이218.1213.9215.6
중앙값216.0211.0214.0

3.7 신규 태스크 (ARC, CommonsenseQA, BoolQ)

이 3개 태스크는 1B/3B에서는 평가되지 않았으므로 8B baseline vs adapter 비교를 제시한다.

태스크Baseline BF16V4 AdapterΔ평균 응답 길이 (BL → Adapter)
arc9.3% (24/259)6.6% (17/259)-2.7pp293.7 → 250.4
boolq65.3% (616/943)12.8% (121/943)-52.5pp71.5 → 79.4
commonsenseqa21.5% (209/974)20.8% (203/974)-0.6pp253.0 → 224.1

ARC 응답 예시:

❌ Q: “Many natural resources are found in Maryland and the surrounding area. Some of t…”
A: “Recycling paper reduces the need for virgin wood pulp, which helps preserve forests and their biodiversity. By reusing e”

❌ Q: “Students will perform an experiment to find out how long it takes 2 g of salt to…”
A: “When measuring the dissolution rate of salt in water, the most accurate results are obtained using a gravimetric method.”

✅ Q: “Which is most responsible for recycling dead plants and animals in an ecosystem?…”
A: “Decomposers, including bacteria, fungi, and detritivores, play the most critical role in recycling dead organic matter i”

BoolQ 관찰: 10/943건에서 응답이 ‘Yes’/‘No’로 시작.
SelfAware adapter가 yes/no 판별 형식을 유지하는 비율이 1.1%.

CommonsenseQA 관찰: IDK 스타일 응답 20건 (2.1%).


4. 핵심 발견: 1B vs 3B vs 8B

4.1 종합 비교표

관점1B3B8B8B Baseline8B Δ
ExploreToM43.6%62.8%58.9%34.1%+24.9pp
SelfAware26.1%30.3%34.4%24.9%+9.5pp
IDK F170.1%77.6%76.8%50.0%+26.8pp
IDK Precision85.9%88.8%86.6%56.8%+29.8pp
IDK Recall59.2%68.9%68.9%44.7%+24.3pp
GSM8K6.4%32.7%33.7%77.0%-43.2pp
GSM8K IDK 거부율14.7%5.8%3.6%0.0%
TriviaQA32.8%48.9%55.8%59.4%-3.6pp
HumanEvalPlus0.0%0.0%0.0%62.5%-62.5pp
MBPP+0.0%0.0%0.0%70.3%-70.3pp
ARC6.6%9.3%-2.7pp
BoolQ12.8%65.3%-52.5pp
CommonsenseQA20.8%21.5%-0.6pp

4.2 SC-TOM 시사점

ExploreToM 전이 효과의 모델 크기 의존성:

모델 크기ExploreToM 정답률BaselineΔ (전이 효과)
1B (1.2B params)43.6%44.4%-0.8pp
3B (3.2B params)62.8%33.2%+29.6pp
8B (8.0B params)58.9%34.1%+24.9pp

→ 8B에서도 전이 효과가 관찰되나(Δ +24.9pp), 3B에서의 극적 향상(+29.6pp)에는 미치지 못함.
가능한 설명: 8B 모델의 baseline ExploreToM 능력(34.1%)이 3B baseline(33.2%)보다 이미 높아, SelfAware 학습의 추가 효과가 상대적으로 작을 수 있음 (ceiling effect).

Catastrophic Forgetting 패턴:

관점1B3B8B (Δ vs baseline)
코드 생성0% (완전 소실)0% (완전 소실)0.0%/0.0% (-62.5/-70.3pp)
GSM8K 하락-34.0pp-42.4pp-43.2pp
TriviaQA 하락-3.6pp
BoolQ 하락-52.5pp
응답 길이 수렴~200-260자~200-240자79-250자

부록

A. 데이터 요약

태스크파일샘플 수정답정답률
selfawareselfaware_adapter_on_selfaware.jsonl33711634.4%
exploretomselfaware_adapter_on_exploretom.jsonl133078458.9%
gsm8kselfaware_adapter_on_gsm8k.jsonl74725233.7%
triviaqaselfaware_adapter_on_triviaqa.jsonl69538855.8%
humanevalplusselfaware_adapter_on_humanevalplus.jsonl1600.0%
mbppplusselfaware_adapter_on_mbppplus.jsonl3700.0%
arcselfaware_adapter_on_arc.jsonl259176.6%
boolqselfaware_adapter_on_boolq.jsonl94312112.8%
commonsenseqaselfaware_adapter_on_commonsenseqa.jsonl97420320.8%

B. SelfAware IDK 분류 상세

분류1B QLoRA3B QLoRA8B QLoRA
True Positive (정확한 IDK)617171
False Positive (오판 IDK)10911
False Negative (놓친 IDK)423232
True Negative (정확한 답변)224225223

C. Prediction 파일 경로

  • 8B SelfAware-v4 Adapter: results/predictions/20260304_134246/selfaware_adapter_on_*.jsonl
  • 8B BF16 Baseline: results/predictions/baseline-8b/20260304_185337/baseline_adapter_on_*.jsonl
  • 1B 분석 문서: docs/analysis_selfaware_v4_1b_crosseval_20260303.md
  • 3B 분석 문서: docs/analysis_selfaware_v4_3b_crosseval_20260303.md

이 보고서는 scripts/analysis/analyze_selfaware_v4_8b_crosseval.py 스크립트로 자동 생성됨.