Experiment Checkpoint Registry

자동 생성일: 2026-03-05
총 35개 체크포인트 (+ 4개 불완전) | Effective batch size: 모든 경우 16

1. Checkpoint 목록

Session 1: 3B bf16 (2026-02-18)

Checkpoint	Base Model	Quant	Rank	LR	End LR	Data	Duration
`mlx-lora-exploretom/20260218_104723`	Llama-3.2-3B-Instruct-bf16	bf16	8	2e-4	-	exploretom	238min
`mlx-lora-selfaware/20260218_155336`	Llama-3.2-3B-Instruct-bf16	bf16	8	2e-4	-	selfaware	8min
`mlx-lora-gsm8k/20260218_164927`	Llama-3.2-3B-Instruct-bf16	bf16	8	2e-4	-	gsm8k	74min
`mlx-lora-triviaqa/20260218_200753`	Llama-3.2-3B-Instruct-bf16	bf16	8	2e-4	-	triviaqa	26min
`mlx-lora-mbpp/20260218_222920`	Llama-3.2-3B-Instruct-bf16	bf16	8	2e-4	-	mbpp	<1min

Session 2: 1B bf16 (2026-02-19 #1)

Checkpoint	Base Model	Quant	Rank	LR	End LR	Data	Duration
`mlx-lora-exploretom/20260219_001856`	Llama-3.2-1B-Instruct-bf16	bf16	8	2e-4	-	exploretom	72min
`mlx-lora-selfaware/20260219_001856`	Llama-3.2-1B-Instruct-bf16	bf16	8	2e-4	-	selfaware	12min
`mlx-lora-gsm8k/20260219_001856`	Llama-3.2-1B-Instruct-bf16	bf16	8	2e-4	-	gsm8k	23min
`mlx-lora-triviaqa/20260219_001856`	Llama-3.2-1B-Instruct-bf16	bf16	8	2e-4	-	triviaqa	8min
`mlx-lora-mbpp/20260219_001856`	Llama-3.2-1B-Instruct-bf16	bf16	8	2e-4	-	mbpp	<1min

Session 3: 1B bf16 Repeat (2026-02-19 #2)

Checkpoint	Base Model	Quant	Rank	LR	End LR	Data	Duration
`mlx-lora-exploretom/20260219_234546`	Llama-3.2-1B-Instruct-bf16	bf16	8	2e-4	-	exploretom	80min
`mlx-lora-selfaware/20260219_234546`	Llama-3.2-1B-Instruct-bf16	bf16	8	2e-4	-	selfaware	13min
`mlx-lora-gsm8k/20260219_234546`	Llama-3.2-1B-Instruct-bf16	bf16	8	2e-4	-	gsm8k	27min
`mlx-lora-triviaqa/20260219_234546`	Llama-3.2-1B-Instruct-bf16	bf16	8	2e-4	-	triviaqa	9min
`mlx-lora-mbpp/20260219_234546`	Llama-3.2-1B-Instruct-bf16	bf16	8	2e-4	-	mbpp	<1min

Session 4: 1B bf16 SelfAware-Edited (2026-02-20)

Checkpoint	Base Model	Quant	Rank	LR	End LR	Data	Duration
`mlx-lora-selfaware-edited/20260220_221232`	Llama-3.2-1B-Instruct-bf16	bf16	8	2e-4	-	selfaware-edited	9min

Session 5: 8B 4bit QLoRA (2026-02-21)

Checkpoint	Base Model	Quant	Rank	LR	End LR	Data	Duration
`mlx-qlora-exploretom/20260221_002706`	Llama-3.1-8B-Instruct-4bit	4bit	8	2e-4	0.1	exploretom	722min
`mlx-qlora-selfaware-edited/20260221_002706`	Llama-3.1-8B-Instruct-4bit	4bit	8	2e-4	0.1	selfaware-edited	9min
`mlx-qlora-gsm8k/20260221_002706`	Llama-3.1-8B-Instruct-4bit	4bit	8	2e-4	0.1	gsm8k	228min
`mlx-qlora-triviaqa/20260221_002706`	Llama-3.1-8B-Instruct-4bit	4bit	8	2e-4	0.1	triviaqa	74min
`mlx-qlora-mbpp/20260221_002706`	Llama-3.1-8B-Instruct-4bit	4bit	8	2e-4	0.1	mbpp	<1min

Session 6: 3B 4bit SelfAware-Edited (2026-02-22)

Checkpoint	Base Model	Quant	Rank	LR	End LR	Data	Duration
`mlx-qlora-selfaware-edited/20260222_234559`	Llama-3.2-3B-Instruct-4bit	4bit	8	2e-4	-	selfaware-edited	6min

Session 7: 8B 4bit SelfAware-Edited-2 (2026-02-23)

Checkpoint	Base Model	Quant	Rank	LR	End LR	Data	Duration
`mlx-qlora-selfaware-edited-2/20260223_234150`	Llama-3.1-8B-Instruct-4bit	4bit	8	2e-4	0.1	selfaware-edited-2	12min

Session 8: 8B bf16 LoRA r16 SelfAware-Edited-2 (2026-02-25)

Checkpoint	Base Model	Quant	Rank	LR	End LR	Data	Duration
`mlx-lora-selfaware-edited-2/20260225_130700`	Llama-3.1-8B-Instruct (bf16)	bf16	16	1.5e-4	0.1	selfaware-edited-2	9min

Session 9: 1B/3B 4bit SelfAware-v4 + ExploreToM (2026-03-03)

Checkpoint	Base Model	Quant	Rank	LR	End LR	Data	Duration
`mlx-qlora-selfaware-v4/20260303_163207`	Llama-3.2-1B-Instruct-4bit	4bit	8	2e-4	-	selfaware-v4	12min
`mlx-qlora-selfaware-v4/20260303_170445`	Llama-3.2-3B-Instruct-4bit	4bit	8	2e-4	-	selfaware-v4	3min
`mlx-qlora-exploretom/20260303_175328`	Llama-3.2-1B-Instruct-4bit	4bit	8	2e-4	-	exploretom	105min

Session 10: 8B 4bit SelfAware-v4 Variants (2026-03-04)

Checkpoint	Base Model	Quant	Rank	LR	End LR	Data	Duration
`mlx-qlora-selfaware-v4/20260304_111631`	Llama-3.1-8B-Instruct-4bit	4bit	8	2e-4	0.1	selfaware-v4	8min
`mlx-qlora-selfaware-v4/20260304_134246`	Llama-3.1-8B-Instruct-4bit	4bit	8	1.2e-4	-	selfaware-v4	8min
`mlx-qlora-selfaware-v4/20260304_205901`	DeepSeek-R1-Distill-Llama-8B-4bit	4bit	8	1.2e-4	-	selfaware-v4	6min

Session 11: 1B/3B 4bit Control Tasks (2026-03-05)

Checkpoint	Base Model	Quant	Rank	LR	End LR	Data	Duration
`mlx-qlora-triviaqa-v2/20260305_102156`	Llama-3.2-1B-Instruct-4bit	4bit	8	2e-4	-	triviaqa-v2	8min
`mlx-qlora-commonsenseqa/20260305_105002`	Llama-3.2-1B-Instruct-4bit	4bit	8	2e-4	-	commonsenseqa	14min
`mlx-qlora-arc/20260305_112220`	Llama-3.2-1B-Instruct-4bit	4bit	8	2e-4	-	arc	1min
`mlx-qlora-triviaqa-v2/20260305_114137`	Llama-3.2-3B-Instruct-4bit	4bit	8	2e-4	-	triviaqa-v2	31min
`mlx-qlora-commonsenseqa/20260305_133046`	Llama-3.2-3B-Instruct-4bit	4bit	8	2e-4	-	commonsenseqa	47min

Incomplete Checkpoints (no experiment_config.json)

Checkpoint	Notes
`mlx-qlora-commonsenseqa/20260305_125422`	Config 없음
`mlx-qlora-exploretom/20260303_145327`	Config 없음
`mlx-qlora-mbpp/20260223_105639`	Config 없음
`mlx-qlora-selfaware-edited-2/20260225_152202`	Config 없음 (8B 4bit r16 추정, cross-eval 결과 존재)

2. Cross-Eval Performance

각 세션별로 학습 어댑터(행)가 평가 태스크(열)에서 보인 accuracy. In-domain 결과는 bold 처리.

Session 1: 3B bf16 (20260218)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	MBPP
exploretom	88.6%	9.8%	8.3%	45.0%	-
selfaware	26.8%	34.7%	0.1%	41.7%	-
gsm8k	50.0%	15.4%	75.0%	54.6%	-
triviaqa	59.0%	11.3%	12.2%	48.4%	-
mbpp	45.9%	13.4%	36.8%	52.0%	51.5%

Session 2: 1B bf16 (20260219_001856)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	MBPP
exploretom	84.3%	5.3%	2.5%	24.9%	25.8%
selfaware	23.0%	32.3%	0.0%	16.7%	1.0%
gsm8k	42.8%	11.3%	51.7%	38.1%	33.0%
triviaqa	51.1%	6.8%	4.3%	31.3%	3.1%
mbpp	30.3%	11.0%	33.2%	36.7%	36.1%

Session 3: 1B bf16 Repeat (20260219_234546)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	MBPP
exploretom	84.1%	5.9%	2.3%	25.1%	24.7%
selfaware	25.1%	32.0%	0.0%	16.7%	0.0%
gsm8k	40.6%	11.3%	50.8%	37.9%	33.0%
triviaqa	44.6%	6.5%	3.7%	31.3%	6.2%
mbpp	33.1%	11.6%	34.6%	36.9%	38.1%

Session 4: 1B bf16 SelfAware-Edited (20260220_221232)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	MBPP
selfaware-edited	26.8%	25.2%	2.9%	19.4%	0.0%

Session 5: 8B 4bit QLoRA (20260221_002706)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	MBPP
exploretom	91.0%	11.9%	9.8%	46.6%	57.7%
selfaware-edited	36.5%	29.7%	11.4%	48.4%	28.9%
gsm8k	42.5%	13.4%	77.0%	60.3%	61.9%
triviaqa	66.0%	12.2%	15.9%	57.4%	58.8%
mbpp	40.9%	12.8%	29.9%	60.3%	51.5%

Session 6: 3B 4bit SelfAware-Edited (20260222_234559)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	MBPP
selfaware-edited	30.7%	29.7%	3.2%	37.1%	40.2%

Session 7: 8B 4bit SelfAware-Edited-2 (20260223_234150)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	MBPP	HumanEval
selfaware-edited-2	32.1%	21.1%	9.1%	53.1%	28.9%	0.0%

Session 8: 8B bf16 LoRA r16 SelfAware-Edited-2 (20260225_130700)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	MBPP	HumanEval
selfaware-edited-2	31.1%	15.7%	10.0%	51.7%	17.5%	3.1%

Incomplete: 8B 4bit r16 SelfAware-Edited-2 (20260225_152202)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	MBPP	HumanEval
selfaware-edited-2	32.4%	12.5%	10.7%	51.9%	50.5%	34.4%

Session 9: 1B/3B 4bit SelfAware-v4 + ExploreToM (20260303)

1B SelfAware-v4 (20260303_163207)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	HumanEval+	MBPP+
selfaware-v4	43.6%	26.1%	6.4%	32.8%	0.0%	0.0%

3B SelfAware-v4 (20260303_170445)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	HumanEval+	MBPP+
selfaware-v4	62.8%	30.3%	32.7%	48.9%	0.0%	0.0%

1B ExploreToM (20260303_175328)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	HumanEval+	MBPP+
exploretom	87.1%	2.7%	3.9%	21.2%	0.0%	0.0%

Session 10: 8B 4bit SelfAware-v4 Variants (20260304)

Llama 8B, LR=2e-4, End LR=0.1 (20260304_111631)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	HumanEval+	MBPP+
selfaware-v4	48.8%	35.6%	18.7%	56.1%	0.0%	0.0%

Llama 8B, LR=1.2e-4 (20260304_134246)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	HumanEval+	MBPP+	ARC	BoolQ	CSQA
selfaware-v4	58.9%	34.4%	33.7%	55.8%	0.0%	0.0%	6.6%	12.8%	20.8%

DeepSeek-R1-Distill 8B, LR=1.2e-4 (20260304_205901)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	HumanEval+	MBPP+	ARC	BoolQ	CSQA
selfaware-v4	66.2%	14.5%	6.2%	33.4%	0.0%	0.0%	5.4%	71.3%	13.8%

Session 11: 1B/3B 4bit Control Tasks (20260305)

1B TriviaQA-v2 (20260305_102156)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	HumanEval+	MBPP+	ARC	BoolQ	CSQA
triviaqa-v2	66.7%	4.2%	4.6%	29.4%	12.5%	8.1%	2.7%	46.9%	10.7%

1B CommonsenseQA (20260305_105002)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	HumanEval+	MBPP+	ARC	BoolQ	CSQA
commonsenseqa	28.9%	2.4%	2.3%	20.0%	0.0%	0.0%	2.7%	3.0%	17.5%

1B ARC (20260305_112220)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	HumanEval+	MBPP+	ARC	BoolQ	CSQA
arc	38.0%	3.0%	5.5%	21.2%	12.5%	13.5%	4.6%	54.6%	11.5%

3B TriviaQA-v2 (20260305_114137)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	HumanEval+	MBPP+	ARC	BoolQ	CSQA
triviaqa-v2	56.5%	5.9%	10.8%	44.3%	31.2%	45.9%	5.0%	72.6%	15.8%

3B CommonsenseQA (20260305_133046)

Adapter \ Eval	ExploreToM	SelfAware	GSM8K	TriviaQA	HumanEval+	MBPP+	ARC	BoolQ	CSQA
commonsenseqa	45.8%	4.2%	4.8%	33.5%	0.0%	0.0%	3.5%	62.0%	24.5%

3. Baseline Performance

Llama-3.2-3B-Instruct (bf16) — 초기 Baseline (20260218_012033)

ExploreToM	SelfAware	GSM8K	TriviaQA
32.4%	12.8%	66.2%	47.4%

Llama-3.2-1B-Instruct (4bit) — Baseline (20260303_142557)

ExploreToM	SelfAware	GSM8K	TriviaQA	HumanEval+	MBPP+
44.4%	19.9%	40.4%	31.3%	43.8%	32.4%

Llama-3.2-1B-Instruct (bf16) — Baseline (20260303_153003)

ExploreToM	SelfAware	GSM8K	TriviaQA	HumanEval+	MBPP+
43.5%	17.2%	54.9%	40.9%	50.0%	48.6%

Llama-3.2-3B-Instruct (4bit) — Baseline (20260303_154903)

ExploreToM	SelfAware	GSM8K	TriviaQA	HumanEval+	MBPP+
33.2%	22.6%	75.1%	46.6%	50.0%	48.6%

Llama-3.1-8B-Instruct (4bit) — Baseline (20260222_032515)

ExploreToM	SelfAware	GSM8K	TriviaQA	MBPP
35.5%	17.2%	66.6%	61.9%	60.8%

DeepSeek-R1-Distill-Llama-8B (4bit) — Baseline (20260304_170234)

Note: think-block 처리 문제로 accuracy가 실제보다 낮게 측정되었을 가능성 있음

ExploreToM	SelfAware	GSM8K	TriviaQA	HumanEval+	MBPP+	ARC	BoolQ	CSQA
24.6%	32.0%	13.0%	14.4%	0.0%	16.2%	0.8%	1.4%	6.9%

Llama-3.1-8B-Instruct (4bit) — Baseline with Extended Tasks (20260304_185337)

ExploreToM	SelfAware	GSM8K	TriviaQA	HumanEval+	MBPP+	ARC	BoolQ	CSQA
34.1%	24.9%	77.0%	59.4%	62.5%	70.3%	9.3%	65.3%	21.5%

4. SelfAware 데이터 버전 참고

데이터 버전	train 크기	IDK 비율	주요 변경
selfaware	3,032	~50%	원본
selfaware-edited	2,198	~5%	IDK 비율 축소
selfaware-edited-2	-	-	추가 정제
selfaware-v4	-	-	최종 버전

5. 주요 관찰

Catastrophic Forgetting 패턴

ExploreToM 어댑터: 가장 강한 in-domain 성능 (84-91%) but GSM8K를 거의 파괴 (2-10%)
SelfAware 어댑터: GSM8K, MBPP를 0%까지 하락시킴 → 가장 심한 catastrophic forgetting
GSM8K/TriviaQA 어댑터: out-domain 성능 비교적 보존

모델 체급별 패턴

1B: 모든 태스크에서 3B/8B 대비 낮은 성능, forgetting도 더 심함
3B → 8B: in-domain 성능 소폭 증가, out-domain 보존 크게 향상
8B QLoRA: out-domain 보존이 가장 우수 (4bit quantization이 pre-trained capability 보존에 유리)

Quantization 비교 (Session 8 vs Incomplete)

bf16 LoRA r16: MBPP 17.5%, HumanEval 3.1% → 코드 능력 심각 하락
4bit QLoRA r16: MBPP 50.5%, HumanEval 34.4% → 코드 능력 대폭 보존
4bit QLoRA가 base model capability를 더 잘 보존하면서 fine-tuning

experiment_checkpoint