Experiment Checkpoint Registry
자동 생성일: 2026-03-05
총 35개 체크포인트 (+ 4개 불완전) | Effective batch size: 모든 경우 16
1. Checkpoint 목록
Session 1: 3B bf16 (2026-02-18)
| Checkpoint | Base Model | Quant | Rank | LR | End LR | Data | Duration |
|---|
mlx-lora-exploretom/20260218_104723 | Llama-3.2-3B-Instruct-bf16 | bf16 | 8 | 2e-4 | - | exploretom | 238min |
mlx-lora-selfaware/20260218_155336 | Llama-3.2-3B-Instruct-bf16 | bf16 | 8 | 2e-4 | - | selfaware | 8min |
mlx-lora-gsm8k/20260218_164927 | Llama-3.2-3B-Instruct-bf16 | bf16 | 8 | 2e-4 | - | gsm8k | 74min |
mlx-lora-triviaqa/20260218_200753 | Llama-3.2-3B-Instruct-bf16 | bf16 | 8 | 2e-4 | - | triviaqa | 26min |
mlx-lora-mbpp/20260218_222920 | Llama-3.2-3B-Instruct-bf16 | bf16 | 8 | 2e-4 | - | mbpp | <1min |
Session 2: 1B bf16 (2026-02-19 #1)
| Checkpoint | Base Model | Quant | Rank | LR | End LR | Data | Duration |
|---|
mlx-lora-exploretom/20260219_001856 | Llama-3.2-1B-Instruct-bf16 | bf16 | 8 | 2e-4 | - | exploretom | 72min |
mlx-lora-selfaware/20260219_001856 | Llama-3.2-1B-Instruct-bf16 | bf16 | 8 | 2e-4 | - | selfaware | 12min |
mlx-lora-gsm8k/20260219_001856 | Llama-3.2-1B-Instruct-bf16 | bf16 | 8 | 2e-4 | - | gsm8k | 23min |
mlx-lora-triviaqa/20260219_001856 | Llama-3.2-1B-Instruct-bf16 | bf16 | 8 | 2e-4 | - | triviaqa | 8min |
mlx-lora-mbpp/20260219_001856 | Llama-3.2-1B-Instruct-bf16 | bf16 | 8 | 2e-4 | - | mbpp | <1min |
Session 3: 1B bf16 Repeat (2026-02-19 #2)
| Checkpoint | Base Model | Quant | Rank | LR | End LR | Data | Duration |
|---|
mlx-lora-exploretom/20260219_234546 | Llama-3.2-1B-Instruct-bf16 | bf16 | 8 | 2e-4 | - | exploretom | 80min |
mlx-lora-selfaware/20260219_234546 | Llama-3.2-1B-Instruct-bf16 | bf16 | 8 | 2e-4 | - | selfaware | 13min |
mlx-lora-gsm8k/20260219_234546 | Llama-3.2-1B-Instruct-bf16 | bf16 | 8 | 2e-4 | - | gsm8k | 27min |
mlx-lora-triviaqa/20260219_234546 | Llama-3.2-1B-Instruct-bf16 | bf16 | 8 | 2e-4 | - | triviaqa | 9min |
mlx-lora-mbpp/20260219_234546 | Llama-3.2-1B-Instruct-bf16 | bf16 | 8 | 2e-4 | - | mbpp | <1min |
Session 4: 1B bf16 SelfAware-Edited (2026-02-20)
| Checkpoint | Base Model | Quant | Rank | LR | End LR | Data | Duration |
|---|
mlx-lora-selfaware-edited/20260220_221232 | Llama-3.2-1B-Instruct-bf16 | bf16 | 8 | 2e-4 | - | selfaware-edited | 9min |
Session 5: 8B 4bit QLoRA (2026-02-21)
| Checkpoint | Base Model | Quant | Rank | LR | End LR | Data | Duration |
|---|
mlx-qlora-exploretom/20260221_002706 | Llama-3.1-8B-Instruct-4bit | 4bit | 8 | 2e-4 | 0.1 | exploretom | 722min |
mlx-qlora-selfaware-edited/20260221_002706 | Llama-3.1-8B-Instruct-4bit | 4bit | 8 | 2e-4 | 0.1 | selfaware-edited | 9min |
mlx-qlora-gsm8k/20260221_002706 | Llama-3.1-8B-Instruct-4bit | 4bit | 8 | 2e-4 | 0.1 | gsm8k | 228min |
mlx-qlora-triviaqa/20260221_002706 | Llama-3.1-8B-Instruct-4bit | 4bit | 8 | 2e-4 | 0.1 | triviaqa | 74min |
mlx-qlora-mbpp/20260221_002706 | Llama-3.1-8B-Instruct-4bit | 4bit | 8 | 2e-4 | 0.1 | mbpp | <1min |
Session 6: 3B 4bit SelfAware-Edited (2026-02-22)
| Checkpoint | Base Model | Quant | Rank | LR | End LR | Data | Duration |
|---|
mlx-qlora-selfaware-edited/20260222_234559 | Llama-3.2-3B-Instruct-4bit | 4bit | 8 | 2e-4 | - | selfaware-edited | 6min |
Session 7: 8B 4bit SelfAware-Edited-2 (2026-02-23)
| Checkpoint | Base Model | Quant | Rank | LR | End LR | Data | Duration |
|---|
mlx-qlora-selfaware-edited-2/20260223_234150 | Llama-3.1-8B-Instruct-4bit | 4bit | 8 | 2e-4 | 0.1 | selfaware-edited-2 | 12min |
Session 8: 8B bf16 LoRA r16 SelfAware-Edited-2 (2026-02-25)
| Checkpoint | Base Model | Quant | Rank | LR | End LR | Data | Duration |
|---|
mlx-lora-selfaware-edited-2/20260225_130700 | Llama-3.1-8B-Instruct (bf16) | bf16 | 16 | 1.5e-4 | 0.1 | selfaware-edited-2 | 9min |
Session 9: 1B/3B 4bit SelfAware-v4 + ExploreToM (2026-03-03)
| Checkpoint | Base Model | Quant | Rank | LR | End LR | Data | Duration |
|---|
mlx-qlora-selfaware-v4/20260303_163207 | Llama-3.2-1B-Instruct-4bit | 4bit | 8 | 2e-4 | - | selfaware-v4 | 12min |
mlx-qlora-selfaware-v4/20260303_170445 | Llama-3.2-3B-Instruct-4bit | 4bit | 8 | 2e-4 | - | selfaware-v4 | 3min |
mlx-qlora-exploretom/20260303_175328 | Llama-3.2-1B-Instruct-4bit | 4bit | 8 | 2e-4 | - | exploretom | 105min |
Session 10: 8B 4bit SelfAware-v4 Variants (2026-03-04)
| Checkpoint | Base Model | Quant | Rank | LR | End LR | Data | Duration |
|---|
mlx-qlora-selfaware-v4/20260304_111631 | Llama-3.1-8B-Instruct-4bit | 4bit | 8 | 2e-4 | 0.1 | selfaware-v4 | 8min |
mlx-qlora-selfaware-v4/20260304_134246 | Llama-3.1-8B-Instruct-4bit | 4bit | 8 | 1.2e-4 | - | selfaware-v4 | 8min |
mlx-qlora-selfaware-v4/20260304_205901 | DeepSeek-R1-Distill-Llama-8B-4bit | 4bit | 8 | 1.2e-4 | - | selfaware-v4 | 6min |
Session 11: 1B/3B 4bit Control Tasks (2026-03-05)
| Checkpoint | Base Model | Quant | Rank | LR | End LR | Data | Duration |
|---|
mlx-qlora-triviaqa-v2/20260305_102156 | Llama-3.2-1B-Instruct-4bit | 4bit | 8 | 2e-4 | - | triviaqa-v2 | 8min |
mlx-qlora-commonsenseqa/20260305_105002 | Llama-3.2-1B-Instruct-4bit | 4bit | 8 | 2e-4 | - | commonsenseqa | 14min |
mlx-qlora-arc/20260305_112220 | Llama-3.2-1B-Instruct-4bit | 4bit | 8 | 2e-4 | - | arc | 1min |
mlx-qlora-triviaqa-v2/20260305_114137 | Llama-3.2-3B-Instruct-4bit | 4bit | 8 | 2e-4 | - | triviaqa-v2 | 31min |
mlx-qlora-commonsenseqa/20260305_133046 | Llama-3.2-3B-Instruct-4bit | 4bit | 8 | 2e-4 | - | commonsenseqa | 47min |
Incomplete Checkpoints (no experiment_config.json)
| Checkpoint | Notes |
|---|
mlx-qlora-commonsenseqa/20260305_125422 | Config 없음 |
mlx-qlora-exploretom/20260303_145327 | Config 없음 |
mlx-qlora-mbpp/20260223_105639 | Config 없음 |
mlx-qlora-selfaware-edited-2/20260225_152202 | Config 없음 (8B 4bit r16 추정, cross-eval 결과 존재) |
각 세션별로 학습 어댑터(행)가 평가 태스크(열)에서 보인 accuracy. In-domain 결과는 bold 처리.
Session 1: 3B bf16 (20260218)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | MBPP |
|---|
| exploretom | 88.6% | 9.8% | 8.3% | 45.0% | - |
| selfaware | 26.8% | 34.7% | 0.1% | 41.7% | - |
| gsm8k | 50.0% | 15.4% | 75.0% | 54.6% | - |
| triviaqa | 59.0% | 11.3% | 12.2% | 48.4% | - |
| mbpp | 45.9% | 13.4% | 36.8% | 52.0% | 51.5% |
Session 2: 1B bf16 (20260219_001856)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | MBPP |
|---|
| exploretom | 84.3% | 5.3% | 2.5% | 24.9% | 25.8% |
| selfaware | 23.0% | 32.3% | 0.0% | 16.7% | 1.0% |
| gsm8k | 42.8% | 11.3% | 51.7% | 38.1% | 33.0% |
| triviaqa | 51.1% | 6.8% | 4.3% | 31.3% | 3.1% |
| mbpp | 30.3% | 11.0% | 33.2% | 36.7% | 36.1% |
Session 3: 1B bf16 Repeat (20260219_234546)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | MBPP |
|---|
| exploretom | 84.1% | 5.9% | 2.3% | 25.1% | 24.7% |
| selfaware | 25.1% | 32.0% | 0.0% | 16.7% | 0.0% |
| gsm8k | 40.6% | 11.3% | 50.8% | 37.9% | 33.0% |
| triviaqa | 44.6% | 6.5% | 3.7% | 31.3% | 6.2% |
| mbpp | 33.1% | 11.6% | 34.6% | 36.9% | 38.1% |
Session 4: 1B bf16 SelfAware-Edited (20260220_221232)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | MBPP |
|---|
| selfaware-edited | 26.8% | 25.2% | 2.9% | 19.4% | 0.0% |
Session 5: 8B 4bit QLoRA (20260221_002706)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | MBPP |
|---|
| exploretom | 91.0% | 11.9% | 9.8% | 46.6% | 57.7% |
| selfaware-edited | 36.5% | 29.7% | 11.4% | 48.4% | 28.9% |
| gsm8k | 42.5% | 13.4% | 77.0% | 60.3% | 61.9% |
| triviaqa | 66.0% | 12.2% | 15.9% | 57.4% | 58.8% |
| mbpp | 40.9% | 12.8% | 29.9% | 60.3% | 51.5% |
Session 6: 3B 4bit SelfAware-Edited (20260222_234559)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | MBPP |
|---|
| selfaware-edited | 30.7% | 29.7% | 3.2% | 37.1% | 40.2% |
Session 7: 8B 4bit SelfAware-Edited-2 (20260223_234150)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | MBPP | HumanEval |
|---|
| selfaware-edited-2 | 32.1% | 21.1% | 9.1% | 53.1% | 28.9% | 0.0% |
Session 8: 8B bf16 LoRA r16 SelfAware-Edited-2 (20260225_130700)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | MBPP | HumanEval |
|---|
| selfaware-edited-2 | 31.1% | 15.7% | 10.0% | 51.7% | 17.5% | 3.1% |
Incomplete: 8B 4bit r16 SelfAware-Edited-2 (20260225_152202)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | MBPP | HumanEval |
|---|
| selfaware-edited-2 | 32.4% | 12.5% | 10.7% | 51.9% | 50.5% | 34.4% |
Session 9: 1B/3B 4bit SelfAware-v4 + ExploreToM (20260303)
1B SelfAware-v4 (20260303_163207)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | HumanEval+ | MBPP+ |
|---|
| selfaware-v4 | 43.6% | 26.1% | 6.4% | 32.8% | 0.0% | 0.0% |
3B SelfAware-v4 (20260303_170445)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | HumanEval+ | MBPP+ |
|---|
| selfaware-v4 | 62.8% | 30.3% | 32.7% | 48.9% | 0.0% | 0.0% |
1B ExploreToM (20260303_175328)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | HumanEval+ | MBPP+ |
|---|
| exploretom | 87.1% | 2.7% | 3.9% | 21.2% | 0.0% | 0.0% |
Session 10: 8B 4bit SelfAware-v4 Variants (20260304)
Llama 8B, LR=2e-4, End LR=0.1 (20260304_111631)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | HumanEval+ | MBPP+ |
|---|
| selfaware-v4 | 48.8% | 35.6% | 18.7% | 56.1% | 0.0% | 0.0% |
Llama 8B, LR=1.2e-4 (20260304_134246)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | HumanEval+ | MBPP+ | ARC | BoolQ | CSQA |
|---|
| selfaware-v4 | 58.9% | 34.4% | 33.7% | 55.8% | 0.0% | 0.0% | 6.6% | 12.8% | 20.8% |
DeepSeek-R1-Distill 8B, LR=1.2e-4 (20260304_205901)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | HumanEval+ | MBPP+ | ARC | BoolQ | CSQA |
|---|
| selfaware-v4 | 66.2% | 14.5% | 6.2% | 33.4% | 0.0% | 0.0% | 5.4% | 71.3% | 13.8% |
Session 11: 1B/3B 4bit Control Tasks (20260305)
1B TriviaQA-v2 (20260305_102156)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | HumanEval+ | MBPP+ | ARC | BoolQ | CSQA |
|---|
| triviaqa-v2 | 66.7% | 4.2% | 4.6% | 29.4% | 12.5% | 8.1% | 2.7% | 46.9% | 10.7% |
1B CommonsenseQA (20260305_105002)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | HumanEval+ | MBPP+ | ARC | BoolQ | CSQA |
|---|
| commonsenseqa | 28.9% | 2.4% | 2.3% | 20.0% | 0.0% | 0.0% | 2.7% | 3.0% | 17.5% |
1B ARC (20260305_112220)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | HumanEval+ | MBPP+ | ARC | BoolQ | CSQA |
|---|
| arc | 38.0% | 3.0% | 5.5% | 21.2% | 12.5% | 13.5% | 4.6% | 54.6% | 11.5% |
3B TriviaQA-v2 (20260305_114137)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | HumanEval+ | MBPP+ | ARC | BoolQ | CSQA |
|---|
| triviaqa-v2 | 56.5% | 5.9% | 10.8% | 44.3% | 31.2% | 45.9% | 5.0% | 72.6% | 15.8% |
3B CommonsenseQA (20260305_133046)
| Adapter \ Eval | ExploreToM | SelfAware | GSM8K | TriviaQA | HumanEval+ | MBPP+ | ARC | BoolQ | CSQA |
|---|
| commonsenseqa | 45.8% | 4.2% | 4.8% | 33.5% | 0.0% | 0.0% | 3.5% | 62.0% | 24.5% |
Llama-3.2-3B-Instruct (bf16) — 초기 Baseline (20260218_012033)
| ExploreToM | SelfAware | GSM8K | TriviaQA |
|---|
| 32.4% | 12.8% | 66.2% | 47.4% |
Llama-3.2-1B-Instruct (4bit) — Baseline (20260303_142557)
| ExploreToM | SelfAware | GSM8K | TriviaQA | HumanEval+ | MBPP+ |
|---|
| 44.4% | 19.9% | 40.4% | 31.3% | 43.8% | 32.4% |
Llama-3.2-1B-Instruct (bf16) — Baseline (20260303_153003)
| ExploreToM | SelfAware | GSM8K | TriviaQA | HumanEval+ | MBPP+ |
|---|
| 43.5% | 17.2% | 54.9% | 40.9% | 50.0% | 48.6% |
Llama-3.2-3B-Instruct (4bit) — Baseline (20260303_154903)
| ExploreToM | SelfAware | GSM8K | TriviaQA | HumanEval+ | MBPP+ |
|---|
| 33.2% | 22.6% | 75.1% | 46.6% | 50.0% | 48.6% |
Llama-3.1-8B-Instruct (4bit) — Baseline (20260222_032515)
| ExploreToM | SelfAware | GSM8K | TriviaQA | MBPP |
|---|
| 35.5% | 17.2% | 66.6% | 61.9% | 60.8% |
DeepSeek-R1-Distill-Llama-8B (4bit) — Baseline (20260304_170234)
Note: think-block 처리 문제로 accuracy가 실제보다 낮게 측정되었을 가능성 있음
| ExploreToM | SelfAware | GSM8K | TriviaQA | HumanEval+ | MBPP+ | ARC | BoolQ | CSQA |
|---|
| 24.6% | 32.0% | 13.0% | 14.4% | 0.0% | 16.2% | 0.8% | 1.4% | 6.9% |
Llama-3.1-8B-Instruct (4bit) — Baseline with Extended Tasks (20260304_185337)
| ExploreToM | SelfAware | GSM8K | TriviaQA | HumanEval+ | MBPP+ | ARC | BoolQ | CSQA |
|---|
| 34.1% | 24.9% | 77.0% | 59.4% | 62.5% | 70.3% | 9.3% | 65.3% | 21.5% |
4. SelfAware 데이터 버전 참고
| 데이터 버전 | train 크기 | IDK 비율 | 주요 변경 |
|---|
| selfaware | 3,032 | ~50% | 원본 |
| selfaware-edited | 2,198 | ~5% | IDK 비율 축소 |
| selfaware-edited-2 | - | - | 추가 정제 |
| selfaware-v4 | - | - | 최종 버전 |
5. 주요 관찰
Catastrophic Forgetting 패턴
- ExploreToM 어댑터: 가장 강한 in-domain 성능 (84-91%) but GSM8K를 거의 파괴 (2-10%)
- SelfAware 어댑터: GSM8K, MBPP를 0%까지 하락시킴 → 가장 심한 catastrophic forgetting
- GSM8K/TriviaQA 어댑터: out-domain 성능 비교적 보존
모델 체급별 패턴
- 1B: 모든 태스크에서 3B/8B 대비 낮은 성능, forgetting도 더 심함
- 3B → 8B: in-domain 성능 소폭 증가, out-domain 보존 크게 향상
- 8B QLoRA: out-domain 보존이 가장 우수 (4bit quantization이 pre-trained capability 보존에 유리)
Quantization 비교 (Session 8 vs Incomplete)
- bf16 LoRA r16: MBPP 17.5%, HumanEval 3.1% → 코드 능력 심각 하락
- 4bit QLoRA r16: MBPP 50.5%, HumanEval 34.4% → 코드 능력 대폭 보존
- 4bit QLoRA가 base model capability를 더 잘 보존하면서 fine-tuning