Experiment Checkpoint Registry

자동 생성일: 2026-03-05
총 35개 체크포인트 (+ 4개 불완전) | Effective batch size: 모든 경우 16


1. Checkpoint 목록

Session 1: 3B bf16 (2026-02-18)

CheckpointBase ModelQuantRankLREnd LRDataDuration
mlx-lora-exploretom/20260218_104723Llama-3.2-3B-Instruct-bf16bf1682e-4-exploretom238min
mlx-lora-selfaware/20260218_155336Llama-3.2-3B-Instruct-bf16bf1682e-4-selfaware8min
mlx-lora-gsm8k/20260218_164927Llama-3.2-3B-Instruct-bf16bf1682e-4-gsm8k74min
mlx-lora-triviaqa/20260218_200753Llama-3.2-3B-Instruct-bf16bf1682e-4-triviaqa26min
mlx-lora-mbpp/20260218_222920Llama-3.2-3B-Instruct-bf16bf1682e-4-mbpp<1min

Session 2: 1B bf16 (2026-02-19 #1)

CheckpointBase ModelQuantRankLREnd LRDataDuration
mlx-lora-exploretom/20260219_001856Llama-3.2-1B-Instruct-bf16bf1682e-4-exploretom72min
mlx-lora-selfaware/20260219_001856Llama-3.2-1B-Instruct-bf16bf1682e-4-selfaware12min
mlx-lora-gsm8k/20260219_001856Llama-3.2-1B-Instruct-bf16bf1682e-4-gsm8k23min
mlx-lora-triviaqa/20260219_001856Llama-3.2-1B-Instruct-bf16bf1682e-4-triviaqa8min
mlx-lora-mbpp/20260219_001856Llama-3.2-1B-Instruct-bf16bf1682e-4-mbpp<1min

Session 3: 1B bf16 Repeat (2026-02-19 #2)

CheckpointBase ModelQuantRankLREnd LRDataDuration
mlx-lora-exploretom/20260219_234546Llama-3.2-1B-Instruct-bf16bf1682e-4-exploretom80min
mlx-lora-selfaware/20260219_234546Llama-3.2-1B-Instruct-bf16bf1682e-4-selfaware13min
mlx-lora-gsm8k/20260219_234546Llama-3.2-1B-Instruct-bf16bf1682e-4-gsm8k27min
mlx-lora-triviaqa/20260219_234546Llama-3.2-1B-Instruct-bf16bf1682e-4-triviaqa9min
mlx-lora-mbpp/20260219_234546Llama-3.2-1B-Instruct-bf16bf1682e-4-mbpp<1min

Session 4: 1B bf16 SelfAware-Edited (2026-02-20)

CheckpointBase ModelQuantRankLREnd LRDataDuration
mlx-lora-selfaware-edited/20260220_221232Llama-3.2-1B-Instruct-bf16bf1682e-4-selfaware-edited9min

Session 5: 8B 4bit QLoRA (2026-02-21)

CheckpointBase ModelQuantRankLREnd LRDataDuration
mlx-qlora-exploretom/20260221_002706Llama-3.1-8B-Instruct-4bit4bit82e-40.1exploretom722min
mlx-qlora-selfaware-edited/20260221_002706Llama-3.1-8B-Instruct-4bit4bit82e-40.1selfaware-edited9min
mlx-qlora-gsm8k/20260221_002706Llama-3.1-8B-Instruct-4bit4bit82e-40.1gsm8k228min
mlx-qlora-triviaqa/20260221_002706Llama-3.1-8B-Instruct-4bit4bit82e-40.1triviaqa74min
mlx-qlora-mbpp/20260221_002706Llama-3.1-8B-Instruct-4bit4bit82e-40.1mbpp<1min

Session 6: 3B 4bit SelfAware-Edited (2026-02-22)

CheckpointBase ModelQuantRankLREnd LRDataDuration
mlx-qlora-selfaware-edited/20260222_234559Llama-3.2-3B-Instruct-4bit4bit82e-4-selfaware-edited6min

Session 7: 8B 4bit SelfAware-Edited-2 (2026-02-23)

CheckpointBase ModelQuantRankLREnd LRDataDuration
mlx-qlora-selfaware-edited-2/20260223_234150Llama-3.1-8B-Instruct-4bit4bit82e-40.1selfaware-edited-212min

Session 8: 8B bf16 LoRA r16 SelfAware-Edited-2 (2026-02-25)

CheckpointBase ModelQuantRankLREnd LRDataDuration
mlx-lora-selfaware-edited-2/20260225_130700Llama-3.1-8B-Instruct (bf16)bf16161.5e-40.1selfaware-edited-29min

Session 9: 1B/3B 4bit SelfAware-v4 + ExploreToM (2026-03-03)

CheckpointBase ModelQuantRankLREnd LRDataDuration
mlx-qlora-selfaware-v4/20260303_163207Llama-3.2-1B-Instruct-4bit4bit82e-4-selfaware-v412min
mlx-qlora-selfaware-v4/20260303_170445Llama-3.2-3B-Instruct-4bit4bit82e-4-selfaware-v43min
mlx-qlora-exploretom/20260303_175328Llama-3.2-1B-Instruct-4bit4bit82e-4-exploretom105min

Session 10: 8B 4bit SelfAware-v4 Variants (2026-03-04)

CheckpointBase ModelQuantRankLREnd LRDataDuration
mlx-qlora-selfaware-v4/20260304_111631Llama-3.1-8B-Instruct-4bit4bit82e-40.1selfaware-v48min
mlx-qlora-selfaware-v4/20260304_134246Llama-3.1-8B-Instruct-4bit4bit81.2e-4-selfaware-v48min
mlx-qlora-selfaware-v4/20260304_205901DeepSeek-R1-Distill-Llama-8B-4bit4bit81.2e-4-selfaware-v46min

Session 11: 1B/3B 4bit Control Tasks (2026-03-05)

CheckpointBase ModelQuantRankLREnd LRDataDuration
mlx-qlora-triviaqa-v2/20260305_102156Llama-3.2-1B-Instruct-4bit4bit82e-4-triviaqa-v28min
mlx-qlora-commonsenseqa/20260305_105002Llama-3.2-1B-Instruct-4bit4bit82e-4-commonsenseqa14min
mlx-qlora-arc/20260305_112220Llama-3.2-1B-Instruct-4bit4bit82e-4-arc1min
mlx-qlora-triviaqa-v2/20260305_114137Llama-3.2-3B-Instruct-4bit4bit82e-4-triviaqa-v231min
mlx-qlora-commonsenseqa/20260305_133046Llama-3.2-3B-Instruct-4bit4bit82e-4-commonsenseqa47min

Incomplete Checkpoints (no experiment_config.json)

CheckpointNotes
mlx-qlora-commonsenseqa/20260305_125422Config 없음
mlx-qlora-exploretom/20260303_145327Config 없음
mlx-qlora-mbpp/20260223_105639Config 없음
mlx-qlora-selfaware-edited-2/20260225_152202Config 없음 (8B 4bit r16 추정, cross-eval 결과 존재)

2. Cross-Eval Performance

각 세션별로 학습 어댑터(행)가 평가 태스크(열)에서 보인 accuracy. In-domain 결과는 bold 처리.

Session 1: 3B bf16 (20260218)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAMBPP
exploretom88.6%9.8%8.3%45.0%-
selfaware26.8%34.7%0.1%41.7%-
gsm8k50.0%15.4%75.0%54.6%-
triviaqa59.0%11.3%12.2%48.4%-
mbpp45.9%13.4%36.8%52.0%51.5%

Session 2: 1B bf16 (20260219_001856)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAMBPP
exploretom84.3%5.3%2.5%24.9%25.8%
selfaware23.0%32.3%0.0%16.7%1.0%
gsm8k42.8%11.3%51.7%38.1%33.0%
triviaqa51.1%6.8%4.3%31.3%3.1%
mbpp30.3%11.0%33.2%36.7%36.1%

Session 3: 1B bf16 Repeat (20260219_234546)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAMBPP
exploretom84.1%5.9%2.3%25.1%24.7%
selfaware25.1%32.0%0.0%16.7%0.0%
gsm8k40.6%11.3%50.8%37.9%33.0%
triviaqa44.6%6.5%3.7%31.3%6.2%
mbpp33.1%11.6%34.6%36.9%38.1%

Session 4: 1B bf16 SelfAware-Edited (20260220_221232)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAMBPP
selfaware-edited26.8%25.2%2.9%19.4%0.0%

Session 5: 8B 4bit QLoRA (20260221_002706)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAMBPP
exploretom91.0%11.9%9.8%46.6%57.7%
selfaware-edited36.5%29.7%11.4%48.4%28.9%
gsm8k42.5%13.4%77.0%60.3%61.9%
triviaqa66.0%12.2%15.9%57.4%58.8%
mbpp40.9%12.8%29.9%60.3%51.5%

Session 6: 3B 4bit SelfAware-Edited (20260222_234559)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAMBPP
selfaware-edited30.7%29.7%3.2%37.1%40.2%

Session 7: 8B 4bit SelfAware-Edited-2 (20260223_234150)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAMBPPHumanEval
selfaware-edited-232.1%21.1%9.1%53.1%28.9%0.0%

Session 8: 8B bf16 LoRA r16 SelfAware-Edited-2 (20260225_130700)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAMBPPHumanEval
selfaware-edited-231.1%15.7%10.0%51.7%17.5%3.1%

Incomplete: 8B 4bit r16 SelfAware-Edited-2 (20260225_152202)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAMBPPHumanEval
selfaware-edited-232.4%12.5%10.7%51.9%50.5%34.4%

Session 9: 1B/3B 4bit SelfAware-v4 + ExploreToM (20260303)

1B SelfAware-v4 (20260303_163207)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAHumanEval+MBPP+
selfaware-v443.6%26.1%6.4%32.8%0.0%0.0%

3B SelfAware-v4 (20260303_170445)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAHumanEval+MBPP+
selfaware-v462.8%30.3%32.7%48.9%0.0%0.0%

1B ExploreToM (20260303_175328)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAHumanEval+MBPP+
exploretom87.1%2.7%3.9%21.2%0.0%0.0%

Session 10: 8B 4bit SelfAware-v4 Variants (20260304)

Llama 8B, LR=2e-4, End LR=0.1 (20260304_111631)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAHumanEval+MBPP+
selfaware-v448.8%35.6%18.7%56.1%0.0%0.0%

Llama 8B, LR=1.2e-4 (20260304_134246)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAHumanEval+MBPP+ARCBoolQCSQA
selfaware-v458.9%34.4%33.7%55.8%0.0%0.0%6.6%12.8%20.8%

DeepSeek-R1-Distill 8B, LR=1.2e-4 (20260304_205901)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAHumanEval+MBPP+ARCBoolQCSQA
selfaware-v466.2%14.5%6.2%33.4%0.0%0.0%5.4%71.3%13.8%

Session 11: 1B/3B 4bit Control Tasks (20260305)

1B TriviaQA-v2 (20260305_102156)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAHumanEval+MBPP+ARCBoolQCSQA
triviaqa-v266.7%4.2%4.6%29.4%12.5%8.1%2.7%46.9%10.7%

1B CommonsenseQA (20260305_105002)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAHumanEval+MBPP+ARCBoolQCSQA
commonsenseqa28.9%2.4%2.3%20.0%0.0%0.0%2.7%3.0%17.5%

1B ARC (20260305_112220)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAHumanEval+MBPP+ARCBoolQCSQA
arc38.0%3.0%5.5%21.2%12.5%13.5%4.6%54.6%11.5%

3B TriviaQA-v2 (20260305_114137)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAHumanEval+MBPP+ARCBoolQCSQA
triviaqa-v256.5%5.9%10.8%44.3%31.2%45.9%5.0%72.6%15.8%

3B CommonsenseQA (20260305_133046)

Adapter \ EvalExploreToMSelfAwareGSM8KTriviaQAHumanEval+MBPP+ARCBoolQCSQA
commonsenseqa45.8%4.2%4.8%33.5%0.0%0.0%3.5%62.0%24.5%

3. Baseline Performance

Llama-3.2-3B-Instruct (bf16) — 초기 Baseline (20260218_012033)

ExploreToMSelfAwareGSM8KTriviaQA
32.4%12.8%66.2%47.4%

Llama-3.2-1B-Instruct (4bit) — Baseline (20260303_142557)

ExploreToMSelfAwareGSM8KTriviaQAHumanEval+MBPP+
44.4%19.9%40.4%31.3%43.8%32.4%

Llama-3.2-1B-Instruct (bf16) — Baseline (20260303_153003)

ExploreToMSelfAwareGSM8KTriviaQAHumanEval+MBPP+
43.5%17.2%54.9%40.9%50.0%48.6%

Llama-3.2-3B-Instruct (4bit) — Baseline (20260303_154903)

ExploreToMSelfAwareGSM8KTriviaQAHumanEval+MBPP+
33.2%22.6%75.1%46.6%50.0%48.6%

Llama-3.1-8B-Instruct (4bit) — Baseline (20260222_032515)

ExploreToMSelfAwareGSM8KTriviaQAMBPP
35.5%17.2%66.6%61.9%60.8%

DeepSeek-R1-Distill-Llama-8B (4bit) — Baseline (20260304_170234)

Note: think-block 처리 문제로 accuracy가 실제보다 낮게 측정되었을 가능성 있음

ExploreToMSelfAwareGSM8KTriviaQAHumanEval+MBPP+ARCBoolQCSQA
24.6%32.0%13.0%14.4%0.0%16.2%0.8%1.4%6.9%

Llama-3.1-8B-Instruct (4bit) — Baseline with Extended Tasks (20260304_185337)

ExploreToMSelfAwareGSM8KTriviaQAHumanEval+MBPP+ARCBoolQCSQA
34.1%24.9%77.0%59.4%62.5%70.3%9.3%65.3%21.5%

4. SelfAware 데이터 버전 참고

데이터 버전train 크기IDK 비율주요 변경
selfaware3,032~50%원본
selfaware-edited2,198~5%IDK 비율 축소
selfaware-edited-2--추가 정제
selfaware-v4--최종 버전

5. 주요 관찰

Catastrophic Forgetting 패턴

  • ExploreToM 어댑터: 가장 강한 in-domain 성능 (84-91%) but GSM8K를 거의 파괴 (2-10%)
  • SelfAware 어댑터: GSM8K, MBPP를 0%까지 하락시킴 → 가장 심한 catastrophic forgetting
  • GSM8K/TriviaQA 어댑터: out-domain 성능 비교적 보존

모델 체급별 패턴

  • 1B: 모든 태스크에서 3B/8B 대비 낮은 성능, forgetting도 더 심함
  • 3B → 8B: in-domain 성능 소폭 증가, out-domain 보존 크게 향상
  • 8B QLoRA: out-domain 보존이 가장 우수 (4bit quantization이 pre-trained capability 보존에 유리)

Quantization 비교 (Session 8 vs Incomplete)

  • bf16 LoRA r16: MBPP 17.5%, HumanEval 3.1% → 코드 능력 심각 하락
  • 4bit QLoRA r16: MBPP 50.5%, HumanEval 34.4% → 코드 능력 대폭 보존
  • 4bit QLoRA가 base model capability를 더 잘 보존하면서 fine-tuning