Introduction

Methods

Results

Discussion

QLoRA

LoRA로 학습할 건데, baseline 모델의 규모가 여전히 크긴해서 그걸 줄이고 싶다.
basemodel에 Quantization하면, 파라미터 정밀도 하락에 수반된 성능 하락이 있지만, 어짜피 sft를 할거니까 비슷한 수준 혹은 그 이상으로 회복할 수 있도록 LoRA만 기존 정밀도를 사용해서 연산 자체는 LoRA의 high-precision 연산을 수행하지만, memory에 올라가 있는 baseline model의 precision은 낮춰서 memory pressure를 줄일 수 있다.

e.g. baseline ; NF4, (Q)LoRA : bf16

Abstract

Results

65B 모델을 48GB 단일 gpu만으로도 학습이 가능하도록 LoRA를 light-weight 하게 만들었다.
4 bit-quantization을 했음에도 불구하고, 16bit model을 학습했을 때와 동일한 수준의 performance를 보임.

Methods

pre-train된 LLM을 4bit-quantize한 뒤, freeze함.
LoRA를 얹음.
backprop은 4bit 모델을 통과하여 adpater만 learning 됨.

QLoRA Hyperparameter

LoRA rank는 생각보다 상관없더라.
대신 $α = 0.5 \cdot l r$ 추천(for 8B, 13B 정도 체급에서는.)

Expected Vram Usage

Appendix D. 에 따르면,
Guanaco model

65B은 41GB

33B은 21GB

13B은 10GB

7B은 5GB 정도 된다고 함.

Juhyeon's Blog

탐색기

QLoRA - Efficient Finetuning of Quantized LLMs

Introduction

Methods

Results

Discussion

Results

Methods

그래프 뷰

목차

Properties

백링크

QLoRA - Efficient Finetuning of Quantized LLMs

Introduction

Related Papers

Methods

Results

Discussion

Results

Methods

그래프 뷰

목차

Properties

백링크