Summary

Definition:

An n-gram is a chunk of $n$ consecutive words.

Example

text : “the students opened their $???$ ”

unigrams : “the”, “students”, “opened”, “their”

bigrams: “the students”, “students opened”, “opened their”

trigram: “the students opened”, “students opened their”

4-gram: “the student opened their”

Important

Idea: 서로 다른 n-gram의 빈도를 사용하여 NSP(Next token prediction.)

Tip

일반적으로 sparsity problem으로 인해 4-gram이 한계.

Assumption

Summary

n-gram 모델은 기본적으로 Markov 가정을 가져가는데, $n - 1$ 개의 전 단어만이 다음 단어 예측에 관여한다.
$P (x^{(t + 1)} ∣ x^{(t)}, \dots, x^{(t - n + 1)})$
조건부 확률의 정의에 의해 위 식은, joint-probability로 다음과 같이 쓰여진다.
$= \frac{P ( x ^{(t + 1)} , x ^{(t)} , \dots , x ^{(t - n + 1)} )}{P ( x ^{(t)} , \dots , x ^{(t - n + 1)} )}$
이건 대규모 코퍼스에서 count로 근사되고.
$= \frac{count ( x ^{(t + 1)} , x ^{(t)} , \dots , x ^{(t - n + 1)} )}{count ( x ^{(t)} , \dots , x ^{(t - n + 1)} )}$

Limits

Sparsity problem

Summary

일반적으로 생각해봐도, 4단어의 window, context는 너무 짧다.

Storage Problem

Summary

각 n-gram 마다 빈도를 저장해야 하니, 코퍼스 크기에 비례해 저장 용량이 커진다.

Text Generating

Summary

n-gram model은 n-gram마다 출현 빈도를 저장하고 있으니, 이를 사용해서 다음 단어 생성을 할 수 있다.
n-gram model이 저장하고 있는 출현빈도(probability distribution)을 기반으로 다음 단어를 생성하는데 2가지 방법이 있는데,

빈도를 그대로 사용해서 하나를 추출.(deterministic)

이렇게 하면 동일 input에 대해 항상 같은 response.

빈도 분포에서 sampling

현재의 gpt등의 생성형 LLM이 채택하여, 동일 input에 대해서도 다양한 응답.

Example

문법적으로는 꽤 맞지만, 말이 되지 않는 말(incoherent)들을 뱉는다.
그렇다고 window size $n$ 을 키우면 sparse해지면서 모델 크기도 동시에 커진다.

Juhyeon's Blog

탐색기

n-gram

Assumption

Limits

Sparsity problem

Storage Problem

Text Generating

그래프 뷰

목차

Properties

백링크