BLEU

Summary

BLEU (Bilingual Evaluation Understudy)
주로 기계 번역(Machine Translation) 평가에 쓰이는 metric.
$모델이 만든 문장이 정답 (reference) 문장과 얼마나 겹치는가 ?$
→ Precision(정밀도) 기반: “내가 생성한 단어 중 정답에 있는 비율”
→ 0~1 사이 값. 높을수록 좋음.

계산 방식

N-gram precision: 생성문의 1-gram, 2-gram, 3-gram, 4-gram이 reference에 얼마나 등장하는지 측정.

Geometric mean: 보통 1~4-gram precision을 기하평균.

Brevity Penalty (BP): 너무 짧게 생성해서 점수만 높이는 꼼수 방지.
$B L E U = BP \cdot exp (\sum_{n = 1}^{N} w_{n} lo g p_{n})$
여기서 $p_{n}$ 은 n-gram precision, $w_{n}$ 은 가중치(보통 균등), $BP$ 는 길이 페널티.

한 줄 예시

Reference: the cat is on the mat
Hypothesis: the cat the cat on the mat
→ 단어 단위로 보면 다 겹치지만, “the cat”이 중복으로 카운트되지 않게 clipped precision을 씀.

한계

표면적 단어 겹침만 봄 → 같은 의미여도 단어가 다르면 점수 낮음 (e.g. “fast” vs “quick”).

어순/문법 미반영, 의미 평가 못함.

그래서 BERTScore 같은 embedding 기반 metric이 나옴.

자매 metric

ROUGE: 요약 평가용, recall 기반 (BLEU와 짝꿍 개념).

원 논문

Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002).
“BLEU: a Method for Automatic Evaluation of Machine Translation.”
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002).

Juhyeon's Blog

탐색기

BLEU

그래프 뷰

Properties

백링크