대규모 다중과제 언어 이해 측정 (Measuring Massive Multitask Language Understanding)

Digest: LLM이 학습 중 습득한 지식의 폭과 깊이를 어떻게 측정할 것인가? UC Berkeley의 MMLU는 57개 학문 분야(STEM, 인문학, 사회과학, 전문 분야)에 걸쳐 14,042개 4지선다 문제로 구성된 벤치마크이다. 핵심 통찰은 LLM의 진정한 능력은 단일 과제가 아닌 다양한 학문 분야에 걸친 종합적 지식과 추론으로 평가해야 한다는 것이다. 초등~전문가 수준까지 다양한 난이도를 포함하며, GPT-3(few-shot)는 43.9% (Table 3), 랜덤 기준선 25%와 비교하여 아직 인간 전문가(~89.8%)에 크게 못 미쳤다. 이후 GPT-4(~86%), Claude 3.5(~88%)가 인간에 근접하면서, MMLU는 LLM 리더보드의 사실상 표준이 되었다.

메타데이터

항목	내용
제목	Measuring Massive Multitask Language Understanding
저자	Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt
소속	UC Berkeley, Columbia University
연도	2020
발표	ICLR 2021, arXiv:2009.03300
링크	arXiv, GitHub
키워드	MMLU, multitask, knowledge, language understanding, benchmark

데이터셋 구성

규모 및 분할

항목	내용
전체 크기	14,042개 4지선다 문제
Test	14,042개
Dev (few-shot용)	각 과목 5개 = 285개
Validation	1,531개
과목 수	57개
정답 형식	A, B, C, D 중 택1

57과목 4대 카테고리

카테고리	과목 수	대표 과목
STEM	18	abstract_algebra, astronomy, college_mathematics, computer_science, electrical_engineering, machine_learning, physics
Humanities	13	formal_logic, jurisprudence, moral_disputes, philosophy, prehistory, world_religions
Social Sciences	12	econometrics, high_school_geography, marketing, professional_psychology, sociology
Other	14	clinical_knowledge, global_facts, medical_genetics, nutrition, professional_accounting, professional_medicine

Feature/Column 구조

필드	설명	예시
`question`	문제 텍스트	`"Which of the following is NOT a characteristic of..."`
`A`	보기 A	`"Increased heart rate"`
`B`	보기 B	`"Decreased blood pressure"`
`C`	보기 C	`"Constricted pupils"`
`D`	보기 D	`"Dry mouth"`
`answer`	정답	`"C"`
`subject`	과목명	`"anatomy"`

난이도 체계

과목 자체가 난이도를 내포:

수준	과목 예시	설명
초등~중학	elementary_mathematics, high_school_biology	기본 개념
고등	high_school_physics, high_school_chemistry	고교 수준
대학	college_physics, college_chemistry	대학 교양
전문가	professional_medicine, professional_law, professional_accounting	자격시험 수준

실제 데이터 예시

예시 1: 고교 물리 (High School Physics)

Question: A ball is thrown vertically upward with a speed of 20 m/s.
Ignoring air resistance, what is the maximum height reached?

A. 10 m
B. 20 m
C. 30 m
D. 40 m

Answer: B (h = v²/2g = 400/20 = 20m)

예시 2: 추상 대수 (Abstract Algebra)

Question: Find the order of the element 5 in the multiplicative
group of integers modulo 8.

A. 1
B. 2
C. 4
D. 8

Answer: B (5² = 25 ≡ 1 mod 8, so order = 2)

예시 3: 전문 의학 (Professional Medicine)

Question: A 45-year-old woman presents with fatigue, weight gain,
and cold intolerance. TSH is elevated and free T4 is low.
What is the most likely diagnosis?

A. Graves' disease
B. Hashimoto's thyroiditis
C. Thyroid cancer
D. Subacute thyroiditis

Answer: B

왜 이 연구를 하는가?

핵심 질문

LLM이 사전학습 과정에서 얼마나 광범위한 세상 지식과 전문 지식을 습득했는가?

기존 접근법의 한계

한계	설명
단일 과제 평가	SuperGLUE 등은 NLU 중심, 세상 지식 미측정
좁은 범위	특정 분야(수학, 코딩)만 평가하면 전체 능력 파악 불가
낮은 난이도 천장	기존 벤치마크는 포화되어 모델 간 차이 구별 불가

핵심 통찰

LLM의 “지능”을 측정하려면 인간의 교육 체계와 유사하게 다양한 학문 분야에 걸쳐 초등~전문가 수준까지 평가해야 한다. 이는 마치 종합 시험처럼 모델의 지식 범위를 측정한다.

방법 (Method)

프레임워크 개요

graph TB
    A["57개 학문 분야<br/>문제 수집"] --> B["4지선다 형식 통일"]
    B --> C["MMLU 데이터셋<br/>14,042문항"]

    C --> D["0-shot 평가"]
    C --> E["5-shot 평가<br/>(dev set 예시)"]

    D --> F["과목별 정확도"]
    E --> F
    F --> G["4대 카테고리 평균"]
    G --> H["전체 평균"]

평가 방식

Few-shot: 각 과목의 dev 셋에서 5개 예시를 프롬프트에 포함
로그 확률 기반: A/B/C/D 각 토큰의 로그 확률을 비교하여 정답 선택
매크로 평균: 57개 과목의 정확도를 동일 비중으로 평균

발견 (Findings)

주요 결과 (5-shot)

모델	전체	STEM	Humanities	Social Sci	Other
Random	25.0%	25.0%	25.0%	25.0%	25.0%
GPT-3 (175B)	43.9%	36.7%	40.8%	50.5%	48.8%
UnifiedQA	48.9%	—	—	—	—
Human Expert	89.8%	—	—	—	—

(Table 3)

핵심 발견

인간과의 격차: 최고 모델(2020)도 인간 전문가 대비 ~40%p 뒤처짐 (Table 3)
STEM이 가장 어려움: 수학, 물리, 컴퓨터과학 등 STEM 과목에서 가장 낮은 점수 (Table 3)
과목 간 극단적 편차: 일부 과목 70%+, 일부 과목 25%(랜덤 수준) — 지식 분포의 불균형
모델 크기와 MMLU: log-linear 관계로, MMLU 점수로 다음 스케일의 성능을 예측 가능

이론적 의의

LLM 평가의 사실상 표준

MMLU는 2020년 이후 거의 모든 LLM 논문과 리더보드에서 보고되는 핵심 벤치마크가 되었다. GPT-4, Claude, Gemini, Llama 등의 “헤드라인 넘버”에 항상 MMLU 점수가 포함된다. 다만 2024년 이후 상위 모델들이 88-90%에 도달하면서 포화가 시작되어, MMLU-Pro 등 후속 벤치마크가 등장했다.

핵심 용어 정리

용어	정의
MMLU	Massive Multitask Language Understanding. 57과목 14k 문항 종합 지식 벤치마크
4지선다 (Multiple Choice)	A/B/C/D 4개 보기 중 정답을 선택하는 형식
Few-shot	소수의 예시를 프롬프트에 포함하여 과제를 수행하는 방식
매크로 평균	각 카테고리/과목에 동일 가중치를 부여하는 평균 방식
Benchmark Saturation	모델 성능이 벤치마크의 상한에 근접하여 변별력이 사라지는 현상

Juhyeon's Blog

탐색기

Measuring Massive Multitask Language Understanding