CoQA: A Conversational Question Answering Challenge

Digest: 대화형 QA(Conversational Question Answering) 벤치마크. 7개 도메인의 8,399개 지문에 대해 대화 형태로 질문-답변을 진행하며, 127,000+ QA 쌍을 포함한다. 핵심 설계 원칙은 각 답변이 자유형 자연어 답변(input_text) + 지문 내 추출 근거(rationale) 이중으로 제공되어, 추출형(extractive)과 생성형(abstractive) QA를 동시에 평가할 수 있다는 점이다. 대화가 진행될수록 대명사 해소(coreference resolution), 생략 복원(ellipsis resolution) 능력이 요구되며, 약 25%의 Yes/No 질문을 포함하여 다양한 추론 유형을 측정한다.

메타데이터

항목	내용
제목	CoQA: A Conversational Question Answering Challenge
저자	Siva Reddy, Danqi Chen, Christopher D. Manning
소속	Stanford University
발표	Transactions of the Association for Computational Linguistics (TACL), 2019
arXiv	1808.07042
총 규모	127,000+ QA 쌍 / 8,399 대화 / 8,399 지문
평가 메트릭	Macro-average F1 score (도메인별 F1 평균)
라이선스	연구 목적 공개

데이터셋 구성

규모 및 분할

Split	지문 수	QA 쌍 수	비고
Train	7,199	108,647	5개 in-domain
Dev	500	7,983	7개 도메인 전체
Test	700	~12,000	7개 도메인 (비공개)

평균 대화 길이: ~15 QA 턴 per passage
Yes/No 질문 비율: 약 25%

Feature / Column 구조

Feature	Type	설명
`id`	string	대화 고유 식별자
`source`	string	도메인 출처 (e.g., “mctest”, “cnn”)
`story`	string	지문 텍스트 (passage)
`questions`	list[string]	대화 순서대로 정렬된 질문 리스트
`answers.input_text`	string	자유형 자연어 답변 (추상적 표현 가능)
`answers.span_start`	int	근거 시작 위치
`answers.span_end`	int	근거 끝 위치
`answers.span_text`	string	추출 근거(rationale) — 지문 내 원문

이중 답변 구조가 CoQA의 핵심 차별점: input_text는 사람이 자연스럽게 말하는 답변, span_text는 지문에서 직접 추출한 근거이다.

도메인 분포

도메인	출처	In/Out-of-Domain	대화 수
Children’s Stories	MCTest	In-domain	~1,400
Literature	Project Gutenberg	In-domain	~1,400
Mid/High School Exams	RACE	In-domain	~1,400
News	CNN/DailyMail	In-domain	~1,400
Wikipedia	Wikipedia	In-domain	~1,400
Reddit	Writing Prompts	Out-of-domain	~200
Science	AI2 Science	Out-of-domain	~200

Out-of-domain (Reddit, Science)은 train set에 포함되지 않으며, dev/test에서만 등장 → domain transfer 능력 평가

실제 데이터 예시

예시 1: Children’s Story (Coreference Resolution)

Passage: “Jessica went to the store to buy some groceries. She needed milk, bread, and eggs. The store was crowded, but she found everything quickly. The total came to $15, which she thought was reasonable.”

Turn	Question	Answer (input_text)	Rationale (span_text)
Q1	Where did Jessica go?	The store	”Jessica went to the store”
Q2	What did she need?	Milk, bread, and eggs	”She needed milk, bread, and eggs”
Q3	Was the store empty?	No	”The store was crowded”
Q4	How much did it cost?	$15	”The total came to $15”
Q5	Did she think it was too much?	No, she thought it was reasonable	”she thought was reasonable”

Q2의 “she” → Jessica (coreference resolution 필요)
Q3, Q5 → Yes/No 유형

예시 2: News Domain (Multi-hop + Ellipsis)

Passage: (CNN) “The president announced a new climate policy on Monday. Environmental groups praised the move, calling it ‘historic.’ Opposition leaders criticized the plan, arguing it would hurt the economy.”

Turn	Question	Answer (input_text)	Rationale (span_text)
Q1	What was announced?	A new climate policy	”announced a new climate policy”
Q2	When?	Monday	”on Monday”
Q3	Who praised it?	Environmental groups	”Environmental groups praised the move”
Q4	What did they call it?	Historic	”calling it ‘historic‘“
Q5	Who opposed?	Opposition leaders	”Opposition leaders criticized the plan”

Q2: 생략 복원(ellipsis) — “When [was it announced]?”
Q4: “they” → Environmental groups, “it” → the move

예시 3: Science Domain (Out-of-Domain)

Passage: “Photosynthesis is the process by which plants convert sunlight into energy. Chlorophyll, the green pigment in leaves, absorbs light…”

Turn	Question	Answer (input_text)
Q1	What is photosynthesis?	The process plants use to convert sunlight into energy
Q2	What absorbs light?	Chlorophyll
Q3	What color is it?	Green

Q3: “it” → Chlorophyll (multi-turn coreference chain)

왜 이 연구를 하는가?

기존 QA 벤치마크(SQuAD, TriviaQA 등)는 단발성 질문-답변만 다루며, 실제 인간의 정보 탐색 행동인 대화적 맥락 안에서의 연속 질문을 반영하지 못했다. CoQA는 다음 세 가지 gap을 해결한다:

Conversational context dependency: 이전 턴의 질문-답변이 다음 질문의 해석에 영향 → 단순 passage retrieval이 아닌 dialogue state tracking 필요
Extractive vs. Abstractive 이분법 극복: 기존 벤치마크는 추출형(SQuAD) 또는 생성형(NarrativeQA) 중 하나만 평가 → CoQA는 이중 답변 구조로 두 패러다임을 동시에 벤치마킹
Cross-domain generalization: 7개 이질적 도메인 + out-of-domain test를 통해 모델의 domain transfer 능력 측정

방법

flowchart TD
    subgraph Data_Collection["데이터 수집 파이프라인"]
        A["7개 도메인에서<br/>지문(passage) 수집"] --> B["Questioner-Answerer<br/>페어링 (Turker 2인 1조)"]
        B --> C["Questioner: 지문을 보지 않고<br/>대화 기반 질문 생성"]
        C --> D["Answerer: 자유형 답변 +<br/>지문 내 rationale span 선택"]
        D --> E["추가 주석자 3명이<br/>답변 검증 (다중 참조)"]
    end

    subgraph Evaluation["평가 프레임워크"]
        F["모델 예측 답변"] --> G["각 도메인별<br/>F1 score 계산"]
        G --> H["Macro-average F1<br/>(7개 도메인 평균)"]
    end

    subgraph Baselines["베이스라인 모델"]
        I["Extractive: DrQA +<br/>대화 히스토리 추가"]
        J["Generative: Seq2Seq +<br/>Attention + Copy"]
        K["Hybrid: 추출 + 생성<br/>결합 모델"]
    end

    E --> F
    I --> F
    J --> F
    K --> F

수집 방식의 특징:

Questioner는 passage를 직접 볼 수 없고, 이전 대화 내용만 참조 → 자연스러운 대화 흐름 유도
Answerer는 자유형 답변을 먼저 작성한 후, 해당 답변의 근거가 되는 span을 passage에서 선택
각 답변에 대해 3명의 추가 주석자가 독립적으로 답변 → 다중 참조(multiple references)로 robust한 평가 가능

발견 (Results)

주요 모델 성능 비교

모델	전체 F1	Children	Literature	Mid/High	News	Wikipedia	Reddit	Science
Human	89.4	90.2	88.4	89.8	88.6	89.9	89.7	88.1
GPT-4 (2023)	~85+	-	-	-	-	-	-	-
RoBERTa-Large	~80.0	-	-	-	-	-	-	-
BERT + FlowQA	~75.0	-	-	-	-	-	-	-
DrQA + PGNet	65.1	66.5	65.5	67.1	68.3	65.4	57.8	63.1
Seq2Seq	27.5	-	-	-	-	-	-	-

핵심 발견

Human-machine gap: 최초 보고 시점에서 인간(89.4) 대비 최고 모델(65.1)의 격차가 24.3 F1 포인트 → 대화형 QA의 난이도를 실증
Out-of-domain 성능 하락: Reddit(57.8), Science(63.1) 도메인에서 in-domain 대비 뚜렷한 성능 저하 → domain transfer의 어려움 확인
Turn 진행에 따른 난이도 증가: 대화 후반부(10턴 이상)에서 coreference chain이 길어지며 성능 하락
Yes/No 질문의 상대적 용이성: Yes/No 유형은 전반적으로 높은 정확도, 그러나 근거 추출은 여전히 도전적

이론적 의의

학술적 기여

대화형 QA의 표준 벤치마크 확립: CoQA는 최초의 대규모 conversational QA 데이터셋으로, 이후 QuAC, DoQA 등 후속 벤치마크의 설계 기반이 됨
이중 답변 구조(dual answer structure): extractive span + free-form answer의 결합은 모델의 이해 깊이와 표현 자연스러움을 동시에 측정하는 새로운 평가 패러다임 제시
Multi-turn reasoning 연구 촉진: coreference resolution, ellipsis resolution, pragmatic inference 등 대화 특유의 언어 현상을 QA 프레임워크 내에서 체계적으로 연구할 수 있는 토대 마련

한계점

대화 길이가 평균 ~15턴으로, 실제 장기 대화(100+ 턴) 시나리오 반영 부족
주석자 간 답변 표현 편차(variability)가 존재하나, 다중 참조로 부분적 완화
Passage가 주어진 상태에서의 QA이므로, open-domain retrieval은 평가하지 않음

벤치마크	관계
SQuAD2_2018_ReadingComprehension	단발성 추출형 QA → CoQA가 대화형으로 확장
QuAC_2018_DialogueQA	동시기 대화형 QA, 추출형 답변만 (CoQA는 이중 구조)
NaturalQuestions_2019_OpenDomainQA	Open-domain QA, 비대화형
HotpotQA_2018_MultiHopQA	Multi-hop 추론, 비대화형
DROP_2019_NumericalReasoning	수치 추론 QA, 비대화형
TriviaQA_2017_LargeScaleQA	대규모 QA, 단발성 질문

핵심 용어

용어	정의
Conversational QA	대화 맥락 내에서 연속적으로 질문-답변을 수행하는 QA 과제
Coreference Resolution	대명사(“he”, “it”)가 이전 턴의 어떤 개체를 지칭하는지 해소하는 과정
Ellipsis Resolution	생략된 문장 성분(“When?” → “When was it announced?“)을 복원하는 과정
Rationale	답변의 근거가 되는 지문 내 추출 구간(span)
Dual Answer Structure	자유형 답변(input_text) + 추출 근거(span_text)를 동시에 제공하는 CoQA 고유 구조
Macro-average F1	각 도메인별 F1을 계산한 후 단순 평균 → 도메인 간 균등 가중
Out-of-domain	훈련 데이터에 포함되지 않은 도메인 (Reddit, Science)

benchmark conversational-qa multi-turn coreference extractive-abstractive f1-score reading-comprehension stanford tacl-2019

Juhyeon's Blog

탐색기

CoQA - A Conversational Question Answering Challenge

CoQA: A Conversational Question Answering Challenge

메타데이터

데이터셋 구성

규모 및 분할

Feature / Column 구조

도메인 분포

실제 데이터 예시

예시 1: Children’s Story (Coreference Resolution)

예시 2: News Domain (Multi-hop + Ellipsis)

예시 3: Science Domain (Out-of-Domain)

왜 이 연구를 하는가?

방법

발견 (Results)

주요 모델 성능 비교

핵심 발견

이론적 의의

학술적 기여

한계점

관련 연구

핵심 용어

그래프 뷰

목차

Properties

백링크