본문으로 건너뛰기
Juhyeon's Blog
Search
검색
다크 모드
라이트 모드
탐색기
태그: benchmark
71건의 항목
2026년 4월 13일
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
paper
MultiNLI
NLI
multi-genre
domain-transfer
benchmark
NAACL
2026년 4월 13일
A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories
paper
ROCStories
Story-Cloze
commonsense-reasoning
narrative
benchmark
NAACL
2026년 4월 13일
A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories
paper
benchmark
commonsense
StoryCloze
narrative
ROCStories
2026년 4월 13일
A large annotated corpus for learning natural language inference 1
paper
NLI
SNLI
dataset
benchmark
crowdsourcing
textual-entailment
EMNLP
2026년 4월 13일
AIME 2024 - 미국 수학 올림피아드 벤치마크
1에서
15로
benchmark
math
reasoning
AIME
competition
olympiad
chain-of-thought
evaluation
2026년 4월 13일
ALFWorld - Aligning Text and Embodied Environments for Interactive Learning
paper
benchmark
embodied_agent
ALFWorld
BUTLER
text_transfer
ICLR
UW
MSR
2026년 4월 13일
ARC-AGI - Abstraction and Reasoning Corpus
benchmark
reasoning
abstraction
generalization
ARC
AGI
Chollet
few-shot
program-synthesis
core-knowledge
2026년 4월 13일
Adversarial NLI - A New Benchmark for Natural Language Understanding
paper
benchmark
NLI
adversarial
ANLI
human_in_the_loop
2026년 4월 13일
AgentBench - Evaluating LLMs as Agents
paper
benchmark
agent
AgentBench
multi_environment
Tsinghua
ICLR
2026년 4월 13일
Aider Polyglot - 다언어 코드 편집 벤치마크
benchmark
code-editing
multi-language
polyglot
aider
exercism
practical-coding
LLM-evaluation
2026년 4월 13일
Aligning AI With Shared Human Values
paper
benchmark
ethics
moral_judgment
AI_alignment
safety
ICLR
2026년 4월 13일
BBQ - A Hand-Built Bias Benchmark for Question Answering
paper
benchmark
bias
BBQ
QA
ambiguity
social_stereotypes
fairness
2026년 4월 13일
BigCodeBench - Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
paper
benchmark
code_generation
BigCodeBench
API
library
practical_coding
2026년 4월 13일
BoolQ - Exploring the Surprising Difficulty of Natural Yes-No Questions
paper
benchmark
yes_no_QA
BoolQ
SuperGLUE
Google
2026년 4월 13일
Can a Suit of Armor Conduct Electricity A New Dataset for Open Book Question Answering
paper
benchmark
science_commonsense
OpenBookQA
open_book
AI2
2026년 4월 13일
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
paper
benchmark
reasoning
BBH
BIG_Bench
chain_of_thought
ACL
2026년 4월 13일
ChartQA - A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
benchmark
chart-understanding
visual-qa
multimodal
relaxed-accuracy
data-extraction
visual-reasoning
ACL2022
2026년 4월 13일
Chatbot Arena - An Open Platform for Evaluating LLMs by Human Preference
benchmark
human-preference
elo-rating
bradley-terry
pairwise-comparison
crowdsourcing
lmsys
chatbot-arena
llm-evaluation
icml-2024
2026년 4월 13일
CoQA - A Conversational Question Answering Challenge
benchmark
conversational-qa
multi-turn
coreference
extractive-abstractive
f1-score
reading-comprehension
stanford
tacl-2019
2026년 4월 13일
CommonsenseQA - A Question Answering Challenge Targeting World Knowledge
paper
benchmark
commonsense
CommonsenseQA
ConceptNet
knowledge_graph
2026년 4월 13일
CrowS-Pairs - A Challenge Dataset for Measuring Social Biases in Masked Language Models
paper
benchmark
bias
stereotypes
CrowS-Pairs
fairness
minimal_pairs
2026년 4월 13일
DROP - A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
paper
benchmark
reading_comprehension
numerical_reasoning
DROP
NAACL
2026년 4월 13일
DocVQA - A Dataset for VQA on Document Images
benchmark
document-ai
VQA
OCR
layout-understanding
multimodal
ANLS
WACV2021
2026년 4월 13일
Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization
paper
benchmark
summarization
XSum
extreme
abstractive
BBC
2026년 4월 13일
Evaluating Large Language Models Trained on Code
paper
benchmark
code_generation
HumanEval
pass_at_k
Codex
OpenAI
2026년 4월 13일
GAIA - A Benchmark for General AI Assistants
paper
benchmark
general_AI
GAIA
tool_use
assistant
Meta_FAIR
ICLR
2026년 4월 13일
GLUE - A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding 1
paper
GLUE
benchmark
multi-task
NLU
QNLI
RTE
transfer-learning
ICLR
2026년 4월 13일
GPQA - A Graduate-Level Google-Proof Q&A Benchmark
paper
benchmark
expert_level
GPQA
science
graduate
Google_proof
ICLR
2026년 4월 13일
HellaSwag - Can a Machine Really Finish Your Sentence
paper
benchmark
commonsense
HellaSwag
adversarial_filtering
ACL
2026년 4월 13일
Holistic Evaluation of Language Models
paper
benchmark
evaluation_framework
HELM
holistic
Stanford
multi_metric
2026년 4월 13일
HotpotQA - A Dataset for Diverse, Explainable Multi-hop Question Answering
paper
QA
multi-hop
explainability
supporting-facts
benchmark
EMNLP
2026년 4월 13일
Instruction-Following Evaluation for Large Language Models
paper
benchmark
instruction_following
IFEval
verifiable
Google
automatic_evaluation
2026년 4월 13일
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
paper
benchmark
LLM_judge
MT_Bench
chatbot
multi_turn
NeurIPS
LMSYS
2026년 4월 13일
Kaggle Measuring Progress Toward AGI - Cognitive Abilities
kaggle
hackathon
AGI
benchmark
cognitive-evaluation
DeepMind
metacognition
attention
learning
executive-functions
social-cognition
2026년 4월 13일
Know What You Don't Know - Unanswerable Questions for SQuAD
paper
benchmark
reading_comprehension
SQuAD
unanswerable
extractive_QA
2026년 4월 13일
Learning Multiple Layers of Features from Tiny Images
paper
CIFAR-10
CIFAR-100
image-classification
CNN
benchmark
computer-vision
2026년 4월 13일
Length-Controlled AlpacaEval - A Simple Way to Debias Automatic Evaluators
paper
benchmark
instruction_following
AlpacaEval
length_bias
LLM_judge
Stanford
2026년 4월 13일
LiveCodeBench - Holistic and Contamination Free Evaluation of Large Language Models for Code
paper
benchmark
code_generation
LiveCodeBench
contamination_free
competitive_programming
2026년 4월 13일
MMLU-Pro - A More Robust and Challenging Multi-Task Language Understanding Benchmark
paper
benchmark
MMLU_Pro
knowledge
reasoning
10_choice
NeurIPS
2026년 4월 13일
MMMU - A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
paper
benchmark
multimodal
MMMU
expert_level
multi_discipline
CVPR
2026년 4월 13일
MathVista - Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
paper
benchmark
mathematics
multimodal
visual_reasoning
MathVista
ICLR
2026년 4월 13일
Measuring Massive Multitask Language Understanding
paper
benchmark
MMLU
multitask
knowledge
language_understanding
ICLR
2026년 4월 13일
Measuring Mathematical Problem Solving with the MATH Dataset
paper
benchmark
mathematics
MATH
competition_math
reasoning
NeurIPS
2026년 4월 13일
Natural Questions - A Benchmark for Question Answering Research
paper
benchmark
QA
open_domain
NaturalQuestions
Google
2026년 4월 13일
Needle in a Haystack - Pressure Testing LLMs
benchmark
long-context
retrieval
pressure-test
needle-in-a-haystack
lost-in-the-middle
heatmap
evaluation
2026년 4월 13일
Neural Network Acceptability Judgments
paper
CoLA
linguistic-acceptability
grammar
benchmark
GLUE
MCC
2026년 4월 13일
No Language Left Behind - Scaling Human-Centered Machine Translation
benchmark
multilingual
translation
low-resource
FLORES
NLLB
spBLEU
Meta-AI
evaluation
2026년 4월 13일
Open LLM Leaderboard
paper
benchmark
leaderboard
HuggingFace
open_source
standardized_evaluation
2026년 4월 13일
PIQA - Reasoning about Physical Commonsense in Natural Language
paper
benchmark
physical_commonsense
PIQA
intuitive_physics
everyday_reasoning
2026년 4월 13일
Program Synthesis with Large Language Models
paper
benchmark
code_generation
MBPP
program_synthesis
Python
Google
2026년 4월 13일
QuAC - Question Answering in Context
paper
benchmark
conversational_QA
QuAC
dialogue
information_asymmetry
2026년 4월 13일
RACE - Large-scale ReAding Comprehension Dataset From Examinations 1
paper
RACE
reading-comprehension
QA
multiple-choice
exam
benchmark
EMNLP
2026년 4월 13일
RULER - What's the Real Context Size of Your Long-Context Language Models
benchmark
long-context
NIAH
NVIDIA
evaluation
synthetic-data
effective-context-length
NAACL2025
2026년 4월 13일
RealToxicityPrompts - Evaluating Neural Toxic Degeneration in Language Models
paper
benchmark
toxicity
safety
RealToxicityPrompts
language_model
degeneration
2026년 4월 13일
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
paper
SST
SST-2
sentiment-analysis
compositionality
RNTN
benchmark
EMNLP
2026년 4월 13일
SWE-bench - Can Language Models Resolve Real-World GitHub Issues
paper
benchmark
software_engineering
SWE_bench
agent
GitHub
Princeton
2026년 4월 13일
SciTaiL - A Textual Entailment Dataset from Science Question Answering
paper
SciTail
textual-entailment
science-QA
NLI
benchmark
AAAI
2026년 4월 13일
SemEval-2017 Task 1 - Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation
paper
STS
STS-B
semantic-similarity
regression
multilingual
benchmark
SemEval
2026년 4월 13일
Social IQa - Commonsense Reasoning about Social Interactions
paper
benchmark
social_commonsense
SIQA
emotional_reasoning
ATOMIC
2026년 4월 13일
SuperGLUE - A Stickier Benchmark for General-Purpose Language Understanding Systems
paper
benchmark
NLU
SuperGLUE
language_understanding
benchmark_suite
2026년 4월 13일
Teaching Machines to Read and Comprehend (원본) - Abstractive Text Summarization using Sequence-to-sequence RNNs (요약 버전)
paper
benchmark
summarization
CNN_DailyMail
ROUGE
news
2026년 4월 13일
The LAMBADA dataset - Word prediction requiring a broad discourse context
paper
benchmark
language_model
LAMBADA
word_prediction
long_range_dependency
2026년 4월 13일
Think you have Solved Question Answering Try ARC, the AI2 Reasoning Challenge
paper
benchmark
science_reasoning
ARC
challenge_set
AI2
adversarial_filtering
2026년 4월 13일
TriviaQA - A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
paper
benchmark
QA
TriviaQA
distant_supervision
reading_comprehension
2026년 4월 13일
TruthfulQA - Measuring How Models Mimic Human Falsehoods
paper
benchmark
truthfulness
hallucination
TruthfulQA
safety
ACL
2026년 4월 13일
WebArena - A Realistic Web Environment for Building Autonomous Agents
paper
benchmark
web_agent
WebArena
autonomous_agent
CMU
ICLR
2026년 4월 13일
WebShop - Towards Scalable Real-World Web Interaction with Grounded Language Agents
paper
benchmark
web_agent
WebShop
web_shopping
sim_to_real
NeurIPS
Princeton
2026년 4월 13일
WildBench - Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
benchmark
LLM-evaluation
real-user-tasks
WildBench
checklist-evaluation
LLM-as-Judge
chatbot-arena
AI2
automatic-evaluation
ecological-validity
2026년 4월 13일
WinoGrande - An Adversarial Winograd Schema Challenge at Scale
paper
benchmark
commonsense
WinoGrande
winograd
coreference
AAAI
2026년 4월 13일
LLM_as_Judge_Survey_2025_LLM_Evaluation
paper
LLM_Evaluation
LLM_as_Judge
reliability
bias
benchmark
survey
2026년 4월 13일
LLM_Squid_Game_2026_Benchmark
paper
self_preservation
benchmark
LLM_safety
survival_game
motivation
FSPM
GIST
proposal