본문으로 건너뛰기
Juhyeon's Blog
Search
검색
다크 모드
라이트 모드
탐색기
Home
❯
AI
❯
Papers
❯
Benchmarks
폴더: AI/Papers/Benchmarks
81건의 항목
2026년 4월 13일
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
paper
MultiNLI
NLI
multi-genre
domain-transfer
benchmark
NAACL
2026년 4월 13일
A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories
paper
ROCStories
Story-Cloze
commonsense-reasoning
narrative
benchmark
NAACL
2026년 4월 13일
A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories
paper
benchmark
commonsense
StoryCloze
narrative
ROCStories
2026년 4월 13일
A large annotated corpus for learning natural language inference 1
paper
NLI
SNLI
dataset
benchmark
crowdsourcing
textual-entailment
EMNLP
2026년 4월 13일
A large annotated corpus for learning natural language inference
NLI
NLU
Benchmark
Entailment
Crowdsourcing
SentencePair
EMNLP2015
TransferLearning
AnnotationArtifact
2026년 4월 13일
AIME 2024 - 미국 수학 올림피아드 벤치마크
1에서
15로
benchmark
math
reasoning
AIME
competition
olympiad
chain-of-thought
evaluation
2026년 4월 13일
ALFWorld - Aligning Text and Embodied Environments for Interactive Learning
paper
benchmark
embodied_agent
ALFWorld
BUTLER
text_transfer
ICLR
UW
MSR
2026년 4월 13일
ARC-AGI - Abstraction and Reasoning Corpus
benchmark
reasoning
abstraction
generalization
ARC
AGI
Chollet
few-shot
program-synthesis
core-knowledge
2026년 4월 13일
Adversarial NLI - A New Benchmark for Natural Language Understanding
paper
benchmark
NLI
adversarial
ANLI
human_in_the_loop
2026년 4월 13일
AgentBench - Evaluating LLMs as Agents
paper
benchmark
agent
AgentBench
multi_environment
Tsinghua
ICLR
2026년 4월 13일
Aider Polyglot - 다언어 코드 편집 벤치마크
benchmark
code-editing
multi-language
polyglot
aider
exercism
practical-coding
LLM-evaluation
2026년 4월 13일
Aligning AI With Shared Human Values
paper
benchmark
ethics
moral_judgment
AI_alignment
safety
ICLR
2026년 4월 13일
BBQ - A Hand-Built Bias Benchmark for Question Answering
paper
benchmark
bias
BBQ
QA
ambiguity
social_stereotypes
fairness
2026년 4월 13일
Benchmarks
2026년 4월 13일
Berkeley Function Calling Leaderboard (BFCL)
Benchmark
FunctionCalling
ToolUse
LLM
AST
Agent
API
Evaluation
UCBerkeley
Gorilla
2026년 4월 13일
BigCodeBench - Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
paper
benchmark
code_generation
BigCodeBench
API
library
practical_coding
2026년 4월 13일
BoolQ - Exploring the Surprising Difficulty of Natural Yes-No Questions
paper
benchmark
yes_no_QA
BoolQ
SuperGLUE
Google
2026년 4월 13일
Can a Suit of Armor Conduct Electricity A New Dataset for Open Book Question Answering
paper
benchmark
science_commonsense
OpenBookQA
open_book
AI2
2026년 4월 13일
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
paper
benchmark
reasoning
BBH
BIG_Bench
chain_of_thought
ACL
2026년 4월 13일
ChartQA - A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
benchmark
chart-understanding
visual-qa
multimodal
relaxed-accuracy
data-extraction
visual-reasoning
ACL2022
2026년 4월 13일
Chatbot Arena - An Open Platform for Evaluating LLMs by Human Preference
benchmark
human-preference
elo-rating
bradley-terry
pairwise-comparison
crowdsourcing
lmsys
chatbot-arena
llm-evaluation
icml-2024
2026년 4월 13일
CoQA - A Conversational Question Answering Challenge
benchmark
conversational-qa
multi-turn
coreference
extractive-abstractive
f1-score
reading-comprehension
stanford
tacl-2019
2026년 4월 13일
CommonsenseQA - A Question Answering Challenge Targeting World Knowledge
paper
benchmark
commonsense
CommonsenseQA
ConceptNet
knowledge_graph
2026년 4월 13일
CrowS-Pairs - A Challenge Dataset for Measuring Social Biases in Masked Language Models
paper
benchmark
bias
stereotypes
CrowS-Pairs
fairness
minimal_pairs
2026년 4월 13일
DROP - A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
paper
benchmark
reading_comprehension
numerical_reasoning
DROP
NAACL
2026년 4월 13일
DocVQA - A Dataset for VQA on Document Images
benchmark
document-ai
VQA
OCR
layout-understanding
multimodal
ANLS
WACV2021
2026년 4월 13일
Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization
paper
benchmark
summarization
XSum
extreme
abstractive
BBC
2026년 4월 13일
Evaluating Large Language Models Trained on Code
paper
benchmark
code_generation
HumanEval
pass_at_k
Codex
OpenAI
2026년 4월 13일
FrontierMath - A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Benchmark
Math
FrontierMath
ResearchLevel
MathematicalReasoning
EpochAI
HiddenTestSet
DataContamination
ExpertEvaluation
AI수학추론
2026년 4월 13일
GAIA - A Benchmark for General AI Assistants
paper
benchmark
general_AI
GAIA
tool_use
assistant
Meta_FAIR
ICLR
2026년 4월 13일
GLUE - A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding 1
paper
GLUE
benchmark
multi-task
NLU
QNLI
RTE
transfer-learning
ICLR
2026년 4월 13일
GLUE - A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Benchmark
NLU
GLUE
MultiTask
TransferLearning
PretrainFinetune
NLI
SentimentAnalysis
Paraphrase
LanguageUnderstanding
2026년 4월 13일
GPQA - A Graduate-Level Google-Proof Q&A Benchmark
paper
benchmark
expert_level
GPQA
science
graduate
Google_proof
ICLR
2026년 4월 13일
HarmBench - A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Benchmark
RedTeaming
LLM-Safety
Adversarial-Attack
Jailbreak
ASR
ICML2024
2026년 4월 13일
HellaSwag - Can a Machine Really Finish Your Sentence
paper
benchmark
commonsense
HellaSwag
adversarial_filtering
ACL
2026년 4월 13일
Holistic Evaluation of Language Models
paper
benchmark
evaluation_framework
HELM
holistic
Stanford
multi_metric
2026년 4월 13일
HotpotQA - A Dataset for Diverse, Explainable Multi-hop Question Answering
paper
QA
multi-hop
explainability
supporting-facts
benchmark
EMNLP
2026년 4월 13일
Instruction-Following Evaluation for Large Language Models
paper
benchmark
instruction_following
IFEval
verifiable
Google
automatic_evaluation
2026년 4월 13일
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
paper
benchmark
LLM_judge
MT_Bench
chatbot
multi_turn
NeurIPS
LMSYS
2026년 4월 13일
Kaggle Measuring Progress Toward AGI - Cognitive Abilities
kaggle
hackathon
AGI
benchmark
cognitive-evaluation
DeepMind
metacognition
attention
learning
executive-functions
social-cognition
2026년 4월 13일
Know What You Don't Know - Unanswerable Questions for SQuAD
paper
benchmark
reading_comprehension
SQuAD
unanswerable
extractive_QA
2026년 4월 13일
Learning Multiple Layers of Features from Tiny Images
paper
CIFAR-10
CIFAR-100
image-classification
CNN
benchmark
computer-vision
2026년 4월 13일
Length-Controlled AlpacaEval - A Simple Way to Debias Automatic Evaluators
paper
benchmark
instruction_following
AlpacaEval
length_bias
LLM_judge
Stanford
2026년 4월 13일
LiveCodeBench - Holistic and Contamination Free Evaluation of Large Language Models for Code
paper
benchmark
code_generation
LiveCodeBench
contamination_free
competitive_programming
2026년 4월 13일
LongBench - A Bilingual, Multitask Benchmark for Long Context Understanding
Benchmark
LongContext
Bilingual
DocumentUnderstanding
Evaluation
QA
Summarization
CodeGeneration
LLM
2026년 4월 13일
MMLU-Pro - A More Robust and Challenging Multi-Task Language Understanding Benchmark
paper
benchmark
MMLU_Pro
knowledge
reasoning
10_choice
NeurIPS
2026년 4월 13일
MMMU - A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
paper
benchmark
multimodal
MMMU
expert_level
multi_discipline
CVPR
2026년 4월 13일
Making the V in VQA Matter - Elevating the Role of Image Understanding in VQA
Benchmark
VQA
Multimodal
VisualQA
LanguageBias
ComplementaryPairs
COCO
CVPR2017
2026년 4월 13일
MathVista - Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
paper
benchmark
mathematics
multimodal
visual_reasoning
MathVista
ICLR
2026년 4월 13일
Measuring Massive Multitask Language Understanding
paper
benchmark
MMLU
multitask
knowledge
language_understanding
ICLR
2026년 4월 13일
Measuring Mathematical Problem Solving with the MATH Dataset
paper
benchmark
mathematics
MATH
competition_math
reasoning
NeurIPS
2026년 4월 13일
Natural Questions - A Benchmark for Question Answering Research
paper
benchmark
QA
open_domain
NaturalQuestions
Google
2026년 4월 13일
Needle in a Haystack - Pressure Testing LLMs
benchmark
long-context
retrieval
pressure-test
needle-in-a-haystack
lost-in-the-middle
heatmap
evaluation
2026년 4월 13일
Neural Network Acceptability Judgments
paper
CoLA
linguistic-acceptability
grammar
benchmark
GLUE
MCC
2026년 4월 13일
No Language Left Behind - Scaling Human-Centered Machine Translation
benchmark
multilingual
translation
low-resource
FLORES
NLLB
spBLEU
Meta-AI
evaluation
2026년 4월 13일
Open LLM Leaderboard
paper
benchmark
leaderboard
HuggingFace
open_source
standardized_evaluation
2026년 4월 13일
PIQA - Reasoning about Physical Commonsense in Natural Language
paper
benchmark
physical_commonsense
PIQA
intuitive_physics
everyday_reasoning
2026년 4월 13일
Program Synthesis with Large Language Models
paper
benchmark
code_generation
MBPP
program_synthesis
Python
Google
2026년 4월 13일
QuAC - Question Answering in Context
paper
benchmark
conversational_QA
QuAC
dialogue
information_asymmetry
2026년 4월 13일
RACE - Large-scale ReAding Comprehension Dataset From Examinations 1
paper
RACE
reading-comprehension
QA
multiple-choice
exam
benchmark
EMNLP
2026년 4월 13일
RACE - Large-scale ReAding Comprehension Dataset From Examinations
Benchmark
ReadingComprehension
MultipleChoice
NLU
EMNLP
English
Inference
RACE
2026년 4월 13일
RULER - What's the Real Context Size of Your Long-Context Language Models
benchmark
long-context
NIAH
NVIDIA
evaluation
synthetic-data
effective-context-length
NAACL2025
2026년 4월 13일
RealToxicityPrompts - Evaluating Neural Toxic Degeneration in Language Models
paper
benchmark
toxicity
safety
RealToxicityPrompts
language_model
degeneration
2026년 4월 13일
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
paper
SST
SST-2
sentiment-analysis
compositionality
RNTN
benchmark
EMNLP
2026년 4월 13일
SWE-bench - Can Language Models Resolve Real-World GitHub Issues
paper
benchmark
software_engineering
SWE_bench
agent
GitHub
Princeton
2026년 4월 13일
SciTaiL - A Textual Entailment Dataset from Science Question Answering
paper
SciTail
textual-entailment
science-QA
NLI
benchmark
AAAI
2026년 4월 13일
SemEval-2017 Task 1 - Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation
paper
STS
STS-B
semantic-similarity
regression
multilingual
benchmark
SemEval
2026년 4월 13일
Social IQa - Commonsense Reasoning about Social Interactions
paper
benchmark
social_commonsense
SIQA
emotional_reasoning
ATOMIC
2026년 4월 13일
SuperGLUE - A Stickier Benchmark for General-Purpose Language Understanding Systems
paper
benchmark
NLU
SuperGLUE
language_understanding
benchmark_suite
2026년 4월 13일
Teaching Machines to Read and Comprehend (원본) - Abstractive Text Summarization using Sequence-to-sequence RNNs (요약 버전)
paper
benchmark
summarization
CNN_DailyMail
ROUGE
news
2026년 4월 13일
The LAMBADA dataset - Word prediction requiring a broad discourse context
paper
benchmark
language_model
LAMBADA
word_prediction
long_range_dependency
2026년 4월 13일
Think you have Solved Question Answering Try ARC, the AI2 Reasoning Challenge
paper
benchmark
science_reasoning
ARC
challenge_set
AI2
adversarial_filtering
2026년 4월 13일
Training Verifiers to Solve Math Word Problem
2026년 4월 13일
TriviaQA - A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
paper
benchmark
QA
TriviaQA
distant_supervision
reading_comprehension
2026년 4월 13일
TruthfulQA - Measuring How Models Mimic Human Falsehoods
paper
benchmark
truthfulness
hallucination
TruthfulQA
safety
ACL
2026년 4월 13일
WMT 공유 태스크 (Workshop on Machine Translation)
Benchmark
MachineTranslation
WMT
BLEU
COMET
SharedTask
NeuralMT
Transformer
MultilingualNLP
DirectAssessment
2026년 4월 13일
WebArena - A Realistic Web Environment for Building Autonomous Agents
paper
benchmark
web_agent
WebArena
autonomous_agent
CMU
ICLR
2026년 4월 13일
WebShop - Towards Scalable Real-World Web Interaction with Grounded Language Agents
paper
benchmark
web_agent
WebShop
web_shopping
sim_to_real
NeurIPS
Princeton
2026년 4월 13일
WildBench - Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
benchmark
LLM-evaluation
real-user-tasks
WildBench
checklist-evaluation
LLM-as-Judge
chatbot-arena
AI2
automatic-evaluation
ecological-validity
2026년 4월 13일
WinoGrande - An Adversarial Winograd Schema Challenge at Scale
paper
benchmark
commonsense
WinoGrande
winograd
coreference
AAAI
2026년 4월 13일
_survey-overview