본문으로 건너뛰기

Juhyeon's Blog

❯

❯

❯

폴더: AI/Papers/Benchmarks

81건의 항목

2026년 4월 13일
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
2026년 4월 13일
A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories
2026년 4월 13일
A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories
2026년 4월 13일
A large annotated corpus for learning natural language inference 1
2026년 4월 13일
A large annotated corpus for learning natural language inference
2026년 4월 13일
AIME 2024 - 미국 수학 올림피아드 벤치마크
2026년 4월 13일
ALFWorld - Aligning Text and Embodied Environments for Interactive Learning
2026년 4월 13일
ARC-AGI - Abstraction and Reasoning Corpus
2026년 4월 13일
Adversarial NLI - A New Benchmark for Natural Language Understanding
2026년 4월 13일
AgentBench - Evaluating LLMs as Agents
2026년 4월 13일
Aider Polyglot - 다언어 코드 편집 벤치마크
2026년 4월 13일
Aligning AI With Shared Human Values
2026년 4월 13일
BBQ - A Hand-Built Bias Benchmark for Question Answering
2026년 4월 13일
Benchmarks
2026년 4월 13일
Berkeley Function Calling Leaderboard (BFCL)
2026년 4월 13일
BigCodeBench - Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
2026년 4월 13일
BoolQ - Exploring the Surprising Difficulty of Natural Yes-No Questions
2026년 4월 13일
Can a Suit of Armor Conduct Electricity A New Dataset for Open Book Question Answering
2026년 4월 13일
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
2026년 4월 13일
ChartQA - A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
2026년 4월 13일
Chatbot Arena - An Open Platform for Evaluating LLMs by Human Preference
2026년 4월 13일
CoQA - A Conversational Question Answering Challenge
2026년 4월 13일
CommonsenseQA - A Question Answering Challenge Targeting World Knowledge
2026년 4월 13일
CrowS-Pairs - A Challenge Dataset for Measuring Social Biases in Masked Language Models
2026년 4월 13일
DROP - A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
2026년 4월 13일
DocVQA - A Dataset for VQA on Document Images
2026년 4월 13일
Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization
2026년 4월 13일
Evaluating Large Language Models Trained on Code
2026년 4월 13일
FrontierMath - A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
2026년 4월 13일
GAIA - A Benchmark for General AI Assistants
2026년 4월 13일
GLUE - A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding 1
2026년 4월 13일
GLUE - A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
2026년 4월 13일
GPQA - A Graduate-Level Google-Proof Q&A Benchmark
2026년 4월 13일
HarmBench - A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
2026년 4월 13일
HellaSwag - Can a Machine Really Finish Your Sentence
2026년 4월 13일
Holistic Evaluation of Language Models
2026년 4월 13일
HotpotQA - A Dataset for Diverse, Explainable Multi-hop Question Answering
2026년 4월 13일
Instruction-Following Evaluation for Large Language Models
2026년 4월 13일
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
2026년 4월 13일
Kaggle Measuring Progress Toward AGI - Cognitive Abilities
2026년 4월 13일
Know What You Don't Know - Unanswerable Questions for SQuAD
2026년 4월 13일
Learning Multiple Layers of Features from Tiny Images
2026년 4월 13일
Length-Controlled AlpacaEval - A Simple Way to Debias Automatic Evaluators
2026년 4월 13일
LiveCodeBench - Holistic and Contamination Free Evaluation of Large Language Models for Code
2026년 4월 13일
LongBench - A Bilingual, Multitask Benchmark for Long Context Understanding
2026년 4월 13일
MMLU-Pro - A More Robust and Challenging Multi-Task Language Understanding Benchmark
2026년 4월 13일
MMMU - A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
2026년 4월 13일
Making the V in VQA Matter - Elevating the Role of Image Understanding in VQA
2026년 4월 13일
MathVista - Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
2026년 4월 13일
Measuring Massive Multitask Language Understanding
2026년 4월 13일
Measuring Mathematical Problem Solving with the MATH Dataset
2026년 4월 13일
Natural Questions - A Benchmark for Question Answering Research
2026년 4월 13일
Needle in a Haystack - Pressure Testing LLMs
2026년 4월 13일
Neural Network Acceptability Judgments
2026년 4월 13일
No Language Left Behind - Scaling Human-Centered Machine Translation
2026년 4월 13일
Open LLM Leaderboard
2026년 4월 13일
PIQA - Reasoning about Physical Commonsense in Natural Language
2026년 4월 13일
Program Synthesis with Large Language Models
2026년 4월 13일
QuAC - Question Answering in Context
2026년 4월 13일
RACE - Large-scale ReAding Comprehension Dataset From Examinations 1
2026년 4월 13일
RACE - Large-scale ReAding Comprehension Dataset From Examinations
2026년 4월 13일
RULER - What's the Real Context Size of Your Long-Context Language Models
2026년 4월 13일
RealToxicityPrompts - Evaluating Neural Toxic Degeneration in Language Models
2026년 4월 13일
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
2026년 4월 13일
SWE-bench - Can Language Models Resolve Real-World GitHub Issues
2026년 4월 13일
SciTaiL - A Textual Entailment Dataset from Science Question Answering
2026년 4월 13일
SemEval-2017 Task 1 - Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation
2026년 4월 13일
Social IQa - Commonsense Reasoning about Social Interactions
2026년 4월 13일
SuperGLUE - A Stickier Benchmark for General-Purpose Language Understanding Systems
2026년 4월 13일
Teaching Machines to Read and Comprehend (원본) - Abstractive Text Summarization using Sequence-to-sequence RNNs (요약 버전)
2026년 4월 13일
The LAMBADA dataset - Word prediction requiring a broad discourse context
2026년 4월 13일
Think you have Solved Question Answering Try ARC, the AI2 Reasoning Challenge
2026년 4월 13일
Training Verifiers to Solve Math Word Problem
2026년 4월 13일
TriviaQA - A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
2026년 4월 13일
TruthfulQA - Measuring How Models Mimic Human Falsehoods
2026년 4월 13일
WMT 공유 태스크 (Workshop on Machine Translation)
2026년 4월 13일
WebArena - A Realistic Web Environment for Building Autonomous Agents
2026년 4월 13일
WebShop - Towards Scalable Real-World Web Interaction with Grounded Language Agents
2026년 4월 13일
WildBench - Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
2026년 4월 13일
WinoGrande - An Adversarial Winograd Schema Challenge at Scale
2026년 4월 13일
_survey-overview

키보드 단축키

`/` 또는 `Ctrl`+`K`	검색
`?`	단축키 도움말
`Esc`	모달 닫기

Created with Quartz v4.5.2 © 2026

GitHub
Blog