본문으로 건너뛰기

Juhyeon's Blog

태그: benchmark

81건의 항목

2026년 6월 04일
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
2026년 6월 04일
A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories
2026년 6월 04일
A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories
2026년 6월 04일
A large annotated corpus for learning natural language inference 1
2026년 6월 04일
AIME 2024 - 미국 수학 올림피아드 벤치마크
2026년 6월 04일
ALFWorld - Aligning Text and Embodied Environments for Interactive Learning
2026년 6월 04일
ARC-AGI - Abstraction and Reasoning Corpus
2026년 6월 04일
Adversarial NLI - A New Benchmark for Natural Language Understanding
2026년 6월 04일
AgentBench - Evaluating LLMs as Agents
2026년 6월 04일
Aider Polyglot - 다언어 코드 편집 벤치마크
2026년 6월 04일
Aligning AI With Shared Human Values
2026년 6월 04일
BBQ - A Hand-Built Bias Benchmark for Question Answering
2026년 6월 04일
Big Bench - Beyond the Imitation Game - Quantifying and extrapolating the capabilities of language models
2026년 6월 04일
BigCodeBench - Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
2026년 6월 04일
BoolQ - Exploring the Surprising Difficulty of Natural Yes-No Questions
2026년 6월 04일
Can a Suit of Armor Conduct Electricity A New Dataset for Open Book Question Answering
2026년 6월 04일
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
2026년 6월 04일
ChartQA - A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
2026년 6월 04일
Chatbot Arena - An Open Platform for Evaluating LLMs by Human Preference
2026년 6월 04일
CoQA - A Conversational Question Answering Challenge
2026년 6월 04일
CommonsenseQA - A Question Answering Challenge Targeting World Knowledge
2026년 6월 04일
CrowS-Pairs - A Challenge Dataset for Measuring Social Biases in Masked Language Models
2026년 6월 04일
DROP - A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
2026년 6월 04일
DocVQA - A Dataset for VQA on Document Images
2026년 6월 04일
Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization
2026년 6월 04일
Evaluating Large Language Models Trained on Code
2026년 6월 04일
GAIA - A Benchmark for General AI Assistants
2026년 6월 04일
GLUE - A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding 1
2026년 6월 04일
GPQA - A Graduate-Level Google-Proof Q&A Benchmark
2026년 6월 04일
HellaSwag - Can a Machine Really Finish Your Sentence
2026년 6월 04일
Holistic Evaluation of Language Models
2026년 6월 04일
HotpotQA - A Dataset for Diverse, Explainable Multi-hop Question Answering
2026년 6월 04일
If an LLM Were a Character Would It Know Its Own Story - Evaluating Lifelong Learning in LLMs
2026년 6월 04일
Instruction-Following Evaluation for Large Language Models
2026년 6월 04일
Is Your Code Generated by ChatGPT Really Correct! Rigorous Evaluation of Large Language Models for Code Generation
2026년 6월 04일
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
2026년 6월 04일
Kaggle Measuring Progress Toward AGI - Cognitive Abilities
2026년 6월 04일
Know What You Don't Know - Unanswerable Questions for SQuAD
2026년 6월 04일
LLM_as_Judge_Survey_2025_LLM_Evaluation
2026년 6월 04일
Learning Multiple Layers of Features from Tiny Images
2026년 6월 04일
Length-Controlled AlpacaEval - A Simple Way to Debias Automatic Evaluators
2026년 6월 04일
LiveCodeBench - Holistic and Contamination Free Evaluation of Large Language Models for Code
2026년 6월 04일
MMLU-Pro - A More Robust and Challenging Multi-Task Language Understanding Benchmark
2026년 6월 04일
MMMU - A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
2026년 6월 04일
MathVista - Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
2026년 6월 04일
Measuring Massive Multitask Language Understanding
2026년 6월 04일
Measuring Mathematical Problem Solving with the MATH Dataset
2026년 6월 04일
Natural Questions - A Benchmark for Question Answering Research
2026년 6월 04일
Needle in a Haystack - Pressure Testing LLMs
2026년 6월 04일
Neural Network Acceptability Judgments
2026년 6월 04일
No Language Left Behind - Scaling Human-Centered Machine Translation
2026년 6월 04일
On Verbalized Confidence Scores for LLMs
2026년 6월 04일
Open LLM Leaderboard
2026년 6월 04일
PIQA - Reasoning about Physical Commonsense in Natural Language
2026년 6월 04일
PersonaGym - Evaluating Persona Agents and LLMs
2026년 6월 04일
Principled Personas - Defining and Measuring the Intended Effects of Persona Prompting on Task Performance
2026년 6월 04일
Program Synthesis with Large Language Models
2026년 6월 04일
PromptBench - A Unified Library for Evaluation of Large Language Models
2026년 6월 04일
QuAC - Question Answering in Context
2026년 6월 04일
RACE - Large-scale ReAding Comprehension Dataset From Examinations 1
2026년 6월 04일
RULER - What's the Real Context Size of Your Long-Context Language Models
2026년 6월 04일
RealToxicityPrompts - Evaluating Neural Toxic Degeneration in Language Models
2026년 6월 04일
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
2026년 6월 04일
SWE-bench - Can Language Models Resolve Real-World GitHub Issues
2026년 6월 04일
SciTaiL - A Textual Entailment Dataset from Science Question Answering
2026년 6월 04일
SemEval-2017 Task 1 - Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation
2026년 6월 04일
Social IQa - Commonsense Reasoning about Social Interactions
2026년 6월 04일
SuperGLUE - A Stickier Benchmark for General-Purpose Language Understanding Systems
2026년 6월 04일
Teaching Machines to Read and Comprehend (원본) - Abstractive Text Summarization using Sequence-to-sequence RNNs (요약 버전)
2026년 6월 04일
TextArena
2026년 6월 04일
The LAMBADA dataset - Word prediction requiring a broad discourse context
2026년 6월 04일
Think you have Solved Question Answering Try ARC, the AI2 Reasoning Challenge
2026년 6월 04일
TriviaQA - A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
2026년 6월 04일
TruthfulQA - Measuring How Models Mimic Human Falsehoods
2026년 6월 04일
WebArena - A Realistic Web Environment for Building Autonomous Agents
2026년 6월 04일
WebShop - Towards Scalable Real-World Web Interaction with Grounded Language Agents
2026년 6월 04일
WildBench - Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
2026년 6월 04일
WinoGrande - An Adversarial Winograd Schema Challenge at Scale
2026년 6월 04일
Evaluating Vision-Language Models for Emotion Recognition
2026년 6월 04일
Harb et al. (2025) — GPT-4o·Gemini의 NimStim 얼굴 감정 인식 평가
2026년 6월 04일
Evaluating Open-Source Vision Language Models for Facial Emotion Recognition against Traditional Deep Learning Models

키보드 단축키

`/` 또는 `Ctrl`+`K`	검색
`?`	단축키 도움말
`Esc`	모달 닫기

Created with Quartz v4.5.2 © 2026

GitHub
Blog