본문으로 건너뛰기
Juhyeon's Blog
Search
검색
다크 모드
라이트 모드
탐색기
태그: Benchmark
14건의 항목
2026년 6월 04일
A large annotated corpus for learning natural language inference
NLI
NLU
Benchmark
Entailment
Crowdsourcing
SentencePair
EMNLP2015
TransferLearning
AnnotationArtifact
2026년 6월 04일
Belief in the Machine - Investigating Epistemological Blind Spots of Language Models
LLM
Epistemology
Belief
Knowledge
KaBLE
Benchmark
TheoryOfMind
Factivity
FirstPerson
Self-Consciousness
Evaluation
Theory
2026년 6월 04일
Benchmark Self-Evolving - A Multi-Agent Framework for Dynamic LLM Evaluation
Paper
Benchmark
Evaluation
LLM
MultiAgent
DynamicEvaluation
DataContamination
2026년 6월 04일
Berkeley Function Calling Leaderboard (BFCL)
Benchmark
FunctionCalling
ToolUse
LLM
AST
Agent
API
Evaluation
UCBerkeley
Gorilla
2026년 6월 04일
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
paper
AI-Safety
Alignment
Benchmark
Instrumental-Convergence
Power-Seeking
LLM-Agents
ICML2023
Machine-Ethics
Pareto-Frontier
GPT-4-Annotation
2026년 6월 04일
FrontierMath - A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Benchmark
Math
FrontierMath
ResearchLevel
MathematicalReasoning
EpochAI
HiddenTestSet
DataContamination
ExpertEvaluation
AI수학추론
2026년 6월 04일
GLUE - A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Benchmark
NLU
GLUE
MultiTask
TransferLearning
PretrainFinetune
NLI
SentimentAnalysis
Paraphrase
LanguageUnderstanding
2026년 6월 04일
HarmBench - A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Benchmark
RedTeaming
LLM-Safety
Adversarial-Attack
Jailbreak
ASR
ICML2024
2026년 6월 04일
LongBench - A Bilingual, Multitask Benchmark for Long Context Understanding
Benchmark
LongContext
Bilingual
DocumentUnderstanding
Evaluation
QA
Summarization
CodeGeneration
LLM
2026년 6월 04일
Making the V in VQA Matter - Elevating the Role of Image Understanding in VQA
Benchmark
VQA
Multimodal
VisualQA
LanguageBias
ComplementaryPairs
COCO
CVPR2017
2026년 6월 04일
Multi-ToM - Evaluating Multilingual Theory of Mind Capabilities in Large Language Models
Paper
ToM
Multilingual
Benchmark
LLM-Evaluation
Cross-Cultural
Social-Reasoning
2026년 6월 04일
RACE - Large-scale ReAding Comprehension Dataset From Examinations
Benchmark
ReadingComprehension
MultipleChoice
NLU
EMNLP
English
Inference
RACE
2026년 6월 04일
SimpleToM - Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs
ToM
Benchmark
LLM-Evaluation
Self-Consciousness
Metacognition
AppliedReasoning
SocialReasoning
ICLR2026
2026년 6월 04일
WMT 공유 태스크 (Workshop on Machine Translation)
Benchmark
MachineTranslation
WMT
BLEU
COMET
SharedTask
NeuralMT
Transformer
MultilingualNLP
DirectAssessment