본문으로 건너뛰기

Juhyeon's Blog

태그: Benchmark

14건의 항목

2026년 6월 04일
A large annotated corpus for learning natural language inference
2026년 6월 04일
Belief in the Machine - Investigating Epistemological Blind Spots of Language Models
2026년 6월 04일
Benchmark Self-Evolving - A Multi-Agent Framework for Dynamic LLM Evaluation
2026년 6월 04일
Berkeley Function Calling Leaderboard (BFCL)
2026년 6월 04일
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
2026년 6월 04일
FrontierMath - A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
2026년 6월 04일
GLUE - A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
2026년 6월 04일
HarmBench - A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
2026년 6월 04일
LongBench - A Bilingual, Multitask Benchmark for Long Context Understanding
2026년 6월 04일
Making the V in VQA Matter - Elevating the Role of Image Understanding in VQA
2026년 6월 04일
Multi-ToM - Evaluating Multilingual Theory of Mind Capabilities in Large Language Models
2026년 6월 04일
RACE - Large-scale ReAding Comprehension Dataset From Examinations
2026년 6월 04일
SimpleToM - Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs
2026년 6월 04일
WMT 공유 태스크 (Workshop on Machine Translation)

키보드 단축키

`/` 또는 `Ctrl`+`K`	검색
`?`	단축키 도움말
`Esc`	모달 닫기

Created with Quartz v4.5.2 © 2026

GitHub
Blog