본문으로 건너뛰기
Juhyeon's Blog
Search
검색
다크 모드
라이트 모드
탐색기
태그: Benchmark
9건의 항목
2026년 4월 13일
A large annotated corpus for learning natural language inference
NLI
NLU
Benchmark
Entailment
Crowdsourcing
SentencePair
EMNLP2015
TransferLearning
AnnotationArtifact
2026년 4월 13일
Berkeley Function Calling Leaderboard (BFCL)
Benchmark
FunctionCalling
ToolUse
LLM
AST
Agent
API
Evaluation
UCBerkeley
Gorilla
2026년 4월 13일
FrontierMath - A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Benchmark
Math
FrontierMath
ResearchLevel
MathematicalReasoning
EpochAI
HiddenTestSet
DataContamination
ExpertEvaluation
AI수학추론
2026년 4월 13일
GLUE - A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Benchmark
NLU
GLUE
MultiTask
TransferLearning
PretrainFinetune
NLI
SentimentAnalysis
Paraphrase
LanguageUnderstanding
2026년 4월 13일
HarmBench - A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Benchmark
RedTeaming
LLM-Safety
Adversarial-Attack
Jailbreak
ASR
ICML2024
2026년 4월 13일
LongBench - A Bilingual, Multitask Benchmark for Long Context Understanding
Benchmark
LongContext
Bilingual
DocumentUnderstanding
Evaluation
QA
Summarization
CodeGeneration
LLM
2026년 4월 13일
Making the V in VQA Matter - Elevating the Role of Image Understanding in VQA
Benchmark
VQA
Multimodal
VisualQA
LanguageBias
ComplementaryPairs
COCO
CVPR2017
2026년 4월 13일
RACE - Large-scale ReAding Comprehension Dataset From Examinations
Benchmark
ReadingComprehension
MultipleChoice
NLU
EMNLP
English
Inference
RACE
2026년 4월 13일
WMT 공유 태스크 (Workshop on Machine Translation)
Benchmark
MachineTranslation
WMT
BLEU
COMET
SharedTask
NeuralMT
Transformer
MultilingualNLP
DirectAssessment