본문으로 건너뛰기

Juhyeon's Blog

태그: evaluation

6건의 항목

  • 2026년 6월 04일

    AIME 2024 - 미국 수학 올림피아드 벤치마크

    • 1에서
    • 15로
    • benchmark
    • math
    • reasoning
    • AIME
    • competition
    • olympiad
    • chain-of-thought
    • evaluation
  • 2026년 6월 04일

    If an LLM Were a Character Would It Know Its Own Story - Evaluating Lifelong Learning in LLMs

    • paper
    • LLM
    • lifelong-learning
    • benchmark
    • evaluation
    • memory
    • role-play
    • catastrophic-forgetting
    • self-awareness
    • narrative
  • 2026년 6월 04일

    Is Your Code Generated by ChatGPT Really Correct! Rigorous Evaluation of Large Language Models for Code Generation

    • paper
    • LLM
    • code-generation
    • benchmark
    • evaluation
    • EvalPlus
    • HumanEval
    • MBPP
    • mutation-testing
    • differential-testing
    • NeurIPS2023
  • 2026년 6월 04일

    Needle in a Haystack - Pressure Testing LLMs

    • benchmark
    • long-context
    • retrieval
    • pressure-test
    • needle-in-a-haystack
    • lost-in-the-middle
    • heatmap
    • evaluation
  • 2026년 6월 04일

    No Language Left Behind - Scaling Human-Centered Machine Translation

    • benchmark
    • multilingual
    • translation
    • low-resource
    • FLORES
    • NLLB
    • spBLEU
    • Meta-AI
    • evaluation
  • 2026년 6월 04일

    RULER - What's the Real Context Size of Your Long-Context Language Models

    • benchmark
    • long-context
    • NIAH
    • NVIDIA
    • evaluation
    • synthetic-data
    • effective-context-length
    • NAACL2025

키보드 단축키

/ 또는 Ctrl+K검색
?단축키 도움말
Esc모달 닫기

Created with Quartz v4.5.2 © 2026

  • GitHub
  • Blog