본문으로 건너뛰기

Juhyeon's Blog

❯

❯

❯

❯

Is Your Code Generated by ChatGPT Really Correct! Rigorous Evaluation of Large Language Models for Code Generation

Is Your Code Generated by ChatGPT Really Correct! Rigorous Evaluation of Large Language Models for Code Generation

2026년 4월 13일1분 분량

github link : https://github.com/evalplus/evalplus

Summary

기존의 유명한 coding banchmark인 HumanEval이나 MBPP(Most Basic Python Programming) 등이 1. 각 문항에 대한 소규모 tes-case, 2. groun-truth에도 error가 있다는 점을 지적.

이를 LLM을 사용해서 각 문제별 test-case를 증강하는 framework인 evalplus proj를 제안.
기존 Benchmark 대비
HumanEval+ : 기존 HumanEval 대비 80x 많은 샘플 제공
MBPP+ : 기존 MBPP 대비 35x 더 많은 샘플 제공.

Llama 3.x series, Qwen2.5-Coder, DeepSeek-CoderV2 들에서도 eval-bench로 사용.

Note

여기서 제안하는 pipeline으로 증강한 HumanEvalPlus, MBPPPlus 사용한 예정.
MBPPPlus : 378개 sample, HumanEvalPlus: 164 sample
기존:

Evaluation Procedure

공유하기

그래프 뷰

Properties

ArXiv ID: N/A
Category: LLMs
DOI: N/A
IsTargetPaper: true
Linked Bases: [[LLMs.base]]
Reading Status: Not Started

백링크

The Student's Guide to Cognitive NeuroScience
Memory
Architecture
Benchmarks
LLMs
Fundamentals
self-consciousness
Theory of mind
Vision

Created with Quartz v4.5.2 © 2026

GitHub
Blog