evalplus Download - evalplus Source code download

`EvalPlus() =>`

?About • Quick Start • LLM Backends • Documents • Citation • Acknowledgement

? News

Who's using EvalPlus datasets? EvalPlus has been used by various LLM teams, including:

Meta Llama 3.1
Qwen2.5-Coder
CodeQwen 1.5
DeepSeek-Coder V2
Qwen2
Snowflake Arctic
StarCoder2
Magicoder
WizardCoder

Below tracks the notable updates of EvalPlus:

[2024-10-20 v0.3.1]: EvalPlus v0.3.1 is officially released! Highlights: (i) Code efficiency evaluation via EvalPerf, (ii) one command to run all: generation + post-processing + evaluation, (iii) support for more inference backends such as Google Gemini & Anthropic, etc.
[2024-06-09 pre v0.3.0]: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to EvalArena.
[2024-04-17 pre v0.3.0]: MBPP+ is upgraded to v0.2.0 by removing some broken tasks (399 -> 378 tasks). ~4pp pass@1 improvement could be expected.

Earlier news :: click to expand ::

(v0.2.1) You can use EvalPlus datasets via bigcode-evaluation-harness! HumanEval+ oracle fixes (32).
(v0.2.0) MBPP+ is released! HumanEval contract & input fixes (0/3/9/148/114/1/2/99/28/32/35/160).
(v0.1.7) Leaderboard release; HumanEval+ contract and input fixes (32/166/126/6)
(v0.1.6) Configurable and by-default-conservative timeout settings; HumanEval+ contract & ground-truth fixes (129/148/75/53/0/3/9/140)
(v0.1.5) HumanEval+ mini is released for ultra-fast evaluation when you have too many samples!
(v0.1.1) Optimizing user experiences: evaluation speed, PyPI package, Docker, etc.
(v0.1.0) HumanEval+ is released!

? About

EvalPlus is a rigorous evaluation framework for LLM4Code, with:

HumanEval+: 80x more tests than the original HumanEval!
MBPP+: 35x more tests than the original MBPP!
EvalPerf: evaluating the efficiency of LLM-generated code!
Framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.

Why EvalPlus?

Precise evaluation: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
Coding rigorousness: Look at the score differences! esp. before & after using EvalPlus tests! Less drop means more rigorousness in code generation; while a bigger drop means the generated code tends to be fragile.
Code efficiency: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs.

Want to know more details? Read our papers & materials!

EvalPlus: NeurIPS'23 paper, Slides, Poster, Leaderboard
EvalPerf: COLM'24 paper, Poster, Documentation, Leaderboard

Quick Start

Code Correctness Evaluation: HumanEval(+) or MBPP(+)

pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" 
                  --dataset [humaneval|mbpp]             
                  --backend vllm                         
                  --greedy

?️ Safe code execution within Docker :: click to expand ::

# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" 
                 --dataset humaneval                    
                 --backend vllm                         
                 --greedy

# Code execution within Docker
docker run --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest 
           evalplus.evaluate --dataset humaneval                                     
           --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl

Code Efficiency Evaluation: EvalPerf (*nix only)

pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release

sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm

?️ Safe code execution within Docker :: click to expand ::

# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" 
                 --dataset evalperf                     
                 --backend vllm                         
                 --temperature 1.0                      
                 --n-samples 100

# Code execution within Docker
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
docker run --cap-add PERFMON --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest 
           evalplus.evalperf --samples /app/evalperf/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl

LLM Backends

HuggingFace models

transformers backend:

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" 
                  --dataset [humaneval|mbpp]             
                  --backend hf                           
                  --greedy

Note

EvalPlus uses different prompts for base and chat models. By default it is detected by tokenizer.chat_template when using hf/vllm as backend. For other backends, only chat mode is allowed.

Therefore, if your base models come with a tokenizer.chat_template, please add --force-base-prompt to avoid being evaluated in a chat mode.

Enable Flash Attention 2 :: click to expand ::

# Install Flash Attention 2
pip install packaging ninja
pip install flash-attn --no-build-isolation
# Note: if you have installation problem, consider using pre-built
# wheels from https://github.com/Dao-AILab/flash-attention/releases

# Run evaluation with FA2
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B"         
                  --dataset [humaneval|mbpp]                     
                  --backend hf                                   
                  --attn-implementation [flash_attention_2|sdpa] 
                  --greedy

vllm backend:

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" 
                  --dataset [humaneval|mbpp]             
                  --backend vllm                         
                  --tp [TENSOR_PARALLEL_SIZE]            
                  --greedy

openai compatible servers (e.g., vLLM):

# OpenAI models
export OPENAI_API_KEY="{KEY}" # https://platform.openai.com/settings/organization/api-keys
evalplus.evaluate --model "gpt-4o-2024-08-06"  
                  --dataset [humaneval|mbpp]   
                  --backend openai --greedy

# DeepSeek
export OPENAI_API_KEY="{KEY}" # https://platform.deepseek.com/api_keys
evalplus.evaluate --model "deepseek-chat"              
                  --dataset [humaneval|mbpp]           
                  --base-url https://api.deepseek.com  
                  --backend openai --greedy

# Grok
export OPENAI_API_KEY="{KEY}" # https://console.x.ai/
evalplus.evaluate --model "grok-beta"             
                  --dataset [humaneval|mbpp]      
                  --base-url https://api.x.ai/v1  
                  --backend openai --greedy

# vLLM server
# First, launch a vLLM server: https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" 
                  --dataset [humaneval|mbpp]             
                  --base-url http://localhost:8000/v1    
                  --backend openai --greedy

OpenAI models

Access OpenAI APIs from OpenAI Console

export OPENAI_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gpt-4o"            
                  --dataset [humaneval|mbpp]  
                  --backend openai            
                  --greedy

Anthropic models

Access Anthropic APIs from Anthropic Console

export ANTHROPIC_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "claude-3-haiku-20240307" 
                  --dataset [humaneval|mbpp]        
                  --backend anthropic               
                  --greedy

Google Gemini models

Access Gemini APIs from Google AI Studio

export GOOGLE_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gemini-1.5-pro"    
                  --dataset [humaneval|mbpp]  
                  --backend google            
                  --greedy

Amazon Bedrock models

Amazon Bedrock

export BEDROCK_ROLE_ARN="[BEDROCK_ROLE_ARN]"
evalplus.evaluate --model "anthropic.claude-3-5-sonnet-20241022-v2:0" 
                  --dataset [humaneval|mbpp]                          
                  --backend bedrock                                   
                  --greedy

You can checkout the generation and results at evalplus_results/[humaneval|mbpp]/

⏬ Using EvalPlus as a local repo? :: click to expand ::

git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt

Documents

To learn more about how to use EvalPlus, please refer to:

EvalPlus Commands
EvalPerf
Program Execution

Citation

@inproceedings{evalplus,
  title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
  author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},
  booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
  year = {2023},
  url = {https://openreview.net/forum?id=1qvx610Cu7},
}

@inproceedings{evalperf,
  title = {Evaluating Language Models for Efficient Code Generation},
  author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},
  booktitle = {First Conference on Language Modeling},
  year = {2024},
  url = {https://openreview.net/forum?id=IBCBMeAhmC},
}

Acknowledgement

HumanEval
MBPP

Expand

evalplus

`EvalPlus() =>`

? News

? About

Quick Start

Code Correctness Evaluation: HumanEval(+) or MBPP(+)

Code Efficiency Evaluation: EvalPerf (*nix only)

LLM Backends

HuggingFace models

OpenAI models

Anthropic models

Google Gemini models

Amazon Bedrock models

Documents

Citation

Acknowledgement

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express