bigcodebench下载 - bigcodebench源代码下载

bigcodebench

其他源码

v0.2.1.post2

下载

BigCodeBench

？影响 • ？新闻•快速启动•远程评估•LLM生成的代码•？高级用法•？结果提交•引文

？影响

BigCodeBench已被许多LLM团队使用，包括：

Zhipu ai
阿里巴巴Qwen
DeepSeek
亚马逊AWS AI
雪花AI研究
ServiceNow研究
meta ai
cohere ai
Sakana ai

？消息

[2024-10-06]我们正在释放bigcodebench==v0.2.0 ！
[2024-10-05]我们在拥抱面积空间上创建公共代码执行API。
[2024-10-01]到目前为止，我们已经在BigCodebench-Hard上评估了139个模型。看看排行榜！
[2024-08-19]为了使评估完全可复制，我们在排行榜中添加了实时代码执行会话。可以在这里查看。
[2024-08-02]我们发布bigcodebench==v0.1.9 。

更多新闻::单击以展开::

[2024-07-18]我们宣布BigCodeBench-Hard的BigCodeBench的一个子集，其中包含148个任务，这些任务与现实世界编程任务更加一致。详细信息可在此博客文章中找到。该数据集可在此处提供。新版本是bigcodebench==v0.1.8 。
[2024-06-28]我们发布bigcodebench==v0.1.7 。
[2024-06-27]我们发布bigcodebench==v0.1.6 。
[2024-06-19]我们开始拥抱脸bigcodebench排行榜！排行榜可在此处提供。
[2024-06-18]我们发布了BigCodeBench，这是一种新的基准，用于代码生成，具有1140个面向软件工程的编程任务。预印本可在此处使用。 PYPI软件包可在此处提供0.1.5版本。

？关于

BigCodeBench

BigCodeBench是通过代码解决实用和具有挑战性的任务的易于使用的基准。它旨在在更现实的环境中评估大语言模型（LLM）的真正编程功能。该基准是为类似HOMANEVAL的功能级代码生成任务而设计的，但具有更复杂的说明和不同的功能调用。

BigCodebench中有两个分裂：

Complete ：THES Split设计用于基于综合docstrings的代码完成。
Instruct ：拆分仅适用于指令调整和聊天模型，在此要求模型根据自然语言说明生成代码段。该说明仅包含必要的信息，需要更复杂的推理。

为什么要bigcodebench？

BigCodeBench专注于通过代码生成具有不同功能呼叫和复杂说明的代码自动化，其中：

精确的评估和排名：有关严格评估之前和之后的最新LLM排名，请参见我们的排行榜。
预先生成的样本：BigCodebench通过开放式LLM生成的样品为各种型号加速了代码情报研究 - 无需重新运行昂贵的基准测试！

快速开始

首先，请首先设置环境：

 # By default, you will use the remote evaluation API to execute the output samples.
pip install bigcodebench --upgrade

# You are suggested to use `flash-attn` for generating code samples.
pip install packaging ninja
pip install flash-attn --no-build-isolation
# Note: if you have installation problem, consider using pre-built
# wheels from https://github.com/Dao-AILab/flash-attention/releases

⏬安装夜间版本::单击以展开::

 # Install to use bigcodebench.generate
pip install " git+https://github.com/bigcode-project/bigcodebench.git " --upgrade

远程评估

我们使用贪婪的解码来显示如何通过远程API评估生成的代码样本。

警告

为了简化生成，我们默认使用批处理推断。但是，至少对于VLLM后端而言，批处理推理结果可能从批处理大小到批处理大小以及版本到版本之间有所不同。如果您想获得更多的贪婪解码结果，请将--bs设置为1 。

笔记

在BigCodeBench-Full上执行远程执行通常需要6-7分钟，而在BigCodeBench-Hard上通常需要4-5分钟。

bigcodebench.evaluate 
  --model meta-llama/Meta-Llama-3.1-8B-Instruct 
  --split [complete | instruct] 
  --subset [full | hard] 
  --backend [vllm | openai | anthropic | google | mistral | hf]

所有结果文件将存储在名为bcb_results的文件夹中。
生成的代码样本将存储在名为[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated.jsonl 。
评估结果将存储在名为[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_eval_results.json 。
通过@k结果将存储在名为[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_pass_at_k.json 。

笔记

BigCodeBench对基础和聊天模型使用不同的提示。默认情况下，使用hf / vllm作为后端时，由tokenizer.chat_template检测到它。对于其他后端，仅允许聊天模式。

因此，如果您的基本模型带有tokenizer.chat_template ，请添加--direct_completion ，以避免在聊天模式下评估。

访问OpenAI控制台的OpenAI API

 export OPENAI_API_KEY= < your_openai_api_key >

从拟人控制台访问拟人API

 export ANTHROPIC_API_KEY= < your_anthropic_api_key >

从Mistral控制台访问Mistral API

 export MISTRAL_API_KEY= < your_mistral_api_key >

访问Google AI Studio的GEMINI API

 export GOOGLE_API_KEY= < your_google_api_key >

LLM生成的代码

我们共享我们评估过的LLM的预生产代码样本：

请参阅我们的v0.2.0.post3的附件。为了您的方便，我们包括sanitized_samples_calibrated.zip 。

？高级用法

有关更多详细信息，请参考高级用法。

？结果提交

如果您想将模型贡献给排行榜，请通过电子邮件将生成的代码示例和执行结果发送给[email protected]。请注意，文件名应为[model_name]--[revision]--[bigcodebench|bigcodebench-hard]-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated.jsonl and [model_name]--[revision]--[bigcodebench|bigcodebench-hard]-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_eval_results.json 。如果我们在3天内不回复您的电子邮件，您可以提出问题以提醒我们。

引用

 @article { zhuo2024bigcodebench ,
  title = { BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions } ,
  author = { Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others } ,
  journal = { arXiv preprint arXiv:2406.15877 } ,
  year = { 2024 }
}

致谢

评估Plus

展开

附加信息

版本 v0.2.1.post2
类型其他源码
更新时间 2025-03-04
大小 86.95KB
来自于 Github

bigcodebench

BigCodeBench

？影响

？消息

？关于

BigCodeBench

为什么要bigcodebench？

快速开始

远程评估

LLM生成的代码

？高级用法

？结果提交

引用

致谢

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express