DafnyBench Download - DafnyBench Source code download

DafnyBench

AI Source Code

1.0.0

Download

DafnyBench: A Benchmark for Formal Software Verification

Dataset & code for our paper DafnyBench: A Benchmark for Formal Software Verification

Dataset is also available for download on ? Hugging Face.

Overview

DafnyBench is the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification, with over 750 Dafny programs.

Usage

Dataset: The dataset for DafnyBench (with 782 programs) could be found in the DafnyBench directory, which contains the ground_truth set & the hints_removedset (with compiler hints, i.e. annoataions, removed).
Evaluation: Evaluate LLMs on DafnyBench by asking models to fill in missing hints in a test file from the hints_removed set and checking if the reconstructed program could be verified by Dafny. Please refer to the eval directory.

Set Up for Evaluation ?

Install Dafny on your machine by following this tutorial
Clone & cd into this repository
Set up environment by running the following lines:

python -m venv stats
source stats/bin/activate
pip install -r requirements.txt
cd eval

Set up environment variable for the root directory:

export DAFNYBENCH_ROOT=

Set up environment variable for path to Dafny executable on your machine (for example, /opt/homebrew/bin/Dafny):

export DAFNY_PATH=

If you're evaluating an LLM through API access, set up API key. For example:

export OPENAI_API_KEY=

You can choose to evaluate an LLM on a single test program, such as:

python fill_hints.py --model "gpt-4o" --test_file "Clover_abs_no_hints.dfy" --feedback_turn 3 --dafny_path "$DAFNY_PATH"

or evaluate on the entire dataset:

export model_to_eval='gpt-4o'
./run_eval.sh

Contents ?

DafnyBench
- A collection of 782 Dafny programs. Each program has a ground_truth version that is fully verified with Dafny & a hints_removed version that has hints (i.e. annotations) removed
eval
- Contains scripts to evaluate LLMs on DafnyBench
results
- results_summary - Dataframes that summarize LLMs' success on every test program
- reconstructed_files - LLM outputs with hints filled back in
- analysis - Contains a notebook for analyzing the results

Citation ?

@article{loughridge2024dafnybench,
         title={DafnyBench: A Benchmark for Formal Software Verification}, 
         author={Chloe Loughridge and Qinyi Sun and Seth Ahrenbach and Federico Cassano and Chuyue Sun and Ying Sheng and Anish Mudide and Md Rakib Hossain Misu and Nada Amin and Max Tegmark},
         year={2024},
         journal={arXiv preprint arXiv:2406.08467}
}

Expand

Additional Information

Version 1.0.0
Type AI Source Code
Update Time 2025-06-13
size 3.64MB
From Github

Related Applications

ML stack

2025-07-01
awesome free chatgpt

2025-01-04
promptl

2025-02-17
pywin_contextmenu

2025-08-31
tick.chat

2025-09-16
FastLoRAChat

2025-09-03

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
ML stack

AI Source Code

1.0.0
awesome free chatgpt

AI Source Code

1.0.0
promptl

AI Source Code

1.0.0
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
hidusbf

Other source code

1.0.0

Related Information All