DafnyBench
1.0.0
Dataset & code for our paper DafnyBench: A Benchmark for Formal Software Verification
Dataset is also available for download on ? Hugging Face.
DafnyBench is the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification, with over 750 Dafny programs.
DafnyBench directory, which contains the ground_truth set & the hints_removedset (with compiler hints, i.e. annoataions, removed).hints_removed set and checking if the reconstructed program could be verified by Dafny. Please refer to the eval directory.
cd into this repositorypython -m venv stats
source stats/bin/activate
pip install -r requirements.txt
cd eval
export DAFNYBENCH_ROOT=
/opt/homebrew/bin/Dafny):export DAFNY_PATH=
export OPENAI_API_KEY=
python fill_hints.py --model "gpt-4o" --test_file "Clover_abs_no_hints.dfy" --feedback_turn 3 --dafny_path "$DAFNY_PATH"
or evaluate on the entire dataset:
export model_to_eval='gpt-4o'
./run_eval.sh
DafnyBench
ground_truth version that is fully verified with Dafny & a hints_removed version that has hints (i.e. annotations) removedeval
results
results_summary - Dataframes that summarize LLMs' success on every test programreconstructed_files - LLM outputs with hints filled back inanalysis - Contains a notebook for analyzing the results@article{loughridge2024dafnybench,
title={DafnyBench: A Benchmark for Formal Software Verification},
author={Chloe Loughridge and Qinyi Sun and Seth Ahrenbach and Federico Cassano and Chuyue Sun and Ying Sheng and Anish Mudide and Md Rakib Hossain Misu and Nada Amin and Max Tegmark},
year={2024},
journal={arXiv preprint arXiv:2406.08467}
}