Verifiable vector similarity queries over a committed vector database.
This projects aims to obtain a proof-of-concept for a verifiable vector database using zero-knowledge proofs. We make heavy use of the awesome ZKFixedPointChip which enables fixed-point arithmetic with halo2-lib.
You need Rust installed:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | shThen, you can clone this repository and use the chips inside it:
git clone https://github.com/erhant/halo2-vectordb.git
cd halo2-vectordbWe implement two chips, one for distance metrics in halo2, and the other for basic vector database operations.
DistanceChipDistanceChip provides distance metrics that operate on two vectors of equal length. The vector elements are expected to be quantized with the FixedPointChip. The following distance metrics are implemented:
euclidean_distance computes the Euclidean distance between two vectors.manhattan_distance computes the Manhattan distance between two vectors.hamming_distance computes one minus Hamming similarity between two vectors.cosine_distance computes one minus Cosine similarity between two vectors.VectorDBChipVectorDBChip implements basic vector database functionality over a set of vectors. Similar to DistanceChip, it requires a FixedPointChip to operate over quantized values. It exposes the following functions:
nearest_vector takes a set of vectors and a query vector, and finds the vector that is most similar to the query w.r.t. a given distance function. It also returns an indicator (i.e. one-hot encoded vector that indicates the index of the result vector) which may be used at later steps.merkle_commitment takes a set of vectors, and commits to them using a Merkle tree with Poseidon hashes. If the given set does not include power-of-two many elements, it will pad zeros to the remaining leaves. In our scenario, we only need the entire vector or none at all, and for that reason we do not care about committing to elements within the vector. As such, we first hash the entire vector, and then treat that hash as the leaf node.kmeans takes a set of vectors, a K constant to determine the number of centroids and an I constant to determine the number of iterations. K-means usually is an iterative algorithm that terminates when the centroids are no more updated; however, such a control-flow is not possible in a zk-circuit. Therefore, the I parameter determines a fixed number of iterations.We also have a trait FixedPointVectorInstructions and its implementation for the FixedPointChip, which are simple utility functions to quantize and dequantize vectors.
A demonstrative test suite can be found at demo_test:
K centroids & clusters.K centroids & clusters along with K+2 merkle roots.f64 Rust implementation.Run the examples via one of the following:
# demonstrate distance computations
LOOKUP_BITS=12 cargo run --example distances --
--name distances -k 13 mock
# example merkle commitment to vectors
LOOKUP_BITS=12 cargo run --example merkle --
--name merkle -k 13 mock
# exhaustively find the similar vector & commit to the database
LOOKUP_BITS=12 cargo run --example query --
--name query -k 13 mock
# compute centroids
LOOKUP_BITS=15 cargo run --example kmeans --
--name kmeans -k 16 mockYou can provide a specific input via the --input <input-name> option.
To run tests:
cargo testSome of the tests make use of the ANN_SIFT_10K dataset by Jégou et al. which can be downloaded at Corpus-Texmex. This dataset 128-dimensional vectors. Within our tests folder, the common module exposes two functions to read these vectors:
read_vectors_from_disk takes a path and dimension (128 in this case) and reads the .fvec files from the dataset, returning a single vector of f32 values. This single vector is composed of all vector elements, composed together.select_from_vectors takes that single vector and a list of indices, and returns the selected vectors only.