This project is super early, not ready for production.
Serverless, lightweight, and fast vector store on top of DynamoDB
Whiplash is a lightweight vector store built on top of AWS DynamoDB. It uses a variant of locality-sensitive hashing (LSH) to index vectors in a DynamoDB table. This is intended to be a mimimalist, scalable, and fast vector store and is intended to be extremely easy to use, maintain, and self-host.
There are several main components to Whiplash:
pip install whiplash-client
# OR
poetry add whiplash-client
npm install -g serverless
serverless install --url https://github.com/ericmillsio/whiplash
cd whiplash
npm install
Prepare for deployment:
cp .env.example .env
# Change AWS_PROFILE within .env to your AWS profile
# Adjust other values if you want
Deploy to AWS:
serverless deploy --stage dev --region us-east-2
Expected output is below. Note the API key for access and endpoints to use.
Running "serverless" from node_modules
Deploying whiplash to stage dev (us-east-2)
✔ Service deployed to stack whiplash-dev (64s)
api keys:
dev-api-key: <YOUR API KEY WILL BE HERE>
endpoints:
GET - https://API_ID.execute-api.us-east-2.amazonaws.com/dev/projects/{projectId}
POST - https://API_ID.execute-api.us-east-2.amazonaws.com/dev/projects
GET - https://API_ID.execute-api.us-east-2.amazonaws.com/dev/projects
POST - https://API_ID.execute-api.us-east-2.amazonaws.com/dev/projects/{projectId}/collections
GET - https://API_ID.execute-api.us-east-2.amazonaws.com/dev/projects/{projectId}/collections
GET - https://API_ID.execute-api.us-east-2.amazonaws.com/dev/projects/{projectId}/collections/{collectionId}
POST - https://API_ID.execute-api.us-east-2.amazonaws.com/dev/projects/{projectId}/collections/{collectionId}/items
POST - https://API_ID.execute-api.us-east-2.amazonaws.com/dev/projects/{projectId}/collections/{collectionId}/items/batch
GET - https://API_ID.execute-api.us-east-2.amazonaws.com/dev/projects/{projectId}/collections/{collectionId}/items/{itemId}
GET - https://API_ID.execute-api.us-east-2.amazonaws.com/dev/projects/{projectId}/collections/{collectionId}/search
functions:
getProject: whiplash-dev-getProject (12 kB)
createProject: whiplash-dev-createProject (12 kB)
listProjects: whiplash-dev-listProjects (12 kB)
createCollection: whiplash-dev-createCollection (12 kB)
listCollections: whiplash-dev-listCollections (12 kB)
getCollection: whiplash-dev-getCollection (12 kB)
createItem: whiplash-dev-createItem (12 kB)
createItems: whiplash-dev-createItems (12 kB)
getItem: whiplash-dev-getItem (12 kB)
searchItems: whiplash-dev-searchItems (12 kB)
There are three ways to use Whiplash: as a direct client, as a serverless API, or as a library proxying through the API.
The library is the most flexible way to use Whiplash. It can be used in any Python project and can be used to build custom applications on top of Whiplash without deploying the API. It will manage the tables directly. It's recommended to be used inside of AWS to avoid network latency, but can be used outside for testing/evaluation.
import numpy as np
from whiplash import Vector, Whiplash
# AWS_PROFILE must be set in environment variables for boto3
whiplash = Whiplash("us-east-2", "dev")
# First time only setup
whiplash.setup()
collection = whiplash.create_collection("test_collection", n_features=3)
# Insert a vector
item = Vector("some_id", np.ndarray([1, 2, 3]))
collection.insert(item)
# Search for the inserted vector
result = collection.search(item.vector, limit=1)
The serverless API is fully functional microservice and can be deployed to AWS with a few commands. The API is built using Serverless and AWS Lambda.
API Endpoints:
/projects
/projects/{projectId}
/projects/{projectId}/collections
/projects/{projectId}/collections/{collectionId}
/projects/{projectId}/collections/{collectionId}/items
/projects/{projectId}/collections/{collectionId}/items/{itemId}
/projects/{projectId}/collections/{collectionId}/search
Required Headers:
x-api-key: API_KEY
Content-Type: application/json
You can choose to interact with the API directly or use the Whiplash Client Library. An api.yaml file is included in the project for use with Postman or Insomnia.
The library can be used to proxy through the API. This is recommended if you want to use a library but don't want to manage the tables directly. This is more stable and likely faster than the direct client library for bulk insert operations, but slower for single insert operations (assuming you are calling from outside the AWS region).
This can only be used if the API is deployed. It also avoids the numpy dependency, but I haven't broken out the packages yet.
import time
from whiplash.api.client import Whiplash
query = [0.5472,...]
whiplash = Whiplash(
"https://API-ID.execute-api.us-east-2.amazonaws.com/STAGE",
"API_KEY",
)
collection = whiplash.get_collection("example")
assert collection is not None
start = time.time()
results = collection.search(query)
print("Search took", time.time() - start, "seconds")
print("Results:", results)
Whiplash uses locality-sensitive hashing (LSH) to index vectors in a DynamoDB table. This is intended to be a mimimalist, scalable, and fast vector store built on top of AWS production-grade infrastructure.
Whiplash varies from traditional LSH in that it uses differing number of bits for each hash key. This allows for dynamic and automatic tuning of the number of buckets, which gives flexiblity to scaling the index over time. When buckets on the smallest hash key are close to the maximum size of a DynamoDB item, a new layer of hash function/keys is added.
Whiplash uses several Dynamo tables to store the vectors and buckets:
PROJECT_STAGE_COLLECTION_vectors
- stores the { vector_id: binary(vector) }
PROJECT_STAGE_COLLECTION_buckets
- stores the { hash: set of vector_ids }
whiplash_metadata
- stores metadata about each collection { collection_id: { n_features: 256, uniform_planes: {0: [binary]} ...} }
Whiplash uses random projection to hash vectors into buckets. The vectors are projected onto a set of random planes, and the sign of the projection determines which side of the plane the vector is on. The planes are generated using the Gaussian distribution to ensure that the vectors are evenly distributed.
Hash keys are generated by converting the array of boolean results of random projection to a binary string, and then converted to base 36.
poetry install
poetry run pytest
Whiplash (and LSH in general) for the purpose of approximate nearest neighbor (ANN) algorithmic search has tradeoffs compared to current vector stores:
Pros:
Cons:
I wanted to build this because I was frustrated with the high cost of managed vector stores, and the complexity of self-hosting other open-source vector stores. There is also no good serverless solution (even if AWS pushes "Serverless" ElasticSearch, ugh). Whiplash is so simple and built on production-ready infrastructure that will have very little maintenance.
Comparing storage cost alone to Pinecone:
Pinecone:
Whiplash:
In no particular order:
If you are Amazon and want to use this, please contact me and pay me a bunch of money beforehand. Otherwise, this is licensed under Apache 2.0.