lancedb study Download - lancedb study Source code download

lancedb study

Other source code

1.0.0

Download

LanceDB benchmark: Full-text and vector search performance

Code for the benchmark study described in this blog post.

LanceDB is an open source, embedded and developer-friendly vector database. Some key features about LanceDB that make it extremely valuable are listed below, among many others listed on their GitHub repo.

Incredibly lightweight (no DB servers to manage), because it runs entirely in-process with the application
Extremely scalable from development to production
Ability to perform full-text search (FTS), SQL search (via DataFusion) and ANN vector search
Multi-modal data support (images, text, video, audio, point-clouds, etc.)
Zero-copy (via Arrow) with automatic versioning of data on its native Lance storage format

The aim of this repo is to demonstrate the full-text and vector search features of LanceDB via an end-to-end benchmark, in which we carefully study query results and throughput.

Dataset

The dataset used for this demo is the Wine Reviews dataset from Kaggle, containing ~130k reviews on wines along with other metadata. The dataset is converted to a ZIP archive, and the code for this as well as the ZIP data is provided here for reference.

Comparison

Studying the performance of any tool in isolation is a challenge, so for the sake of comparison, an Elasticsearch workflow is provided in this repo. Elasticsearch is a popular Lucene-based full-text and vector search engine whose use is regularly justified for full-text (and these days, vector search), so this makes it a meaningful tool to compare LanceDB against.

Setup

Install the dependencies in virtual environment via requirements.txt.

# Setup the environment for the first time
python -m venv .venv  # python -> python 3.11+

# Activate the environment (for subsequent runs)
source .venv/bin/activate

python -m pip install -r requirements.txt

Benchmark results

Note

The numbers below are from a 2022 M2 Macbook Pro with 16GB RAM
The search space comprises 129,971 wine review descriptions in either LanceDB or Elasticsearch
The queries are randomly sampled from a list of 10 example queries for FTS and vector search, and run for 10, 100, 1000 and 10000 random queries
The vector dimensionality for the embeddings is 384 (BAAI/bge-small-en-v1.5)
Vector search in Elasticsearch is based on Lucene-HNSW, and in LanceDB, is based on IVF-PQ
The distance metric for vector search is cosine similarity in either DB
The run times reported (and QPS computed) are an average over 3 runs

Summary of results for 10,000 random queries:

Case	Elasticsearch (QPS)	LanceDB (QPS)
FTS: Serial	399.8	468.9
FTS: Concurrent	1539.0	528.9
Vector search: Serial	11.9	54.0
Vector search: Concurrent	50.7	71.6

Discussion

Via their Python clients, LanceDB is clearly faster than Elasticsearch in terms of QPS (queries per second) for the vector search use case, and is also faster for the full-text search use case when using multiple threads concurrently.
Elasticsearch is faster only for the FTS use case, specifically in the concurrent scenario likely because it uses a non-blocking async client (unlike LanceDB, for now).
In the future, if an async (non-blocking) Python client is available for LanceDB, the throughput for LanceDB for FTS is expected to be even higher.

Serial Benchmark

The serial benchmark shown below involves sequentially running queries in a sync for loop in Python. This isn't representative of a realistic use case in production, but is useful to understand the performance of the underlying search engines in each case (Lucene for Elasticsearch and Tantivy for LanceDB).

More details on this will be discussed in a blog post.

Full-text search (FTS)

Queries	Elasticsearch (sec)	Elasticsearch (QPS)	LanceDB (sec)	LanceDB (QPS)
10	0.0516	193.8	0.0518	193.0
100	0.2589	386.3	0.2383	419.7
1000	2.5748	388.6	2.1759	459.3
10000	25.0318	399.8	21.3196	468.9

Vector search

Queries	Elasticsearch (sec)	Elasticsearch (QPS)	LanceDB (sec)	LanceDB (QPS)
10	0.8087	12.4	0.2158	46.3
100	7.6020	13.1	1.6803	59.5
1000	84.0086	11.9	16.7948	59.5
10000	842.9494	11.9	185.0582	54.0

Concurrent Benchmark

The concurrent benchmark is designed to replicate a realistic use case for LanceDB or Elasticsearch - where multiple queries arrive at the same time, and the REST API on top of the DB has to handle asynchronous requests.

Note

The concurrency in Elasticsearch is achieved through its async client
The concurrency in LanceDB is achieved through Python's multiprocessing library on 4 worker threads (a higher number of threads resulted in slower performance).