Resolving concept labels to standardized identifiers from existing databases is a fundamental requirement in the process of annotating biomedical data. While several annotation services, including BioPortal and the Translator Name Resolution service, are available, most of them rely on straightforward matching mechanisms (respectively mgrep and solr). Unfortunately, these mechanisms often fall short when dealing with concept labels that exhibit substantial variations from standardized identifiers or when dealing with synonyms.
We propose to explore the use of vector similarity search to improve the accuracy of concept resolution. We will leverage the extensive dataset gathered by the Translator Babel project, which includes a vast repository of identifiers, labels, and synonyms from the biomedical domain (PubChem, CHEMBL, UniProt, MONDO, OMIM, HGNC, DrugBank, and more).
During the Biomedical Linked Annotation Hackathon, our key objectives are as follows:
The name resolution service will be exposed as an OpenAPI-described API that takes a concept label as input, and return a list of matching entities, represented by a dictionary with the score and their ID curie, label, synonyms.
| Name | Creation | GitHub stars | Written in | SDK for | Query language/API* | Implement vector functions | Comment |
|---|---|---|---|---|---|---|---|
| Qdrant | July 2020 | ~14k | Rust | Python, JS, Rust, Go, .NET | OpenAPI, gRPC | cosine, euclid, dot | Can be used as local standalone tool, in memory or persistent on disk, without to deploy a web service |
| Milvus | October 2019 | ~24k | Go | Python, JS, Java, Go | OpenAPI ❓️ | cosine, euclid, inner product | aka. Zilliz cloud |
| Chroma | October 2022 | ~9k | Python | Python, JS | OpenAPI ❓️ | ||
| Weaviate | March 2016 | ~8k | Go | Python, JS, Java, Go | GraphQL API | cosine, euclid | |
| pgvector | April 2021 | ~6.5k | C | Through Postgres SDK ❓️ | SQL | cosine, euclid, inner product, taxicab | Integrated in PostgreSQL |
*Query language/API specifies which type of query language or API can be used to query the information inside the vector database
All those products are Open Source, and they all propose a simple web UI to explore the vector database.
Most of them have a modern and simple API (apart from pgvector which lives within PostgreSQL)
Reference benchmark for text embeddings models: https://huggingface.co/blog/mteb
Leaderboard: https://huggingface.co/spaces/mteb/leaderboard
Popular embedding models:
bge-large-en-v1.5text-embedding-ada-002sentence-transformers/all-MiniLM-L6-v2jina-embeddings-v2-base-enembed-english-v3.0To be defined.
Existing benchmarks for Vector databases:
Preliminary results on the 19/01/2024 (Babel synonyms not fully loaded yet, missing files after Drug: gene, protein, organisms, pathway, umls): most issues seems to be resolved apart from "Rat" and "acp-044 dose a" (does not time out but no interesting results)
Start services:
docker compose up -dGet into the workspace container to run the loading scripts.
Download the Babel synonyms and load them in the vectordb:
make load(experimental) Load PubDictionaries in pgvector:
python src/pubdict_load.pylimitfeature from the vectordb (if the 2 first results from the vectordb are from the same point, then we will return only 1 results, which will not match the limit of 2 asked by the user)Possible solution would be to use postgres and pgvector, with 2 tables (one for embeddings, one for concept infos) but that would make the system much more complex than a JSON store.
Is there any self-hosted vectordb that can support multiple unnamed vectors for a single point? (Qdrant currently only supports multiple named vectors which does not fit our use-case)
Introduction presentation: https://docs.google.com/presentation/d/1_nTMF-ltHvYbbvfUSDxSdBEb0Wm_yr_BvNNt-IvLKtc/edit
PubDictionaries experiment: https://docs.google.com/document/d/1nipvy2ZhZedmf5bjcUzcbGZIfN22V9KpZfO4eTXL89M/edit
Conclusion presentation: https://docs.google.com/presentation/d/1sJeuo4oegNmaMTrvCAWb0TZJZR9SGnYH-EFwTjf99lg/edit
Preprint biohackrxiv paper: http://preview.biohackrxiv.org/papers/bdda0f94-f526-4f35-8768-8faf62d731fa/paper.pdf
Demo API: https://concept-resolver.137.120.31.102.nip.io