This project demonstrates how to implement a Retrieval-Augmented Generation (RAG) pipeline using Hugging Face embeddings and ChromaDB for efficient semantic search. The solution reads, processes, and embeds textual data, enabling a user to perform accurate and fast queries on the data.
BAAI/bge-base-en-v1.5) to convert text chunks into vector representations.Before running the notebook, ensure the necessary libraries are installed:
pip install chromadb
pip install llama-indexYou also need to clone the required datasets from Hugging Face If u you just want to check it out and test the working :) :
git clone https://huggingface.co/datasets/NahedAbdelgaber/evaluating-student-writing
git clone https://huggingface.co/datasets/transformersbook/emotion-train-splitLoad Datasets:
Embedding Creation:
BAAI/bge-base-en-v1.5 model, text chunks are converted into vector embeddings. You can any model of your liking.ChromaDB Integration:
Semantic Search:
To use the code, simply run the notebook after installing the dependencies and cloning the required datasets. The following command can be used to query the stored embeddings:
query_collection("Your search query here", n_results=1)This will return the most relevant text chunk based on the provided query.
query_collection(
"Even though the planet is very similar to Earth, there are challenges to get accurate data because of the harsh conditions on the planet.",
n_results=1
)There are 2 files in here. The simple one just create a vector database of a single file and the advance one can work on multiple files with differnt extensions and create vector database of them and you can also test it out on a text-gen model.
This repository is licensed under the MIT License.