Welcome to the Question-Answering Pipeline with VectorDB and Large Language Models (LLM). This project aims to create an efficient and scalable pipeline for question-answering tasks using ChromaDB which is an open-source vector database, in conjunction with Llama2 which is also an open-source Large Language Model (LLM).
User Input: Users provide textual data sources in formats such as .pdf. These documents serve as the basis for generating responses.
Document Loading: LangChain's document loader is employed to efficiently load and preprocess the provided documents, ensuring compatibility with downstream tasks.
Document Chunking: The loaded documents are divided into smaller, manageable chunks to enhance the efficiency of the question-answering process.
Embedding Storage in VectorDB (ChromaDB): The chunks' embeddings are generated and stored in ChromaDB, VectorDB's underlying technology, enabling fast and accurate information retrieval.
Query Processing: User queries are converted into embeddings, allowing for a seamless comparison with the stored document embeddings.
Vector Database Search: The VectorDB is queried with the generated embeddings to retrieve relevant chunks of information, optimizing the question-answering process.
LLM Processing (Llama2): The retrieved embeddings are passed to Llama2, an LLM, which generates context-aware and accurate answers to user queries.
To kickstart the Question-Answering Pipeline, users need to provide their textual data sources in supported formats (Currently Supported format are: pdf, csv, html, xlsx, docx, xml, json). Follow the next section to ensure proper installation and configuration of dependencies.
Follow these steps to run the Question-Answering Pipeline successfully:
Install Dependencies: Ensure that you have all the required dependencies installed. Run the following commands in a notebook cell:
!pip install langchain
!pip install PyPDF
!pip install sentence_transformers
!pip install chromadb
!pip install accelerate
!pip install bitsandbytes
!pip install jq
!pip install unstructured
Customize Parameters:
Open the notebook and locate the following parameters:
jq_schema: Customize this parameter according to your data schema. Define the structure of your textual data for proper loading and processing.
input_path: Specify the path to your textual data source, such as a .pdf file. Ensure that the path is correctly set to your document.
Hugging Face Authorization Token: Make sure to obtain an authorization token from Hugging Face for downloading the Llama2 model. This token is crucial for accessing the model. Set the token in the appropriate section of the notebook.
Run the Notebook: Run the Jupyter notebook cell by cell. Ensure that each cell executes successfully without errors.
We welcome contributions and feedback from the community. Whether you identify issues, have suggestions for improvements, or want to extend the functionality, your input is valuable to us. Feel free to contribute to the project. Thank you for exploring our Project.