How can we design and optimize a RAG system to provide personalized, reference-rich outputs for data science teams, while managing in-house packages, in an offline environment?
To break it down further:
DocQuest is an offline Retrieval-Augmented Generation (RAG) system designed for data science teams. It provides personalized, reference-rich outputs while operating in environments without internet connectivity. This system is ideal for secure settings where data privacy is paramount.
The system integrates documentation from multiple sources, including:
DOC-QUEST/
│
├── data/ # Folder for data/documents
│ ├── documents/ # Raw or processed document storage
│ └── vector_db/ # Vector databases
│ ├── child_docs/ # Child documents
│ └── parent_docs/ # Parent documents
│
├── notebooks/ # Jupyter notebooks for prototyping and experimentation
│ ├── 1_documentation_download.ipynb
│ ├── 2_document_pre_processing.ipynb
│ ├── 3_embedding_vector_save_gpu.ipynb
│ ├── 4_conversation_rag.ipynb
│ ├── data_wrangling.ipynb
│ └── rag_v1.ipynb
│
├── src/ # Core source code for pipeline components
│ ├── 1_documentation_download.py
│ ├── 2_document_pre_processing.py
│ ├── 3_embedding_vector_save_gpu.py
│ ├── 4_conversation_rag.py
│
├── .gitignore # Specifies files/folders to ignore in version control
├── doc_quest_app.py # DocQuest UI streamlit application
├── README.md # Project documentation
└── requirements.txt # Dependencies for the project
Clone the repository:
git clone https://github.com/shrivastavasatyam/Doc-Quest.git
cd Doc-Quest
Set up a virtual environment:
python3 -m venv .venv
source .venv/bin/activate
Install dependencies:
pip install -r requirements.txt
Configure API key: Set up your GROQ API key as an environment variable:
export GROQ_API_KEY=your_groq_api_key
Or add it directly in the doc_quest_app.py file:
os.environ["GROQ_API_KEY"] = "your_groq_api_key"Prepare document paths:
Ensure your document paths are correctly set in the doc_quest_app.py file:
parent_doc_path = "/path/to/your/parent_docs"
child_doc_path = "./path/to/your/child_docs"Launch the Streamlit app:
streamlit run doc_quest_app.py
Access the web interface at the URL provided by Streamlit (usually http://localhost:8501).
Use the chat interface to ask questions and interact with the RAG system.