This project is a conversational agent that leverages LangChain, OpenAI API, and the RAG (Retrieval-Augmented Generation) concept. The agent is designed to read lengthy PDF documents, extract various components such as text, images, and tables, and store them in a vector database for efficient retrieval during conversations with users.
PDF Processing: The agent is capable of parsing and extracting information from long PDF documents.
Multi-Modal Extraction: Extracts text, images, and tables from PDFs for a comprehensive understanding.
Vector Database: Utilizes a vector database to store and retrieve information efficiently.
Conversational AI: Implements the RAG concept to enhance conversational interactions with users.
We will use Unstructured to parse images, text, and tables from documents (PDFs).
We will use the multi-vector retriever with Chroma to store raw text and images along with their summaries for retrieval.
We will use GPT-4V for both image summarization (for retrieval) as well as final answer synthesis from join review of images and texts (or tables).
LangChain <- Visit here to understand langchain installation
OpenAI API <- Instructions for setting up and using OpenAI API.
Chroma DB <- Instructions for setting up and using the vector database.
Provide path to the source pdf
Change the prompt_text according to your needs.
Replace your questions in the query line.
The agent will use the stored information for intelligent responses.
Retrieval
Retrieval is performed based upon similarity to image summaries as well as text chunks. This requires some careful consideration because image retrieval can fail if there are competing text chunks. To mitigate this, I produce larger (4k token) text chunks and summarize them for retrieval.
Image Size
The quality of answer synthesis appears to be sensitive to image size, as expected. I'll do evals soon to test this more carefully.
This project is licensed under the MIT License.