Retrieval augmentated generation (RAG) has grown increasingly popular as a way to improve the quality of text generated by large language models. Now that multimodal LLMs are in vouge, it's time to extend RAG to multimodal data.
When we add in the ability to search and retrieve data across multiple modalities, we get a powerful tool for interacting with the most powerful AI models available today. However, we also add brand new layers of complexity to the process.
Some of the considerations we need to take into account include:
On a more practical level, here are some of the basic knobs we can turn:
This project is a testbed for exploring these questions and more. It uses three open source libraries, FiftyOne, LlamaIndex, and Milvus, to make the process of working with multimodal data, experimenting with different multimodal RAG techniques, and finding what works best for your use-case as easy as possible.
Also note that LlamaIndex frequently updates its API. This is why the version of LlamaIndex and its associated packages are all pinned ?
First, install FiftyOne:
pip install fiftyoneNext, using FiftyOne's CLI syntax, download and install the FiftyOne Multimodal RAG plugin:
fiftyone plugins download https://github.com/jacobmarks/fiftyone-multimodal-rag-pluginLlamaIndex has a verbose installation process (if you want to build anything multimodal at least). Fortunately for you, this (and all other install dependencies) will be taken care of with the following command:
fiftyone plugins requirements @jacobmarks/multimodal_rag --installTo get started, launch the FiftyOne App. You can do so from the terminal by running:
fiftyone app launchOr you can run the following Python code:
import fiftyone as fo
session = fo.launch_app()Now press the backtick key (`) and type create_dataset_from_llama_documents.
Press Enter to open the operator's modal. This operator gives you a UI to choose
a directory containing your multimodal data (images, text files, pdfs, etc) and create a FiftyOne dataset from it.
Once you've selected a directory, execute the operator. It will create a new dataset in your FiftyOne session. For text files, you will see a image rendering of the truncated text. For images, you will see the image itself.
You can add additional directories of multimodal data with the add_llama_documents_to_dataset operator.
Now that you have a multimodal dataset, you can index it with LlamaIndex and Milvus.
Use the create_multimodal_rag_index operator to enter this process. This operator
will prompt you to name the index, and will give you the option to index the images
via CLIP embeddings or captions. If you choose captions, you will be prompted to select
the text field to use as the caption.
If you do not have captions on your dataset, you might be interested in the FiftyOne Image Captioning Plugin.
fiftyone plugins download https://github.com/jacobmarks/fiftyone-image-captioning-pluginOnce you have created an index, you can inspect it by running the get_multimodal_rag_index_info operator
and selecting the index you want to inspect from the dropdown.
Finally, you can query the index with the query_multimodal_rag_index operator.
This operator will prompt you to enter a query string, and an index to query.
You can also specify the multimodal model to use for generating the retrieval-augmented results, as well as both the number of image and text results to retrieve.