The project has 2 main parts:
The image retrieval part of the project uses a pre-trained openai CLIP model(https://github.com/openai/clip) to retrieve images from a dataset that are relevant to a given text query. The dataset used for this project is the Pascal VOC 2012 dataset. The dataset contains around 3500 images (train + validation). The CLIP model is used to encode the text query and the images in the dataset. The similarity between the text query and the images is calculated using cosine similarity. The images are then ranked based on the similarity score and the top k images are returned.
The image description generation part of the project uses a pre-trained Mistral-7b (https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF) model to generate descriptions for the give input query.
To run the project, follow the steps below:
code.ipynbCheck out the demo video to see Text2ImageDescription in action:
This project is licensed under the MIT License - see the LICENSE file for details.