This project focuses on embedding and retrieval of a large-scale fashion product dataset collected from major brands such as Aarong, Allen Solly, Bata, Apex, and Infinity. The dataset consists of over 20,000 products, covering a wide variety of categories and styles. The notebook leverages powerful models and tools to create embeddings for both text and images, and then stores these embeddings in a vector database using Qdrant. This setup enables efficient and accurate retrieval of fashion products based on semantic similarity.
The dataset, hosted on Hugging Face, includes over 20,000 fashion products scraped from multiple sources, with details like product category, company, name, description, specifications, image links, and more. You can explore the dataset here.
text-embedding-3-large model to create high-dimensional embeddings for the product descriptions and summaries.clip-ViT-B-32) from the SentenceTransformer library is employed to generate image embeddings. This model captures visual features that can be used to find similar products based on their appearance.For each product, a summary string is generated, capturing key details such as category, company, name, and specifications. This string is then embedded using the text model. Simultaneously, the primary product image is downloaded, processed, and encoded to generate an image embedding. Both embeddings are stored in the Qdrant collection for efficient vector search.
The Qdrant database is employed as a vector store for these embeddings, supporting real-time similarity searches based on both text and image queries. The notebook creates a collection that accommodates both summary and image vectors using cosine similarity.
The notebook iterates over the dataset and:
This setup allows seamless integration into any system requiring fashion product recommendations or search functionality based on multi-modal data.
The image above showcases the number of vector points stored in the Qdrant collection, visualizing the scale of the dataset and the embeddings stored.
The project is an excellent resource for anyone looking to explore multi-modal embeddings, vector databases, and fashion data at scale.
This project utilizes the LLaVA (Language and Vision Assistant) model to generate product descriptions and specifications from images. The model is based on a conversational AI architecture that can interact with both text and visual inputs.
Before running the code, ensure you have the following dependencies installed:
transformers and datasets librariestorch for PyTorch supportPIL for image processingInstall the LLaVA package:
!pip install git+https://github.com/haotian-liu/LLaVA.git@786aa6a19ea10edc6f574ad2e16276974e9aaa3aInstall additional dependencies:
!pip install -qU datasetsInitialize the LLaVA ChatBot:
from transformers import AutoTokenizer, BitsAndBytesConfig
from llava.model import LlavaLlamaForCausalLM
from llava.utils import disable_torch_init
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.mm_utils import tokenizer_image_token, KeywordsStoppingCriteria
from llava.conversation import conv_templates, SeparatorStyle
import torch
from PIL import Image
import requests
from io import BytesIO
chatbot = LLaVAChatBot(load_in_8bit=True,
bnb_8bit_compute_dtype=torch.float16,
bnb_8bit_use_double_quant=True,
bnb_8bit_quant_type='nf8')Load the dataset:
from datasets import load_dataset
fashion = load_dataset(
"thegreyhound/demo2",
split="train"
)
product_df = fashion.to_pandas()Generate product descriptions and specifications:
cnt = 1
for index, row in product_df.iterrows():
str1 = "Given Image detail was: " + row['Description'] + " Now generate a brief high level description for the product shown in the image"
str2 = "Given Image detail was: " + row['Description'] + " Now generate a detailed specifications for the product shown in the image including the fabric, color, design, style etc"
ans1 = chatbot.start_new_chat(img_path=row['Image_link'],
prompt=str1)
ans2 = chatbot.start_new_chat(img_path=row['Image_link'],
prompt=str2)
product_df.loc[index, 'Description'] = ans1
product_df.loc[index, 'Specifications'] = ans2
print(cnt)
cnt += 1The script processes images and generates high-level product descriptions and detailed specifications. The final output is saved in a JSON file containing an array of product information.
This project is licensed under the MIT License - see the LICENSE file for details.
You can find more details and access the dataset at Hugging Face.