This Python library provides a suite of advanced methods for aggregating multiple embeddings associated with a single document or entity into a single representative embedding. It supports a wide range of aggregation techniques, from simple averaging to sophisticated methods like PCA and Attentive Pooling.
To install the package, you can use pip:
pip install faiss_vector_aggregatorBelow are examples demonstrating how to use the library to aggregate embeddings using different methods.
Suppose you have a collection of embeddings stored in a FAISS index, and you want to aggregate them by their associated document IDs using simple averaging.
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using simple averaging
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="average"
)input_folder: Path to the folder containing the input FAISS index and metadata.column_name: The metadata field by which to aggregate embeddings (e.g., 'id').output_folder: Path where the output FAISS index and metadata will be saved.method="average": Specifies the aggregation method.If you have different weights for the embeddings, you can apply a weighted average to give more importance to certain embeddings.
from faiss_vector_aggregator import aggregate_embeddings
# Example weights for the embeddings
weights = [0.1, 0.3, 0.6]
# Aggregate embeddings using weighted averaging
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="weighted_average",
weights=weights
)weights: A list or array of weights corresponding to each embedding.method="weighted_average": Specifies the weighted average method.To reduce high-dimensional embeddings to a single representative vector using PCA:
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using PCA
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="pca"
)method="pca": Specifies that PCA should be used for aggregation.Use K-Means clustering to find the centroid of embeddings for each document ID.
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using K-Means clustering to find the centroid
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="centroid"
)method="centroid": Specifies that K-Means clustering should be used.To use an attention mechanism for aggregating embeddings:
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using Attentive Pooling
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="attentive_pooling"
)method="attentive_pooling": Specifies the attentive pooling method.Below is a detailed description of each aggregation method supported by the library:
weights.trim_percentage parameter.input_folder (str): Path to the folder containing the input FAISS index (index.faiss) and metadata (index.pkl).column_name (str): The metadata field by which to aggregate embeddings (e.g., 'id').output_folder (str): Path where the output FAISS index and metadata will be saved.method (str): The aggregation method to use. Options include:
'average', 'weighted_average', 'geometric_mean', 'harmonic_mean', 'centroid', 'pca', 'median', 'trimmed_mean', 'max_pooling', 'min_pooling', 'entropy_weighted_average', 'attentive_pooling', 'tukeys_biweight', 'exemplar'.weights (list or np.ndarray, optional): Weights for the weighted_average method.trim_percentage (float, optional): Fraction to trim from each end for trimmed_mean. Should be between 0 and less than 0.5.weights (list or np.ndarray, optional): Weights for the weighted_average method.Ensure you have the following packages installed:
You can install the dependencies using:
pip install faiss-cpu numpy scipy scikit-learn langchainNote: Replace faiss-cpu with faiss-gpu if you prefer to use the GPU version of FAISS.
Contributions are welcome! Please feel free to submit a pull request or open an issue on the GitHub repository.
When contributing, please ensure that your code adheres to the following guidelines:
This project is licensed under the MIT License. See the LICENSE file for details.
FAISS vector store. Ensure that your embeddings and indexes are handled consistently when integrating with LangChain.