??Chinese | English | Documents/Docs | ?Models/Models
Similarities : a toolkit for similarity calculation and semantic search, supports text and image. Similarity calculation, semantic matching search toolkit.
Similarities implements a variety of similarity calculations and semantic matching retrieval algorithms for text and pictures, and supports billions of data search, text search, and picture search, Python3 development, pip installation, and out of the box.
Guide
Image Search Demo: https://huggingface.co/spaces/shibing624/CLIP-Image-Search
Text Search Demo: https://huggingface.co/spaces/shibing624/similarities
pip install torch # conda install pytorch
pip install -U similarities
or
git clone https://github.com/shibing624/similarities.git
cd similarities
pip install -e .
example: examples/text_similarity_demo.py
from similarities import BertSimilarity
m = BertSimilarity ( model_name_or_path = "shibing624/text2vec-base-chinese" )
r = m . similarity ( '如何更换花呗绑定银行卡' , '花呗更改绑定银行卡' )
print ( f"similarity score: { float ( r ) } " ) # similarity score: 0.855146050453186
model_name_or_path
: The model name or path will be downloaded from the HF model hub by default and the Chinese semantic matching model shibing624/text2vec-base-chinese is used. If multilingual is needed, it can be replaced with shibing624/text2vec-base-multilingual model, supporting Chinese, English, Korean, Japanese, German, Italian and other languages.Find the text that is most similar to query in the document candidate set, which is often used for similar matches and text searches in QA scenarios.
example: examples/text_semantic_search_demo.py
example: examples/fast_text_semantic_search_demo.py
Text transfer vector, indexing, batch search, start service: examples/faiss_bert_search_server_demo.py
Front-end python call: examples/faiss_bert_search_client_demo.py
It supports similarity calculation and literal matching search for synonyms such as Cilin, CNKI Hownet, WordEmbedding, Tfidf, SimHash, BM25, etc., and is often used for text matching cold start.
example: examples/literal_text_semantic_search_demo.py
Supports image similarity calculation and matching search for algorithms such as CLIP, pHash, and SIFT. The Chinese CLIP model supports image search, text search, and also supports Chinese and English graphics and text search.
example: examples/image_semantic_search_demo.py
Image turning vector, indexing, batch search, start service: examples/faiss_clip_search_server_demo.py
Front-end python call: examples/faiss_clip_search_client_demo.py
Front-end gradio call: examples/faiss_clip_search_gradio_demo.py
Clustering can be performed on large-scale datasets through community_detection algorithms to find clustering (i.e., similar sentence groups).
example: examples/text_clustering_demo.py
The synonym sentence mining (paraphrase_mining_embeddings) algorithm can be used to mine sentence pairs with similar meanings from a large number of sentences or documents, which can be used for redundant graphic and text detection and semantic deduplication.
code: cli.py
> similarities -h
NAME
similarities
SYNOPSIS
similarities COMMAND
COMMANDS
COMMAND is one of the following:
bert_embedding
Compute embeddings for a list of sentences
bert_index
Build indexes from text embeddings using autofaiss
bert_filter
Entry point of bert filter, batch search index
bert_server
Main entry point of bert search backend, start the server
clip_embedding
Embedding text and image with clip model
clip_index
Build indexes from embeddings using autofaiss
clip_filter
Entry point of clip filter, batch search index
clip_server
Main entry point of clip search backend, start the server
run:
pip install similarities -U
similarities clip_embedding -h
# example
cd examples
similarities clip_embedding data/toy_clip/
bert_embedding
etc. are secondary commands. Bert starts with text correlation, and clip starts with image correlation.similarities clip_embedding -h
data/toy_clip/
is the input_dir
parameter of the clip_embedding
method, and enter the file directory (required) If you use similarities in your research, please quote it in the following format:
APA:
Xu, M. Similarities: Compute similarity score for humans (Version 1.0.1) [Computer software]. https://github.com/shibing624/similarities
BibTeX:
@misc{Xu_Similarities_Compute_similarity,
title={Similarities: similarity calculation and semantic search toolkit},
author={Xu Ming},
year={2022},
howpublished={url{https://github.com/shibing624/similarities}},
}
The license agreement is The Apache License 2.0, which can be used for commercial purposes for free. Please attach the link and authorization agreement to the product description.
The project code is still very rough. If you have improved the code, you are welcome to submit it back to this project. Before submitting, pay attention to the following two points:
tests
python -m pytest
to run all unit tests to ensure that all single tests are passedYou can submit your PR later.
Thanks for their great work!