similarities Download - similarities Source code download

similarities

Other source code

1.1.2

Download

??Chinese | English | Documents/Docs | ?Models/Models

Similarities: Similarity Calculation and Semantic Search

Similarities : a toolkit for similarity calculation and semantic search, supports text and image. Similarity calculation, semantic matching search toolkit.

Similarities implements a variety of similarity calculations and semantic matching retrieval algorithms for text and pictures, and supports billions of data search, text search, and picture search, Python3 development, pip installation, and out of the box.

Guide

Features
Install
Usage
Contact
Acknowledgements

Features

Text similarity calculation + text search

Semantic matching model [Recommended]: This project implements text similarity calculation and text search of CoSENT model based on text2vec
- Supports pre-trained models in Chinese and English, multilingual, SentenceBERT classes
- Supports Cos Similarity/Dot Product/Hamming Distance/Euclidean Distance and other similarity calculation methods
- Supports various text search algorithms such as SemanticSearch/Faiss/Annoy/Hnsw
- Support efficient retrieval of billions of data
- Support command line text transfer vector (multiple cards), indexing, batch retrieval, and start services
Literal matching model: This project implements various literal matching models such as Word2Vec, BM25, RankBM25, TFIDF, SimHash, synonym word forest, and CNKI Hownet meaning original matching

Image similarity calculation/graphic similarity calculation+graphic search/text search

CLIP (Contrastive Language-Image Pre-Training) model: a graphic matching model, which can be used for graphic features (embeddings), similarity calculation, graphic search, and zero-sample image classification. This project implements vector representation, construction index (based on AutoFaiss), batch search, background service (based on FastAPI), and front-end display (based on Gradio) functions of the CLIP model based on PyTorch.
- Supports CLIP series models such as openai/clip-vit-base-patch32
- Support Chinese-CLIP series models such as OFA-Sys/chinese-clip-vit-huge-patch14
- Supports front-end separation deployment, FastAPI back-end service, and Gradio front-end display
- Supports efficient retrieval of billions of data, based on Faiss retrieval, supports GPU acceleration
- Supports image search, text search, vector search
- Support image embedding extraction and text embedding extraction
- Support image similarity calculation and image similarity calculation
- Support command line image turning vector (multiple cards), indexing, batch retrieval, and start-up services
Image feature extraction: This project implements a variety of image feature extraction algorithms such as pHash, dHash, wHash, aHash, SIFT based on cv2

Demo

Image Search Demo: https://huggingface.co/spaces/shibing624/CLIP-Image-Search

Text Search Demo: https://huggingface.co/spaces/shibing624/similarities

Install

 pip install torch # conda install pytorch
pip install -U similarities

or

 git clone https://github.com/shibing624/similarities.git
cd similarities
pip install -e .

Usage

1. Text vector similarity calculation

example: examples/text_similarity_demo.py

 from similarities import BertSimilarity
m = BertSimilarity ( model_name_or_path = "shibing624/text2vec-base-chinese" )
r = m . similarity ( '如何更换花呗绑定银行卡' , '花呗更改绑定银行卡' )
print ( f"similarity score: { float ( r ) } " )  # similarity score: 0.855146050453186

model_name_or_path : The model name or path will be downloaded from the HF model hub by default and the Chinese semantic matching model shibing624/text2vec-base-chinese is used. If multilingual is needed, it can be replaced with shibing624/text2vec-base-multilingual model, supporting Chinese, English, Korean, Japanese, German, Italian and other languages.

2. Text vector search

Find the text that is most similar to query in the document candidate set, which is often used for similar matches and text searches in QA scenarios.

SemanticSearch accurate search algorithm, Cos Similarity + topK cluster search, suitable for data sets within millions

example: examples/text_semantic_search_demo.py

Approximate search algorithms such as Annoy and Hnswlib are suitable for millions of data sets

example: examples/fast_text_semantic_search_demo.py

Faiss efficient vector search, suitable for billions of data sets

Text transfer vector, indexing, batch search, start service: examples/faiss_bert_search_server_demo.py
Front-end python call: examples/faiss_bert_search_client_demo.py

3. Literally based text similarity calculation and text search

It supports similarity calculation and literal matching search for synonyms such as Cilin, CNKI Hownet, WordEmbedding, Tfidf, SimHash, BM25, etc., and is often used for text matching cold start.

example: examples/literal_text_semantic_search_demo.py

4. Image similarity calculation and image search

Supports image similarity calculation and matching search for algorithms such as CLIP, pHash, and SIFT. The Chinese CLIP model supports image search, text search, and also supports Chinese and English graphics and text search.

example: examples/image_semantic_search_demo.py

image_sim

Faiss efficient vector search, suitable for billions of data sets

Image turning vector, indexing, batch search, start service: examples/faiss_clip_search_server_demo.py
Front-end python call: examples/faiss_clip_search_client_demo.py
Front-end gradio call: examples/faiss_clip_search_gradio_demo.py

5. Clustering

Clustering can be performed on large-scale datasets through community_detection algorithms to find clustering (i.e., similar sentence groups).

example: examples/text_clustering_demo.py

6. Graphic and text semantics are removed

The synonym sentence mining (paraphrase_mining_embeddings) algorithm can be used to mine sentence pairs with similar meanings from a large number of sentences or documents, which can be used for redundant graphic and text detection and semantic deduplication.

Text semantic deduplication: examples/text_duplicates_demo.py
Image semantic deduplication: examples/image_duplicates_demo.py

Command Line Mode (CLI)

Support batch acquisition of text vectors and image vectors (embedding)
Supports index building (index)
Supports batch retrieval (filter)
Support startup service (server)

code: cli.py

 > similarities -h                                    

NAME
    similarities

SYNOPSIS
    similarities COMMAND

COMMANDS
    COMMAND is one of the following:

     bert_embedding
       Compute embeddings for a list of sentences

     bert_index
       Build indexes from text embeddings using autofaiss

     bert_filter
       Entry point of bert filter, batch search index

     bert_server
       Main entry point of bert search backend, start the server

     clip_embedding
       Embedding text and image with clip model

     clip_index
       Build indexes from embeddings using autofaiss

     clip_filter
       Entry point of clip filter, batch search index

     clip_server
       Main entry point of clip search backend, start the server

run:

pip install similarities -U
similarities clip_embedding -h

# example
cd examples
similarities clip_embedding data/toy_clip/

bert_embedding etc. are secondary commands. Bert starts with text correlation, and clip starts with image correlation.
See similarities clip_embedding -h
In the above example, data/toy_clip/ is the input_dir parameter of the clip_embedding method, and enter the file directory (required)

Contact

Issue (suggestions):
Email me: xuming: [email protected]
WeChat Me: Add me WeChat ID: xuming624, Note: Name-Company-NLP Enter the NLP Exchange Group.

Citation

If you use similarities in your research, please quote it in the following format:

APA:

 Xu, M. Similarities: Compute similarity score for humans (Version 1.0.1) [Computer software]. https://github.com/shibing624/similarities

BibTeX:

 @misc{Xu_Similarities_Compute_similarity,
  title={Similarities: similarity calculation and semantic search toolkit},
  author={Xu Ming},
  year={2022},
  howpublished={url{https://github.com/shibing624/similarities}},
}

License

The license agreement is The Apache License 2.0, which can be used for commercial purposes for free. Please attach the link and authorization agreement to the product description.

Contribute

The project code is still very rough. If you have improved the code, you are welcome to submit it back to this project. Before submitting, pay attention to the following two points:

Add corresponding unit tests in tests
Use python -m pytest to run all unit tests to ensure that all single tests are passed

You can submit your PR later.

Acknowledgements

A Simple but Tough-to-Beat Baseline for Sentence Embeddings[Sanjeev Arora and Yingyu Liang and Tengyu Ma, 2017]
https://github.com/liuhuanyong/SentenceSimilarity
https://github.com/qwertyforce/image_search
ImageHash - Official Github repository
https://github.com/openai/CLIP
https://github.com/OFA-Sys/Chinese-CLIP
https://github.com/UKPLab/sentence-transformers
https://github.com/rom1504/clip-retrieval

Thanks for their great work!

Expand

Additional Information

Version 1.1.2
Type Other source code
Update Time 2025-03-13
size 8.53MB
From Github

Related Applications

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All