??中文| English |文檔/Docs | ?模型/Models
similarities : a toolkit for similarity calculation and semantic search, supports text and image. 相似度計算、語義匹配搜索工具包。
similarities實現了多種文本和圖片的相似度計算、語義匹配檢索算法,支持億級數據文搜文、文搜圖、圖搜圖,python3開發,pip安裝,開箱即用。
Guide
Image Search Demo: https://huggingface.co/spaces/shibing624/CLIP-Image-Search

Text Search Demo: https://huggingface.co/spaces/shibing624/similarities

pip install torch # conda install pytorch
pip install -U similarities
or
git clone https://github.com/shibing624/similarities.git
cd similarities
pip install -e .
example: examples/text_similarity_demo.py
from similarities import BertSimilarity
m = BertSimilarity ( model_name_or_path = "shibing624/text2vec-base-chinese" )
r = m . similarity ( '如何更换花呗绑定银行卡' , '花呗更改绑定银行卡' )
print ( f"similarity score: { float ( r ) } " ) # similarity score: 0.855146050453186model_name_or_path :模型名稱或者路徑,默認會從HF model hub下載並使用中文語義匹配模型shibing624/text2vec-base-chinese,如果需要多語言,可以替換為shibing624/text2vec-base-multilingual模型,支持中、英、韓、日、德、意等多國語言在文檔候選集中找與query最相似的文本,常用於QA場景的問句相似匹配、文本搜索等任務。
example: examples/text_semantic_search_demo.py
example: examples/fast_text_semantic_search_demo.py
文本轉向量,建索引,批量檢索,啟動服務:examples/faiss_bert_search_server_demo.py
前端python調用:examples/faiss_bert_search_client_demo.py
支持同義詞詞林(Cilin)、知網Hownet、詞向量(WordEmbedding)、Tfidf、SimHash、BM25等算法的相似度計算和字面匹配搜索,常用於文本匹配冷啟動。
example: examples/literal_text_semantic_search_demo.py
支持CLIP、pHash、SIFT等算法的圖像相似度計算和匹配搜索,中文CLIP模型支持圖搜圖,文搜圖、還支持中英文圖文互搜。
example: examples/image_semantic_search_demo.py

圖像轉向量,建索引,批量檢索,啟動服務:examples/faiss_clip_search_server_demo.py
前端python調用:examples/faiss_clip_search_client_demo.py
前端gradio調用:examples/faiss_clip_search_gradio_demo.py

通過社群發現(community_detection)算法可以在大規模數據集上執行聚類,尋找聚類簇(即相似的句子組)。
example: examples/text_clustering_demo.py
通過同義句挖掘(paraphrase_mining_embeddings)算法可以從大量句子或文檔集中挖掘出具有相似意義的句子對,可用於冗餘圖文檢測,語義去重。
code: cli.py
> similarities -h
NAME
similarities
SYNOPSIS
similarities COMMAND
COMMANDS
COMMAND is one of the following:
bert_embedding
Compute embeddings for a list of sentences
bert_index
Build indexes from text embeddings using autofaiss
bert_filter
Entry point of bert filter, batch search index
bert_server
Main entry point of bert search backend, start the server
clip_embedding
Embedding text and image with clip model
clip_index
Build indexes from embeddings using autofaiss
clip_filter
Entry point of clip filter, batch search index
clip_server
Main entry point of clip search backend, start the server
run:
pip install similarities -U
similarities clip_embedding -h
# example
cd examples
similarities clip_embedding data/toy_clip/bert_embedding等是二級命令,bert開頭的是文本相關,clip開頭的是圖像相關similarities clip_embedding -hdata/toy_clip/是clip_embedding方法的input_dir參數,輸入文件目錄(required) 
如果你在研究中使用了similarities,請按如下格式引用:
APA:
Xu, M. Similarities: Compute similarity score for humans (Version 1.0.1) [Computer software]. https://github.com/shibing624/similarities
BibTeX:
@misc{Xu_Similarities_Compute_similarity,
title={Similarities: similarity calculation and semantic search toolkit},
author={Xu Ming},
year={2022},
howpublished={url{https://github.com/shibing624/similarities}},
}
授權協議為The Apache License 2.0,可免費用做商業用途。請在產品說明中附加similarities的鏈接和授權協議。
項目代碼還很粗糙,如果大家對代碼有所改進,歡迎提交回本項目,在提交之前,注意以下兩點:
tests添加相應的單元測試python -m pytest來運行所有單元測試,確保所有單測都是通過的之後即可提交PR。
Thanks for their great work!