similarities下載 - similarities源代碼下載

similarities

其他源碼

1.1.2

下載

??中文| English |文檔/Docs | ?模型/Models

Similarities: Similarity Calculation and Semantic Search

similarities : a toolkit for similarity calculation and semantic search, supports text and image. 相似度計算、語義匹配搜索工具包。

similarities實現了多種文本和圖片的相似度計算、語義匹配檢索算法，支持億級數據文搜文、文搜圖、圖搜圖，python3開發，pip安裝，開箱即用。

Guide

Features
Install
Usage
Contact
Acknowledgements

Features

文本相似度計算+ 文本搜索

語義匹配模型【推薦】：本項目基於text2vec實現了CoSENT模型的文本相似度計算和文本搜索
- 支持中英文、多語言多種SentenceBERT類預訓練模型
- 支持Cos Similarity/Dot Product/Hamming Distance/Euclidean Distance 等多種相似度計算方法
- 支持SemanticSearch/Faiss/Annoy/Hnsw 等多種文本搜索算法
- 支持億級數據高效檢索
- 支持命令行文本轉向量（多卡）、建索引、批量檢索、啟動服務
字面匹配模型：本項目實現了Word2Vec、BM25、RankBM25、TFIDF、SimHash、同義詞詞林、知網Hownet義原匹配等多種字面匹配模型

圖像相似度計算/圖文相似度計算+ 圖搜圖/文搜圖

CLIP(Contrastive Language-Image Pre-Training)模型：圖文匹配模型，可用於圖文特徵（embeddings）、相似度計算、圖文檢索、零樣本圖片分類，本項目基於PyTorch實現了CLIP模型的向量表徵、構建索引（基於AutoFaiss）、批量檢索、後台服務（基於FastAPI）、前端展現（基於Gradio）功能
- 支持openai/clip-vit-base-patch32等CLIP系列模型
- 支持OFA-Sys/chinese-clip-vit-huge-patch14等Chinese-CLIP系列模型
- 支持前後端分離部署，FastAPI後端服務，Gradio前端展現
- 支持億級數據高效檢索，基於Faiss檢索，支持GPU加速
- 支持圖搜圖、文搜圖、向量搜圖
- 支持圖像embedding提取、文本embedding提取
- 支持圖像相似度計算、圖文相似度計算
- 支持命令行圖像轉向量（多卡）、建索引、批量檢索、啟動服務
圖像特徵提取：本項目基於cv2實現了pHash、dHash、wHash、aHash、SIFT等多種圖像特徵提取算法

Demo

Image Search Demo: https://huggingface.co/spaces/shibing624/CLIP-Image-Search

Text Search Demo: https://huggingface.co/spaces/shibing624/similarities

Install

 pip install torch # conda install pytorch
pip install -U similarities

or

 git clone https://github.com/shibing624/similarities.git
cd similarities
pip install -e .

Usage

1. 文本向量相似度計算

example: examples/text_similarity_demo.py

 from similarities import BertSimilarity
m = BertSimilarity ( model_name_or_path = "shibing624/text2vec-base-chinese" )
r = m . similarity ( '如何更换花呗绑定银行卡' , '花呗更改绑定银行卡' )
print ( f"similarity score: { float ( r ) } " )  # similarity score: 0.855146050453186

model_name_or_path ：模型名稱或者路徑，默認會從HF model hub下載並使用中文語義匹配模型shibing624/text2vec-base-chinese，如果需要多語言，可以替換為shibing624/text2vec-base-multilingual模型，支持中、英、韓、日、德、意等多國語言

2. 文本向量搜索

在文檔候選集中找與query最相似的文本，常用於QA場景的問句相似匹配、文本搜索等任務。

SemanticSearch精準搜索算法，Cos Similarity + topK 聚類檢索，適合百萬內數據集

example: examples/text_semantic_search_demo.py

Annoy、Hnswlib等近似搜索算法，適合百萬級數據集

example: examples/fast_text_semantic_search_demo.py

Faiss高效向量檢索，適合億級數據集

文本轉向量，建索引，批量檢索，啟動服務：examples/faiss_bert_search_server_demo.py
前端python調用：examples/faiss_bert_search_client_demo.py

3. 基於字面的文本相似度計算和文本搜索

支持同義詞詞林（Cilin）、知網Hownet、詞向量（WordEmbedding）、Tfidf、SimHash、BM25等算法的相似度計算和字面匹配搜索，常用於文本匹配冷啟動。

example: examples/literal_text_semantic_search_demo.py

4. 圖像相似度計算和圖片搜索

支持CLIP、pHash、SIFT等算法的圖像相似度計算和匹配搜索，中文CLIP模型支持圖搜圖，文搜圖、還支持中英文圖文互搜。

example: examples/image_semantic_search_demo.py

image_sim

Faiss高效向量檢索，適合億級數據集

圖像轉向量，建索引，批量檢索，啟動服務：examples/faiss_clip_search_server_demo.py
前端python調用：examples/faiss_clip_search_client_demo.py
前端gradio調用：examples/faiss_clip_search_gradio_demo.py

5. 聚類

通過社群發現（community_detection）算法可以在大規模數據集上執行聚類，尋找聚類簇（即相似的句子組）。

example: examples/text_clustering_demo.py

6. 圖文語義去重

通過同義句挖掘（paraphrase_mining_embeddings）算法可以從大量句子或文檔集中挖掘出具有相似意義的句子對，可用於冗餘圖文檢測，語義去重。

文本語義去重：examples/text_duplicates_demo.py
圖片語義去重：examples/image_duplicates_demo.py

命令行模式（CLI）

支持批量獲取文本向量、圖像向量（embedding）
支持構建索引（index）
支持批量檢索（filter）
支持啟動服務（server）

code: cli.py

 > similarities -h                                    

NAME
    similarities

SYNOPSIS
    similarities COMMAND

COMMANDS
    COMMAND is one of the following:

     bert_embedding
       Compute embeddings for a list of sentences

     bert_index
       Build indexes from text embeddings using autofaiss

     bert_filter
       Entry point of bert filter, batch search index

     bert_server
       Main entry point of bert search backend, start the server

     clip_embedding
       Embedding text and image with clip model

     clip_index
       Build indexes from embeddings using autofaiss

     clip_filter
       Entry point of clip filter, batch search index

     clip_server
       Main entry point of clip search backend, start the server

run：

pip install similarities -U
similarities clip_embedding -h

# example
cd examples
similarities clip_embedding data/toy_clip/

bert_embedding等是二級命令，bert開頭的是文本相關，clip開頭的是圖像相關
各二級命令使用方法見similarities clip_embedding -h
上面示例中data/toy_clip/是clip_embedding方法的input_dir參數，輸入文件目錄（required）

Contact

Issue(建議) ：
郵件我：xuming: [email protected]
微信我：加我微信號：xuming624, 備註：姓名-公司-NLP進NLP交流群。

Citation

如果你在研究中使用了similarities，請按如下格式引用：

APA:

 Xu, M. Similarities: Compute similarity score for humans (Version 1.0.1) [Computer software]. https://github.com/shibing624/similarities

BibTeX:

 @misc{Xu_Similarities_Compute_similarity,
  title={Similarities: similarity calculation and semantic search toolkit},
  author={Xu Ming},
  year={2022},
  howpublished={url{https://github.com/shibing624/similarities}},
}

License

授權協議為The Apache License 2.0，可免費用做商業用途。請在產品說明中附加similarities的鏈接和授權協議。

Contribute

項目代碼還很粗糙，如果大家對代碼有所改進，歡迎提交回本項目，在提交之前，注意以下兩點：

在tests添加相應的單元測試
使用python -m pytest來運行所有單元測試，確保所有單測都是通過的

之後即可提交PR。

Acknowledgements

A Simple but Tough-to-Beat Baseline for Sentence Embeddings[Sanjeev Arora and Yingyu Liang and Tengyu Ma, 2017]
https://github.com/liuhuanyong/SentenceSimilarity
https://github.com/qwertyforce/image_search
ImageHash - Official Github repository
https://github.com/openai/CLIP
https://github.com/OFA-Sys/Chinese-CLIP
https://github.com/UKPLab/sentence-transformers
https://github.com/rom1504/clip-retrieval

Thanks for their great work!

展開

附加信息

版本 1.1.2
類型其他源碼
更新時間 2025-03-13
大小 8.53MB
來自於 Github

相關應用

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部