neofuzz下載neofuzz源代碼下載

neofuzz

其他源碼

v0.3.0

下載

neofuzz

Python中的快速，輕巧和可定制的模糊和語義文本搜索。

簡介（文檔）

Neofuzz是基於矢量化和近似最近的鄰居搜索技術的模糊搜索庫。

新版本0.3.0

現在，您可以使用Levenshtein距離重新排序搜索結果！有時，n-gram過程或矢量化過程不能完全正確地訂購結果。在這些情況下，您可以從索引語料庫中檢索更高數量的示例，然後以Levenshtein距離來完善這些結果。

 from neofuzz import char_ngram_process

process = char_ngram_process ()
process . index ( corpus )

process . extract ( "your query" , limit = 30 , refine_levenshtein = True )

為什麼Neofuzz快速？

大多數模糊的搜索庫都依賴於相同的模糊搜索算法（錘擊距離，Levenshtein距離）中優化地獄。不幸的是，由於這些算法的複雜性，沒有任何優化能夠為您帶來想要的速度。

Neofuzz實現了意識到，您不能依靠傳統算法來超過一定的速度限制，並使用文本矢量化和矢量空間中最近的鄰居搜索來加快此過程。

當涉及速度與準確性的困境時，Neofuzz會變得全面速度。

我什麼時候應該選擇neofuzz？

您需要在同一語料庫中重複搜索。
Levenshtein和Hamming距離根本不夠快。
您願意為速度犧牲結果的質量。
您不介意為索引語料庫索引的前期計算可能需要時間。
您的字符串很長，其他方法將是不切實際的。
您想依靠語義內容。
您需要替換thefuzz。

我什麼時候不應該選擇neofuzz？

語料庫一直在改變，或者您只想在語料庫中進行一次搜索。（但是，在這種情況下，它仍然可能會加速。）
您將結果的質量評估為速度。
您不介意較慢的搜索，而不是不使用索引。
您有一個小的語料庫，上面有短字符串。

用法

您可以從PYPI安裝Neofuzz：

pip install neofuzz

如果您想要插件體驗，則可以使用char_ngram_process()進程創建一個良好的快速又髒過程。

 from neofuzz import char_ngram_process

# We create a process that takes character 1 to 5-grams as features for
# vectorization and uses a tf-idf weighting scheme.
# We will use cosine distance for the nearest neighbour search.
process = char_ngram_process ( ngram_range = ( 1 , 5 ), metric = "cosine" , tf_idf = True )

# We index the options that we are going to search in
process . index ( options )

# Then we can extract the ten most similar items the same way as in
# thefuzz
process . extract ( "fuzz" , limit = 10 )
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[( 'fuzzer' , 67 ),
 ( 'Januzzi' , 30 ),
 ( 'Figliuzzi' , 25 ),
 ( 'Fun' , 20 ),
 ( 'Erika_Petruzzi' , 20 ),
 ( 'zu' , 20 ),
 ( 'Zo' , 18 ),
 ( 'blog_BuzzMachine' , 18 ),
 ( 'LW_Todd_Bertuzzi' , 18 ),
 ( 'OFU' , 17 )]

自定義過程

您可以通過制定自定義過程來自定義Neofuzz的行為。在引擎蓋下，每個新脈衝過程都取決於相同的兩個組成部分：

向量器將文本變成矢量化的形式，並且可以完全自定義。
大約最近的鄰居搜索，該搜索索引了矢量空間，並且可以很快找到給定向量的鄰居。該組件固定為pynndescent，但其所有參數均在API中公開，因此也可以隨意更改其行為。

單詞作為特徵

如果您對文本的單詞/語義內容更感興趣，則也可以將其用作功能。這可能非常有用，尤其是在文學作品等較長的文本中。

 from neofuzz import Process
from sklearn . feature_extraction . text import TfidfVectorizer

 # Vectorization with words is the default in sklearn.
 vectorizer = TfidfVectorizer ()

 # We use cosine distance because it's waay better for high-dimentional spaces.
 process = Process ( vectorizer , metric = "cosine" )

減少維度

您可能會發現模糊搜索過程的速度還不夠。在這種情況下，可能希望使用一些矩陣分解方法或主題模型來降低產生的向量的尺寸。

例如，我在這裡使用NMF（出色的主題模型，也非常快速）也加快了我的模糊搜索管道。

 from neofuzz import Process
from sklearn . feature_extraction . text import TfidfVectorizer
from sklearn . decomposition import NMF
from sklearn . pipeline import make_pipeline

# Vectorization with tokens again
vectorizer = TfidfVectorizer ()
# Dimensionality reduction method to 20 dimensions
nmf = NMF ( n_components = 20 )
# Create a pipeline of the two
pipeline = make_pipeline ( vectorizer , nmf )

process = Process ( pipeline , metric = "cosine" )

語義搜索/大語言模型

借助Neofuzz，您可以輕鬆地使用語義嵌入來發揮自己的優勢，並且可以同時使用基於注意力的語言模型（BERT），只有簡單的神經單詞或文檔嵌入（Word2Vec，doc2vec，fastText等），甚至是Openai的LLM。

我們建議您嘗試Enbetter，它具有許多內置的Sklearn兼容矢量器。

pip install embetter

 from embetter . text import SentenceEncoder
from neofuzz import Process

# Here we will use a pretrained Bert sentence encoder as vectorizer
vectorizer = SentenceEncoder ( "all-distilroberta-v1" )
# Then we make a process with the language model
process = Process ( vectorizer , metric = "cosine" )

# Remember that the options STILL have to be indexed even though you have a pretrained vectorizer
process . index ( options )

展開

附加信息

版本 v0.3.0
類型其他源碼
更新時間 2025-03-10
大小 513.07KB
來自於 Github

相關應用

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部