
Python中的快速,輕巧和可定制的模糊和語義文本搜索。
Neofuzz是基於矢量化和近似最近的鄰居搜索技術的模糊搜索庫。
現在,您可以使用Levenshtein距離重新排序搜索結果!有時,n-gram過程或矢量化過程不能完全正確地訂購結果。在這些情況下,您可以從索引語料庫中檢索更高數量的示例,然後以Levenshtein距離來完善這些結果。
from neofuzz import char_ngram_process
process = char_ngram_process ()
process . index ( corpus )
process . extract ( "your query" , limit = 30 , refine_levenshtein = True )大多數模糊的搜索庫都依賴於相同的模糊搜索算法(錘擊距離,Levenshtein距離)中優化地獄。不幸的是,由於這些算法的複雜性,沒有任何優化能夠為您帶來想要的速度。
Neofuzz實現了意識到,您不能依靠傳統算法來超過一定的速度限制,並使用文本矢量化和矢量空間中最近的鄰居搜索來加快此過程。
當涉及速度與準確性的困境時,Neofuzz會變得全面速度。
您可以從PYPI安裝Neofuzz:
pip install neofuzz如果您想要插件體驗,則可以使用char_ngram_process()進程創建一個良好的快速又髒過程。
from neofuzz import char_ngram_process
# We create a process that takes character 1 to 5-grams as features for
# vectorization and uses a tf-idf weighting scheme.
# We will use cosine distance for the nearest neighbour search.
process = char_ngram_process ( ngram_range = ( 1 , 5 ), metric = "cosine" , tf_idf = True )
# We index the options that we are going to search in
process . index ( options )
# Then we can extract the ten most similar items the same way as in
# thefuzz
process . extract ( "fuzz" , limit = 10 )
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[( 'fuzzer' , 67 ),
( 'Januzzi' , 30 ),
( 'Figliuzzi' , 25 ),
( 'Fun' , 20 ),
( 'Erika_Petruzzi' , 20 ),
( 'zu' , 20 ),
( 'Zo' , 18 ),
( 'blog_BuzzMachine' , 18 ),
( 'LW_Todd_Bertuzzi' , 18 ),
( 'OFU' , 17 )]您可以通過制定自定義過程來自定義Neofuzz的行為。在引擎蓋下,每個新脈衝過程都取決於相同的兩個組成部分:
如果您對文本的單詞/語義內容更感興趣,則也可以將其用作功能。這可能非常有用,尤其是在文學作品等較長的文本中。
from neofuzz import Process
from sklearn . feature_extraction . text import TfidfVectorizer
# Vectorization with words is the default in sklearn.
vectorizer = TfidfVectorizer ()
# We use cosine distance because it's waay better for high-dimentional spaces.
process = Process ( vectorizer , metric = "cosine" )您可能會發現模糊搜索過程的速度還不夠。在這種情況下,可能希望使用一些矩陣分解方法或主題模型來降低產生的向量的尺寸。
例如,我在這裡使用NMF(出色的主題模型,也非常快速)也加快了我的模糊搜索管道。
from neofuzz import Process
from sklearn . feature_extraction . text import TfidfVectorizer
from sklearn . decomposition import NMF
from sklearn . pipeline import make_pipeline
# Vectorization with tokens again
vectorizer = TfidfVectorizer ()
# Dimensionality reduction method to 20 dimensions
nmf = NMF ( n_components = 20 )
# Create a pipeline of the two
pipeline = make_pipeline ( vectorizer , nmf )
process = Process ( pipeline , metric = "cosine" )借助Neofuzz,您可以輕鬆地使用語義嵌入來發揮自己的優勢,並且可以同時使用基於注意力的語言模型(BERT),只有簡單的神經單詞或文檔嵌入(Word2Vec,doc2vec,fastText等),甚至是Openai的LLM。
我們建議您嘗試Enbetter,它具有許多內置的Sklearn兼容矢量器。
pip install embetter from embetter . text import SentenceEncoder
from neofuzz import Process
# Here we will use a pretrained Bert sentence encoder as vectorizer
vectorizer = SentenceEncoder ( "all-distilroberta-v1" )
# Then we make a process with the language model
process = Process ( vectorizer , metric = "cosine" )
# Remember that the options STILL have to be indexed even though you have a pretrained vectorizer
process . index ( options )