
Python中的快速,轻巧和可定制的模糊和语义文本搜索。
Neofuzz是基于矢量化和近似最近的邻居搜索技术的模糊搜索库。
现在,您可以使用Levenshtein距离重新排序搜索结果!有时,n-gram过程或矢量化过程不能完全正确地订购结果。在这些情况下,您可以从索引语料库中检索更高数量的示例,然后以Levenshtein距离来完善这些结果。
from neofuzz import char_ngram_process
process = char_ngram_process ()
process . index ( corpus )
process . extract ( "your query" , limit = 30 , refine_levenshtein = True )大多数模糊的搜索库都依赖于相同的模糊搜索算法(锤击距离,Levenshtein距离)中优化地狱。不幸的是,由于这些算法的复杂性,没有任何优化能够为您带来想要的速度。
Neofuzz实现了意识到,您不能依靠传统算法来超过一定的速度限制,并使用文本矢量化和矢量空间中最近的邻居搜索来加快此过程。
当涉及速度与准确性的困境时,Neofuzz会变得全面速度。
您可以从PYPI安装Neofuzz:
pip install neofuzz如果您想要插件体验,则可以使用char_ngram_process()进程创建一个良好的快速又脏过程。
from neofuzz import char_ngram_process
# We create a process that takes character 1 to 5-grams as features for
# vectorization and uses a tf-idf weighting scheme.
# We will use cosine distance for the nearest neighbour search.
process = char_ngram_process ( ngram_range = ( 1 , 5 ), metric = "cosine" , tf_idf = True )
# We index the options that we are going to search in
process . index ( options )
# Then we can extract the ten most similar items the same way as in
# thefuzz
process . extract ( "fuzz" , limit = 10 )
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[( 'fuzzer' , 67 ),
( 'Januzzi' , 30 ),
( 'Figliuzzi' , 25 ),
( 'Fun' , 20 ),
( 'Erika_Petruzzi' , 20 ),
( 'zu' , 20 ),
( 'Zo' , 18 ),
( 'blog_BuzzMachine' , 18 ),
( 'LW_Todd_Bertuzzi' , 18 ),
( 'OFU' , 17 )]您可以通过制定自定义过程来自定义Neofuzz的行为。在引擎盖下,每个新脉冲过程都取决于相同的两个组成部分:
如果您对文本的单词/语义内容更感兴趣,则也可以将其用作功能。这可能非常有用,尤其是在文学作品等较长的文本中。
from neofuzz import Process
from sklearn . feature_extraction . text import TfidfVectorizer
# Vectorization with words is the default in sklearn.
vectorizer = TfidfVectorizer ()
# We use cosine distance because it's waay better for high-dimentional spaces.
process = Process ( vectorizer , metric = "cosine" )您可能会发现模糊搜索过程的速度还不够。在这种情况下,可能希望使用一些矩阵分解方法或主题模型来降低产生的向量的尺寸。
例如,我在这里使用NMF(出色的主题模型,也非常快速)也加快了我的模糊搜索管道。
from neofuzz import Process
from sklearn . feature_extraction . text import TfidfVectorizer
from sklearn . decomposition import NMF
from sklearn . pipeline import make_pipeline
# Vectorization with tokens again
vectorizer = TfidfVectorizer ()
# Dimensionality reduction method to 20 dimensions
nmf = NMF ( n_components = 20 )
# Create a pipeline of the two
pipeline = make_pipeline ( vectorizer , nmf )
process = Process ( pipeline , metric = "cosine" )借助Neofuzz,您可以轻松地使用语义嵌入来发挥自己的优势,并且可以同时使用基于注意力的语言模型(BERT),只有简单的神经单词或文档嵌入(Word2Vec,doc2vec,fastText等),甚至是Openai的LLM。
我们建议您尝试Enbetter,它具有许多内置的Sklearn兼容矢量器。
pip install embetter from embetter . text import SentenceEncoder
from neofuzz import Process
# Here we will use a pretrained Bert sentence encoder as vectorizer
vectorizer = SentenceEncoder ( "all-distilroberta-v1" )
# Then we make a process with the language model
process = Process ( vectorizer , metric = "cosine" )
# Remember that the options STILL have to be indexed even though you have a pretrained vectorizer
process . index ( options )