neofuzz下载neofuzz源代码下载

neofuzz

其他源码

v0.3.0

下载

neofuzz

Python中的快速，轻巧和可定制的模糊和语义文本搜索。

简介（文档）

Neofuzz是基于矢量化和近似最近的邻居搜索技术的模糊搜索库。

新版本0.3.0

现在，您可以使用Levenshtein距离重新排序搜索结果！有时，n-gram过程或矢量化过程不能完全正确地订购结果。在这些情况下，您可以从索引语料库中检索更高数量的示例，然后以Levenshtein距离来完善这些结果。

 from neofuzz import char_ngram_process

process = char_ngram_process ()
process . index ( corpus )

process . extract ( "your query" , limit = 30 , refine_levenshtein = True )

为什么Neofuzz快速？

大多数模糊的搜索库都依赖于相同的模糊搜索算法（锤击距离，Levenshtein距离）中优化地狱。不幸的是，由于这些算法的复杂性，没有任何优化能够为您带来想要的速度。

Neofuzz实现了意识到，您不能依靠传统算法来超过一定的速度限制，并使用文本矢量化和矢量空间中最近的邻居搜索来加快此过程。

当涉及速度与准确性的困境时，Neofuzz会变得全面速度。

我什么时候应该选择neofuzz？

您需要在同一语料库中重复搜索。
Levenshtein和Hamming距离根本不够快。
您愿意为速度牺牲结果的质量。
您不介意为索引语料库索引的前期计算可能需要时间。
您的字符串很长，其他方法将是不切实际的。
您想依靠语义内容。
您需要替换thefuzz。

我什么时候不应该选择neofuzz？

语料库一直在改变，或者您只想在语料库中进行一次搜索。（但是，在这种情况下，它仍然可能会加速。）
您将结果的质量评估为速度。
您不介意较慢的搜索，而不是不使用索引。
您有一个小的语料库，上面有短字符串。

用法

您可以从PYPI安装Neofuzz：

pip install neofuzz

如果您想要插件体验，则可以使用char_ngram_process()进程创建一个良好的快速又脏过程。

 from neofuzz import char_ngram_process

# We create a process that takes character 1 to 5-grams as features for
# vectorization and uses a tf-idf weighting scheme.
# We will use cosine distance for the nearest neighbour search.
process = char_ngram_process ( ngram_range = ( 1 , 5 ), metric = "cosine" , tf_idf = True )

# We index the options that we are going to search in
process . index ( options )

# Then we can extract the ten most similar items the same way as in
# thefuzz
process . extract ( "fuzz" , limit = 10 )
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[( 'fuzzer' , 67 ),
 ( 'Januzzi' , 30 ),
 ( 'Figliuzzi' , 25 ),
 ( 'Fun' , 20 ),
 ( 'Erika_Petruzzi' , 20 ),
 ( 'zu' , 20 ),
 ( 'Zo' , 18 ),
 ( 'blog_BuzzMachine' , 18 ),
 ( 'LW_Todd_Bertuzzi' , 18 ),
 ( 'OFU' , 17 )]

自定义过程

您可以通过制定自定义过程来自定义Neofuzz的行为。在引擎盖下，每个新脉冲过程都取决于相同的两个组成部分：

向量器将文本变成矢量化的形式，并且可以完全自定义。
大约最近的邻居搜索，该搜索索引了矢量空间，并且可以很快找到给定向量的邻居。该组件固定为pynndescent，但其所有参数均在API中公开，因此也可以随意更改其行为。

单词作为特征

如果您对文本的单词/语义内容更感兴趣，则也可以将其用作功能。这可能非常有用，尤其是在文学作品等较长的文本中。

 from neofuzz import Process
from sklearn . feature_extraction . text import TfidfVectorizer

 # Vectorization with words is the default in sklearn.
 vectorizer = TfidfVectorizer ()

 # We use cosine distance because it's waay better for high-dimentional spaces.
 process = Process ( vectorizer , metric = "cosine" )

减少维度

您可能会发现模糊搜索过程的速度还不够。在这种情况下，可能希望使用一些矩阵分解方法或主题模型来降低产生的向量的尺寸。

例如，我在这里使用NMF（出色的主题模型，也非常快速）也加快了我的模糊搜索管道。

 from neofuzz import Process
from sklearn . feature_extraction . text import TfidfVectorizer
from sklearn . decomposition import NMF
from sklearn . pipeline import make_pipeline

# Vectorization with tokens again
vectorizer = TfidfVectorizer ()
# Dimensionality reduction method to 20 dimensions
nmf = NMF ( n_components = 20 )
# Create a pipeline of the two
pipeline = make_pipeline ( vectorizer , nmf )

process = Process ( pipeline , metric = "cosine" )

语义搜索/大语言模型

借助Neofuzz，您可以轻松地使用语义嵌入来发挥自己的优势，并且可以同时使用基于注意力的语言模型（BERT），只有简单的神经单词或文档嵌入（Word2Vec，doc2vec，fastText等），甚至是Openai的LLM。

我们建议您尝试Enbetter，它具有许多内置的Sklearn兼容矢量器。

pip install embetter

 from embetter . text import SentenceEncoder
from neofuzz import Process

# Here we will use a pretrained Bert sentence encoder as vectorizer
vectorizer = SentenceEncoder ( "all-distilroberta-v1" )
# Then we make a process with the language model
process = Process ( vectorizer , metric = "cosine" )

# Remember that the options STILL have to be indexed even though you have a pretrained vectorizer
process . index ( options )