SenTrEv下载 - SenTrEv源代码下载

SenTrEv

其他源码

v @ v.0.1.0?

下载

sentrev

PDFS上的抹布的简单评估

SentRev （ Sen Tence Tr ansFormers ev Aluator）是一个Python软件包，旨在运行简单的评估测试，以帮助您选择使用PDF文档的最佳嵌入模型，以检索增强生成（RAG）。

适用性

SentRev与：

通过python package sentence_transformers中的类SentenceTransformer加载的文本编码器/嵌入式
PDF文档（支持单个和多个上传）
QDRANT矢量数据库（本地和云上都

安装

您可以使用pip安装软件包（更容易，但没有自定义）：

python3 -m pip install sentrev

或者，您可以从源代码（更困难但可自定义）构建它：

 # clone the repo
git clone https://github.com/AstraBert/SenTrEv.git
# access the repo
cd SenTrEv
# build the package
python3 -m build
# install the package locally with editability settings
python3 -m pip install -e .

评估过程

SentRev应用了非常简单的评估工作流程：

在PDF文本提取和块（CFR。同上）阶段之后，根据（可选）用户定义的百分比（默认值为25％）减少块，该块在每个块的任何点随机提取。
减少的块在词典中映射到其原始块
每个模型都编码原始块并将向量上传到QDRANT矢量存储
然后，减少的块被用作密集检索的查询
从检索结果开始，计算并绘制了准确性，时间和碳排放统计。

有关工作流的可视化，请参见下图

用于评估性能的指标是：

成功率：定义为数字检索操作，其中正确的上下文在所有检索到的上下文中排名最高，在总检索操作中：
$ sr = frac {ncorrect} {ntot} $ （等式1）
平均相互排名（MRR） ：MRR定义了排名正确的上下文的高度位置在检索结果之间。使用了MRR@10，这意味着为每个检索操作返回10个项目，并对正确上下文进行评估，然后在0到1之间进行标准化（已在SentRev中实现）。 1的MRR表示正确的上下文是第一个排名，而MRR为0表示未检索。 MRR用以下一般方程计算：
$ mrr = frac {排名 + nretresed -1} {nretrive} $ （等式2）
当未检索正确的上下文时，MRR将自动设置为0。为每个检索操作计算MRR，然后计算并报告平均值和标准偏差。
时间性能：对于每个检索操作，计算秒钟内的时间性能：然后报告平均值和标准偏差。
碳排放：通过Python图书馆的codecarbon在GCO2EQ（二氧化碳的克）中计算碳排放，并对奥地利地区进行了评估。报告了所有检索操作的全球计算负载。

用例

1。当地QDRANT

您可以使用Docker轻松在本地运行QDRANT：

docker pull qdrant/qdrant:latest
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant:latest

现在，您的矢量数据库正在http://localhost:6333

Let's say we have three PDFs ( ~/pdfs/instructions.pdf , ~/pdfs/history.pdf , ~/pdfs/info.pdf ) and we want to test retrieval with three different encoders sentence-transformers/all-MiniLM-L6-v2 , sentence-transformers/sentence-t5-base , sentence-transformers/all-mpnet-base-v2 .

我们可以使用这个非常简单的代码来完成：

 from sentrev . evaluator import evaluate_rag
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient

# load all the embedding moedels
encoder1 = SentenceTransformer ( 'sentence-transformers/all-MiniLM-L6-v2' )
encoder2 = SentenceTransformer ( 'sentence-transformers/sentence-t5-base' )
encoder3 = SentenceTransformer ( 'sentence-transformers/all-mpnet-base-v1' )

# create a list of the embedders and a dictionary that map each one with its name for the stats report which will be output by SenTrEv
encoders = [ encoder1 , encoder2 , encoder3 ]
encoder_to_names = { encoder1 : 'all-MiniLM-L6-v2' , encoder2 : 'sentence-t5-base' , encoder3 : 'all-mpnet-base-v1' }

# set up a Qdrant client
client = QdrantClient ( "http://localhost:6333" )

# create a list of your PDF paths
pdfs = [ '~/pdfs/instructions.pdf' , '~/pdfs/history.pdf' , '~/pdfs/info.pdf' ]

# Choose a path for the CSV where the evaluation stats will be saved

csv_path = '~/eval/stats.csv'

# evaluate retrieval
evaluate_rag ( pdfs = pdfs , encoders = encoders , encoder_to_name = encoder_to_names , client = client , csv_path = csv_path , distance = 'euclid' , chunking_size = 400 , mrr = 10 , carbon_tracking = "USA" , plot = True )

您可以通过设置chunking_size参数或通过设置text_percentage或通过设置距离参数来检索距离参数或使用mrr设置来调整检索到的项目的数量（在本例10中）来测试检索的距离（在本例10）中使用（在此情况10）中使用用于检索的distance度量（在此情况10）中使用（在本例10）中使用（在此情况10）；如果需要评估图，也可以通过plot=True ：将在CSV文件的同一文件夹中保存图；如果您想打开碳排放跟踪，可以使用carbon_tracking选项，然后使用您所处状态的三个字母ISO代码。

2。云QDRANT

您还可以利用QDRANT在云数据库解决方案（在此处进行更多有关它）。您只需要QDRANT群集URL和API键即可访问它：

 from qdrant_client import QdrantClient

client = QdrantClient ( url = "YOUR-QDRANT-URL" , api_key = "YOUR-API-KEY" )

这是您之前唯一必须对示例中提供的代码做出的更改。

3。将PDF上传到QDRANT

您也可以使用SentRev将PDF块，矢量化和上传到QDRANT数据库中。

 from sentrev . evaluator import upload_pdfs

encoder = SentenceTransformer ( 'sentence-transformers/all-MiniLM-L6-v2' )
pdfs = [ '~/pdfs/instructions.pdf' , '~/pdfs/history.pdf' , '~/pdfs/info.pdf' ]
client = QdrantClient ( "http://localhost:6333" )

upload_pdfs ( pdfs = pdfs , encoder = encoder , client = client )

至于以前，您还可以使用chunking_size参数（默认为1000）和distance参数（默认值为余弦）来玩耍。

4。在QDrant集合上实施语义搜索

您还可以在带有sentRev的QDRANT数据库中搜索已经存在的集合：

 from sentrev . utils import NeuralSearcher

encoder = SentenceTransformer ( 'sentence-transformers/all-MiniLM-L6-v2' )
collection_name = 'customer_help'
client = QdrantClient ( "http://localhost:6333" )

searcher = NeuralSearcher ( client = client , model = encoder , collection_name = collection_name )
res = searcher . search ( "Is it possible to pay online with my credit card?" , limit = 5 )

结果将作为有效载荷列表返回（您上传到Qdrant集合的元数据以及矢量点）。

如果使用SentRev upload_pdfs函数，则应该能够以这种方式访问结果：

 text = res [ 0 ][ "text" ]
source = res [ 0 ][ "source" ]
page = res [ 0 ][ "page" ]

案例研究

您可以在此处参考报告案例

参考

在此处找到所有功能和类的参考

路线图

V1.0.0

添加对Markdown，HTML，Word和CSV数据类型的支持
增加对色度，松果，编织，supabase和mongodb作为矢量数据库的支持

贡献

总是欢迎捐款！

查找贡献指南

许可，引文和资金

该项目是开源的，是根据MIT许可提供的。

如果您使用SenTrEv评估您的检索模型，请考虑引用它：

Bertelli，AC（2024）。评估三句话变形金刚文本嵌入式的性能 - SentRev的案例研究（v0.1.0）。 Zenodo。 https://doi.org/10.5281/Zenodo.14503887

如果您发现它有用，请考虑为其提供资金。

展开

附加信息

版本 v @ v.0.1.0?
类型其他源码
更新时间 2025-05-27
大小 2.48MB
来自于 Github

SenTrEv

sentrev

PDFS上的抹布的简单评估

适用性

安装

评估过程

用例

1。当地QDRANT

2。云QDRANT

3。将PDF上传到QDRANT

4。在QDrant集合上实施语义搜索

案例研究

参考

路线图

V1.0.0

贡献

许可，引文和资金

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express