splade下载 - splade源代码下载

splade

其他源码

2023

下载

杂乱

什么是新的：

2023年11月：可用的更好的训练代码（例如，Cross编码器，RankT5）；新车型即将在Github上推出！
2023年7月：我们添加了静态修剪杂散索引的代码
2023年5月：我们添加了一个新分支（基于HF培训师），允许培训几个负面：
2023年4月：我们已经删除了权重，并将它们推到了Huggingface（https://huggingface.co/naver/splade_v2_max和https://huggingface.co/naver.co/naver/splade_v2_distil）

该存储库包含用于执行Splade模型的培训，索引和检索的代码。它还包括在Beir基准测试上启动评估所需的一切。

tl; Splade博士是一种神经检索模型，通过Bert MLM头和稀疏正则化学习查询/文档稀疏扩展。与密集的方法相比，稀疏表示受益：有效利用倒置索引，显式词汇匹配，可解释性...它们似乎也更好地概括了跨域数据（Beir Benchmark）。

（v1，splade）splade：第一阶段排名的稀疏词汇和扩展模型， Thibault正式，本杰明·皮瓦瓦尔斯基（Benjamin Piwowarski）和斯特芬·凯恩特（StéphaneClinthing） 。 Sigir21短纸。

通过受益于训练神经检索者的最新进展，我们的V2模型依赖于硬性采矿，蒸馏和更好的预训练的语言模型初始化，以进一步提高其对内域（MS MARCO）和室外评估（Beir Benchmark）的有效性。

（V2，Splade V2）Splade V2：信息检索的稀疏词汇和扩展模型， Thibault Formor ， Benjamin Piwowarski ， Carlos Sulassance和StéphaneClincant 。 arxiv。
（V2BIS，Splade ++）从蒸馏到硬性阴性采样：使稀疏的神经IR模型更有效， Thibault Formor ， Carlos Sulassance ， Benjamin Piwowarski和StéphaneClinchthing 。 Sigir22短纸（ splade v2的扩展）。

最后，通过引入多种修改（查询特定的正则化，不相交编码等），我们能够提高效率，并在相同的计算限制下与BM25达到延迟。

（有效的分裂）一项针对Splade模型， Carlos Sulassance和StéphaneClinthant的效率研究。 Sigir22短纸。

在各种设置下训练的模型的权重可以在Naver Labs Europe网站上找到，以及拥抱的面孔。请记住，Splade更多的是一类模型，而不是模型本身：取决于正规化幅度，我们可以获得具有不同属性和性能的不同模型（从非常稀疏到进行强烈查询/DOC扩展的模型）。

Splade：沿一个边缘或两个边缘尖锐的螺旋，使其可以用作刀，叉子和勺子。

入门

要求

我们建议从新鲜的环境开始，然后从conda_splade_env.yml安装包装。

 conda create -n splade_env python=3.9
conda activate splade_env
conda env create -f conda_splade_env.yml

用法

玩模型

inference_splade.ipynb允许您加载和执行推理，以检查预测的“膨胀袋”。我们为六个主要型号提供权重：

模型	MRR@10（Marco Dev女士）
`naver/splade_v2_max` （ v2 hf）	34.0
`naver/splade_v2_distil` （ v2 hf）	36.8
`naver/splade-cocondenser-selfdistil` （ Splade ++ ，HF）	37.6
`naver/splade-cocondenser-ensembledistil` （ Splade ++ ，HF）	38.3
`naver/efficient-splade-V-large-doc` （HF） + `naver/efficient-splade-V-large-query` （HF）（有效的Splade ）	38.8
`naver/efficient-splade-VI-BT-large-doc` （HF） + `efficient-splade-VI-BT-large-query` （HF）（有效的Splade ）	38.0

我们还在这里上传了各种模型。随时尝试！

代码结构的高级概述

该存储库可让您可以训练（ train.py ），索引（ index.py ），检索（ retrieve.py ）（或使用all.py ）模型执行每个步骤。
为了管理实验，我们依靠九头蛇。请参阅conf/readme.md，以获取有关我们如何配置实验的完整指南。

数据

为了培训模型，我们依靠MASCO数据。
我们还进一步依靠蒸馏和硬采矿，来自可用的数据集（保证金MSE蒸馏，句子变形金刚硬质底片）或我们自己构建的数据集（例如，从Splade开采的负面因素）。
大多数数据格式都是非常标准的。为了进行验证，我们依靠一个类似于TAS-B的设置之后的近似验证集。

为了简化设置，我们提供了所有数据文件夹，可以在此处下载。此链接包括查询，文档和硬性数据，允许在EnsembleDistil设置下进行培训（请参见V2BIS纸）。对于其他设置（ Simple ， DistilMSE ， SelfDistil ），您还必须下载：

（ Simple ）标准BM25三胞胎
（ DistilMSE ）“维也纳”三元组，用于边缘蒸馏
（ SelfDistil ）三胞胎从杂碎中开采

下载后，您只需在根目录中取消拆卸，它将放置在正确的文件夹中。

 tar -xzvf file.tar.gz

快速开始

为了执行所有步骤（在玩具数据上，即config_default.yaml上），请在根目录上运行：

conda activate splade_env
export PYTHONPATH= $PYTHONPATH : $( pwd )
export SPLADE_CONFIG_NAME= " config_default.yaml "
python3 -m splade.all 
  config.checkpoint_dir=experiments/debug/checkpoint 
  config.index_dir=experiments/debug/index 
  config.out_dir=experiments/debug/out

其他例子

我们提供可以插入上述代码的其他示例。有关如何更改实验设置的详细信息，请参见conf/readme.md。

您可以类似地运行训练python3 -m splade.train （索引或检索相同）
要创建Anserini可读文件（训练后），请运行SPLADE_CONFIG_FULLPATH=/path/to/checkpoint/dir/config.yaml python3 -m splade.create_anserini +quantization_factor_document=100 +quantization_factor_query=100
/conf中提供各种设置的配置文件（蒸馏等）。例如，运行SelfDistil设置：
- 更改为SPLADE_CONFIG_NAME=config_splade++_selfdistil.yaml
- 要进一步更改config之外的更改参数（例如lambdas），请运行： python3 -m splade.all config.regularizer.FLOPS.lambda_q=0.06 config.regularizer.FLOPS.lambda_d=0.02

我们提供了几种与V2BI和“效率”论文中实验相对应的基本配置。请注意，这些适用于我们的硬件设置，即具有32GB内存的4 GPU Tesla V100。为了使用一个GPU培训模型，您需要降低批次尺寸以进行培训和评估。另请注意，由于损失的范围可能会随不同的批次大小而变化，因此可能需要调整用于正则化的相应lambdas。但是，我们提供了一个单gpu配置config_splade++_cocondenser_ensembledistil_monogpu.yaml ，我们可以获得37.2 mrr@10，对单个16GB GPU进行了培训。

评估预训练的模型

可以使用我们的（基于NUMBA的）倒置索引或Anserini实现索引（和检索）。让我们使用可用的模型（ naver/splade-cocondenser-ensembledistil ）执行这些步骤。

conda activate splade_env
export PYTHONPATH= $PYTHONPATH : $( pwd )
export SPLADE_CONFIG_NAME= " config_splade++_cocondenser_ensembledistil "
python3 -m splade.index 
  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil 
  config.pretrained_no_yamlconfig=true 
  config.index_dir=experiments/pre-trained/index
python3 -m splade.retrieve 
  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil 
  config.pretrained_no_yamlconfig=true 
  config.index_dir=experiments/pre-trained/index 
  config.out_dir=experiments/pre-trained/out
# pretrained_no_yamlconfig indicates that we solely rely on a HF-valid model path

要更改数据，只需覆盖hydra retireve_evaluate软件包，例如添加retrieve_evaluate=msmarco作为splade.retrieve的参数。

您可以类似地构建Anserini摄入的文件：

python3 -m splade.create_anserini 
  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil 
  config.pretrained_no_yamlconfig=true 
  config.index_dir=experiments/pre-trained/index 
  +quantization_factor_document=100 
  +quantization_factor_query=100

它将创建JSON Collection（ docs_anserini.jsonl ）以及Anserini所需的查询（ queries_anserini.tsv ）。然后，您只需要在此处遵循回归即可进行索引和检索。

贝尔评估

您也可以在贝尔上进行评估，例如：

conda activate splade_env
export PYTHONPATH= $PYTHONPATH : $( pwd )
export SPLADE_CONFIG_FULLPATH= " /path/to/checkpoint/dir/config.yaml "
for dataset in arguana fiqa nfcorpus quora scidocs scifact trec-covid webis-touche2020 climate-fever dbpedia-entity fever hotpotqa nq
do
    python3 -m splade.beir_eval 
      +beir.dataset= $dataset 
      +beir.dataset_path=data/beir 
      config.index_retrieve_batch_size=100
done

PISA评估

我们在efficient_splade_pisa/README.md中提供使用PISA评估有效散发模型的步骤。

引用

请引用我们的工作为：

（v1）sigir21短纸

 @inbook{10.1145/3404835.3463098,
author = {Formal, Thibault and Piwowarski, Benjamin and Clinchant, St'{e}phane},
title = {SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking},
year = {2021},
isbn = {9781450380379},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3404835.3463098},
booktitle = {Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2288–2292},
numpages = {5}
}

（v2）arxiv

 @misc{https://doi.org/10.48550/arxiv.2109.10086,
  doi = {10.48550/ARXIV.2109.10086},
  url = {https://arxiv.org/abs/2109.10086},
  author = {Formal, Thibault and Lassance, Carlos and Piwowarski, Benjamin and Clinchant, Stéphane},
  keywords = {Information Retrieval (cs.IR), Artificial Intelligence (cs.AI), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval},
  publisher = {arXiv},
  year = {2021},
  copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
}

（v2bis）Splade ++，Sigir22短纸

 @inproceedings{10.1145/3477495.3531857,
author = {Formal, Thibault and Lassance, Carlos and Piwowarski, Benjamin and Clinchant, St'{e}phane},
title = {From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
year = {2022},
isbn = {9781450387323},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3477495.3531857},
doi = {10.1145/3477495.3531857},
abstract = {Neural retrievers based on dense representations combined with Approximate Nearest Neighbors search have recently received a lot of attention, owing their success to distillation and/or better sampling of examples for training -- while still relying on the same backbone architecture. In the meantime, sparse representation learning fueled by traditional inverted indexing techniques has seen a growing interest, inheriting from desirable IR priors such as explicit lexical matching. While some architectural variants have been proposed, a lesser effort has been put in the training of such models. In this work, we build on SPLADE -- a sparse expansion-based retriever -- and show to which extent it is able to benefit from the same training improvements as dense models, by studying the effect of distillation, hard-negative mining as well as the Pre-trained Language Model initialization. We furthermore study the link between effectiveness and efficiency, on in-domain and zero-shot settings, leading to state-of-the-art results in both scenarios for sufficiently expressive models.},
booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2353–2359},
numpages = {7},
keywords = {neural networks, indexing, sparse representations, regularization},
location = {Madrid, Spain},
series = {SIGIR '22}
}

有效的splade，sigir22短纸

 @inproceedings{10.1145/3477495.3531833,
author = {Lassance, Carlos and Clinchant, St'{e}phane},
title = {An Efficiency Study for SPLADE Models},
year = {2022},
isbn = {9781450387323},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3477495.3531833},
doi = {10.1145/3477495.3531833},
abstract = {Latency and efficiency issues are often overlooked when evaluating IR models based on Pretrained Language Models (PLMs) in reason of multiple hardware and software testing scenarios. Nevertheless, efficiency is an important part of such systems and should not be overlooked. In this paper, we focus on improving the efficiency of the SPLADE model since it has achieved state-of-the-art zero-shot performance and competitive results on TREC collections. SPLADE efficiency can be controlled via a regularization factor, but solely controlling this regularization has been shown to not be efficient enough. In order to reduce the latency gap between SPLADE and traditional retrieval systems, we propose several techniques including L1 regularization for queries, a separation of document/query encoders, a FLOPS-regularized middle-training, and the use of faster query encoders. Our benchmark demonstrates that we can drastically improve the efficiency of these models while increasing the performance metrics on in-domain data. To our knowledge, we propose the first neural models that, under the same computing constraints, achieve similar latency (less than 4ms difference) as traditional BM25, while having similar performance (less than 10% MRR@10 reduction) as the state-of-the-art single-stage neural rankers on in-domain data.},
booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2220–2226},
numpages = {7},
keywords = {splade, sparse representations, latency, information retrieval},
location = {Madrid, Spain},
series = {SIGIR '22}
}

接触？

随时通过Twitter或Mail @ [email protected]与我们联系！

执照

Splade版权（C）2021-Present Naver Corp.

Splade获得了创意共享归因非商业期4.0国际许可证的许可。（请参阅许可证）

您应该收到许可证的副本以及这项工作。如果没有，请参见http://creativecommons.org/licenses/by-nc-sa/4.0/。

展开

附加信息

版本 2023
类型其他源码
更新时间 2025-04-15
大小 981.43KB
来自于 Github

splade

杂乱

什么是新的：

入门

要求

用法

玩模型

代码结构的高级概述

数据

快速开始

其他例子

评估预训练的模型

贝尔评估

PISA评估

引用

接触？

执照

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express

splade

杂乱

什么是新的：

入门

要求

用法

玩模型

代码结构的高级概述

数据

快速开始

其他例子

评估预训练的模型

贝尔评估

PISA评估

引用

接触 ？

执照

接触？