
该存储库包含用于执行Splade模型的培训,索引和检索的代码。它还包括在Beir基准测试上启动评估所需的一切。
tl; Splade博士是一种神经检索模型,通过Bert MLM头和稀疏正则化学习查询/文档稀疏扩展。与密集的方法相比,稀疏表示受益:有效利用倒置索引,显式词汇匹配,可解释性...它们似乎也更好地概括了跨域数据(Beir Benchmark)。
通过受益于训练神经检索者的最新进展,我们的V2模型依赖于硬性采矿,蒸馏和更好的预训练的语言模型初始化,以进一步提高其对内域(MS MARCO)和室外评估(Beir Benchmark)的有效性。
最后,通过引入多种修改(查询特定的正则化,不相交编码等),我们能够提高效率,并在相同的计算限制下与BM25达到延迟。
在各种设置下训练的模型的权重可以在Naver Labs Europe网站上找到,以及拥抱的面孔。请记住,Splade更多的是一类模型,而不是模型本身:取决于正规化幅度,我们可以获得具有不同属性和性能的不同模型(从非常稀疏到进行强烈查询/DOC扩展的模型)。
Splade:沿一个边缘或两个边缘尖锐的螺旋,使其可以用作刀,叉子和勺子。
我们建议从新鲜的环境开始,然后从conda_splade_env.yml安装包装。
conda create -n splade_env python=3.9
conda activate splade_env
conda env create -f conda_splade_env.yml
inference_splade.ipynb允许您加载和执行推理,以检查预测的“膨胀袋”。我们为六个主要型号提供权重:
| 模型 | MRR@10(Marco Dev女士) |
|---|---|
naver/splade_v2_max ( v2 hf) | 34.0 |
naver/splade_v2_distil ( v2 hf) | 36.8 |
naver/splade-cocondenser-selfdistil ( Splade ++ ,HF) | 37.6 |
naver/splade-cocondenser-ensembledistil ( Splade ++ ,HF) | 38.3 |
naver/efficient-splade-V-large-doc (HF) + naver/efficient-splade-V-large-query (HF)(有效的Splade ) | 38.8 |
naver/efficient-splade-VI-BT-large-doc (HF) + efficient-splade-VI-BT-large-query (HF)(有效的Splade ) | 38.0 |
我们还在这里上传了各种模型。随时尝试!
train.py ),索引( index.py ),检索( retrieve.py )(或使用all.py )模型执行每个步骤。为了简化设置,我们提供了所有数据文件夹,可以在此处下载。此链接包括查询,文档和硬性数据,允许在EnsembleDistil设置下进行培训(请参见V2BIS纸)。对于其他设置( Simple , DistilMSE , SelfDistil ),您还必须下载:
Simple )标准BM25三胞胎DistilMSE )“维也纳”三元组,用于边缘蒸馏SelfDistil )三胞胎从杂碎中开采下载后,您只需在根目录中取消拆卸,它将放置在正确的文件夹中。
tar -xzvf file.tar.gz
为了执行所有步骤(在玩具数据上,即config_default.yaml上),请在根目录上运行:
conda activate splade_env
export PYTHONPATH= $PYTHONPATH : $( pwd )
export SPLADE_CONFIG_NAME= " config_default.yaml "
python3 -m splade.all
config.checkpoint_dir=experiments/debug/checkpoint
config.index_dir=experiments/debug/index
config.out_dir=experiments/debug/out我们提供可以插入上述代码的其他示例。有关如何更改实验设置的详细信息,请参见conf/readme.md。
python3 -m splade.train (索引或检索相同)SPLADE_CONFIG_FULLPATH=/path/to/checkpoint/dir/config.yaml python3 -m splade.create_anserini +quantization_factor_document=100 +quantization_factor_query=100/conf中提供各种设置的配置文件(蒸馏等)。例如,运行SelfDistil设置:SPLADE_CONFIG_NAME=config_splade++_selfdistil.yamlpython3 -m splade.all config.regularizer.FLOPS.lambda_q=0.06 config.regularizer.FLOPS.lambda_d=0.02我们提供了几种与V2BI和“效率”论文中实验相对应的基本配置。请注意,这些适用于我们的硬件设置,即具有32GB内存的4 GPU Tesla V100。为了使用一个GPU培训模型,您需要降低批次尺寸以进行培训和评估。另请注意,由于损失的范围可能会随不同的批次大小而变化,因此可能需要调整用于正则化的相应lambdas。但是,我们提供了一个单gpu配置config_splade++_cocondenser_ensembledistil_monogpu.yaml ,我们可以获得37.2 mrr@10,对单个16GB GPU进行了培训。
可以使用我们的(基于NUMBA的)倒置索引或Anserini实现索引(和检索)。让我们使用可用的模型( naver/splade-cocondenser-ensembledistil )执行这些步骤。
conda activate splade_env
export PYTHONPATH= $PYTHONPATH : $( pwd )
export SPLADE_CONFIG_NAME= " config_splade++_cocondenser_ensembledistil "
python3 -m splade.index
init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil
config.pretrained_no_yamlconfig=true
config.index_dir=experiments/pre-trained/index
python3 -m splade.retrieve
init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil
config.pretrained_no_yamlconfig=true
config.index_dir=experiments/pre-trained/index
config.out_dir=experiments/pre-trained/out
# pretrained_no_yamlconfig indicates that we solely rely on a HF-valid model pathretrieve_evaluate=msmarco作为splade.retrieve的参数。您可以类似地构建Anserini摄入的文件:
python3 -m splade.create_anserini
init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil
config.pretrained_no_yamlconfig=true
config.index_dir=experiments/pre-trained/index
+quantization_factor_document=100
+quantization_factor_query=100它将创建JSON Collection( docs_anserini.jsonl )以及Anserini所需的查询( queries_anserini.tsv )。然后,您只需要在此处遵循回归即可进行索引和检索。
您也可以在贝尔上进行评估,例如:
conda activate splade_env
export PYTHONPATH= $PYTHONPATH : $( pwd )
export SPLADE_CONFIG_FULLPATH= " /path/to/checkpoint/dir/config.yaml "
for dataset in arguana fiqa nfcorpus quora scidocs scifact trec-covid webis-touche2020 climate-fever dbpedia-entity fever hotpotqa nq
do
python3 -m splade.beir_eval
+beir.dataset= $dataset
+beir.dataset_path=data/beir
config.index_retrieve_batch_size=100
done我们在efficient_splade_pisa/README.md中提供使用PISA评估有效散发模型的步骤。
请引用我们的工作为:
@inbook{10.1145/3404835.3463098,
author = {Formal, Thibault and Piwowarski, Benjamin and Clinchant, St'{e}phane},
title = {SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking},
year = {2021},
isbn = {9781450380379},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3404835.3463098},
booktitle = {Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2288–2292},
numpages = {5}
}
@misc{https://doi.org/10.48550/arxiv.2109.10086,
doi = {10.48550/ARXIV.2109.10086},
url = {https://arxiv.org/abs/2109.10086},
author = {Formal, Thibault and Lassance, Carlos and Piwowarski, Benjamin and Clinchant, Stéphane},
keywords = {Information Retrieval (cs.IR), Artificial Intelligence (cs.AI), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval},
publisher = {arXiv},
year = {2021},
copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
}
@inproceedings{10.1145/3477495.3531857,
author = {Formal, Thibault and Lassance, Carlos and Piwowarski, Benjamin and Clinchant, St'{e}phane},
title = {From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
year = {2022},
isbn = {9781450387323},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3477495.3531857},
doi = {10.1145/3477495.3531857},
abstract = {Neural retrievers based on dense representations combined with Approximate Nearest Neighbors search have recently received a lot of attention, owing their success to distillation and/or better sampling of examples for training -- while still relying on the same backbone architecture. In the meantime, sparse representation learning fueled by traditional inverted indexing techniques has seen a growing interest, inheriting from desirable IR priors such as explicit lexical matching. While some architectural variants have been proposed, a lesser effort has been put in the training of such models. In this work, we build on SPLADE -- a sparse expansion-based retriever -- and show to which extent it is able to benefit from the same training improvements as dense models, by studying the effect of distillation, hard-negative mining as well as the Pre-trained Language Model initialization. We furthermore study the link between effectiveness and efficiency, on in-domain and zero-shot settings, leading to state-of-the-art results in both scenarios for sufficiently expressive models.},
booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2353–2359},
numpages = {7},
keywords = {neural networks, indexing, sparse representations, regularization},
location = {Madrid, Spain},
series = {SIGIR '22}
}
@inproceedings{10.1145/3477495.3531833,
author = {Lassance, Carlos and Clinchant, St'{e}phane},
title = {An Efficiency Study for SPLADE Models},
year = {2022},
isbn = {9781450387323},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3477495.3531833},
doi = {10.1145/3477495.3531833},
abstract = {Latency and efficiency issues are often overlooked when evaluating IR models based on Pretrained Language Models (PLMs) in reason of multiple hardware and software testing scenarios. Nevertheless, efficiency is an important part of such systems and should not be overlooked. In this paper, we focus on improving the efficiency of the SPLADE model since it has achieved state-of-the-art zero-shot performance and competitive results on TREC collections. SPLADE efficiency can be controlled via a regularization factor, but solely controlling this regularization has been shown to not be efficient enough. In order to reduce the latency gap between SPLADE and traditional retrieval systems, we propose several techniques including L1 regularization for queries, a separation of document/query encoders, a FLOPS-regularized middle-training, and the use of faster query encoders. Our benchmark demonstrates that we can drastically improve the efficiency of these models while increasing the performance metrics on in-domain data. To our knowledge, we propose the first neural models that, under the same computing constraints, achieve similar latency (less than 4ms difference) as traditional BM25, while having similar performance (less than 10% MRR@10 reduction) as the state-of-the-art single-stage neural rankers on in-domain data.},
booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2220–2226},
numpages = {7},
keywords = {splade, sparse representations, latency, information retrieval},
location = {Madrid, Spain},
series = {SIGIR '22}
}
随时通过Twitter或Mail @ [email protected]与我们联系!
Splade版权(C)2021-Present Naver Corp.
Splade获得了创意共享归因非商业期4.0国际许可证的许可。 (请参阅许可证)
您应该收到许可证的副本以及这项工作。如果没有,请参见http://creativecommons.org/licenses/by-nc-sa/4.0/。