神经搜索

Neural-Cherche是一个库,旨在微调神经搜索模型,例如Splade,Colbert和在特定数据集上稀疏。 Neural-Cherche还提供了对微调猎犬或排名者有效推断的类别。 Neural-Cherche旨在为离线和在线设置中的微调和利用神经搜索模型提供一种直接有效的方法。它还使用户可以保存所有计算的嵌入以防止冗余计算。
神经切尔奇与CPU,GPU和MPS设备兼容。我们可以从任何句子变压器预训练的检查点中微调Colbert。 splade和稀疏的微调更为棘手,需要MLM预训练的模型。
我们可以使用:
pip install neural-cherche
如果我们计划在培训安装时评估我们的模型:
pip install "neural-cherche[eval]"
完整的文档可在此处找到。
您的训练数据集必须由锚定为查询的三倍(anchor, positive, negative)制成,正面是直接链接到锚的文档,而负面是与锚无关的文档。
X = [
( "anchor 1" , "positive 1" , "negative 1" ),
( "anchor 2" , "positive 2" , "negative 2" ),
( "anchor 3" , "positive 3" , "negative 3" ),
]这是如何使用Neural-Cherche从句子变压器进行预训练的检查点中调整Colbert的方法:
import torch
from neural_cherche import models , utils , train
model = models . ColBERT (
model_name_or_path = "raphaelsty/neural-cherche-colbert" ,
device = "cuda" if torch . cuda . is_available () else "cpu" # or mps
)
optimizer = torch . optim . AdamW ( model . parameters (), lr = 3e-6 )
X = [
( "query" , "positive document" , "negative document" ),
( "query" , "positive document" , "negative document" ),
( "query" , "positive document" , "negative document" ),
]
for step , ( anchor , positive , negative ) in enumerate ( utils . iter (
X ,
epochs = 1 , # number of epochs
batch_size = 8 , # number of triples per batch
shuffle = True
)):
loss = train . train_colbert (
model = model ,
optimizer = optimizer ,
anchor = anchor ,
positive = positive ,
negative = negative ,
step = step ,
gradient_accumulation_steps = 50 ,
)
if ( step + 1 ) % 1000 == 0 :
# Save the model every 1000 steps
model . save_pretrained ( "checkpoint" )这是如何使用微调的Colbert模型重新排列文档的方法:
import torch
from lenlp import sparse
from neural_cherche import models , rank , retrieve
documents = [
{ "id" : "doc1" , "title" : "Paris" , "text" : "Paris is the capital of France." },
{ "id" : "doc2" , "title" : "Montreal" , "text" : "Montreal is the largest city in Quebec." },
{ "id" : "doc3" , "title" : "Bordeaux" , "text" : "Bordeaux in Southwestern France." },
]
retriever = retrieve . BM25 (
key = "id" ,
on = [ "title" , "text" ],
count_vectorizer = sparse . CountVectorizer (
normalize = True , ngram_range = ( 3 , 5 ), analyzer = "char_wb" , stop_words = []
),
k1 = 1.5 ,
b = 0.75 ,
epsilon = 0.0 ,
)
model = models . ColBERT (
model_name_or_path = "raphaelsty/neural-cherche-colbert" ,
device = "cuda" if torch . cuda . is_available () else "cpu" , # or mps
)
ranker = rank . ColBERT (
key = "id" ,
on = [ "title" , "text" ],
model = model ,
)
documents_embeddings = retriever . encode_documents (
documents = documents ,
)
retriever . add (
documents_embeddings = documents_embeddings ,
)现在,我们可以使用微型模型检索文档:
queries = [ "Paris" , "Montreal" , "Bordeaux" ]
queries_embeddings = retriever . encode_queries (
queries = queries ,
)
ranker_queries_embeddings = ranker . encode_queries (
queries = queries ,
)
candidates = retriever (
queries_embeddings = queries_embeddings ,
batch_size = 32 ,
k = 100 , # number of documents to retrieve
)
# Compute embeddings of the candidates with the ranker model.
# Note, we could also pre-compute all the embeddings.
ranker_documents_embeddings = ranker . encode_candidates_documents (
candidates = candidates ,
documents = documents ,
batch_size = 32 ,
)
scores = ranker (
queries_embeddings = ranker_queries_embeddings ,
documents_embeddings = ranker_documents_embeddings ,
documents = candidates ,
batch_size = 32 ,
)
scores [[{ 'id' : 0 , 'similarity' : 22.825355529785156 },
{ 'id' : 1 , 'similarity' : 11.201947212219238 },
{ 'id' : 2 , 'similarity' : 10.748161315917969 }],
[{ 'id' : 1 , 'similarity' : 23.21628189086914 },
{ 'id' : 0 , 'similarity' : 9.9658203125 },
{ 'id' : 2 , 'similarity' : 7.308732509613037 }],
[{ 'id' : 1 , 'similarity' : 6.4031805992126465 },
{ 'id' : 0 , 'similarity' : 5.601611137390137 },
{ 'id' : 2 , 'similarity' : 5.599479675292969 }]] Neural-Cherche提供了一个SparseEmbed ,一个SPLADE ,一个TFIDF , BM25猎犬和ColBERT Ranker,可用于重新订购猎犬的输出。有关更多信息,请参考文档。
我们提供专门为神经卫生设计的预训练检查点:raphaelsty/neural-cherche-sparse-embed和raphaelsty/neural-cherche-colbert。这些检查点在MS-MARCO数据集的一个子集上进行了微调,并将受益于在特定数据集中进行微调。您可以从任何句子变压器预先训练的检查点中微调Colbert,以适合您的特定语言。您应该使用基于MLM的检查点进行微调。
| Scifact数据集 | ||||
|---|---|---|---|---|
| 模型 | 拥抱面检查点 | NDCG@10 | 命中@10 | 命中@1 |
| TFIDF | - | 0,62 | 0,86 | 0,50 |
| BM25 | - | 0,69 | 0,92 | 0,56 |
| 稀疏 | raphaelSty/neural-cherche-sparse-embed | 0,62 | 0,87 | 0,48 |
| 句子变压器 | 句子转换器/全mpnet-base-v2 | 0,66 | 0,89 | 0,53 |
| 科尔伯特 | Raphaelsty/Neural-Cherche-Colbert | 0,70 | 0,92 | 0,58 |
| TFIDF猎犬 + Colbert Ranker | Raphaelsty/Neural-Cherche-Colbert | 0,71 | 0,94 | 0,59 |
| BM25猎犬 + Colbert Ranker | Raphaelsty/Neural-Cherche-Colbert | 0,72 | 0,95 | 0,59 |
Splade:由Thibault正式,本杰明·皮瓦瓦尔斯基(Benjamin Piwowarski)撰写的第一阶段排名的稀疏词汇和扩展模型,斯蒂芬·蒂尔(StéphaneClinthing),西吉尔(Sigir)2021年。
Splade V2:稀疏的词汇和扩展模型,用于Thibault正式,Carlos Lassance,Benjamin Piwowarski,StéphaneClinthant,Sigir 2022。
稀疏:学习稀疏的词汇表示,并由Weize Kong,Jeffrey M. Dudek,Cheng Li,Mingyang Zhang和Mike Bendersky,Sigir 2023撰写的环境嵌入。
COLBERT:通过上下文化的后期相互作用,由Mare Khattab,Matei Zaharia,Sigir 2020撰写的BERT,通过上下文化的晚期相互作用进行有效的通过搜索。
该Python库是根据MIT开源许可证获得许可的,Splade模型仅由作者将其视为非商业。稀疏和科尔伯特(Colbert)完全开源,包括商业用法。