神經搜索

Neural-Cherche是一個庫,旨在微調神經搜索模型,例如Splade,Colbert和在特定數據集上稀疏。 Neural-Cherche還提供了對微調獵犬或排名者有效推斷的類別。 Neural-Cherche旨在為離線和在線設置中的微調和利用神經搜索模型提供一種直接有效的方法。它還使用戶可以保存所有計算的嵌入以防止冗餘計算。
神經切爾奇與CPU,GPU和MPS設備兼容。我們可以從任何句子變壓器預訓練的檢查點中微調Colbert。 splade和稀疏的微調更為棘手,需要MLM預訓練的模型。
我們可以使用:
pip install neural-cherche
如果我們計劃在培訓安裝時評估我們的模型:
pip install "neural-cherche[eval]"
完整的文檔可在此處找到。
您的訓練數據集必須由錨定為查詢的三倍(anchor, positive, negative)製成,正面是直接鏈接到錨的文檔,而負面是與錨無關的文檔。
X = [
( "anchor 1" , "positive 1" , "negative 1" ),
( "anchor 2" , "positive 2" , "negative 2" ),
( "anchor 3" , "positive 3" , "negative 3" ),
]這是如何使用Neural-Cherche從句子變壓器進行預訓練的檢查點中調整Colbert的方法:
import torch
from neural_cherche import models , utils , train
model = models . ColBERT (
model_name_or_path = "raphaelsty/neural-cherche-colbert" ,
device = "cuda" if torch . cuda . is_available () else "cpu" # or mps
)
optimizer = torch . optim . AdamW ( model . parameters (), lr = 3e-6 )
X = [
( "query" , "positive document" , "negative document" ),
( "query" , "positive document" , "negative document" ),
( "query" , "positive document" , "negative document" ),
]
for step , ( anchor , positive , negative ) in enumerate ( utils . iter (
X ,
epochs = 1 , # number of epochs
batch_size = 8 , # number of triples per batch
shuffle = True
)):
loss = train . train_colbert (
model = model ,
optimizer = optimizer ,
anchor = anchor ,
positive = positive ,
negative = negative ,
step = step ,
gradient_accumulation_steps = 50 ,
)
if ( step + 1 ) % 1000 == 0 :
# Save the model every 1000 steps
model . save_pretrained ( "checkpoint" )這是如何使用微調的Colbert模型重新排列文檔的方法:
import torch
from lenlp import sparse
from neural_cherche import models , rank , retrieve
documents = [
{ "id" : "doc1" , "title" : "Paris" , "text" : "Paris is the capital of France." },
{ "id" : "doc2" , "title" : "Montreal" , "text" : "Montreal is the largest city in Quebec." },
{ "id" : "doc3" , "title" : "Bordeaux" , "text" : "Bordeaux in Southwestern France." },
]
retriever = retrieve . BM25 (
key = "id" ,
on = [ "title" , "text" ],
count_vectorizer = sparse . CountVectorizer (
normalize = True , ngram_range = ( 3 , 5 ), analyzer = "char_wb" , stop_words = []
),
k1 = 1.5 ,
b = 0.75 ,
epsilon = 0.0 ,
)
model = models . ColBERT (
model_name_or_path = "raphaelsty/neural-cherche-colbert" ,
device = "cuda" if torch . cuda . is_available () else "cpu" , # or mps
)
ranker = rank . ColBERT (
key = "id" ,
on = [ "title" , "text" ],
model = model ,
)
documents_embeddings = retriever . encode_documents (
documents = documents ,
)
retriever . add (
documents_embeddings = documents_embeddings ,
)現在,我們可以使用微型模型檢索文檔:
queries = [ "Paris" , "Montreal" , "Bordeaux" ]
queries_embeddings = retriever . encode_queries (
queries = queries ,
)
ranker_queries_embeddings = ranker . encode_queries (
queries = queries ,
)
candidates = retriever (
queries_embeddings = queries_embeddings ,
batch_size = 32 ,
k = 100 , # number of documents to retrieve
)
# Compute embeddings of the candidates with the ranker model.
# Note, we could also pre-compute all the embeddings.
ranker_documents_embeddings = ranker . encode_candidates_documents (
candidates = candidates ,
documents = documents ,
batch_size = 32 ,
)
scores = ranker (
queries_embeddings = ranker_queries_embeddings ,
documents_embeddings = ranker_documents_embeddings ,
documents = candidates ,
batch_size = 32 ,
)
scores [[{ 'id' : 0 , 'similarity' : 22.825355529785156 },
{ 'id' : 1 , 'similarity' : 11.201947212219238 },
{ 'id' : 2 , 'similarity' : 10.748161315917969 }],
[{ 'id' : 1 , 'similarity' : 23.21628189086914 },
{ 'id' : 0 , 'similarity' : 9.9658203125 },
{ 'id' : 2 , 'similarity' : 7.308732509613037 }],
[{ 'id' : 1 , 'similarity' : 6.4031805992126465 },
{ 'id' : 0 , 'similarity' : 5.601611137390137 },
{ 'id' : 2 , 'similarity' : 5.599479675292969 }]] Neural-Cherche提供了一個SparseEmbed ,一個SPLADE ,一個TFIDF , BM25獵犬和ColBERT Ranker,可用於重新訂購獵犬的輸出。有關更多信息,請參考文檔。
我們提供專門為神經衛生設計的預訓練檢查點:raphaelsty/neural-cherche-sparse-embed和raphaelsty/neural-cherche-colbert。這些檢查點在MS-MARCO數據集的一個子集上進行了微調,並將受益於在特定數據集中進行微調。您可以從任何句子變壓器預先訓練的檢查點中微調Colbert,以適合您的特定語言。您應該使用基於MLM的檢查點進行微調。
| Scifact數據集 | ||||
|---|---|---|---|---|
| 模型 | 擁抱面檢查點 | NDCG@10 | 命中@10 | 命中@1 |
| TFIDF | - | 0,62 | 0,86 | 0,50 |
| BM25 | - | 0,69 | 0,92 | 0,56 |
| 稀疏 | raphaelSty/neural-cherche-sparse-embed | 0,62 | 0,87 | 0,48 |
| 句子變壓器 | 句子轉換器/全mpnet-base-v2 | 0,66 | 0,89 | 0,53 |
| 科爾伯特 | Raphaelsty/Neural-Cherche-Colbert | 0,70 | 0,92 | 0,58 |
| TFIDF獵犬 + Colbert Ranker | Raphaelsty/Neural-Cherche-Colbert | 0,71 | 0,94 | 0,59 |
| BM25獵犬 + Colbert Ranker | Raphaelsty/Neural-Cherche-Colbert | 0,72 | 0,95 | 0,59 |
Splade:由Thibault正式,本傑明·皮瓦瓦爾斯基(Benjamin Piwowarski)撰寫的第一階段排名的稀疏詞彙和擴展模型,斯蒂芬·蒂爾(StéphaneClinthing),西吉爾(Sigir)2021年。
Splade V2:稀疏的詞彙和擴展模型,用於Thibault正式,Carlos Lassance,Benjamin Piwowarski,StéphaneClinthant,Sigir 2022。
稀疏:學習稀疏的詞彙表示,並由Weize Kong,Jeffrey M. Dudek,Cheng Li,Mingyang Zhang和Mike Bendersky,Sigir 2023撰寫的環境嵌入。
COLBERT:通過上下文化的後期相互作用,由Mare Khattab,Matei Zaharia,Sigir 2020撰寫的BERT,通過上下文化的晚期相互作用進行有效的通過搜索。
該Python庫是根據MIT開源許可證獲得許可的,Splade模型僅由作者將其視為非商業。稀疏和科爾伯特(Colbert)完全開源,包括商業用法。