gpl下载 - gpl源代码下载

生成伪标签（GPL）

GPL是一种无监督的域适应方法，用于训练致密猎犬。它基于与功能强大的跨编码器的查询产生和伪标记。要训练由域适应的模型，它只需要未标记的目标语料库，并且可以对零射击模型实现显着改进。

有关更多信息，请查看我们的出版物：

GPL：无监督的域名适应量的生成伪标记（NAACL 2022）

要复制，请参阅此快照分支。

安装

一个可以通过pip安装GPL

pip install gpl

或通过git clone

git clone https://github.com/UKPLab/gpl.git && cd gpl
pip install -e .

同时，请确保根据您的CUDA版本安装了Pytorch的正确版本。

用法

GPL接受Beir-Format中的数据。例如，我们可以下载Beir托管的FIQA数据集：

wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip
unzip fiqa.zip
head -n 2 fiqa/corpus.jsonl  # One can check this data format. Actually GPL only need this `corpus.jsonl` as data input for training.

然后，我们可以使用python -m功能直接运行GPL培训：

 export dataset= " fiqa "
python -m gpl.train 
    --path_to_generated_data " generated/ $dataset " 
    --base_ckpt " distilbert-base-uncased " 
    --gpl_score_function " dot " 
    --batch_size_gpl 32 
    --gpl_steps 140000 
    --new_size -1 
    --queries_per_passage -1 
    --output_dir " output/ $dataset " 
    --evaluation_data " ./ $dataset " 
    --evaluation_output " evaluation/ $dataset " 
    --generator " BeIR/query-gen-msmarco-t5-base-v1 " 
    --retrievers " msmarco-distilbert-base-v3 " " msmarco-MiniLM-L-6-v3 " 
    --retriever_score_functions " cos_sim " " cos_sim " 
    --cross_encoder " cross-encoder/ms-marco-MiniLM-L-6-v2 " 
    --qgen_prefix " qgen " 
    --do_evaluation 
    # --use_amp   # Use this for efficient training if the machine supports AMP

# One can run `python -m gpl.train --help` for the information of all the arguments
# To reproduce the experiments in the paper, set `base_ckpt` to "GPL/msmarco-distilbert-margin-mse" (https://huggingface.co/GPL/msmarco-distilbert-margin-mse)

或在Python脚本中导入GPL的火车方法：

 import gpl

dataset = 'fiqa'
gpl . train (
    path_to_generated_data = f"generated/ { dataset } " ,
    base_ckpt = "distilbert-base-uncased" ,  
    # base_ckpt='GPL/msmarco-distilbert-margin-mse',  
    # The starting checkpoint of the experiments in the paper
    gpl_score_function = "dot" ,
    # Note that GPL uses MarginMSE loss, which works with dot-product
    batch_size_gpl = 32 ,
    gpl_steps = 140000 ,
    new_size = - 1 ,
    # Resize the corpus to `new_size` (|corpus|) if needed. When set to None (by default), the |corpus| will be the full size. When set to -1, the |corpus| will be set automatically: If QPP * |corpus| <= 250K, |corpus| will be the full size; else QPP will be set 3 and |corpus| will be set to 250K / 3
    queries_per_passage = - 1 ,
    # Number of Queries Per Passage (QPP) in the query generation step. When set to -1 (by default), the QPP will be chosen automatically: If QPP * |corpus| <= 250K, then QPP will be set to 250K / |corpus|; else QPP will be set 3 and |corpus| will be set to 250K / 3
    output_dir = f"output/ { dataset } " ,
    evaluation_data = f"./ { dataset } " ,
    evaluation_output = f"evaluation/ { dataset } " ,
    generator = "BeIR/query-gen-msmarco-t5-base-v1" ,
    retrievers = [ "msmarco-distilbert-base-v3" , "msmarco-MiniLM-L-6-v3" ],
    retriever_score_functions = [ "cos_sim" , "cos_sim" ],
    # Note that these two retriever model work with cosine-similarity
    cross_encoder = "cross-encoder/ms-marco-MiniLM-L-6-v2" ,
    qgen_prefix = "qgen" ,
    # This prefix will appear as part of the (folder/file) names for query-generation results: For example, we will have "qgen-qrels/" and "qgen-queries.jsonl" by default.
    do_evaluation = True ,
    # use_amp=True   # One can use this flag for enabling the efficient float16 precision
)

还可以在Google Colab上参考此玩具示例，以更好地了解代码的工作原理。

GPL如何工作？

GPL的工作流如下：

GPL首先使用SEQ2SEQ（默认情况下，我们使用beir/Query-gen-MSMARCO-T5-BASE-V1）模型来生成无标记语料库中每个段落的queries_per_passage查询。查询对接对被视为训练的积极例子。
结果文件（在路径$path_to_generated_data下）：（1） ${qgen}-qrels/train.tsv ，（2） ${qgen}-queries.jsonl以及（3） corpus.jsonl （从$evaluation_data/ copus.jsonl（复制）
然后，它以产生的查询作为目标语料库的输入进行负挖掘。采矿段落将被视为培训的负面例子。一个人可以指定任何密集的检索器（Sbert或Huggingface/Transfersers检查点，我们使用MSMARCO-DISTILBERT-BASE-V3 + MSMARCO-MINILM-L-6-V3（默认情况下）或BM25将其用于参数retrievers作为负矿工。
结果文件（在路径$path_to_generated_data下）：hard-negative.jsonl;
最后，它可以使用功能强大的交叉编码器（默认情况下使用交叉编码器/MS-Marco-Minilm-L-6-V2）进行伪标记。在查询配对上，我们到目前为止拥有的查询配对对（用于正面和负面示例）。
结果文件（在路径$path_to_generated_data下）： gpl-training-data.tsv 。它总共包含（ gpl_steps * batch_size_gpl ）元组。

到目前为止，我们已经准备好实际培训数据。可以查看示例数据/生成/fiqa，以获取有关数据格式的快速示例。最后一步是应用边缘损失，以教导学生猎犬模仿保证金分数，ce（查询，正） - CE（查询，负）由教师模型（Cross -nocoder，CE）标记。当然，GPL中包含边距步骤，将自动完成:)。请注意，Marginmse与DOT产品合作，因此使用GPL训练的最终模型可与DOT产品一起使用。

PS： --retrievers是用于负面采矿的。它们可以是在通用域中训练的任何密集的检索器（例如MS MARCO），并且不需要对目标任务/域进行强大。请参阅论文以获取更多详细信息（参见表7）。

定制数据

一个人还可以以相同名称时尚的方式替换/将任何中间步骤的自定义数据替换/放置在路径$path_to_generated_data下的任何中间步骤。 GPL将使用这些提供的数据跳过中间步骤。

作为典型的工作流程，可能只有（英语）Unlabeld语料库，并且希望一个好的模型为此语料库表现良好。要在这种情况下进行GPL培训，只需要以下步骤：

以与数据样本相同的格式准备您的语料库；
将您的corpus.jsonl放在一个文件夹下，例如，命名为“生成”，用于GPL的数据加载和数据生成；
用文件夹路径拨打gpl.Train作为输入参数：（其他参数照常工作）

python -m gpl.train 
    --path_to_generated_data " generated " 
    --output_dir " output " 
    --new_size -1 
    --queries_per_passage -1

预先训练的检查点和生成数据

预训练的检查点

现在，我们通过https://huggingface.co/gpl发布了预训练的GPL模型。当前有五种类型的模型：

GPL/${dataset}-msmarco-distilbert-gpl ：在${dataset}上的训练订单（1）ribgl的训练顺序的模型；
GPL/${dataset}-tsdae-msmarco-distilbert-gpl ：型号，培训顺序为（1）tsdae on ${dataset} - >（2）MSMARCO上的Marginmse在${dataset}上的MSMARCO->（3）GPL上的MASMARCO->（3）;
GPL/msmarco-distilbert-margin-mse ：在MSMARCO培训的模型；
GPL/${dataset}-tsdae-msmarco-distilbert-margin-mse ：$ {dataSet} - >（2）MSMARCO上的Marginmse;
GPL/${dataset}-distilbert-tas-b-gpl-self_miner ：从TAS-B模型开始，模型在目标语料库${dataset}上接受了基本模型本身作为负矿工的培训（此处称为“ self_miner”）。

实际上，模型1和2。在模型3和4。所有GPL型号均经过new_size和queries_per_passage的自动设置（通过将它们设置为-1 ）。这种自动设置可以在高效的同时保持性能。有关更多详细信息，请参阅论文中的第4.1节。

在这些模型中， GPL/${dataset}-distilbert-tas-b-gpl-self_miner Ones在Beir基准上的表现最好：

要使用实验中使用的相同软件包版本重现结果，请参阅conda环境文件，环境。

生成的数据

现在，我们发布了GPL纸实验中使用的生成数据：

在6个BEIR数据集上的主要实验的生成数据：https：//public.ukp.informatik.tu-darmstadt.de/kwang/gpl/gpl/gpl/generated-data/main/;
整个贝尔（Beir）数据集中的实验的生成数据：https：//public.ukp.informatik.tu-darmstadt.de/kwang/gpl/gpl/glpl/generated-data/beir。

请注意，只有在与原始正式当局注册后， bioasq ， robust04 ， trec-news和signal1m的4个数据集可用于。我们仅使用文件名corpus.doc_ids.txt发布这些语料库的文档ID。有关更多详细信息，请参考贝尔存储库。

引用

如果您使用该代码进行评估，请随时引用我们的出版物GPL：生成伪标签，以适应密集检索的无监督域：

 @article { wang2021gpl ,
    title = " GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval " ,
    author = " Kexin Wang and Nandan Thakur and Nils Reimers and Iryna Gurevych " , 
    journal = " arXiv preprint arXiv:2112.07577 " ,
    month = " 4 " ,
    year = " 2021 " ,
    url = " https://arxiv.org/abs/2112.07577 " ,
}