sapbert下载 - sapbert源代码下载

sapbert

Ai源码

1.0.0

下载

Sapbert：Bert的自我对准预审预测

[新闻| 2021年8月22日] Sapbert被整合到Nvidia的深度学习工具包Nemo中，作为其实体链接模块（谢谢Nvidia！）。您可以在Google Colab中使用它。

该仓库拥有（1）我们NAACL 2021论文中提出的Sapbert模型的代码，数据和预处理的权重：生物医学实体表示的自我调整预审计； （2）在我们的ACL 2021论文中提出的基准（ XL-BEL ）的跨语性Sapbert和一个跨语性的生物医学实体：学习域特殊的表示跨语性生物医学实体链接的表示。

前页编号

拥抱面模型

英语模型：[Sapbert]和[Sapbert-Mean-Token]

如[Liu等人，NAACL 2021]中所述的标准Sapbert。使用microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext作为基本模型，接受了UMLS 2020AA（仅英语）培训。对于[Sapbert]，请使用[CLS] （池之前）作为输入的表示；对于[Sapbert-Mean-Token]，请在所有令牌中使用平均泵。

跨语性模型：[Sapbert-XLMR]和[Sapbert-XLMR-LARGE]

如[Liu等人，ACL 2021]中所述的跨语性Sapbert。使用xlm-roberta-base / xlm-roberta-large作为基本模型，接受了UMLS 2020AB（所有语言）培训。使用[CLS] （池之前）作为输入的表示。

环境

该代码用Python 3.8，Torch 1.7.0和HuggingFace Transferes 4.4.2进行了测试。请查看requirements.txt以获取更多详细信息。

用Sapbert嵌入提取

以下脚本将字符串（实体名称）列表转换为嵌入。

 import numpy as np
import torch
from tqdm . auto import tqdm
from transformers import AutoTokenizer , AutoModel  

tokenizer = AutoTokenizer . from_pretrained ( "cambridgeltl/SapBERT-from-PubMedBERT-fulltext" )  
model = AutoModel . from_pretrained ( "cambridgeltl/SapBERT-from-PubMedBERT-fulltext" ). cuda ()

# replace with your own list of entity names
all_names = [ "covid-19" , "Coronavirus infection" , "high fever" , "Tumor of posterior wall of oropharynx" ] 

bs = 128 # batch size during inference
all_embs = []
for i in tqdm ( np . arange ( 0 , len ( all_names ), bs )):
    toks = tokenizer . batch_encode_plus ( all_names [ i : i + bs ], 
                                       padding = "max_length" , 
                                       max_length = 25 , 
                                       truncation = True ,
                                       return_tensors = "pt" )
    toks_cuda = {}
    for k , v in toks . items ():
        toks_cuda [ k ] = v . cuda ()
    cls_rep = model ( ** toks_cuda )[ 0 ][:, 0 ,:] # use CLS representation as the embedding
    all_embs . append ( cls_rep . cpu (). detach (). numpy ())

all_embs = np . concatenate ( all_embs , axis = 0 )

有关更广泛的推理示例，请参见推理/inperion_on_snomed.ipynb。

火车萨普伯特

从training_data/generate_pretraining_data.ipynb中提取umls的培训数据（由于许可问题，我们无法直接发布培训文件）。

跑步：

 >> cd train/
>> ./pretrain.sh 0,1

其中0,1指定GPU设备。

要在您的自定义数据集中进行填充，请以

 concept_id || entity_name_1 || entity_name_2
...

其中entity_name_1和entity_name_2是从给定标记的数据集采样的同义词对（属于同一概念概念concept_id ）。如果一个概念与数据集中的多个实体名称相关联，则可以穿越所有成对组合。

对于使用通用域并行数据（MUSE，Wiki标题或两者）的跨语性SAP调整，可以在training_data/general_domain_parallel_data/中找到数据。示例脚本： train/xling_train.sh 。

评估Sapbert

要进行评估（无论是弦词还是跨语言），请查看evaluation/README.md了解详细信息。 evaluation/xl_bel/包含[Liu等，ACL 2021]中提出的XL-BEL基准。

引用

萨普伯特：

 @inproceedings { liu2021self ,
	title = { Self-Alignment Pretraining for Biomedical Entity Representations } ,
	author = { Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel } ,
	booktitle = { Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies } ,
	pages = { 4228--4238 } ,
	month = jun,
	year = { 2021 }
}

跨语言Sapbert和XL-Bel：

 @inproceedings { liu2021learning ,
	title = { Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking } ,
	author = { Liu, Fangyu and Vuli{'c}, Ivan and Korhonen, Anna and Collier, Nigel } ,
	booktitle = { Proceedings of ACL-IJCNLP 2021 } ,
	pages = { 565--574 } ,
	month = aug,
	year = { 2021 }
}