Descargar OAG BERT - Descargar el código fuente de OAG BERT

OAG BERT

Código Fuente de IA

1.0.0

Descargar

Biblioteca | Papel | Flojo

Lanzamos dos versiones de OAG-Bert en el paquete COGDL (debido a la expiración de la descarga anterior del disco de la nube Tsinghua, ahora descargue modelos manualmente de ModelsCope). Oag-Bert es un modelo de lenguaje académico académico de entidad heterogéneo que no solo comprende los textos académicos sino también el conocimiento de la entidad heterogénea en OAG. ¡Únase a nuestro grupo Slack o Google para cualquier comentario y solicitud! Nuestro papel está aquí.

V1: la versión de vainilla

Una versión básica oag-bert. Similar a Scibert, pre-entrenamos el modelo Bert en el corpus de texto académico en un gráfico académico abierto, incluidos títulos de papel, resúmenes y cuerpos.

El uso de OAG-Bert es el mismo de Sciber o Bert ordinarios. Por ejemplo, puede usar el siguiente código para codificar dos secuencias de texto y recuperar sus salidas

 from cogdl . oag import oagbert

tokenizer , bert_model = oagbert ()

sequence = [ "CogDL is developed by KEG, Tsinghua." , "OAGBert is developed by KEG, Tsinghua." ]
tokens = tokenizer ( sequence , return_tensors = "pt" , padding = True )
outputs = bert_model ( ** tokens )

V2: la versión aumentada de la entidad

Una extensión al vainilla oag-bert. Incorporamos información de entidad rica en gráficos académicos abiertos, como autores y campo de estudio . Por lo tanto, puede codificar varios tipos de entidades en OAG-Bert V2. Por ejemplo, para codificar el documento de Bert, puede usar el siguiente código

 from cogdl . oag import oagbert
import torch

tokenizer , model = oagbert ( "oagbert-v2" )
title = 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'
abstract = 'We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation...'
authors = [ 'Jacob Devlin' , 'Ming-Wei Chang' , 'Kenton Lee' , 'Kristina Toutanova' ]
venue = 'north american chapter of the association for computational linguistics'
affiliations = [ 'Google' ]
concepts = [ 'language model' , 'natural language inference' , 'question answering' ]
# build model inputs
input_ids , input_masks , token_type_ids , masked_lm_labels , position_ids , position_ids_second , masked_positions , num_spans = model . build_inputs (
    title = title , abstract = abstract , venue = venue , authors = authors , concepts = concepts , affiliations = affiliations
)
# run forward
sequence_output , pooled_output = model . bert . forward (
    input_ids = torch . LongTensor ( input_ids ). unsqueeze ( 0 ),
    token_type_ids = torch . LongTensor ( token_type_ids ). unsqueeze ( 0 ),
    attention_mask = torch . LongTensor ( input_masks ). unsqueeze ( 0 ),
    output_all_encoded_layers = False ,
    checkpoint_activations = False ,
    position_ids = torch . LongTensor ( position_ids ). unsqueeze ( 0 ),
    position_ids_second = torch . LongTensor ( position_ids_second ). unsqueeze ( 0 )
)

También puede usar algunas funciones integradas para usar OAG-Bert V2 directamente, como usar decode_beamsearch para generar entidades basadas en el contexto existente. Por ejemplo, para generar conceptos con 2 tokens para el papel Bert, ejecute el siguiente código

 model . eval ()
candidates = model . decode_beamsearch (
    title = title ,
    abstract = abstract ,
    venue = venue ,
    authors = authors ,
    affiliations = affiliations ,
    decode_span_type = 'FOS' ,
    decode_span_length = 2 ,
    beam_width = 8 ,
    force_forward = False
)

Oag-Bert supera a otros modelos de lenguaje académico en una amplia gama de tareas conscientes de la entidad, mientras que mantiene su rendimiento en las tareas ordinarias de la PNL.

Más allá de

También lanzamos otra versión V2 para usuarios.

Uno es una versión basada en generación que se puede utilizar para generar textos basados en otra información. Por ejemplo, use el siguiente código para generar automáticamente títulos en papel con resúmenes.

 from cogdl . oag import oagbert

tokenizer , model = oagbert ( 'oagbert-v2-lm' )
model . eval ()

for seq , prob in model . generate_title ( abstract = "To enrich language models with domain knowledge is crucial but difficult. Based on the world's largest public academic graph Open Academic Graph (OAG), we pre-train an academic language model, namely OAG-BERT, which integrates massive heterogeneous entities including paper, author, concept, venue, and affiliation. To better endow OAG-BERT with the ability to capture entity information, we develop novel pre-training strategies including heterogeneous entity type embedding, entity-aware 2D positional encoding, and span-aware entity masking. For zero-shot inference, we design a special decoding strategy to allow OAG-BERT to generate entity names from scratch. We evaluate the OAG-BERT on various downstream academic tasks, including NLP benchmarks, zero-shot entity inference, heterogeneous graph link prediction, and author name disambiguation. Results demonstrate the effectiveness of the proposed pre-training approach to both comprehending academic texts and modeling knowledge from heterogeneous entities. OAG-BERT has been deployed to multiple real-world applications, such as reviewer recommendations for NSFC (National Nature Science Foundation of China) and paper tagging in the AMiner system. It is also available to the public through the CogDL package." ):
    print ( 'Title: %s' % seq )
    print ( 'Perplexity: %.4f' % prob )
# One of our generations: "pre-training oag-bert: an academic language model for enriching academic texts with domain knowledge"

Además de eso, ajustamos el oag-bert para calcular la similitud en papel basada en las tareas de desambiguación de nombre, que se nombra como oración-oagbert después de la oración-bert. Los siguientes códigos demuestran un ejemplo de uso de la oración-oagbert para calcular la similitud del papel.

 import os
from cogdl . oag import oagbert
import torch
import torch . nn . functional as F
import numpy as np


# load time
tokenizer , model = oagbert ( "oagbert-v2-sim" )
model . eval ()

# Paper 1
title = 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'
abstract = 'We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation...'
authors = [ 'Jacob Devlin' , 'Ming-Wei Chang' , 'Kenton Lee' , 'Kristina Toutanova' ]
venue = 'north american chapter of the association for computational linguistics'
affiliations = [ 'Google' ]
concepts = [ 'language model' , 'natural language inference' , 'question answering' ]

# encode first paper
input_ids , input_masks , token_type_ids , masked_lm_labels , position_ids , position_ids_second , masked_positions , num_spans = model . build_inputs (
    title = title , abstract = abstract , venue = venue , authors = authors , concepts = concepts , affiliations = affiliations
)
_ , paper_embed_1 = model . bert . forward (
    input_ids = torch . LongTensor ( input_ids ). unsqueeze ( 0 ),
    token_type_ids = torch . LongTensor ( token_type_ids ). unsqueeze ( 0 ),
    attention_mask = torch . LongTensor ( input_masks ). unsqueeze ( 0 ),
    output_all_encoded_layers = False ,
    checkpoint_activations = False ,
    position_ids = torch . LongTensor ( position_ids ). unsqueeze ( 0 ),
    position_ids_second = torch . LongTensor ( position_ids_second ). unsqueeze ( 0 )
)

# Positive Paper 2
title = 'Attention Is All You Need'
abstract = 'We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely...'
authors = [ 'Ashish Vaswani' , 'Noam Shazeer' , 'Niki Parmar' , 'Jakob Uszkoreit' ]
venue = 'neural information processing systems'
affiliations = [ 'Google' ]
concepts = [ 'machine translation' , 'computation and language' , 'language model' ]

input_ids , input_masks , token_type_ids , masked_lm_labels , position_ids , position_ids_second , masked_positions , num_spans = model . build_inputs (
    title = title , abstract = abstract , venue = venue , authors = authors , concepts = concepts , affiliations = affiliations
)
# encode second paper
_ , paper_embed_2 = model . bert . forward (
    input_ids = torch . LongTensor ( input_ids ). unsqueeze ( 0 ),
    token_type_ids = torch . LongTensor ( token_type_ids ). unsqueeze ( 0 ),
    attention_mask = torch . LongTensor ( input_masks ). unsqueeze ( 0 ),
    output_all_encoded_layers = False ,
    checkpoint_activations = False ,
    position_ids = torch . LongTensor ( position_ids ). unsqueeze ( 0 ),
    position_ids_second = torch . LongTensor ( position_ids_second ). unsqueeze ( 0 )
)

# Negative Paper 3
title = "Traceability and international comparison of ultraviolet irradiance"
abstract = "NIM took part in the CIPM Key Comparison of ″Spectral Irradiance 250 to 2500 nm″. In UV and NIR wavelength, the international comparison results showed that the consistency between Chinese value and the international reference one"
authors =  [ 'Jing Yu' , 'Bo Huang' , 'Jia-Lin Yu' , 'Yan-Dong Lin' , 'Cai-Hong Dai' ]
veune = 'Jiliang Xuebao/Acta Metrologica Sinica'
affiliations = [ 'Department of Electronic Engineering' ]
concept = [ 'Optical Division' ]

input_ids , input_masks , token_type_ids , masked_lm_labels , position_ids , position_ids_second , masked_positions , num_spans = model . build_inputs (
    title = title , abstract = abstract , venue = venue , authors = authors , concepts = concepts , affiliations = affiliations
)
# encode thrid paper
_ , paper_embed_3 = model . bert . forward (
    input_ids = torch . LongTensor ( input_ids ). unsqueeze ( 0 ),
    token_type_ids = torch . LongTensor ( token_type_ids ). unsqueeze ( 0 ),
    attention_mask = torch . LongTensor ( input_masks ). unsqueeze ( 0 ),
    output_all_encoded_layers = False ,
    checkpoint_activations = False ,
    position_ids = torch . LongTensor ( position_ids ). unsqueeze ( 0 ),
    position_ids_second = torch . LongTensor ( position_ids_second ). unsqueeze ( 0 )
)

# calulate text similarity
# normalize
paper_embed_1 = F . normalize ( paper_embed_1 , p = 2 , dim = 1 )
paper_embed_2 = F . normalize ( paper_embed_2 , p = 2 , dim = 1 )
paper_embed_3 = F . normalize ( paper_embed_3 , p = 2 , dim = 1 )

# cosine sim.
sim12 = torch . mm ( paper_embed_1 , paper_embed_2 . transpose ( 0 , 1 ))
sim13 = torch . mm ( paper_embed_1 , paper_embed_3 . transpose ( 0 , 1 ))
print ( sim12 , sim13 )

Este ajuste de fino se realizó en las tareas de desambiguación del nombre. Los documentos escritos por los mismos autores se tratan como pares positivos y los descansos como pares negativos. Muestra 0.4m pares positivos y 1,6 millones de pares negativos y utilizamos el aprendizaje construido para ajustar el oag-bert (versión 2). Para el 50% de instancias solo usamos el título en papel, mientras que el otro 50% usa toda la información heterogénea. Evaluamos el rendimiento utilizando un rango recíproco medio donde los valores más altos indican mejores resultados. El rendimiento en los conjuntos de pruebas se muestra a continuación.

	Oagbert-V2	Oagbert-V2-SIM
Título	0.349	0.725
Título+Resumen+Autor+AFF+Lugar	0.355	0.789

Para obtener más detalles, consulte Ejemplos/oagbert_metainfo.py en Cogdl.

Versión china

También entrenamos al chino Oagbert para su uso. El modelo fue priorizado en un corpus que incluye 44 millones de metadatos en papel chinos que incluyen título, resumen, autores, afiliaciones, lugares, palabras clave y fondos . El nuevo fondo de entidad se extiende más allá de las entidades utilizadas en la versión en inglés. Además, el Oagbert chino está entrenado con el tokenizador de oraciones. Estas son las dos diferencias principales entre el inglés Oagbert y el chino Oagbert.

Los ejemplos de usar el chino original Oagbert y la oración-oagbert se pueden encontrar en ejemplos/oagbert/oagbert_metainfo_zh.py y ejemplos/oagbert/oagbert_metainfo_zh_sim.py. De manera similar a la oración inglesa-oagbert, la oración china-oagbert está ajustada en las tareas de desambiguación de nombre para calcular la similitud de incrustación de papel. El rendimiento se muestra a continuación. Recomendamos a los usuarios que usen directamente esta versión si las tareas aguas abajo no tienen suficientes datos para ajustar.

	oagbert-v2-zh	oagbert-v2-zh-sim
Título	0.337	0.619
Título+Resumen	0.314	0.682

Citar

Si encuentra que es útil, cíquanos en su trabajo:

 @article{xiao2021oag,
  title={OAG-BERT: Pre-train Heterogeneous Entity-augmented Academic Language Model},
  author={Liu, Xiao and Yin, Da and Zhang, Xingjian and Su, Kai and Wu, Kan and Yang, Hongxia and Tang, Jie},
  journal={arXiv preprint arXiv:2103.02410},
  year={2021}
}
@inproceedings{zhang2019oag,
  title={OAG: Toward Linking Large-scale Heterogeneous Entity Graphs.},
  author={Zhang, Fanjin and Liu, Xiao and Tang, Jie and Dong, Yuxiao and Yao, Peiran and Zhang, Jie and Gu, Xiaotao and Wang, Yan and Shao, Bin and Li, Rui and Wang, Kuansan},
  booktitle={Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’19)},
  year={2019}
}
@article{chen2020conna,
  title={CONNA: Addressing Name Disambiguation on The Fly},
  author={Chen, Bo and Zhang, Jing and Tang, Jie and Cai, Lingfan and Wang, Zhaoyu and Zhao, Shu and Chen, Hong and Li, Cuiping},
  journal={IEEE Transactions on Knowledge and Data Engineering},
  year={2020},
  publisher={IEEE}
}

Expandir

Información adicional

Versión 1.0.0
Tipo Código Fuente de IA
Fecha de actualización 2025-09-11
tamaño 373.98KB
Proviene de Github

Aplicaciones relacionadas

GitHub sgrebnov/cordova plugin background download

2024-11-05
Wa ch ull navra maza navsacha 2 2024 ull ovie Fr e Online On Strea ings

2024-11-03
Wa ch navra maza navsacha 2 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-03
Wa ch the greatest of all time 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-02
wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

2024-11-01
GitHub the via/releases

2024-11-01

Recomendado para ti

chat.petals.dev

Otro código fuente

1.0.0
GPT Prompt Templates

Otro código fuente

1.0.0
GPTyped

Otro código fuente

GPTyped 1.0.5
ML stack

Código Fuente de IA

1.0.0
awesome free chatgpt

Código Fuente de IA

1.0.0
pywin_contextmenu

Código Fuente de IA

Version update
Google Dorks

Otro código fuente

1.0
shepherd

Otro código fuente

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Otro código fuente

v1.1.0-rc-3

Información relacionada Todo