零和幾射擊名稱實體和關係識別

文檔:https://ibm.github.io/zshot
源代碼:https://github.com/ibm/zshot
論文:https://aclanthology.org/2023.acl-demo.34/
Zshot是一個高度可定制的框架,用於執行零,幾個名為“實體識別”的鏡頭。
可用於執行:
Python 3.6+
spacy Zshot依靠Spacy進行管道和可視化
torch -PYTORCH需要運行Pytorch型號。
transformers - 預訓練的語言模型所需的。
evaluate - 評估所需。
datasets - 對數據集進行評估(例如:Ontonotes)所需。
flair如果您想使用Flair提取器,以及用於Tars Linker和Tars提取器,則需要。blink - 需要使用眨眼鏈接到Wikipedia頁面。gliner如果您想使用Gliner Linker或Gliner提取器,則需要。 $ pip install zshot
---> 100% | 例子 | 筆記本 |
|---|---|
| 安裝和可視化 | |
| 知識提取器 | |
| 維克化 | |
| 自定義組件 | |
| 評估 |
Zshot包含兩個不同的組件,即提取器和鏈接器。
提及的提取器將檢測可能的實體(又稱提及),然後將其鏈接到數據源(例如:Wikidata)。
當前,基於Flair的7種不同的提取器,SMXM,Tars,Gliner,2個基於Spacy的2。 Spacy和Flair的兩個不同版本相似,一個基於命名的實體識別和分類(NERC),另一個基於語言學(即:使用語音標記的一部分(POS)和依賴關係解析(DP))。
NERC方法將使用NERC模型來檢測所有必須鏈接的實體。這種方法取決於所使用的模型,並且該模型已受過培訓,因此取決於用例和目標實體,它可能不是最佳方法,因為NERC模型可能無法識別該實體,因此不會鏈接。
語言方法依賴於提及通常是含義或名詞的想法。因此,這種方法檢測到含義的名詞,並且像對象,主題等一樣起作用。此方法不取決於模型(儘管性能確實如此),但是文本中的名詞應始終是名詞,不取決於模型的數據集,該模型已受過培訓。
鏈接器將將檢測到的實體鏈接到現有的標籤集。但是,一些鏈接器是端到端的,即它們不需要提取器,因為它們同時檢測並鏈接實體。
同樣,目前有5個可用的鏈接器,其中3個是端到端的,而2個則沒有。
| 鏈接器名稱 | 端到端 | 原始碼 | 紙 |
|---|---|---|---|
| 眨 | x | 原始碼 | 紙 |
| 類型 | x | 原始碼 | 紙 |
| SMXM | ✓ | 原始碼 | 紙 |
| 焦油 | ✓ | 原始碼 | 紙 |
| 格里納 | ✓ | 原始碼 | 紙 |
關係提取器將在以前由接頭提取的不同實體之間提取關係。
目前,僅可用一個關係提取器:
知識提取器將同時執行指定實體的提取和分類及其之間的關係。帶有此組件的管道不需要任何提取器,鏈接器或關係提取器來工作。
目前,只有一個知識提取器可用:
知識
pip install -r requirements.txtpython -m spacy download en_core_web_smmain.py ( Wikipedia摘要通常是描述的好起點): import spacy
from zshot import PipelineConfig , displacy
from zshot . linker import LinkerRegen
from zshot . mentions_extractor import MentionsExtractorSpacy
from zshot . utils . data_models import Entity
nlp = spacy . load ( "en_core_web_sm" )
nlp_config = PipelineConfig (
mentions_extractor = MentionsExtractorSpacy (),
linker = LinkerRegen (),
entities = [
Entity ( name = "Paris" ,
description = "Paris is located in northern central France, in a north-bending arc of the river Seine" ),
Entity ( name = "IBM" ,
description = "International Business Machines Corporation (IBM) is an American multinational technology corporation headquartered in Armonk, New York" ),
Entity ( name = "New York" , description = "New York is a city in U.S. state" ),
Entity ( name = "Florida" , description = "southeasternmost U.S. state" ),
Entity ( name = "American" ,
description = "American, something of, from, or related to the United States of America, commonly known as the United States or America" ),
Entity ( name = "Chemical formula" ,
description = "In chemistry, a chemical formula is a way of presenting information about the chemical proportions of atoms that constitute a particular chemical compound or molecule" ),
Entity ( name = "Acetamide" ,
description = "Acetamide (systematic name: ethanamide) is an organic compound with the formula CH3CONH2. It is the simplest amide derived from acetic acid. It finds some use as a plasticizer and as an industrial solvent." ),
Entity ( name = "Armonk" ,
description = "Armonk is a hamlet and census-designated place (CDP) in the town of North Castle, located in Westchester County, New York, United States." ),
Entity ( name = "Acetic Acid" ,
description = "Acetic acid, systematically named ethanoic acid, is an acidic, colourless liquid and organic compound with the chemical formula CH3COOH" ),
Entity ( name = "Industrial solvent" ,
description = "Acetamide (systematic name: ethanamide) is an organic compound with the formula CH3CONH2. It is the simplest amide derived from acetic acid. It finds some use as a plasticizer and as an industrial solvent." ),
]
)
nlp . add_pipe ( "zshot" , config = nlp_config , last = True )
text = "International Business Machines Corporation (IBM) is an American multinational technology corporation"
" headquartered in Armonk, New York, with operations in over 171 countries."
doc = nlp ( text )
displacy . serve ( doc , style = "ent" )運行
$ python main.py
Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...腳本將使用Zshot註釋文本,並使用顯示屏可視化註釋
在http://127.0.0.1:5000上打開瀏覽器。
您會看到帶註釋的句子:
如果您想實現自己的intimits_extractor或鏈接器並將其與Zshot一起使用,則可以做到。為了使用戶更容易實現新組件,提供了一些基本類,必須使用代碼擴展。
它就像創建一個擴展基類的新類一樣簡單( MentionsExtractor或Linker )。您將必須實現預測方法,該方法將接收Spacy文檔,並將返回每個文檔的zshot.utils.data_models.Span的列表。
這是一個簡單的提及_ extractor,將提取時提到所有包含字母s的單詞:
from typing import Iterable
import spacy
from spacy . tokens import Doc
from zshot import PipelineConfig
from zshot . utils . data_models import Span
from zshot . mentions_extractor import MentionsExtractor
class SimpleMentionExtractor ( MentionsExtractor ):
def predict ( self , docs : Iterable [ Doc ], batch_size = None ):
spans = [[ Span ( tok . idx , tok . idx + len ( tok )) for tok in doc if "s" in tok . text ] for doc in docs ]
return spans
new_nlp = spacy . load ( "en_core_web_sm" )
config = PipelineConfig (
mentions_extractor = SimpleMentionExtractor ()
)
new_nlp . add_pipe ( "zshot" , config = config , last = True )
text_acetamide = "CH2O2 is a chemical compound similar to Acetamide used in International Business "
"Machines Corporation (IBM)."
doc = new_nlp ( text_acetamide )
print ( doc . _ . mentions )
> >> [ is , similar , used , Business , Machines , materials ]評估是一個重要的過程,可以不斷提高模型的性能,這就是為什麼Zshot允許使用兩個預定義的數據集評估組件:Ontonotes和Medions,以零發版本,其中測試和驗證拆分的實體在火車集中不出現。
包裝evaluation包含評估ZSHOT組件的所有功能。主要功能是zshot.evaluation.zshot_evaluate.evaluate ,它將作為Spacy nlp模型和數據集的輸入。它將返回包含評估結果的str 。例如,對Ontonotes驗證集的ZSHOT中TAR鏈接器的評估將是:
import spacy
from zshot import PipelineConfig
from zshot . linker import LinkerTARS
from zshot . evaluation . dataset import load_ontonotes_zs
from zshot . evaluation . zshot_evaluate import evaluate , prettify_evaluate_report
from zshot . evaluation . metrics . seqeval . seqeval import Seqeval
ontonotes_zs = load_ontonotes_zs ( 'validation' )
nlp = spacy . blank ( "en" )
nlp_config = PipelineConfig (
linker = LinkerTARS (),
entities = ontonotes_zs . entities
)
nlp . add_pipe ( "zshot" , config = nlp_config , last = True )
evaluation = evaluate ( nlp , ontonotes_zs , metric = Seqeval ())
prettify_evaluate_report ( evaluation ) @inproceedings{picco-etal-2023-zshot,
title = "Zshot: An Open-source Framework for Zero-Shot Named Entity Recognition and Relation Extraction",
author = "Picco, Gabriele and
Martinez Galindo, Marcos and
Purpura, Alberto and
Fuchs, Leopold and
Lopez, Vanessa and
Hoang, Thanh Lam",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-demo.34",
doi = "10.18653/v1/2023.acl-demo.34",
pages = "357--368",
abstract = "The Zero-Shot Learning (ZSL) task pertains to the identification of entities or relations in texts that were not seen during training. ZSL has emerged as a critical research area due to the scarcity of labeled data in specific domains, and its applications have grown significantly in recent years. With the advent of large pretrained language models, several novel methods have been proposed, resulting in substantial improvements in ZSL performance. There is a growing demand, both in the research community and industry, for a comprehensive ZSL framework that facilitates the development and accessibility of the latest methods and pretrained models.In this study, we propose a novel ZSL framework called Zshot that aims to address the aforementioned challenges. Our primary objective is to provide a platform that allows researchers to compare different state-of-the-art ZSL methods with standard benchmark datasets. Additionally, we have designed our framework to support the industry with readily available APIs for production under the standard SpaCy NLP pipeline. Our API is extendible and evaluable, moreover, we include numerous enhancements such as boosting the accuracy with pipeline ensembling and visualization utilities available as a SpaCy extension.",
}