零和几射击名称实体和关系识别

文档:https://ibm.github.io/zshot
源代码:https://github.com/ibm/zshot
论文:https://aclanthology.org/2023.acl-demo.34/
Zshot是一个高度可定制的框架,用于执行零,几个名为“实体识别”的镜头。
可用于执行:
Python 3.6+
spacy Zshot依靠Spacy进行管道和可视化
torch -PYTORCH需要运行Pytorch型号。
transformers - 预训练的语言模型所需的。
evaluate - 评估所需。
datasets - 对数据集进行评估(例如:Ontonotes)所需。
flair如果您想使用Flair提取器,以及用于Tars Linker和Tars提取器,则需要。blink - 需要使用眨眼链接到Wikipedia页面。gliner如果您想使用Gliner Linker或Gliner提取器,则需要。 $ pip install zshot
---> 100% | 例子 | 笔记本 |
|---|---|
| 安装和可视化 | |
| 知识提取器 | |
| 维克化 | |
| 自定义组件 | |
| 评估 |
Zshot包含两个不同的组件,即提取器和链接器。
提及的提取器将检测可能的实体(又称提及),然后将其链接到数据源(例如:Wikidata)。
当前,基于Flair的7种不同的提取器,SMXM,Tars,Gliner,2个基于Spacy的2。 Spacy和Flair的两个不同版本相似,一个基于命名的实体识别和分类(NERC),另一个基于语言学(即:使用语音标记的一部分(POS)和依赖关系解析(DP))。
NERC方法将使用NERC模型来检测所有必须链接的实体。这种方法取决于所使用的模型,并且该模型已受过培训,因此取决于用例和目标实体,它可能不是最佳方法,因为NERC模型可能无法识别该实体,因此不会链接。
语言方法依赖于提及通常是含义或名词的想法。因此,这种方法检测到含义的名词,并且像对象,主题等一样起作用。此方法不取决于模型(尽管性能确实如此),但是文本中的名词应始终是名词,不取决于模型的数据集,该模型已受过培训。
链接器将将检测到的实体链接到现有的标签集。但是,一些链接器是端到端的,即它们不需要提取器,因为它们同时检测并链接实体。
同样,目前有5个可用的链接器,其中3个是端到端的,而2个则没有。
| 链接器名称 | 端到端 | 源代码 | 纸 |
|---|---|---|---|
| 眨 | x | 源代码 | 纸 |
| 流派 | x | 源代码 | 纸 |
| SMXM | ✓ | 源代码 | 纸 |
| 焦油 | ✓ | 源代码 | 纸 |
| 格里纳 | ✓ | 源代码 | 纸 |
关系提取器将在以前由接头提取的不同实体之间提取关系。
目前,仅可用一个关系提取器:
知识提取器将同时执行指定实体的提取和分类及其之间的关系。带有此组件的管道不需要任何提取器,链接器或关系提取器来工作。
目前,只有一个知识提取器可用:
知识
pip install -r requirements.txtpython -m spacy download en_core_web_smmain.py ( Wikipedia摘要通常是描述的好起点): import spacy
from zshot import PipelineConfig , displacy
from zshot . linker import LinkerRegen
from zshot . mentions_extractor import MentionsExtractorSpacy
from zshot . utils . data_models import Entity
nlp = spacy . load ( "en_core_web_sm" )
nlp_config = PipelineConfig (
mentions_extractor = MentionsExtractorSpacy (),
linker = LinkerRegen (),
entities = [
Entity ( name = "Paris" ,
description = "Paris is located in northern central France, in a north-bending arc of the river Seine" ),
Entity ( name = "IBM" ,
description = "International Business Machines Corporation (IBM) is an American multinational technology corporation headquartered in Armonk, New York" ),
Entity ( name = "New York" , description = "New York is a city in U.S. state" ),
Entity ( name = "Florida" , description = "southeasternmost U.S. state" ),
Entity ( name = "American" ,
description = "American, something of, from, or related to the United States of America, commonly known as the United States or America" ),
Entity ( name = "Chemical formula" ,
description = "In chemistry, a chemical formula is a way of presenting information about the chemical proportions of atoms that constitute a particular chemical compound or molecule" ),
Entity ( name = "Acetamide" ,
description = "Acetamide (systematic name: ethanamide) is an organic compound with the formula CH3CONH2. It is the simplest amide derived from acetic acid. It finds some use as a plasticizer and as an industrial solvent." ),
Entity ( name = "Armonk" ,
description = "Armonk is a hamlet and census-designated place (CDP) in the town of North Castle, located in Westchester County, New York, United States." ),
Entity ( name = "Acetic Acid" ,
description = "Acetic acid, systematically named ethanoic acid, is an acidic, colourless liquid and organic compound with the chemical formula CH3COOH" ),
Entity ( name = "Industrial solvent" ,
description = "Acetamide (systematic name: ethanamide) is an organic compound with the formula CH3CONH2. It is the simplest amide derived from acetic acid. It finds some use as a plasticizer and as an industrial solvent." ),
]
)
nlp . add_pipe ( "zshot" , config = nlp_config , last = True )
text = "International Business Machines Corporation (IBM) is an American multinational technology corporation"
" headquartered in Armonk, New York, with operations in over 171 countries."
doc = nlp ( text )
displacy . serve ( doc , style = "ent" )运行
$ python main.py
Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...脚本将使用Zshot注释文本,并使用显示屏可视化注释
在http://127.0.0.1:5000上打开浏览器。
您会看到带注释的句子:
如果您想实现自己的intimits_extractor或链接器并将其与Zshot一起使用,则可以做到。为了使用户更容易实现新组件,提供了一些基本类,必须使用代码扩展。
它就像创建一个扩展基类的新类一样简单( MentionsExtractor或Linker )。您将必须实现预测方法,该方法将接收Spacy文档,并将返回每个文档的zshot.utils.data_models.Span的列表。
这是一个简单的提及_ extractor,将提取时提到所有包含字母s的单词:
from typing import Iterable
import spacy
from spacy . tokens import Doc
from zshot import PipelineConfig
from zshot . utils . data_models import Span
from zshot . mentions_extractor import MentionsExtractor
class SimpleMentionExtractor ( MentionsExtractor ):
def predict ( self , docs : Iterable [ Doc ], batch_size = None ):
spans = [[ Span ( tok . idx , tok . idx + len ( tok )) for tok in doc if "s" in tok . text ] for doc in docs ]
return spans
new_nlp = spacy . load ( "en_core_web_sm" )
config = PipelineConfig (
mentions_extractor = SimpleMentionExtractor ()
)
new_nlp . add_pipe ( "zshot" , config = config , last = True )
text_acetamide = "CH2O2 is a chemical compound similar to Acetamide used in International Business "
"Machines Corporation (IBM)."
doc = new_nlp ( text_acetamide )
print ( doc . _ . mentions )
> >> [ is , similar , used , Business , Machines , materials ]评估是一个重要的过程,可以不断提高模型的性能,这就是为什么Zshot允许使用两个预定义的数据集评估组件:Ontonotes和Medions,以零发版本,其中测试和验证拆分的实体在火车集中不出现。
包装evaluation包含评估ZSHOT组件的所有功能。主要功能是zshot.evaluation.zshot_evaluate.evaluate ,它将作为Spacy nlp模型和数据集的输入。它将返回包含评估结果的str 。例如,对Ontonotes验证集的ZSHOT中TAR链接器的评估将是:
import spacy
from zshot import PipelineConfig
from zshot . linker import LinkerTARS
from zshot . evaluation . dataset import load_ontonotes_zs
from zshot . evaluation . zshot_evaluate import evaluate , prettify_evaluate_report
from zshot . evaluation . metrics . seqeval . seqeval import Seqeval
ontonotes_zs = load_ontonotes_zs ( 'validation' )
nlp = spacy . blank ( "en" )
nlp_config = PipelineConfig (
linker = LinkerTARS (),
entities = ontonotes_zs . entities
)
nlp . add_pipe ( "zshot" , config = nlp_config , last = True )
evaluation = evaluate ( nlp , ontonotes_zs , metric = Seqeval ())
prettify_evaluate_report ( evaluation ) @inproceedings{picco-etal-2023-zshot,
title = "Zshot: An Open-source Framework for Zero-Shot Named Entity Recognition and Relation Extraction",
author = "Picco, Gabriele and
Martinez Galindo, Marcos and
Purpura, Alberto and
Fuchs, Leopold and
Lopez, Vanessa and
Hoang, Thanh Lam",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-demo.34",
doi = "10.18653/v1/2023.acl-demo.34",
pages = "357--368",
abstract = "The Zero-Shot Learning (ZSL) task pertains to the identification of entities or relations in texts that were not seen during training. ZSL has emerged as a critical research area due to the scarcity of labeled data in specific domains, and its applications have grown significantly in recent years. With the advent of large pretrained language models, several novel methods have been proposed, resulting in substantial improvements in ZSL performance. There is a growing demand, both in the research community and industry, for a comprehensive ZSL framework that facilitates the development and accessibility of the latest methods and pretrained models.In this study, we propose a novel ZSL framework called Zshot that aims to address the aforementioned challenges. Our primary objective is to provide a platform that allows researchers to compare different state-of-the-art ZSL methods with standard benchmark datasets. Additionally, we have designed our framework to support the industry with readily available APIs for production under the standard SpaCy NLP pipeline. Our API is extendible and evaluable, moreover, we include numerous enhancements such as boosting the accuracy with pipeline ensembling and visualization utilities available as a SpaCy extension.",
}