使用Spacy POS标记的Python实现术语提取算法,例如C值,基本,组合基本,怪异和术语提取器。
新:可以在https://kevinlu1248.github.io/pyate/上找到文档。到目前为止,文档仍缺少两种算法和有关TermExtraction类的详细信息,但我很快就会完成。
新:在https://pyate-demo.herokuapp.com/上尝试使用算法,这是一个用于演示Pyate的网络应用程序!
新:Spacy V3得到了支持!对于Spacy V2,使用pyate==0.4.3并查看Spacy V2 readme.md文件
如果您有建议在此软件包中实现另一种Ate算法的建议,请随时将其作为算法所基于的论文提交。
对于在Scala和Java中实施的ATE软件包,分别查看ATR4和JATE。
使用PIP:
pip install pyate
spacy download en_core_web_sm首先,只需调用一种已实现的算法即可。根据Astrakhantsev 2016的数据, combo_basic是五种算法中最精确的,尽管basic和cvalues并不落后(请参阅Precision)。同一项研究表明,PU-ATR和KeyConceptrel的精度高于combo_basic ,但没有实施,并且PU-ATR自使用机器学习以来需要花费更多的时间。
from pyate import combo_basic
# source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1994795/
string = """Central to the development of cancer are genetic changes that endow these “cancer cells” with many of the
hallmarks of cancer, such as self-sufficient growth and resistance to anti-growth and pro-death signals. However, while the
genetic changes that occur within cancer cells themselves, such as activated oncogenes or dysfunctional tumor suppressors,
are responsible for many aspects of cancer development, they are not sufficient. Tumor promotion and progression are
dependent on ancillary processes provided by cells of the tumor environment but that are not necessarily cancerous
themselves. Inflammation has long been associated with the development of cancer. This review will discuss the reflexive
relationship between cancer and inflammation with particular focus on how considering the role of inflammation in physiologic
processes such as the maintenance of tissue homeostasis and repair may provide a logical framework for understanding the U
connection between the inflammatory response and cancer."""
print ( combo_basic ( string ). sort_values ( ascending = False ))
""" (Output)
dysfunctional tumor 1.443147
tumor suppressors 1.443147
genetic changes 1.386294
cancer cells 1.386294
dysfunctional tumor suppressors 1.298612
logical framework 0.693147
sufficient growth 0.693147
death signals 0.693147
many aspects 0.693147
inflammatory response 0.693147
tumor promotion 0.693147
ancillary processes 0.693147
tumor environment 0.693147
reflexive relationship 0.693147
particular focus 0.693147
physiologic processes 0.693147
tissue homeostasis 0.693147
cancer development 0.693147
dtype: float64
"""如果您想将其添加到Spacy管道中,只需使用添加Spacy的add_pipe方法即可。
import spacy
from pyate . term_extraction_pipeline import TermExtractionPipeline
nlp = spacy . load ( "en_core_web_sm" )
nlp . add_pipe ( "combo_basic" )
doc = nlp ( string )
print ( doc . _ . combo_basic . sort_values ( ascending = False ). head ( 5 ))
""" (Output)
dysfunctional tumor 1.443147
tumor suppressors 1.443147
genetic changes 1.386294
cancer cells 1.386294
dysfunctional tumor suppressors 1.298612
dtype: float64
"""此外, TermExtractionPipeline.__init__定义如下
__init__(
self,
func: Callable[..., pd.Series] = combo_basic,
*args,
**kwargs
)
如果func本质上是您的术语提取算法,该算法会吸收一个语料库(字符串或迭代器),并输出一系列术语值对及其各自的术语。默认情况下是func的combo_basic 。 args和kwargs是为您过度默认值的默认值,您可以通过运行help (可能以后进行记录)。
每个cvalues, basic, combo_basic, weirdness和term_extractor都采用字符串或迭代器,并输出一系列pardas序列的术语值对,其中较高的值表明是域特定项的较高的机会。此外, weirdness和term_extractor采用general_corpus关键词参数,必须是字符串的迭代器,默认为下文所述的通用语料库。
All functions only take the string of which you would like to extract terms from as the mandatory input (the technical_corpus ), as well as other tweakable settings, including general_corpus (contrasting corpus for weirdness and term_extractor ), general_corpus_size , verbose (whether to print a progress bar), weights , smoothing , have_single_word (whether to have a single word count as a phrase) and threshold 。如果您尚未阅读论文并且不熟悉算法,我建议您仅使用默认设置。同样,请help找到有关每种算法的详细信息,因为它们都是不同的。
在path/to/site-packages/pyate/default_general_domain.en.zip下,有一个通用语料库的一般CSV文件,特别是来自Wikipedia的3000个随机句子。它的来源可以在https://www.kaggle.com/mikeortman/wikipedia-sentences上找到。安装pyate后,使用以下内容访问它。
import pandas as pd
from distutils . sysconfig import get_python_lib
df = pd . read_csv ( get_python_lib () + "/pyate/default_general_domain.en.zip" )[ "SECTION_TEXT" ]
print ( df . head ())
""" (Output)
0 '''Anarchism''' is a political philosophy that...
1 The term ''anarchism'' is a compound word comp...
2 ===Origins=== n Woodcut from a Diggers document...
3 Portrait of philosopher Pierre-Joseph Proudhon...
4 consistent with anarchist values is a controve...
Name: SECTION_TEXT, dtype: object
"""对于切换语言,只需运行Term_Extraction.set_language({language}, {model_name}) ,其中model_name默认为language 。例如, Term_Extraction.set_language("it", "it_core_news_sm"}) for Italian。默认情况下,该语言是英语。到目前为止,受支持的语言列表是:
要添加更多语言,请用所需语言的至少3000个段落的语料库提交问题(最好是default_general_domain.{lang}.zip https://www.loc.gov/standards/iso639-2/php/code_list.php)。文件格式应为以下表格,以通过熊猫来解析。
,SECTION_TEXT
0,"{paragraph_0}"
1,"{paragraph_1}"
...
或者,将文件放入src/pyate中,然后提交拉动请求。
警告:该模型仅与Spacy V2一起使用。
尽管该模型最初是针对符号AI算法(非计算学习)的,但我意识到术语提取的Spacy模型可以达到更高的性能,因此决定在此处包括该模型。
有关与符号AI算法的比较,请参见精度。请注意,仅在这里为模型采用F得分,准确性和精度,但是对于AVP的算法,与指标进行比较的算法是没有意义的。
| URL | F-评分(%) | 精确 (%) | 记起 (%) |
|---|---|---|---|
| https://github.com/kevinlu1248/pyate/releases/download/v0.4.4.2/en_acl_terms_sm-2.0.4.4.4.tar.gz | 94.71 | 95.41 | 94.03 |
该模型在ACL数据集上进行了训练和评估,ACL数据集是一个以计算机科学为导向的数据集,可以手动选择该术语。但是,这尚未在其他领域进行测试。
该模型不带胸衣。要安装,请运行
pip install https://github.com/kevinlu1248/pyate/releases/download/v0.4.2/en_acl_terms_sm-2.0.4.tar.gz提取条款,
import spacy
nlp = spacy . load ( "en_acl_terms_sm" )
doc = nlp ( "Hello world, I am a term extraction algorithm." )
print ( doc . ents )
"""
(term extraction, algorithm)
""" 这是在Astrakhantsev 2016中测试的七个不同数据库上的平均精度(AVP)度量的平均精度(AVP)度量的平均精度。 
该项目计划是连接到Google Chrome扩展程序的工具,该工具突出显示并定义了读者可能不知道的关键术语。此外,与NLP的其他领域相比,术语提取是一个没有很多重点研究的领域,尤其是最近,由于更通用的NER标记工具,因此并不非常实用。但是,现代的NER标记通常结合了记忆单词和深度学习的某种组合,这些单词在空间和计算上都很重。此外,为了概括算法以识别医学和AI研究不断增长的领域的术语,记住的单词列表将无法做到。
在实施的五种算法中,没有一个很昂贵,实际上,空间分配和计算费用的瓶颈来自Spacy模型和Spacy POS标签。这是因为它们主要依靠POS模式,单词频率以及嵌入式术语候选者的存在。例如,候选人“乳腺癌”一词意味着“恶性乳腺癌”可能不是一种术语,而只是一种“恶性”(在C值实施)的“乳腺癌”形式。
我似乎找不到原始的基本和组合基本论文,但我找到了引用它们的论文。 “ ATR4S:具有最先进的自动术语识别方法的工具包或多或少地总结了所有内容,并将其包含在此软件包中。
如果您发布使用Pyate的作品,请通过[email protected]告诉我,并引用:
Lu, Kevin. (2021, June 28). kevinlu1248/pyate: Python Automated Term Extraction (Version v0.5.3). Zenodo. http://doi.org/10.5281/zenodo.5039289
或等效地与BibText:
@software{pyate,
title = {kevinlu1248/pyate: Python Automated Term Extraction},
author = {Lu, Kevin},
year = 2021,
month = {Jun},
publisher = {Zenodo},
doi = {10.5281/zenodo.5039289}
}
该软件包在论文中使用(使用术语提取器(Dowlagar and Mamidi,2021)提取了无监督的技术域项提取。
如果我的工作对您有帮助,请考虑在https://www.buymeacoffee.com/kevinlu1248上给我买咖啡。