作者:理查德·保罗·哈德森(Richard Paul Hudson),爆炸人工智能
Managermanager.nlpOntologySupervisedTopicTrainingBasis (从Manager.get_supervised_topic_training_basis()返回)SupervisedTopicModelTrainer (从SupervisedTopicTrainingBasis.train()返回)SupervisedTopicClassifier (从SupervisedTopicModelTrainer.classifier()和Manager.deserialize_supervised_topic_classifier()返回)Manager.match()返回Manager.topic_match_documents_against()返回福尔摩斯(Holmes)是在Spacy(v3.1 -V3.3)上运行的Python 3库(v3.6 -V3.10),该图书馆支持许多用例,涉及从英语和德语文本中提取信息。在所有用例中,信息提取均基于分析每个句子的组成部分表达的语义关系:
在聊天机器人用例中,系统是使用一个或多个搜索短语配置的。然后,福尔摩斯(Holmes)寻找与搜索文档中这些搜索短语相对应的结构,在这种情况下,这对应于最终用户输入的单独的文本或语音片段。在匹配中,搜索短语中的每个单词都具有其自身含义(即不仅符合语法函数)与文档中的一个或多个此类单词相对应。匹配搜索短语的事实以及搜索短语提取物可用于驱动聊天机器人的任何结构化信息。
结构提取的用例使用与聊天机器人用例完全相同的结构匹配技术,但是进行搜索是针对通常比聊天机器人用例中分析的摘要更长的预先存在的文档或文档进行的,其目的是提取和存储结构化信息。例如,可以搜索一组商业文章,以找到一家公司计划接管第二家公司的所有地方。然后可以将有关公司的身份存储在数据库中。
主题匹配的用例旨在在含义与另一个文档的含义接近的文档中查找段落,该文档接近了查询文档的作用,或者与用户的询问短语的作用或查询短语的作用。福尔摩斯从查询短语或查询文档中提取许多小短语,将所搜索的文档与每个词句匹配,并将结果混合在一起以找到文档中最相关的段落。由于没有严格的要求,在查询文档中,每个单词都具有其自身含义的单词与搜索文档中的特定单词或单词匹配,因此比在结构提取用例中发现的匹配性更多,但是匹配项不包含可以在后续处理中使用的结构化信息。该主题匹配的用例可以通过一个网站展示,该网站允许在六本查尔斯·狄更斯小说(英语)和大约350个传统故事(用于德语)中进行搜索。
监督的文档分类用例使用培训数据来学习一个分类器,该分类器将一个或多个分类标签分配给新文档的分类标签。它通过将其与从培训文档中提取的短语相匹配,就像从主题匹配的用例中的Query文档中提取相同的方式来对新文档进行分类。该技术的灵感来自使用N-grams的基于单词的分类算法,但旨在得出其组成词在语义上与之相关的n-gram,而不是仅仅是语言表面表示中的邻居。
在所有四种用例中,单个单词都使用多种策略匹配。为了确定包含单独匹配单词的两个语法结构在逻辑上对应并构成匹配,福尔摩斯将Spacy库提供的句法解析信息转换为语义结构,以便使用谓词逻辑比较文本。作为福尔摩斯的用户,您不需要了解其工作原理的复杂性,尽管关于为聊天机器人和结构提取用例编写有效的搜索短语有一些重要技巧,您应该尝试并乘坐。
福尔摩斯的目的是提供通才解决方案,这些解决方案可以或多或少开箱即用,而调整或培训相对较少,并且迅速适用于广泛的用例。以逻辑,编程,基于规则的系统为核心,描述了每种语言中的句法表示如何表达语义关系。尽管受监督的文档分类用例确实包含了神经网络,尽管使用机器学习的霍尔姆斯构建本身的Spacy库是经过预先培训的,但基本的基于规则的性质意味着,Holmes的本质可以使聊天机器人,结构提取和主题匹配的用例可以在不用训练的情况下进行任何训练,而无需训练的数据是相关的训练数据,因为典型的训练数据是相关的训练数据,因为典型的数据是相关的数据,因为该数据是相关的,因为该数据是相关的数据现实世界中的问题。
福尔摩斯(Holmes)拥有悠久而复杂的历史,由于多家公司的善意和开放性,我们现在能够根据MIT许可发布它。我,理查德·哈德森(Richard Hudson),在MSG Systems(慕尼黑附近的一家国际软件咨询公司)工作时,撰写了最高3.0.0的版本。在2021年底,我改变了雇主,现在为Spacy and Prodigy的创造者而爆炸工作。福尔摩斯图书馆的元素被我本人在2000年代初在一家名为Defeniens的初创公司工作时所撰写的美国专利涵盖,该专利已被阿斯利康(Astrazeneca)收购。在Astrazeneca和MSG系统的同时,我现在在爆炸时维护Holmes,并可以在宽敞的许可下首次提供它:现在任何人都可以根据MIT许可证的条款使用Holmes而不必担心专利。
该图书馆最初是在MSG系统开发的,但现在正在爆炸AI中维护。请将任何新问题或讨论引向爆炸库。
如果您的机器上还没有Python 3和PIP,则需要在安装Holmes之前安装它们。
使用以下命令安装Holmes:
Linux:
pip3 install holmes-extractor
视窗:
pip install holmes-extractor
要从以前的Holmes版本升级,请发出以下命令,然后重新发行命令以下载Spacy和Coreferee模型,以确保您具有正确的版本:
Linux:
pip3 install --upgrade holmes-extractor
视窗:
pip install --upgrade holmes-extractor
如果您想使用示例和测试,请使用
git clone https://github.com/explosion/holmes-extractor
如果您想尝试更改源代码,则可以通过启动python(type python3 (linux)或python (windows))来覆盖已安装的代码,该目录的parent parent Directory在您更改的holmes_extractor模块代码为中。如果您已经从Git中检查了Holmes,则将是holmes-extractor目录。
如果您希望再次卸载Holmes,则可以通过直接从文件系统删除已安装的文件来实现这一目标。可以通过从holmes_extractor的父级目录以外的任何目录中发出以下任何目录中从python命令提示下发出以下内容:
import holmes_extractor
print(holmes_extractor.__file__)
福尔摩斯建立的Spacy和Coreferee库需要特定于语言的模型,这些模型必须在使用Holmes之前单独下载:
Linux/英语:
python3 -m spacy download en_core_web_trf
python3 -m spacy download en_core_web_lg
python3 -m coreferee install en
Linux/德语:
pip3 install spacy-lookups-data # (from spaCy 3.3 onwards)
python3 -m spacy download de_core_news_lg
python3 -m coreferee install de
Windows/English:
python -m spacy download en_core_web_trf
python -m spacy download en_core_web_lg
python -m coreferee install en
Windows/German:
pip install spacy-lookups-data # (from spaCy 3.3 onwards)
python -m spacy download de_core_news_lg
python -m coreferee install de
如果您打算运行回归测试:
Linux:
python3 -m spacy download en_core_web_sm
视窗:
python -m spacy download en_core_web_sm
您在实例化管理器立面课程时为Holmes指定了一个Spacy模型。 en_core_web_trf和de_core_web_lg是发现分别为英语和德语产生最佳结果的模型。因为en_core_web_trf没有自己的单词向量,但是Holmes需要单词向量以进行基于嵌入的匹配,因此en_core_web_lg被加载为矢量源时, en_core_web_trf被指定到Manager类别为主要模型。
en_core_web_trf模型比其他模型需要更多的资源;在资源稀缺的siunt中,使用en_core_web_lg代替是主要模型,这可能是明智的折衷。
将福尔摩斯集成到非Python环境中的最佳方法是将其作为静止的HTTP服务包裹起来,并将其作为微服务部署。请参阅此处的示例。
由于福尔摩斯执行复杂,智能的分析,因此与更传统的搜索框架相比,它不可避免地需要更多的硬件资源。涉及加载文档的用例(结构提取和主题匹配)最立即适用于大型但不大型的语料库(例如,属于某个组织的所有文件,所有主题上的所有专利,所有作者的所有书籍)。出于成本原因,福尔摩斯将不是分析整个互联网内容的合适工具!
也就是说,福尔摩斯既垂直和水平可扩展。有了足够的硬件,这两个用例都可以通过在多台计算机上运行Holmes,在每个机器上处理不同的文档并将结果混合在一起,将两个用例应用于本质上无限的文档。请注意,该策略已经采用了单个计算机上的多个内核之间分布匹配:管理器类启动了许多工作过程并在它们之间分发注册文档。
福尔摩斯在内存中持有加载的文档,这与大型但不是大型语料库的预期用途有关。如果操作系统必须将存储器页面交换为辅助存储,则文档加载,结构提取和主题的性能将所有降级匹配,因为Holmes可以在处理单个句子时需要从各种页面中收到内存。这意味着在每台机器上提供足够的RAM以容纳所有已加载的文档很重要。
请注意以上有关不同模型的相对资源要求的评论。
聊天机器人用例是最简单的用例,以快速了解福尔摩斯的工作方式。
这里将一个或多个搜索短语定义为福尔摩斯,搜索文档是最终用户交互式键入的简短句子或段落。在现实生活中,提取的信息将用于确定与最终用户的交互作用。为了进行测试和演示目的,有一个控制台可以交互显示其匹配的发现。它可以从Python命令行(通过python3 (Linux)或python (Windows))或Jupyter Notebook中的Python命令行快速启动(本身从操作系统提示开始。
以下代码段可以进入线路中的行中的python命令行,jupyter笔记本或进入IDE。它记录了您对追逐猫的大狗的句子感兴趣的事实,并启动了示范聊天机器人游戏机:
英语:
import holmes_extractor as holmes
holmes_manager = holmes.Manager(model='en_core_web_lg', number_of_workers=1)
holmes_manager.register_search_phrase('A big dog chases a cat')
holmes_manager.start_chatbot_mode_console()
德语:
import holmes_extractor as holmes
holmes_manager = holmes.Manager(model='de_core_news_lg', number_of_workers=1)
holmes_manager.register_search_phrase('Ein großer Hund jagt eine Katze')
holmes_manager.start_chatbot_mode_console()
如果您现在输入与搜索短语相对应的句子,则控制台将显示匹配:
英语:
Ready for input
A big dog chased a cat
Matched search phrase with text 'A big dog chases a cat':
'big'->'big' (Matches BIG directly); 'A big dog'->'dog' (Matches DOG directly); 'chased'->'chase' (Matches CHASE directly); 'a cat'->'cat' (Matches CAT directly)
德语:
Ready for input
Ein großer Hund jagte eine Katze
Matched search phrase 'Ein großer Hund jagt eine Katze':
'großer'->'groß' (Matches GROSS directly); 'Ein großer Hund'->'hund' (Matches HUND directly); 'jagte'->'jagen' (Matches JAGEN directly); 'eine Katze'->'katze' (Matches KATZE directly)
通过简单的匹配算法可以很容易地实现这一目标,因此请输入一些更复杂的句子以说服自己,福尔摩斯确实在抓住它们,并且仍然返回匹配:
英语:
The big dog would not stop chasing the cat
The big dog who was tired chased the cat
The cat was chased by the big dog
The cat always used to be chased by the big dog
The big dog was going to chase the cat
The big dog decided to chase the cat
The cat was afraid of being chased by the big dog
I saw a cat-chasing big dog
The cat the big dog chased was scared
The big dog chasing the cat was a problem
There was a big dog that was chasing a cat
The cat chase by the big dog
There was a big dog and it was chasing a cat.
I saw a big dog. My cat was afraid of being chased by the dog.
There was a big dog. His name was Fido. He was chasing my cat.
A dog appeared. It was chasing a cat. It was very big.
The cat sneaked back into our lounge because a big dog had been chasing her.
Our big dog was excited because he had been chasing a cat.
德语:
Der große Hund hat die Katze ständig gejagt
Der große Hund, der müde war, jagte die Katze
Die Katze wurde vom großen Hund gejagt
Die Katze wurde immer wieder durch den großen Hund gejagt
Der große Hund wollte die Katze jagen
Der große Hund entschied sich, die Katze zu jagen
Die Katze, die der große Hund gejagt hatte, hatte Angst
Dass der große Hund die Katze jagte, war ein Problem
Es gab einen großen Hund, der eine Katze jagte
Die Katzenjagd durch den großen Hund
Es gab einmal einen großen Hund, und er jagte eine Katze
Es gab einen großen Hund. Er hieß Fido. Er jagte meine Katze
Es erschien ein Hund. Er jagte eine Katze. Er war sehr groß.
Die Katze schlich sich in unser Wohnzimmer zurück, weil ein großer Hund sie draußen gejagt hatte
Unser großer Hund war aufgeregt, weil er eine Katze gejagt hatte
如果不尝试包含相同单词但没有表达相同想法的其他句子并观察到它们不匹配的其他句子,该演示是不完整的:
英语:
The dog chased a big cat
The big dog and the cat chased about
The big dog chased a mouse but the cat was tired
The big dog always used to be chased by the cat
The big dog the cat chased was scared
Our big dog was upset because he had been chased by a cat.
The dog chase of the big cat
德语:
Der Hund jagte eine große Katze
Die Katze jagte den großen Hund
Der große Hund und die Katze jagten
Der große Hund jagte eine Maus aber die Katze war müde
Der große Hund wurde ständig von der Katze gejagt
Der große Hund entschloss sich, von der Katze gejagt zu werden
Die Hundejagd durch den große Katze
在上面的示例中,福尔摩斯与各种具有相同含义的不同句子级结构匹配,但是匹配的文档中三个单词的基本形式始终与搜索短语中的三个单词相同。福尔摩斯提供了几种在单个单词级别匹配的策略。结合福尔摩斯匹配不同句子结构的能力,这些可以使搜索短语与文档句子相匹配,该句子具有其含义,即使两者共享单词不共,并且在语法上完全不同。
这些附加的单词匹配策略之一是命名的匹配:可以将特殊单词包含在与人或地点(例如人或地点)匹配的搜索短语中。通过输入exit来退出控制台,然后注册第二个搜索短语并重新启动控制台:
英语:
holmes_manager.register_search_phrase('An ENTITYPERSON goes into town')
holmes_manager.start_chatbot_mode_console()
德语:
holmes_manager.register_search_phrase('Ein ENTITYPER geht in die Stadt')
holmes_manager.start_chatbot_mode_console()
您现在已经对进入城镇的人注册了您的兴趣,并可以将适当的句子输入控制台:
英语:
Ready for input
I met Richard Hudson and John Doe last week. They didn't want to go into town.
Matched search phrase with text 'An ENTITYPERSON goes into town'; negated; uncertain; involves coreference:
'Richard Hudson'->'ENTITYPERSON' (Has an entity label matching ENTITYPERSON); 'go'->'go' (Matches GO directly); 'into'->'into' (Matches INTO directly); 'town'->'town' (Matches TOWN directly)
Matched search phrase with text 'An ENTITYPERSON goes into town'; negated; uncertain; involves coreference:
'John Doe'->'ENTITYPERSON' (Has an entity label matching ENTITYPERSON); 'go'->'go' (Matches GO directly); 'into'->'into' (Matches INTO directly); 'town'->'town' (Matches TOWN directly)
德语:
Ready for input
Letzte Woche sah ich Richard Hudson und Max Mustermann. Sie wollten nicht mehr in die Stadt gehen.
Matched search phrase with text 'Ein ENTITYPER geht in die Stadt'; negated; uncertain; involves coreference:
'Richard Hudson'->'ENTITYPER' (Has an entity label matching ENTITYPER); 'gehen'->'gehen' (Matches GEHEN directly); 'in'->'in' (Matches IN directly); 'die Stadt'->'stadt' (Matches STADT directly)
Matched search phrase with text 'Ein ENTITYPER geht in die Stadt'; negated; uncertain; involves coreference:
'Max Mustermann'->'ENTITYPER' (Has an entity label matching ENTITYPER); 'gehen'->'gehen' (Matches GEHEN directly); 'in'->'in' (Matches IN directly); 'die Stadt'->'stadt' (Matches STADT directly)
在这两种语言中的每种语言中,最后一个示例都展示了福尔摩斯的其他几个特征:
有关更多示例,请参见第5节。
每个策略使用一个Python模块实施以下策略。尽管标准库不支持通过经理课程添加定制策略,但对于具有Python编程技能的任何人来说,更容易更改代码以实现这一目标。
word_match.type=='direct' )搜索短语单词和文档单词之间的直接匹配始终是活动的。该策略主要依靠匹配单词的词干形式,例如将英语购买和儿童与购买的孩子,德国史蒂根(Steigen)以及善良的stieg and Kinder匹配。但是,为了增加当解析器提供单词不正确的词干形式时直接匹配工作的机会,在直接匹配期间,也考虑了搜索词和文档单词的原始形式。
word_match.type=='derivation' )基于派生的匹配涉及通常属于不同单词类的不同但相关的单词,例如英语评估和评估,德国Jagen和Jagd 。默认情况下它是有效的,但可以使用analyze_derivational_morphology参数关闭,该参数在实例化管理器类时设置。
word_match.type=='entity' )通过在搜索短语中插入特殊命名 - 实体标识符来激活命名 - 实现匹配,以代替名词,例如
一个实体进入城镇(英语)
EIN Entityper geht在Die Stadt (德语)中。
受支持的指定性标识符直接取决于Spacy模型为每种语言提供的指定性信息(从Spacy文档的早期版本复制的描述):
英语:
| 标识符 | 意义 |
|---|---|
| Entitynoun | 任何名词短语。 |
| 实体 | 人们,包括虚构的人。 |
| EntityNorp | 国籍或宗教或政治团体。 |
| EntityFac | 建筑物,机场,公路,桥梁等 |
| Entityorg | 公司,机构,机构等 |
| 实体 | 国家,城市,国家。 |
| EntityLoc | 非GPE地点,山脉,水体。 |
| 实体产品 | 物体,车辆,食物等(不是服务)。 |
| 实体 | 命名为飓风,战斗,战争,体育赛事等。 |
| ENTITYWORK_OF_ART | 书籍,歌曲等的标题。 |
| EntityLaw | 命名为法律的文件。 |
| 实体语言 | 任何命名语言。 |
| 实体日期 | 绝对或相对日期或期间。 |
| 实体时间 | 次小于一天。 |
| 实体 | 百分比,包括“%”。 |
| EntityMoney | 货币价值,包括单位。 |
| 实体 | 测量,重量或距离。 |
| 实体 | “第一”,“第二”,等等。 |
| 实体 | 不属于另一种类型的数字。 |
德语:
| 标识符 | 意义 |
|---|---|
| Entitynoun | 任何名词短语。 |
| Entityper | 命名人或家人。 |
| EntityLoc | 政治或地理上定义的位置的名称(城市,省,国家,国际地区,水域,山区)。 |
| Entityorg | 命名为公司,政府或其他组织实体。 |
| 实体 | 其他实体,例如事件,国籍,产品或艺术品。 |
我们已经将ENTITYNOUN添加到了真正的命名实体标识符中。当它与任何名词短语匹配时,它的行为与通用代词相似。差异是, ENTITYNOUN必须匹配文档中的特定名词短语,并且提取了该特定名词短语并用于进一步处理。主题匹配用例中不支持ENTITYNOUN 。
word_match.type=='ontology' )本体学使用户能够定义单词之间的关系,这些关系在将文档匹配到搜索短语时被考虑在内。三种相关的关系类型是hushim词(某种是某物的子类型),同义词(某种方式与某物相同)和命名的个人(某物是某物的特定实例)。这三种关系类型在图1中举例说明了

使用RDF/XML序列化的OWL本体标准将本体论定义为HOLMES。可以使用各种工具生成此类本体。对于Holmes示例和测试,使用了免费的工具蛋白。建议您使用protege来定义自己的本体论,并浏览示例和测试的本体论。保存在Protege下的本体时,请选择RDF/XML作为格式。 Protege分配了Holmes理解为默认的siber,同义词和个体关系的标准标签,但也可以被覆盖。
本体论条目是使用国际化资源标识符(IRI)定义的,例如http://www.semanticweb.org/hudsonr/ontologies/2019/0/animals#dog 。福尔摩斯仅使用最终片段进行匹配,该片段允许在本体树中的多个点上定义同音词(具有相同形式但多种含义的单词)。
当使用针对特定主题域和用例的小本体论构建时,基于本体的匹配可为福尔摩斯提供最佳的结果。例如,如果您正在为建筑保险用例实施聊天机器人,则应创建一个小本体,以捕获该特定域中的条款和关系。另一方面,不建议在整个语言(例如WordNet)中使用为所有域中构建的大本体论。这是因为仅适用于狭窄主题域中的许多同音词和关系会导致大量不正确的匹配。对于一般用例,基于嵌入的匹配将倾向于产生更好的结果。
本体论中的每个单词都可以被视为由子树的标题,其次要,同义词和命名个体,这些单词的histonyms,同义词和命名为个人等等。由于以标准方式设置了适合聊天机器人和结构提取用例的本体论,因此,如果文档单词在搜索短语单词的子树中,则HOLMES搜索短语中的单词与文档中的一个单词匹配。除了直接匹配策略外,图1中的本体也定义为福尔摩斯(Holmes),这将使每个单词与自身匹配,以下组合将与:
英语短语动词(如饮食)和德语可分离动词(如奥法森)必须定义为本体中的单个项目。当福尔摩斯分析文本并遇到这样一个动词时,主动词和粒子被混合成单个逻辑词,然后可以通过本体学匹配。这意味着文本中的饮食将与本体论中的饮食子树相匹配,而与本体学内的饮食子树不符。
如果基于派生的匹配是有效的,则在潜在的基于本体的匹配的两侧都考虑到它。例如,如果将Alter和修正定义为本体论中的同义词,则更改和修正案也将相互匹配。
在找到相关句子的情况下,与确保文档匹配与搜索短语的逻辑对应相对应更为重要,在定义本体学时指定对称匹配可能是有意义的。建议对主题匹配的用例进行对称匹配,但不太可能适合聊天机器人或结构提取用例。这意味着在匹配时考虑了高nym(反向信)关系,以及在匹配时的信中和同义词关系,从而导致文档和搜索短语之间的更对称关系。当通过对称本体匹配时,适用的一个重要规则是,匹配路径可能不包含超nym和信号关系,即您不能返回自己。如果上面的本体定义为对称,则以下组合将匹配:
在监督的文档分类用例中,可以使用两个单独的本体论:
结构匹配本体用于分析培训和测试文档的内容。本体论中发现的文档中的每个单词都被其最一般的高核祖先所取代。重要的是要意识到,如果本体专门为目的而构建的本体,则只能与结构匹配进行进行结构匹配:这种本体应该由许多单独的树组成,代表要分类的文档中的对象的主要类别。在上面显示的示例本体论中,本体论中的所有单词都将被动物所取代。在具有WordNet式本体论的极端情况下,所有名词最终都将被事物取代,这显然不是理想的结果!
分类本体论用于捕获分类标签之间的关系:文档具有一定的分类意味着它还具有与该分类所属子树的分类。如果在分类本体中,则应谨慎使用同义词,因为它们增加了神经网络的复杂性而无需添加任何值。尽管从技术上讲可以建立分类本体来使用对称匹配,但没有明智的理由这样做。请注意,分类本体中的标签未直接定义为任何培训文档的标签,必须使用SupervisedTopicTrainingBasis.register_additional_classification_label()方法专门注册,如果在训练分类器时要考虑到。
word_match.type=='embedding' )Spacy提供单词嵌入:机器学习生成的单词的数值向量表示,这些词捕获了每个单词倾向于发生的上下文。两个具有相似含义的单词往往会带有彼此靠近的单词嵌入,并且Spacy可以测量任何两个单词的嵌入在0.0(无相似度)和1.0(相同单词)之间的余弦相似性。由于狗和猫倾向于出现在相似的情况下,因此它们的相似性为0.80。狗和马的共同点较少,相似性为0.62;狗和铁的相似性仅为0.25。仅针对名词,形容词和副词激活基于嵌入的匹配,因为发现结果与其他单词类别不满意。
重要的是要了解,两个单词具有相似的嵌入的事实并不意味着两者之间的逻辑关系与使用基于本体的匹配时的逻辑关系相同:例如,狗和猫具有相似的嵌入事实,这一事实既不意味着狗是一种猫,也不是猫是一种狗。尽管如此,基于嵌入的匹配还是适当的选择取决于功能用例。
对于聊天机器人,结构提取和受监管的文档分类用例,Holmes使用整个在管理员类中在全球定义的总体定义的overall_similarity_threshold参数使用基于单词嵌入的相似性。每当单个相应的单词对之间的相似之处的几何平均值大于此阈值时,搜索短语和文档中的结构之间都会检测到匹配。该技术背后的直觉在于,其中一个带有六个词汇单词的搜索短语与文档结构匹配,其中五个单词完全匹配,只有一个单词通过嵌入匹配,与此匹配的第六个单词匹配所需的相似性仅小于仅有的三个单词正好匹配的三个单词,而其他两个单词也可以通过嵌入来对应。
将搜索短语与文档匹配,首先是在搜索短语的根部(句法头)中找到匹配单词的单词。然后,福尔摩斯研究了这些匹配的文档单词周围的结构,以检查文档结构是否与搜索短语结构匹配。匹配搜索短语root Word的文档单词通常使用索引找到。但是,如果在找到与搜索短语root Word匹配的文档单词时必须考虑嵌入,则必须比较每个文档中的每个单词,都必须将其与该搜索短语root Word的相似性进行比较。这具有非常明显的性能,除了聊天机器人用例本质上无法使用外,所有用例本质上都无法使用。
为了避免通过嵌入搜索短语root单词匹配而导致的典型不必要的性能,它通常使用embedding_based_matching_on_root_words参数(即在管理器类实例化时设置)与基于嵌入的匹配进行控制。建议您在大多数用例中关闭此设置(值False )。
overall_similarity_threshold _similarity_threshold和embedding_based_matching_on_root_words参数对主题匹配的用例都没有任何影响。此处使用word_embedding_match_threshold和initial_question_word_embedding_match_threshold参数时,在此处设置了Word级嵌入相似topic_match_documents_against阈值。
word_match.type=='entity_embedding' )在具有某个实体标签的搜索文档单词和搜索短语或查询文档单词之间获得的命名基于内置的匹配,其嵌入的嵌入与该实体标签的基本含义相似,例如,搜索短语中的单词单词具有相似的单词嵌入人的基础含义。请注意,无论embedding_based_matching_on_root_words设置如何,基于命名的entity-embedding匹配永远不会在root单词上活跃。
word_match.type=='question' )初始问题匹配仅在主题匹配期间活跃。查询短语中的初始问题单词与搜索文档中的实体相匹配,这些实体代表了该问题的潜在答案,例如,在比较彼得在何时将早餐与搜索的文档peter peter peter进行比较时,彼得在上午8点吃早餐,何时将及时的临时adverbial短语匹配。
在调用Manager类上的topic_match_documents_against函数时,使用initial_question_word_behaviour参数打开和关闭初始问题匹配。 It is only likely to be useful when topic matching is being performed in an interactive setting where the user enters short query phrases, as opposed to when it is being used to find documents on a similar topic to an pre-existing query document: initial question words are only processed at the beginning of the first sentence of the query phrase or query document.
Linguistically speaking, if a query phrase consists of a complex question with several elements dependent on the main verb, a finding in a searched document is only an 'answer' if contains matches to all these elements. Because recall is typically more important than precision when performing topic matching with interactive query phrases, however, Holmes will match an initial question word to a searched-document phrase wherever they correspond semantically (eg wherever when corresponds to a temporal adverbial phrase) and each depend on verbs that themselves match at the word level. One possible strategy to filter out 'incomplete answers' would be to calculate the maximum possible score for a query phrase and reject topic matches that score below a threshold scaled to this maximum.
Before Holmes analyses a searched document or query document, coreference resolution is performed using the Coreferee library running on top of spaCy. This means that situations are recognised where pronouns and nouns that are located near one another within a text refer to the same entities. The information from one mention can then be applied to the analysis of further mentions:
I saw a big dog . It was chasing a cat.
I saw a big dog . The dog was chasing a cat.
Coreferee also detects situations where a noun refers back to a named entity:
We discussed AstraZeneca . The company had given us permission to publish this library under the MIT license.
If this example were to match the search phrase A company gives permission to publish something , the coreference information that the company under discussion is AstraZeneca is clearly relevant and worth extracting in addition to the word(s) directly matched to the search phrase. Such information is captured in the word_match.extracted_word field.
The concept of search phrases has already been introduced and is relevant to the chatbot use case, the structural extraction use case and to preselection within the supervised document classification use case.
It is crucial to understand that the tips and limitations set out in Section 4 do not apply in any way to query phrases in topic matching. If you are using Holmes for topic matching only, you can completely ignore this section!
Structural matching between search phrases and documents is not symmetric: there are many situations in which sentence X as a search phrase would match sentence Y within a document but where the converse would not be true. Although Holmes does its best to understand any search phrases, the results are better when the user writing them follows certain patterns and tendencies, and getting to grips with these patterns and tendencies is the key to using the relevant features of Holmes successfully.
Holmes distinguishes between: lexical words like dog , chase and cat (English) or Hund , jagen and Katze (German) in the initial example above; and grammatical words like a (English) or ein and eine (German) in the initial example above. Only lexical words match words in documents, but grammatical words still play a crucial role within a search phrase: they enable Holmes to understand it.
Dog chase cat (English)
Hund jagen Katze (German)
contain the same lexical words as the search phrases in the initial example above, but as they are not grammatical sentences Holmes is liable to misunderstand them if they are used as search phrases. This is a major difference between Holmes search phrases and the search phrases you use instinctively with standard search engines like Google, and it can take some getting used to.
A search phrase need not contain a verb:
ENTITYPERSON (English)
A big dog (English)
Interest in fishing (English)
ENTITYPER (German)
Ein großer Hund (German)
Interesse am Angeln (German)
are all perfectly valid and potentially useful search phrases.
Where a verb is present, however, Holmes delivers the best results when the verb is in the present active , as chases and jagt are in the initial example above. This gives Holmes the best chance of understanding the relationship correctly and of matching the widest range of document structures that share the target meaning.
Sometimes you may only wish to extract the object of a verb. For example, you might want to find sentences that are discussing a cat being chased regardless of who is doing the chasing. In order to avoid a search phrase containing a passive expression like
A cat is chased (English)
Eine Katze wird gejagt (German)
you can use a generic pronoun . This is a word that Holmes treats like a grammatical word in that it is not matched to documents; its sole purpose is to help the user form a grammatically optimal search phrase in the present active. Recognised generic pronouns are English something , somebody and someone and German jemand (and inflected forms of jemand ) and etwas : Holmes treats them all as equivalent. Using generic pronouns, the passive search phrases above could be re-expressed as
Somebody chases a cat (English)
Jemand jagt eine Katze (German).
Experience shows that different prepositions are often used with the same meaning in equivalent phrases and that this can prevent search phrases from matching where one would intuitively expect it. For example, the search phrases
Somebody is at the market (English)
Jemand ist auf dem Marktplatz (German)
would fail to match the document phrases
Richard was in the market (English)
Richard war am Marktplatz (German)
The best way of solving this problem is to define the prepositions in question as synonyms in an ontology.
The following types of structures are prohibited in search phrases and result in Python user-defined errors:
A dog chases a cat. A cat chases a dog (English)
Ein Hund jagt eine Katze. Eine Katze jagt einen Hund (German)
Each clause must be separated out into its own search phrase and registered individually.
A dog does not chase a cat. (英语)
Ein Hund jagt keine Katze. (德语)
Negative expressions are recognised as such in documents and the generated matches marked as negative; allowing search phrases themselves to be negative would overcomplicate the library without offering any benefits.
A dog and a lion chase a cat. (英语)
Ein Hund und ein Löwe jagen eine Katze. (德语)
Wherever conjunction occurs in documents, Holmes distributes the information among multiple matches as explained above. In the unlikely event that there should be a requirement to capture conjunction explicitly when matching, this could be achieved by using the Manager.match() function and looking for situations where the document token objects are shared by multiple match objects.
The (English)
Der (German)
A search phrase cannot be processed if it does not contain any words that can be matched to documents.
A dog chases a cat and he chases a mouse (English)
Ein Hund jagt eine Katze und er jagt eine Maus (German)
Pronouns that corefer with nouns elsewhere in the search phrase are not permitted as this would overcomplicate the library without offering any benefits.
The following types of structures are strongly discouraged in search phrases:
Dog chase cat (English)
Hund jagen Katze (German)
Although these will sometimes work, the results will be better if search phrases are expressed grammatically.
A cat is chased by a dog (English)
A dog will have chased a cat (English)
Eine Katze wird durch einen Hund gejagt (German)
Ein Hund wird eine Katze gejagt haben (German)
Although these will sometimes work, the results will be better if verbs in search phrases are expressed in the present active.
Who chases the cat? (英语)
Wer jagt die Katze? (德语)
Although questions are supported as query phrases in the topic matching use case, they are not appropriate as search phrases. Questions should be re-phrased as statements, in this case
Something chases the cat (English)
Etwas jagt die Katze (German).
Informationsextraktion (German)
Ein Stadtmittetreffen (German)
The internal structure of German compound words is analysed within searched documents as well as within query phrases in the topic matching use case, but not within search phrases. In search phrases, compound words should be reexpressed as genitive constructions even in cases where this does not strictly capture their meaning:
Extraktion der Information (German)
Ein Treffen der Stadtmitte (German)
The following types of structures should be used with caution in search phrases:
A fierce dog chases a scared cat on the way to the theatre (English)
Ein kämpferischer Hund jagt eine verängstigte Katze auf dem Weg ins Theater (German)
Holmes can handle any level of complexity within search phrases, but the more complex a structure, the less likely it becomes that a document sentence will match it. If it is really necessary to match such complex relationships with search phrases rather than with topic matching, they are typically better extracted by splitting the search phrase up, eg
A fierce dog (English)
A scared cat (English)
A dog chases a cat (English)
Something chases something on the way to the theatre (English)
Ein kämpferischer Hund (German)
Eine verängstigte Katze (German)
Ein Hund jagt eine Katze (German)
Etwas jagt etwas auf dem Weg ins Theater (German)
Correlations between the resulting matches can then be established by matching via the Manager.match() function and looking for situations where the document token objects are shared across multiple match objects.
One possible exception to this piece of advice is when embedding-based matching is active. Because whether or not each word in a search phrase matches then depends on whether or not other words in the same search phrase have been matched, large, complex search phrases can sometimes yield results that a combination of smaller, simpler search phrases would not.
The chasing of a cat (English)
Die Jagd einer Katze (German)
These will often work, but it is generally better practice to use verbal search phrases like
Something chases a cat (English)
Etwas jagt eine Katze (German)
and to allow the corresponding nominal phrases to be matched via derivation-based matching.
The chatbot use case has already been introduced: a predefined set of search phrases is used to extract information from phrases entered interactively by an end user, which in this use case act as the documents.
The Holmes source code ships with two examples demonstrating the chatbot use case, one for each language, with predefined ontologies. Having cloned the source code and installed the Holmes library, navigate to the /examples directory and type the following (Linux):
英语:
python3 example_chatbot_EN_insurance.py
德语:
python3 example_chatbot_DE_insurance.py
or click on the files in Windows Explorer (Windows).
Holmes matches syntactically distinct structures that are semantically equivalent, ie that share the same meaning. In a real chatbot use case, users will typically enter equivalent information with phrases that are semantically distinct as well, ie that have different meanings. Because the effort involved in registering a search phrase is barely greater than the time it takes to type it in, it makes sense to register a large number of search phrases for each relationship you are trying to extract: essentially all ways people have been observed to express the information you are interested in or all ways you can imagine somebody might express the information you are interested in . To assist this, search phrases can be registered with labels that do not need to be unique: a label can then be used to express the relationship an entire group of search phrases is designed to extract. Note that when many search phrases have been defined to extract the same relationship, a single user entry is likely to be sometimes matched by multiple search phrases. This must be handled appropriately by the calling application.
One obvious weakness of Holmes in the chatbot setting is its sensitivity to correct spelling and, to a lesser extent, to correct grammar. Strategies for mitigating this weakness include:
The structural extraction use case uses structural matching in the same way as the chatbot use case, and many of the same comments and tips apply to it. The principal differences are that pre-existing and often lengthy documents are scanned rather than text snippets entered ad-hoc by the user, and that the returned match objects are not used to drive a dialog flow; they are examined solely to extract and store structured information.
Code for performing structural extraction would typically perform the following tasks:
Manager.register_search_phrase() several times to define a number of search phrases specifying the information to be extracted.Manager.parse_and_register_document() several times to load a number of documents within which to search.Manager.match() to perform the matching.The topic matching use case matches a query document , or alternatively a query phrase entered ad-hoc by the user, against a set of documents pre-loaded into memory. The aim is to find the passages in the documents whose topic most closely corresponds to the topic of the query document; the output is a ordered list of passages scored according to topic similarity. Additionally, if a query phrase contains an initial question word, the output will contain potential answers to the question.
Topic matching queries may contain generic pronouns and named-entity identifiers just like search phrases, although the ENTITYNOUN token is not supported. However, an important difference from search phrases is that the topic matching use case places no restrictions on the grammatical structures permissible within the query document.
In addition to the Holmes demonstration website, the Holmes source code ships with three examples demonstrating the topic matching use case with an English literature corpus, a German literature corpus and a German legal corpus respectively. Users are encouraged to run these to get a feel for how they work.
Topic matching uses a variety of strategies to find text passages that are relevant to the query. These include resource-hungry procedures like investigating semantic relationships and comparing embeddings. Because applying these across the board would prevent topic matching from scaling, Holmes only attempts them for specific areas of the text that less resource-intensive strategies have already marked as looking promising. This and the other interior workings of topic matching are explained here.
In the supervised document classification use case, a classifier is trained with a number of documents that are each pre-labelled with a classification. The trained classifier then assigns one or more labels to new documents according to what each new document is about. As explained here, ontologies can be used both to enrichen the comparison of the content of the various documents and to capture implication relationships between classification labels.
A classifier makes use of a neural network (a multilayer perceptron) whose topology can either be determined automatically by Holmes or specified explicitly by the user. With a large number of training documents, the automatically determined topology can easily exhaust the memory available on a typical machine; if there is no opportunity to scale up the memory, this problem can be remedied by specifying a smaller number of hidden layers or a smaller number of nodes in one or more of the layers.
A trained document classification model retains no references to its training data. This is an advantage from a data protection viewpoint, although it cannot presently be guaranteed that models will not contain individual personal or company names.
A typical problem with the execution of many document classification use cases is that a new classification label is added when the system is already live but that there are initially no examples of this new classification with which to train a new model. The best course of action in such a situation is to define search phrases which preselect the more obvious documents with the new classification using structural matching. Those documents that are not preselected as having the new classification label are then passed to the existing, previously trained classifier in the normal way. When enough documents exemplifying the new classification have accumulated in the system, the model can be retrained and the preselection search phrases removed.
Holmes ships with an example script demonstrating supervised document classification for English with the BBC Documents dataset. The script downloads the documents (for this operation and for this operation alone, you will need to be online) and places them in a working directory. When training is complete, the script saves the model to the working directory. If the model file is found in the working directory on subsequent invocations of the script, the training phase is skipped and the script goes straight to the testing phase. This means that if it is wished to repeat the training phase, either the model has to be deleted from the working directory or a new working directory has to be specified to the script.
Having cloned the source code and installed the Holmes library, navigate to the /examples directory. Specify a working directory at the top of the example_supervised_topic_model_EN.py file, then type python3 example_supervised_topic_model_EN (Linux) or click on the script in Windows Explorer (Windows).
It is important to realise that Holmes learns to classify documents according to the words or semantic relationships they contain, taking any structural matching ontology into account in the process. For many classification tasks, this is exactly what is required; but there are tasks (eg author attribution according to the frequency of grammatical constructions typical for each author) where it is not. For the right task, Holmes achieves impressive results. For the BBC Documents benchmark processed by the example script, Holmes performs slightly better than benchmarks available online (see eg here) although the difference is probably too slight to be significant, especially given that the different training/test splits were used in each case: Holmes has been observed to learn models that predict the correct result between 96.9% and 98.7% of the time. The range is explained by the fact that the behaviour of the neural network is not fully deterministic.
The interior workings of supervised document classification are explained here.
Manager holmes_extractor.Manager(self, model, *, overall_similarity_threshold=1.0,
embedding_based_matching_on_root_words=False, ontology=None,
analyze_derivational_morphology=True, perform_coreference_resolution=None,
number_of_workers=None, verbose=False)
The facade class for the Holmes library.
Parameters:
model -- the name of the spaCy model, e.g. *en_core_web_trf*
overall_similarity_threshold -- the overall similarity threshold for embedding-based
matching. Defaults to *1.0*, which deactivates embedding-based matching. Note that this
parameter is not relevant for topic matching, where the thresholds for embedding-based
matching are set on the call to *topic_match_documents_against*.
embedding_based_matching_on_root_words -- determines whether or not embedding-based
matching should be attempted on search-phrase root tokens, which has a considerable
performance hit. Defaults to *False*. Note that this parameter is not relevant for topic
matching.
ontology -- an *Ontology* object. Defaults to *None* (no ontology).
analyze_derivational_morphology -- *True* if matching should be attempted between different
words from the same word family. Defaults to *True*.
perform_coreference_resolution -- *True* if coreference resolution should be taken into account
when matching. Defaults to *True*.
use_reverse_dependency_matching -- *True* if appropriate dependencies in documents can be
matched to dependencies in search phrases where the two dependencies point in opposite
directions. Defaults to *True*.
number_of_workers -- the number of worker processes to use, or *None* if the number of worker
processes should depend on the number of available cores. Defaults to *None*
verbose -- a boolean value specifying whether multiprocessing messages should be outputted to
the console. Defaults to *False*
Manager.register_serialized_document(self, serialized_document:bytes, label:str="") -> None
Parameters:
document -- a preparsed Holmes document.
label -- a label for the document which must be unique. Defaults to the empty string,
which is intended for use cases involving single documents (typically user entries).
Manager.register_serialized_documents(self, document_dictionary:dict[str, bytes]) -> None
Note that this function is the most efficient way of loading documents.
Parameters:
document_dictionary -- a dictionary from labels to serialized documents.
Manager.parse_and_register_document(self, document_text:str, label:str='') -> None
Parameters:
document_text -- the raw document text.
label -- a label for the document which must be unique. Defaults to the empty string,
which is intended for use cases involving single documents (typically user entries).
Manager.remove_document(self, label:str) -> None
Manager.remove_all_documents(self, labels_starting:str=None) -> None
Parameters:
labels_starting -- a string starting the labels of documents to be removed,
or 'None' if all documents are to be removed.
Manager.list_document_labels(self) -> List[str]
Returns a list of the labels of the currently registered documents.
Manager.serialize_document(self, label:str) -> Optional[bytes]
Returns a serialized representation of a Holmes document that can be
persisted to a file. If 'label' is not the label of a registered document,
'None' is returned instead.
Parameters:
label -- the label of the document to be serialized.
Manager.get_document(self, label:str='') -> Optional[Doc]
Returns a Holmes document. If *label* is not the label of a registered document, *None*
is returned instead.
Parameters:
label -- the label of the document to be serialized.
Manager.debug_document(self, label:str='') -> None
Outputs a debug representation for a loaded document.
Parameters:
label -- the label of the document to be serialized.
Manager.register_search_phrase(self, search_phrase_text:str, label:str=None) -> SearchPhrase
Registers and returns a new search phrase.
Parameters:
search_phrase_text -- the raw search phrase text.
label -- a label for the search phrase, which need not be unique.
If label==None, the assigned label defaults to the raw search phrase text.
Manager.remove_all_search_phrases_with_label(self, label:str) -> None
Manager.remove_all_search_phrases(self) -> None
Manager.list_search_phrase_labels(self) -> List[str]
Manager.match(self, search_phrase_text:str=None, document_text:str=None) -> List[Dict]
Matches search phrases to documents and returns the result as match dictionaries.
Parameters:
search_phrase_text -- a text from which to generate a search phrase, or 'None' if the
preloaded search phrases should be used for matching.
document_text -- a text from which to generate a document, or 'None' if the preloaded
documents should be used for matching.
topic_match_documents_against(self, text_to_match:str, *,
use_frequency_factor:bool=True,
maximum_activation_distance:int=75,
word_embedding_match_threshold:float=0.8,
initial_question_word_embedding_match_threshold:float=0.7,
relation_score:int=300,
reverse_only_relation_score:int=200,
single_word_score:int=50,
single_word_any_tag_score:int=20,
initial_question_word_answer_score:int=600,
initial_question_word_behaviour:str='process',
different_match_cutoff_score:int=15,
overlapping_relation_multiplier:float=1.5,
embedding_penalty:float=0.6,
ontology_penalty:float=0.9,
relation_matching_frequency_threshold:float=0.25,
embedding_matching_frequency_threshold:float=0.5,
sideways_match_extent:int=100,
only_one_result_per_document:bool=False,
number_of_results:int=10,
document_label_filter:str=None,
tied_result_quotient:float=0.9) -> List[Dict]:
Returns a list of dictionaries representing the results of a topic match between an entered text
and the loaded documents.
Properties:
text_to_match -- the text to match against the loaded documents.
use_frequency_factor -- *True* if scores should be multiplied by a factor between 0 and 1
expressing how rare the words matching each phraselet are in the corpus. Note that,
even if this parameter is set to *False*, the factors are still calculated as they are
required for determining which relation and embedding matches should be attempted.
maximum_activation_distance -- the number of words it takes for a previous phraselet
activation to reduce to zero when the library is reading through a document.
word_embedding_match_threshold -- the cosine similarity above which two words match where
the search phrase word does not govern an interrogative pronoun.
initial_question_word_embedding_match_threshold -- the cosine similarity above which two
words match where the search phrase word governs an interrogative pronoun.
relation_score -- the activation score added when a normal two-word relation is matched.
reverse_only_relation_score -- the activation score added when a two-word relation
is matched using a search phrase that can only be reverse-matched.
single_word_score -- the activation score added when a single noun is matched.
single_word_any_tag_score -- the activation score added when a single word is matched
that is not a noun.
initial_question_word_answer_score -- the activation score added when a question word is
matched to an potential answer phrase.
initial_question_word_behaviour -- 'process' if a question word in the sentence
constituent at the beginning of *text_to_match* is to be matched to document phrases
that answer it and to matching question words; 'exclusive' if only topic matches that
answer questions are to be permitted; 'ignore' if question words are to be ignored.
different_match_cutoff_score -- the activation threshold under which topic matches are
separated from one another. Note that the default value will probably be too low if
*use_frequency_factor* is set to *False*.
overlapping_relation_multiplier -- the value by which the activation score is multiplied
when two relations were matched and the matches involved a common document word.
embedding_penalty -- a value between 0 and 1 with which scores are multiplied when the
match involved an embedding. The result is additionally multiplied by the overall
similarity measure of the match.
ontology_penalty -- a value between 0 and 1 with which scores are multiplied for each
word match within a match that involved the ontology. For each such word match,
the score is multiplied by the value (abs(depth) + 1) times, so that the penalty is
higher for hyponyms and hypernyms than for synonyms and increases with the
depth distance.
relation_matching_frequency_threshold -- the frequency threshold above which single
word matches are used as the basis for attempting relation matches.
embedding_matching_frequency_threshold -- the frequency threshold above which single
word matches are used as the basis for attempting relation matches with
embedding-based matching on the second word.
sideways_match_extent -- the maximum number of words that may be incorporated into a
topic match either side of the word where the activation peaked.
only_one_result_per_document -- if 'True', prevents multiple results from being returned
for the same document.
number_of_results -- the number of topic match objects to return.
document_label_filter -- optionally, a string with which document labels must start to
be considered for inclusion in the results.
tied_result_quotient -- the quotient between a result and following results above which
the results are interpreted as tied.
Manager.get_supervised_topic_training_basis(self, *, classification_ontology:Ontology=None,
overlap_memory_size:int=10, oneshot:bool=True, match_all_words:bool=False,
verbose:bool=True) -> SupervisedTopicTrainingBasis:
Returns an object that is used to train and generate a model for the
supervised document classification use case.
Parameters:
classification_ontology -- an Ontology object incorporating relationships between
classification labels, or 'None' if no such ontology is to be used.
overlap_memory_size -- how many non-word phraselet matches to the left should be
checked for words in common with a current match.
oneshot -- whether the same word or relationship matched multiple times within a
single document should be counted once only (value 'True') or multiple times
(value 'False')
match_all_words -- whether all single words should be taken into account
(value 'True') or only single words with noun tags (value 'False')
verbose -- if 'True', information about training progress is outputted to the console.
Manager.deserialize_supervised_topic_classifier(self,
serialized_model:bytes, verbose:bool=False) -> SupervisedTopicClassifier:
Returns a classifier for the supervised document classification use case
that will use a supplied pre-trained model.
Parameters:
serialized_model -- the pre-trained model as returned from `SupervisedTopicClassifier.serialize_model()`.
verbose -- if 'True', information about matching is outputted to the console.
Manager.start_chatbot_mode_console(self)
Starts a chatbot mode console enabling the matching of pre-registered
search phrases to documents (chatbot entries) entered ad-hoc by the
user.
Manager.start_structural_search_mode_console(self)
Starts a structural extraction mode console enabling the matching of pre-registered
documents to search phrases entered ad-hoc by the user.
Manager.start_topic_matching_search_mode_console(self,
only_one_result_per_document:bool=False, word_embedding_match_threshold:float=0.8,
initial_question_word_embedding_match_threshold:float=0.7):
Starts a topic matching search mode console enabling the matching of pre-registered
documents to query phrases entered ad-hoc by the user.
Parameters:
only_one_result_per_document -- if 'True', prevents multiple topic match
results from being returned for the same document.
word_embedding_match_threshold -- the cosine similarity above which two words match where the
search phrase word does not govern an interrogative pronoun.
initial_question_word_embedding_match_threshold -- the cosine similarity above which two
words match where the search phrase word governs an interrogative pronoun.
Manager.close(self) -> None
Terminates the worker processes.
manager.nlp manager.nlp is the underlying spaCy Language object on which both Coreferee and Holmes have been registered as custom pipeline components. The most efficient way of parsing documents for use with Holmes is to call manager.nlp.pipe() . This yields an iterable of documents that can then be loaded into Holmes via manager.register_serialized_documents() .
The pipe() method has an argument n_process that specifies the number of processors to use. With _lg , _md and _sm spaCy models, there are some situations where it can make sense to specify a value other than 1 (the default). Note however that with transformer spaCy models ( _trf ) values other than 1 are not supported.
Ontology holmes_extractor.Ontology(self, ontology_path,
owl_class_type='http://www.w3.org/2002/07/owl#Class',
owl_individual_type='http://www.w3.org/2002/07/owl#NamedIndividual',
owl_type_link='http://www.w3.org/1999/02/22-rdf-syntax-ns#type',
owl_synonym_type='http://www.w3.org/2002/07/owl#equivalentClass',
owl_hyponym_type='http://www.w3.org/2000/01/rdf-schema#subClassOf',
symmetric_matching=False)
Loads information from an existing ontology and manages ontology
matching.
The ontology must follow the W3C OWL 2 standard. Search phrase words are
matched to hyponyms, synonyms and instances from within documents being
searched.
This class is designed for small ontologies that have been constructed
by hand for specific use cases. Where the aim is to model a large number
of semantic relationships, word embeddings are likely to offer
better results.
Holmes is not designed to support changes to a loaded ontology via direct
calls to the methods of this class. It is also not permitted to share a single instance
of this class between multiple Manager instances: instead, a separate Ontology instance
pointing to the same path should be created for each Manager.
Matching is case-insensitive.
Parameters:
ontology_path -- the path from where the ontology is to be loaded,
or a list of several such paths. See https://github.com/RDFLib/rdflib/.
owl_class_type -- optionally overrides the OWL 2 URL for types.
owl_individual_type -- optionally overrides the OWL 2 URL for individuals.
owl_type_link -- optionally overrides the RDF URL for types.
owl_synonym_type -- optionally overrides the OWL 2 URL for synonyms.
owl_hyponym_type -- optionally overrides the RDF URL for hyponyms.
symmetric_matching -- if 'True', means hypernym relationships are also taken into account.
SupervisedTopicTrainingBasis (returned from Manager.get_supervised_topic_training_basis() )Holder object for training documents and their classifications from which one or more SupervisedTopicModelTrainer objects can be derived. This class is NOT threadsafe.
SupervisedTopicTrainingBasis.parse_and_register_training_document(self, text:str, classification:str,
label:Optional[str]=None) -> None
Parses and registers a document to use for training.
Parameters:
text -- the document text
classification -- the classification label
label -- a label with which to identify the document in verbose training output,
or 'None' if a random label should be assigned.
SupervisedTopicTrainingBasis.register_training_document(self, doc:Doc, classification:str,
label:Optional[str]=None) -> None
Registers a pre-parsed document to use for training.
Parameters:
doc -- the document
classification -- the classification label
label -- a label with which to identify the document in verbose training output,
or 'None' if a random label should be assigned.
SupervisedTopicTrainingBasis.register_additional_classification_label(self, label:str) -> None
Register an additional classification label which no training document possesses explicitly
but that should be assigned to documents whose explicit labels are related to the
additional classification label via the classification ontology.
SupervisedTopicTrainingBasis.prepare(self) -> None
Matches the phraselets derived from the training documents against the training
documents to generate frequencies that also include combined labels, and examines the
explicit classification labels, the additional classification labels and the
classification ontology to derive classification implications.
Once this method has been called, the instance no longer accepts new training documents
or additional classification labels.
SupervisedTopicTrainingBasis.train(
self,
*,
minimum_occurrences: int = 4,
cv_threshold: float = 1.0,
learning_rate: float = 0.001,
batch_size: int = 5,
max_epochs: int = 200,
convergence_threshold: float = 0.0001,
hidden_layer_sizes: Optional[List[int]] = None,
shuffle: bool = True,
normalize: bool = True
) -> SupervisedTopicModelTrainer:
Trains a model based on the prepared state.
Parameters:
minimum_occurrences -- the minimum number of times a word or relationship has to
occur in the context of the same classification for the phraselet
to be accepted into the final model.
cv_threshold -- the minimum coefficient of variation with which a word or relationship has
to occur across the explicit classification labels for the phraselet to be
accepted into the final model.
learning_rate -- the learning rate for the Adam optimizer.
batch_size -- the number of documents in each training batch.
max_epochs -- the maximum number of training epochs.
convergence_threshold -- the threshold below which loss measurements after consecutive
epochs are regarded as equivalent. Training stops before 'max_epochs' is reached
if equivalent results are achieved after four consecutive epochs.
hidden_layer_sizes -- a list containing the number of neurons in each hidden layer, or
'None' if the topology should be determined automatically.
shuffle -- 'True' if documents should be shuffled during batching.
normalize -- 'True' if normalization should be applied to the loss function.
SupervisedTopicModelTrainer (returned from SupervisedTopicTrainingBasis.train() ) Worker object used to train and generate models. This object could be removed from the public interface ( SupervisedTopicTrainingBasis.train() could return a SupervisedTopicClassifier directly) but has been retained to facilitate testability.
This class is NOT threadsafe.
SupervisedTopicModelTrainer.classifier(self)
Returns a supervised topic classifier which contains no explicit references to the training data and that
can be serialized.
SupervisedTopicClassifier (returned from SupervisedTopicModelTrainer.classifier() and Manager.deserialize_supervised_topic_classifier() ))
SupervisedTopicClassifier.def parse_and_classify(self, text: str) -> Optional[OrderedDict]:
Returns a dictionary from classification labels to probabilities
ordered starting with the most probable, or *None* if the text did
not contain any words recognised by the model.
Parameters:
text -- the text to parse and classify.
SupervisedTopicClassifier.classify(self, doc: Doc) -> Optional[OrderedDict]:
Returns a dictionary from classification labels to probabilities
ordered starting with the most probable, or *None* if the text did
not contain any words recognised by the model.
Parameters:
doc -- the pre-parsed document to classify.
SupervisedTopicClassifier.serialize_model(self) -> str
Returns a serialized model that can be reloaded using
*Manager.deserialize_supervised_topic_classifier()*
Manager.match() A text-only representation of a match between a search phrase and a
document. The indexes refer to tokens.
Properties:
search_phrase_label -- the label of the search phrase.
search_phrase_text -- the text of the search phrase.
document -- the label of the document.
index_within_document -- the index of the match within the document.
sentences_within_document -- the raw text of the sentences within the document that matched.
negated -- 'True' if this match is negated.
uncertain -- 'True' if this match is uncertain.
involves_coreference -- 'True' if this match was found using coreference resolution.
overall_similarity_measure -- the overall similarity of the match, or
'1.0' if embedding-based matching was not involved in the match.
word_matches -- an array of dictionaries with the properties:
search_phrase_token_index -- the index of the token that matched from the search phrase.
search_phrase_word -- the string that matched from the search phrase.
document_token_index -- the index of the token that matched within the document.
first_document_token_index -- the index of the first token that matched within the document.
Identical to 'document_token_index' except where the match involves a multiword phrase.
last_document_token_index -- the index of the last token that matched within the document
(NOT one more than that index). Identical to 'document_token_index' except where the match
involves a multiword phrase.
structurally_matched_document_token_index -- the index of the token within the document that
structurally matched the search phrase token. Is either the same as 'document_token_index' or
is linked to 'document_token_index' within a coreference chain.
document_subword_index -- the index of the token subword that matched within the document, or
'None' if matching was not with a subword but with an entire token.
document_subword_containing_token_index -- the index of the document token that contained the
subword that matched, which may be different from 'document_token_index' in situations where a
word containing multiple subwords is split by hyphenation and a subword whose sense
contributes to a word is not overtly realised within that word.
document_word -- the string that matched from the document.
document_phrase -- the phrase headed by the word that matched from the document.
match_type -- 'direct', 'derivation', 'entity', 'embedding', 'ontology', 'entity_embedding'
or 'question'.
negated -- 'True' if this word match is negated.
uncertain -- 'True' if this word match is uncertain.
similarity_measure -- for types 'embedding' and 'entity_embedding', the similarity between the
two tokens, otherwise '1.0'.
involves_coreference -- 'True' if the word was matched using coreference resolution.
extracted_word -- within the coreference chain, the most specific term that corresponded to
the document_word.
depth -- the number of hyponym relationships linking 'search_phrase_word' and
'extracted_word', or '0' if ontology-based matching is not active. Can be negative
if symmetric matching is active.
explanation -- creates a human-readable explanation of the word match from the perspective of the
document word (e.g. to be used as a tooltip over it).
Manager.topic_match_documents_against() A text-only representation of a topic match between a search text and a document.
Properties:
document_label -- the label of the document.
text -- the document text that was matched.
text_to_match -- the search text.
rank -- a string representation of the scoring rank which can have the form e.g. '2=' in case of a tie.
index_within_document -- the index of the document token where the activation peaked.
subword_index -- the index of the subword within the document token where the activation peaked, or
'None' if the activation did not peak at a specific subword.
start_index -- the index of the first document token in the topic match.
end_index -- the index of the last document token in the topic match (NOT one more than that index).
sentences_start_index -- the token start index within the document of the sentence that contains
'start_index'
sentences_end_index -- the token end index within the document of the sentence that contains
'end_index' (NOT one more than that index).
sentences_character_start_index_in_document -- the character index of the first character of 'text'
within the document.
sentences_character_end_index_in_document -- one more than the character index of the last
character of 'text' within the document.
score -- the score
word_infos -- an array of arrays with the semantics:
[0] -- 'relative_start_index' -- the index of the first character in the word relative to
'sentences_character_start_index_in_document'.
[1] -- 'relative_end_index' -- one more than the index of the last character in the word
relative to 'sentences_character_start_index_in_document'.
[2] -- 'type' -- 'single' for a single-word match, 'relation' if within a relation match
involving two words, 'overlapping_relation' if within a relation match involving three
or more words.
[3] -- 'is_highest_activation' -- 'True' if this was the word at which the highest activation
score reported in 'score' was achieved, otherwise 'False'.
[4] -- 'explanation' -- a human-readable explanation of the word match from the perspective of
the document word (e.g. to be used as a tooltip over it).
answers -- an array of arrays with the semantics:
[0] -- the index of the first character of a potential answer to an initial question word.
[1] -- one more than the index of the last character of a potential answer to an initial question
word.
Earlier versions of Holmes could only be published under a restrictive license because of patent issues. As explained in the introduction, this is no longer the case thanks to the generosity of AstraZeneca: versions from 4.0.0 onwards are licensed under the MIT license.
The word-level matching and the high-level operation of structural matching between search-phrase and document subgraphs both work more or less as one would expect. What is perhaps more in need of further comment is the semantic analysis code subsumed in the parsing.py script as well as in the language_specific_rules.py script for each language.
SemanticAnalyzer is an abstract class that is subclassed for each language: at present by EnglishSemanticAnalyzer and GermanSemanticAnalyzer . These classes contain most of the semantic analysis code. SemanticMatchingHelper is a second abstract class, again with an concrete implementation for each language, that contains semantic analysis code that is required at matching time. Moving this out to a separate class family was necessary because, on operating systems that spawn processes rather than forking processes (eg Windows), SemanticMatchingHelper instances have to be serialized when the worker processes are created: this would not be possible for SemanticAnalyzer instances because not all spaCy models are serializable, and would also unnecessarily consume large amounts of memory.
At present, all functionality that is common to the two languages is realised in the two abstract parent classes. Especially because English and German are closely related languages, it is probable that functionality will need to be moved from the abstract parent classes to specific implementing children classes if and when new semantic analyzers are added for new languages.
The HolmesDictionary class is defined as a spaCy extension attribute that is accessed using the syntax token._.holmes . The most important information in the dictionary is a list of SemanticDependency objects. These are derived from the dependency relationships in the spaCy output ( token.dep_ ) but go through a considerable amount of processing to make them 'less syntactic' and 'more semantic'. To give but a few examples:
Some new semantic dependency labels that do not occur in spaCy outputs as values of token.dep_ are added for Holmes semantic dependencies. It is important to understand that Holmes semantic dependencies are used exclusively for matching and are therefore neither intended nor required to form a coherent set of linguistic theoretical entities or relationships; whatever works best for matching is assigned on an ad-hoc basis.
For each language, the match_implication_dict dictionary maps search-phrase semantic dependencies to matching document semantic dependencies and is responsible for the asymmetry of matching between search phrases and documents.
Topic matching involves the following steps:
SemanticMatchingHelper.topic_matching_phraselet_stop_lemmas ), which are consistently ignored throughout the whole process.SemanticMatchingHelper.phraselet_templates .SemanticMatchingHelper.topic_matching_reverse_only_parent_lemmas ) or when the frequency factor for the parent word is below the threshold for relation matching ( relation_matching_frequency_threshold , default: 0.25). These measures are necessary because matching on eg a parent preposition would lead to a large number of potential matches that would take a lot of resources to investigate: it is better to start investigation from the less frequent word within a given relation.relation_matching_frequency_threshold , default: 0.25).embedding_matching_frequency_threshold , default: 0.5), matching at all of those words where the relation template has not already been matched is retried using embeddings at the other word within the relation. A pair of words is then regarded as matching when their mutual cosine similarity is above initial_question_word_embedding_match_threshold (default: 0.7) in situations where the document word has an initial question word in its phrase or word_embedding_match_threshold (default: 0.8) in all other situations.use_frequency_factor is set to True (the default), each score is scaled by the frequency factor of its phraselet, meaning that words that occur less frequently in the corpus give rise to higher scores.maximum_activation_distance ; default: 75) of its value is subtracted from it as each new word is read.single_word_score ; default: 50), a non-noun single-word phraselet or a noun phraselet that matched a subword ( single_word_any_tag_score ; default: 20), a relation phraselet produced by a reverse-only template ( reverse_only_relation_score ; default: 200), any other (normally matched) relation phraselet ( relation_score ; default: 300), or a relation phraselet involving an initial question word ( initial_question_word_answer_score ; default: 600).embedding_penalty ; default: 0.6).ontology_penalty ; default: 0.9) once more often than the difference in depth between the two ontology entries, ie once for a synonym, twice for a child, three times for a grandchild and so on.overlapping_relation_multiplier ; default: 1.5).sideways_match_extent ; default: 100 words) within which the activation score is higher than the different_match_cutoff_score (default: 15) are regarded as belonging to a contiguous passage around the peak that is then returned as a TopicMatch object. (Note that this default will almost certainly turn out to be too low if use_frequency_factor is set to False .) A word whose activation equals the threshold exactly is included at the beginning of the area as long as the next word where activation increases has a score above the threshold. If the topic match peak is below the threshold, the topic match will only consist of the peak word.initial_question_word_behaviour is set to process (the default) or to exclusive , where a document word has matched an initial question word from the query phrase, the subtree of the matched document word is identified as a potential answer to the question and added to the dictionary to be returned. If initial_question_word_behaviour is set to exclusive , any topic matches that do not contain answers to initial question words are discarded.only_one_result_per_document = True prevents more than one result from being returned from the same document; only the result from each document with the highest score will then be returned.tied_result_quotient (default: 0.9) are labelled as tied. The supervised document classification use case relies on the same phraselets as the topic matching use case, although reverse-only templates are ignored and a different set of stop words is used ( SemanticMatchingHelper.supervised_document_classification_phraselet_stop_lemmas ). Classifiers are built and trained as follows:
oneshot ; whether single-word phraselets are generated for all words with their own meaning or only for those such words whose part-of-speech tags match the single-word phraselet template specification (essentially: noun phraselets) depends on the value of match_all_words . Wherever two phraselet matches overlap, a combined match is recorded. Combined matches are treated in the same way as other phraselet matches in further processing. This means that effectively the algorithm picks up one-word, two-word and three-word semantic combinations. See here for a discussion of the performance of this step.minimum_occurrences ; default: 4) or where the coefficient of variation (the standard deviation divided by the arithmetic mean) of the occurrences across the categories is below a threshold ( cv_threshold ; default: 1.0).oneshot==True vs. oneshot==False respectively). The outputs are the category labels, including any additional labels determined via a classification ontology. By default, the multilayer perceptron has three hidden layers where the first hidden layer has the same number of neurons as the input layer and the second and third layers have sizes in between the input and the output layer with an equally sized step between each size; the user is however free to specify any other topology.Holmes code is formatted with black.
The complexity of what Holmes does makes development impossible without a robust set of over 1400 regression tests. These can be executed individually with unittest or all at once by running the pytest utility from the Holmes source code root directory. (Note that the Python 3 command on Linux is pytest-3 .)
The pytest variant will only work on machines with sufficient memory resources. To reduce this problem, the tests are distributed across three subdirectories, so that pytest can be run three times, once from each subdirectory:
New languages can be added to Holmes by subclassing the SemanticAnalyzer and SemanticMatchingHelper classes as explained here.
The sets of matching semantic dependencies captured in the _matching_dep_dict dictionary for each language have been obtained on the basis of a mixture of linguistic-theoretical expectations and trial and error. The results would probably be improved if the _matching_dep_dict dictionaries could be derived using machine learning instead; as yet this has not been attempted because of the lack of appropriate training data.
An attempt should be made to remove personal data from supervised document classification models to make them more compliant with data protection laws.
In cases where embedding-based matching is not active, the second step of the supervised document classification procedure repeats a considerable amount of processing from the first step. Retaining the relevant information from the first step of the procedure would greatly improve training performance. This has not been attempted up to now because a large number of tests would be required to prove that such performance improvements did not have any inadvertent impacts on functionality.
The topic matching and supervised document classification use cases are both configured with a number of hyperparameters that are presently set to best-guess values derived on a purely theoretical basis. Results could be further improved by testing the use cases with a variety of hyperparameters to learn the optimal values.
The initial open-source version.
pobjp linking parents of prepositions directly with their children.MultiprocessingManager object as its facade.Manager and MultiprocessingManager classes into a single Manager class, with a redesigned public interface, that uses worker threads for everything except supervised document classification.