作者:理查德·保羅·哈德森(Richard Paul Hudson),爆炸人工智能
Managermanager.nlpOntologySupervisedTopicTrainingBasis (從Manager.get_supervised_topic_training_basis()返回)SupervisedTopicModelTrainer (從SupervisedTopicTrainingBasis.train()返回)SupervisedTopicClassifier (從SupervisedTopicModelTrainer.classifier()和Manager.deserialize_supervised_topic_classifier()返回)Manager.match()返回Manager.topic_match_documents_against()返回福爾摩斯(Holmes)是在Spacy(v3.1 -V3.3)上運行的Python 3庫(v3.6 -V3.10),該圖書館支持許多用例,涉及從英語和德語文本中提取信息。在所有用例中,信息提取均基於分析每個句子的組成部分錶達的語義關係:
在聊天機器人用例中,系統是使用一個或多個搜索短語配置的。然後,福爾摩斯(Holmes)尋找與搜索文檔中這些搜索短語相對應的結構,在這種情況下,這對應於最終用戶輸入的單獨的文本或語音片段。在匹配中,搜索短語中的每個單詞都具有其自身含義(即不僅符合語法函數)與文檔中的一個或多個此類單詞相對應。匹配搜索短語的事實以及搜索短語提取物可用於驅動聊天機器人的任何結構化信息。
結構提取的用例使用與聊天機器人用例完全相同的結構匹配技術,但是進行搜索是針對通常比聊天機器人用例中分析的摘要更長的預先存在的文檔或文檔進行的,其目的是提取和存儲結構化信息。例如,可以搜索一組商業文章,以找到一家公司計劃接管第二家公司的所有地方。然後可以將有關公司的身份存儲在數據庫中。
主題匹配的用例旨在在含義與另一個文檔的含義接近的文檔中查找段落,該文檔接近了查詢文檔的作用,或者與用戶的詢問短語的作用或查詢短語的作用。福爾摩斯從查詢短語或查詢文檔中提取許多小短語,將所搜索的文檔與每個詞句匹配,並將結果混合在一起以找到文檔中最相關的段落。由於沒有嚴格的要求,在查詢文檔中,每個單詞都具有其自身含義的單詞與搜索文檔中的特定單詞或單詞匹配,因此比在結構提取用例中發現的匹配性更多,但是匹配項不包含可以在後續處理中使用的結構化信息。該主題匹配的用例可以通過一個網站展示,該網站允許在六本查爾斯·狄更斯小說(英語)和大約350個傳統故事(用於德語)中進行搜索。
監督的文檔分類用例使用培訓數據來學習一個分類器,該分類器將一個或多個分類標籤分配給新文檔的分類標籤。它通過將其與從培訓文檔中提取的短語相匹配,就像從主題匹配的用例中的Query文檔中提取相同的方式來對新文檔進行分類。該技術的靈感來自使用N-grams的基於單詞的分類算法,但旨在得出其組成詞在語義上與之相關的n-gram,而不是僅僅是語言表面表示中的鄰居。
在所有四種用例中,單個單詞都使用多種策略匹配。為了確定包含單獨匹配單詞的兩個語法結構在邏輯上對應並構成匹配,福爾摩斯將Spacy庫提供的句法解析信息轉換為語義結構,以便使用謂詞邏輯比較文本。作為福爾摩斯的用戶,您不需要了解其工作原理的複雜性,儘管關於為聊天機器人和結構提取用例編寫有效的搜索短語有一些重要技巧,您應該嘗試並乘坐。
福爾摩斯的目的是提供通才解決方案,這些解決方案可以或多或少開箱即用,而調整或培訓相對較少,並且迅速適用於廣泛的用例。以邏輯,編程,基於規則的系統為核心,描述了每種語言中的句法表示如何表達語義關係。儘管受監督的文檔分類用例確實包含了神經網絡,儘管使用機器學習的霍爾姆斯構建本身的Spacy庫是經過預先培訓的,但基本的基於規則的性質意味著,Holmes的本質可以使聊天機器人,結構提取和主題匹配的用例可以在不用訓練的情況下進行任何訓練,而無需訓練的數據是相關的訓練數據,因為典型的訓練數據是相關的訓練數據,因為典型的數據是相關的數據,因為該數據是相關的,因為該數據是相關的數據現實世界中的問題。
福爾摩斯(Holmes)擁有悠久而復雜的歷史,由於多家公司的善意和開放性,我們現在能夠根據MIT許可發布它。我,理查德·哈德森(Richard Hudson),在MSG Systems(慕尼黑附近的一家國際軟件諮詢公司)工作時,撰寫了最高3.0.0的版本。在2021年底,我改變了雇主,現在為Spacy and Prodigy的創造者而爆炸工作。福爾摩斯圖書館的元素被我本人在2000年代初在一家名為Defeniens的初創公司工作時所撰寫的美國專利涵蓋,該專利已被阿斯利康(Astrazeneca)收購。在Astrazeneca和MSG系統的同時,我現在在爆炸時維護Holmes,並可以在寬敞的許可下首次提供它:現在任何人都可以根據MIT許可證的條款使用Holmes而不必擔心專利。
該圖書館最初是在MSG系統開發的,但現在正在爆炸AI中維護。請將任何新問題或討論引向爆炸庫。
如果您的機器上還沒有Python 3和PIP,則需要在安裝Holmes之前安裝它們。
使用以下命令安裝Holmes:
Linux:
pip3 install holmes-extractor
視窗:
pip install holmes-extractor
要從以前的Holmes版本升級,請發出以下命令,然後重新發行命令以下載Spacy和Coreferee模型,以確保您具有正確的版本:
Linux:
pip3 install --upgrade holmes-extractor
視窗:
pip install --upgrade holmes-extractor
如果您想使用示例和測試,請使用
git clone https://github.com/explosion/holmes-extractor
如果您想嘗試更改源代碼,則可以通過啟動python(type python3 (linux)或python (windows))來覆蓋已安裝的代碼,該目錄的parent parent Directory在您更改的holmes_extractor模塊代碼為中。如果您已經從Git中檢查了Holmes,則將是holmes-extractor目錄。
如果您希望再次卸載Holmes,則可以通過直接從文件系統刪除已安裝的文件來實現這一目標。可以通過從holmes_extractor的父級目錄以外的任何目錄中發出以下任何目錄中從python命令提示下發出以下內容:
import holmes_extractor
print(holmes_extractor.__file__)
福爾摩斯建立的Spacy和Coreferee庫需要特定於語言的模型,這些模型必須在使用Holmes之前單獨下載:
Linux/英語:
python3 -m spacy download en_core_web_trf
python3 -m spacy download en_core_web_lg
python3 -m coreferee install en
Linux/德語:
pip3 install spacy-lookups-data # (from spaCy 3.3 onwards)
python3 -m spacy download de_core_news_lg
python3 -m coreferee install de
Windows/English:
python -m spacy download en_core_web_trf
python -m spacy download en_core_web_lg
python -m coreferee install en
Windows/German:
pip install spacy-lookups-data # (from spaCy 3.3 onwards)
python -m spacy download de_core_news_lg
python -m coreferee install de
如果您打算運行回歸測試:
Linux:
python3 -m spacy download en_core_web_sm
視窗:
python -m spacy download en_core_web_sm
您在實例化管理器立面課程時為Holmes指定了一個Spacy模型。 en_core_web_trf和de_core_web_lg是發現分別為英語和德語產生最佳結果的模型。因為en_core_web_trf沒有自己的單詞向量,但是Holmes需要單詞向量以進行基於嵌入的匹配,因此en_core_web_lg被加載為矢量源時, en_core_web_trf被指定到Manager類別為主要模型。
en_core_web_trf模型比其他模型需要更多的資源;在資源稀缺的siunt中,使用en_core_web_lg代替是主要模型,這可能是明智的折衷。
將福爾摩斯集成到非Python環境中的最佳方法是將其作為靜止的HTTP服務包裹起來,並將其作為微服務部署。請參閱此處的示例。
由於福爾摩斯執行複雜,智能的分析,因此與更傳統的搜索框架相比,它不可避免地需要更多的硬件資源。涉及加載文檔的用例(結構提取和主題匹配)最立即適用於大型但不大型的語料庫(例如,屬於某個組織的所有文件,所有主題上的所有專利,所有作者的所有書籍)。出於成本原因,福爾摩斯將不是分析整個互聯網內容的合適工具!
也就是說,福爾摩斯既垂直和水平可擴展。有了足夠的硬件,這兩個用例都可以通過在多台計算機上運行Holmes,在每個機器上處理不同的文檔並將結果混合在一起,將兩個用例應用於本質上無限的文檔。請注意,該策略已經採用了單個計算機上的多個內核之間分佈匹配:管理器類啟動了許多工作過程並在它們之間分發註冊文檔。
福爾摩斯在內存中持有加載的文檔,這與大型但不是大型語料庫的預期用途有關。如果操作系統必須將存儲器頁面交換為輔助存儲,則文檔加載,結構提取和主題的性能將所有降級匹配,因為Holmes可以在處理單個句子時需要從各種頁面中收到內存。這意味著在每台機器上提供足夠的RAM以容納所有已加載的文檔很重要。
請注意以上有關不同模型的相對資源要求的評論。
聊天機器人用例是最簡單的用例,以快速了解福爾摩斯的工作方式。
這裡將一個或多個搜索短語定義為福爾摩斯,搜索文檔是最終用戶交互式鍵入的簡短句子或段落。在現實生活中,提取的信息將用於確定與最終用戶的交互作用。為了進行測試和演示目的,有一個控制台可以交互顯示其匹配的發現。它可以從Python命令行(通過python3 (Linux)或python (Windows))或Jupyter Notebook中的Python命令行快速啟動(本身從操作系統提示開始。
以下代碼段可以進入線路中的行中的python命令行,jupyter筆記本或進入IDE。它記錄了您對追逐貓的大狗的句子感興趣的事實,並啟動了示範聊天機器人遊戲機:
英語:
import holmes_extractor as holmes
holmes_manager = holmes.Manager(model='en_core_web_lg', number_of_workers=1)
holmes_manager.register_search_phrase('A big dog chases a cat')
holmes_manager.start_chatbot_mode_console()
德語:
import holmes_extractor as holmes
holmes_manager = holmes.Manager(model='de_core_news_lg', number_of_workers=1)
holmes_manager.register_search_phrase('Ein großer Hund jagt eine Katze')
holmes_manager.start_chatbot_mode_console()
如果您現在輸入與搜索短語相對應的句子,則控制台將顯示匹配:
英語:
Ready for input
A big dog chased a cat
Matched search phrase with text 'A big dog chases a cat':
'big'->'big' (Matches BIG directly); 'A big dog'->'dog' (Matches DOG directly); 'chased'->'chase' (Matches CHASE directly); 'a cat'->'cat' (Matches CAT directly)
德語:
Ready for input
Ein großer Hund jagte eine Katze
Matched search phrase 'Ein großer Hund jagt eine Katze':
'großer'->'groß' (Matches GROSS directly); 'Ein großer Hund'->'hund' (Matches HUND directly); 'jagte'->'jagen' (Matches JAGEN directly); 'eine Katze'->'katze' (Matches KATZE directly)
通過簡單的匹配算法可以很容易地實現這一目標,因此請輸入一些更複雜的句子以說服自己,福爾摩斯確實在抓住它們,並且仍然返回匹配:
英語:
The big dog would not stop chasing the cat
The big dog who was tired chased the cat
The cat was chased by the big dog
The cat always used to be chased by the big dog
The big dog was going to chase the cat
The big dog decided to chase the cat
The cat was afraid of being chased by the big dog
I saw a cat-chasing big dog
The cat the big dog chased was scared
The big dog chasing the cat was a problem
There was a big dog that was chasing a cat
The cat chase by the big dog
There was a big dog and it was chasing a cat.
I saw a big dog. My cat was afraid of being chased by the dog.
There was a big dog. His name was Fido. He was chasing my cat.
A dog appeared. It was chasing a cat. It was very big.
The cat sneaked back into our lounge because a big dog had been chasing her.
Our big dog was excited because he had been chasing a cat.
德語:
Der große Hund hat die Katze ständig gejagt
Der große Hund, der müde war, jagte die Katze
Die Katze wurde vom großen Hund gejagt
Die Katze wurde immer wieder durch den großen Hund gejagt
Der große Hund wollte die Katze jagen
Der große Hund entschied sich, die Katze zu jagen
Die Katze, die der große Hund gejagt hatte, hatte Angst
Dass der große Hund die Katze jagte, war ein Problem
Es gab einen großen Hund, der eine Katze jagte
Die Katzenjagd durch den großen Hund
Es gab einmal einen großen Hund, und er jagte eine Katze
Es gab einen großen Hund. Er hieß Fido. Er jagte meine Katze
Es erschien ein Hund. Er jagte eine Katze. Er war sehr groß.
Die Katze schlich sich in unser Wohnzimmer zurück, weil ein großer Hund sie draußen gejagt hatte
Unser großer Hund war aufgeregt, weil er eine Katze gejagt hatte
如果不嘗試包含相同單詞但沒有表達相同想法的其他句子並觀察到它們不匹配的其他句子,該演示是不完整的:
英語:
The dog chased a big cat
The big dog and the cat chased about
The big dog chased a mouse but the cat was tired
The big dog always used to be chased by the cat
The big dog the cat chased was scared
Our big dog was upset because he had been chased by a cat.
The dog chase of the big cat
德語:
Der Hund jagte eine große Katze
Die Katze jagte den großen Hund
Der große Hund und die Katze jagten
Der große Hund jagte eine Maus aber die Katze war müde
Der große Hund wurde ständig von der Katze gejagt
Der große Hund entschloss sich, von der Katze gejagt zu werden
Die Hundejagd durch den große Katze
在上面的示例中,福爾摩斯與各種具有相同含義的不同句子級結構匹配,但是匹配的文檔中三個單詞的基本形式始終與搜索短語中的三個單詞相同。福爾摩斯提供了幾種在單個單詞級別匹配的策略。結合福爾摩斯匹配不同句子結構的能力,這些可以使搜索短語與文檔句子相匹配,該句子具有其含義,即使兩者共享單詞不共,並且在語法上完全不同。
這些附加的單詞匹配策略之一是命名的匹配:可以將特殊單詞包含在與人或地點(例如人或地點)匹配的搜索短語中。通過輸入exit來退出控制台,然後註冊第二個搜索短語並重新啟動控制台:
英語:
holmes_manager.register_search_phrase('An ENTITYPERSON goes into town')
holmes_manager.start_chatbot_mode_console()
德語:
holmes_manager.register_search_phrase('Ein ENTITYPER geht in die Stadt')
holmes_manager.start_chatbot_mode_console()
您現在已經對進入城鎮的人註冊了您的興趣,並可以將適當的句子輸入控制台:
英語:
Ready for input
I met Richard Hudson and John Doe last week. They didn't want to go into town.
Matched search phrase with text 'An ENTITYPERSON goes into town'; negated; uncertain; involves coreference:
'Richard Hudson'->'ENTITYPERSON' (Has an entity label matching ENTITYPERSON); 'go'->'go' (Matches GO directly); 'into'->'into' (Matches INTO directly); 'town'->'town' (Matches TOWN directly)
Matched search phrase with text 'An ENTITYPERSON goes into town'; negated; uncertain; involves coreference:
'John Doe'->'ENTITYPERSON' (Has an entity label matching ENTITYPERSON); 'go'->'go' (Matches GO directly); 'into'->'into' (Matches INTO directly); 'town'->'town' (Matches TOWN directly)
德語:
Ready for input
Letzte Woche sah ich Richard Hudson und Max Mustermann. Sie wollten nicht mehr in die Stadt gehen.
Matched search phrase with text 'Ein ENTITYPER geht in die Stadt'; negated; uncertain; involves coreference:
'Richard Hudson'->'ENTITYPER' (Has an entity label matching ENTITYPER); 'gehen'->'gehen' (Matches GEHEN directly); 'in'->'in' (Matches IN directly); 'die Stadt'->'stadt' (Matches STADT directly)
Matched search phrase with text 'Ein ENTITYPER geht in die Stadt'; negated; uncertain; involves coreference:
'Max Mustermann'->'ENTITYPER' (Has an entity label matching ENTITYPER); 'gehen'->'gehen' (Matches GEHEN directly); 'in'->'in' (Matches IN directly); 'die Stadt'->'stadt' (Matches STADT directly)
在這兩種語言中的每種語言中,最後一個示例都展示了福爾摩斯的其他幾個特徵:
有關更多示例,請參見第5節。
每個策略使用一個Python模塊實施以下策略。儘管標準庫不支持通過經理課程添加定制策略,但對於具有Python編程技能的任何人來說,更容易更改代碼以實現這一目標。
word_match.type=='direct' )搜索短語單詞和文檔單詞之間的直接匹配始終是活動的。該策略主要依靠匹配單詞的詞幹形式,例如將英語購買和兒童與購買的孩子,德國史蒂根(Steigen)以及善良的stieg and Kinder匹配。但是,為了增加當解析器提供單詞不正確的詞幹形式時直接匹配工作的機會,在直接匹配期間,也考慮了搜索詞和文檔單詞的原始形式。
word_match.type=='derivation' )基於派生的匹配涉及通常屬於不同單詞類的不同但相關的單詞,例如英語評估和評估,德國Jagen和Jagd 。默認情況下它是有效的,但可以使用analyze_derivational_morphology參數關閉,該參數在實例化管理器類時設置。
word_match.type=='entity' )通過在搜索短語中插入特殊命名 - 實體標識符來激活命名 - 實現匹配,以代替名詞,例如
一個實體進入城鎮(英語)
EIN Entityper geht在Die Stadt (德語)中。
受支持的指定性標識符直接取決於Spacy模型為每種語言提供的指定性信息(從Spacy文檔的早期版本複制的描述):
英語:
| 標識符 | 意義 |
|---|---|
| Entitynoun | 任何名詞短語。 |
| 實體 | 人們,包括虛構的人。 |
| EntityNorp | 國籍或宗教或政治團體。 |
| EntityFac | 建築物,機場,公路,橋樑等 |
| Entityorg | 公司,機構,機構等 |
| 實體 | 國家,城市,國家。 |
| EntityLoc | 非GPE地點,山脈,水體。 |
| 實體產品 | 物體,車輛,食物等(不是服務)。 |
| 實體 | 命名為颶風,戰鬥,戰爭,體育賽事等。 |
| ENTITYWORK_OF_ART | 書籍,歌曲等的標題。 |
| EntityLaw | 命名為法律的文件。 |
| 實體語言 | 任何命名語言。 |
| 實體日期 | 絕對或相對日期或期間。 |
| 實體時間 | 次小於一天。 |
| 實體 | 百分比,包括“%”。 |
| EntityMoney | 貨幣價值,包括單位。 |
| 實體 | 測量,重量或距離。 |
| 實體 | “第一”,“第二”,等等。 |
| 實體 | 不屬於另一種類型的數字。 |
德語:
| 標識符 | 意義 |
|---|---|
| Entitynoun | 任何名詞短語。 |
| Entityper | 命名人或家人。 |
| EntityLoc | 政治或地理上定義的位置的名稱(城市,省,國家,國際地區,水域,山區)。 |
| Entityorg | 命名為公司,政府或其他組織實體。 |
| 實體 | 其他實體,例如事件,國籍,產品或藝術品。 |
我們已經將ENTITYNOUN添加到了真正的命名實體標識符中。當它與任何名詞短語匹配時,它的行為與通用代詞相似。差異是, ENTITYNOUN必須匹配文檔中的特定名詞短語,並且提取了該特定名詞短語並用於進一步處理。主題匹配用例中不支持ENTITYNOUN 。
word_match.type=='ontology' )本體學使用戶能夠定義單詞之間的關係,這些關係在將文檔匹配到搜索短語時被考慮在內。三種相關的關係類型是hushim詞(某種是某物的子類型),同義詞(某種方式與某物相同)和命名的個人(某物是某物的特定實例)。這三種關係類型在圖1中舉例說明了

使用RDF/XML序列化的OWL本體標準將本體論定義為HOLMES。可以使用各種工俱生成此類本體。對於Holmes示例和測試,使用了免費的工具蛋白。建議您使用protege來定義自己的本體論,並瀏覽示例和測試的本體論。保存在Protege下的本體時,請選擇RDF/XML作為格式。 Protege分配了Holmes理解為默認的siber,同義詞和個體關係的標準標籤,但也可以被覆蓋。
本體論條目是使用國際化資源標識符(IRI)定義的,例如http://www.semanticweb.org/hudsonr/ontologies/2019/0/animals#dog 。福爾摩斯僅使用最終片段進行匹配,該片段允許在本體樹中的多個點上定義同音詞(具有相同形式但多種含義的單詞)。
當使用針對特定主題域和用例的小本體論構建時,基於本體的匹配可為福爾摩斯提供最佳的結果。例如,如果您正在為建築保險用例實施聊天機器人,則應創建一個小本體,以捕獲該特定域中的條款和關係。另一方面,不建議在整個語言(例如WordNet)中使用為所有域中構建的大本體論。這是因為僅適用於狹窄主題域中的許多同音詞和關係會導致大量不正確的匹配。對於一般用例,基於嵌入的匹配將傾向於產生更好的結果。
本體論中的每個單詞都可以被視為由子樹的標題,其次要,同義詞和命名個體,這些單詞的histonyms,同義詞和命名為個人等等。由於以標準方式設置了適合聊天機器人和結構提取用例的本體論,因此,如果文檔單詞在搜索短語單詞的子樹中,則HOLMES搜索短語中的單詞與文檔中的一個單詞匹配。除了直接匹配策略外,圖1中的本體也定義為福爾摩斯(Holmes),這將使每個單詞與自身匹配,以下組合將與:
英語短語動詞(如飲食)和德語可分離動詞(如奧法森)必須定義為本體中的單個項目。當福爾摩斯分析文本並遇到這樣一個動詞時,主動詞和粒子被混合成單個邏輯詞,然後可以通過本體學匹配。這意味著文本中的飲食將與本體論中的飲食子樹相匹配,而與本體學內的飲食子樹不符。
如果基於派生的匹配是有效的,則在潛在的基於本體的匹配的兩側都考慮到它。例如,如果將Alter和修正定義為本體論中的同義詞,則更改和修正案也將相互匹配。
在找到相關句子的情況下,與確保文檔匹配與搜索短語的邏輯對應相對應更為重要,在定義本體學時指定對稱匹配可能是有意義的。建議對主題匹配的用例進行對稱匹配,但不太可能適合聊天機器人或結構提取用例。這意味著在匹配時考慮了高nym(反向信)關係,以及在匹配時的信中和同義詞關係,從而導致文檔和搜索短語之間的更對稱關係。當通過對稱本體匹配時,適用的一個重要規則是,匹配路徑可能不包含超nym和信號關係,即您不能返回自己。如果上面的本體定義為對稱,則以下組合將匹配:
在監督的文檔分類用例中,可以使用兩個單獨的本體論:
結構匹配本體用於分析培訓和測試文檔的內容。本體論中發現的文檔中的每個單詞都被其最一般的高核祖先所取代。重要的是要意識到,如果本體專門為目的而構建的本體,則只能與結構匹配進行進行結構匹配:這種本體應該由許多單獨的樹組成,代表要分類的文檔中的對象的主要類別。在上面顯示的示例本體論中,本體論中的所有單詞都將被動物所取代。在具有WordNet式本體論的極端情況下,所有名詞最終都將被事物取代,這顯然不是理想的結果!
分類本體論用於捕獲分類標籤之間的關係:文檔具有一定的分類意味著它還具有與該分類所屬子樹的分類。如果在分類本體中,則應謹慎使用同義詞,因為它們增加了神經網絡的複雜性而無需添加任何值。儘管從技術上講可以建立分類本體來使用對稱匹配,但沒有明智的理由這樣做。請注意,分類本體中的標籤未直接定義為任何培訓文檔的標籤,必須使用SupervisedTopicTrainingBasis.register_additional_classification_label()方法專門註冊,如果在訓練分類器時要考慮到。
word_match.type=='embedding' )Spacy提供單詞嵌入:機器學習生成的單詞的數值向量表示,這些詞捕獲了每個單詞傾向於發生的上下文。兩個具有相似含義的單詞往往會帶有彼此靠近的單詞嵌入,並且Spacy可以測量任何兩個單詞的嵌入在0.0(無相似度)和1.0(相同單詞)之間的餘弦相似性。由於狗和貓傾向於出現在相似的情況下,因此它們的相似性為0.80。狗和馬的共同點較少,相似性為0.62;狗和鐵的相似性僅為0.25。僅針對名詞,形容詞和副詞激活基於嵌入的匹配,因為發現結果與其他單詞類別不滿意。
重要的是要了解,兩個單詞具有相似的嵌入的事實並不意味著兩者之間的邏輯關係與使用基於本體的匹配時的邏輯關係相同:例如,狗和貓具有相似的嵌入事實,這一事實既不意味著狗是一種貓,也不是貓是一種狗。儘管如此,基於嵌入的匹配還是適當的選擇取決於功能用例。
對於聊天機器人,結構提取和受監管的文檔分類用例,Holmes使用整個在管理員類中在全球定義的總體定義的overall_similarity_threshold參數使用基於單詞嵌入的相似性。每當單個相應的單詞對之間的相似之處的幾何平均值大於此閾值時,搜索短語和文檔中的結構之間都會檢測到匹配。該技術背後的直覺在於,其中一個帶有六個詞彙單詞的搜索短語與文檔結構匹配,其中五個單詞完全匹配,只有一個單詞通過嵌入匹配,與此匹配的第六個單詞匹配所需的相似性僅小於僅有的三個單詞正好匹配的三個單詞,而其他兩個單詞也可以通過嵌入來對應。
將搜索短語與文檔匹配,首先是在搜索短語的根部(句法頭)中找到匹配單詞的單詞。然後,福爾摩斯研究了這些匹配的文檔單詞周圍的結構,以檢查文檔結構是否與搜索短語結構匹配。匹配搜索短語root Word的文檔單詞通常使用索引找到。但是,如果在找到與搜索短語root Word匹配的文檔單詞時必須考慮嵌入,則必須比較每個文檔中的每個單詞,都必須將其與該搜索短語root Word的相似性進行比較。這具有非常明顯的性能,除了聊天機器人用例本質上無法使用外,所有用例本質上都無法使用。
為了避免通過嵌入搜索短語root單詞匹配而導致的典型不必要的性能,它通常使用embedding_based_matching_on_root_words參數(即在管理器類實例化時設置)與基於嵌入的匹配進行控制。建議您在大多數用例中關閉此設置(值False )。
overall_similarity_threshold _similarity_threshold和embedding_based_matching_on_root_words參數對主題匹配的用例都沒有任何影響。此處使用word_embedding_match_threshold和initial_question_word_embedding_match_threshold參數時,在此處設置了Word級嵌入相似topic_match_documents_against閾值。
word_match.type=='entity_embedding' )在具有某個實體標籤的搜索文檔單詞和搜索短語或查詢文檔單詞之間獲得的命名基於內置的匹配,其嵌入的嵌入與該實體標籤的基本含義相似,例如,搜索短語中的單詞單詞具有相似的單詞嵌入人的基礎含義。請注意,無論embedding_based_matching_on_root_words設置如何,基於命名的entity-embedding匹配永遠不會在root單詞上活躍。
word_match.type=='question' )初始問題匹配僅在主題匹配期間活躍。查詢短語中的初始問題單詞與搜索文檔中的實體相匹配,這些實體代表了該問題的潛在答案,例如,在比較彼得在何時將早餐與搜索的文檔peter peter peter進行比較時,彼得在上午8點吃早餐,何時將及時的臨時adverbial短語匹配。
在調用Manager類上的topic_match_documents_against函數時,使用initial_question_word_behaviour參數打開和關閉初始問題匹配。 It is only likely to be useful when topic matching is being performed in an interactive setting where the user enters short query phrases, as opposed to when it is being used to find documents on a similar topic to an pre-existing query document: initial question words are only processed at the beginning of the first sentence of the query phrase or query document.
Linguistically speaking, if a query phrase consists of a complex question with several elements dependent on the main verb, a finding in a searched document is only an 'answer' if contains matches to all these elements. Because recall is typically more important than precision when performing topic matching with interactive query phrases, however, Holmes will match an initial question word to a searched-document phrase wherever they correspond semantically (eg wherever when corresponds to a temporal adverbial phrase) and each depend on verbs that themselves match at the word level. One possible strategy to filter out 'incomplete answers' would be to calculate the maximum possible score for a query phrase and reject topic matches that score below a threshold scaled to this maximum.
Before Holmes analyses a searched document or query document, coreference resolution is performed using the Coreferee library running on top of spaCy. This means that situations are recognised where pronouns and nouns that are located near one another within a text refer to the same entities. The information from one mention can then be applied to the analysis of further mentions:
I saw a big dog . It was chasing a cat.
I saw a big dog . The dog was chasing a cat.
Coreferee also detects situations where a noun refers back to a named entity:
We discussed AstraZeneca . The company had given us permission to publish this library under the MIT license.
If this example were to match the search phrase A company gives permission to publish something , the coreference information that the company under discussion is AstraZeneca is clearly relevant and worth extracting in addition to the word(s) directly matched to the search phrase. Such information is captured in the word_match.extracted_word field.
The concept of search phrases has already been introduced and is relevant to the chatbot use case, the structural extraction use case and to preselection within the supervised document classification use case.
It is crucial to understand that the tips and limitations set out in Section 4 do not apply in any way to query phrases in topic matching. If you are using Holmes for topic matching only, you can completely ignore this section!
Structural matching between search phrases and documents is not symmetric: there are many situations in which sentence X as a search phrase would match sentence Y within a document but where the converse would not be true. Although Holmes does its best to understand any search phrases, the results are better when the user writing them follows certain patterns and tendencies, and getting to grips with these patterns and tendencies is the key to using the relevant features of Holmes successfully.
Holmes distinguishes between: lexical words like dog , chase and cat (English) or Hund , jagen and Katze (German) in the initial example above; and grammatical words like a (English) or ein and eine (German) in the initial example above. Only lexical words match words in documents, but grammatical words still play a crucial role within a search phrase: they enable Holmes to understand it.
Dog chase cat (English)
Hund jagen Katze (German)
contain the same lexical words as the search phrases in the initial example above, but as they are not grammatical sentences Holmes is liable to misunderstand them if they are used as search phrases. This is a major difference between Holmes search phrases and the search phrases you use instinctively with standard search engines like Google, and it can take some getting used to.
A search phrase need not contain a verb:
ENTITYPERSON (English)
A big dog (English)
Interest in fishing (English)
ENTITYPER (German)
Ein großer Hund (German)
Interesse am Angeln (German)
are all perfectly valid and potentially useful search phrases.
Where a verb is present, however, Holmes delivers the best results when the verb is in the present active , as chases and jagt are in the initial example above. This gives Holmes the best chance of understanding the relationship correctly and of matching the widest range of document structures that share the target meaning.
Sometimes you may only wish to extract the object of a verb. For example, you might want to find sentences that are discussing a cat being chased regardless of who is doing the chasing. In order to avoid a search phrase containing a passive expression like
A cat is chased (English)
Eine Katze wird gejagt (German)
you can use a generic pronoun . This is a word that Holmes treats like a grammatical word in that it is not matched to documents; its sole purpose is to help the user form a grammatically optimal search phrase in the present active. Recognised generic pronouns are English something , somebody and someone and German jemand (and inflected forms of jemand ) and etwas : Holmes treats them all as equivalent. Using generic pronouns, the passive search phrases above could be re-expressed as
Somebody chases a cat (English)
Jemand jagt eine Katze (German).
Experience shows that different prepositions are often used with the same meaning in equivalent phrases and that this can prevent search phrases from matching where one would intuitively expect it. For example, the search phrases
Somebody is at the market (English)
Jemand ist auf dem Marktplatz (German)
would fail to match the document phrases
Richard was in the market (English)
Richard war am Marktplatz (German)
The best way of solving this problem is to define the prepositions in question as synonyms in an ontology.
The following types of structures are prohibited in search phrases and result in Python user-defined errors:
A dog chases a cat. A cat chases a dog (English)
Ein Hund jagt eine Katze. Eine Katze jagt einen Hund (German)
Each clause must be separated out into its own search phrase and registered individually.
A dog does not chase a cat. (英語)
Ein Hund jagt keine Katze. (德語)
Negative expressions are recognised as such in documents and the generated matches marked as negative; allowing search phrases themselves to be negative would overcomplicate the library without offering any benefits.
A dog and a lion chase a cat. (英語)
Ein Hund und ein Löwe jagen eine Katze. (德語)
Wherever conjunction occurs in documents, Holmes distributes the information among multiple matches as explained above. In the unlikely event that there should be a requirement to capture conjunction explicitly when matching, this could be achieved by using the Manager.match() function and looking for situations where the document token objects are shared by multiple match objects.
The (English)
Der (German)
A search phrase cannot be processed if it does not contain any words that can be matched to documents.
A dog chases a cat and he chases a mouse (English)
Ein Hund jagt eine Katze und er jagt eine Maus (German)
Pronouns that corefer with nouns elsewhere in the search phrase are not permitted as this would overcomplicate the library without offering any benefits.
The following types of structures are strongly discouraged in search phrases:
Dog chase cat (English)
Hund jagen Katze (German)
Although these will sometimes work, the results will be better if search phrases are expressed grammatically.
A cat is chased by a dog (English)
A dog will have chased a cat (English)
Eine Katze wird durch einen Hund gejagt (German)
Ein Hund wird eine Katze gejagt haben (German)
Although these will sometimes work, the results will be better if verbs in search phrases are expressed in the present active.
Who chases the cat? (英語)
Wer jagt die Katze? (德語)
Although questions are supported as query phrases in the topic matching use case, they are not appropriate as search phrases. Questions should be re-phrased as statements, in this case
Something chases the cat (English)
Etwas jagt die Katze (German).
Informationsextraktion (German)
Ein Stadtmittetreffen (German)
The internal structure of German compound words is analysed within searched documents as well as within query phrases in the topic matching use case, but not within search phrases. In search phrases, compound words should be reexpressed as genitive constructions even in cases where this does not strictly capture their meaning:
Extraktion der Information (German)
Ein Treffen der Stadtmitte (German)
The following types of structures should be used with caution in search phrases:
A fierce dog chases a scared cat on the way to the theatre (English)
Ein kämpferischer Hund jagt eine verängstigte Katze auf dem Weg ins Theater (German)
Holmes can handle any level of complexity within search phrases, but the more complex a structure, the less likely it becomes that a document sentence will match it. If it is really necessary to match such complex relationships with search phrases rather than with topic matching, they are typically better extracted by splitting the search phrase up, eg
A fierce dog (English)
A scared cat (English)
A dog chases a cat (English)
Something chases something on the way to the theatre (English)
Ein kämpferischer Hund (German)
Eine verängstigte Katze (German)
Ein Hund jagt eine Katze (German)
Etwas jagt etwas auf dem Weg ins Theater (German)
Correlations between the resulting matches can then be established by matching via the Manager.match() function and looking for situations where the document token objects are shared across multiple match objects.
One possible exception to this piece of advice is when embedding-based matching is active. Because whether or not each word in a search phrase matches then depends on whether or not other words in the same search phrase have been matched, large, complex search phrases can sometimes yield results that a combination of smaller, simpler search phrases would not.
The chasing of a cat (English)
Die Jagd einer Katze (German)
These will often work, but it is generally better practice to use verbal search phrases like
Something chases a cat (English)
Etwas jagt eine Katze (German)
and to allow the corresponding nominal phrases to be matched via derivation-based matching.
The chatbot use case has already been introduced: a predefined set of search phrases is used to extract information from phrases entered interactively by an end user, which in this use case act as the documents.
The Holmes source code ships with two examples demonstrating the chatbot use case, one for each language, with predefined ontologies. Having cloned the source code and installed the Holmes library, navigate to the /examples directory and type the following (Linux):
英語:
python3 example_chatbot_EN_insurance.py
德語:
python3 example_chatbot_DE_insurance.py
or click on the files in Windows Explorer (Windows).
Holmes matches syntactically distinct structures that are semantically equivalent, ie that share the same meaning. In a real chatbot use case, users will typically enter equivalent information with phrases that are semantically distinct as well, ie that have different meanings. Because the effort involved in registering a search phrase is barely greater than the time it takes to type it in, it makes sense to register a large number of search phrases for each relationship you are trying to extract: essentially all ways people have been observed to express the information you are interested in or all ways you can imagine somebody might express the information you are interested in . To assist this, search phrases can be registered with labels that do not need to be unique: a label can then be used to express the relationship an entire group of search phrases is designed to extract. Note that when many search phrases have been defined to extract the same relationship, a single user entry is likely to be sometimes matched by multiple search phrases. This must be handled appropriately by the calling application.
One obvious weakness of Holmes in the chatbot setting is its sensitivity to correct spelling and, to a lesser extent, to correct grammar. Strategies for mitigating this weakness include:
The structural extraction use case uses structural matching in the same way as the chatbot use case, and many of the same comments and tips apply to it. The principal differences are that pre-existing and often lengthy documents are scanned rather than text snippets entered ad-hoc by the user, and that the returned match objects are not used to drive a dialog flow; they are examined solely to extract and store structured information.
Code for performing structural extraction would typically perform the following tasks:
Manager.register_search_phrase() several times to define a number of search phrases specifying the information to be extracted.Manager.parse_and_register_document() several times to load a number of documents within which to search.Manager.match() to perform the matching.The topic matching use case matches a query document , or alternatively a query phrase entered ad-hoc by the user, against a set of documents pre-loaded into memory. The aim is to find the passages in the documents whose topic most closely corresponds to the topic of the query document; the output is a ordered list of passages scored according to topic similarity. Additionally, if a query phrase contains an initial question word, the output will contain potential answers to the question.
Topic matching queries may contain generic pronouns and named-entity identifiers just like search phrases, although the ENTITYNOUN token is not supported. However, an important difference from search phrases is that the topic matching use case places no restrictions on the grammatical structures permissible within the query document.
In addition to the Holmes demonstration website, the Holmes source code ships with three examples demonstrating the topic matching use case with an English literature corpus, a German literature corpus and a German legal corpus respectively. Users are encouraged to run these to get a feel for how they work.
Topic matching uses a variety of strategies to find text passages that are relevant to the query. These include resource-hungry procedures like investigating semantic relationships and comparing embeddings. Because applying these across the board would prevent topic matching from scaling, Holmes only attempts them for specific areas of the text that less resource-intensive strategies have already marked as looking promising. This and the other interior workings of topic matching are explained here.
In the supervised document classification use case, a classifier is trained with a number of documents that are each pre-labelled with a classification. The trained classifier then assigns one or more labels to new documents according to what each new document is about. As explained here, ontologies can be used both to enrichen the comparison of the content of the various documents and to capture implication relationships between classification labels.
A classifier makes use of a neural network (a multilayer perceptron) whose topology can either be determined automatically by Holmes or specified explicitly by the user. With a large number of training documents, the automatically determined topology can easily exhaust the memory available on a typical machine; if there is no opportunity to scale up the memory, this problem can be remedied by specifying a smaller number of hidden layers or a smaller number of nodes in one or more of the layers.
A trained document classification model retains no references to its training data. This is an advantage from a data protection viewpoint, although it cannot presently be guaranteed that models will not contain individual personal or company names.
A typical problem with the execution of many document classification use cases is that a new classification label is added when the system is already live but that there are initially no examples of this new classification with which to train a new model. The best course of action in such a situation is to define search phrases which preselect the more obvious documents with the new classification using structural matching. Those documents that are not preselected as having the new classification label are then passed to the existing, previously trained classifier in the normal way. When enough documents exemplifying the new classification have accumulated in the system, the model can be retrained and the preselection search phrases removed.
Holmes ships with an example script demonstrating supervised document classification for English with the BBC Documents dataset. The script downloads the documents (for this operation and for this operation alone, you will need to be online) and places them in a working directory. When training is complete, the script saves the model to the working directory. If the model file is found in the working directory on subsequent invocations of the script, the training phase is skipped and the script goes straight to the testing phase. This means that if it is wished to repeat the training phase, either the model has to be deleted from the working directory or a new working directory has to be specified to the script.
Having cloned the source code and installed the Holmes library, navigate to the /examples directory. Specify a working directory at the top of the example_supervised_topic_model_EN.py file, then type python3 example_supervised_topic_model_EN (Linux) or click on the script in Windows Explorer (Windows).
It is important to realise that Holmes learns to classify documents according to the words or semantic relationships they contain, taking any structural matching ontology into account in the process. For many classification tasks, this is exactly what is required; but there are tasks (eg author attribution according to the frequency of grammatical constructions typical for each author) where it is not. For the right task, Holmes achieves impressive results. For the BBC Documents benchmark processed by the example script, Holmes performs slightly better than benchmarks available online (see eg here) although the difference is probably too slight to be significant, especially given that the different training/test splits were used in each case: Holmes has been observed to learn models that predict the correct result between 96.9% and 98.7% of the time. The range is explained by the fact that the behaviour of the neural network is not fully deterministic.
The interior workings of supervised document classification are explained here.
Manager holmes_extractor.Manager(self, model, *, overall_similarity_threshold=1.0,
embedding_based_matching_on_root_words=False, ontology=None,
analyze_derivational_morphology=True, perform_coreference_resolution=None,
number_of_workers=None, verbose=False)
The facade class for the Holmes library.
Parameters:
model -- the name of the spaCy model, e.g. *en_core_web_trf*
overall_similarity_threshold -- the overall similarity threshold for embedding-based
matching. Defaults to *1.0*, which deactivates embedding-based matching. Note that this
parameter is not relevant for topic matching, where the thresholds for embedding-based
matching are set on the call to *topic_match_documents_against*.
embedding_based_matching_on_root_words -- determines whether or not embedding-based
matching should be attempted on search-phrase root tokens, which has a considerable
performance hit. Defaults to *False*. Note that this parameter is not relevant for topic
matching.
ontology -- an *Ontology* object. Defaults to *None* (no ontology).
analyze_derivational_morphology -- *True* if matching should be attempted between different
words from the same word family. Defaults to *True*.
perform_coreference_resolution -- *True* if coreference resolution should be taken into account
when matching. Defaults to *True*.
use_reverse_dependency_matching -- *True* if appropriate dependencies in documents can be
matched to dependencies in search phrases where the two dependencies point in opposite
directions. Defaults to *True*.
number_of_workers -- the number of worker processes to use, or *None* if the number of worker
processes should depend on the number of available cores. Defaults to *None*
verbose -- a boolean value specifying whether multiprocessing messages should be outputted to
the console. Defaults to *False*
Manager.register_serialized_document(self, serialized_document:bytes, label:str="") -> None
Parameters:
document -- a preparsed Holmes document.
label -- a label for the document which must be unique. Defaults to the empty string,
which is intended for use cases involving single documents (typically user entries).
Manager.register_serialized_documents(self, document_dictionary:dict[str, bytes]) -> None
Note that this function is the most efficient way of loading documents.
Parameters:
document_dictionary -- a dictionary from labels to serialized documents.
Manager.parse_and_register_document(self, document_text:str, label:str='') -> None
Parameters:
document_text -- the raw document text.
label -- a label for the document which must be unique. Defaults to the empty string,
which is intended for use cases involving single documents (typically user entries).
Manager.remove_document(self, label:str) -> None
Manager.remove_all_documents(self, labels_starting:str=None) -> None
Parameters:
labels_starting -- a string starting the labels of documents to be removed,
or 'None' if all documents are to be removed.
Manager.list_document_labels(self) -> List[str]
Returns a list of the labels of the currently registered documents.
Manager.serialize_document(self, label:str) -> Optional[bytes]
Returns a serialized representation of a Holmes document that can be
persisted to a file. If 'label' is not the label of a registered document,
'None' is returned instead.
Parameters:
label -- the label of the document to be serialized.
Manager.get_document(self, label:str='') -> Optional[Doc]
Returns a Holmes document. If *label* is not the label of a registered document, *None*
is returned instead.
Parameters:
label -- the label of the document to be serialized.
Manager.debug_document(self, label:str='') -> None
Outputs a debug representation for a loaded document.
Parameters:
label -- the label of the document to be serialized.
Manager.register_search_phrase(self, search_phrase_text:str, label:str=None) -> SearchPhrase
Registers and returns a new search phrase.
Parameters:
search_phrase_text -- the raw search phrase text.
label -- a label for the search phrase, which need not be unique.
If label==None, the assigned label defaults to the raw search phrase text.
Manager.remove_all_search_phrases_with_label(self, label:str) -> None
Manager.remove_all_search_phrases(self) -> None
Manager.list_search_phrase_labels(self) -> List[str]
Manager.match(self, search_phrase_text:str=None, document_text:str=None) -> List[Dict]
Matches search phrases to documents and returns the result as match dictionaries.
Parameters:
search_phrase_text -- a text from which to generate a search phrase, or 'None' if the
preloaded search phrases should be used for matching.
document_text -- a text from which to generate a document, or 'None' if the preloaded
documents should be used for matching.
topic_match_documents_against(self, text_to_match:str, *,
use_frequency_factor:bool=True,
maximum_activation_distance:int=75,
word_embedding_match_threshold:float=0.8,
initial_question_word_embedding_match_threshold:float=0.7,
relation_score:int=300,
reverse_only_relation_score:int=200,
single_word_score:int=50,
single_word_any_tag_score:int=20,
initial_question_word_answer_score:int=600,
initial_question_word_behaviour:str='process',
different_match_cutoff_score:int=15,
overlapping_relation_multiplier:float=1.5,
embedding_penalty:float=0.6,
ontology_penalty:float=0.9,
relation_matching_frequency_threshold:float=0.25,
embedding_matching_frequency_threshold:float=0.5,
sideways_match_extent:int=100,
only_one_result_per_document:bool=False,
number_of_results:int=10,
document_label_filter:str=None,
tied_result_quotient:float=0.9) -> List[Dict]:
Returns a list of dictionaries representing the results of a topic match between an entered text
and the loaded documents.
Properties:
text_to_match -- the text to match against the loaded documents.
use_frequency_factor -- *True* if scores should be multiplied by a factor between 0 and 1
expressing how rare the words matching each phraselet are in the corpus. Note that,
even if this parameter is set to *False*, the factors are still calculated as they are
required for determining which relation and embedding matches should be attempted.
maximum_activation_distance -- the number of words it takes for a previous phraselet
activation to reduce to zero when the library is reading through a document.
word_embedding_match_threshold -- the cosine similarity above which two words match where
the search phrase word does not govern an interrogative pronoun.
initial_question_word_embedding_match_threshold -- the cosine similarity above which two
words match where the search phrase word governs an interrogative pronoun.
relation_score -- the activation score added when a normal two-word relation is matched.
reverse_only_relation_score -- the activation score added when a two-word relation
is matched using a search phrase that can only be reverse-matched.
single_word_score -- the activation score added when a single noun is matched.
single_word_any_tag_score -- the activation score added when a single word is matched
that is not a noun.
initial_question_word_answer_score -- the activation score added when a question word is
matched to an potential answer phrase.
initial_question_word_behaviour -- 'process' if a question word in the sentence
constituent at the beginning of *text_to_match* is to be matched to document phrases
that answer it and to matching question words; 'exclusive' if only topic matches that
answer questions are to be permitted; 'ignore' if question words are to be ignored.
different_match_cutoff_score -- the activation threshold under which topic matches are
separated from one another. Note that the default value will probably be too low if
*use_frequency_factor* is set to *False*.
overlapping_relation_multiplier -- the value by which the activation score is multiplied
when two relations were matched and the matches involved a common document word.
embedding_penalty -- a value between 0 and 1 with which scores are multiplied when the
match involved an embedding. The result is additionally multiplied by the overall
similarity measure of the match.
ontology_penalty -- a value between 0 and 1 with which scores are multiplied for each
word match within a match that involved the ontology. For each such word match,
the score is multiplied by the value (abs(depth) + 1) times, so that the penalty is
higher for hyponyms and hypernyms than for synonyms and increases with the
depth distance.
relation_matching_frequency_threshold -- the frequency threshold above which single
word matches are used as the basis for attempting relation matches.
embedding_matching_frequency_threshold -- the frequency threshold above which single
word matches are used as the basis for attempting relation matches with
embedding-based matching on the second word.
sideways_match_extent -- the maximum number of words that may be incorporated into a
topic match either side of the word where the activation peaked.
only_one_result_per_document -- if 'True', prevents multiple results from being returned
for the same document.
number_of_results -- the number of topic match objects to return.
document_label_filter -- optionally, a string with which document labels must start to
be considered for inclusion in the results.
tied_result_quotient -- the quotient between a result and following results above which
the results are interpreted as tied.
Manager.get_supervised_topic_training_basis(self, *, classification_ontology:Ontology=None,
overlap_memory_size:int=10, oneshot:bool=True, match_all_words:bool=False,
verbose:bool=True) -> SupervisedTopicTrainingBasis:
Returns an object that is used to train and generate a model for the
supervised document classification use case.
Parameters:
classification_ontology -- an Ontology object incorporating relationships between
classification labels, or 'None' if no such ontology is to be used.
overlap_memory_size -- how many non-word phraselet matches to the left should be
checked for words in common with a current match.
oneshot -- whether the same word or relationship matched multiple times within a
single document should be counted once only (value 'True') or multiple times
(value 'False')
match_all_words -- whether all single words should be taken into account
(value 'True') or only single words with noun tags (value 'False')
verbose -- if 'True', information about training progress is outputted to the console.
Manager.deserialize_supervised_topic_classifier(self,
serialized_model:bytes, verbose:bool=False) -> SupervisedTopicClassifier:
Returns a classifier for the supervised document classification use case
that will use a supplied pre-trained model.
Parameters:
serialized_model -- the pre-trained model as returned from `SupervisedTopicClassifier.serialize_model()`.
verbose -- if 'True', information about matching is outputted to the console.
Manager.start_chatbot_mode_console(self)
Starts a chatbot mode console enabling the matching of pre-registered
search phrases to documents (chatbot entries) entered ad-hoc by the
user.
Manager.start_structural_search_mode_console(self)
Starts a structural extraction mode console enabling the matching of pre-registered
documents to search phrases entered ad-hoc by the user.
Manager.start_topic_matching_search_mode_console(self,
only_one_result_per_document:bool=False, word_embedding_match_threshold:float=0.8,
initial_question_word_embedding_match_threshold:float=0.7):
Starts a topic matching search mode console enabling the matching of pre-registered
documents to query phrases entered ad-hoc by the user.
Parameters:
only_one_result_per_document -- if 'True', prevents multiple topic match
results from being returned for the same document.
word_embedding_match_threshold -- the cosine similarity above which two words match where the
search phrase word does not govern an interrogative pronoun.
initial_question_word_embedding_match_threshold -- the cosine similarity above which two
words match where the search phrase word governs an interrogative pronoun.
Manager.close(self) -> None
Terminates the worker processes.
manager.nlp manager.nlp is the underlying spaCy Language object on which both Coreferee and Holmes have been registered as custom pipeline components. The most efficient way of parsing documents for use with Holmes is to call manager.nlp.pipe() . This yields an iterable of documents that can then be loaded into Holmes via manager.register_serialized_documents() .
The pipe() method has an argument n_process that specifies the number of processors to use. With _lg , _md and _sm spaCy models, there are some situations where it can make sense to specify a value other than 1 (the default). Note however that with transformer spaCy models ( _trf ) values other than 1 are not supported.
Ontology holmes_extractor.Ontology(self, ontology_path,
owl_class_type='http://www.w3.org/2002/07/owl#Class',
owl_individual_type='http://www.w3.org/2002/07/owl#NamedIndividual',
owl_type_link='http://www.w3.org/1999/02/22-rdf-syntax-ns#type',
owl_synonym_type='http://www.w3.org/2002/07/owl#equivalentClass',
owl_hyponym_type='http://www.w3.org/2000/01/rdf-schema#subClassOf',
symmetric_matching=False)
Loads information from an existing ontology and manages ontology
matching.
The ontology must follow the W3C OWL 2 standard. Search phrase words are
matched to hyponyms, synonyms and instances from within documents being
searched.
This class is designed for small ontologies that have been constructed
by hand for specific use cases. Where the aim is to model a large number
of semantic relationships, word embeddings are likely to offer
better results.
Holmes is not designed to support changes to a loaded ontology via direct
calls to the methods of this class. It is also not permitted to share a single instance
of this class between multiple Manager instances: instead, a separate Ontology instance
pointing to the same path should be created for each Manager.
Matching is case-insensitive.
Parameters:
ontology_path -- the path from where the ontology is to be loaded,
or a list of several such paths. See https://github.com/RDFLib/rdflib/.
owl_class_type -- optionally overrides the OWL 2 URL for types.
owl_individual_type -- optionally overrides the OWL 2 URL for individuals.
owl_type_link -- optionally overrides the RDF URL for types.
owl_synonym_type -- optionally overrides the OWL 2 URL for synonyms.
owl_hyponym_type -- optionally overrides the RDF URL for hyponyms.
symmetric_matching -- if 'True', means hypernym relationships are also taken into account.
SupervisedTopicTrainingBasis (returned from Manager.get_supervised_topic_training_basis() )Holder object for training documents and their classifications from which one or more SupervisedTopicModelTrainer objects can be derived. This class is NOT threadsafe.
SupervisedTopicTrainingBasis.parse_and_register_training_document(self, text:str, classification:str,
label:Optional[str]=None) -> None
Parses and registers a document to use for training.
Parameters:
text -- the document text
classification -- the classification label
label -- a label with which to identify the document in verbose training output,
or 'None' if a random label should be assigned.
SupervisedTopicTrainingBasis.register_training_document(self, doc:Doc, classification:str,
label:Optional[str]=None) -> None
Registers a pre-parsed document to use for training.
Parameters:
doc -- the document
classification -- the classification label
label -- a label with which to identify the document in verbose training output,
or 'None' if a random label should be assigned.
SupervisedTopicTrainingBasis.register_additional_classification_label(self, label:str) -> None
Register an additional classification label which no training document possesses explicitly
but that should be assigned to documents whose explicit labels are related to the
additional classification label via the classification ontology.
SupervisedTopicTrainingBasis.prepare(self) -> None
Matches the phraselets derived from the training documents against the training
documents to generate frequencies that also include combined labels, and examines the
explicit classification labels, the additional classification labels and the
classification ontology to derive classification implications.
Once this method has been called, the instance no longer accepts new training documents
or additional classification labels.
SupervisedTopicTrainingBasis.train(
self,
*,
minimum_occurrences: int = 4,
cv_threshold: float = 1.0,
learning_rate: float = 0.001,
batch_size: int = 5,
max_epochs: int = 200,
convergence_threshold: float = 0.0001,
hidden_layer_sizes: Optional[List[int]] = None,
shuffle: bool = True,
normalize: bool = True
) -> SupervisedTopicModelTrainer:
Trains a model based on the prepared state.
Parameters:
minimum_occurrences -- the minimum number of times a word or relationship has to
occur in the context of the same classification for the phraselet
to be accepted into the final model.
cv_threshold -- the minimum coefficient of variation with which a word or relationship has
to occur across the explicit classification labels for the phraselet to be
accepted into the final model.
learning_rate -- the learning rate for the Adam optimizer.
batch_size -- the number of documents in each training batch.
max_epochs -- the maximum number of training epochs.
convergence_threshold -- the threshold below which loss measurements after consecutive
epochs are regarded as equivalent. Training stops before 'max_epochs' is reached
if equivalent results are achieved after four consecutive epochs.
hidden_layer_sizes -- a list containing the number of neurons in each hidden layer, or
'None' if the topology should be determined automatically.
shuffle -- 'True' if documents should be shuffled during batching.
normalize -- 'True' if normalization should be applied to the loss function.
SupervisedTopicModelTrainer (returned from SupervisedTopicTrainingBasis.train() ) Worker object used to train and generate models. This object could be removed from the public interface ( SupervisedTopicTrainingBasis.train() could return a SupervisedTopicClassifier directly) but has been retained to facilitate testability.
This class is NOT threadsafe.
SupervisedTopicModelTrainer.classifier(self)
Returns a supervised topic classifier which contains no explicit references to the training data and that
can be serialized.
SupervisedTopicClassifier (returned from SupervisedTopicModelTrainer.classifier() and Manager.deserialize_supervised_topic_classifier() ))
SupervisedTopicClassifier.def parse_and_classify(self, text: str) -> Optional[OrderedDict]:
Returns a dictionary from classification labels to probabilities
ordered starting with the most probable, or *None* if the text did
not contain any words recognised by the model.
Parameters:
text -- the text to parse and classify.
SupervisedTopicClassifier.classify(self, doc: Doc) -> Optional[OrderedDict]:
Returns a dictionary from classification labels to probabilities
ordered starting with the most probable, or *None* if the text did
not contain any words recognised by the model.
Parameters:
doc -- the pre-parsed document to classify.
SupervisedTopicClassifier.serialize_model(self) -> str
Returns a serialized model that can be reloaded using
*Manager.deserialize_supervised_topic_classifier()*
Manager.match() A text-only representation of a match between a search phrase and a
document. The indexes refer to tokens.
Properties:
search_phrase_label -- the label of the search phrase.
search_phrase_text -- the text of the search phrase.
document -- the label of the document.
index_within_document -- the index of the match within the document.
sentences_within_document -- the raw text of the sentences within the document that matched.
negated -- 'True' if this match is negated.
uncertain -- 'True' if this match is uncertain.
involves_coreference -- 'True' if this match was found using coreference resolution.
overall_similarity_measure -- the overall similarity of the match, or
'1.0' if embedding-based matching was not involved in the match.
word_matches -- an array of dictionaries with the properties:
search_phrase_token_index -- the index of the token that matched from the search phrase.
search_phrase_word -- the string that matched from the search phrase.
document_token_index -- the index of the token that matched within the document.
first_document_token_index -- the index of the first token that matched within the document.
Identical to 'document_token_index' except where the match involves a multiword phrase.
last_document_token_index -- the index of the last token that matched within the document
(NOT one more than that index). Identical to 'document_token_index' except where the match
involves a multiword phrase.
structurally_matched_document_token_index -- the index of the token within the document that
structurally matched the search phrase token. Is either the same as 'document_token_index' or
is linked to 'document_token_index' within a coreference chain.
document_subword_index -- the index of the token subword that matched within the document, or
'None' if matching was not with a subword but with an entire token.
document_subword_containing_token_index -- the index of the document token that contained the
subword that matched, which may be different from 'document_token_index' in situations where a
word containing multiple subwords is split by hyphenation and a subword whose sense
contributes to a word is not overtly realised within that word.
document_word -- the string that matched from the document.
document_phrase -- the phrase headed by the word that matched from the document.
match_type -- 'direct', 'derivation', 'entity', 'embedding', 'ontology', 'entity_embedding'
or 'question'.
negated -- 'True' if this word match is negated.
uncertain -- 'True' if this word match is uncertain.
similarity_measure -- for types 'embedding' and 'entity_embedding', the similarity between the
two tokens, otherwise '1.0'.
involves_coreference -- 'True' if the word was matched using coreference resolution.
extracted_word -- within the coreference chain, the most specific term that corresponded to
the document_word.
depth -- the number of hyponym relationships linking 'search_phrase_word' and
'extracted_word', or '0' if ontology-based matching is not active. Can be negative
if symmetric matching is active.
explanation -- creates a human-readable explanation of the word match from the perspective of the
document word (e.g. to be used as a tooltip over it).
Manager.topic_match_documents_against() A text-only representation of a topic match between a search text and a document.
Properties:
document_label -- the label of the document.
text -- the document text that was matched.
text_to_match -- the search text.
rank -- a string representation of the scoring rank which can have the form e.g. '2=' in case of a tie.
index_within_document -- the index of the document token where the activation peaked.
subword_index -- the index of the subword within the document token where the activation peaked, or
'None' if the activation did not peak at a specific subword.
start_index -- the index of the first document token in the topic match.
end_index -- the index of the last document token in the topic match (NOT one more than that index).
sentences_start_index -- the token start index within the document of the sentence that contains
'start_index'
sentences_end_index -- the token end index within the document of the sentence that contains
'end_index' (NOT one more than that index).
sentences_character_start_index_in_document -- the character index of the first character of 'text'
within the document.
sentences_character_end_index_in_document -- one more than the character index of the last
character of 'text' within the document.
score -- the score
word_infos -- an array of arrays with the semantics:
[0] -- 'relative_start_index' -- the index of the first character in the word relative to
'sentences_character_start_index_in_document'.
[1] -- 'relative_end_index' -- one more than the index of the last character in the word
relative to 'sentences_character_start_index_in_document'.
[2] -- 'type' -- 'single' for a single-word match, 'relation' if within a relation match
involving two words, 'overlapping_relation' if within a relation match involving three
or more words.
[3] -- 'is_highest_activation' -- 'True' if this was the word at which the highest activation
score reported in 'score' was achieved, otherwise 'False'.
[4] -- 'explanation' -- a human-readable explanation of the word match from the perspective of
the document word (e.g. to be used as a tooltip over it).
answers -- an array of arrays with the semantics:
[0] -- the index of the first character of a potential answer to an initial question word.
[1] -- one more than the index of the last character of a potential answer to an initial question
word.
Earlier versions of Holmes could only be published under a restrictive license because of patent issues. As explained in the introduction, this is no longer the case thanks to the generosity of AstraZeneca: versions from 4.0.0 onwards are licensed under the MIT license.
The word-level matching and the high-level operation of structural matching between search-phrase and document subgraphs both work more or less as one would expect. What is perhaps more in need of further comment is the semantic analysis code subsumed in the parsing.py script as well as in the language_specific_rules.py script for each language.
SemanticAnalyzer is an abstract class that is subclassed for each language: at present by EnglishSemanticAnalyzer and GermanSemanticAnalyzer . These classes contain most of the semantic analysis code. SemanticMatchingHelper is a second abstract class, again with an concrete implementation for each language, that contains semantic analysis code that is required at matching time. Moving this out to a separate class family was necessary because, on operating systems that spawn processes rather than forking processes (eg Windows), SemanticMatchingHelper instances have to be serialized when the worker processes are created: this would not be possible for SemanticAnalyzer instances because not all spaCy models are serializable, and would also unnecessarily consume large amounts of memory.
At present, all functionality that is common to the two languages is realised in the two abstract parent classes. Especially because English and German are closely related languages, it is probable that functionality will need to be moved from the abstract parent classes to specific implementing children classes if and when new semantic analyzers are added for new languages.
The HolmesDictionary class is defined as a spaCy extension attribute that is accessed using the syntax token._.holmes . The most important information in the dictionary is a list of SemanticDependency objects. These are derived from the dependency relationships in the spaCy output ( token.dep_ ) but go through a considerable amount of processing to make them 'less syntactic' and 'more semantic'. To give but a few examples:
Some new semantic dependency labels that do not occur in spaCy outputs as values of token.dep_ are added for Holmes semantic dependencies. It is important to understand that Holmes semantic dependencies are used exclusively for matching and are therefore neither intended nor required to form a coherent set of linguistic theoretical entities or relationships; whatever works best for matching is assigned on an ad-hoc basis.
For each language, the match_implication_dict dictionary maps search-phrase semantic dependencies to matching document semantic dependencies and is responsible for the asymmetry of matching between search phrases and documents.
Topic matching involves the following steps:
SemanticMatchingHelper.topic_matching_phraselet_stop_lemmas ), which are consistently ignored throughout the whole process.SemanticMatchingHelper.phraselet_templates .SemanticMatchingHelper.topic_matching_reverse_only_parent_lemmas ) or when the frequency factor for the parent word is below the threshold for relation matching ( relation_matching_frequency_threshold , default: 0.25). These measures are necessary because matching on eg a parent preposition would lead to a large number of potential matches that would take a lot of resources to investigate: it is better to start investigation from the less frequent word within a given relation.relation_matching_frequency_threshold , default: 0.25).embedding_matching_frequency_threshold , default: 0.5), matching at all of those words where the relation template has not already been matched is retried using embeddings at the other word within the relation. A pair of words is then regarded as matching when their mutual cosine similarity is above initial_question_word_embedding_match_threshold (default: 0.7) in situations where the document word has an initial question word in its phrase or word_embedding_match_threshold (default: 0.8) in all other situations.use_frequency_factor is set to True (the default), each score is scaled by the frequency factor of its phraselet, meaning that words that occur less frequently in the corpus give rise to higher scores.maximum_activation_distance ; default: 75) of its value is subtracted from it as each new word is read.single_word_score ; default: 50), a non-noun single-word phraselet or a noun phraselet that matched a subword ( single_word_any_tag_score ; default: 20), a relation phraselet produced by a reverse-only template ( reverse_only_relation_score ; default: 200), any other (normally matched) relation phraselet ( relation_score ; default: 300), or a relation phraselet involving an initial question word ( initial_question_word_answer_score ; default: 600).embedding_penalty ; default: 0.6).ontology_penalty ; default: 0.9) once more often than the difference in depth between the two ontology entries, ie once for a synonym, twice for a child, three times for a grandchild and so on.overlapping_relation_multiplier ; default: 1.5).sideways_match_extent ; default: 100 words) within which the activation score is higher than the different_match_cutoff_score (default: 15) are regarded as belonging to a contiguous passage around the peak that is then returned as a TopicMatch object. (Note that this default will almost certainly turn out to be too low if use_frequency_factor is set to False .) A word whose activation equals the threshold exactly is included at the beginning of the area as long as the next word where activation increases has a score above the threshold. If the topic match peak is below the threshold, the topic match will only consist of the peak word.initial_question_word_behaviour is set to process (the default) or to exclusive , where a document word has matched an initial question word from the query phrase, the subtree of the matched document word is identified as a potential answer to the question and added to the dictionary to be returned. If initial_question_word_behaviour is set to exclusive , any topic matches that do not contain answers to initial question words are discarded.only_one_result_per_document = True prevents more than one result from being returned from the same document; only the result from each document with the highest score will then be returned.tied_result_quotient (default: 0.9) are labelled as tied. The supervised document classification use case relies on the same phraselets as the topic matching use case, although reverse-only templates are ignored and a different set of stop words is used ( SemanticMatchingHelper.supervised_document_classification_phraselet_stop_lemmas ). Classifiers are built and trained as follows:
oneshot ; whether single-word phraselets are generated for all words with their own meaning or only for those such words whose part-of-speech tags match the single-word phraselet template specification (essentially: noun phraselets) depends on the value of match_all_words . Wherever two phraselet matches overlap, a combined match is recorded. Combined matches are treated in the same way as other phraselet matches in further processing. This means that effectively the algorithm picks up one-word, two-word and three-word semantic combinations. See here for a discussion of the performance of this step.minimum_occurrences ; default: 4) or where the coefficient of variation (the standard deviation divided by the arithmetic mean) of the occurrences across the categories is below a threshold ( cv_threshold ; default: 1.0).oneshot==True vs. oneshot==False respectively). The outputs are the category labels, including any additional labels determined via a classification ontology. By default, the multilayer perceptron has three hidden layers where the first hidden layer has the same number of neurons as the input layer and the second and third layers have sizes in between the input and the output layer with an equally sized step between each size; the user is however free to specify any other topology.Holmes code is formatted with black.
The complexity of what Holmes does makes development impossible without a robust set of over 1400 regression tests. These can be executed individually with unittest or all at once by running the pytest utility from the Holmes source code root directory. (Note that the Python 3 command on Linux is pytest-3 .)
The pytest variant will only work on machines with sufficient memory resources. To reduce this problem, the tests are distributed across three subdirectories, so that pytest can be run three times, once from each subdirectory:
New languages can be added to Holmes by subclassing the SemanticAnalyzer and SemanticMatchingHelper classes as explained here.
The sets of matching semantic dependencies captured in the _matching_dep_dict dictionary for each language have been obtained on the basis of a mixture of linguistic-theoretical expectations and trial and error. The results would probably be improved if the _matching_dep_dict dictionaries could be derived using machine learning instead; as yet this has not been attempted because of the lack of appropriate training data.
An attempt should be made to remove personal data from supervised document classification models to make them more compliant with data protection laws.
In cases where embedding-based matching is not active, the second step of the supervised document classification procedure repeats a considerable amount of processing from the first step. Retaining the relevant information from the first step of the procedure would greatly improve training performance. This has not been attempted up to now because a large number of tests would be required to prove that such performance improvements did not have any inadvertent impacts on functionality.
The topic matching and supervised document classification use cases are both configured with a number of hyperparameters that are presently set to best-guess values derived on a purely theoretical basis. Results could be further improved by testing the use cases with a variety of hyperparameters to learn the optimal values.
The initial open-source version.
pobjp linking parents of prepositions directly with their children.MultiprocessingManager object as its facade.Manager and MultiprocessingManager classes into a single Manager class, with a redesigned public interface, that uses worker threads for everything except supervised document classification.