版本5.12.5
鏈接語法解析器展示了英語,泰語,俄語,阿拉伯語,波斯語和有限的其他語言的語言(自然語言)結構。該結構是句子中單詞之間的鍵入鏈接(邊)的圖。可以通過應用一系列規則轉換為這些不同格式,可以從鏈接語法中獲得更傳統的HPSG(組成)和依賴性樣式解析。這是可能的,因為鏈接語法對句子的“句法語義”結構有所“更深”:它提供了比常規解析器中通常可用的信息更精細且詳細的信息。
Link語法解析的理論最初是由戴維·溫賽(Davy Tembyley),約翰·拉弗蒂(John Lafferty)和丹尼爾·斯萊特(Daniel Sleator)於1991年開發的,當時卡內基·梅隆大學(Carnegie Mellon University)的語言學和計算機科學教授。關於該理論的三個首次出版物提供了最佳的介紹和概述。從那時起,就有數百個出版物進一步探索,研究和擴展了思想。
儘管基於原始的卡內基 - 梅隆代碼基礎,但當前的鏈接語法軟件包已經急劇發展,並且與早期版本有很大不同。有無數的錯誤修復;性能通過幾個數量級提高了。該軟件包已完全多線程,完全啟用了UTF-8,並且已被擦洗以進行安全性,從而實現了雲部署。解析英語的覆蓋範圍已大大提高;已經添加了其他語言(最著名的是泰語和俄語)。有許多新功能,包括對形態學,方言和細粒度(成本)系統的支持,從而允許矢量插入類似的行為。有一個針對形態學量身定制的新的,複雜的令牌:它可以為形態上模棱兩可的單詞提供替代性分裂。詞典可以在運行時更新,從而使對語法進行連續學習的系統也可以同時解析。也就是說,字典更新和解析是相互線程的安全。單詞類可以用Regexes識別。隨機平面圖解析得到充分支持;這允許平面圖均勻地採樣。關於改變的內容的詳細報告可以在ChangElog中找到。
該代碼是根據LGPL許可發布的,可自由使用私人和商業用途,幾乎沒有限制。許可證的條款在此軟件中包含的許可證文件中給出。
請參閱主網頁以獲取更多信息。此版本是原始CMU解析器的延續。
截至5.9.0版本,該系統包括用於生成句子的實驗系統。這些是使用“填充空白” API指定的,每當結果是語法有效的句子時,單詞被替換為通用卡片位置。其他詳細信息在“人頁面: man link-generator (在man子目錄中)。
該發電機用於Opencog語言學習項目,該項目旨在使用嶄新的和創新的信息理論技術自動從Corpora學習鏈接語法,與人工神經網(深度學習)中發現的技術有點相似,但使用明確的象徵性表示。
解析器以各種不同的編程語言包括API,以及用於使用它的方便的命令行工具。這是一些典型的輸出:
linkparser> This is a test!
Linkage 1, cost vector = (UNUSED=0 DIS= 0.00 LEN=6)
+-------------Xp------------+
+----->WV----->+---Ost--+ |
+---Wd---+-Ss*b+ +Ds**c+ |
| | | | | |
LEFT-WALL this.p is.v a test.n !
(S (NP this.p) (VP is.v (NP a test.n)) !)
LEFT-WALL 0.000 Wd+ hWV+ Xp+
this.p 0.000 Wd- Ss*b+
is.v 0.000 Ss- dWV- O*t+
a 0.000 Ds**c+
test.n 0.000 Ds**c- Os-
! 0.000 Xp- RW+
RIGHT-WALL 0.000 RW-
這個相當忙碌的顯示器說明了許多有趣的事情。例如, Ss*b鏈接連接動詞和主題,並表明主題是單數。同樣, Ost鏈接連接動詞和對象,並且也表明對像是單數。 WV (動詞牆)鏈接指向句子的頭部詞,而Wd鏈接指向標題。 Xp鏈接連接到後標點符號。 Ds**c鏈接將名詞連接到確定器:它再次確認名詞是單數的,並且名詞以輔音開頭。 (這裡不需要的PH鏈接用於強制語音一致,從“ an”區分“ a”)。這些鏈接類型記錄在英語鏈接文檔中。
顯示的底部是每個單詞使用的“分離”的列表。析取只是用於形成鏈接的連接器的列表。它們特別有趣,因為它們是“演講的一部分”的一種極為細粒度的形式。因此,例如:分離的S- O+指示一個傳遞動詞:它的動詞既可以接受主題又一個對象。上面的附加標記表明“ IS”不僅被用作瞬態動詞,而且還表明了更細節的細節:一種換句話給主題的及可動詞,並且使用(可用為)句子的頭動詞。浮點值是分離的“成本”;它非常粗略地捕獲了這種特定語法用法的對數概率的想法。就像言論的一部分與單詞含義相關,因此精細的詞性詞性部分與更細微的區分和含義的層次相關。
鏈接 - 格拉瑪解析器還支持形態分析。這是俄羅斯人的一個例子:
linkparser> это теста
Linkage 1, cost vector = (UNUSED=0 DIS= 0.00 LEN=4)
+-----MVAip-----+
+---Wd---+ +-LLCAG-+
| | | |
LEFT-WALL это.msi тест.= =а.ndnpi
LL鏈接將詞幹''連接到後綴'。 MVA鏈路僅連接到後綴,因為在俄語中,後綴攜帶了所有句法結構,而不是莖。這裡記錄了俄羅斯的列克斯。
泰語字典現已完全開發,有效地涵蓋了整個語言。泰國的一個例子:
linkparser> นายกรัฐมนตรี ขึ้น กล่าว สุนทรพจน์
Linkage 1, cost vector = (UNUSED=0 DIS= 2.00 LEN=2)
+---------LWs--------+
| +<---S<--+--VS-+-->O-->+
| | | | |
LEFT-WALL นายกรัฐมนตรี.n ขึ้น.v กล่าว.v สุนทรพจน์.n
VS鏈接在串行動詞構造中連接了兩個動詞'ขึ้น'和'กล่าว'。此處記錄了鏈接類型的摘要。可以在此處找到泰國鏈接語法的完整文檔。
泰國鏈接語法還接受POS標記和命名性的標籤輸入。每個單詞都可以用鏈接POS標籤註釋。例如:
linkparser> เมื่อวานนี้.n มี.ve คน.n มา.x ติดต่อ.v คุณ.pr ครับ.pt
Found 1 linkage (1 had no P.P. violations)
Unique linkage, cost vector = (UNUSED=0 DIS= 0.00 LEN=12)
+---------------------PT--------------------+
+---------LWs---------+---------->VE---------->+ |
| +<---S<---+-->O-->+ +<--AXw<-+--->O--->+ |
| | | | | | | |
LEFT-WALL เมื่อวานนี้.n[!] มี.ve[!] คน.n[!] มา.x[!] ติดต่อ.v[!] คุณ.pr[!] ครับ.pt[!]
可以在此處找到泰語字典的完整文檔。
泰語字典接受POS和命名實體的LST20標籤,以彌合基本NLP工具與鏈接解析器之間的差距。例如:
linkparser> linkparser> วันที่_25_ธันวาคม@DTM ของ@PS ทุก@AJ ปี@NN เป็น@VV วัน@NN คริสต์มาส@NN
Found 348 linkages (348 had no P.P. violations)
Linkage 1, cost vector = (UNUSED=0 DIS= 1.00 LEN=10)
+--------------------------------LWs--------------------------------+
| +<------------------------S<------------------------+
| | +---------->PO--------->+ |
| +----->AJpr----->+ +<---AJj<--+ +---->O---->+------NZ-----+
| | | | | | | |
LEFT-WALL วันที่_25_ธันวาคม@DTM[!] ของ@PS[!].pnn ทุก@AJ[!].jl ปี@NN[!].n เป็น@VV[!].v วัน@NN[!].na คริสต์มาส@NN[!].n
請注意,上面的每個單詞都用LST20 POS標籤和NE標籤註釋。可以在此處找到有關鏈接POS標籤和LST20標籤的完整文檔。有關LST20,例如註釋指南和數據統計信息的更多信息,請參見此處。
any語言都支持統一採樣的隨機平面圖:
linkparser> asdf qwer tyuiop fghj bbb
Found 1162 linkages (1162 had no P.P. violations)
+-------ANY------+-------ANY------+
+---ANY--+--ANY--+ +---ANY--+--ANY--+
| | | | | |
LEFT-WALL asdf[!] qwer[!] tyuiop[!] fghj[!] bbb[!]
ady語言同樣可以執行隨機的形態分裂:
linkparser> asdf qwerty fghjbbb
Found 1512 linkages (1512 had no P.P. violations)
+------------------ANY-----------------+
+-----ANY----+-------ANY------+ +---------LL--------+
| | | | |
LEFT-WALL asdf[!ANY-WORD] qwerty[!ANY-WORD] fgh[!SIMPLE-STEM].= =jbbb[!SIMPLE-SUFF]
可以在鏈接語法Wikipedia頁面中找到擴展的概述和摘要,該頁面涉及該理論的大多數導入主要方面。但是,它不能代替有關該主題發表的原始論文:
主要鏈接語法網站上列出了更多的論文和參考文獻
另請參見C/C ++ API文檔。可以在綁定目錄中找到其他編程語言的綁定,包括Python3,Java和Node.js。 (有兩組JavaScript綁定:一組用於庫API,另一組用於命令行解析器。)
| 內容 | 描述 |
|---|---|
| 執照 | 描述使用條款的許可證 |
| ChangElog | 最近變化的綱要。 |
| 配置 | GNU配置腳本 |
| AutoGen.SH | 開發人員的配置維護工具 |
| 鏈接 - 格拉瑪/*。 c | 該程序。 (用ANSI-C編寫) |
| ----- | ----- |
| 綁定/自動/ | 可選的自動語言綁定。 |
| 綁定/java/ | 可選的Java語言綁定。 |
| 綁定/JS/ | 可選的JavaScript語言綁定。 |
| 綁定/lisp/ | 可選的通用LISP語言綁定。 |
| bindings/node.js/ | 可選節點。 JS語言綁定。 |
| 綁定/ocaml/ | 可選的OCAML語言綁定。 |
| 綁定/python/ | 可選的Python3語言綁定。 |
| 綁定/python-examples/ | Link-grammar測試套件和Python語言綁定用法示例。 |
| 綁定/swig/ | SWIG接口文件,用於其他FFI接口。 |
| 綁定/vala/ | 可選的vala語言綁定。 |
| ----- | ----- |
| 數據/en/ | 英語詞典。 |
| 數據/EN/4.0.DICT | 包含字典定義的文件。 |
| 數據/EN/4.0。知識 | 後處理知識文件。 |
| 數據/EN/4.0.CONSTITUNTENT | 組成知識文件。 |
| 數據/EN/4.0.FRAFFIX | 詞綴(前綴/後綴)文件。 |
| 數據/EN/4.0.Regex | 基於正則表達的形態學猜測。 |
| 數據/en/tiny.dict | 一個小示例詞典。 |
| 數據/en/單詞/ | 一個裝滿單詞列表的目錄。 |
| 數據/en/corpus*.batch | 用於測試的示例語料庫。 |
| ----- | ----- |
| 數據/ru/ | 成熟的俄羅斯詞典 |
| 數據/TH/ | 成熟的泰語詞典(100,000多個單詞) |
| 數據/AR/ | 相當完整的阿拉伯語詞典 |
| 數據/FA/ | 波斯語(FARSI)詞典 |
| 數據/de/ | 小型原型德語詞典 |
| 數據/lt/ | 立陶宛詞典的小原型 |
| 數據/ID/ | 小型印度尼西亞詞典 |
| 數據/VN/ | 小型原型越南詞典 |
| 數據/HE/ | 希伯來語詞典 |
| 數據/kz/ | 哈薩克詞典實驗性的 |
| 數據/tr/ | 土耳其詞典實驗性詞典 |
| ----- | ----- |
| 形態/AR/ | 阿拉伯形態分析儀 |
| 形態/FA/ | 波斯形態分析儀 |
| ----- | ----- |
| 偵錯/ | 有關調試圖書館的信息 |
| MSVC/ | Microsoft Visual-C項目文件 |
| mingw/ | 有關在MSYS或CYGWIN下使用mingw的信息 |
該系統使用常規tar.gz格式分佈;可以使用命令行中的tar -zxf link-grammar.tar.gz命令提取它。
最新版本的TARBALL可以從以下下載:
https://www.gnucash.org/link-grammar/downloads/
這些文件已被數字簽名,以確保在下載過程中沒有損壞數據集的損壞,並有助於確保第三方對代碼內部的代碼沒有任何惡意更改。可以使用GPG命令檢查簽名:
gpg --verify link-grammar-5.12.5.tar.gz.asc
它應該生成與(日期除外)相同的輸出:
gpg: Signature made Thu 26 Apr 2012 12:45:31 PM CDT using RSA key ID E0C0651C
gpg: Good signature from "Linas Vepstas (Hexagon Architecture Patches) <[email protected]>"
gpg: aka "Linas Vepstas (LKML) <[email protected]>"
或者,可以驗證MD5校驗和。這些不能提供加密安全性,但它們可以檢測到簡單的腐敗。要驗證校驗和,請在命令行中發出md5sum -c MD5SUM 。
git中的標籤可以通過執行以下操作來驗證:
gpg --recv-keys --keyserver keyserver.ubuntu.com EB6AA534E0C0651C
git tag -v link-grammar-5.10.5
要編譯鏈接 - 格拉瑪共享庫和演示程序,請在命令行中鍵入:
./configure
make
make check
要安裝,請將用戶更改為“ root”並說
make install
ldconfig
這將將liblink-grammar.so庫安裝到/usr/local/lib , /usr/local/include/link-grammar中的標題文件,然後將字典和/usr/local/share/link-grammar中的詞典。運行ldconfig將重建共享庫緩存。要驗證安裝是否成功,請運行(作為非根本用戶)
make installcheck
鏈接 - 格拉瑪庫具有可選功能,如果configure檢測某些庫會自動啟用。這些庫在大多數係統上都是可選的,如果需要添加它們的功能,則需要在運行configure之前安裝相應的庫。
庫包的名稱可能會在各種系統上有所不同(如果需要,請諮詢Google ...)。例如,名稱可以包括-devel而不是-dev ,或者完全沒有它。庫名稱可能沒有前綴lib 。
libsqlite3-dev (用於Sqlite支持的詞典)libz1g-dev或libz-devel (當前需要捆綁的minisat2 )libedit-dev (請參閱編輯線)libhunspell-dev或libaspell-dev (以及相應的英語詞典)。libtre-dev或libpcre2-dev (比LIBC REGEX實現快得多,並且需要在FreeBSD和Cygwin上進行正確性)。libpcre2-dev 。它必須用於某些系統(如其建築物部分中指定)。如果安裝了libedit-dev ,則可以使用箭頭鍵將輸入編輯為link-parser工具。上下箭頭鍵將回憶起以前的條目。你想要這個;它使測試和編輯變得更加容易。
包括兩個版本的node.js綁定。一個版本包裹庫;另一個使用emscripten包裝命令行工具。庫綁定在bindings/node.js中,而emscripten包裝器則為bindings/js 。
這些是使用npm構建的。首先,您必須構建核心C庫。然後執行以下操作:
cd bindings/node.js
npm install
npm run make
這將創建庫綁定並運行一個小型單元測試(應該通過)。可以在bindings/node.js/examples/simple.js中找到一個示例。
對於命令行包裝器,請執行以下操作:
cd bindings/js
./install_emsdk.sh
./build_packages.sh
Python3綁定默認是構建的,前提是安裝了相應的Python開發軟件包。 (不再支持Python2結合。)
這些軟件包是:
python3-develpython3-dev注意:在發布configure之前(請參見下文),您必須驗證可以使用您的PATH調用所需的Python版本。
Python結合的使用是可選的。如果您不打算與Python一起使用Link-grammar,則不需要這些。如果您想禁用Python綁定,請使用:
./configure --disable-python-bindings
linkgrammar.py模塊在Python中提供了高級接口。 example.py tests.py sentence-check.py
make install pythondir=/where/to/install可以做到這一點默認情況下, Makefile的嘗試構建Java綁定。 Java綁定的使用是可選的。如果您不打算使用Java使用鏈接 - 格拉瑪,則不需要這些。您可以通過禁用以下方式來跳過構建Java綁定:
./configure --disable-java-bindings
如果找不到jni.h ,或者找不到ant ,則將無法構建Java綁定。
有關查找jni.h的註釋:
一些常見的Java JVM發行版(最著名的是Sun的分佈)將該文件放在無法自動找到的不尋常位置。要解決此問題,請確保正確設置了環境變量JAVA_HOME 。配置腳本在$JAVA_HOME/Headers和$JAVA_HOME/include中尋找jni.h ;它還檢查了$JDK_HOME的相應位置。如果仍然找不到jni.h ,請用CPPFLAGS變量指定位置:因此,例如
export CPPFLAGS="-I/opt/jdk1.5/include/:/opt/jdk1.5/include/linux"
或者
export CPPFLAGS="-I/c/java/jdk1.6.0/include/ -I/c/java/jdk1.6.0/include/win32/"
請注意, /opt的使用是非標準的,大多數係統工具將無法在此處找到安裝的軟件包。
使用標準GNU configure --prefix選項可以過度纏繞/usr/local安裝目標;因此,例如:
./configure --prefix=/opt/link-grammar
通過使用pkg-config (見下文),可以自動檢測到非標準安裝位置。
其他配置選項由
./configure --help
該系統已經過測試,並在32位和64位Linux系統,FreeBSD,MACOS以及Microsoft Windows系統上運行良好。特定的OS依賴性註釋如下。
最終用戶應下載TARBALL(請參閱解開包裝和簽名驗證)。
當前的GitHub版本旨在為開發人員(包括願意提供修復程序,新功能或改進的任何人)。主分支的尖端通常是不穩定的,有時可能會在開發時具有不良的代碼。它還需要安裝默認情況下未安裝的開發工具。由於這些原因,對於常規最終用戶,不建議使用GitHub版本。
克隆: git clone https://github.com/opencog/link-grammar.git
或以拉鍊下載:
https://github.com/opencog/link-grammar/archive/master.zip
可能需要安裝的工具才能構建鏈接 - 格拉瑪:
make (可能需要gmake變體)
m4
gcc或clang
autoconf
libtool
autoconf-archive
pkg-config (可以命名為pkgconf或pkgconfig )
pip3 (用於Python結合)
選修的:
swig (用於語言綁定)
flex
Apache Ant(用於Java綁定)
graphviz (如果您想使用單詞 - 圖表顯示功能)
GitHub版本不包括configure腳本。要生成它,請使用:
autogen.sh
如果遇到錯誤,請確保已安裝上述開發軟件包,並且系統安裝是最新的。特別是,缺少autoconf或autoconf-archive可能會導致奇怪而誤導的錯誤。
有關如何進行的更多信息,請繼續在該部分創建系統和相關部分。
要配置調試模式,請使用:
configure --enable-debug
它添加了一些驗證調試代碼和功能,這些代碼和功能可以很好地打印幾個數據結構。
可能對調試可能有用的功能是單詞圖顯示。默認情況下啟用它。有關此功能的更多詳細信息,請參見Word-Graph Display。
當使用gcc時,當前配置具有明顯的標準C ++庫混合問題(歡迎修復)。但是,FreeBSD的常見做法是與clang進行編譯,並且沒有這個問題。此外,在/usr/local下安裝了附加軟件包。
因此,應調用configure方式:
env LDFLAGS=-L/usr/local/lib CPPFLAGS=-I/usr/local/include
CC=clang CXX=clang++ configure
請注意, pcre2是必需的軟件包,因為現有的libc REGEX實現沒有正則支持的級別。
有些軟件包的名稱不同於前面部分中提到的名稱:
minisat (minisat2) pkgconf (pkg-config)
如上所述,Plin-Vanilla Link語法應該在Apple Macos上編譯並運行。目前,尚無報告的問題。
如果您不需要Java綁定,則幾乎應該肯定地配置:
./configure --disable-java-bindings
如果您確實需要Java綁定,請確保將JDK_HOME環境變量設置為<Headers/jni.h> is。將Java_home變量設置為Java編譯器的位置。確保已安裝螞蟻。
如果您想從GitHub構建(請參閱GitHub存儲庫中的構建),則可以安裝使用自製的工具。
可以通過三種不同的方式在Windows上編譯鏈接 - 格拉瑪。一種方法是使用Cygwin,該cygwin為Windows提供了Linux兼容層。另一種方法是使用MSVC系統。第三種方法是使用MINGW系統,該系統使用GNU工具集來編譯Windows程序。源代碼支持Vista ON的Windows系統。
Cygwin方法當前會產生最佳結果,因為它支持命令完成和歷史記錄的線路編輯,並支持X-Windows上顯示的文字圖形。 (MINGW當前沒有libedit ,MSVC端口當前不支持命令完成和歷史記錄以及拼寫。
在MS Windows上工作的最簡單方法是使用Cygwin,這是一種類似Linux的環境,用於Windows,使得可以在Posix Systems上運行到Windows的軟件。下載並安裝Cygwin。
請注意,需要安裝pcre2軟件包,因為LIBC REGEX實現不夠。
有關更多詳細信息,請參見mingw/readme-cygwin.md。
構建鏈接程序的另一種方法是使用mingw,它使用GNU工具集來編譯適合Windows Posix的程序。使用mingw/msys2可能是獲得Windows可行的Java綁定的最簡單方法。從msys2.org下載並安裝mingw/msys2。
請注意,需要安裝pcre2軟件包,因為LIBC REGEX實現不夠。
有關更多詳細信息,請參見mingw/readme-mingw64.md。
Microsoft Visual C/C ++項目文件可以在msvc目錄中找到。有關說明,請參見那裡的readme.md文件。
要運行程序發出命令(假設它在您的路徑中):
link-parser [arguments]
這啟動了程序。該程序具有許多可用戶選擇的變量和選項。這些可以通過在鏈接份子提示下輸入!var來顯示。輸入!help將顯示一些其他命令。
詞典以名稱為2字母的語言代碼的目錄排列。 Link-Parser程序按該順序,直接或在目錄名稱data下搜索此類語言目錄:
/usr/local/share/link-grammar中)。如果Link-parser找不到所需的字典,請使用詳細的級別4來調試問題。例如:
link-parser ru -verbosity=4
可以在命令行上指定其他位置;例如:
link-parser ../path/to-my/modified/data/en
當在非標準位置訪問字典時,仍然假定標准文件名( ie 4.0.dict , 4.0.affix等)。
俄羅斯詞典在data/ru中。因此,俄羅斯解析器可以開始:
link-parser ru
如果您不提供鏈接偏見器的論點,它將根據您當前的語言環境設置搜索一種語言。如果找不到這樣的語言目錄,則默認為“ en”。
如果您看到與此類似的錯誤:
Warning: The word "encyclop" found near line 252 of en/4.0.dict
matches the following words:
encyclop
This word will be ignored.
然後,您的UTF-8環境未安裝或未配置。 shell命令locale -a應將en_US.utf8列為一個語言環境。如果不是,那麼您需要根據操作系統,需要dpkg-reconfigure locales和/或運行update-locale或可能的apt-get install locales ,或這些組合或變體,具體取決於您的操作系統。
有幾種測試結果構建的方法。如果構建了Python綁定,則可以在./bindings/python-examples/tests.py中找到一個測試程序。有關更多詳細信息,請參見bindings/python-examples目錄中的readme.md。
語言數據目錄中也有多批測試/示例句子,通常具有名稱corpus-*.batch批處理程序可以在批處理模式下運行,以便在大量句子上測試系統。以下命令在名為corpus-basic.batch的文件上運行解析器;
link-parser < corpus-basic.batch
線!batch附近的corpus-basic。批處理打開批處理模式。在此模式下,應拒絕標記為初始*的句子,而不以A *開頭的句子應接受。該批處理文件確實報告了一些錯誤,文件corpus-biolg.batch和corpus-fixes.batch也是如此。正在進行工作以解決這些問題。
corpus-fixes.batch文件包含以來鏈接 - 格拉瑪(Link-Grammar)最初的4.1版本以來已修復的數千個句子。 corpus-biolg.batch包含Biolg項目中的生物學/醫學文本句子。 corpus-voa.batch包含來自美國之聲的樣本; corpus-failures.batch批次包含大量故障。
以下數字可能會發生變化,但是,目前,每個文件中每個文件中都可以期望觀察到的錯誤數量大致如下:
en/corpus-basic.batch: 88 errors
en/corpus-fixes.batch: 371 errors
lt/corpus-basic.batch: 15 errors
ru/corpus-basic.batch: 47 errors
綁定/Python目錄包含用於Python結合的單位測試。它還執行了幾項基本檢查,以壓力鏈接 - 格拉瑪庫。
解析器有一個API(應用程序程序接口)。這使得將其納入您自己的應用程序變得容易。 API記錄在網站上。
FindLinkGrammar.cmake文件可用於在基於CMAKE的構建環境中測試和設置彙編。
為了使編譯和鏈接更容易,當前版本使用PKG-Config系統。為了確定鏈接 - 格拉瑪標頭文件的位置,例如pkg-config --cflags link-grammar以獲取庫的位置,例如pkg-config --libs link-grammar因此,例如,典型的makefile可能包括目標:
.c.o:
cc -O2 -g -Wall -c $< `pkg-config --cflags link-grammar`
$(EXE): $(OBJS)
cc -g -o $@ $^ `pkg-config --libs link-grammar`
該版本提供了Java文件,可提供三種訪問解析器的方法。最簡單的方法是使用org.linkgrammar.linkgrammar類;這為解析器提供了非常簡單的Java API。
第二種可能性是使用LGService類。這實現了TCP/IP網絡服務器,以JSON消息提供分析結果。任何具有JSON能力的客戶端都可以連接到該服務器並獲取解析文本。
第三種可能性是使用org.linkgrammar.lgremoteclient類,尤其是parse()方法。該類是連接到JSON服務器的網絡客戶端,並將響應轉換回通過Parseresult API訪問的結果。
如果安裝了Apache ant ,將建立上述代碼。
可以開始說:網絡服務器:
java -classpath linkgrammar.jar org.linkgrammar.LGService 9000
以上啟動了端口9000上的服務器。它省略了端口,打印了幫助文本。可以通過TCP/IP直接聯繫該服務器;例如:
telnet localhost 9000
(或者,使用NetCat代替Telnet)。連接後,輸入:
text: this is an example sentence to parse
返回的字節將是提供句子解析的JSON消息。默認情況下,文本的ASCII-Art分析未傳輸。這可以通過發送表格的消息來獲得:
storeDiagramString:true, text: this is a test.
如果解析器會在早期階段進行拼寫檢查器,如果它遇到一個不知道並且無法基於形態的單詞。配置腳本尋找Aspell或Hunspell Spell-Checkers;如果發現了Aspell Devel環境,則使用Aspell,則使用hunspell。
在運行時可以禁用咒語猜測,在鏈接範圍的客戶端中使用!spell=0標誌。輸入!help以獲取更多詳細信息。
小心:Aspell版本0.60.8,可能還有其他內存洩漏。強烈灰心地勸阻在生產服務器中使用法術猜測。在Parse_Options中保持咒語猜測( =0 )是安全的。
在多個線程中使用鏈接 - 格拉瑪是安全的。線程可能共享同一詞典。解析選項可以以每個線程為基礎設置,除了詳細性,這是所有線程共享的全局。這是唯一的全球。
在輔音/元音之前通過新的pH鏈接類型來處理a/a語音確定器,將確定器鏈接到其後面的單詞。狀態:在5.1.0版(2014年8月)中引入。大多數情況下完成了,儘管許多特種名詞未完成。
某些語言需要定向鏈接,例如立陶宛語,土耳其語和其他免費單詞訂單語言。目標是要使一個鏈接清楚地表明哪個單詞是頭詞,哪個是依賴性的。這是通過將連接器帶有單個下部案例字母的前綴來實現的:h,d,指示“頭”和“依賴”。鏈接規則使得H無匹配或d匹配,並且D與H匹配H或什麼都不匹配。這是版本5.1.0(2014年8月)中的一項新功能。該網站提供了其他文檔。
儘管英語鏈接 - 語法鏈接是未註重的,但似乎可以給他們一個事實上的方向,這與依賴語法的標準概念完全一致。
依賴關係箭頭具有以下屬性:
反射性(一個單詞不能取決於自身;它不能自然指向。)
抗對稱(如果Word1取決於Word2,則Word2不能依賴Word1)(因此,例如,確定詞依賴於名詞,但切勿反之亦然)
箭頭既不是傳遞的,也不是反傳播的:一個單詞可以由幾個頭統治。例如:
+------>WV------->+
+-->Wd-->+<--Ss<--+
| | |
LEFT-WALL she thinks.v
也就是說,有一條通往主題的途徑,“她”,直接從左壁,通過WD鏈接,以及間接地,從牆到根動詞,然後再到主題。與B和R鏈接形成類似的循環。這樣的循環對於約束可能的解析數很有用:約束與“無鏈接交叉”元規則結合發生。
有幾個相關的數學概念,但沒有一個完全捕獲定向LG:
方向性LG圖類似於DAG,除了LG僅允許一個壁(一個“頂部”元素)。
方向性LG圖類似於嚴格的部分訂單,但LG箭頭通常不是傳遞的。
方向性LG圖類似於catena,除了Catena嚴格是反交易的 - 在Catena中,通往任何單詞的路徑都是唯一的。
基礎LG論文要求分析圖的平面度。這是基於一個非常古老的觀察,依賴性幾乎永遠不會以自然語言交叉:人類根本不會在鏈接交叉的句子中說話。然後,對產生的解析提供了強大的工程和算法約束:要考慮的解析總數大大降低,因此可以大大提高解析的總體解析速度。
但是,這種平面性規則偶爾會有相對較少的例外。幾乎所有語言都會觀察到這樣的例外。下面為英語提供了許多這些例外。
因此,放鬆平面性約束並找到幾乎同樣嚴格但仍然允許異常的其他東西似乎很重要。理查德·哈德森(Richard Hudson)在他的“文字語法”理論中定義的“具有里程碑意義的傳遞性”的概念似乎是一種機制。
ftp://ftp.phon.ucl.ac.uk/pub/word-grammar/ell2-wg.pdf
http://www.phon.ucl.ac.uk/home/dick/enc/syntax.htm
http://goertzel.org/prowlgrammar.pdf
實際上,平面約束允許在解析器的實現中使用非常有效的算法。因此,從實施的角度來看,我們要保持平面度。幸運的是,也有一種方便,明確的方式來放我們的蛋糕並吃掉它。可以使用標準電氣工程符號在一張紙上繪製非平面圖:一個有趣的符號,無論在任何地方都可以交叉。該符號很容易適應LG連接器。以下是一個實際的工作示例,已經在當前的LG英語詞典中實現。所有鏈接交叉都可以通過這種方式實現!因此,我們不必實際放棄當前的解析算法來獲取非平面圖。我們甚至不必修改它們!歡呼!
這是一個有效的例子:“我想看並聆聽一切。”這想要兩個指向“一切”的J鏈接。所需的圖需要看起來像這樣:
+---->WV---->+
| +--------IV---------->+
| | +<-VJlpi--+
| | | +---xxx------------Js------->+
+--Wd--+-Sp*i+--TO-+-I*t-+-MVp+ +--VJrpi>+--MVp-+---Js->+
| | | | | | | | | |
LEFT-WALL I.p want.v to.r look.v at and.j-v listen.v to.r everything
以上確實希望將Js鏈接從“ at”到“所有事物”,但是此Js鏈接交叉(與xxx標記)鏈接到連詞。其他示例表明,大多數鏈接應該跨越鏈接到連詞。
平面維護的工作是將Js鏈接分為兩個:一個Jj部分和Jk部分;兩者一起用於跨越連詞。目前,這是在英語詞典中實現的,並且有效。
這種工作實際上是完全通用的,可以擴展到任何形式的鏈接交叉。為此,更好的符號將很方便; perhaps uJs- instead of Jj- and vJs- instead of Jk- , or something like that ... (TODO: invent better notation.) (NB: This is a kind of re-invention of "fat links", but in the dictionary, not in the code.)
Given that non-planar parses can be enabled without any changes to the parser algorithm, all that is required is to understand what sort of theory describes link-crossing in a coherent grounding. That theory is Dick Hudson's Landmark Transitivity, explained here.
This mechanism works as follows:
First, every link must be directional, with a head and a dependent. That is, we are concerned with directional-LG links, which are of the form x--A-->y or y<--A--x for words x,y and LG link type A.
Given either the directional-LG relation x--A-->y or y<--A--x, define the dependency relation x-->y. That is, ignore the link-type label.
Heads are landmarks for dependents. If the dependency relation x-->y holds, then x is said to be a landmark for y, and the predicate land(x,y) is true, while the predicate land(y,x) is false. Here, x and y are words, while --> is the landmark relation.
Although the basic directional-LG links form landmark relations, the total set of landmark relations is extended by transitive closure. That is, if land(x,y) and land(y,z) then land(x,z). That is, the basic directional-LG links are "generators" of landmarks; they generate by means of transitivity. Note that the transitive closure is unique.
In addition to the above landmark relation, there are two additional relations: the before and after landmark relations. (In English, these correspond to left and right; in Hebrew, the opposite). That is, since words come in chronological order in a sentence, the dependency relation can point either left or right. The previously-defined landmark relation only described the dependency order; we now introduce the word-sequence order. Thus, there are are land-before() and land-after() relations that capture both the dependency relation, and the word-order relation.
Notation: the before-landmark relation land-B(x,y) corresponds to x-->y (in English, reversed in right-left languages such as Hebrew), whereas the after-landmark relation land-A(x,y) corresponds to y<--x. That is, land(x,y) == land-B(x,y) or land-A(x,y) holds as a statement about the predicate form of the relations.
As before, the full set of directional landmarks are obtained by transitive closure applied to the directional-LG links. Two different rules are used to perform this closure:
-- land-B(x,y) and land(y,z) ==> land-B(x,y)
-- land-A(x,y) and land(y,z) ==> land-A(x,y)
Parsing is then performed by joining LG connectors in the usual manner, to form a directional link. The transitive closure of the directional landmarks are then computed. Finally, any parse that does not conclude with the "left wall" being the upper-most landmark is discarded.
Here is an example where landmark transitivity provides a natural solution to a (currently) broken parse. The "to.r" has a disjunct "I+ & MVi-" which allows "What is there to do?" to parse correctly. However, it also allows the incorrect parse "He is going to do". The fix would be to force "do" to take an object; however, a link from "do" to "what" is not allowed, because link-crossing would prevent it.
Fixing this requires only a fix to the dictionary, and not to the parser itself.
Examples where the no-links-cross constraint seems to be violated, in English:
"He is either in the 105th or the 106th battalion."
"He is in either the 105th or the 106th battalion."
Both seem to be acceptable in English, but the ambiguity of the "in-either" temporal ordering requires two different parse trees, if the no-links-cross rule is to be enforced. This seems un-natural.相似地:
"He is either here or he is there."
"He either is here or he is there."
A different example involves a crossing to the left wall. That is, the links LEFT-WALL--remains crosses over here--found :
"Here the remains can be found."
Other examples, per And Rosta:
The allowed--by link crosses cake--that :
He had been allowed to eat a cake by Sophy that she had made him specially
a--book , very--indeed
"a very much easier book indeed"
an--book , easy--to
"an easy book to read"
a--book , more--than
"a more difficult book than that one"
that--have crosses remains--of
"It was announced that remains have been found of the ark of the covenant"
There is a natural crossing, driven by conjunctions:
"I was in hell yesterday and heaven on Tuesday."
the "natural" linkage is to use MV links to connect "yesterday" and "on Tuesday" to the verb. However, if this is done, then these must cross the links from the conjunction "and" to "heaven" and "hell". This can be worked around partly as follows:
+-------->Ju--------->+
| +<------SJlp<----+
+<-SX<-+->Pp->+ +-->Mpn->+ +->SJru->+->Mp->+->Js->+
| | | | | | | | |
I was in hell yesterday and heaven on Tuesday
but the desired MV links from the verb to the time-prepositions "yesterday" and "on Tuesday" are missing -- whereas they are present, when the individual sentences "I was in hell yesterday" and "I was in heaven on Tuesday" are parsed. Using a conjunction should not wreck the relations that get used; but this requires link-crossing.
"Sophy wondered up to whose favorite number she should count"
Here, "up_to" must modify "number", and not "whose". There's no way to do this without link-crossing.
Link Grammar can be understood in the context of type theory. A simple introduction to type theory can be found in chapter 1 of the HoTT book. This book is freely available online and strongly recommended if you are interested in types.
Link types can be mapped to types that appear in categorial grammars. The nice thing about link-grammar is that the link types form a type system that is much easier to use and comprehend than that of categorial grammar, and yet can be directly converted to that system! That is, link-grammar is completely compatible with categorial grammar, and is easier-to-use. See the paper "Combinatory Categorial Grammar and Link Grammar are Equivalent" for details.
The foundational LG papers make comments to this effect; however, see also work by Bob Coecke on category theory and grammar. Coecke's diagrammatic approach is essentially identical to the diagrams given in the foundational LG papers; it becomes abundantly clear that the category theoretic approach is equivalent to Link Grammar. See, for example, this introductory sketch http://www.cs.ox.ac.uk/people/bob.coecke/NewScientist.pdf and observe how the diagrams are essentially identical to the LG jigsaw-puzzle piece diagrams of the foundational LG publications.
If you have any questions, please feel free to send a note to the mailing list.
The source code of link-parser and the link-grammar library is located at GitHub.
For bug reports, please open an issue there.
Although all messages should go to the mailing list, the current maintainers can be contacted at:
Linas Vepstas - <[email protected]>
Amir Plivatsky - <[email protected]>
Dom Lachowicz - <[email protected]>
A complete list of authors and copyright holders can be found in the AUTHORS file. The original authors of the Link Grammar parser are:
Daniel Sleator [email protected]
Computer Science Department 412-268-7563
Carnegie Mellon University www.cs.cmu.edu/~sleator
Pittsburgh, PA 15213
Davy Temperley [email protected]
Eastman School of Music 716-274-1557
26 Gibbs St. www.link.cs.cmu.edu/temperley
Rochester, NY 14604
John Lafferty [email protected]
Computer Science Department 412-268-6791
Carnegie Mellon University www.cs.cmu.edu/~lafferty
Pittsburgh, PA 15213
Some working notes.
Easy to fix: provide a more uniform API to the constituent tree. ie provide word index. Also, provide a better word API, showing word extent, subscript, etc.
There are subtle technical issues for handling capitalized first words. This needs to be fixed. In addition, for now these words are shown uncapitalized in the result linkages. This can be fixed.
Maybe capitalization could be handled in the same way that a/an could be handled! After all, it's essentially a nearest-neighbor phenomenon!
See also issue 690
The proximal issue is to add a cost, so that Bill gets a lower cost than bill.n when parsing "Bill went on a walk". The best solution would be to add a 'capitalization-mark token' during tokenization; this token precedes capitalized words. The dictionary then explicitly links to this token, with rules similar to the a/an phonetic distinction. The point here is that this moves capitalization out of ad-hoc C code and into the dictionary, where it can be handled like any other language feature. The tokenizer includes experimental code for that.
The old for parse ranking via corpus statistics needs to be revived. The issue can be illustrated with these example sentences:
"Please the customer, bring in the money"
"Please, turn off the lights"
In the first sentence, the comma acts as a conjunction of two directives (imperatives). In the second sentence, it is much too easy to mistake "please" for a verb, the comma for a conjunction, and come to the conclusion that one should please some unstated object, and then turn off the lights. (Perhaps one is pleasing by turning off the lights?)
When a sentence fails to parse, look for:
Poor agreement might be handled by giving a cost to mismatched lower-case connector letters.
An common phenomenon in English is that some words that one might expect to "properly" be present can disappear under various conditions. Below is a sampling of these. Some possible solutions are given below.
Expressions such as "Looks good" have an implicit "it" (also called a zero-it or phantom-it) in them; that is, the sentence should really parse as "(it) looks good". The dictionary could be simplified by admitting such phantom words explicitly, rather than modifying the grammar rules to allow such constructions. Other examples, with the phantom word in parenthesis, include:
This can extend to elided/unvoiced syllables:
Elided punctuation:
Normally, the subjects of imperatives must always be offset by a comma: "John, give me the hammer", but here, in muttering an oath, the comma is swallowed (unvoiced).
Some complex phantom constructions:
See also GitHub issue #224.
Actual ellipsis:
Here, the ellipsis stands for a subordinate clause, which attaches with not one, but two links: C+ & CV+ , and thus requires two words, not one. There is no way to have the ellipsis word to sink two connectors starting from the same word, and so some more complex mechanism is needed. The solution is to infer a second phantom ellipsis:
where the first ellipsis is a stand in for the subject of a subordinate clause, and the second stands in for an unknown verb.
Many (unstressed) syllables can be elided; in modern English, this occurs most commonly in the initial unstressed syllable:
Poorly punctuated sentences cause problems: for example:
"Mike was not first, nor was he last."
"Mike was not first nor was he last."
The one without the comma currently fails to parse. How can we deal with this in a simple, fast, elegant way? Similar questions for zero-copula and zero-that sentences.
Consider an argument between a professor and a dean, and the dean wants the professor to write a brilliant review. At the end of the argument, the dean exclaims: "I want the review brilliant!" This is a predicative adjective; clearly it means "I want the review [that you write to be] brilliant." However, taken out of context, such a construction is ungrammatical, as the predictiveness is not at all apparent, and it reads just as incorrectly as would "*Hey Joe, can you hand me that review brilliant?"
"Push button"
"Push button firmly"
The subject is a phantom; the subject is "you".
One possible solution is to perform a one-point compactification. The dictionary contains the phantom words, and their connectors. Ordinary disjuncts can link to these, but should do so using a special initial lower-case letter (say, 'z', in addition to 'h' and 'd' as is currently implemented). The parser, as it works, examines the initial letter of each connector: if it is 'z', then the usual pruning rules no longer apply, and one or more phantom words are selected out of the bucket of phantom words. (This bucket is kept out-of-line, it is not yet placed into sentence word sequence order, which is why the usual pruning rules get modified.) Otherwise, parsing continues as normal. At the end of parsing, if there are any phantom words that are linked, then all of the connectors on the disjunct must be satisfied (of course!) else the linkage is invalid. After parsing, the phantom words can be inserted into the sentence, with the location deduced from link lengths.
A more principled approach to fixing the phantom-word issue is to borrow the idea of re-writing from the theory of operator grammar. That is, certain phrases and constructions can be (should be) re-written into their "proper form", prior to parsing. The re-writing step would insert the missing words, then the parsing proceeds. One appeal of such an approach is that re-writing can also handle other "annoying" phenomena, such as typos (missing apostrophes, eg "lets" vs. "let's", "its" vs. "it's") as well as multi-word rewrites (eg "let's" vs. "let us", or "it's" vs. "it is").
Exactly how to implement this is unclear. However, it seems to open the door to more abstract, semantic analysis. Thus, for example, in Meaning-Text Theory (MTT), one must move between SSynt to DSynt structures. Such changes require a graph re-write from the surface syntax parse (eg provided by link-grammar) to the deep-syntactic structure. By contrast, handling phantom words by graph re-writing prior to parsing inverts the order of processing. This suggests that a more holistic approach is needed to graph rewriting: it must somehow be performed "during" parsing, so that parsing can both guide the insertion of the phantom words, and, simultaneously guide the deep syntactic rewrites.
Another interesting possibility arises with regards to tokenization. The current tokenizer is clever, in that it splits not only on whitespace, but can also strip off prefixes, suffixes, and perform certain limited kinds of morphological splitting. That is, it currently has the ability to re-write single-words into sequences of words. It currently does so in a conservative manner; the letters that compose a word are preserved, with a few exceptions, such as making spelling correction suggestions. The above considerations suggest that the boundary between tokenization and parsing needs to become both more fluid, and more tightly coupled.
Compare "she will be happier than before" to "she will be more happy than before." Current parser makes "happy" the head word, and "more" a modifier w/EA link. I believe the correct solution would be to make "more" the head (link it as a comparative), and make "happy" the dependent. This would harmonize rules for comparatives... and would eliminate/simplify rules for less,more.
However, this idea needs to be double-checked against, eg Hudson's word grammar. I'm confused on this issue ...
Currently, some links can act at "unlimited" length, while others can only be finite-length. eg determiners should be near the noun that they apply to. A better solution might be to employ a 'stretchiness' cost to some connectors: the longer they are, the higher the cost. (This eliminates the "unlimited_connector_set" in the dictionary).
Sometimes, the existence of one parse should suggest that another parse must surely be wrong: if one parse is possible, then the other parses must surely be unlikely. For example: the conjunction and.jg allows the "The Great Southern and Western Railroad" to be parsed as the single name of an entity. However, it also provides a pattern match for "John and Mike" as a single entity, which is almost certainly wrong. But "John and Mike" has an alternative parse, as a conventional-and -- a list of two people, and so the existence of this alternative (and correct) parse suggests that perhaps the entity-and is really very much the wrong parse. That is, the mere possibility of certain parses should strongly disfavor other possible parses. (Exception: Ben & Jerry's ice cream; however, in this case, we could recognize Ben & Jerry as the name of a proper brand; but this is outside of the "normal" dictionary (?) (but maybe should be in the dictionary!))
More examples: "high water" can have the connector A joining high.a and AN joining high.n; these two should either be collapsed into one, or one should be eliminated.
Use WordNet to reduce the number for parses for sentences containing compound verb phrases, such as "give up", "give off", etc.
To avoid a combinatorial explosion of parses, it would be nice to have an incremental parsing, phrase by phrase, using a sliding window algorithm to obtain the parse. Thus, for example, the parse of the last half of a long, run-on sentence should not be sensitive to the parse of the beginning of the sentence.
Doing so would help with combinatorial explosion. So, for example, if the first half of a sentence has 4 plausible parses, and the last half has 4 more, then currently, the parser reports 16 parses total. It would be much more useful if it could instead report the factored results: ie the four plausible parses for the first half, and the four plausible parses for the last half. This would ease the burden on downstream users of link-grammar.
This approach has at psychological support. Humans take long sentences and split them into smaller chunks that "hang together" as phrase- structures, viz compounded sentences. The most likely parse is the one where each of the quasi sub-sentences is parsed correctly.
This could be implemented by saving dangling right-going connectors into a parse context, and then, when another sentence fragment arrives, use that context in place of the left-wall.
This somewhat resembles the application of construction grammar ideas to the link-grammar dictionary. It also somewhat resembles Viterbi parsing to some fixed depth. Viz. do a full backward-forward parse for a phrase, and then, once this is done, take a Viterbi-step. That is, once the phrase is done, keep only the dangling connectors to the phrase, place a wall, and then step to the next part of the sentence.
Caution: watch out for garden-path sentences:
The horse raced past the barn fell.
The old man the boat.
The cotton clothing is made of grows in Mississippi.
The current parser parses these perfectly; a viterbi parser could trip on these.
Other benefits of a Viterbi decoder:
One may argue that Viterbi is a more natural, biological way of working with sequences. Some experimental, psychological support for this can be found at http://www.sciencedaily.com/releases/2012/09/120925143555.htm per Morten Christiansen, Cornell professor of psychology.
Consider the sentence "Thieves rob bank" -- a typical newspaper headline. LG currently fails to parse this, because the determiner is missing ("bank" is a count noun, not a mass noun, and thus requires a determiner. By contrast, "thieves rob water" parses just fine.) A fix for this would be to replace mandatory determiner links by (D- or {[[()]] & headline-flag}) which allows the D link to be omitted if the headline-flag bit is set. Here, "headline-flag" could be a new link-type, but one that is not subject to planarity constraints.
Note that this is easier said than done: if one simply adds a high-cost null link, and no headline-flag, then all sorts of ungrammatical sentences parse, with strange parses; while some grammatical sentences, which should parse, but currently don't, become parsable, but with crazy results.
More examples, from And Rosta:
"when boy meets girl"
"when bat strikes ball"
"both mother and baby are well"
A natural approach would be to replace fixed costs by formulas. This would allow the dialect/sociolect to be dynamically changeable. That is, rather than having a binary headline-flag, there would be a formula for the cost, which could be changed outside of the parsing loop. Such formulas could be used to enable/disable parsing specific to different dialects/sociolects, simply by altering the network of link costs.
A simpler alternative would be to have labeled costs (a cost vector), so that different dialects assign different costs to various links. A dialect would be specified during the parse, thus causing the costs for that dialect to be employed during parse ranking.
This has been implemented; what's missing is a practical tutorial on how this might be used.
A good reference for refining verb usage patterns is: "COBUILD GRAMMAR PATTERNS 1: VERBS from THE COBUILD SERIES", from THE BANK OF ENGLISH, HARPER COLLINS. Online at https://arts-ccr-002.bham.ac.uk/ccr/patgram/ and http://www.corpus.bham.ac.uk/publications/index.shtml
Currently tokenize.c tokenizes double-quotes and some UTF8 quotes (see the RPUNC/LPUNC class in en/4.0.affix - the QUOTES class is not used for that, but for capitalization support), with some very basic support in the English dictionary (see "% Quotation marks." there). However, it does not do this for the various "curly" UTF8 quotes, such as 'these' and “these”. This results is some ugly parsing for sentences containing such quotes. (Note that these are in 4.0.affix).
A mechanism is needed to disentangle the quoting from the quoted text, so that each can be parsed appropriately. It's somewhat unclear how to handle this within link-grammar. This is somewhat related to the problem of morphology (parsing words as if they were "mini-sentences",) idioms (phrases that are treated as if they were single words), set-phrase structures (if ... then ... not only... but also ...) which have a long-range structure similar to quoted text (he said ...).
See also GitHub issue #42.
"to be fishing": Link grammar offers four parses of "I was fishing for evidence", two of which are given low scores, and two are given high scores. Of the two with high scores, one parse is clearly bad. Its links "to be fishing.noun" as opposed to the correct "to be fishing.gerund". That is, I can be happy, healthy and wise, but I certainly cannot be fishing.noun. This is perhaps not just a bug in the structure of the dictionary, but is perhaps deeper: link-grammar has little or no concept of lexical units (ie collocations, idioms, institutional phrases), which thus allows parses with bad word-senses to sneak in.
The goal is to introduce more knowledge of lexical units into LG.
Different word senses can have different grammar rules (and thus, the links employed reveal the sense of the word): for example: "I tend to agree" vs. "I tend to the sheep" -- these employ two different meanings for the verb "tend", and the grammatical constructions allowed for one meaning are not the same as those allowed for the other. Yet, the link rules for "tend.v" have to accommodate both senses, thus making the rules rather complex. Worse, it potentially allows for non-sense constructions. If, instead, we allowed the dictionary to contain different rules for "tend.meaning1" and "tend.meaning2", the rules would simplify (at the cost of inflating the size of the dictionary).
Another example: "I fear so" -- the word "so" is only allowed with some, but not all, lexical senses of "fear". So eg "I fear so" is in the same semantic class as "I think so" or "I hope so", although other meanings of these verbs are otherwise quite different.
[Sin2004] "New evidence, new priorities, new attitudes" in J. Sinclair, (ed) (2004) How to use corpora in language teaching, Amsterdam: John Benjamins
See also: Pattern Grammar: A Corpus-Driven Approach to the Lexical Grammar of English
Susan Hunston and Gill Francis (University of Birmingham)
Amsterdam: John Benjamins (Studies in corpus linguistics, edited by Elena Tognini-Bonelli, volume 4), 2000
書評。
“The Molecular Level of Lexical Semantics”, EA Nida, (1997) International Journal of Lexicography, 10(4): 265–274.在線的
The link-grammar provides several mechanisms to support circumpositions or even more complicated multi-word structures. One mechanism is by ordinary links; see the V, XJ and RJ links. The other mechanism is by means of post-processing rules. (For example, the "filler-it" SF rules use post-processing.) However, rules for many common forms have not yet been written. The general problem is of supporting structures that have "holes" in the middle, that require "lacing" to tie them together.
For a general theory, see catena.
For example, the adposition:
... from [xxx] on.
"He never said another word from then on."
"I promise to be quiet from now on."
"Keep going straight from that point on."
"We went straight from here on."
... from there on.
"We went straight, from the house on to the woods."
"We drove straight, from the hill onwards."
Note that multiple words can fit in the slot [xxx]. Note the tangling of another prepositional phrase: "... from [xxx] on to [yyy]"
More complicated collocations with holes include
"First.. next..."
"If ... then ..."
'Then' is optional ('then' is a 'null word'), for example:
"If it is raining, stay inside!"
"If it is raining, [then] stay inside!"
"if ... only ..." "If there were only more like you!"
"... not only, ... but also ..."
"As ..., so ..." "As it was commanded, so it shall be done"
"Either ... or ..."
"Both ... and ..." "Both June and Tom are coming"
"ought ... if ..." "That ought to be the case, if John is not lying"
"Someone ... who ..."
"Someone is outside who wants to see you"
"... for ... to ..."
"I need for you to come to my party"
The above are not currently supported. An example that is supported is the "non-referential it", eg
"It ... that ..."
"It seemed likely that John would go"
The above is supported by means of special disjuncts for 'it' and 'that', which must occur in the same post-processing domain.
參見:
http://www.phon.ucl.ac.uk/home/dick/enc2010/articles/extraposition.htm
http://www.phon.ucl.ac.uk/home/dick/enc2010/articles/relative-clause.htm
"...from X and from Y" "By X, and by Y, ..." Here, X and Y might be rather long phrases, containing other prepositions. In this case, the usual link-grammar linkage rules will typically conjoin "and from Y" to some preposition in X, instead of the correct link to "from X". Although adding a cost to keep the lengths of X and Y approximately equal can help, it would be even better to recognize the "...from ... and from..." pattern.
The correct solution for the "Either ... or ..." appears to be this:
---------------------------+---SJrs--+
+------???----------+ |
| +Ds**c+--SJls-+ +Ds**+
| | | | | |
either.r the lorry.n or.j-n the van.n
The wrong solution is
--------------------------+
+-----Dn-----+ +---SJrs---+
| +Ds**c+--SJn--+ +Ds**+
| | | | | |
neither.j the lorry.n nor.j-n the van.n
The problem with this is that "neither" must coordinate with "nor". That is, one cannot say "either.. nor..." "neither ... or ... " "neither ...and..." "but ... nor ..." The way I originally solved the coordination problem was to invent a new link called Dn, and a link SJn and to make sure that Dn could only connect to SJn, and nothing else. Thus, the lower-case "n" was used to propagate the coordination across two links. This demonstrates how powerful the link-grammar theory is: with proper subscripts, constraints can be propagated along links over large distances. However, this also makes the dictionary more complex, and the rules harder to write: coordination requires a lot of different links to be hooked together. And so I think that creating a single, new link, called ???, will make the coordination easy and direct. That is why I like that idea.
這 ? ? ? link should be the XJ link, which-see.
More idiomatic than the above examples: "...the chip on X's shoulder" "to do X a favour" "to give X a look"
The above are all examples of "set phrases" or "phrasemes", and are most commonly discussed in the context of MTT or Meaning-Text Theory of Igor Mel'cuk et al (search for "MTT Lexical Function" for more info). Mel'cuk treats set phrases as lexemes, and, for parsing, this is not directly relevant. However, insofar as phrasemes have a high mutual information content, they can dominate the syntactic structure of a sentence.
The current parse of "he wanted to look at and listen to everything." is inadequate: the link to "everything" needs to connect to "and", so that "listen to" and "look at" are treated as atomic verb phrases.
MTT suggests that perhaps the correct way to understand the contents of the post-processing rules is as an implementation of 'lexical functions' projected onto syntax. That is, the post-processing rules allow only certain syntactical constructions, and these are the kinds of constructions one typically sees in certain kinds of lexical functions.
Alternately, link-grammar suffers from a combinatoric explosion of possible parses of a given sentence. It would seem that lexical functions could be used to rule out many of these parses. On the other hand, the results are likely to be similar to that of statistical parse ranking (which presumably captures such quasi-idiomatic collocations at least weakly).
參考。 I. Mel'cuk: "Collocations and Lexical Functions", in ''Phraseology: theory, analysis, and applications'' Ed. Anthony Paul Cowie (1998) Oxford University Press pp. 23-54.
More generally, all of link-grammar could benefit from a MTT-izing of infrastructure.
Compare the above commentary on lexical functions to Hebrew morphological analysis. To quote Wikipedia:
This distinction between the word as a unit of speech and the root as a unit of meaning is even more important in the case of languages where roots have many different forms when used in actual words, as is the case in Semitic languages. In these, roots are formed by consonants alone, and different words (belonging to different parts of speech) are derived from the same root by inserting vowels. For example, in Hebrew, the root gdl represents the idea of largeness, and from it we have gadol and gdola (masculine and feminine forms of the adjective "big"), gadal "he grew", higdil "he magnified" and magdelet "magnifier", along with many other words such as godel "size" and migdal "tower".
Instead of hard-coding LL, declare which links are morpho links in the dict.
Version 6.0 will change Sentence to Sentence*, Linkage to Linkage* in the API. But perhaps this is a bad idea...