版本5.12.5
链接语法解析器展示了英语,泰语,俄语,阿拉伯语,波斯语和有限的其他语言的语言(自然语言)结构。该结构是句子中单词之间的键入链接(边)的图。可以通过应用一系列规则转换为这些不同格式,可以从链接语法中获得更传统的HPSG(组成)和依赖性样式解析。这是可能的,因为链接语法对句子的“句法语义”结构有所“更深”:它提供了比常规解析器中通常可用的信息更精细且详细的信息。
Link语法解析的理论最初是由戴维·温赛(Davy Tembyley),约翰·拉弗蒂(John Lafferty)和丹尼尔·斯莱特(Daniel Sleator)于1991年开发的,当时卡内基·梅隆大学(Carnegie Mellon University)的语言学和计算机科学教授。关于该理论的三个首次出版物提供了最佳的介绍和概述。从那时起,就有数百个出版物进一步探索,研究和扩展了思想。
尽管基于原始的卡内基 - 梅隆代码基础,但当前的链接语法软件包已经急剧发展,并且与早期版本有很大不同。有无数的错误修复;性能通过几个数量级提高了。该软件包已完全多线程,完全启用了UTF-8,并且已被擦洗以进行安全性,从而实现了云部署。解析英语的覆盖范围已大大提高;已经添加了其他语言(最著名的是泰语和俄语)。有许多新功能,包括对形态学,方言和细粒度(成本)系统的支持,从而允许矢量插入类似的行为。有一个针对形态学量身定制的新的,复杂的令牌:它可以为形态上模棱两可的单词提供替代性分裂。词典可以在运行时更新,从而使对语法进行连续学习的系统也可以同时解析。也就是说,字典更新和解析是相互线程的安全。单词类可以用Regexes识别。随机平面图解析得到充分支持;这允许平面图均匀地采样。关于改变的内容的详细报告可以在ChangElog中找到。
该代码是根据LGPL许可发布的,可自由使用私人和商业用途,几乎没有限制。许可证的条款在此软件中包含的许可证文件中给出。
请参阅主网页以获取更多信息。此版本是原始CMU解析器的延续。
截至5.9.0版本,该系统包括用于生成句子的实验系统。这些是使用“填充空白” API指定的,每当结果是语法有效的句子时,单词被替换为通用卡片位置。其他详细信息在“人页面: man link-generator (在man子目录中)。
该发电机用于Opencog语言学习项目,该项目旨在使用崭新的和创新的信息理论技术自动从Corpora学习链接语法,与人工神经网(深度学习)中发现的技术有点相似,但使用明确的象征性表示。
解析器以各种不同的编程语言包括API,以及用于使用它的方便的命令行工具。这是一些典型的输出:
linkparser> This is a test!
Linkage 1, cost vector = (UNUSED=0 DIS= 0.00 LEN=6)
+-------------Xp------------+
+----->WV----->+---Ost--+ |
+---Wd---+-Ss*b+ +Ds**c+ |
| | | | | |
LEFT-WALL this.p is.v a test.n !
(S (NP this.p) (VP is.v (NP a test.n)) !)
LEFT-WALL 0.000 Wd+ hWV+ Xp+
this.p 0.000 Wd- Ss*b+
is.v 0.000 Ss- dWV- O*t+
a 0.000 Ds**c+
test.n 0.000 Ds**c- Os-
! 0.000 Xp- RW+
RIGHT-WALL 0.000 RW-
这个相当忙碌的显示器说明了许多有趣的事情。例如, Ss*b链接连接动词和主题,并表明主题是单数。同样, Ost链接连接动词和对象,并且也表明对象是单数。 WV (动词墙)链接指向句子的头部词,而Wd链接指向标题。 Xp链接连接到后标点符号。 Ds**c链接将名词连接到确定器:它再次确认名词是单数的,并且名词以辅音开头。 (这里不需要的PH链接用于强制语音一致,从“ an”区分“ a”)。这些链接类型记录在英语链接文档中。
显示的底部是每个单词使用的“分离”的列表。析取只是用于形成链接的连接器的列表。它们特别有趣,因为它们是“演讲的一部分”的一种极为细粒度的形式。因此,例如:分离的S- O+指示一个传递动词:它的动词既可以接受主题又一个对象。上面的附加标记表明“ IS”不仅被用作瞬态动词,而且还表明了更细节的细节:一种换句话给主题的及可动词,并且使用(可用为)句子的头动词。浮点值是分离的“成本”;它非常粗略地捕获了这种特定语法用法的对数概率的想法。就像言论的一部分与单词含义相关,因此精细的词性词性部分与更细微的区分和含义的层次相关。
链接 - 格拉玛解析器还支持形态分析。这是俄罗斯人的一个例子:
linkparser> это теста
Linkage 1, cost vector = (UNUSED=0 DIS= 0.00 LEN=4)
+-----MVAip-----+
+---Wd---+ +-LLCAG-+
| | | |
LEFT-WALL это.msi тест.= =а.ndnpi
LL链接将词干''连接到后缀'。 MVA链路仅连接到后缀,因为在俄语中,后缀携带了所有句法结构,而不是茎。这里记录了俄罗斯的列克斯。
泰语字典现已完全开发,有效地涵盖了整个语言。泰国的一个例子:
linkparser> นายกรัฐมนตรี ขึ้น กล่าว สุนทรพจน์
Linkage 1, cost vector = (UNUSED=0 DIS= 2.00 LEN=2)
+---------LWs--------+
| +<---S<--+--VS-+-->O-->+
| | | | |
LEFT-WALL นายกรัฐมนตรี.n ขึ้น.v กล่าว.v สุนทรพจน์.n
VS链接在串行动词构造中连接了两个动词'ขึ้น'和'กล่าว'。此处记录了链接类型的摘要。可以在此处找到泰国链接语法的完整文档。
泰国链接语法还接受POS标记和命名性的标签输入。每个单词都可以用链接POS标签注释。例如:
linkparser> เมื่อวานนี้.n มี.ve คน.n มา.x ติดต่อ.v คุณ.pr ครับ.pt
Found 1 linkage (1 had no P.P. violations)
Unique linkage, cost vector = (UNUSED=0 DIS= 0.00 LEN=12)
+---------------------PT--------------------+
+---------LWs---------+---------->VE---------->+ |
| +<---S<---+-->O-->+ +<--AXw<-+--->O--->+ |
| | | | | | | |
LEFT-WALL เมื่อวานนี้.n[!] มี.ve[!] คน.n[!] มา.x[!] ติดต่อ.v[!] คุณ.pr[!] ครับ.pt[!]
可以在此处找到泰语字典的完整文档。
泰语字典接受POS和命名实体的LST20标签,以弥合基本NLP工具与链接解析器之间的差距。例如:
linkparser> linkparser> วันที่_25_ธันวาคม@DTM ของ@PS ทุก@AJ ปี@NN เป็น@VV วัน@NN คริสต์มาส@NN
Found 348 linkages (348 had no P.P. violations)
Linkage 1, cost vector = (UNUSED=0 DIS= 1.00 LEN=10)
+--------------------------------LWs--------------------------------+
| +<------------------------S<------------------------+
| | +---------->PO--------->+ |
| +----->AJpr----->+ +<---AJj<--+ +---->O---->+------NZ-----+
| | | | | | | |
LEFT-WALL วันที่_25_ธันวาคม@DTM[!] ของ@PS[!].pnn ทุก@AJ[!].jl ปี@NN[!].n เป็น@VV[!].v วัน@NN[!].na คริสต์มาส@NN[!].n
请注意,上面的每个单词都用LST20 POS标签和NE标签注释。可以在此处找到有关链接POS标签和LST20标签的完整文档。有关LST20,例如注释指南和数据统计信息的更多信息,请参见此处。
any语言都支持统一采样的随机平面图:
linkparser> asdf qwer tyuiop fghj bbb
Found 1162 linkages (1162 had no P.P. violations)
+-------ANY------+-------ANY------+
+---ANY--+--ANY--+ +---ANY--+--ANY--+
| | | | | |
LEFT-WALL asdf[!] qwer[!] tyuiop[!] fghj[!] bbb[!]
ady语言同样可以执行随机的形态分裂:
linkparser> asdf qwerty fghjbbb
Found 1512 linkages (1512 had no P.P. violations)
+------------------ANY-----------------+
+-----ANY----+-------ANY------+ +---------LL--------+
| | | | |
LEFT-WALL asdf[!ANY-WORD] qwerty[!ANY-WORD] fgh[!SIMPLE-STEM].= =jbbb[!SIMPLE-SUFF]
可以在链接语法Wikipedia页面中找到扩展的概述和摘要,该页面涉及该理论的大多数导入主要方面。但是,它不能代替有关该主题发表的原始论文:
主要链接语法网站上列出了更多的论文和参考文献
另请参见C/C ++ API文档。可以在绑定目录中找到其他编程语言的绑定,包括Python3,Java和Node.js。 (有两组JavaScript绑定:一组用于库API,另一组用于命令行解析器。)
| 内容 | 描述 |
|---|---|
| 执照 | 描述使用条款的许可证 |
| ChangElog | 最近变化的纲要。 |
| 配置 | GNU配置脚本 |
| AutoGen.SH | 开发人员的配置维护工具 |
| 链接 - 格拉玛/*。c | 该程序。 (用ANSI-C编写) |
| ----- | ----- |
| 绑定/自动/ | 可选的自动语言绑定。 |
| 绑定/java/ | 可选的Java语言绑定。 |
| 绑定/JS/ | 可选的JavaScript语言绑定。 |
| 绑定/lisp/ | 可选的通用LISP语言绑定。 |
| bindings/node.js/ | 可选节点。JS语言绑定。 |
| 绑定/ocaml/ | 可选的OCAML语言绑定。 |
| 绑定/python/ | 可选的Python3语言绑定。 |
| 绑定/python-examples/ | Link-grammar测试套件和Python语言绑定用法示例。 |
| 绑定/swig/ | SWIG接口文件,用于其他FFI接口。 |
| 绑定/vala/ | 可选的vala语言绑定。 |
| ----- | ----- |
| 数据/en/ | 英语词典。 |
| 数据/EN/4.0.DICT | 包含字典定义的文件。 |
| 数据/EN/4.0。知识 | 后处理知识文件。 |
| 数据/EN/4.0.CONSTITUNTENT | 组成知识文件。 |
| 数据/EN/4.0.FRAFFIX | 词缀(前缀/后缀)文件。 |
| 数据/EN/4.0.Regex | 基于正则表达的形态学猜测。 |
| 数据/en/tiny.dict | 一个小示例词典。 |
| 数据/en/单词/ | 一个装满单词列表的目录。 |
| 数据/en/corpus*.batch | 用于测试的示例语料库。 |
| ----- | ----- |
| 数据/ru/ | 成熟的俄罗斯词典 |
| 数据/TH/ | 成熟的泰语词典(100,000多个单词) |
| 数据/AR/ | 相当完整的阿拉伯语词典 |
| 数据/FA/ | 波斯语(FARSI)词典 |
| 数据/de/ | 小型原型德语词典 |
| 数据/lt/ | 立陶宛词典的小原型 |
| 数据/ID/ | 小型印度尼西亚词典 |
| 数据/VN/ | 小型原型越南词典 |
| 数据/HE/ | 希伯来语词典 |
| 数据/kz/ | 哈萨克词典实验性的 |
| 数据/tr/ | 土耳其词典实验性词典 |
| ----- | ----- |
| 形态/AR/ | 阿拉伯形态分析仪 |
| 形态/FA/ | 波斯形态分析仪 |
| ----- | ----- |
| 调试/ | 有关调试图书馆的信息 |
| MSVC/ | Microsoft Visual-C项目文件 |
| mingw/ | 有关在MSYS或CYGWIN下使用mingw的信息 |
该系统使用常规tar.gz格式分布;可以使用命令行中的tar -zxf link-grammar.tar.gz命令提取它。
最新版本的TARBALL可以从以下下载:
https://www.gnucash.org/link-grammar/downloads/
这些文件已被数字签名,以确保在下载过程中没有损坏数据集的损坏,并有助于确保第三方对代码内部的代码没有任何恶意更改。可以使用GPG命令检查签名:
gpg --verify link-grammar-5.12.5.tar.gz.asc
它应该生成与(日期除外)相同的输出:
gpg: Signature made Thu 26 Apr 2012 12:45:31 PM CDT using RSA key ID E0C0651C
gpg: Good signature from "Linas Vepstas (Hexagon Architecture Patches) <[email protected]>"
gpg: aka "Linas Vepstas (LKML) <[email protected]>"
或者,可以验证MD5校验和。这些不能提供加密安全性,但它们可以检测到简单的腐败。要验证校验和,请在命令行中发出md5sum -c MD5SUM 。
git中的标签可以通过执行以下操作来验证:
gpg --recv-keys --keyserver keyserver.ubuntu.com EB6AA534E0C0651C
git tag -v link-grammar-5.10.5
要编译链接 - 格拉玛共享库和演示程序,请在命令行中键入:
./configure
make
make check
要安装,请将用户更改为“ root”并说
make install
ldconfig
这将将liblink-grammar.so库安装到/usr/local/lib , /usr/local/include/link-grammar中的标题文件,然后将字典和/usr/local/share/link-grammar中的词典。运行ldconfig将重建共享库缓存。要验证安装是否成功,请运行(作为非根本用户)
make installcheck
链接 - 格拉玛库具有可选功能,如果configure检测某些库会自动启用。这些库在大多数系统上都是可选的,如果需要添加它们的功能,则需要在运行configure之前安装相应的库。
库包的名称可能会在各种系统上有所不同(如果需要,请咨询Google ...)。例如,名称可以包括-devel而不是-dev ,或者完全没有它。库名称可能没有前缀lib 。
libsqlite3-dev (用于Sqlite支持的词典)libz1g-dev或libz-devel (当前需要捆绑的minisat2 )libedit-dev (请参阅编辑线)libhunspell-dev或libaspell-dev (以及相应的英语词典)。libtre-dev或libpcre2-dev (比LIBC REGEX实现快得多,并且需要在FreeBSD和Cygwin上进行正确性)。libpcre2-dev 。它必须用于某些系统(如其建筑物部分中指定)。如果安装了libedit-dev ,则可以使用箭头键将输入编辑为link-parser工具。上下箭头键将回忆起以前的条目。你想要这个;它使测试和编辑变得更加容易。
包括两个版本的node.js绑定。一个版本包裹库;另一个使用emscripten包装命令行工具。库绑定在bindings/node.js中,而emscripten包装器则为bindings/js 。
这些是使用npm构建的。首先,您必须构建核心C库。然后执行以下操作:
cd bindings/node.js
npm install
npm run make
这将创建库绑定并运行一个小型单元测试(应该通过)。可以在bindings/node.js/examples/simple.js中找到一个示例。
对于命令行包装器,请执行以下操作:
cd bindings/js
./install_emsdk.sh
./build_packages.sh
Python3绑定默认是构建的,前提是安装了相应的Python开发软件包。 (不再支持Python2结合。)
这些软件包是:
python3-develpython3-dev注意:在发布configure之前(请参见下文),您必须验证可以使用您的PATH调用所需的Python版本。
Python结合的使用是可选的。如果您不打算与Python一起使用Link-grammar,则不需要这些。如果您想禁用Python绑定,请使用:
./configure --disable-python-bindings
linkgrammar.py模块在Python中提供了高级接口。 example.py tests.py sentence-check.py
make install pythondir=/where/to/install可以做到这一点默认情况下, Makefile的尝试构建Java绑定。 Java绑定的使用是可选的。如果您不打算使用Java使用链接 - 格拉玛,则不需要这些。您可以通过禁用以下方式来跳过构建Java绑定:
./configure --disable-java-bindings
如果找不到jni.h ,或者找不到ant ,则将无法构建Java绑定。
有关查找jni.h的注释:
一些常见的Java JVM发行版(最著名的是Sun的分布)将该文件放在无法自动找到的不寻常位置。要解决此问题,请确保正确设置了环境变量JAVA_HOME 。配置脚本在$JAVA_HOME/Headers和$JAVA_HOME/include中寻找jni.h ;它还检查了$JDK_HOME的相应位置。如果仍然找不到jni.h ,请用CPPFLAGS变量指定位置:因此,例如
export CPPFLAGS="-I/opt/jdk1.5/include/:/opt/jdk1.5/include/linux"
或者
export CPPFLAGS="-I/c/java/jdk1.6.0/include/ -I/c/java/jdk1.6.0/include/win32/"
请注意, /opt的使用是非标准的,大多数系统工具将无法在此处找到安装的软件包。
使用标准GNU configure --prefix选项可以过度缠绕/usr/local安装目标;因此,例如:
./configure --prefix=/opt/link-grammar
通过使用pkg-config (见下文),可以自动检测到非标准安装位置。
其他配置选项由
./configure --help
该系统已经过测试,并在32位和64位Linux系统,FreeBSD,MACOS以及Microsoft Windows系统上运行良好。特定的OS依赖性注释如下。
最终用户应下载TARBALL(请参阅解开包装和签名验证)。
当前的GitHub版本旨在为开发人员(包括愿意提供修复程序,新功能或改进的任何人)。主分支的尖端通常是不稳定的,有时可能会在开发时具有不良的代码。它还需要安装默认情况下未安装的开发工具。由于这些原因,对于常规最终用户,不建议使用GitHub版本。
克隆: git clone https://github.com/opencog/link-grammar.git
或以拉链下载:
https://github.com/opencog/link-grammar/archive/master.zip
可能需要安装的工具才能构建链接 - 格拉玛:
make (可能需要gmake变体)
m4
gcc或clang
autoconf
libtool
autoconf-archive
pkg-config (可以命名为pkgconf或pkgconfig )
pip3 (用于Python结合)
选修的:
swig (用于语言绑定)
flex
Apache Ant(用于Java绑定)
graphviz (如果您想使用单词 - 图表显示功能)
GitHub版本不包括configure脚本。要生成它,请使用:
autogen.sh
如果遇到错误,请确保已安装上述开发软件包,并且系统安装是最新的。特别是,缺少autoconf或autoconf-archive可能会导致奇怪而误导的错误。
有关如何进行的更多信息,请继续在该部分创建系统和相关部分。
要配置调试模式,请使用:
configure --enable-debug
它添加了一些验证调试代码和功能,这些代码和功能可以很好地打印几个数据结构。
可能对调试可能有用的功能是单词图显示。默认情况下启用它。有关此功能的更多详细信息,请参见Word-Graph Display。
当使用gcc时,当前配置具有明显的标准C ++库混合问题(欢迎修复)。但是,FreeBSD的常见做法是与clang进行编译,并且没有这个问题。此外,在/usr/local下安装了附加软件包。
因此,应调用configure方式:
env LDFLAGS=-L/usr/local/lib CPPFLAGS=-I/usr/local/include
CC=clang CXX=clang++ configure
请注意, pcre2是必需的软件包,因为现有的libc REGEX实现没有正则支持的级别。
有些软件包的名称不同于前面部分中提到的名称:
minisat (minisat2) pkgconf (pkg-config)
如上所述,Plin-Vanilla Link语法应该在Apple Macos上编译并运行。目前,尚无报告的问题。
如果您不需要Java绑定,则几乎应该肯定地配置:
./configure --disable-java-bindings
如果您确实需要Java绑定,请确保将JDK_HOME环境变量设置为<Headers/jni.h> is。将Java_home变量设置为Java编译器的位置。确保已安装蚂蚁。
如果您想从GitHub构建(请参阅GitHub存储库中的构建),则可以安装使用自制的工具。
可以通过三种不同的方式在Windows上编译链接 - 格拉玛。一种方法是使用Cygwin,该cygwin为Windows提供了Linux兼容层。另一种方法是使用MSVC系统。第三种方法是使用MINGW系统,该系统使用GNU工具集来编译Windows程序。源代码支持Vista ON的Windows系统。
Cygwin方法当前会产生最佳结果,因为它支持命令完成和历史记录的线路编辑,并支持X-Windows上显示的文字图形。 (MINGW当前没有libedit ,MSVC端口当前不支持命令完成和历史记录以及拼写。
在MS Windows上工作的最简单方法是使用Cygwin,这是一种类似Linux的环境,用于Windows,使得可以在Posix Systems上运行到Windows的软件。下载并安装Cygwin。
请注意,需要安装pcre2软件包,因为LIBC REGEX实现不够。
有关更多详细信息,请参见mingw/readme-cygwin.md。
构建链接程序的另一种方法是使用mingw,它使用GNU工具集来编译适合Windows Posix的程序。使用mingw/msys2可能是获得Windows可行的Java绑定的最简单方法。从msys2.org下载并安装mingw/msys2。
请注意,需要安装pcre2软件包,因为LIBC REGEX实现不够。
有关更多详细信息,请参见mingw/readme-mingw64.md。
Microsoft Visual C/C ++项目文件可以在msvc目录中找到。有关说明,请参见那里的readme.md文件。
要运行程序发出命令(假设它在您的路径中):
link-parser [arguments]
这启动了程序。该程序具有许多可用户选择的变量和选项。这些可以通过在链接份子提示下输入!var来显示。输入!help将显示一些其他命令。
词典以名称为2字母的语言代码的目录排列。 Link-Parser程序按该顺序,直接或在目录名称data下搜索此类语言目录:
/usr/local/share/link-grammar中)。如果Link-parser找不到所需的字典,请使用详细的级别4来调试问题。例如:
link-parser ru -verbosity=4
可以在命令行上指定其他位置;例如:
link-parser ../path/to-my/modified/data/en
当在非标准位置访问字典时,仍然假定标准文件名( ie 4.0.dict , 4.0.affix等)。
俄罗斯词典在data/ru中。因此,俄罗斯解析器可以开始:
link-parser ru
如果您不提供链接偏见器的论点,它将根据您当前的语言环境设置搜索一种语言。如果找不到这样的语言目录,则默认为“ en”。
如果您看到与此类似的错误:
Warning: The word "encyclop" found near line 252 of en/4.0.dict
matches the following words:
encyclop
This word will be ignored.
然后,您的UTF-8环境未安装或未配置。 shell命令locale -a应将en_US.utf8列为一个语言环境。如果不是,那么您需要根据操作系统,需要dpkg-reconfigure locales和/或运行update-locale或可能的apt-get install locales ,或这些组合或变体,具体取决于您的操作系统。
有几种测试结果构建的方法。如果构建了Python绑定,则可以在./bindings/python-examples/tests.py中找到一个测试程序。有关更多详细信息,请参见bindings/python-examples目录中的readme.md。
语言数据目录中也有多批测试/示例句子,通常具有名称corpus-*.batch批处理程序可以在批处理模式下运行,以便在大量句子上测试系统。以下命令在名为corpus-basic.batch的文件上运行解析器;
link-parser < corpus-basic.batch
线!batch附近的corpus-basic。批处理打开批处理模式。在此模式下,应拒绝标记为初始*的句子,而不以A *开头的句子应接受。该批处理文件确实报告了一些错误,文件corpus-biolg.batch和corpus-fixes.batch也是如此。正在进行工作以解决这些问题。
corpus-fixes.batch文件包含以来链接 - 格拉玛(Link-Grammar)最初的4.1版本以来已修复的数千个句子。 corpus-biolg.batch包含Biolg项目中的生物学/医学文本句子。 corpus-voa.batch包含来自美国之声的样本; corpus-failures.batch批次包含大量故障。
以下数字可能会发生变化,但是,目前,每个文件中每个文件中都可以期望观察到的错误数量大致如下:
en/corpus-basic.batch: 88 errors
en/corpus-fixes.batch: 371 errors
lt/corpus-basic.batch: 15 errors
ru/corpus-basic.batch: 47 errors
绑定/Python目录包含用于Python结合的单位测试。它还执行了几项基本检查,以压力链接 - 格拉玛库。
解析器有一个API(应用程序程序接口)。这使得将其纳入您自己的应用程序变得容易。 API记录在网站上。
FindLinkGrammar.cmake文件可用于在基于CMAKE的构建环境中测试和设置汇编。
为了使编译和链接更容易,当前版本使用PKG-Config系统。为了确定链接 - 格拉玛标头文件的位置,例如pkg-config --cflags link-grammar以获取库的位置,例如pkg-config --libs link-grammar因此,例如,典型的makefile可能包括目标:
.c.o:
cc -O2 -g -Wall -c $< `pkg-config --cflags link-grammar`
$(EXE): $(OBJS)
cc -g -o $@ $^ `pkg-config --libs link-grammar`
该版本提供了Java文件,可提供三种访问解析器的方法。最简单的方法是使用org.linkgrammar.linkgrammar类;这为解析器提供了非常简单的Java API。
第二种可能性是使用LGService类。这实现了TCP/IP网络服务器,以JSON消息提供分析结果。任何具有JSON能力的客户端都可以连接到该服务器并获取解析文本。
第三种可能性是使用org.linkgrammar.lgremoteclient类,尤其是parse()方法。该类是连接到JSON服务器的网络客户端,并将响应转换回通过Parseresult API访问的结果。
如果安装了Apache ant ,将建立上述代码。
可以开始说:网络服务器:
java -classpath linkgrammar.jar org.linkgrammar.LGService 9000
以上启动了端口9000上的服务器。它省略了端口,打印了帮助文本。可以通过TCP/IP直接联系该服务器;例如:
telnet localhost 9000
(或者,使用NetCat代替Telnet)。连接后,输入:
text: this is an example sentence to parse
返回的字节将是提供句子解析的JSON消息。默认情况下,文本的ASCII-Art分析未传输。这可以通过发送表格的消息来获得:
storeDiagramString:true, text: this is a test.
如果解析器会在早期阶段进行拼写检查器,如果它遇到一个不知道并且无法基于形态的单词。配置脚本寻找Aspell或Hunspell Spell-Checkers;如果发现了Aspell Devel环境,则使用Aspell,则使用hunspell。
在运行时可以禁用咒语猜测,在链接范围的客户端中使用!spell=0标志。输入!help以获取更多详细信息。
小心:Aspell版本0.60.8,可能还有其他内存泄漏。强烈灰心地劝阻在生产服务器中使用法术猜测。在Parse_Options中保持咒语猜测( =0 )是安全的。
在多个线程中使用链接 - 格拉玛是安全的。线程可能共享同一词典。解析选项可以以每个线程为基础设置,除了详细性,这是所有线程共享的全局。这是唯一的全球。
在辅音/元音之前通过新的pH链接类型来处理a/a语音确定器,将确定器链接到其后面的单词。状态:在5.1.0版(2014年8月)中引入。大多数情况下完成了,尽管许多特种名词未完成。
某些语言需要定向链接,例如立陶宛语,土耳其语和其他免费单词订单语言。目标是要使一个链接清楚地表明哪个单词是头词,哪个是依赖性的。这是通过将连接器带有单个下部案例字母的前缀来实现的:h,d,指示“头”和“依赖”。链接规则使得H无匹配或d匹配,并且D与H匹配H或什么都不匹配。这是版本5.1.0(2014年8月)中的一项新功能。该网站提供了其他文档。
尽管英语链接 - 语法链接是未注重的,但似乎可以给他们一个事实上的方向,这与依赖语法的标准概念完全一致。
依赖关系箭头具有以下属性:
反射性(一个单词不能取决于自身;它不能自然指向。)
抗对称(如果Word1取决于Word2,则Word2不能依赖Word1)(因此,例如,确定词依赖于名词,但切勿反之亦然)
箭头既不是传递的,也不是反传播的:一个单词可以由几个头统治。例如:
+------>WV------->+
+-->Wd-->+<--Ss<--+
| | |
LEFT-WALL she thinks.v
也就是说,有一条通往主题的途径,“她”,直接从左壁,通过WD链接,以及间接地,从墙到根动词,然后再到主题。与B和R链接形成类似的循环。这样的循环对于约束可能的解析数很有用:约束与“无链接交叉”元规则结合发生。
有几个相关的数学概念,但没有一个完全捕获定向LG:
方向性LG图类似于DAG,除了LG仅允许一个壁(一个“顶部”元素)。
方向性LG图类似于严格的部分订单,但LG箭头通常不是传递的。
方向性LG图类似于catena,除了Catena严格是反交易的 - 在Catena中,通往任何单词的路径都是唯一的。
基础LG论文要求分析图的平面度。这是基于一个非常古老的观察,依赖性几乎永远不会以自然语言交叉:人类根本不会在链接交叉的句子中说话。然后,对产生的解析提供了强大的工程和算法约束:要考虑的解析总数大大降低,因此可以大大提高解析的总体解析速度。
但是,这种平面性规则偶尔会有相对较少的例外。几乎所有语言都会观察到这样的例外。下面为英语提供了许多这些例外。
因此,放松平面性约束并找到几乎同样严格但仍然允许异常的其他东西似乎很重要。理查德·哈德森(Richard Hudson)在他的“文字语法”理论中定义的“具有里程碑意义的传递性”的概念似乎是一种机制。
ftp://ftp.phon.ucl.ac.uk/pub/word-grammar/ell2-wg.pdf
http://www.phon.ucl.ac.uk/home/dick/enc/syntax.htm
http://goertzel.org/prowlgrammar.pdf
实际上,平面约束允许在解析器的实现中使用非常有效的算法。因此,从实施的角度来看,我们要保持平面度。幸运的是,也有一种方便,明确的方式来放我们的蛋糕并吃掉它。可以使用标准电气工程符号在一张纸上绘制非平面图:一个有趣的符号,无论在任何地方都可以交叉。该符号很容易适应LG连接器。以下是一个实际的工作示例,已经在当前的LG英语词典中实现。所有链接交叉都可以通过这种方式实现!因此,我们不必实际放弃当前的解析算法来获取非平面图。我们甚至不必修改它们!欢呼!
这是一个有效的例子:“我想看并聆听一切。”这想要两个指向“一切”的J链接。所需的图需要看起来像这样:
+---->WV---->+
| +--------IV---------->+
| | +<-VJlpi--+
| | | +---xxx------------Js------->+
+--Wd--+-Sp*i+--TO-+-I*t-+-MVp+ +--VJrpi>+--MVp-+---Js->+
| | | | | | | | | |
LEFT-WALL I.p want.v to.r look.v at and.j-v listen.v to.r everything
以上确实希望将Js链接从“ at”到“所有事物”,但是此Js链接交叉(与xxx标记)链接到连词。其他示例表明,大多数链接应该跨越链接到连词。
平面维护的工作是将Js链接分为两个:一个Jj部分和Jk部分;两者一起用于跨越连词。目前,这是在英语词典中实现的,并且有效。
这种工作实际上是完全通用的,可以扩展到任何形式的链接交叉。为此,更好的符号将很方便;也许是uJs-而不是Jj-和vJs-而不是Jk-或类似的东西...(todo:发明更好的符号。)(nb:这是对“ fat Links”的一种重新发明,而是在词典中,而不是代码中。)
Given that non-planar parses can be enabled without any changes to the parser algorithm, all that is required is to understand what sort of theory describes link-crossing in a coherent grounding. That theory is Dick Hudson's Landmark Transitivity, explained here.
This mechanism works as follows:
First, every link must be directional, with a head and a dependent. That is, we are concerned with directional-LG links, which are of the form x--A-->y or y<--A--x for words x,y and LG link type A.
Given either the directional-LG relation x--A-->y or y<--A--x, define the dependency relation x-->y. That is, ignore the link-type label.
Heads are landmarks for dependents. If the dependency relation x-->y holds, then x is said to be a landmark for y, and the predicate land(x,y) is true, while the predicate land(y,x) is false. Here, x and y are words, while --> is the landmark relation.
Although the basic directional-LG links form landmark relations, the total set of landmark relations is extended by transitive closure. That is, if land(x,y) and land(y,z) then land(x,z). That is, the basic directional-LG links are "generators" of landmarks; they generate by means of transitivity. Note that the transitive closure is unique.
In addition to the above landmark relation, there are two additional relations: the before and after landmark relations. (In English, these correspond to left and right; in Hebrew, the opposite). That is, since words come in chronological order in a sentence, the dependency relation can point either left or right. The previously-defined landmark relation only described the dependency order; we now introduce the word-sequence order. Thus, there are are land-before() and land-after() relations that capture both the dependency relation, and the word-order relation.
Notation: the before-landmark relation land-B(x,y) corresponds to x-->y (in English, reversed in right-left languages such as Hebrew), whereas the after-landmark relation land-A(x,y) corresponds to y<--x. That is, land(x,y) == land-B(x,y) or land-A(x,y) holds as a statement about the predicate form of the relations.
As before, the full set of directional landmarks are obtained by transitive closure applied to the directional-LG links. Two different rules are used to perform this closure:
-- land-B(x,y) and land(y,z) ==> land-B(x,y)
-- land-A(x,y) and land(y,z) ==> land-A(x,y)
Parsing is then performed by joining LG connectors in the usual manner, to form a directional link. The transitive closure of the directional landmarks are then computed. Finally, any parse that does not conclude with the "left wall" being the upper-most landmark is discarded.
Here is an example where landmark transitivity provides a natural solution to a (currently) broken parse. The "to.r" has a disjunct "I+ & MVi-" which allows "What is there to do?" to parse correctly. However, it also allows the incorrect parse "He is going to do". The fix would be to force "do" to take an object; however, a link from "do" to "what" is not allowed, because link-crossing would prevent it.
Fixing this requires only a fix to the dictionary, and not to the parser itself.
Examples where the no-links-cross constraint seems to be violated, in English:
"He is either in the 105th or the 106th battalion."
"He is in either the 105th or the 106th battalion."
Both seem to be acceptable in English, but the ambiguity of the "in-either" temporal ordering requires two different parse trees, if the no-links-cross rule is to be enforced. This seems un-natural.相似地:
"He is either here or he is there."
"He either is here or he is there."
A different example involves a crossing to the left wall. That is, the links LEFT-WALL--remains crosses over here--found :
"Here the remains can be found."
Other examples, per And Rosta:
The allowed--by link crosses cake--that :
He had been allowed to eat a cake by Sophy that she had made him specially
a--book , very--indeed
"a very much easier book indeed"
an--book , easy--to
"an easy book to read"
a--book , more--than
"a more difficult book than that one"
that--have crosses remains--of
"It was announced that remains have been found of the ark of the covenant"
There is a natural crossing, driven by conjunctions:
"I was in hell yesterday and heaven on Tuesday."
the "natural" linkage is to use MV links to connect "yesterday" and "on Tuesday" to the verb. However, if this is done, then these must cross the links from the conjunction "and" to "heaven" and "hell". This can be worked around partly as follows:
+-------->Ju--------->+
| +<------SJlp<----+
+<-SX<-+->Pp->+ +-->Mpn->+ +->SJru->+->Mp->+->Js->+
| | | | | | | | |
I was in hell yesterday and heaven on Tuesday
but the desired MV links from the verb to the time-prepositions "yesterday" and "on Tuesday" are missing -- whereas they are present, when the individual sentences "I was in hell yesterday" and "I was in heaven on Tuesday" are parsed. Using a conjunction should not wreck the relations that get used; but this requires link-crossing.
"Sophy wondered up to whose favorite number she should count"
Here, "up_to" must modify "number", and not "whose". There's no way to do this without link-crossing.
Link Grammar can be understood in the context of type theory. A simple introduction to type theory can be found in chapter 1 of the HoTT book. This book is freely available online and strongly recommended if you are interested in types.
Link types can be mapped to types that appear in categorial grammars. The nice thing about link-grammar is that the link types form a type system that is much easier to use and comprehend than that of categorial grammar, and yet can be directly converted to that system! That is, link-grammar is completely compatible with categorial grammar, and is easier-to-use. See the paper "Combinatory Categorial Grammar and Link Grammar are Equivalent" for details.
The foundational LG papers make comments to this effect; however, see also work by Bob Coecke on category theory and grammar. Coecke's diagrammatic approach is essentially identical to the diagrams given in the foundational LG papers; it becomes abundantly clear that the category theoretic approach is equivalent to Link Grammar. See, for example, this introductory sketch http://www.cs.ox.ac.uk/people/bob.coecke/NewScientist.pdf and observe how the diagrams are essentially identical to the LG jigsaw-puzzle piece diagrams of the foundational LG publications.
If you have any questions, please feel free to send a note to the mailing list.
The source code of link-parser and the link-grammar library is located at GitHub.
For bug reports, please open an issue there.
Although all messages should go to the mailing list, the current maintainers can be contacted at:
Linas Vepstas - <[email protected]>
Amir Plivatsky - <[email protected]>
Dom Lachowicz - <[email protected]>
A complete list of authors and copyright holders can be found in the AUTHORS file. The original authors of the Link Grammar parser are:
Daniel Sleator [email protected]
Computer Science Department 412-268-7563
Carnegie Mellon University www.cs.cmu.edu/~sleator
Pittsburgh, PA 15213
Davy Temperley [email protected]
Eastman School of Music 716-274-1557
26 Gibbs St. www.link.cs.cmu.edu/temperley
Rochester, NY 14604
John Lafferty [email protected]
Computer Science Department 412-268-6791
Carnegie Mellon University www.cs.cmu.edu/~lafferty
Pittsburgh, PA 15213
Some working notes.
Easy to fix: provide a more uniform API to the constituent tree. ie provide word index. Also, provide a better word API, showing word extent, subscript, etc.
There are subtle technical issues for handling capitalized first words.这需要修复。 In addition, for now these words are shown uncapitalized in the result linkages. This can be fixed.
Maybe capitalization could be handled in the same way that a/an could be handled! After all, it's essentially a nearest-neighbor phenomenon!
See also issue 690
The proximal issue is to add a cost, so that Bill gets a lower cost than bill.n when parsing "Bill went on a walk". The best solution would be to add a 'capitalization-mark token' during tokenization; this token precedes capitalized words. The dictionary then explicitly links to this token, with rules similar to the a/an phonetic distinction. The point here is that this moves capitalization out of ad-hoc C code and into the dictionary, where it can be handled like any other language feature. The tokenizer includes experimental code for that.
The old for parse ranking via corpus statistics needs to be revived. The issue can be illustrated with these example sentences:
"Please the customer, bring in the money"
"Please, turn off the lights"
In the first sentence, the comma acts as a conjunction of two directives (imperatives). In the second sentence, it is much too easy to mistake "please" for a verb, the comma for a conjunction, and come to the conclusion that one should please some unstated object, and then turn off the lights. (Perhaps one is pleasing by turning off the lights?)
When a sentence fails to parse, look for:
Poor agreement might be handled by giving a cost to mismatched lower-case connector letters.
An common phenomenon in English is that some words that one might expect to "properly" be present can disappear under various conditions. Below is a sampling of these. Some possible solutions are given below.
Expressions such as "Looks good" have an implicit "it" (also called a zero-it or phantom-it) in them; that is, the sentence should really parse as "(it) looks good". The dictionary could be simplified by admitting such phantom words explicitly, rather than modifying the grammar rules to allow such constructions. Other examples, with the phantom word in parenthesis, include:
This can extend to elided/unvoiced syllables:
Elided punctuation:
Normally, the subjects of imperatives must always be offset by a comma: "John, give me the hammer", but here, in muttering an oath, the comma is swallowed (unvoiced).
Some complex phantom constructions:
See also GitHub issue #224.
Actual ellipsis:
Here, the ellipsis stands for a subordinate clause, which attaches with not one, but two links: C+ & CV+ , and thus requires two words, not one. There is no way to have the ellipsis word to sink two connectors starting from the same word, and so some more complex mechanism is needed. The solution is to infer a second phantom ellipsis:
where the first ellipsis is a stand in for the subject of a subordinate clause, and the second stands in for an unknown verb.
Many (unstressed) syllables can be elided; in modern English, this occurs most commonly in the initial unstressed syllable:
Poorly punctuated sentences cause problems: for example:
"Mike was not first, nor was he last."
"Mike was not first nor was he last."
The one without the comma currently fails to parse. How can we deal with this in a simple, fast, elegant way? Similar questions for zero-copula and zero-that sentences.
Consider an argument between a professor and a dean, and the dean wants the professor to write a brilliant review. At the end of the argument, the dean exclaims: "I want the review brilliant!" This is a predicative adjective; clearly it means "I want the review [that you write to be] brilliant." However, taken out of context, such a construction is ungrammatical, as the predictiveness is not at all apparent, and it reads just as incorrectly as would "*Hey Joe, can you hand me that review brilliant?"
"Push button"
"Push button firmly"
The subject is a phantom; the subject is "you".
One possible solution is to perform a one-point compactification. The dictionary contains the phantom words, and their connectors. Ordinary disjuncts can link to these, but should do so using a special initial lower-case letter (say, 'z', in addition to 'h' and 'd' as is currently implemented). The parser, as it works, examines the initial letter of each connector: if it is 'z', then the usual pruning rules no longer apply, and one or more phantom words are selected out of the bucket of phantom words. (This bucket is kept out-of-line, it is not yet placed into sentence word sequence order, which is why the usual pruning rules get modified.) Otherwise, parsing continues as normal. At the end of parsing, if there are any phantom words that are linked, then all of the connectors on the disjunct must be satisfied (of course!) else the linkage is invalid. After parsing, the phantom words can be inserted into the sentence, with the location deduced from link lengths.
A more principled approach to fixing the phantom-word issue is to borrow the idea of re-writing from the theory of operator grammar. That is, certain phrases and constructions can be (should be) re-written into their "proper form", prior to parsing. The re-writing step would insert the missing words, then the parsing proceeds. One appeal of such an approach is that re-writing can also handle other "annoying" phenomena, such as typos (missing apostrophes, eg "lets" vs. "let's", "its" vs. "it's") as well as multi-word rewrites (eg "let's" vs. "let us", or "it's" vs. "it is").
Exactly how to implement this is unclear. However, it seems to open the door to more abstract, semantic analysis. Thus, for example, in Meaning-Text Theory (MTT), one must move between SSynt to DSynt structures. Such changes require a graph re-write from the surface syntax parse (eg provided by link-grammar) to the deep-syntactic structure. By contrast, handling phantom words by graph re-writing prior to parsing inverts the order of processing. This suggests that a more holistic approach is needed to graph rewriting: it must somehow be performed "during" parsing, so that parsing can both guide the insertion of the phantom words, and, simultaneously guide the deep syntactic rewrites.
Another interesting possibility arises with regards to tokenization. The current tokenizer is clever, in that it splits not only on whitespace, but can also strip off prefixes, suffixes, and perform certain limited kinds of morphological splitting. That is, it currently has the ability to re-write single-words into sequences of words. It currently does so in a conservative manner; the letters that compose a word are preserved, with a few exceptions, such as making spelling correction suggestions. The above considerations suggest that the boundary between tokenization and parsing needs to become both more fluid, and more tightly coupled.
Compare "she will be happier than before" to "she will be more happy than before." Current parser makes "happy" the head word, and "more" a modifier w/EA link. I believe the correct solution would be to make "more" the head (link it as a comparative), and make "happy" the dependent. This would harmonize rules for comparatives... and would eliminate/simplify rules for less,more.
However, this idea needs to be double-checked against, eg Hudson's word grammar. I'm confused on this issue ...
Currently, some links can act at "unlimited" length, while others can only be finite-length. eg determiners should be near the noun that they apply to. A better solution might be to employ a 'stretchiness' cost to some connectors: the longer they are, the higher the cost. (This eliminates the "unlimited_connector_set" in the dictionary).
Sometimes, the existence of one parse should suggest that another parse must surely be wrong: if one parse is possible, then the other parses must surely be unlikely. For example: the conjunction and.jg allows the "The Great Southern and Western Railroad" to be parsed as the single name of an entity. However, it also provides a pattern match for "John and Mike" as a single entity, which is almost certainly wrong. But "John and Mike" has an alternative parse, as a conventional-and -- a list of two people, and so the existence of this alternative (and correct) parse suggests that perhaps the entity-and is really very much the wrong parse. That is, the mere possibility of certain parses should strongly disfavor other possible parses. (Exception: Ben & Jerry's ice cream; however, in this case, we could recognize Ben & Jerry as the name of a proper brand; but this is outside of the "normal" dictionary (?) (but maybe should be in the dictionary!))
More examples: "high water" can have the connector A joining high.a and AN joining high.n; these two should either be collapsed into one, or one should be eliminated.
Use WordNet to reduce the number for parses for sentences containing compound verb phrases, such as "give up", "give off", etc.
To avoid a combinatorial explosion of parses, it would be nice to have an incremental parsing, phrase by phrase, using a sliding window algorithm to obtain the parse. Thus, for example, the parse of the last half of a long, run-on sentence should not be sensitive to the parse of the beginning of the sentence.
Doing so would help with combinatorial explosion. So, for example, if the first half of a sentence has 4 plausible parses, and the last half has 4 more, then currently, the parser reports 16 parses total. It would be much more useful if it could instead report the factored results: ie the four plausible parses for the first half, and the four plausible parses for the last half. This would ease the burden on downstream users of link-grammar.
This approach has at psychological support. Humans take long sentences and split them into smaller chunks that "hang together" as phrase- structures, viz compounded sentences. The most likely parse is the one where each of the quasi sub-sentences is parsed correctly.
This could be implemented by saving dangling right-going connectors into a parse context, and then, when another sentence fragment arrives, use that context in place of the left-wall.
This somewhat resembles the application of construction grammar ideas to the link-grammar dictionary. It also somewhat resembles Viterbi parsing to some fixed depth. Viz. do a full backward-forward parse for a phrase, and then, once this is done, take a Viterbi-step. That is, once the phrase is done, keep only the dangling connectors to the phrase, place a wall, and then step to the next part of the sentence.
Caution: watch out for garden-path sentences:
The horse raced past the barn fell.
The old man the boat.
The cotton clothing is made of grows in Mississippi.
The current parser parses these perfectly; a viterbi parser could trip on these.
Other benefits of a Viterbi decoder:
One may argue that Viterbi is a more natural, biological way of working with sequences. Some experimental, psychological support for this can be found at http://www.sciencedaily.com/releases/2012/09/120925143555.htm per Morten Christiansen, Cornell professor of psychology.
Consider the sentence "Thieves rob bank" -- a typical newspaper headline. LG currently fails to parse this, because the determiner is missing ("bank" is a count noun, not a mass noun, and thus requires a determiner. By contrast, "thieves rob water" parses just fine.) A fix for this would be to replace mandatory determiner links by (D- or {[[()]] & headline-flag}) which allows the D link to be omitted if the headline-flag bit is set. Here, "headline-flag" could be a new link-type, but one that is not subject to planarity constraints.
Note that this is easier said than done: if one simply adds a high-cost null link, and no headline-flag, then all sorts of ungrammatical sentences parse, with strange parses; while some grammatical sentences, which should parse, but currently don't, become parsable, but with crazy results.
More examples, from And Rosta:
"when boy meets girl"
"when bat strikes ball"
"both mother and baby are well"
A natural approach would be to replace fixed costs by formulas. This would allow the dialect/sociolect to be dynamically changeable. That is, rather than having a binary headline-flag, there would be a formula for the cost, which could be changed outside of the parsing loop. Such formulas could be used to enable/disable parsing specific to different dialects/sociolects, simply by altering the network of link costs.
A simpler alternative would be to have labeled costs (a cost vector), so that different dialects assign different costs to various links. A dialect would be specified during the parse, thus causing the costs for that dialect to be employed during parse ranking.
This has been implemented; what's missing is a practical tutorial on how this might be used.
A good reference for refining verb usage patterns is: "COBUILD GRAMMAR PATTERNS 1: VERBS from THE COBUILD SERIES", from THE BANK OF ENGLISH, HARPER COLLINS. Online at https://arts-ccr-002.bham.ac.uk/ccr/patgram/ and http://www.corpus.bham.ac.uk/publications/index.shtml
Currently tokenize.c tokenizes double-quotes and some UTF8 quotes (see the RPUNC/LPUNC class in en/4.0.affix - the QUOTES class is not used for that, but for capitalization support), with some very basic support in the English dictionary (see "% Quotation marks." there). However, it does not do this for the various "curly" UTF8 quotes, such as 'these' and “these”. This results is some ugly parsing for sentences containing such quotes. (Note that these are in 4.0.affix).
A mechanism is needed to disentangle the quoting from the quoted text, so that each can be parsed appropriately. It's somewhat unclear how to handle this within link-grammar. This is somewhat related to the problem of morphology (parsing words as if they were "mini-sentences",) idioms (phrases that are treated as if they were single words), set-phrase structures (if ... then ... not only... but also ...) which have a long-range structure similar to quoted text (he said ...).
See also GitHub issue #42.
"to be fishing": Link grammar offers four parses of "I was fishing for evidence", two of which are given low scores, and two are given high scores. Of the two with high scores, one parse is clearly bad. Its links "to be fishing.noun" as opposed to the correct "to be fishing.gerund". That is, I can be happy, healthy and wise, but I certainly cannot be fishing.noun. This is perhaps not just a bug in the structure of the dictionary, but is perhaps deeper: link-grammar has little or no concept of lexical units (ie collocations, idioms, institutional phrases), which thus allows parses with bad word-senses to sneak in.
The goal is to introduce more knowledge of lexical units into LG.
Different word senses can have different grammar rules (and thus, the links employed reveal the sense of the word): for example: "I tend to agree" vs. "I tend to the sheep" -- these employ two different meanings for the verb "tend", and the grammatical constructions allowed for one meaning are not the same as those allowed for the other. Yet, the link rules for "tend.v" have to accommodate both senses, thus making the rules rather complex. Worse, it potentially allows for non-sense constructions. If, instead, we allowed the dictionary to contain different rules for "tend.meaning1" and "tend.meaning2", the rules would simplify (at the cost of inflating the size of the dictionary).
Another example: "I fear so" -- the word "so" is only allowed with some, but not all, lexical senses of "fear". So eg "I fear so" is in the same semantic class as "I think so" or "I hope so", although other meanings of these verbs are otherwise quite different.
[Sin2004] "New evidence, new priorities, new attitudes" in J. Sinclair, (ed) (2004) How to use corpora in language teaching, Amsterdam: John Benjamins
See also: Pattern Grammar: A Corpus-Driven Approach to the Lexical Grammar of English
Susan Hunston and Gill Francis (University of Birmingham)
Amsterdam: John Benjamins (Studies in corpus linguistics, edited by Elena Tognini-Bonelli, volume 4), 2000
书评。
“The Molecular Level of Lexical Semantics”, EA Nida, (1997) International Journal of Lexicography, 10(4): 265–274.在线的
The link-grammar provides several mechanisms to support circumpositions or even more complicated multi-word structures. One mechanism is by ordinary links; see the V, XJ and RJ links. The other mechanism is by means of post-processing rules. (For example, the "filler-it" SF rules use post-processing.) However, rules for many common forms have not yet been written. The general problem is of supporting structures that have "holes" in the middle, that require "lacing" to tie them together.
For a general theory, see catena.
For example, the adposition:
... from [xxx] on.
"He never said another word from then on."
"I promise to be quiet from now on."
"Keep going straight from that point on."
"We went straight from here on."
... from there on.
"We went straight, from the house on to the woods."
"We drove straight, from the hill onwards."
Note that multiple words can fit in the slot [xxx]. Note the tangling of another prepositional phrase: "... from [xxx] on to [yyy]"
More complicated collocations with holes include
"First.. next..."
"If ... then ..."
'Then' is optional ('then' is a 'null word'), for example:
"If it is raining, stay inside!"
"If it is raining, [then] stay inside!"
"if ... only ..." "If there were only more like you!"
"... not only, ... but also ..."
"As ..., so ..." "As it was commanded, so it shall be done"
"Either ... or ..."
"Both ... and ..." "Both June and Tom are coming"
"ought ... if ..." "That ought to be the case, if John is not lying"
"Someone ... who ..."
"Someone is outside who wants to see you"
"... for ... to ..."
"I need for you to come to my party"
The above are not currently supported. An example that is supported is the "non-referential it", eg
"It ... that ..."
"It seemed likely that John would go"
The above is supported by means of special disjuncts for 'it' and 'that', which must occur in the same post-processing domain.
参见:
http://www.phon.ucl.ac.uk/home/dick/enc2010/articles/extraposition.htm
http://www.phon.ucl.ac.uk/home/dick/enc2010/articles/relative-clause.htm
"...from X and from Y" "By X, and by Y, ..." Here, X and Y might be rather long phrases, containing other prepositions. In this case, the usual link-grammar linkage rules will typically conjoin "and from Y" to some preposition in X, instead of the correct link to "from X". Although adding a cost to keep the lengths of X and Y approximately equal can help, it would be even better to recognize the "...from ... and from..." pattern.
The correct solution for the "Either ... or ..." appears to be this:
---------------------------+---SJrs--+
+------???----------+ |
| +Ds**c+--SJls-+ +Ds**+
| | | | | |
either.r the lorry.n or.j-n the van.n
The wrong solution is
--------------------------+
+-----Dn-----+ +---SJrs---+
| +Ds**c+--SJn--+ +Ds**+
| | | | | |
neither.j the lorry.n nor.j-n the van.n
The problem with this is that "neither" must coordinate with "nor". That is, one cannot say "either.. nor..." "neither ... or ... " "neither ...and..." "but ... nor ..." The way I originally solved the coordination problem was to invent a new link called Dn, and a link SJn and to make sure that Dn could only connect to SJn, and nothing else. Thus, the lower-case "n" was used to propagate the coordination across two links. This demonstrates how powerful the link-grammar theory is: with proper subscripts, constraints can be propagated along links over large distances. However, this also makes the dictionary more complex, and the rules harder to write: coordination requires a lot of different links to be hooked together. And so I think that creating a single, new link, called ???, will make the coordination easy and direct. That is why I like that idea.
这 ??? link should be the XJ link, which-see.
More idiomatic than the above examples: "...the chip on X's shoulder" "to do X a favour" "to give X a look"
The above are all examples of "set phrases" or "phrasemes", and are most commonly discussed in the context of MTT or Meaning-Text Theory of Igor Mel'cuk et al (search for "MTT Lexical Function" for more info). Mel'cuk treats set phrases as lexemes, and, for parsing, this is not directly relevant. However, insofar as phrasemes have a high mutual information content, they can dominate the syntactic structure of a sentence.
The current parse of "he wanted to look at and listen to everything." is inadequate: the link to "everything" needs to connect to "and", so that "listen to" and "look at" are treated as atomic verb phrases.
MTT suggests that perhaps the correct way to understand the contents of the post-processing rules is as an implementation of 'lexical functions' projected onto syntax. That is, the post-processing rules allow only certain syntactical constructions, and these are the kinds of constructions one typically sees in certain kinds of lexical functions.
Alternately, link-grammar suffers from a combinatoric explosion of possible parses of a given sentence. It would seem that lexical functions could be used to rule out many of these parses. On the other hand, the results are likely to be similar to that of statistical parse ranking (which presumably captures such quasi-idiomatic collocations at least weakly).
参考。 I. Mel'cuk: "Collocations and Lexical Functions", in ''Phraseology: theory, analysis, and applications'' Ed. Anthony Paul Cowie (1998) Oxford University Press pp. 23-54.
More generally, all of link-grammar could benefit from a MTT-izing of infrastructure.
Compare the above commentary on lexical functions to Hebrew morphological analysis. To quote Wikipedia:
This distinction between the word as a unit of speech and the root as a unit of meaning is even more important in the case of languages where roots have many different forms when used in actual words, as is the case in Semitic languages. In these, roots are formed by consonants alone, and different words (belonging to different parts of speech) are derived from the same root by inserting vowels. For example, in Hebrew, the root gdl represents the idea of largeness, and from it we have gadol and gdola (masculine and feminine forms of the adjective "big"), gadal "he grew", higdil "he magnified" and magdelet "magnifier", along with many other words such as godel "size" and migdal "tower".
Instead of hard-coding LL, declare which links are morpho links in the dict.
Version 6.0 will change Sentence to Sentence*, Linkage to Linkage* in the API. But perhaps this is a bad idea...