link grammar Download - link grammar herunterladen

Link Grammatar Parser

Version 5.12.5

Der Link-Grammatik-Parser zeigt die sprachliche (natürliche Sprach-) Struktur von Englisch-, thailändischen, russischen, arabischen, persischen und begrenzten Teilmengen eines halben Dutzend anderen Sprachen. Diese Struktur ist ein Diagramm typisierter Links (Kanten) zwischen den Wörtern in einem Satz. Man kann die konventionelleren HPSG (konstituierter) und Abhängigkeitsstil -Parsen aus der Link -Grammatik erhalten, indem sie eine Sammlung von Regeln zur Konvertierung in diese verschiedenen Formate anwenden. Dies ist möglich, weil die Link-Grammatik in die "Syntactico-Semantic" -Struktur eines Satzes ein wenig "tiefer" wird: Sie liefert wesentlich feinkörnigere und detaillierte Informationen als bei herkömmlichen Parsers.

Die Theorie der Link -Grammatik -Parsen wurde ursprünglich 1991 von Davy Temperley, John Lafferty und Daniel Sleator, zu dieser Zeit Professoren für Linguistik und Informatik an der Carnegie Mellon University, entwickelt. Die drei ersten Veröffentlichungen zu dieser Theorie bieten die beste Einführung und Übersicht. Seitdem wurden Hunderte von Veröffentlichungen weiter untersucht, die Ideen untersucht, untersucht und erweitert.

Basierend auf der ursprünglichen Carnegie-Mellon-Code-Basis hat sich das aktuelle Link-Grammatikpaket dramatisch entwickelt und unterscheidet sich zutiefst von früheren Versionen. Es gab unzählige Fehlerbehebungen; Die Leistung hat sich um mehrere Größenordnungen verbessert. Das Paket ist vollständig multi-thread, vollständig UTF-8 aktiviert und wurde aus Sicherheitsgründen geschrubbt, um die Cloud-Bereitstellung zu ermöglichen. Die Abdeckung von Englisch von Englisch wurde dramatisch verbessert. Andere Sprachen wurden hinzugefügt (vor allem Thai und Russisch). Es gibt eine Reihe neuer Merkmale, einschließlich der Unterstützung für Morphologie, Dialekte und ein feines Gewichtssystem (Kosten), das das Verhalten des vektorembedingten Verhaltens ermöglicht. Es gibt einen neuen, anspruchsvollen Tokenizer, der auf Morphologie zugeschnitten ist: Es kann alternative Spaltungen für morphologisch -mehrdeutige Wörter bieten. Wörterbücher können zur Laufzeit aktualisiert werden, um Systeme zu ermöglichen, die ein kontinuierliches Lernen der Grammatik gleichzeitig analysieren. Das heißt, Wörterbuchaktualisierungen und Parsen sind sich gegenseitig mit Thread-Safe sicher. Wörterklassen können mit Regexen erkannt werden. Das Parsen von zufälliger planarer Graphen wird vollständig unterstützt. Dies ermöglicht eine einheitliche Abtastung des Raums der planaren Graphen. Ein detaillierter Bericht über das, was sich geändert hat, finden Sie im Changelog.

Dieser Code wird unter der LGPL -Lizenz veröffentlicht, wodurch er sowohl für den privaten als auch für die kommerzielle Nutzung mit wenigen Einschränkungen frei verfügbar ist. Die Bedingungen der Lizenz sind in der Lizenzdatei angegeben, die in dieser Software enthalten ist.

Weitere Informationen finden Sie in der Hauptwebseite. Diese Version ist eine Fortsetzung des ursprünglichen CMU -Parsers.

Neu!

Ab Version 5.9.0 enthält das System ein experimentelles System zum Generieren von Sätzen. Diese werden unter Verwendung einer "Füllen Sie die Leerzeichen" -API angegeben, wobei Wörter in Wildcard-Stellen ersetzt werden, wenn das Ergebnis ein grammatikalisch gültiger Satz ist. Weitere Details finden Sie auf der Mannseite: man link-generator (im man Subverzeichnis).

Dieser Generator wird im OpenCog-Sprachlernprojekt verwendet, mit dem die Grammatiken von Korpora automatisch Link-Link-Verwendung von brandneuen und innovativen Informationstechniken erlernen sollen, die denen in künstlichen neuronalen Netzen (Deep Learning) etwas ähnlich sind, jedoch explizit symbolische Darstellungen verwenden.

Schnelle Übersicht

Der Parser enthält APIs in verschiedenen Programmiersprachen sowie ein praktisches Befehlszeilen-Tool zum Spielen. Hier ist einige typische Ausgabe:

 linkparser> This is a test!
	Linkage 1, cost vector = (UNUSED=0 DIS= 0.00 LEN=6)

    +-------------Xp------------+
    +----->WV----->+---Ost--+   |
    +---Wd---+-Ss*b+  +Ds**c+   |
    |        |     |  |     |   |
LEFT-WALL this.p is.v a  test.n !

(S (NP this.p) (VP is.v (NP a test.n)) !)

            LEFT-WALL    0.000  Wd+ hWV+ Xp+
               this.p    0.000  Wd- Ss*b+
                 is.v    0.000  Ss- dWV- O*t+
                    a    0.000  Ds**c+
               test.n    0.000  Ds**c- Os-
                    !    0.000  Xp- RW+
           RIGHT-WALL    0.000  RW-

Diese eher geschäftige Anzeige zeigt viele interessante Dinge. Beispielsweise verbindet die Ss*b -Verbindung das Verb und das Thema und zeigt an, dass das Subjekt Singular ist. Ebenso verbindet die Ost -Verbindung das Verb und das Objekt und zeigt auch an, dass das Objekt einzigartig ist. Der WV (Verb-Wall) zeigt am Kopf-Verb des Satzes, während der Wd Link am Kopfnomen zeigt. Die Xp -Verbindung stellt eine Verbindung zur nachfolgenden Interpunktion her. Der Ds**c -Link verbindet das Substantiv mit dem Determiner: Es bestätigt erneut, dass das Substantiv einzigartig ist und dass das Substantiv mit einem Konsonanten beginnt. (Der hier nicht benötigte PH -Link wird verwendet, um die phonetische Übereinstimmung zu erzwingen und 'a' von 'an' zu unterscheiden). Diese Link -Typen sind in der englischen Linkdokumentation dokumentiert.

Der untere Teil der Anzeige ist eine Auflistung der für jedes Wort verwendeten "Disjunkte". Die Disjunkte sind einfach eine Liste der Anschlüsse, die zur Bildung der Links verwendet wurden. Sie sind besonders interessant, weil sie als äußerst feinkörnige Form eines "Teils der Sprache" dienen. So zeigt sich zum Beispiel: Das Disjunkt S- O+ zeigt ein transitives Verb an: Es ist ein Verb, das sowohl ein Subjekt als auch ein Objekt nimmt. Das zusätzliche Markup oben zeigt an, dass 'IS' nicht nur als transitives Verb verwendet wird, sondern auch feinere Details angibt: ein transitives Verb, das ein einzigartiges Thema nahm und als Kopfverb eines Satzes verwendet wurde. Der schwebende Punktwert sind die "Kosten" der Disjunkte; Es erfasst die Idee der logarithmischen Wahrnehmung dieser bestimmten grammatikalischen Verwendung sehr grob. So wie die Speech-Teile mit Word-Merkmalen korrelieren, korrelieren auch Feinkörnchen-Speech-Teile mit viel feineren Unterscheidungen und Bedeutung.

Der Link-Grammar-Parser unterstützt auch die morphologische Analyse. Hier ist ein Beispiel in Russisch:

 linkparser> это теста
	Linkage 1, cost vector = (UNUSED=0 DIS= 0.00 LEN=4)

             +-----MVAip-----+
    +---Wd---+       +-LLCAG-+
    |        |       |       |
LEFT-WALL это.msi тест.= =а.ndnpi

Der LL -Link verbindet den Stiel 'те & subix mit dem Suffix' а '. Der MVA -Link verbindet sich nur mit dem Suffix, da auf Russisch das Suffix sind, die die gesamte syntaktische Struktur und nicht die Stämme tragen. Der russische Lexis ist hier dokumentiert.

Das thailändische Wörterbuch ist jetzt voll entwickelt und deckt die gesamte Sprache effektiv ab. Ein Beispiel in Thai:

 linkparser> นายกรัฐมนตรี ขึ้น กล่าว สุนทรพจน์
	Linkage 1, cost vector = (UNUSED=0 DIS= 2.00 LEN=2)

    +---------LWs--------+
    |           +<---S<--+--VS-+-->O-->+
    |           |        |     |       |
LEFT-WALL นายกรัฐมนตรี.n ขึ้น.v กล่าว.v สุนทรพจน์.n

Die VS -Verbindung verbindet zwei Verben 'ขึ้น' und 'กล่าว' in einer seriellen Verbkonstruktion. Eine Zusammenfassung der Link -Typen wird hier dokumentiert. Eine vollständige Dokumentation der thailändischen Link -Grammatik finden Sie hier.

Die Thai-Link-Grammatik akzeptiert auch POS-markierte und benannte Inputs. Jedes Wort kann mit dem Link POS -Tag mit Anmerkungen versehen werden. Zum Beispiel:

 linkparser> เมื่อวานนี้.n มี.ve คน.n มา.x ติดต่อ.v คุณ.pr ครับ.pt
Found 1 linkage (1 had no P.P. violations)
	Unique linkage, cost vector = (UNUSED=0 DIS= 0.00 LEN=12)

                          +---------------------PT--------------------+
    +---------LWs---------+---------->VE---------->+                  |
    |           +<---S<---+-->O-->+       +<--AXw<-+--->O--->+        |
    |           |         |       |       |        |         |        |
LEFT-WALL เมื่อวานนี้.n[!] มี.ve[!] คน.n[!] มา.x[!] ติดต่อ.v[!] คุณ.pr[!] ครับ.pt[!]

Eine vollständige Dokumentation für das thailändische Wörterbuch finden Sie hier.

Das thailändische Wörterbuch akzeptiert LST20 -Tagsets für POS und genannte Entitäten, um die Lücke zwischen grundlegenden NLP -Tools und dem Link -Parser zu überbrücken. Zum Beispiel:

 linkparser> linkparser> วันที่_25_ธันวาคม@DTM ของ@PS ทุก@AJ ปี@NN เป็น@VV วัน@NN คริสต์มาส@NN
Found 348 linkages (348 had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS= 1.00 LEN=10)

    +--------------------------------LWs--------------------------------+
    |               +<------------------------S<------------------------+
    |               |                +---------->PO--------->+          |
    |               +----->AJpr----->+            +<---AJj<--+          +---->O---->+------NZ-----+
    |               |                |            |          |          |           |             |
LEFT-WALL วันที่_25_ธันวาคม@DTM[!] ของ@PS[!].pnn ทุก@AJ[!].jl ปี@NN[!].n เป็น@VV[!].v วัน@NN[!].na คริสต์มาส@NN[!].n

Beachten Sie, dass jedes Wort oben mit LST20 POS -Tags und NE -Tags kommentiert wird. Die vollständige Dokumentation sowohl für die Link POS -Tags als auch für die LST20 -Tagsets finden Sie hier. Weitere Informationen zu LST20, z. B. Richtlinien und Datenstatistiken, z. B. Annotationsrichtlinien, finden Sie hier.

Die any Sprache unterstützt einheitlich abgesetzte zufällige planare Graphen:

 linkparser> asdf qwer tyuiop fghj bbb
Found 1162 linkages (1162 had no P.P. violations)

             +-------ANY------+-------ANY------+
    +---ANY--+--ANY--+        +---ANY--+--ANY--+
    |        |       |        |        |       |
LEFT-WALL asdf[!] qwer[!] tyuiop[!] fghj[!] bbb[!]

Die ady -Sprache tut ebenfalls und führt zufällige morphologische Spaltungen durch:

 linkparser> asdf qwerty fghjbbb
Found 1512 linkages (1512 had no P.P. violations)

                                  +------------------ANY-----------------+
    +-----ANY----+-------ANY------+                  +---------LL--------+
    |            |                |                  |                   |
LEFT-WALL asdf[!ANY-WORD] qwerty[!ANY-WORD] fgh[!SIMPLE-STEM].= =jbbb[!SIMPLE-SUFF]

Theorie und Dokumentation

Eine erweiterte Übersicht und Zusammenfassung finden Sie auf der Seite der Link -Grammatik -Wikipedia, die den größten Teil des Imports, primären Aspekte der Theorie berührt. Es ist jedoch kein Ersatz für die Originalarbeiten, die zu diesem Thema veröffentlicht wurden:

Daniel DK Sleator, Davy Temperley, "Englisch analysieren mit einer Link-Grammatik" Oktober 1991 CMU-CS-91-196 .
Daniel D. Sleator, Davy Temperley, "Englisch analysieren mit einer Link -Grammatik", dritter internationaler Workshop für Parsing -Technologien (1993).
Dennis Grinberg, John Lafferty, Daniel Sleatoror, "Ein robuster Parsing-Algorithmus für Link-Grammatiken", August 1995 CMU-CS-95-125 .
John Lafferty, Daniel Sleatory, Davy Temperley, "Grammatische Trigramme: Ein probabilistisches Modell der Link -Grammatik", 1992 AAAI -Symposium über probabilistische Ansätze zur natürlichen Sprache .

Es sind viele weitere Artikel und Referenzen auf der primären Link -Grammatik -Website aufgeführt

Siehe auch die C/C ++ - API -Dokumentation. Bindungen für andere Programmiersprachen, einschließlich Python3, Java und Node.js, finden Sie im Bindings -Verzeichnis. (Es gibt zwei Sätze von JavaScript-Bindungen: einen Satz für die Bibliotheks-API und einen anderen Satz für den Befehlszeilen-Parser.)

Inhalt

Inhalt	Beschreibung
LIZENZ	Die Lizenz, die die Nutzungsbedingungen beschreibt
Changelog	Ein Kompendium der jüngsten Änderungen.
konfigurieren	Das GNU -Konfigurationsskript
autogen.sh	Konfigurationstool des Entwicklers des Entwicklers
Link-Grammatik/*. c	Das Programm. (Geschrieben in ANSI-C)
------	------
Bindungen/Autoit/	Optionale automatische Sprachbindungen.
Bindungen/Java/	Optionale Java -Sprachbindungen.
Bindungen/js/	Optionale JavaScript -Sprachbindungen.
Bindungen/Lisp/	Optionale gemeinsame Lisp -Sprachbindungen.
bindungen/node.js/	Optionale Node.js Sprachbindungen.
Bindungen/ocaml/	Optionale OCAML -Sprachbindungen.
Bindungen/Python/	Optionale Python3 -Sprachbindungen.
Bindungen/Python-Examples/	Link-Grammar Test Suite und Python Language Binding-Verwendungsbeispiel.
Bindungen/Swig/	Swig -Schnittstellendatei für andere FFI -Schnittstellen.
Bindungen/Vala/	Optionale Vala -Sprachbindungen.
------	------
Daten/en/	Englischsprachige Wörterbücher.
Daten/EN/4.0.DICT	Die Datei mit den Wörterbuchdefinitionen.
Daten/EN/4.0. Wissen	Die Nachbearbeitungswissendatei.
Daten/EN/4.0.Constituents	Die Wissensdatei konstituierende.
Daten/EN/4.0.AFFIX	Die Datei Affix (Präfix/Suffix).
Daten/EN/4.0.Regx	Regelmäßige expressionsbasierte Morphologie.
Daten/en/winy.dict	Ein kleines Beispiel -Wörterbuch.
Daten/en/Wörter/	Ein Verzeichnis voller Wortlisten.
Daten/EN/Corpus*.batch	Beispielkorpora zum Testen verwendet.
------	------
Daten/Ru/	Ein vollwertiges russisches Wörterbuch
Daten/th/	Ein vollwertiges thailändisches Wörterbuch (mehr als 100.000 Wörter)
Daten/AR//	Ein ziemlich vollständiges arabisches Wörterbuch
Daten/FA//	Ein persisches (Farsi) Wörterbuch
Daten/de/	Ein kleines deutsches Prototypenwörterbuch
Daten/lt/	Ein kleines Prototyp litauanischer Wörterbuch
Daten/ID/	Ein kleines indonesischer Prototypenwörterbuch
Daten/VN/	Ein kleines Prototyp vietnamesischer Wörterbuch
Daten/He/	Ein experimentelles hebräisches Wörterbuch
Daten/kz/	Ein experimentelles kasachisches Wörterbuch
Daten/tr/	Ein experimentelles türkisches Wörterbuch
------	------
Morphologie/AR//	Ein arabischer Morphologieanalysator
Morphologie/FA//	Ein persischer Morphologieanalysator
------	------
debuggen/	Informationen zum Debuggen der Bibliothek
MSVC/	Microsoft Visual-C-Projektdateien
Mingw/	Informationen zur Verwendung von Mingw unter MSYS oder Cygwin

Auspacken und Unterschriftenüberprüfung

Das System wird mit dem herkömmlichen tar.gz -Format verteilt; Es kann mit dem Befehl tar -zxf link-grammar.tar.gz in der Befehlszeile extrahiert werden.

Ein Tarball der neuesten Version kann heruntergeladen werden:
https://www.gnucash.org/link-grammar/downloads/

Die Dateien wurden digital signiert, um sicherzustellen, dass der Datensatz beim Download keine Verfälschung vorliegt und um sicherzustellen, dass die Code -Interna von Dritten keine böswilligen Änderungen vorgenommen haben. Die Signaturen können mit dem Befehl gpg überprüft werden:

gpg --verify link-grammar-5.12.5.tar.gz.asc

Dies sollte eine Ausgabe erzeugen, die identisch mit (mit Ausnahme des Datums) identisch ist:

 gpg: Signature made Thu 26 Apr 2012 12:45:31 PM CDT using RSA key ID E0C0651C
gpg: Good signature from "Linas Vepstas (Hexagon Architecture Patches) <[email protected]>"
gpg:                 aka "Linas Vepstas (LKML) <[email protected]>"

Alternativ können die MD5-Schecks überprüft werden. Diese bieten keine kryptografische Sicherheit, können jedoch einfache Korruption erkennen. Um die Checksum zu überprüfen, geben Sie md5sum -c MD5SUM in der Befehlszeile aus.

Tags in git können überprüft werden, indem Folgendes durchgeführt wird:

 gpg --recv-keys --keyserver keyserver.ubuntu.com EB6AA534E0C0651C
git tag -v link-grammar-5.10.5

Erstellen des Systems

So kompilieren Sie die Link-Grammar-Bibliothek und das Demonstrationsprogramm in der Befehlszeile, Typ:

 ./configure
make
make check

Ändern Sie den Benutzer in "root" und sagen Sie

 make install
ldconfig

Dadurch werden die liblink-grammar.so Bibliothek in /usr/local/lib , die Header-Dateien in /usr/local/include/link-grammar und die Wörterbücher in /usr/local/share/link-grammar installiert. Durch das Ausführen ldconfig wird der gemeinsam genutzte Bibliotheks -Cache wieder aufgebaut. Um zu überprüfen, ob die Installation erfolgreich war, laufen Sie (als Nicht-Root-Benutzer) aus, um zu überprüfen, ob die Installation erfolgreich war (als Nicht-Root-Benutzer)

 make installcheck

Optionale Systembibliotheken

Die Link-Grammar-Bibliothek verfügt über optionale Funktionen, die automatisch aktiviert sind, wenn bestimmte Bibliotheken configure werden. Diese Bibliotheken sind in den meisten Systemen optional und wenn die von ihnen erwünschte Funktion, müssen entsprechende Bibliotheken vor dem Ausführen configure installiert werden.

Die Namensnamen des Bibliothekspakets können in verschiedenen Systemen variieren (bei Bedarf Google wenden Sie sich an ...). Zum Beispiel können die Namen -devel anstelle von -dev enthalten oder ohne sie ohne sie zu sein. Die Bibliotheksnamen können ohne das Präfix lib sein.

libsqlite3-dev (für SQLite-unterstütztes Wörterbuch)
libz1g-dev oder libz-devel (derzeit für das gebündelte minisat2 benötigt)
libedit-dev (siehe Editline)
libhunspell-dev oder libaspell-dev (und das entsprechende englische Wörterbuch).
libtre-dev oder libpcre2-dev (viel schneller als die LIBC Regex-Implementierung und für die Korrektheit von FreeBSD und Cygwin benötigt).
Die Verwendung libpcre2-dev wird dringend empfohlen. Es muss für bestimmte Systeme verwendet werden (wie in ihren Gebäudestellen angegeben).

Editline

Wenn libedit-dev installiert ist, können die Pfeiltasten verwendet werden, um die Eingabe in das Link-Parser-Tool zu bearbeiten. Die Auf- und Ab -Pfeiltasten erinnern sich an frühere Einträge. Du willst das; Es erleichtert das Testen und Bearbeiten viel.

Node.js -Bindungen

Es sind zwei Versionen von Node.js -Bindungen enthalten. Eine Version wickelt die Bibliothek; Der andere verwendet EMSCIPTEN, um das Befehlszeilen-Tool zu wickeln. Die Bibliotheksbindungen befinden sich in bindings/node.js während sich der Emscripten -Wrapper in bindings/js befindet.

Diese werden mit npm gebaut. Zunächst müssen Sie die Kern -C -Bibliothek erstellen. Dann machen Sie Folgendes:

   cd bindings/node.js
   npm install
   npm run make

Dadurch werden die Bibliotheksbindungen erstellt und auch einen kleinen Einheitentest (die bestanden werden sollte). Ein Beispiel kann in bindings/node.js/examples/simple.js gefunden werden.

Führen Sie für die Befehlszeilenverpackung Folgendes aus:

   cd bindings/js
   ./install_emsdk.sh
   ./build_packages.sh

Python3 -Bindungen

Die Python3 -Bindungen werden standardmäßig erstellt, sofern die entsprechenden Python -Entwicklungspakete installiert sind. (Python2 -Bindungen werden nicht mehr unterstützt.)

Diese Pakete sind:

Linux:
- Systeme mit 'RPM' -Paketen: python3-devel
- Systeme mit 'DEB' Paketen: python3-dev
Fenster:
- Installieren Sie Python3 unter https://www.python.org/downloads/windows/. Sie müssen auch SWIG von http://www.swig.org/download.html installieren.
macos:
- Installieren Sie Python3 mit Homebrew.

Hinweis: Bevor Sie configure ausgeben (siehe unten), müssen Sie bestätigen, dass die erforderlichen Python -Versionen mit Ihrem PATH aufgerufen werden können.

Die Verwendung der Python -Bindungen ist optional ; Sie benötigen diese nicht, wenn Sie nicht planen, Link-Grammatik mit Python zu verwenden. Wenn Sie Python -Bindungen deaktivieren möchten, verwenden Sie:

 ./configure --disable-python-bindings

Das linkgrammar.py -Modul bietet eine hochrangige Schnittstelle in Python. Die Skripte von example.py und sentence-check.py bieten eine Demo und tests.py . Py führt Unit-Tests aus.

macos:
- Aufgrund der Einstellungen für Dateiberechtigungen müssen MacOS -Benutzer möglicherweise Python -Bindungen an benutzerdefinierten Verzeichnisorten installieren. Dies kann durch make install pythondir=/where/to/install erfolgen

Java -Bindungen

Makefile Versuch, die Java -Bindungen zu erstellen. Die Verwendung der Java -Bindungen ist optional ; Sie benötigen diese nicht, wenn Sie nicht planen, Link-Grammatik mit Java zu verwenden. Sie können die Java -Bindungen überspringen, indem Sie wie folgt deaktivieren:

 ./configure --disable-java-bindings

Wenn jni.h nicht gefunden wird oder wenn ant nicht gefunden wird, werden die Java -Bindungen nicht gebaut.

Notizen über die Suche nach jni.h :
Einige gemeinsame Java -JVM -Verteilungen (insbesondere die von Sun) platzieren diese Datei an ungewöhnlichen Stellen, an denen sie nicht automatisch gefunden werden kann. Um dies zu beheben, stellen Sie sicher, dass die Umgebungsvariable JAVA_HOME korrekt eingestellt ist. Das Konfigurationsskript sucht nach jni.h in $JAVA_HOME/Headers und in $JAVA_HOME/include ; Es wird auch entsprechende Standorte für $JDK_HOME untersucht. Wenn jni.h immer noch nicht gefunden werden kann, geben Sie den Ort mit der CPPFLAGS -Variablen an: So, zum Beispiel, also,

 export CPPFLAGS="-I/opt/jdk1.5/include/:/opt/jdk1.5/include/linux"

oder

 export CPPFLAGS="-I/c/java/jdk1.6.0/include/ -I/c/java/jdk1.6.0/include/win32/"

Bitte beachten Sie, dass die Verwendung von /opt nicht standardmäßig ist und die meisten Systemtools nicht dort paketen finden, die dort installiert sind.

Standort installieren

Das Ziel /usr/local Installieren kann mit der Option "Standard GNU configure --prefix übernommen werden. Also zum Beispiel:

 ./configure --prefix=/opt/link-grammar

Durch die Verwendung von pkg-config (siehe unten) können nicht standardmäßige Installationsorte automatisch erkannt werden.

Benutzerdefinierte Builds

Zusätzliche Konfigurationsoptionen werden von gedruckt von

 ./configure --help

Das System wurde getestet und funktioniert gut auf 32- und 64-Bit-Linux-Systemen, FreeBSD, MacOS sowie auf Microsoft Windows-Systemen. Spezifische OS-abhängige Notizen folgen.

Gebäude aus dem Github -Repository

Endbenutzer sollten den Tarball herunterladen (siehe Auspacken und Unterschriftenüberprüfung).

Die aktuelle GitHub -Version ist für Entwickler bestimmt (einschließlich aller, die bereit sind, eine Lösung, eine neue Funktion oder eine Verbesserung bereitzustellen). Die Spitze des Master -Zweigs ist oft instabil und kann manchmal einen schlechten Code haben, da er sich in der Entwicklung befindet. Es müssen auch Entwicklungstools installiert werden, die standardmäßig nicht installiert sind. Aus diesem Grund wird die Verwendung der GitHub -Version für reguläre Endbenutzer entmutigt.

Installieren von GitHub

Klon es: git clone https://github.com/opencog/link-grammar.git
Oder laden Sie es als Reißverschluss herunter:
https://github.com/opencog/link-grammar/archive/master.zip

Voraussetzungen

Tools, die möglicherweise installiert werden müssen, bevor Sie Link-Grammatik erstellen können:

make (die gmake -Variante kann erforderlich sein)
m4
gcc oder clang
autoconf
libtool
autoconf-archive
pkgconfig pkg-config pkgconf
pip3 (für die Python -Bindungen)

Optional:
swig (für Sprachbindungen)
flex
Apache Ant (für Java -Bindungen)
graphviz (wenn Sie die Word-Graph-Anzeigefunktion verwenden möchten)

Die GitHub -Version enthält kein configure . Um es zu generieren, verwenden Sie:

 autogen.sh

Wenn Sie Fehler erhalten, stellen Sie sicher, dass Sie die oben aufgeführten Entwicklungspakete installiert haben und dass Ihre Systeminstallation auf dem neuesten Stand ist. Insbesondere fehlende autoconf oder autoconf-archive können seltsame und irreführende Fehler verursachen.

Weitere Informationen zum Vorgehen finden Sie im Abschnitt Erstellen Sie das System und die entsprechenden Abschnitte danach.

Zusätzliche Notizen für Entwickler

Verwenden Sie: Um den Debug -Modus zu konfigurieren, verwenden Sie:

 configure --enable-debug

Es fügt einige Überprüfungs-Debug-Code und -funktionen hinzu, die mehrere Datenstrukturen ziemlich drucken können.

Eine Funktion, die zum Debuggen nützlich sein kann, ist die Wortgrafie. Es ist standardmäßig aktiviert. Weitere Informationen zu dieser Funktion finden Sie in der Anzeige von Word-Graph.

Auf FreeBSD aufbauen

Die aktuelle Konfiguration hat ein scheinbares Standard -Mischproblem für C ++ -Bibliothek, wenn gcc verwendet wird (ein Fix ist willkommen). Die übliche Praxis bei FreeBSD besteht jedoch darin, mit clang zusammenzustellen, und es hat dieses Problem nicht. Darüber hinaus werden die Add-On-Pakete unter /usr/local installiert.

So sollte configure aufgerufen werden:

 env LDFLAGS=-L/usr/local/lib CPPFLAGS=-I/usr/local/include 
CC=clang CXX=clang++ configure

Beachten Sie, dass pcre2 ein erforderliches Paket ist, da die vorhandene libc Regex -Implementierung nicht über die erforderliche REGEX -Unterstützung verfügt.

Einige Pakete haben unterschiedliche Namen als die in den vorherigen Abschnitten erwähnten:

minisat (Minisat2) pkgconf (PKG-Config)

Auf MacOS aufbauen

Die Link-Grammatik mit einer einfachen Vanilla sollte wie oben beschrieben auf Apple MacOS gut kompilieren und ausgeführt werden. Zu diesem Zeitpunkt gibt es keine gemeldeten Probleme.

Wenn Sie die Java -Bindungen nicht benötigen, sollten Sie sicher mit:

 ./configure --disable-java-bindings

Wenn Sie Java -Bindungen wünschen, stellen Sie sicher, dass die Variable jdk_home auf die gesamte <Headers/jni.h> festgelegt ist. Stellen Sie die Variable java_home auf den Ort des Java -Compilers ein. Stellen Sie sicher, dass Sie Ant installiert haben.

Wenn Sie aus GitHub erstellen möchten (siehe Gebäude aus dem Github -Repository), können Sie die dort aufgeführten Tools mit Homebrew installieren.

Aufbau auf Fenstern

Es gibt drei verschiedene Möglichkeiten, wie Link-Grammatik unter Windows kompiliert werden kann. Eine Möglichkeit besteht darin, Cygwin zu verwenden, das eine Linux -Kompatibilitätsebene für Windows liefert. Ein anderer Weg ist die Verwendung des MSVC -Systems. Ein dritter Weg besteht darin, das Mingw -System zu verwenden, das mit dem GNU -Toolset Windows -Programme kompiliert. Der Quellcode unterstützt Windows -Systeme von Vista an.

Der Cygwin Way erzeugt derzeit das beste Ergebnis, da es die Zeilenbearbeitung mit Befehlsabschluss und -verlauf unterstützt und auch die Wortgraphie auf X-Windows unterstützt. (Mingw hat derzeit nicht libedit , und der MSVC -Port unterstützt derzeit keine Fertigstellung und Geschichte des Befehls sowie die Rechtschreibung.

Gebäude auf Fenstern (Cygwin)

Der einfachste Weg, um Link-Grammatar unter MS Windows zu arbeiten, besteht darin, Cygwin zu verwenden, eine Linux-ähnliche Umgebung für Windows, damit die Port-Software, die auf POSIX-Systemen auf Windows ausgeführt wird, Port-Port-Portion ermöglicht wird. Laden Sie Cygwin herunter und installieren Sie sie.

Beachten Sie, dass die Installation des pcre2 -Pakets erforderlich ist, da die LIBC Regex -Implementierung nicht in der Lage ist.

Weitere Informationen finden Sie unter mingw/readme-cygwin.md.

Gebäude auf Fenstern (Mingw)

Eine andere Möglichkeit, Link-Grammatar zu erstellen, besteht darin, Mingw zu verwenden, das das GNU-Toolset verwendet, um POSIX-konforme Programme für Windows zu kompilieren. Die Verwendung von Mingw/MSYS2 ist wahrscheinlich der einfachste Weg, um praktikable Java -Bindungen für Windows zu erhalten. Laden Sie Mingw/MSYS2 von MSYS2.org herunter und installieren Sie sie.

Beachten Sie, dass die Installation des pcre2 -Pakets erforderlich ist, da die LIBC Regex -Implementierung nicht in der Lage ist.

Weitere Informationen finden Sie in Mingw/Readme-Mingw64.md.

Erstellen und Laufen unter Windows (MSVC)

Microsoft Visual C/C ++ -Projektdateien finden Sie im msvc -Verzeichnis. Anweisungen finden Sie in der Datei readme.md dort.

Ausführen des Programms

Um das Programmproblem auszuführen, den Befehl (vorausgesetzt, es steht in Ihrem Weg):

 link-parser [arguments]

Dies startet das Programm. Das Programm verfügt über viele benutzerfreundliche Variablen und Optionen. Diese können durch Eingeben !var an der Link-Parser-Eingabeaufforderung angezeigt werden. Eingabe !help zeigt einige zusätzliche Befehle an.

Die Wörterbücher sind in Verzeichnissen angeordnet, deren Name der 2-Buchstaben-Sprachcode ist. Das Link-Parser-Programm sucht in dieser Reihenfolge nach einem solchen Sprachverzeichnis direkt oder unter einem Verzeichnis data :

Unter Ihrem aktuellen Verzeichnis.
Sofern nicht mit MSVC kompiliert oder unter der Windows-Konsole ausgeführt wird: am installierten Ort (normalerweise in /usr/local/share/link-grammar ).
Wenn Sie unter Windows kompiliert werden: im Verzeichnis der ausführbaren Link-Parser-Datei (kann sich an einem anderen Ort als dem Befehl Link-Parser befinden, der möglicherweise ein Skript sein kann).

Wenn Link-Parser das gewünschte Wörterbuch nicht finden kann, verwenden Sie die Ausführungsstufe 4, um das Problem zu debuggen. Zum Beispiel:

 link-parser ru -verbosity=4

Andere Standorte können in der Befehlszeile angegeben werden. Zum Beispiel:

 link-parser ../path/to-my/modified/data/en

Bei dem Zugriff auf Wörterbücher an nicht standardmäßigen Standorten werden die Standarddateinamen weiterhin angenommen ( dh 4.0.dict , 4.0.affix usw. ).

Die russischen Wörterbücher sind in data/ru . Somit kann der russische Parser als:

 link-parser ru

Wenn Sie kein Argument für Link-Parser liefern, sucht es nach einer Sprache gemäß Ihrem aktuellen Gebietsschale. Wenn es ein solches Sprachverzeichnis nicht finden kann, ist es standardmäßig "en".

Wenn Sie ähnliche Fehler sehen:

 Warning: The word "encyclop" found near line 252 of en/4.0.dict
matches the following words:
encyclop
This word will be ignored.

Dann sind Ihre UTF-8-Orte entweder nicht installiert oder nicht konfiguriert. Das Shell -Befehlsgebietsschema locale -a sollte en_US.utf8 als Gebietsschema auflisten. Wenn nicht, müssen Sie dpkg-reconfigure locales und/oder update-locale oder möglicherweise apt-get install locales oder Kombinationen oder Varianten davon ausführen, abhängig von Ihrem Betriebssystem.

Testen des Systems

Es gibt verschiedene Möglichkeiten, den resultierenden Build zu testen. Wenn die Python-Bindungen erstellt werden, finden Sie in der Datei ein Testprogramm ./bindings/python-examples/tests.py Weitere Informationen finden Sie in Readme.md im Verzeichnis der bindings/python-examples .

In den Sprachdatenverzeichnissen gibt es auch mehrere Stapel von Test-/Beispiel-Sätzen, in denen der Namen corpus-*.batch im Allgemeinen das Parser-Programm im Stapelmodus ausgeführt werden kann, um das System auf einer großen Anzahl von Sätzen zu testen. Der folgende Befehl führt den Parser in einer Datei namens corpus-basic.batch aus.

 link-parser < corpus-basic.batch

Die Linie !batch in der Nähe der Oberseite des Corpus-Basic.Batch aktiviert den Batch-Modus. In diesem Modus sollten Sätze mit einem Anfang * abgelehnt werden und diejenigen, die nicht mit einem * beginnen, sollten akzeptiert werden. Diese Stapeldatei meldet einige Fehler, ebenso wie die Dateien corpus-biolg.batch und corpus-fixes.batch . Die Arbeiten werden fortgesetzt, um diese zu beheben.

Die Datei corpus-fixes.batch enthält viele tausend Sätze, die seit der ursprünglichen Version 4.1 von Link-Grammatik behoben wurden. Die corpus-biolg.batch enthält biologische/medizinische Textsätze aus dem Biolg-Projekt. Der corpus-voa.batch enthält Muster aus Voice of America; Die corpus-failures.batch enthält eine große Anzahl von Ausfällen.

Die folgenden Zahlen können sich ändern, aber zu diesem Zeitpunkt ist die Anzahl der Fehler, die man in jeder dieser Dateien beobachten kann, ungefähr wie folgt:

 en/corpus-basic.batch:      88 errors
en/corpus-fixes.batch:     371 errors
lt/corpus-basic.batch:      15 errors
ru/corpus-basic.batch:      47 errors

Das Verzeichnis der Bindungen/Python enthält einen Unit -Test für die Python -Bindungen. Es führt auch mehrere grundlegende Überprüfungen durch, die die Link-Grammatik-Bibliotheken betonen.

Verwenden des Systems

Es gibt eine API (Anwendungsprogrammschnittstelle) für den Parser. Dies erleichtert es einfach, es in Ihre eigenen Anwendungen einzubeziehen. Die API ist auf der Website dokumentiert.

Verwenden von CMake

Die FindLinkGrammar.cmake -Datei kann verwendet werden, um die Kompilierung in CMake-basierten Build-Umgebungen zu testen und einzurichten.

Verwenden von PKG-Config

Um das Kompilieren und Verknüpfen zu vereinfachen, verwendet die aktuelle Version das PKG-Konfigurationssystem. Um den Speicherort der Link-Grammatik-Header-Dateien zu bestimmen, sagen Sie pkg-config --cflags link-grammar um den Speicherort der Bibliotheken zu erhalten, sagen Sie beispielsweise pkg-config --libs link-grammar . So kann beispielsweise ein typisches Makefile die Ziele enthalten:

 .c.o:
   cc -O2 -g -Wall -c $< `pkg-config --cflags link-grammar`

$(EXE): $(OBJS)
   cc -g -o $@ $^ `pkg-config --libs link-grammar`

Mit Java

Diese Version bietet Java -Dateien, die drei Möglichkeiten für den Zugriff auf den Parser bieten. Der einfachste Weg ist die Verwendung der org.linkgrammar.linkgrammar -Klasse; Dies bietet dem Parser eine sehr einfache Java -API.

Die zweite Möglichkeit besteht darin, die LGService -Klasse zu verwenden. Dies implementiert einen TCP/IP -Netzwerkserver, der Parse -Ergebnisse als JSON -Nachrichten liefert. Jeder JSON-fähige Client kann eine Verbindung zu diesem Server herstellen und Parsen-Text erhalten.

Die dritte Möglichkeit besteht darin, die org.linkgrammar.lgremoteclient -Klasse und insbesondere die Parse () -Methode zu verwenden. Diese Klasse ist ein Netzwerk -Client, der eine Verbindung zum JSON -Server herstellt und die Antwort in die Ergebnisse, die über die Parseresult -API zugänglich ist, zurück konvertiert.

Der oben beschriebene Code wird erstellt, wenn die Apache- ant installiert ist.

Verwenden des JSON -Netzwerkservers

Der Netzwerkserver kann mit den Worten gestartet werden:

 java -classpath linkgrammar.jar org.linkgrammar.LGService 9000

Das obige startet den Server auf Port 9000. Der Port wird weggelassen, Hilfstext wird gedruckt. Dieser Server kann direkt über TCP/IP kontaktiert werden. Zum Beispiel:

 telnet localhost 9000

(Verwenden Sie abwechselnd NetCat anstelle von Telnet). Geben Sie nach der Verbindung ein:

 text:  this is an example sentence to parse

Die zurückgegebenen Bytes werden eine JSON -Nachricht sein, die die Parse des Satzes bereitstellt. Standardmäßig wird der Ascii-Art-Analyse des Textes nicht übertragen. Dies kann durch Senden von Nachrichten des Formulars erhalten werden:

 storeDiagramString:true, text: this is a test.

Zauber Raten

Der Parser wird frühzeitig einen Zauberer-Prüfer durchführen, wenn er auf ein Wort stößt, das er nicht kennt und nicht erraten kann, basierend auf der Morphologie. Das Konfigurationsskript sucht nach Aspell- oder Hunspell-Zauberprüfern. Wenn die Umgebung von Aspell Devel gefunden wird, wird Aspell verwendet, sonst wird Hunspell verwendet.

Das Erraten von Zauber kann zur Laufzeit deaktiviert sein, im Link-Parser-Client mit dem Flag !spell=0 . Geben Sie ein !help für weitere Details.

Achtung: Aspell Version 0.60.8 und möglicherweise andere haben ein Speicherleck. Die Verwendung von Zaubergeschäften auf Produktionsservern ist stark entmutigt. Es ist sicher, die Zaubersprüche deaktiviert zu halten ( =0 ) in Parse_Options .

Multi-Threading

Es ist sicher, die Link-Grammatik in mehreren Threads zu verwenden. Themen können dasselbe Wörterbuch teilen. Parse-Optionen können pro Thread mit Ausnahme der Ausführlichkeit, die von allen Threads geteilt wird, auf dem Thread eingestellt werden. Es ist das einzige globale.

Sprachlicher Kommentar

Phonetik

A/A -phonetische Determiner vor Konsonanten/Vokalen werden von einem neuen pH -Verbindungsart behandelt, wodurch der Determiner mit dem Wort unmittelbar folgt verknüpft wird. Status: Eingeführt in Version 5.1.0 (August 2014). Meistens gemacht, obwohl viele Nomen von Spezialfällen unvollendet sind.

Richtungsverbindungen

Für einige Sprachen werden gerichtete Links benötigt, wie z. Das Ziel ist es, ein Link eindeutig anzuzeigen, welches Wort das Kopfwort ist und welches der abhängige ist. Dies wird erreicht, indem Steckverbinder mit einem einzigen unteren Fallbuchstaben vorangestellt werden: H, D, der "Kopf" und "abhängig" angibt. Die Verknüpfungsregeln sind so, dass H entweder nichts oder D übereinstimmt und D mit H oder nichts übereinstimmt. Dies ist eine neue Funktion in Version 5.1.0 (August 2014). Die Website enthält zusätzliche Unterlagen.

Obwohl die englischsprachigen Link-Grammatik-Links nicht orientiert sind, scheint es, dass ihnen eine De-facto-Richtung gegeben werden kann, die vollständig mit Standardkonzepten einer Abhängigkeitsgrammatik übereinstimmt.

Die Abhängigkeitspfeile haben die folgenden Eigenschaften:

Anti-Reflexive (ein Wort kann nicht von sich selbst abhängen; es kann nicht auf sich selbst zeigen.)
Antisymmetrisch (wenn Word1 von Word2 abhängt, kann Word2 nicht von Word1 abhängen) (also hängen z.
Die Pfeile sind weder transitiv noch anti-transitiv: Ein einzelnes Wort kann von mehreren Köpfen regiert werden. Zum Beispiel:

    +------>WV------->+
    +-->Wd-->+<--Ss<--+
    |        |        |
LEFT-WALL   she    thinks.v

Das heißt, es gibt einen Weg zum Thema "Sie" direkt von der linken Wand, sowohl über das WD -Glied als auch indirekt von der Wand bis zum Wurzelverb und von dort zum Subjekt. Ähnliche Schleifen bilden sich mit den B- und R -Links. Solche Schleifen sind nützlich, um die mögliche Anzahl von Parsen einzuschränken: Die Einschränkung erfolgt in Verbindung mit der Meta-Rule "No Links Cross".

Die Grafiken sind planar; Das heißt, keine zwei Kanten dürfen sich überqueren. Siehe jedoch die Diskussion "Link-Crossing" unten.

Es gibt mehrere verwandte mathematische Begriffe, aber keiner erfasst direkte LG:

Richtungslg -Diagramme ähneln DAGs, außer dass LG nur eine Wand (ein "oberes" Element) erlaubt.
Richtungslg -Diagramme ähneln strengen Teilaufträgen, außer dass die LG -Pfeile normalerweise nicht transitiv sind.
Richtungslg-Diagramme ähneln Catena, außer dass Catena streng anti-transitiv sind-der Weg zu jedem Wort ist einzigartig in einer Catena.

Link Crossing

Die grundlegenden LG -Papiere erfordern die Planarität der Parse -Diagramme. Dies basiert auf einer sehr alten Beobachtung, dass Abhängigkeiten in natürlichen Sprachen fast nie überqueren: Menschen sprechen einfach nicht in Sätzen, in denen Links überqueren. Das Auferlegen von Einschränkungen der Planarität liefert dann eine starke technische und algorithmische Einschränkung für die resultierenden Parsen: Die Gesamtzahl der zu berücksichtigenden Parsen wird stark reduziert, und somit kann die Gesamtgeschwindigkeit des Parsen stark erhöht werden.

Es gibt jedoch gelegentlich, relativ seltene Ausnahmen von dieser Planaritätsregel; Solche Ausnahmen werden in fast allen Sprachen beobachtet. Eine Reihe dieser Ausnahmen sind unten für Englisch angegeben.

Daher erscheint es wichtig, die Einschränkung der Planarität zu entspannen und etwas anderes zu finden, das fast genauso streng ist, aber dennoch seltene Ausnahmen ermöglicht. Es scheint, dass das von Richard Hudson in seiner Theorie der "Wortgrammatik" definierte und von Ben Goertzel befürwortete Konzept der "Landmark Transitivity" möglicherweise ein solcher Mechanismus sein könnte.

ftp://ftp.phon.ucl.ac.uk/pub/word-grammar/ell2-wg.pdf
http://www.phon.ucl.ac.uk/home/dick/enc/syntax.htm
http://goertzel.org/prowlgrammar.pdf

Planarität: Theorie vs. Praxis

In der Praxis ermöglicht die Planaritätsbeschränkung sehr effiziente Algorithmen zur Implementierung des Parsers. Aus Sicht der Umsetzung möchten wir also die Planarität behalten. Glücklicherweise gibt es eine bequeme und eindeutige Möglichkeit, unseren Kuchen zu haben und ihn zu essen. Ein nicht-planarer Diagramm kann auf einem Blatt Papier unter Verwendung von Standard-Elektro-Engineering-Notation gezeichnet werden: ein lustiges Symbol, wo immer Drähte kreuzen. Diese Notation ist sehr leicht an LG -Anschlüsse angepasst. Im Folgenden finden Sie ein aktuelles Beispiel für Arbeit, das bereits im aktuellen LG English Dictionary implementiert ist. Alle Linkübergänge können auf diese Weise implementiert werden! Wir müssen also nicht die aktuellen Parsing-Algorithmen aufgeben, um nicht-planare Diagramme zu erhalten. Wir müssen sie nicht einmal ändern! Hurra!

Hier ist ein funktionierendes Beispiel: "Ich möchte alles betrachten und hören." Dies möchte, dass zwei J -Links auf "Alles" zeigen. Das gewünschte Diagramm müsste so aussehen:

    +---->WV---->+
    |            +--------IV---------->+
    |            |           +<-VJlpi--+
    |            |           |    +---xxx------------Js------->+
    +--Wd--+-Sp*i+--TO-+-I*t-+-MVp+    +--VJrpi>+--MVp-+---Js->+
    |      |     |     |     |    |    |        |      |       |
LEFT-WALL I.p want.v to.r look.v at and.j-v listen.v to.r everything

Das obige möchte wirklich einen Js -Link von 'at' zu 'Alles' haben, aber dieser Js -Link kreuzt (Zusammenstöße mit - markiert von xxx) den Link zur Konjunktion. Andere Beispiele deuten darauf hin, dass die meisten Links die Down-Links zu Konjunktionen überqueren sollten.

Die gearbeitete Planarität besteht darin, den Js Link in zwei: einen Jj Teil und einen Jk Teil aufzuteilen; Die beiden werden zusammen verwendet, um die Konjunktion zu überqueren. Dies wird derzeit im englischen Wörterbuch implementiert und funktioniert.

Diese Arbeit ist in der Tat völlig generisch und kann auf jede Art von Verbindungsübergang ausgedehnt werden. Damit dies funktioniert, wäre eine bessere Notation bequem; perhaps uJs- instead of Jj- and vJs- instead of Jk- , or something like that ... (TODO: invent better notation.) (NB: This is a kind of re-invention of "fat links", but in the dictionary, not in the code.)

Landmark Transitivity: Theory

Given that non-planar parses can be enabled without any changes to the parser algorithm, all that is required is to understand what sort of theory describes link-crossing in a coherent grounding. That theory is Dick Hudson's Landmark Transitivity, explained here.

This mechanism works as follows:

First, every link must be directional, with a head and a dependent. That is, we are concerned with directional-LG links, which are of the form x--A-->y or y<--A--x for words x,y and LG link type A.
Given either the directional-LG relation x--A-->y or y<--A--x, define the dependency relation x-->y. That is, ignore the link-type label.
Heads are landmarks for dependents. If the dependency relation x-->y holds, then x is said to be a landmark for y, and the predicate land(x,y) is true, while the predicate land(y,x) is false. Here, x and y are words, while --> is the landmark relation.
Although the basic directional-LG links form landmark relations, the total set of landmark relations is extended by transitive closure. That is, if land(x,y) and land(y,z) then land(x,z). That is, the basic directional-LG links are "generators" of landmarks; they generate by means of transitivity. Note that the transitive closure is unique.
In addition to the above landmark relation, there are two additional relations: the before and after landmark relations. (In English, these correspond to left and right; in Hebrew, the opposite). That is, since words come in chronological order in a sentence, the dependency relation can point either left or right. The previously-defined landmark relation only described the dependency order; we now introduce the word-sequence order. Thus, there are are land-before() and land-after() relations that capture both the dependency relation, and the word-order relation.
Notation: the before-landmark relation land-B(x,y) corresponds to x-->y (in English, reversed in right-left languages such as Hebrew), whereas the after-landmark relation land-A(x,y) corresponds to y<--x. That is, land(x,y) == land-B(x,y) or land-A(x,y) holds as a statement about the predicate form of the relations.
As before, the full set of directional landmarks are obtained by transitive closure applied to the directional-LG links. Two different rules are used to perform this closure:

 -- land-B(x,y) and land(y,z) ==> land-B(x,y)
-- land-A(x,y) and land(y,z) ==> land-A(x,y)

Parsing is then performed by joining LG connectors in the usual manner, to form a directional link. The transitive closure of the directional landmarks are then computed. Finally, any parse that does not conclude with the "left wall" being the upper-most landmark is discarded.

Here is an example where landmark transitivity provides a natural solution to a (currently) broken parse. The "to.r" has a disjunct "I+ & MVi-" which allows "What is there to do?" to parse correctly. However, it also allows the incorrect parse "He is going to do". The fix would be to force "do" to take an object; however, a link from "do" to "what" is not allowed, because link-crossing would prevent it.

Fixing this requires only a fix to the dictionary, and not to the parser itself.

Link-crossing Examples

Examples where the no-links-cross constraint seems to be violated, in English:

  "He is either in the 105th or the 106th battalion."
  "He is in either the 105th or the 106th battalion."

Both seem to be acceptable in English, but the ambiguity of the "in-either" temporal ordering requires two different parse trees, if the no-links-cross rule is to be enforced. This seems un-natural. Ähnlich:

  "He is either here or he is there."
  "He either is here or he is there."

A different example involves a crossing to the left wall. That is, the links LEFT-WALL--remains crosses over here--found :

  "Here the remains can be found."

Other examples, per And Rosta:

The allowed--by link crosses cake--that :

 He had been allowed to eat a cake by Sophy that she had made him specially

a--book , very--indeed

 "a very much easier book indeed"

an--book , easy--to

 "an easy book to read"

a--book , more--than

 "a more difficult book than that one"

that--have crosses remains--of

 "It was announced that remains have been found of the ark of the covenant"

There is a natural crossing, driven by conjunctions:

 "I was in hell yesterday and heaven on Tuesday."

the "natural" linkage is to use MV links to connect "yesterday" and "on Tuesday" to the verb. However, if this is done, then these must cross the links from the conjunction "and" to "heaven" and "hell". This can be worked around partly as follows:

              +-------->Ju--------->+
              |    +<------SJlp<----+
+<-SX<-+->Pp->+    +-->Mpn->+       +->SJru->+->Mp->+->Js->+
|      |      |    |        |       |        |      |      |
I     was    in  hell   yesterday  and    heaven    on  Tuesday

but the desired MV links from the verb to the time-prepositions "yesterday" and "on Tuesday" are missing -- whereas they are present, when the individual sentences "I was in hell yesterday" and "I was in heaven on Tuesday" are parsed. Using a conjunction should not wreck the relations that get used; but this requires link-crossing.

 "Sophy wondered up to whose favorite number she should count"

Here, "up_to" must modify "number", and not "whose". There's no way to do this without link-crossing.

Type Theory

Link Grammar can be understood in the context of type theory. A simple introduction to type theory can be found in chapter 1 of the HoTT book. This book is freely available online and strongly recommended if you are interested in types.

Link types can be mapped to types that appear in categorial grammars. The nice thing about link-grammar is that the link types form a type system that is much easier to use and comprehend than that of categorial grammar, and yet can be directly converted to that system! That is, link-grammar is completely compatible with categorial grammar, and is easier-to-use. See the paper "Combinatory Categorial Grammar and Link Grammar are Equivalent" for details.

The foundational LG papers make comments to this effect; however, see also work by Bob Coecke on category theory and grammar. Coecke's diagrammatic approach is essentially identical to the diagrams given in the foundational LG papers; it becomes abundantly clear that the category theoretic approach is equivalent to Link Grammar. See, for example, this introductory sketch http://www.cs.ox.ac.uk/people/bob.coecke/NewScientist.pdf and observe how the diagrams are essentially identical to the LG jigsaw-puzzle piece diagrams of the foundational LG publications.

ADDRESSES

If you have any questions, please feel free to send a note to the mailing list.

The source code of link-parser and the link-grammar library is located at GitHub.
For bug reports, please open an issue there.

Although all messages should go to the mailing list, the current maintainers can be contacted at:

  Linas Vepstas - <[email protected]>
  Amir Plivatsky - <[email protected]>
  Dom Lachowicz - <[email protected]>

A complete list of authors and copyright holders can be found in the AUTHORS file. The original authors of the Link Grammar parser are:

  Daniel Sleator                    [email protected]
  Computer Science Department       412-268-7563
  Carnegie Mellon University        www.cs.cmu.edu/~sleator
  Pittsburgh, PA 15213

  Davy Temperley                    [email protected]
  Eastman School of Music           716-274-1557
  26 Gibbs St.                      www.link.cs.cmu.edu/temperley
  Rochester, NY 14604

  John Lafferty                     [email protected]
  Computer Science Department       412-268-6791
  Carnegie Mellon University        www.cs.cmu.edu/~lafferty
  Pittsburgh, PA 15213

TODO -- Working Notes

Some working notes.

Easy to fix: provide a more uniform API to the constituent tree. ie provide word index. Also, provide a better word API, showing word extent, subscript, etc.

Capitalized first words:

There are subtle technical issues for handling capitalized first words. This needs to be fixed. In addition, for now these words are shown uncapitalized in the result linkages. This can be fixed.

Maybe capitalization could be handled in the same way that a/an could be handled! After all, it's essentially a nearest-neighbor phenomenon!

Capitalization-mark tokens:

The proximal issue is to add a cost, so that Bill gets a lower cost than bill.n when parsing "Bill went on a walk". The best solution would be to add a 'capitalization-mark token' during tokenization; this token precedes capitalized words. The dictionary then explicitly links to this token, with rules similar to the a/an phonetic distinction. The point here is that this moves capitalization out of ad-hoc C code and into the dictionary, where it can be handled like any other language feature. The tokenizer includes experimental code for that.

Corpus-statistics-based parse ranking:

The old for parse ranking via corpus statistics needs to be revived. The issue can be illustrated with these example sentences:

 "Please the customer, bring in the money"
"Please, turn off the lights"

In the first sentence, the comma acts as a conjunction of two directives (imperatives). In the second sentence, it is much too easy to mistake "please" for a verb, the comma for a conjunction, and come to the conclusion that one should please some unstated object, and then turn off the lights. (Perhaps one is pleasing by turning off the lights?)

Bad grammar:

When a sentence fails to parse, look for:

confused words: its/it's, there/their/they're, to/too, your/you're ... These could be added at high cost to the dicts.
missing apostrophes in possessives: "the peoples desires"
determiner agreement errors: "a books"
aux verb agreement errors: "to be hooks up"

Poor agreement might be handled by giving a cost to mismatched lower-case connector letters.

Elision/ellipsis/zero/phantom words:

An common phenomenon in English is that some words that one might expect to "properly" be present can disappear under various conditions. Below is a sampling of these. Some possible solutions are given below.

Expressions such as "Looks good" have an implicit "it" (also called a zero-it or phantom-it) in them; that is, the sentence should really parse as "(it) looks good". The dictionary could be simplified by admitting such phantom words explicitly, rather than modifying the grammar rules to allow such constructions. Other examples, with the phantom word in parenthesis, include:

I ate all (of) the cookies.
I've known him only (for) a week.
I taught him (how) to swim.
I told him (that) it was gone.
It stopped me (from) flying off the cliff.
(It) looks good.
(You) go home!
(You) do tell (me).
(That is) enough!
(I) heard that he's giving a test.
(Are) you all right?
He opened the door and (he) went in.
Emma was the younger (daughter) of two daughters.

This can extend to elided/unvoiced syllables:

(I'm a)'fraid so.

Elided punctuation:

God (,) give me strength.

Normally, the subjects of imperatives must always be offset by a comma: "John, give me the hammer", but here, in muttering an oath, the comma is swallowed (unvoiced).

Some complex phantom constructions:

They play billiards but (they do) not (play) snooker.
I know Ringo, but (I do) not (know) his brother.
She likes Indian food, but (she does) not (like) Chinese (food).
If this is true, then (you should) do it.
Perhaps he will (do it), if he sees enough of her.

Elision of syllables

Many (unstressed) syllables can be elided; in modern English, this occurs most commonly in the initial unstressed syllable:

(a)'ccount (a)'fraid (a)'gainst (a)'greed (a)'midst (a)'mongst
(a)'noint (a)'nother (a)'rrest (at)'tend
(be)'fore (be)'gin (be)'havior (be)'long (be)'twixt
(con)'cern (e)'scape (e)'stablish And so on.

Punctuation, zero-copula, zero-that:

Poorly punctuated sentences cause problems: for example:

 "Mike was not first, nor was he last."
"Mike was not first nor was he last."

The one without the comma currently fails to parse. How can we deal with this in a simple, fast, elegant way? Similar questions for zero-copula and zero-that sentences.

Context-dependent zero phrases.

Consider an argument between a professor and a dean, and the dean wants the professor to write a brilliant review. At the end of the argument, the dean exclaims: "I want the review brilliant!" This is a predicative adjective; clearly it means "I want the review [that you write to be] brilliant." However, taken out of context, such a construction is ungrammatical, as the predictiveness is not at all apparent, and it reads just as incorrectly as would "*Hey Joe, can you hand me that review brilliant?"

Imperatives as phantoms:

 "Push button"
"Push button firmly"

The subject is a phantom; the subject is "you".

Handling zero/phantom words by explicitly inserting them:

One possible solution is to perform a one-point compactification. The dictionary contains the phantom words, and their connectors. Ordinary disjuncts can link to these, but should do so using a special initial lower-case letter (say, 'z', in addition to 'h' and 'd' as is currently implemented). The parser, as it works, examines the initial letter of each connector: if it is 'z', then the usual pruning rules no longer apply, and one or more phantom words are selected out of the bucket of phantom words. (This bucket is kept out-of-line, it is not yet placed into sentence word sequence order, which is why the usual pruning rules get modified.) Otherwise, parsing continues as normal. At the end of parsing, if there are any phantom words that are linked, then all of the connectors on the disjunct must be satisfied (of course!) else the linkage is invalid. After parsing, the phantom words can be inserted into the sentence, with the location deduced from link lengths.

Handling zero/phantom words as re-write rules.

A more principled approach to fixing the phantom-word issue is to borrow the idea of re-writing from the theory of operator grammar. That is, certain phrases and constructions can be (should be) re-written into their "proper form", prior to parsing. The re-writing step would insert the missing words, then the parsing proceeds. One appeal of such an approach is that re-writing can also handle other "annoying" phenomena, such as typos (missing apostrophes, eg "lets" vs. "let's", "its" vs. "it's") as well as multi-word rewrites (eg "let's" vs. "let us", or "it's" vs. "it is").

Exactly how to implement this is unclear. However, it seems to open the door to more abstract, semantic analysis. Thus, for example, in Meaning-Text Theory (MTT), one must move between SSynt to DSynt structures. Such changes require a graph re-write from the surface syntax parse (eg provided by link-grammar) to the deep-syntactic structure. By contrast, handling phantom words by graph re-writing prior to parsing inverts the order of processing. This suggests that a more holistic approach is needed to graph rewriting: it must somehow be performed "during" parsing, so that parsing can both guide the insertion of the phantom words, and, simultaneously guide the deep syntactic rewrites.

Another interesting possibility arises with regards to tokenization. The current tokenizer is clever, in that it splits not only on whitespace, but can also strip off prefixes, suffixes, and perform certain limited kinds of morphological splitting. That is, it currently has the ability to re-write single-words into sequences of words. It currently does so in a conservative manner; the letters that compose a word are preserved, with a few exceptions, such as making spelling correction suggestions. The above considerations suggest that the boundary between tokenization and parsing needs to become both more fluid, and more tightly coupled.

Poor linkage choices:

Compare "she will be happier than before" to "she will be more happy than before." Current parser makes "happy" the head word, and "more" a modifier w/EA link. I believe the correct solution would be to make "more" the head (link it as a comparative), and make "happy" the dependent. This would harmonize rules for comparatives... and would eliminate/simplify rules for less,more.

However, this idea needs to be double-checked against, eg Hudson's word grammar. I'm confused on this issue ...

Stretchy links:

Currently, some links can act at "unlimited" length, while others can only be finite-length. eg determiners should be near the noun that they apply to. A better solution might be to employ a 'stretchiness' cost to some connectors: the longer they are, the higher the cost. (This eliminates the "unlimited_connector_set" in the dictionary).

Opposing (repulsing) parses:

Sometimes, the existence of one parse should suggest that another parse must surely be wrong: if one parse is possible, then the other parses must surely be unlikely. For example: the conjunction and.jg allows the "The Great Southern and Western Railroad" to be parsed as the single name of an entity. However, it also provides a pattern match for "John and Mike" as a single entity, which is almost certainly wrong. But "John and Mike" has an alternative parse, as a conventional-and -- a list of two people, and so the existence of this alternative (and correct) parse suggests that perhaps the entity-and is really very much the wrong parse. That is, the mere possibility of certain parses should strongly disfavor other possible parses. (Exception: Ben & Jerry's ice cream; however, in this case, we could recognize Ben & Jerry as the name of a proper brand; but this is outside of the "normal" dictionary (?) (but maybe should be in the dictionary!))

More examples: "high water" can have the connector A joining high.a and AN joining high.n; these two should either be collapsed into one, or one should be eliminated.

WordNet hinting:

Use WordNet to reduce the number for parses for sentences containing compound verb phrases, such as "give up", "give off", etc.

Sliding-window (Incremental) parsing:

To avoid a combinatorial explosion of parses, it would be nice to have an incremental parsing, phrase by phrase, using a sliding window algorithm to obtain the parse. Thus, for example, the parse of the last half of a long, run-on sentence should not be sensitive to the parse of the beginning of the sentence.

Doing so would help with combinatorial explosion. So, for example, if the first half of a sentence has 4 plausible parses, and the last half has 4 more, then currently, the parser reports 16 parses total. It would be much more useful if it could instead report the factored results: ie the four plausible parses for the first half, and the four plausible parses for the last half. This would ease the burden on downstream users of link-grammar.

This approach has at psychological support. Humans take long sentences and split them into smaller chunks that "hang together" as phrase- structures, viz compounded sentences. The most likely parse is the one where each of the quasi sub-sentences is parsed correctly.

This could be implemented by saving dangling right-going connectors into a parse context, and then, when another sentence fragment arrives, use that context in place of the left-wall.

This somewhat resembles the application of construction grammar ideas to the link-grammar dictionary. It also somewhat resembles Viterbi parsing to some fixed depth. Nämlich. do a full backward-forward parse for a phrase, and then, once this is done, take a Viterbi-step. That is, once the phrase is done, keep only the dangling connectors to the phrase, place a wall, and then step to the next part of the sentence.

Caution: watch out for garden-path sentences:

  The horse raced past the barn fell.
  The old man the boat.
  The cotton clothing is made of grows in Mississippi.

The current parser parses these perfectly; a viterbi parser could trip on these.

Other benefits of a Viterbi decoder:

Less sensitive to sentence boundaries: this would allow longer, run-on sentences to be parsed far more quickly.
Could do better with slang, hip-speak.
Support for real-time dialog (parsing of half-uttered sentences).
Parsing of multiple streams, eg from play/movie scripts.
Would enable (or simplify) co-reference resolution across sentences (resolve referents of pronouns, etc.)
Would allow richer state to be passed up to higher layers: specifically, alternate parses for fractions of a sentence, alternate reference resolutions.
Would allow plug-in architecture, so that plugins, employing some alternate, higher-level logic, could disambiguate (eg by making use of semantic content).
Eliminate many of the hard-coded array sizes in the code.

One may argue that Viterbi is a more natural, biological way of working with sequences. Some experimental, psychological support for this can be found at http://www.sciencedaily.com/releases/2012/09/120925143555.htm per Morten Christiansen, Cornell professor of psychology.

Registers, sociolects, dialects (cost vectors):

Consider the sentence "Thieves rob bank" -- a typical newspaper headline. LG currently fails to parse this, because the determiner is missing ("bank" is a count noun, not a mass noun, and thus requires a determiner. By contrast, "thieves rob water" parses just fine.) A fix for this would be to replace mandatory determiner links by (D- or {[[()]] & headline-flag}) which allows the D link to be omitted if the headline-flag bit is set. Here, "headline-flag" could be a new link-type, but one that is not subject to planarity constraints.

Note that this is easier said than done: if one simply adds a high-cost null link, and no headline-flag, then all sorts of ungrammatical sentences parse, with strange parses; while some grammatical sentences, which should parse, but currently don't, become parsable, but with crazy results.

More examples, from And Rosta:

   "when boy meets girl"
   "when bat strikes ball"
   "both mother and baby are well"

A natural approach would be to replace fixed costs by formulas. This would allow the dialect/sociolect to be dynamically changeable. That is, rather than having a binary headline-flag, there would be a formula for the cost, which could be changed outside of the parsing loop. Such formulas could be used to enable/disable parsing specific to different dialects/sociolects, simply by altering the network of link costs.

A simpler alternative would be to have labeled costs (a cost vector), so that different dialects assign different costs to various links. A dialect would be specified during the parse, thus causing the costs for that dialect to be employed during parse ranking.

This has been implemented; what's missing is a practical tutorial on how this might be used.

Hand-refining verb patterns:

A good reference for refining verb usage patterns is: "COBUILD GRAMMAR PATTERNS 1: VERBS from THE COBUILD SERIES", from THE BANK OF ENGLISH, HARPER COLLINS. Online at https://arts-ccr-002.bham.ac.uk/ccr/patgram/ and http://www.corpus.bham.ac.uk/publications/index.shtml

Quotations:

Currently tokenize.c tokenizes double-quotes and some UTF8 quotes (see the RPUNC/LPUNC class in en/4.0.affix - the QUOTES class is not used for that, but for capitalization support), with some very basic support in the English dictionary (see "% Quotation marks." there). However, it does not do this for the various "curly" UTF8 quotes, such as 'these' and “these”. This results is some ugly parsing for sentences containing such quotes. (Note that these are in 4.0.affix).

A mechanism is needed to disentangle the quoting from the quoted text, so that each can be parsed appropriately. It's somewhat unclear how to handle this within link-grammar. This is somewhat related to the problem of morphology (parsing words as if they were "mini-sentences",) idioms (phrases that are treated as if they were single words), set-phrase structures (if ... then ... not only... but also ...) which have a long-range structure similar to quoted text (he said ...).

Semantification of the dictionary:

"to be fishing": Link grammar offers four parses of "I was fishing for evidence", two of which are given low scores, and two are given high scores. Of the two with high scores, one parse is clearly bad. Its links "to be fishing.noun" as opposed to the correct "to be fishing.gerund". That is, I can be happy, healthy and wise, but I certainly cannot be fishing.noun. This is perhaps not just a bug in the structure of the dictionary, but is perhaps deeper: link-grammar has little or no concept of lexical units (ie collocations, idioms, institutional phrases), which thus allows parses with bad word-senses to sneak in.

The goal is to introduce more knowledge of lexical units into LG.

Different word senses can have different grammar rules (and thus, the links employed reveal the sense of the word): for example: "I tend to agree" vs. "I tend to the sheep" -- these employ two different meanings for the verb "tend", and the grammatical constructions allowed for one meaning are not the same as those allowed for the other. Yet, the link rules for "tend.v" have to accommodate both senses, thus making the rules rather complex. Worse, it potentially allows for non-sense constructions. If, instead, we allowed the dictionary to contain different rules for "tend.meaning1" and "tend.meaning2", the rules would simplify (at the cost of inflating the size of the dictionary).

Another example: "I fear so" -- the word "so" is only allowed with some, but not all, lexical senses of "fear". So eg "I fear so" is in the same semantic class as "I think so" or "I hope so", although other meanings of these verbs are otherwise quite different.

[Sin2004] "New evidence, new priorities, new attitudes" in J. Sinclair, (ed) (2004) How to use corpora in language teaching, Amsterdam: John Benjamins

See also: Pattern Grammar: A Corpus-Driven Approach to the Lexical Grammar of English
Susan Hunston and Gill Francis (University of Birmingham)
Amsterdam: John Benjamins (Studies in corpus linguistics, edited by Elena Tognini-Bonelli, volume 4), 2000
Book review.

“The Molecular Level of Lexical Semantics”, EA Nida, (1997) International Journal of Lexicography, 10(4): 265–274. Online

"holes" in collocations (aka "set phrases" of "phrasemes"):

The link-grammar provides several mechanisms to support circumpositions or even more complicated multi-word structures. One mechanism is by ordinary links; see the V, XJ and RJ links. The other mechanism is by means of post-processing rules. (For example, the "filler-it" SF rules use post-processing.) However, rules for many common forms have not yet been written. The general problem is of supporting structures that have "holes" in the middle, that require "lacing" to tie them together.

For a general theory, see catena.

For example, the adposition:

 ... from [xxx] on.
    "He never said another word from then on."
    "I promise to be quiet from now on."
    "Keep going straight from that point on."
    "We went straight from here on."

... from there on.
    "We went straight, from the house on to the woods."
    "We drove straight, from the hill onwards."

Note that multiple words can fit in the slot [xxx]. Note the tangling of another prepositional phrase: "... from [xxx] on to [yyy]"

More complicated collocations with holes include

 "First.. next..."
 "If ... then ..."

'Then' is optional ('then' is a 'null word'), for example:

 "If it is raining, stay inside!"
"If it is raining, [then] stay inside!"


"if ... only ..." "If there were only more like you!"
"... not only, ... but also ..."


"As ..., so ..."  "As it was commanded, so it shall be done"


"Either ... or ..."
"Both ... and  ..."  "Both June and Tom are coming"
"ought ... if ..." "That ought to be the case, if John is not lying"


"Someone ... who ..."
"Someone is outside who wants to see you"


"... for ... to ..."
"I need for you to come to my party"

The above are not currently supported. An example that is supported is the "non-referential it", eg

 "It ... that ..."
"It seemed likely that John would go"

The above is supported by means of special disjuncts for 'it' and 'that', which must occur in the same post-processing domain.

See also:
http://www.phon.ucl.ac.uk/home/dick/enc2010/articles/extraposition.htm
http://www.phon.ucl.ac.uk/home/dick/enc2010/articles/relative-clause.htm

"...from X and from Y" "By X, and by Y, ..." Here, X and Y might be rather long phrases, containing other prepositions. In this case, the usual link-grammar linkage rules will typically conjoin "and from Y" to some preposition in X, instead of the correct link to "from X". Although adding a cost to keep the lengths of X and Y approximately equal can help, it would be even better to recognize the "...from ... and from..." pattern.

The correct solution for the "Either ... or ..." appears to be this:

 ---------------------------+---SJrs--+
       +------???----------+         |
       |     +Ds**c+--SJls-+    +Ds**+
       |     |     |       |    |    |
   either.r the lorry.n or.j-n the van.n

The wrong solution is

 --------------------------+
     +-----Dn-----+       +---SJrs---+
     |      +Ds**c+--SJn--+     +Ds**+
     |      |     |       |     |    |
 neither.j the lorry.n nor.j-n the van.n

The problem with this is that "neither" must coordinate with "nor". That is, one cannot say "either.. nor..." "neither ... or ... " "neither ...and..." "but ... nor ..." The way I originally solved the coordination problem was to invent a new link called Dn, and a link SJn and to make sure that Dn could only connect to SJn, and nothing else. Thus, the lower-case "n" was used to propagate the coordination across two links. This demonstrates how powerful the link-grammar theory is: with proper subscripts, constraints can be propagated along links over large distances. However, this also makes the dictionary more complex, and the rules harder to write: coordination requires a lot of different links to be hooked together. And so I think that creating a single, new link, called ???, will make the coordination easy and direct. That is why I like that idea.

Der ??? link should be the XJ link, which-see.

More idiomatic than the above examples: "...the chip on X's shoulder" "to do X a favour" "to give X a look"

The above are all examples of "set phrases" or "phrasemes", and are most commonly discussed in the context of MTT or Meaning-Text Theory of Igor Mel'cuk et al (search for "MTT Lexical Function" for more info). Mel'cuk treats set phrases as lexemes, and, for parsing, this is not directly relevant. However, insofar as phrasemes have a high mutual information content, they can dominate the syntactic structure of a sentence.

Preposition linking:

The current parse of "he wanted to look at and listen to everything." is inadequate: the link to "everything" needs to connect to "and", so that "listen to" and "look at" are treated as atomic verb phrases.

Lexical functions:

MTT suggests that perhaps the correct way to understand the contents of the post-processing rules is as an implementation of 'lexical functions' projected onto syntax. That is, the post-processing rules allow only certain syntactical constructions, and these are the kinds of constructions one typically sees in certain kinds of lexical functions.

Alternately, link-grammar suffers from a combinatoric explosion of possible parses of a given sentence. It would seem that lexical functions could be used to rule out many of these parses. On the other hand, the results are likely to be similar to that of statistical parse ranking (which presumably captures such quasi-idiomatic collocations at least weakly).

Ref. I. Mel'cuk: "Collocations and Lexical Functions", in ''Phraseology: theory, analysis, and applications'' Ed. Anthony Paul Cowie (1998) Oxford University Press pp. 23-54.

More generally, all of link-grammar could benefit from a MTT-izing of infrastructure.

Morphologie:

Compare the above commentary on lexical functions to Hebrew morphological analysis. To quote Wikipedia:

This distinction between the word as a unit of speech and the root as a unit of meaning is even more important in the case of languages where roots have many different forms when used in actual words, as is the case in Semitic languages. In these, roots are formed by consonants alone, and different words (belonging to different parts of speech) are derived from the same root by inserting vowels. For example, in Hebrew, the root gdl represents the idea of largeness, and from it we have gadol and gdola (masculine and feminine forms of the adjective "big"), gadal "he grew", higdil "he magnified" and magdelet "magnifier", along with many other words such as godel "size" and migdal "tower".

Morphology printing:

Instead of hard-coding LL, declare which links are morpho links in the dict.

Assorted minor cleanup:

Should provide a query that returns compile-time consts, eg the max number of characters in a word, or max words in a sentence.
Should remove compile-time constants, eg max words, max length etc.

Version 6.0 TODO list:

Version 6.0 will change Sentence to Sentence*, Linkage to Linkage* in the API. But perhaps this is a bad idea...

Expandieren