Скачать link grammar - Скачать исходный код link grammar

Ссылка грамматического анализатора

Версия 5.12.5

Сандитор грамматики Link демонстрирует лингвистическую (естественную языковую) структуру английского, тайского, русского, арабского, персидского и ограниченного подмножества полдюжины других языков. Эта структура представляет собой график печатных ссылок (краев) между словами в предложении. Можно получить более обычные анализы HPSG (составляющая) и стиля зависимости от грамматики Link, применяя коллекцию правил для преобразования в эти различные форматы. Это возможно, потому что грамматика ссылки становится немного «глубже» в «синтаксико-эмантическую» структуру предложения: она предоставляет значительно более мелкозернистую и подробную информацию, чем то, что обычно доступно в обычных анализаторах.

Теория анализа грамматики была первоначально разработана в 1991 году Дэви Терпли, Джоном Лафферти и Даниэлем Слэтором, в то время профессора лингвистики и компьютерных наук в университете Карнеги -Меллона. Три первоначальных публикации по этой теории обеспечивают лучшее введение и обзор; С тех пор были проведены сотни публикаций, которые дальше изучали, изучали и расширяли идеи.

Несмотря на то, что на основе оригинальной кодовой базы Carnegie-Mellon текущий пакет грамматики ссылки резко развился и глубоко отличается от более ранних версий. Были бесчисленные исправления ошибок; Производительность улучшилась на несколько порядков. Пакет полностью многопоточный, полностью включен UTF-8 и был вычищен для безопасности, что позволяет развертываться облаком. Покрытие анализа английского языка было значительно улучшено; Другие языки были добавлены (в частности, тайский и русский). Существует множество новых функций, в том числе поддержка морфологии, диалектов и мелкозернистого веса (затрат), позволяющей поведение, похожее на вектор. Существует новый, сложный токенизатор, адаптированный для морфологии: он может предложить альтернативные расщепления для морфологически неоднозначных слов. Словари могут быть обновлены во время выполнения, позволяя системам, которые выполняют непрерывное изучение грамматики, также анализируются одновременно. То есть обновления в словаре и диапазон являются взаимно защищены. Классы слов могут быть распознаны с помощью регуляции. Случайный планарный анализ графика полностью поддерживается; Это обеспечивает равномерную выборку пространства плоских графиков. Подробный отчет о том, что изменилось, можно найти в измене.

Этот код выпускается по лицензии LGPL, что делает его свободно доступным как для частного, так и для коммерческого использования, с небольшим количеством ограничений. Условия лицензии приведены в файле лицензии, включенном в это программное обеспечение.

Пожалуйста, смотрите основную веб -страницу для получения дополнительной информации. Эта версия является продолжением оригинального анализатора CMU.

Новый!

На момент версии 5.9.0 система включает в себя экспериментальную систему для создания предложений. Они указаны с использованием API «Заполнить пробелы», где слова заменяются в местоположениях диких карт, всякий раз, когда результат является грамматически достоверным предложением. Дополнительные данные на странице человека: man link-generator (в подкаталере man ).

Этот генератор используется в проекте по изучению языка Opencog, который направлен на автоматическое изучение грамматики ссылки из корпораций, используя совершенно новые и инновационные теоретические методы информации, несколько похожие на те, которые встречаются в искусственных нейронных сетях (глубокое обучение), но с использованием явных символических представлений.

Быстрый обзор

Сигнал включает в себя API на различных языках программирования, а также удобный инструмент командной строки для игры с ним. Вот какой -то типичный вывод:

 linkparser> This is a test!
	Linkage 1, cost vector = (UNUSED=0 DIS= 0.00 LEN=6)

    +-------------Xp------------+
    +----->WV----->+---Ost--+   |
    +---Wd---+-Ss*b+  +Ds**c+   |
    |        |     |  |     |   |
LEFT-WALL this.p is.v a  test.n !

(S (NP this.p) (VP is.v (NP a test.n)) !)

            LEFT-WALL    0.000  Wd+ hWV+ Xp+
               this.p    0.000  Wd- Ss*b+
                 is.v    0.000  Ss- dWV- O*t+
                    a    0.000  Ds**c+
               test.n    0.000  Ds**c- Os-
                    !    0.000  Xp- RW+
           RIGHT-WALL    0.000  RW-

Этот довольно занятый показ иллюстрирует много интересных вещей. Например, ссылка Ss*b соединяет глагол и субъект и указывает, что субъект является единственным. Аналогичным образом, ссылка Ost соединяет глагол и объект, а также указывает, что объект является единственным. Ссылка WV (глагольная стена) указывает на головную версию предложения, в то время как ссылка Wd указывает на головокружение. Ссылка Xp подключается к торчению пунктуацией. Ссылка Ds**c соединяет существительное с определением: она снова подтверждает, что существительное является единственным числом, а также что существительное начинается с согласного. (Связь PH , не требующая здесь, используется для применения фонетического соглашения, отличая «A» от «an»). Эти типы ссылок задокументированы в документации по ссылке на английском языке.

Нижняя часть дисплея - список «разъединенных», используемых для каждого слова. Разрушители - это просто список разъемов, которые использовались для формирования ссылок. Они особенно интересны, потому что они служат чрезвычайно мелкозернистой формой «части речи». Таким образом, например: дизъектный S- O+ указывает на переходный глагол: его глагол, который принимает как субъект, так и объект. Дополнительная разметка выше указывает на то, что «IS» не только используется в качестве переходного глагола, но и указывает на более тонкие детали: переходный глагол, который взял единственный субъект и использовался (используется как) глагол головного предложения. Стоимость с плавающей точкой является «стоимостью» дизъюнкта; Он очень примерно отражает идею отпускаемости журнала этого конкретного грамматического использования. Как и части речи коррелируют с значениями слов, то также тонкие части речи коррелируют с гораздо более тонкими различиями и градациями значения.

Парсер Link-Grammar также поддерживает морфологический анализ. Вот пример на русском языке:

 linkparser> это теста
	Linkage 1, cost vector = (UNUSED=0 DIS= 0.00 LEN=4)

             +-----MVAip-----+
    +---Wd---+       +-LLCAG-+
    |        |       |       |
LEFT-WALL это.msi тест.= =а.ndnpi

Ссылка LL соединяет стебель «Тепст» к суффиксу «а». Ссылка MVA соединяется только с суффиксом, потому что на русском языке это суффиксы, которые несут всю синтаксическую структуру, а не стебли. Русская Лексис задокументирована здесь.

Тайский словарь в настоящее время полностью разработан, эффективно охватывая весь язык. Пример в тайском языке:

 linkparser> นายกรัฐมนตรี ขึ้น กล่าว สุนทรพจน์
	Linkage 1, cost vector = (UNUSED=0 DIS= 2.00 LEN=2)

    +---------LWs--------+
    |           +<---S<--+--VS-+-->O-->+
    |           |        |     |       |
LEFT-WALL นายกรัฐมนตรี.n ขึ้น.v กล่าว.v สุนทรพจน์.n

Ссылка VS соединяет два глагола «ขึ้น» и «กล่าว» в серийной конструкции глагола. Сводка типов ссылок задокументирована здесь. Полную документацию грамматики тайской ссылки можно найти здесь.

Грамматика Thai Link также принимает загрязненные POS и именованные входные данные. Каждое слово может быть аннотировано с помощью TAG Link POS. Например:

 linkparser> เมื่อวานนี้.n มี.ve คน.n มา.x ติดต่อ.v คุณ.pr ครับ.pt
Found 1 linkage (1 had no P.P. violations)
	Unique linkage, cost vector = (UNUSED=0 DIS= 0.00 LEN=12)

                          +---------------------PT--------------------+
    +---------LWs---------+---------->VE---------->+                  |
    |           +<---S<---+-->O-->+       +<--AXw<-+--->O--->+        |
    |           |         |       |       |        |         |        |
LEFT-WALL เมื่อวานนี้.n[!] มี.ve[!] คน.n[!] มา.x[!] ติดต่อ.v[!] คุณ.pr[!] ครับ.pt[!]

Полную документацию для тайского словаря можно найти здесь.

Тайский словарь принимает тегситы LST20 для POS и названные организации, чтобы преодолеть разрыв между фундаментальными инструментами NLP и анализатором ссылок. Например:

 linkparser> linkparser> วันที่_25_ธันวาคม@DTM ของ@PS ทุก@AJ ปี@NN เป็น@VV วัน@NN คริสต์มาส@NN
Found 348 linkages (348 had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS= 1.00 LEN=10)

    +--------------------------------LWs--------------------------------+
    |               +<------------------------S<------------------------+
    |               |                +---------->PO--------->+          |
    |               +----->AJpr----->+            +<---AJj<--+          +---->O---->+------NZ-----+
    |               |                |            |          |          |           |             |
LEFT-WALL วันที่_25_ธันวาคม@DTM[!] ของ@PS[!].pnn ทุก@AJ[!].jl ปี@NN[!].n เป็น@VV[!].v วัน@NN[!].na คริสต์มาส@NN[!].n

Обратите внимание, что каждое слово выше аннотируется с тегами LST20 POS и тегами NE. Полную документацию как для тегов Link POS, так и для тегов LST20 можно найти здесь. Более подробную информацию о LST20, например, руководство по аннотации и статистике данных можно найти здесь.

any язык поддерживает однородные случайные плоские графики:

 linkparser> asdf qwer tyuiop fghj bbb
Found 1162 linkages (1162 had no P.P. violations)

             +-------ANY------+-------ANY------+
    +---ANY--+--ANY--+        +---ANY--+--ANY--+
    |        |       |        |        |       |
LEFT-WALL asdf[!] qwer[!] tyuiop[!] fghj[!] bbb[!]

Язык ady делает то же самое, выполняя случайные морфологические расщепления:

 linkparser> asdf qwerty fghjbbb
Found 1512 linkages (1512 had no P.P. violations)

                                  +------------------ANY-----------------+
    +-----ANY----+-------ANY------+                  +---------LL--------+
    |            |                |                  |                   |
LEFT-WALL asdf[!ANY-WORD] qwerty[!ANY-WORD] fgh[!SIMPLE-STEM].= =jbbb[!SIMPLE-SUFF]

Теория и документация

Расширенный обзор и резюме можно найти на странице грамматики ссылки в Википедии, которая затрагивает большую часть импорта, основные аспекты теории. Тем не менее, это не заменит оригинальные статьи, опубликованные на эту тему:

Даниэль Д.К. Слейтор, Дэви Терпли, «Расположение английского языка с грамматикой ссылки» октябрь 1991 г. CMU-CS-91-196 .
Даниэль Д. Слэтор, Дэви Терпли, «Расположение английского языка с грамматикой ссылки», Третий международный семинар по технологиям анализации (1993).
Деннис Гринберг, Джон Лафферти, Даниэль Слейтор, «Прочный алгоритм разбора для грамматики ссылки», август 1995 г. CMU-CS-95-125 .
Джон Лафферти, Даниэль Слэтор, Дэви Терпли, «Грамматические триграммы: вероятностная модель грамматики ссылки», 1992 Симпозиум AAAI по вероятностным подходам к естественному языку .

Есть еще много документов и ссылок, перечисленных на веб -сайте грамматики основной ссылки

См. Также документацию API C/C ++. Привязки для других языков программирования, включая Python3, Java и Node.js, можно найти в каталоге привязки. (Существует два набора привязки JavaScript: один набор для API библиотеки и другой набор для анализатора командной строки.)

Содержимое

Содержание	Описание
ЛИЦЕНЗИЯ	Лицензия, описывающая условия использования
Изменение	Сборник недавних изменений.
настройка	Скрипт конфигурации GNU
Autogen.sh	Инструмент технического обслуживания разработчика
Link-Grammar/*. C.	Программа. (Написано в ANSI-C)
-----	-----
привязки/Autoit/	Необязательные языковые привязки.
привязки/java/	Необязательные варки Java Language.
привязки/JS/	Необязательные привязки языка JavaScript.
привязки/lisp/	Дополнительные общие языковые привязки LISP.
привязки/node.js/	Необязательный Node.js Language Bindings.
привязки/ocaml/	Необязательные привязки к OCAML.
привязки/python/	Необязательные языковые привязки Python3.
привязки/python-examples/	Ссылка на тестовый набор и пример привязки для привязки к языку питона.
привязки/Swig/	Файл интерфейса SWIG, для других интерфейсов FFI.
привязки/vala/	Необязательные валиновые языковые привязки.
-----	-----
данные/en/	Английские словаря.
Данные/en/4.0.dict	Файл, содержащий определения словаря.
Данные/en/4.0.	Файл постобработки знаний.
ДАННЫЕ/EN/4.0.CONSTITUENTS	Учредительный файл знаний.
ДАННЫЕ/EN/4.0.FAFFIX	Файл аффикса (префикс/суффикс).
Data/en/4.0.Regex	Регулярное выражение морфологии предположение.
data/en/tiny.dict	Небольшой пример словаря.
данные/en/words/	Каталог, полный списков слов.
DATA/EN/CORPUS*.BATCH	Пример корпораций, используемых для тестирования.
-----	-----
данные/ru/	Полноценный русский словарь
данные/TH/	Полночный тайский словарь (более 100 000 слов)
данные/AR/	Довольно полный арабский словарь
данные/FA/	Персидский (фарси) словарь
данные/de/	Небольшой прототип немецкого словаря
данные/LT/	Небольшой прототип литовского словаря
данные/идентификатор/	Небольшой прототип индонезийский словарь
данные/VN/	Небольшой прототип вьетнамского словаря
данные/он/	Экспериментальный ивритский словарь
Данные/КЗ/	Экспериментальный казахский словарь
данные/TR/	Экспериментальный турецкий словарь
-----	-----
морфология/ar/	Анализатор морфологии арабского
морфология/FA/	Персидский морфологический анализатор
-----	-----
отлаживать/	Информация о отладке библиотеки
MSVC/	Microsoft Visual-C Project Files
Mingw/	Информация об использовании Mingw под MSYS или Cygwin

Распаковка и проверка подписи

Система распределяется с использованием обычного формата tar.gz ; Его можно извлечь с помощью команды tar -zxf link-grammar.tar.gz в командной строке.

Тарбол из последней версии можно загрузить из:
https://www.gnucash.org/link-grammar/downloads/

Файлы были подписаны в цифровом виде, чтобы убедиться, что во время загрузки не было повреждения набора данных, и для того, чтобы третьи лица не были внесены в внутренние члены кода. Подписи могут быть проверены с помощью команды GPG:

gpg --verify link-grammar-5.12.5.tar.gz.asc

который должен генерировать выходные данные, идентичные (кроме даты):

 gpg: Signature made Thu 26 Apr 2012 12:45:31 PM CDT using RSA key ID E0C0651C
gpg: Good signature from "Linas Vepstas (Hexagon Architecture Patches) <[email protected]>"
gpg:                 aka "Linas Vepstas (LKML) <[email protected]>"

Альтернативно, проверки MD5 могут быть проверены. Они не обеспечивают криптографическую безопасность, но они могут обнаружить простую коррупцию. Чтобы проверить проверку, выпустите md5sum -c MD5SUM в командной строке.

Теги в git можно проверить, выполнив следующее:

 gpg --recv-keys --keyserver keyserver.ubuntu.com EB6AA534E0C0651C
git tag -v link-grammar-5.10.5

Создание системы

Для составления общей библиотечной и демонстрационной программы Link-Grammar в командной строке тип:

 ./configure
make
make check

Чтобы установить, изменить пользователя на «корень» и сказать

 make install
ldconfig

Это установит библиотеку liblink-grammar.so в /usr/local/lib , файлы заголовков в /usr/local/include/link-grammar , а также словары в /usr/local/share/link-grammar . Запуск ldconfig восстановит общий библиотечный кэш. Чтобы убедиться, что установка была успешной, запустите (как пользователь, не являющийся корнями)

 make installcheck

Дополнительные системные библиотеки

Библиотека Link-Grammar имеет дополнительные функции, которые включены автоматически, если configure обнаруживает определенные библиотеки. Эти библиотеки являются необязательными в большинстве систем, и если нужна добавленная ими функция, необходимо установить соответствующие библиотеки перед запуском configure .

Имена библиотечных пакетов могут варьироваться в различных системах (при необходимости обращайтесь к Google ...). Например, имена могут включать -devel вместо -dev или без него. Имена библиотеки могут быть без префикса lib .

libsqlite3-dev (для поддерживаемого SQLite словарь)
libz1g-dev или libz-devel (в настоящее время необходимо для комплексного minisat2 )
libedit-dev (см. Editline)
libhunspell-dev или libaspell-dev (и соответствующий английский словарь).
libtre-dev или libpcre2-dev (гораздо быстрее, чем реализация режима Libc, и необходимо для правильности на FreeBSD и Cygwin).
Использование libpcre2-dev настоятельно рекомендуется. Он должен использоваться в определенных системах (как указано в их строительных участках).

Редактировать

Если libedit-dev установлен, то клавиши со стрелками можно использовать для редактирования входа в инструмент Link-Parser; Ключи стрелки вверх и вниз будут помнить предыдущие записи. Вы хотите этого; Это делает тестирование и редактирование намного проще.

Привязки Node.js

Включены две версии связей Node.js. Одна версия завершает библиотеку; Другой использует Emscripten для обертывания инструмента командной строки. Библиотечные привязки находятся в bindings/node.js в то время как обертка Emscripten находится в bindings/js .

Они построены с использованием npm . Во -первых, вы должны построить библиотеку Core C. Затем сделайте следующее:

   cd bindings/node.js
   npm install
   npm run make

Это создаст библиотечные привязки, а также запустит небольшой модульный тест (который должен пройти). Пример можно найти в bindings/node.js/examples/simple.js .

Для обертки командной строки сделайте следующее:

   cd bindings/js
   ./install_emsdk.sh
   ./build_packages.sh

Привязки Python3

Привязки Python3 создаются по умолчанию, при условии, что соответствующие пакеты разработки Python установлены. (Привязки Python2 больше не поддерживаются.)

Эти пакеты:

Linux:
- Системы, использующие пакеты RPM: python3-devel
- Системы, использующие пакеты «deb»: python3-dev
Windows:
- Установите Python3 с https://www.python.org/downloads/windows/. Вы также должны установить SWIG с http://www.swig.org/download.html.
macOS:
- Установите Python3 с помощью Homebrew.

ПРИМЕЧАНИЕ. Перед выдачей configure (см. Ниже) вы должны подтвердить, что требуемые версии Python могут быть вызваны с помощью вашего PATH .

Использование привязки Python является необязательным ; Вам не нужны это, если вы не планируете использовать Link-Grammar с Python. Если вам нравится отключить привязки Python, используйте:

 ./configure --disable-python-bindings

Модуль linkgrammar.py обеспечивает интерфейс высокого уровня в Python. sentence-check.py example.py tests.py

macOS:
- Из -за настройки разрешений на файлы пользователям MacOS может потребоваться установить привязки Python в местоположениях пользовательских каталогов. Это можно сделать, сказав, make install pythondir=/where/to/install

Java Bindings

По умолчанию попытка Makefile S создать привязки Java. Использование привязки Java является необязательным ; Вам не нужны это, если вы не планируете использовать Link-Grammar с Java. Вы можете пропустить строительство привязки Java, отключившись следующим образом:

 ./configure --disable-java-bindings

Если jni.h не найден, или если ant не найден, то привязки Java не будут построены.

Заметки о поиске jni.h :
Некоторые общие распределения Java JVM (в частности, те, которые от Sun) размещают этот файл в необычные места, где его нельзя найти автоматически. Чтобы исправить это, убедитесь, что переменная среды JAVA_HOME установлена правильно. Сценарий Configure ищет jni.h в $JAVA_HOME/Headers и в $JAVA_HOME/include ; Он также рассматривает соответствующие места для $JDK_HOME . Если jni.h все еще не может быть найден, укажите местоположение с переменной CPPFLAGS : так, например,,

 export CPPFLAGS="-I/opt/jdk1.5/include/:/opt/jdk1.5/include/linux"

или

 export CPPFLAGS="-I/c/java/jdk1.6.0/include/ -I/c/java/jdk1.6.0/include/win32/"

Обратите внимание, что использование /opt нестандартно, и большинство системных инструментов не смогут найти там пакеты.

Установите местоположение

Цель /usr/local Install может быть переодетана с помощью стандартной опции configure --prefix ; Так, например:

 ./configure --prefix=/opt/link-grammar

Используя pkg-config (см. Ниже), не стандартные места установки могут быть автоматически обнаружены.

Пользовательские сборки

Дополнительные параметры конфигурации напечатаны

 ./configure --help

Система была протестирована и хорошо работает в 32 и 64-разрядных системах Linux, FreeBSD, MacOS, а также на системах Microsoft Windows. Конкретные ОС-зависимые примечания следуют.

Строительство из репозитория GitHub

Конечные пользователи должны загрузить Tarball (см. Распаковку и проверку подписи).

Текущая версия GitHub предназначена для разработчиков (включая всех, кто готов предоставить исправление, новую функцию или улучшение). Кончик главной ветви часто нестабилен и иногда может иметь плохой код в нем, поскольку он находится в стадии разработки. Он также нуждается в установке инструментов разработки, которые не установлены по умолчанию. По этой причине использование версии GitHub не рекомендуется для обычных конечных пользователей.

Установка из GitHub

Клонировать это: git clone https://github.com/opencog/link-grammar.git
Или скачать его как Zip:
https://github.com/opencog/link-grammar/archive/master.zip

Обязательные инструменты

Инструменты, которые могут нуждаться в установке, прежде чем вы сможете создать граммама ссылки:

make (вариант gmake может понадобиться)
m4
gcc или clang
autoconf
libtool
autoconf-archive
pkg-config (может быть назван pkgconf или pkgconfig )
pip3 (для привязки Python)

Необязательный:
swig (для языковых привязков)
flex
Apache Ant (для привязки Java)
graphviz (если вы хотите использовать функцию отображения Word-Graph)

Версия GitHub не включает в себя сценарий configure . Чтобы генерировать его, используйте:

 autogen.sh

Если вы получите ошибки, убедитесь, что вы установили вышеупомянутые пакеты разработки, и что установка вашей системы обновлена. В частности, отсутствие autoconf или autoconf-archive могут вызвать странные и вводящие в заблуждение ошибки.

Для получения дополнительной информации о том, как продолжить, продолжите в разделе «Создание системы» и соответствующих разделов после нее.

Дополнительные заметки для разработчиков

Для настройки режима отладки используйте:

 configure --enable-debug

Он добавляет некоторый код отладки проверки и функции, которые могут напечатать несколько структур данных.

Функцией, которая может быть полезна для отладки, является дисплей словесного графа. Он включен по умолчанию. Более подробную информацию об этой функции см. В Word-Graph Display.

Строительство на Freebsd

Текущая конфигурация имеет очевидную стандартную задачу смешивания библиотеки C ++ при использовании gcc (приглашается исправление). Тем не менее, обычная практика на FreeBSD состоит в том, чтобы компилировать с clang , и у нее нет этой проблемы. Кроме того, пакеты добавления установлены под /usr/local .

Итак, вот как должна быть вызвана configure :

 env LDFLAGS=-L/usr/local/lib CPPFLAGS=-I/usr/local/include 
CC=clang CXX=clang++ configure

Обратите внимание, что pcre2 является необходимым пакетом, так как существующая реализация libc Regex не имеет необходимого уровня поддержки корпорации.

Некоторые пакеты имеют разные имена, чем упомянутые в предыдущих разделах:

minisat (minisat2) pkgconf (pkg-config)

Строительство на macOS

Грамматика простых ванилле должна компилировать и работать на Apple Macos просто отлично, как описано выше. В настоящее время нет зарегистрированных проблем.

Если вам не нужны привязки Java, вам следует почти настройка:

 ./configure --disable-java-bindings

Если вам действительно нужны привязки Java, обязательно установите переменную среды JDK_HOME, где бы это <Headers/jni.h> было. Установите переменную java_home в местоположение компилятора Java. Убедитесь, что у вас установлен муравей.

Если вы хотите построить из GitHub (см. Здание из репозитория GitHub), вы можете установить инструменты, которые перечислены там с использованием Homebrew.

Здание на окнах

Есть три разных способа, которыми Link-Grammar может быть составлен в Windows. Одним из способов является использование Cygwin, который обеспечивает уровень совместимости Linux для Windows. Другой способ - использовать систему MSVC. Третий способ - использовать систему Mingw, которая использует набор инструментов GNU для компиляции программ Windows. Исходный код поддерживает системы Windows от Vista On.

Cygwin Way в настоящее время дает лучший результат, поскольку он поддерживает редактирование линий с помощью завершения команды и истории, а также поддерживает отображение Word-Graph на X-Windows. (В настоящее время у Mingw нет libedit , а порт MSVC в настоящее время не поддерживает завершение команды и историю, а также правописание.

Здание на Windows (Cygwin)

Самый простой способ, чтобы Link-Grammar работал на MS Windows,-это использовать Cygwin, Linux, подобную Windows, позволяющую порту программного обеспечения, работающего в системах POSIX в Windows. Загрузите и установите Cygwin.

Обратите внимание, что установка пакета pcre2 требуется, потому что реализация Regc Regex Libc недостаточно.

Более подробную информацию см. Mingw/Readme-cygwin.md.

Здание на Windows (Mingw)

Другим способом создания Link-Grammar является использование Mingw, который использует набор инструментов GNU для составления программ, соответствующих POSIX для Windows. Использование Mingw/MSYS2, вероятно, является самым простым способом получить работоспособные привязки Java для Windows. Загрузите и установите Mingw/MSYS2 из msys2.org.

Обратите внимание, что установка пакета pcre2 требуется, потому что реализация Regc Regex Libc недостаточно.

Более подробную информацию см. Mingw/Readme-Mingw64.md.

Строительство и работа на Windows (MSVC)

Файлы проекта Microsoft Visual C/C ++ можно найти в каталоге msvc . На указаниях см. Файл readme.md.

Запуск программы

Чтобы запустить программу, выпустить команду (предполагая, что она находится на вашем пути):

 link-parser [arguments]

Это запускает программу. В программе есть много настроенных пользователей переменных и опций. Они могут отображаться путем ввода !var в подсказке Link-Parser. Вход !help отобразит несколько дополнительных команд.

Словары расположены в каталогах, чье имя является 2-буквенным языковым кодом. Программа Link-Parser ищет такой языковой каталог в этом порядке, напрямую или в соответствии с data имен каталогов:

Под вашим текущим каталогом.
Если не будет составлено с MSVC или запустить под консолью Windows: в установленном месте (обычно в /usr/local/share/link-grammar ).
Если составлено в Windows: в каталоге исполняемого файла ссылки (может быть в другом месте, чем команда Link-Parser, которая может быть сценарием).

Если Link-Parser не может найти желаемый словарь, используйте уровень словесности 4, чтобы отлаживать проблему; например:

 link-parser ru -verbosity=4

Другие места могут быть указаны в командной строке; например:

 link-parser ../path/to-my/modified/data/en

При доступе к словарям в нестандартных местах стандартные файлы все еще предполагаются ( т.е. 4.0.dict , 4.0.affix и т. Д. ).

Российские словарры находятся в data/ru . Таким образом, русский анализатор может быть запущен как:

 link-parser ru

Если вы не предоставите аргумент в пользу Link-Parser, он ищет язык в соответствии с вашей текущей настройкой локали. Если он не может найти такого языкового каталога, он по умолчанию «en».

Если вы видите ошибки, похожие на это:

 Warning: The word "encyclop" found near line 252 of en/4.0.dict
matches the following words:
encyclop
This word will be ignored.

Затем ваши локалы UTF-8 либо не установлены, либо не настроены. locale -a должен перечислить en_US.utf8 как локаль. Если нет, то вам необходимо провести dpkg-reconfigure locales и/или запустить update-locale или, возможно, apt-get install locales , или комбинации или их варианты, в зависимости от вашей операционной системы.

Тестирование системы

Есть несколько способов проверить полученную сборку. Если привязки Python построены, то в файле можно найти программу испытаний ./bindings/python-examples/tests.py Для получения более подробной информации см. Readme.md в каталоге bindings/python-examples .

Существует также несколько партий испытательных/примеров предложений в каталогах языка, обычно имея имена corpus-*.batch программа синтаксического анализатора может быть запущена в режиме партийного режима для тестирования системы на большом количестве предложений. Следующая команда запускает анализатор в файле с именем corpus-basic.batch ;

 link-parser < corpus-basic.batch

Линия !batch возле вершины корпуса-баса. В этом режиме предложения, помеченные начальным * должны быть отклонены, а те, кто не начинается с * должны быть приняты. В этом пакетном файле сообщается о некоторых ошибках, как и файлы corpus-biolg.batch и corpus-fixes.batch . Работа продолжается, чтобы исправить их.

Файл corpus-fixes.batch содержит много тысяч предложений, которые были зафиксированы с момента исходного выпуска 4.1 Link-Grammar. corpus-biolg.batch содержит биологические/медицинские предложения от проекта Biolg. corpus-voa.batch содержит образцы от Voice of America; corpus-failures.batch содержит большое количество сбоев.

Следующие цифры могут быть изменены, но в настоящее время количество ошибок, которые можно ожидать в каждом из этих файлов, примерно следующим образом:

 en/corpus-basic.batch:      88 errors
en/corpus-fixes.batch:     371 errors
lt/corpus-basic.batch:      15 errors
ru/corpus-basic.batch:      47 errors

Справочник привязки/Python содержит модульный тест на привязки Python. Он также выполняет несколько основных проверок, которые подчеркивают библиотеки Grammar.

Используя систему

Существует API (интерфейс прикладной программы) для анализатора. Это позволяет легко включить его в ваши собственные приложения. API задокументирован на веб -сайте.

Использование Cmake

Файл FindLinkGrammar.cmake можно использовать для тестирования и настройки компиляции в средах сборки на основе Cmake.

Использование PKG-CONFIG

Чтобы упростить компиляцию и связь, текущий выпуск использует систему PKG-Config. Чтобы определить местоположение файлов заголовка Grammar-Grammar, скажем, pkg-config --cflags link-grammar чтобы получить местоположение библиотек, скажем, pkg-config --libs link-grammar Таким образом, например, типичный файл Make может включать в себя цели:

 .c.o:
   cc -O2 -g -Wall -c $< `pkg-config --cflags link-grammar`

$(EXE): $(OBJS)
   cc -g -o $@ $^ `pkg-config --libs link-grammar`

Используя Java

Этот релиз предоставляет файлы Java, которые предлагают три способа доступа к анализатору. Самый простой способ - использовать класс org.linkgrammar.linkgrammar; Это обеспечивает очень простой Java API для анализатора.

Вторая возможность - использовать класс LGService. Это реализует сетевой сервер TCP/IP, предоставляя результаты анализа в виде сообщений JSON. Любой клиент, способный к JSON, может подключиться к этому серверу и получить анализ текста.

Третья возможность состоит в том, чтобы использовать класс org.linkgrammar.lgremoteclient и, в частности, метод Parse (). Этот класс является сетевым клиентом, который подключается к серверу JSON и преобразует ответ обратно в результаты, доступные через API анализаторов.

Вышеуказанный код будет построен, если будет установлен Apache ant .

Использование сетевого сервера JSON

Сетевой сервер можно начать с того, что сказали:

 java -classpath linkgrammar.jar org.linkgrammar.LGService 9000

Приведенное выше запускается сервер на порту 9000. Он порт опущен, напечатан текст справки. С этим сервером можно связаться непосредственно через TCP/IP; например:

 telnet localhost 9000

(Поочередно, используйте NetCat вместо Telnet). После подключения введите:

 text:  this is an example sentence to parse

Возвращенные байты будут посланием JSON, предоставляющим анализы предложения. По умолчанию, Ascii-Art Parse текста не передается. Это может быть получено путем отправки сообщений формы:

 storeDiagramString:true, text: this is a test.

Заклинание угадывает

Сигнал будет управлять проверкой орфографией на ранней стадии, если он столкнется с словом, которое он не знает и не может догадаться, основываясь на морфологии. Сценарий конфигурации ищет проверки орфографии Aspell или Hunspell; Если найдена среда Devel Devel, то используется Aspell, иначе используется Hunspell.

Угадание заклинаний может быть отключено во время выполнения, в клиенте Link-Parser с флагом !spell=0 . Введите !help для более подробной информации.

Осторожно: Aspell версия 0.60.8 и, возможно, у других есть утечка памяти. Использование гибели заклинаний на производственных серверах сильно обескуражено. Сохранение отключения заклинаний ( =0 ) в Parse_Options безопасно.

Многопоточное

Это безопасно использовать Link-Grammar в нескольких потоках. Потоки могут иметь один и тот же словарь. Параметры Parse могут быть установлены на основе для нагрузки, за исключением сложности, которая является глобальной, разделенной всеми потоками. Это единственный глобальный.

Лингвистический комментарий

Фонетика

A/Фонетические определения перед согласными/гласными обрабатываются новым типом PH Link, связывая определятеля с слово сразу же после него. Статус: представлен в версии 5.1.0 (август 2014 г.). В основном сделано, хотя многие специальные существительные являются незаконченными.

Направленные ссылки

Направленные ссылки необходимы для некоторых языков, таких как литовские, турецкие и другие свободные языки слова. Цель состоит в том, чтобы ссылка четко указала, какое слово является головным словом, а какое зависимость. Это достигается путем префикса разъемов с помощью одной нижней буквы: H, D, указывая «голова» и «зависимость». Правила связи таковы, что H не совпадает ни на что, ни D, а D соответствует H или ничего. Это новая функция в версии 5.1.0 (август 2014 г.). Веб -сайт предоставляет дополнительную документацию.

Хотя англоязычные ссылки на грамматические ссылки не ориентированы, кажется, что фактическое направление может быть дано им, которое полностью согласуется со стандартными концепциями грамматики зависимости.

Стрелки зависимостей имеют следующие свойства:

Антирефлексивный (слово не может зависеть от себя; оно не может указывать на себя.)
Антимметричный (если Word1 зависит от Word2, то Word2 не может зависеть от Word1) (например, определители зависят от существительных, но никогда не наоборот)
Стрелки не являются ни переходными, ни анти-транспортными: одним словом может управлять несколько голов. Например:

    +------>WV------->+
    +-->Wd-->+<--Ss<--+
    |        |        |
LEFT-WALL   she    thinks.v

То есть существует путь к предмету: «Она», прямо от левой стены, через WD Link, а также косвенно, от стены до глагола корня, а затем к предмету. Аналогичные петли формируются по ссылкам B и R. Такие петли полезны для ограничения возможного количества анализаций: ограничение происходит в сочетании с мета-обработкой «без ссылок».

Графики плоские; То есть не может пересекать два края. Смотрите, однако, обсуждение «пересечения ссылок» ниже.

Есть несколько связанных математических представлений, но ни один из них не захватывает направление LG:

Направленные графики LG напоминают DAG, за исключением того, что LG позволяет только одну стену (один «верхний» элемент).
Направленные графики LG напоминают строгие частичные порядки, за исключением того, что стрелки LG обычно не являются транзитивными.
Направленные графики LG напоминают Catena, за исключением того, что Catena строго анти-транспортировка-путь к любому слову уникален в Catena.

Ссылка пересечения

Основополагающие документы LG предписывают плоскости графиков анализа. Это основано на очень старом наблюдении, что зависимости почти никогда не пересекаются на естественных языках: люди просто не говорят в предложениях, где ссылки пересекаются. Навязывание плоскостных ограничений обеспечивает сильные инженерные и алгоритмические ограничения на полученные подготовки: общее количество рассматриваемых анализов резко уменьшается, и, следовательно, общая скорость анализа может быть значительно увеличена.

Тем не менее, есть случайные, относительно редкие исключения из этого правила плоскости; Такие исключения наблюдаются почти на всех языках. Ряд этих исключений приведен для английского языка ниже.

Таким образом, кажется важным расслабить ограничение на плоскость и найти что -то еще, что почти так же строго, но все же позволяет нечастым исключениям. Похоже, что концепция «знаковой транзитивности», определенная Ричардом Хадсоном в его теории «грамматики слов», а затем выступал Бен Герцель, просто может быть таким механизмом.

ftp://ftp.phon.ucl.ac.uk/pub/word-grammar/ell2-wg.pdf
http://www.phon.ucl.ac.uk/home/dick/enc/syntax.htm
http://goertzel.org/prowlgrammar.pdf

Плана: теория против практики

На практике ограничение плоскости позволяет использовать очень эффективные алгоритмы при реализации анализатора. Таким образом, с точки зрения реализации, мы хотим сохранить плоскость. К счастью, есть удобный и однозначный способ сделать наш торт и съесть его. Непланарная диаграмма может быть нарисована на листе бумаги с использованием стандартной электропроизводственной нотации: забавный символ, везде, где проводится провода. Эта нотация очень легко адаптирована к разъемам LG; Ниже приведен фактический рабочий пример, уже реализованный в текущем словаре английского языка LG. Все пересечения ссылок могут быть реализованы таким образом! Таким образом, нам не нужно фактически отказываться от нынешних алгоритмов анализа, чтобы получить непланарные диаграммы. Нам даже не нужно их модифицировать! Ура!

Вот рабочий пример: «Я хочу посмотреть и слушать все». Это хочет две ссылки J , указывающие на «все». Желаемая диаграмма должна выглядеть так:

    +---->WV---->+
    |            +--------IV---------->+
    |            |           +<-VJlpi--+
    |            |           |    +---xxx------------Js------->+
    +--Wd--+-Sp*i+--TO-+-I*t-+-MVp+    +--VJrpi>+--MVp-+---Js->+
    |      |     |     |     |    |    |        |      |       |
LEFT-WALL I.p want.v to.r look.v at and.j-v listen.v to.r everything

Вышеуказанное действительно хочет иметь ссылку Js от 'at' to 'uverge', но эта ссылка Js пересекается (столкновения с - помеченной XXX) ссылку на соединение. Другие примеры предполагают, что можно позволить большинству ссылок пересекать нижние связи с соединениями.

Рабочее плановое обеспечение состоит в том, чтобы разделить ссылку Js на два: Jj Part и Jk Part; Эти два используются вместе, чтобы пересечь соединение. Это в настоящее время реализовано в английском словаре, и это работает.

Этот обходной обход на самом деле полностью общий и может быть распространен на любое пересечение ссылок. Чтобы это работало, лучше было бы удобно; perhaps uJs- instead of Jj- and vJs- instead of Jk- , or something like that ... (TODO: invent better notation.) (NB: This is a kind of re-invention of "fat links", but in the dictionary, not in the code.)

Landmark Transitivity: Theory

Given that non-planar parses can be enabled without any changes to the parser algorithm, all that is required is to understand what sort of theory describes link-crossing in a coherent grounding. That theory is Dick Hudson's Landmark Transitivity, explained here.

This mechanism works as follows:

First, every link must be directional, with a head and a dependent. That is, we are concerned with directional-LG links, which are of the form x--A-->y or y<--A--x for words x,y and LG link type A.
Given either the directional-LG relation x--A-->y or y<--A--x, define the dependency relation x-->y. That is, ignore the link-type label.
Heads are landmarks for dependents. If the dependency relation x-->y holds, then x is said to be a landmark for y, and the predicate land(x,y) is true, while the predicate land(y,x) is false. Here, x and y are words, while --> is the landmark relation.
Although the basic directional-LG links form landmark relations, the total set of landmark relations is extended by transitive closure. That is, if land(x,y) and land(y,z) then land(x,z). That is, the basic directional-LG links are "generators" of landmarks; they generate by means of transitivity. Note that the transitive closure is unique.
In addition to the above landmark relation, there are two additional relations: the before and after landmark relations. (In English, these correspond to left and right; in Hebrew, the opposite). That is, since words come in chronological order in a sentence, the dependency relation can point either left or right. The previously-defined landmark relation only described the dependency order; we now introduce the word-sequence order. Thus, there are are land-before() and land-after() relations that capture both the dependency relation, and the word-order relation.
Notation: the before-landmark relation land-B(x,y) corresponds to x-->y (in English, reversed in right-left languages such as Hebrew), whereas the after-landmark relation land-A(x,y) corresponds to y<--x. That is, land(x,y) == land-B(x,y) or land-A(x,y) holds as a statement about the predicate form of the relations.
As before, the full set of directional landmarks are obtained by transitive closure applied to the directional-LG links. Two different rules are used to perform this closure:

 -- land-B(x,y) and land(y,z) ==> land-B(x,y)
-- land-A(x,y) and land(y,z) ==> land-A(x,y)

Parsing is then performed by joining LG connectors in the usual manner, to form a directional link. The transitive closure of the directional landmarks are then computed. Finally, any parse that does not conclude with the "left wall" being the upper-most landmark is discarded.

Here is an example where landmark transitivity provides a natural solution to a (currently) broken parse. The "to.r" has a disjunct "I+ & MVi-" which allows "What is there to do?" to parse correctly. However, it also allows the incorrect parse "He is going to do". The fix would be to force "do" to take an object; however, a link from "do" to "what" is not allowed, because link-crossing would prevent it.

Fixing this requires only a fix to the dictionary, and not to the parser itself.

Link-crossing Examples

Examples where the no-links-cross constraint seems to be violated, in English:

  "He is either in the 105th or the 106th battalion."
  "He is in either the 105th or the 106th battalion."

Both seem to be acceptable in English, but the ambiguity of the "in-either" temporal ordering requires two different parse trees, if the no-links-cross rule is to be enforced. This seems un-natural. Сходным образом:

  "He is either here or he is there."
  "He either is here or he is there."

A different example involves a crossing to the left wall. That is, the links LEFT-WALL--remains crosses over here--found :

  "Here the remains can be found."

Other examples, per And Rosta:

The allowed--by link crosses cake--that :

 He had been allowed to eat a cake by Sophy that she had made him specially

a--book , very--indeed

 "a very much easier book indeed"

an--book , easy--to

 "an easy book to read"

a--book , more--than

 "a more difficult book than that one"

that--have crosses remains--of

 "It was announced that remains have been found of the ark of the covenant"

There is a natural crossing, driven by conjunctions:

 "I was in hell yesterday and heaven on Tuesday."

the "natural" linkage is to use MV links to connect "yesterday" and "on Tuesday" to the verb. However, if this is done, then these must cross the links from the conjunction "and" to "heaven" and "hell". This can be worked around partly as follows:

              +-------->Ju--------->+
              |    +<------SJlp<----+
+<-SX<-+->Pp->+    +-->Mpn->+       +->SJru->+->Mp->+->Js->+
|      |      |    |        |       |        |      |      |
I     was    in  hell   yesterday  and    heaven    on  Tuesday

but the desired MV links from the verb to the time-prepositions "yesterday" and "on Tuesday" are missing -- whereas they are present, when the individual sentences "I was in hell yesterday" and "I was in heaven on Tuesday" are parsed. Using a conjunction should not wreck the relations that get used; but this requires link-crossing.

 "Sophy wondered up to whose favorite number she should count"

Here, "up_to" must modify "number", and not "whose". There's no way to do this without link-crossing.

Type Theory

Link Grammar can be understood in the context of type theory. A simple introduction to type theory can be found in chapter 1 of the HoTT book. This book is freely available online and strongly recommended if you are interested in types.

Link types can be mapped to types that appear in categorial grammars. The nice thing about link-grammar is that the link types form a type system that is much easier to use and comprehend than that of categorial grammar, and yet can be directly converted to that system! That is, link-grammar is completely compatible with categorial grammar, and is easier-to-use. See the paper "Combinatory Categorial Grammar and Link Grammar are Equivalent" for details.

The foundational LG papers make comments to this effect; however, see also work by Bob Coecke on category theory and grammar. Coecke's diagrammatic approach is essentially identical to the diagrams given in the foundational LG papers; it becomes abundantly clear that the category theoretic approach is equivalent to Link Grammar. See, for example, this introductory sketch http://www.cs.ox.ac.uk/people/bob.coecke/NewScientist.pdf and observe how the diagrams are essentially identical to the LG jigsaw-puzzle piece diagrams of the foundational LG publications.

ADDRESSES

If you have any questions, please feel free to send a note to the mailing list.

The source code of link-parser and the link-grammar library is located at GitHub.
For bug reports, please open an issue there.

Although all messages should go to the mailing list, the current maintainers can be contacted at:

  Linas Vepstas - <[email protected]>
  Amir Plivatsky - <[email protected]>
  Dom Lachowicz - <[email protected]>

A complete list of authors and copyright holders can be found in the AUTHORS file. The original authors of the Link Grammar parser are:

  Daniel Sleator                    [email protected]
  Computer Science Department       412-268-7563
  Carnegie Mellon University        www.cs.cmu.edu/~sleator
  Pittsburgh, PA 15213

  Davy Temperley                    [email protected]
  Eastman School of Music           716-274-1557
  26 Gibbs St.                      www.link.cs.cmu.edu/temperley
  Rochester, NY 14604

  John Lafferty                     [email protected]
  Computer Science Department       412-268-6791
  Carnegie Mellon University        www.cs.cmu.edu/~lafferty
  Pittsburgh, PA 15213

TODO -- Working Notes

Some working notes.

Easy to fix: provide a more uniform API to the constituent tree. ie provide word index. Also, provide a better word API, showing word extent, subscript, etc.

Capitalized first words:

There are subtle technical issues for handling capitalized first words. This needs to be fixed. In addition, for now these words are shown uncapitalized in the result linkages. This can be fixed.

Maybe capitalization could be handled in the same way that a/an could be handled! After all, it's essentially a nearest-neighbor phenomenon!

Capitalization-mark tokens:

The proximal issue is to add a cost, so that Bill gets a lower cost than bill.n when parsing "Bill went on a walk". The best solution would be to add a 'capitalization-mark token' during tokenization; this token precedes capitalized words. The dictionary then explicitly links to this token, with rules similar to the a/an phonetic distinction. The point here is that this moves capitalization out of ad-hoc C code and into the dictionary, where it can be handled like any other language feature. The tokenizer includes experimental code for that.

Corpus-statistics-based parse ranking:

The old for parse ranking via corpus statistics needs to be revived. The issue can be illustrated with these example sentences:

 "Please the customer, bring in the money"
"Please, turn off the lights"

In the first sentence, the comma acts as a conjunction of two directives (imperatives). In the second sentence, it is much too easy to mistake "please" for a verb, the comma for a conjunction, and come to the conclusion that one should please some unstated object, and then turn off the lights. (Perhaps one is pleasing by turning off the lights?)

Bad grammar:

When a sentence fails to parse, look for:

confused words: its/it's, there/their/they're, to/too, your/you're ... These could be added at high cost to the dicts.
missing apostrophes in possessives: "the peoples desires"
determiner agreement errors: "a books"
aux verb agreement errors: "to be hooks up"

Poor agreement might be handled by giving a cost to mismatched lower-case connector letters.

Elision/ellipsis/zero/phantom words:

An common phenomenon in English is that some words that one might expect to "properly" be present can disappear under various conditions. Below is a sampling of these. Some possible solutions are given below.

Expressions such as "Looks good" have an implicit "it" (also called a zero-it or phantom-it) in them; that is, the sentence should really parse as "(it) looks good". The dictionary could be simplified by admitting such phantom words explicitly, rather than modifying the grammar rules to allow such constructions. Other examples, with the phantom word in parenthesis, include:

I ate all (of) the cookies.
I've known him only (for) a week.
I taught him (how) to swim.
I told him (that) it was gone.
It stopped me (from) flying off the cliff.
(It) looks good.
(You) go home!
(You) do tell (me).
(That is) enough!
(I) heard that he's giving a test.
(Are) you all right?
He opened the door and (he) went in.
Emma was the younger (daughter) of two daughters.

This can extend to elided/unvoiced syllables:

(I'm a)'fraid so.

Elided punctuation:

God (,) give me strength.

Normally, the subjects of imperatives must always be offset by a comma: "John, give me the hammer", but here, in muttering an oath, the comma is swallowed (unvoiced).

Some complex phantom constructions:

They play billiards but (they do) not (play) snooker.
I know Ringo, but (I do) not (know) his brother.
She likes Indian food, but (she does) not (like) Chinese (food).
If this is true, then (you should) do it.
Perhaps he will (do it), if he sees enough of her.

Elision of syllables

Many (unstressed) syllables can be elided; in modern English, this occurs most commonly in the initial unstressed syllable:

(a)'ccount (a)'fraid (a)'gainst (a)'greed (a)'midst (a)'mongst
(a)'noint (a)'nother (a)'rrest (at)'tend
(be)'fore (be)'gin (be)'havior (be)'long (be)'twixt
(con)'cern (e)'scape (e)'stablish And so on.

Punctuation, zero-copula, zero-that:

Poorly punctuated sentences cause problems: for example:

 "Mike was not first, nor was he last."
"Mike was not first nor was he last."

The one without the comma currently fails to parse. How can we deal with this in a simple, fast, elegant way? Similar questions for zero-copula and zero-that sentences.

Context-dependent zero phrases.

Consider an argument between a professor and a dean, and the dean wants the professor to write a brilliant review. At the end of the argument, the dean exclaims: "I want the review brilliant!" This is a predicative adjective; clearly it means "I want the review [that you write to be] brilliant." However, taken out of context, such a construction is ungrammatical, as the predictiveness is not at all apparent, and it reads just as incorrectly as would "*Hey Joe, can you hand me that review brilliant?"

Imperatives as phantoms:

 "Push button"
"Push button firmly"

The subject is a phantom; the subject is "you".

Handling zero/phantom words by explicitly inserting them:

One possible solution is to perform a one-point compactification. The dictionary contains the phantom words, and their connectors. Ordinary disjuncts can link to these, but should do so using a special initial lower-case letter (say, 'z', in addition to 'h' and 'd' as is currently implemented). The parser, as it works, examines the initial letter of each connector: if it is 'z', then the usual pruning rules no longer apply, and one or more phantom words are selected out of the bucket of phantom words. (This bucket is kept out-of-line, it is not yet placed into sentence word sequence order, which is why the usual pruning rules get modified.) Otherwise, parsing continues as normal. At the end of parsing, if there are any phantom words that are linked, then all of the connectors on the disjunct must be satisfied (of course!) else the linkage is invalid. After parsing, the phantom words can be inserted into the sentence, with the location deduced from link lengths.

Handling zero/phantom words as re-write rules.

A more principled approach to fixing the phantom-word issue is to borrow the idea of re-writing from the theory of operator grammar. That is, certain phrases and constructions can be (should be) re-written into their "proper form", prior to parsing. The re-writing step would insert the missing words, then the parsing proceeds. One appeal of such an approach is that re-writing can also handle other "annoying" phenomena, such as typos (missing apostrophes, eg "lets" vs. "let's", "its" vs. "it's") as well as multi-word rewrites (eg "let's" vs. "let us", or "it's" vs. "it is").

Exactly how to implement this is unclear. However, it seems to open the door to more abstract, semantic analysis. Thus, for example, in Meaning-Text Theory (MTT), one must move between SSynt to DSynt structures. Such changes require a graph re-write from the surface syntax parse (eg provided by link-grammar) to the deep-syntactic structure. By contrast, handling phantom words by graph re-writing prior to parsing inverts the order of processing. This suggests that a more holistic approach is needed to graph rewriting: it must somehow be performed "during" parsing, so that parsing can both guide the insertion of the phantom words, and, simultaneously guide the deep syntactic rewrites.

Another interesting possibility arises with regards to tokenization. The current tokenizer is clever, in that it splits not only on whitespace, but can also strip off prefixes, suffixes, and perform certain limited kinds of morphological splitting. That is, it currently has the ability to re-write single-words into sequences of words. It currently does so in a conservative manner; the letters that compose a word are preserved, with a few exceptions, such as making spelling correction suggestions. The above considerations suggest that the boundary between tokenization and parsing needs to become both more fluid, and more tightly coupled.

Poor linkage choices:

Compare "she will be happier than before" to "she will be more happy than before." Current parser makes "happy" the head word, and "more" a modifier w/EA link. I believe the correct solution would be to make "more" the head (link it as a comparative), and make "happy" the dependent. This would harmonize rules for comparatives... and would eliminate/simplify rules for less,more.

However, this idea needs to be double-checked against, eg Hudson's word grammar. I'm confused on this issue ...

Stretchy links:

Currently, some links can act at "unlimited" length, while others can only be finite-length. eg determiners should be near the noun that they apply to. A better solution might be to employ a 'stretchiness' cost to some connectors: the longer they are, the higher the cost. (This eliminates the "unlimited_connector_set" in the dictionary).

Opposing (repulsing) parses:

Sometimes, the existence of one parse should suggest that another parse must surely be wrong: if one parse is possible, then the other parses must surely be unlikely. For example: the conjunction and.jg allows the "The Great Southern and Western Railroad" to be parsed as the single name of an entity. However, it also provides a pattern match for "John and Mike" as a single entity, which is almost certainly wrong. But "John and Mike" has an alternative parse, as a conventional-and -- a list of two people, and so the existence of this alternative (and correct) parse suggests that perhaps the entity-and is really very much the wrong parse. That is, the mere possibility of certain parses should strongly disfavor other possible parses. (Exception: Ben & Jerry's ice cream; however, in this case, we could recognize Ben & Jerry as the name of a proper brand; but this is outside of the "normal" dictionary (?) (but maybe should be in the dictionary!))

More examples: "high water" can have the connector A joining high.a and AN joining high.n; these two should either be collapsed into one, or one should be eliminated.

WordNet hinting:

Use WordNet to reduce the number for parses for sentences containing compound verb phrases, such as "give up", "give off", etc.

Sliding-window (Incremental) parsing:

To avoid a combinatorial explosion of parses, it would be nice to have an incremental parsing, phrase by phrase, using a sliding window algorithm to obtain the parse. Thus, for example, the parse of the last half of a long, run-on sentence should not be sensitive to the parse of the beginning of the sentence.

Doing so would help with combinatorial explosion. So, for example, if the first half of a sentence has 4 plausible parses, and the last half has 4 more, then currently, the parser reports 16 parses total. It would be much more useful if it could instead report the factored results: ie the four plausible parses for the first half, and the four plausible parses for the last half. This would ease the burden on downstream users of link-grammar.

This approach has at psychological support. Humans take long sentences and split them into smaller chunks that "hang together" as phrase- structures, viz compounded sentences. The most likely parse is the one where each of the quasi sub-sentences is parsed correctly.

This could be implemented by saving dangling right-going connectors into a parse context, and then, when another sentence fragment arrives, use that context in place of the left-wall.

This somewhat resembles the application of construction grammar ideas to the link-grammar dictionary. It also somewhat resembles Viterbi parsing to some fixed depth. Viz. do a full backward-forward parse for a phrase, and then, once this is done, take a Viterbi-step. That is, once the phrase is done, keep only the dangling connectors to the phrase, place a wall, and then step to the next part of the sentence.

Caution: watch out for garden-path sentences:

  The horse raced past the barn fell.
  The old man the boat.
  The cotton clothing is made of grows in Mississippi.

The current parser parses these perfectly; a viterbi parser could trip on these.

Other benefits of a Viterbi decoder:

Less sensitive to sentence boundaries: this would allow longer, run-on sentences to be parsed far more quickly.
Could do better with slang, hip-speak.
Support for real-time dialog (parsing of half-uttered sentences).
Parsing of multiple streams, eg from play/movie scripts.
Would enable (or simplify) co-reference resolution across sentences (resolve referents of pronouns, etc.)
Would allow richer state to be passed up to higher layers: specifically, alternate parses for fractions of a sentence, alternate reference resolutions.
Would allow plug-in architecture, so that plugins, employing some alternate, higher-level logic, could disambiguate (eg by making use of semantic content).
Eliminate many of the hard-coded array sizes in the code.

One may argue that Viterbi is a more natural, biological way of working with sequences. Some experimental, psychological support for this can be found at http://www.sciencedaily.com/releases/2012/09/120925143555.htm per Morten Christiansen, Cornell professor of psychology.

Registers, sociolects, dialects (cost vectors):

Consider the sentence "Thieves rob bank" -- a typical newspaper headline. LG currently fails to parse this, because the determiner is missing ("bank" is a count noun, not a mass noun, and thus requires a determiner. By contrast, "thieves rob water" parses just fine.) A fix for this would be to replace mandatory determiner links by (D- or {[[()]] & headline-flag}) which allows the D link to be omitted if the headline-flag bit is set. Here, "headline-flag" could be a new link-type, but one that is not subject to planarity constraints.

Note that this is easier said than done: if one simply adds a high-cost null link, and no headline-flag, then all sorts of ungrammatical sentences parse, with strange parses; while some grammatical sentences, which should parse, but currently don't, become parsable, but with crazy results.

More examples, from And Rosta:

   "when boy meets girl"
   "when bat strikes ball"
   "both mother and baby are well"

A natural approach would be to replace fixed costs by formulas. This would allow the dialect/sociolect to be dynamically changeable. That is, rather than having a binary headline-flag, there would be a formula for the cost, which could be changed outside of the parsing loop. Such formulas could be used to enable/disable parsing specific to different dialects/sociolects, simply by altering the network of link costs.

A simpler alternative would be to have labeled costs (a cost vector), so that different dialects assign different costs to various links. A dialect would be specified during the parse, thus causing the costs for that dialect to be employed during parse ranking.

This has been implemented; what's missing is a practical tutorial on how this might be used.

Hand-refining verb patterns:

A good reference for refining verb usage patterns is: "COBUILD GRAMMAR PATTERNS 1: VERBS from THE COBUILD SERIES", from THE BANK OF ENGLISH, HARPER COLLINS. Online at https://arts-ccr-002.bham.ac.uk/ccr/patgram/ and http://www.corpus.bham.ac.uk/publications/index.shtml

Quotations:

Currently tokenize.c tokenizes double-quotes and some UTF8 quotes (see the RPUNC/LPUNC class in en/4.0.affix - the QUOTES class is not used for that, but for capitalization support), with some very basic support in the English dictionary (see "% Quotation marks." there). However, it does not do this for the various "curly" UTF8 quotes, such as 'these' and “these”. This results is some ugly parsing for sentences containing such quotes. (Note that these are in 4.0.affix).

A mechanism is needed to disentangle the quoting from the quoted text, so that each can be parsed appropriately. It's somewhat unclear how to handle this within link-grammar. This is somewhat related to the problem of morphology (parsing words as if they were "mini-sentences",) idioms (phrases that are treated as if they were single words), set-phrase structures (if ... then ... not only... but also ...) which have a long-range structure similar to quoted text (he said ...).

Semantification of the dictionary:

"to be fishing": Link grammar offers four parses of "I was fishing for evidence", two of which are given low scores, and two are given high scores. Of the two with high scores, one parse is clearly bad. Its links "to be fishing.noun" as opposed to the correct "to be fishing.gerund". That is, I can be happy, healthy and wise, but I certainly cannot be fishing.noun. This is perhaps not just a bug in the structure of the dictionary, but is perhaps deeper: link-grammar has little or no concept of lexical units (ie collocations, idioms, institutional phrases), which thus allows parses with bad word-senses to sneak in.

The goal is to introduce more knowledge of lexical units into LG.

Different word senses can have different grammar rules (and thus, the links employed reveal the sense of the word): for example: "I tend to agree" vs. "I tend to the sheep" -- these employ two different meanings for the verb "tend", and the grammatical constructions allowed for one meaning are not the same as those allowed for the other. Yet, the link rules for "tend.v" have to accommodate both senses, thus making the rules rather complex. Worse, it potentially allows for non-sense constructions. If, instead, we allowed the dictionary to contain different rules for "tend.meaning1" and "tend.meaning2", the rules would simplify (at the cost of inflating the size of the dictionary).

Another example: "I fear so" -- the word "so" is only allowed with some, but not all, lexical senses of "fear". So eg "I fear so" is in the same semantic class as "I think so" or "I hope so", although other meanings of these verbs are otherwise quite different.

[Sin2004] "New evidence, new priorities, new attitudes" in J. Sinclair, (ed) (2004) How to use corpora in language teaching, Amsterdam: John Benjamins

See also: Pattern Grammar: A Corpus-Driven Approach to the Lexical Grammar of English
Susan Hunston and Gill Francis (University of Birmingham)
Amsterdam: John Benjamins (Studies in corpus linguistics, edited by Elena Tognini-Bonelli, volume 4), 2000
Book review.

“The Molecular Level of Lexical Semantics”, EA Nida, (1997) International Journal of Lexicography, 10(4): 265–274. Онлайн

"holes" in collocations (aka "set phrases" of "phrasemes"):

The link-grammar provides several mechanisms to support circumpositions or even more complicated multi-word structures. One mechanism is by ordinary links; see the V, XJ and RJ links. The other mechanism is by means of post-processing rules. (For example, the "filler-it" SF rules use post-processing.) However, rules for many common forms have not yet been written. The general problem is of supporting structures that have "holes" in the middle, that require "lacing" to tie them together.

For a general theory, see catena.

For example, the adposition:

 ... from [xxx] on.
    "He never said another word from then on."
    "I promise to be quiet from now on."
    "Keep going straight from that point on."
    "We went straight from here on."

... from there on.
    "We went straight, from the house on to the woods."
    "We drove straight, from the hill onwards."

Note that multiple words can fit in the slot [xxx]. Note the tangling of another prepositional phrase: "... from [xxx] on to [yyy]"

More complicated collocations with holes include

 "First.. next..."
 "If ... then ..."

'Then' is optional ('then' is a 'null word'), for example:

 "If it is raining, stay inside!"
"If it is raining, [then] stay inside!"


"if ... only ..." "If there were only more like you!"
"... not only, ... but also ..."


"As ..., so ..."  "As it was commanded, so it shall be done"


"Either ... or ..."
"Both ... and  ..."  "Both June and Tom are coming"
"ought ... if ..." "That ought to be the case, if John is not lying"


"Someone ... who ..."
"Someone is outside who wants to see you"


"... for ... to ..."
"I need for you to come to my party"

The above are not currently supported. An example that is supported is the "non-referential it", eg

 "It ... that ..."
"It seemed likely that John would go"

The above is supported by means of special disjuncts for 'it' and 'that', which must occur in the same post-processing domain.

Смотрите также:
http://www.phon.ucl.ac.uk/home/dick/enc2010/articles/extraposition.htm
http://www.phon.ucl.ac.uk/home/dick/enc2010/articles/relative-clause.htm

"...from X and from Y" "By X, and by Y, ..." Here, X and Y might be rather long phrases, containing other prepositions. In this case, the usual link-grammar linkage rules will typically conjoin "and from Y" to some preposition in X, instead of the correct link to "from X". Although adding a cost to keep the lengths of X and Y approximately equal can help, it would be even better to recognize the "...from ... and from..." pattern.

The correct solution for the "Either ... or ..." appears to be this:

 ---------------------------+---SJrs--+
       +------???----------+         |
       |     +Ds**c+--SJls-+    +Ds**+
       |     |     |       |    |    |
   either.r the lorry.n or.j-n the van.n

The wrong solution is

 --------------------------+
     +-----Dn-----+       +---SJrs---+
     |      +Ds**c+--SJn--+     +Ds**+
     |      |     |       |     |    |
 neither.j the lorry.n nor.j-n the van.n

The problem with this is that "neither" must coordinate with "nor". That is, one cannot say "either.. nor..." "neither ... or ... " "neither ...and..." "but ... nor ..." The way I originally solved the coordination problem was to invent a new link called Dn, and a link SJn and to make sure that Dn could only connect to SJn, and nothing else. Thus, the lower-case "n" was used to propagate the coordination across two links. This demonstrates how powerful the link-grammar theory is: with proper subscripts, constraints can be propagated along links over large distances. However, this also makes the dictionary more complex, and the rules harder to write: coordination requires a lot of different links to be hooked together. And so I think that creating a single, new link, called ???, will make the coordination easy and direct. That is why I like that idea.

The ??? link should be the XJ link, which-see.

More idiomatic than the above examples: "...the chip on X's shoulder" "to do X a favour" "to give X a look"

The above are all examples of "set phrases" or "phrasemes", and are most commonly discussed in the context of MTT or Meaning-Text Theory of Igor Mel'cuk et al (search for "MTT Lexical Function" for more info). Mel'cuk treats set phrases as lexemes, and, for parsing, this is not directly relevant. However, insofar as phrasemes have a high mutual information content, they can dominate the syntactic structure of a sentence.

Preposition linking:

The current parse of "he wanted to look at and listen to everything." is inadequate: the link to "everything" needs to connect to "and", so that "listen to" and "look at" are treated as atomic verb phrases.

Lexical functions:

MTT suggests that perhaps the correct way to understand the contents of the post-processing rules is as an implementation of 'lexical functions' projected onto syntax. That is, the post-processing rules allow only certain syntactical constructions, and these are the kinds of constructions one typically sees in certain kinds of lexical functions.

Alternately, link-grammar suffers from a combinatoric explosion of possible parses of a given sentence. It would seem that lexical functions could be used to rule out many of these parses. On the other hand, the results are likely to be similar to that of statistical parse ranking (which presumably captures such quasi-idiomatic collocations at least weakly).

Рефери I. Mel'cuk: "Collocations and Lexical Functions", in ''Phraseology: theory, analysis, and applications'' Ed. Anthony Paul Cowie (1998) Oxford University Press pp. 23-54.

More generally, all of link-grammar could benefit from a MTT-izing of infrastructure.

Морфология:

Compare the above commentary on lexical functions to Hebrew morphological analysis. To quote Wikipedia:

This distinction between the word as a unit of speech and the root as a unit of meaning is even more important in the case of languages where roots have many different forms when used in actual words, as is the case in Semitic languages. In these, roots are formed by consonants alone, and different words (belonging to different parts of speech) are derived from the same root by inserting vowels. For example, in Hebrew, the root gdl represents the idea of largeness, and from it we have gadol and gdola (masculine and feminine forms of the adjective "big"), gadal "he grew", higdil "he magnified" and magdelet "magnifier", along with many other words such as godel "size" and migdal "tower".

Morphology printing:

Instead of hard-coding LL, declare which links are morpho links in the dict.

Assorted minor cleanup:

Should provide a query that returns compile-time consts, eg the max number of characters in a word, or max words in a sentence.
Should remove compile-time constants, eg max words, max length etc.

Version 6.0 TODO list:

Version 6.0 will change Sentence to Sentence*, Linkage to Linkage* in the API. But perhaps this is a bad idea...

Расширять