fastmorph下载fastmorph源代码下载

FastMorph V5

快速语料库搜索引擎最初是为书面tatar语言制作的。

您可以在这里尝试。

源代码可从https://github.com/mansayk/fastmorph获得。

特征

基于不同搜索参数的任何组合的高级搜索选项：
- 单词形式
- 引理
- 形态标签
- 模式匹配（当前“*”和“？”面具）
- 案例匹配
- 下一个单词的距离
它以JSON格式接收搜索查询。

一些速度测试

在机器上执行的测试具有以下特征：

CPU：AMD FX-4100四核处理器
RAM：16 GB
操作系统：CentOS版本6.8（最终）
FastMorph：编译4个线程支持，X64
语料库大小：116亿美元的单词出现（140 mln令牌）
带有资源的完整句子：100

测试不同类型的查询的结果：

 Query:
   Word 1: китап
Number of occurences: 32209
Query processing time: 0,4 sec.

 Query:
   Word 1 (case sensitive, distance to the next word up to 3 words): Китап
   Word 2 (if in brackets, then it is lemma): (бир)
Number of occurences: 15
Query processing time: 0,4 sec.

 Quite heavy query:
   Word 1 (word begins with "б" letter, distance range to the next word is from 1 to 10): б*
   Word 2 (pronoun, word ends with "ң", distance range to the next word is from 1 to 10): <prn>*ң
   Word 3 (lemma "кил", word ends with "р"): (кил)*р
Number of occurences: 135210
Query processing time: 0,8 sec.

 Very heavy query:
   Word 1 (word ends with "ы", distance range to the next word is from 1 to 100): *ы
   Word 2 (word ends with "а", distance range to the next word is from 1 to 100): *а
   Word 3 (word ends with "м", distance range to the next word is from 1 to 100): *м
   Word 4 (word ends with "с", distance range to the next word is from 1 to 100): *с
   Word 5 (word ends with "ь", distance range to the next word is from 1 to 100): *ь
   Word 6 (word ends with "е"): *е
Number of occurences: 135210
Query processing time: 1,4 sec.

系统要求

OS：对不同的Linux X86-64分布进行了测试。
RAM：100亿个单词语料库约800 MB。
CPU：由于多线程支持，建议使用64位多层处理器。
MySQL：程序从MySQL数据库加载所有数据。
UNIX域插座支持OS。

汇编的依赖项

JSMN是C.
MySQL C API是基于C的API，客户端应用程序用C与MySQL Server进行通信。

使用

您可以在这里尝试。我们的语料库手册中有不同的搜索示例。如果您在项目中使用FastMorph有任何疑问，请通过[email protected]与我们联系。
另外，我们请您让我们知道使用此搜索引擎的位置，如果您不介意，我们将在此处发布指向这些项目的链接。

执照

该软件根据GNU通用公共许可证v3.0分发。

JSON

搜索查询：

 Schematical view: {<adj>}(0) 1-5 {ке*<n>}(1) 1-1 {(кил)}(0) 1-1 {}(0) 1-1 {}(0) 1-1 {}(0)  
Detailed:  
   Word 1 (distance range to the next word is from 1 to 5, adjective): <adj>  
   Word 2 (case sensitive, begins with "ке", noun): ке*<n>  
   Word 3: (lemma "кил"):(кил)  
   Word 4:  
   Word 5:  
   Word 6:

输入格式

 {  
  "word": [  
    "",  
    "",  
    "",  
    "",
    "",
    ""  
  ],  
  "lemma": [  
    "",  
    "",  
    "кил",  
    "",
    "",
    ""  
  ],  
  "tags": [  
    "<adj>",  
    "<n>",  
    "",  
    "",  
    "",
    ""  
  ],  
  "wildmatch": [  
    "",  
    "ке*",  
    "",  
    "",
    "",
    ""  
  ],  
  "case": [  
    0,  
    1,  
    0, 
    0,
    0,
    0  
  ],  
  "dist_from": [  
    1,  
    1,  
    1, 
    1,
    1  
  ],  
  "dist_to": [  
    5,  
    1,  
    1,
    1,
    1  
  ],  
  "return": 100,  
  "last_pos": "0"  
}

“返回” - 返回的最大句子金额。
第一个查询的“ last_pos” - “ 0”，或者只需返回此字符串即可获取下一个句子列表。

警告！在将输入数据传递给FastMorph之前，您应该标准化并验证输入数据：

删除所有不允许的符号；
检查字符串腿，数字正确性等。

输出格式

 {  
  "example": [  
    {  
      "id": 15853,  
      "source": ""2013 Универсиадасы блогы" (web-сайт)",  
      "source_type": "kazan2013.ru",  
      "sentence": "Универсиада кебек зур проектның бер өлеше булу өчен, Казанга Россиянең төрле  
        төбәкләреннән һәм Дөньяның  
        <span id='found_word_0' class='found_word' title='(төрле) <adj>'>төрле</span>  
        илләреннән бик күп  
        <span id='found_word_1' class='found_word' title='(кеше) <n>,<nom>,<sg>'>кеше</span>  
        <span id='found_word_2' class='found_word' title='(кил) <ifi>,<iv>,<p3>,<sg>,<v>'>килде</span>."  
    },  
    {  
      "id": -1  
    }  
  ],  
  "last_pos": "892447x39311905x75980782x114356633",  
  "found_all": 1359  
}

如您所见，匹配搜索查询的每个单词将在以下HTML标签中返回：

 <span id='found_word_0' class='found_word' title='(LEMMA) <TAG1><TAG2>'>FOUND_WORD</span>

因此，例如，您可以使用CSS突出显示它们。

MySQL数据库格式

您可以在此处找到创建表示例。

mysql>从morph6_main_apertium限制10;

ID	团结的	句子	来源
0	1594501	1	1
1	761564	1	1
2	787834	1	1
3	1505641	1	1
4	420024	1	1
5	764201	1	1
6	1003674	1	1
7	1003851	1	1
8	764201	1	1
9	1057551	1	1

mysql>选择 *来自morph6_united_apertium，其中id> = 100 limit 10;

ID	弗雷克	word_case	单词	引理	标签
100	1	1000084	599888	429156	2
101	60	1000085	599890	429158	2
102	5	1000086	599891	429159	2
103	2	1000087	599892	429160	2
104	1	1000088	599893	429161	2
105	10	1000089	599894	429162	2
106	1	100008	164606	119768	2
107	1	1000090	599895	429163	2
108	5	1000091	599899	429167	2
109	1	1000092	599901	429169	2

mysql>从morph6_words_case_apertium中选择 *，其中ID> 200000限制10;

ID	弗雷克	word_case
200001	4	极
200002	1	极
200003	3	极
200004	290	极
200005	14	极
200006	1	极
200007	79	极
200008	1	极
200009	1	极
200010	1	极

mysql>从morph6_words_apertium中选择 *，其中ID> 100000限制10;

ID	弗雷克	单词
100001	975	o
100002	7	ouluph
100003	74	oul
100004	1	oul，枢
100005	1	oul ulecter
100006	8	oul ulecision
100007	1	oul
100008	1	oul
100009	1408	o
100010	3	ouluph

mysql> Select *从morph6_lemmas_apertium中iD> 300000限制10;

ID	弗雷克	引理
300001	1	极
300002	130	极
300003	8	极
300004	2	极
300005	3	极
300006	9	极
300007	2	极
300008	2	极
300009	1	极
300010	12	极

mysql>从morph6_tags_apertium中进行选择 *，其中ID> 11100限制10;

ID	弗雷克	组合
11101	4	<ant>，<dat>，<f>，<frm>，<np>，<px2sg>
11102	17141	<ant>，<dat>，<f>，<np>
11103	387	<ant>，<dat>，<f>，<np>，<pl>
11104	1	<ant>，<dat>，<f>，<np>，<pl>，<px1pl>
11105	1	<ant>，<dat>，<f>，<np>，<pl>，<px1sg>
11106	12	<ant>，<dat>，<f>，<np>，<pl>，<px3sp>
11107	1	<ant>，<dat>，<f>，<np>，<pl>，<px>
11108	40	<ant>，<dat>，<f>，<np>，<px1pl>
11109	101	<ant>，<dat>，<f>，<np>，<px1sg>
11110	41	<ant>，<dat>，<f>，<np>，<px2sg>

mysql>从col1> 300限制3的来源选择 *；

Col1	Col2	Col3
301	“ miras.belem.ru”（web-chйй）	miras.belem.ru
302	。 libe	书
303	дәdHisti	tatarstan.ru

孔径

如果您使用Apertium的标签器在形态上注释语料库，则可以使用我们的Python脚本从Apertium的输出中生成表。

要使用此转换器，您应该：

使用Apertium的标签器注释您的语料库：

 cat bigfile.txt | apertium -n -d . tat-tagger | cg-proc dev/mansur.bin > bigfile_tagged.txt

Mansur.bin只是一个有一些其他规则的文件。您可以在这里找到它。
结果，您应该获取文件，其中包含带注释的句子：

 ^Мин/Мин<prn><pers><p1><sg><nom>$ ^үземне/үз<prn><ref><px1sg><acc>$ ^белә/бел<v><tv><prc_impf>$ ^башлаганнан/башла<vaux><ger_past><abl>$ ^бирле/бирле<post>$ ^түбән/түбән<adj>$ ^очка/оч<n><sg><dat>$ – ^ерак/ерак<adj>$ ^бабакайларга/бабакай<n><pl><dat>$ ^төшәргә/төш<v><iv><inf>$ ^ярата/ярат<v><tv><pres><p3><sg>$ ^идем/и<cop><ifi><p1><sg>$^./.<sent>$

以以下格式生成“ INV_SO”文本文件：

句子ID	源ID
1	1
2	1
3	1
4	1
5	1
6	1
7	1
8	1
9	1
10	1

并使用脚本将其放在同一目录中。
3）以这种方式运行Python脚本：

 ./tat-tagger_to_ntables_v6.24.py tatcorpus2.sentences.apertium.tagged.txt

根据您的语料库的大小，这将花费很多时间。
4）如果一切顺利，您应该会获取需要导入MySQL数据库的新文件列表：

 tatcorpus2.sentences.apertium.tagged.txt.lemmas.output.txt
tatcorpus2.sentences.apertium.tagged.txt.main.output.txt
tatcorpus2.sentences.apertium.tagged.txt.tags-uniq.output.txt.sorted.txt
tatcorpus2.sentences.apertium.tagged.txt.tags.output.txt
tatcorpus2.sentences.apertium.tagged.txt.united.output.txt
tatcorpus2.sentences.apertium.tagged.txt.words.output.txt
tatcorpus2.sentences.apertium.tagged.txt.words_case.output.txt

ChangElog：

27.02.2017-发布FastMorph语料库搜索引擎的第五版。现在，它消耗了大约减少内存（RAM）的2.5倍。

2016年11月18日 - FastMorph语料库搜索引擎的第四版。更改列表：

添加了案例敏感搜索选项；
搜索系统的内存（RAM）用法减少了两次；
由于应用程序体系结构的基本变化，搜索查询现在执行3-5倍的速度。技术信息：版本4为同一语料库消耗大约2 GB RAM。

19.07.2016-复杂形态搜索引擎“ FastMorph”的一些改进：

除了现有的蒙版“*”外，还添加了代表任何单个字符的任何数量的符号“蒙版”？有关它的更多信息，您可以在更新的指南中找到；
在技术计划中，搜索系统的记忆使用量最多降低了25％。技术信息：版本3为同一语料库消耗大约4 GB RAM。

13.06.2016-在FastMorph模块中添加了单词功能的中间搜索。例如，如果您键入 *ә执 * *，将会找到像oudibibimәliby的单词，类似于 *fimby。

2016年4月21日 - 由于在“ FastMorph”模块中实现了一些处理器优化和多线程支持，我们实现了复杂的形态搜索现在的性能快五倍。

03.04.2016-复杂的形态搜索系统的特征显着扩展。您可以在最高3.0和更高版本的指南中获取有关它们的更多信息。

22.02.2016-复杂的形态搜索功能出现在书面tatar的语料库中，您可以在其中使用诸如Wordform，Lemma，Lemma，语法标签，单词的开始和结尾等参数的不同组合，它们之间的距离。技术信息：版本1消耗了大约6 GB RAM的语料库，其中包括116亿个单词。它的速度很高。

展开

fastmorph

FastMorph V5

特征

一些速度测试

在机器上执行的测试具有以下特征：

测试不同类型的查询的结果：

系统要求

汇编的依赖项

使用

执照

JSON

输入格式

输出格式

MySQL数据库格式

孔径

ChangElog：

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express