fastmorph下載fastmorph源代碼下載

FastMorph V5

快速語料庫搜索引擎最初是為書面tatar語言製作的。

您可以在這裡嘗試。

源代碼可從https://github.com/mansayk/fastmorph獲得。

特徵

基於不同搜索參數的任何組合的高級搜索選項：
- 單詞形式
- 引理
- 形態標籤
- 模式匹配（當前“*”和“？”面具）
- 案例匹配
- 下一個單詞的距離
它以JSON格式接收搜索查詢。

一些速度測試

在機器上執行的測試具有以下特徵：

CPU：AMD FX-4100四核處理器
RAM：16 GB
操作系統：CentOS版本6.8（最終）
FastMorph：編譯4個線程支持，X64
語料庫大小：116億美元的單詞出現（140 mln令牌）
帶有資源的完整句子：100

測試不同類型的查詢的結果：

 Query:
   Word 1: китап
Number of occurences: 32209
Query processing time: 0,4 sec.

 Query:
   Word 1 (case sensitive, distance to the next word up to 3 words): Китап
   Word 2 (if in brackets, then it is lemma): (бир)
Number of occurences: 15
Query processing time: 0,4 sec.

 Quite heavy query:
   Word 1 (word begins with "б" letter, distance range to the next word is from 1 to 10): б*
   Word 2 (pronoun, word ends with "ң", distance range to the next word is from 1 to 10): <prn>*ң
   Word 3 (lemma "кил", word ends with "р"): (кил)*р
Number of occurences: 135210
Query processing time: 0,8 sec.

 Very heavy query:
   Word 1 (word ends with "ы", distance range to the next word is from 1 to 100): *ы
   Word 2 (word ends with "а", distance range to the next word is from 1 to 100): *а
   Word 3 (word ends with "м", distance range to the next word is from 1 to 100): *м
   Word 4 (word ends with "с", distance range to the next word is from 1 to 100): *с
   Word 5 (word ends with "ь", distance range to the next word is from 1 to 100): *ь
   Word 6 (word ends with "е"): *е
Number of occurences: 135210
Query processing time: 1,4 sec.

系統要求

OS：對不同的Linux X86-64分佈進行了測試。
RAM：100億個單詞語料庫約800 MB。
CPU：由於多線程支持，建議使用64位多層處理器。
MySQL：程序從MySQL數據庫加載所有數據。
UNIX域插座支持OS。

彙編的依賴項

JSMN是C.
MySQL C API是基於C的API，客戶端應用程序用C與MySQL Server進行通信。

使用

您可以在這裡嘗試。我們的語料庫手冊中有不同的搜索示例。如果您在項目中使用FastMorph有任何疑問，請通過[email protected]與我們聯繫。
另外，我們請您讓我們知道使用此搜索引擎的位置，如果您不介意，我們將在此處發布指向這些項目的鏈接。

執照

該軟件根據GNU通用公共許可證v3.0分發。

JSON

搜索查詢：

 Schematical view: {<adj>}(0) 1-5 {ке*<n>}(1) 1-1 {(кил)}(0) 1-1 {}(0) 1-1 {}(0) 1-1 {}(0)  
Detailed:  
   Word 1 (distance range to the next word is from 1 to 5, adjective): <adj>  
   Word 2 (case sensitive, begins with "ке", noun): ке*<n>  
   Word 3: (lemma "кил"):(кил)  
   Word 4:  
   Word 5:  
   Word 6:

輸入格式

 {  
  "word": [  
    "",  
    "",  
    "",  
    "",
    "",
    ""  
  ],  
  "lemma": [  
    "",  
    "",  
    "кил",  
    "",
    "",
    ""  
  ],  
  "tags": [  
    "<adj>",  
    "<n>",  
    "",  
    "",  
    "",
    ""  
  ],  
  "wildmatch": [  
    "",  
    "ке*",  
    "",  
    "",
    "",
    ""  
  ],  
  "case": [  
    0,  
    1,  
    0, 
    0,
    0,
    0  
  ],  
  "dist_from": [  
    1,  
    1,  
    1, 
    1,
    1  
  ],  
  "dist_to": [  
    5,  
    1,  
    1,
    1,
    1  
  ],  
  "return": 100,  
  "last_pos": "0"  
}

“返回” - 返回的最大句子金額。
第一個查詢的“ last_pos” - “ 0”，或者只需返回此字符串即可獲取下一個句子列表。

警告！在將輸入數據傳遞給FastMorph之前，您應該標準化並驗證輸入數據：

刪除所有不允許的符號；
檢查字符串腿，數字正確性等。

輸出格式

 {  
  "example": [  
    {  
      "id": 15853,  
      "source": ""2013 Универсиадасы блогы" (web-сайт)",  
      "source_type": "kazan2013.ru",  
      "sentence": "Универсиада кебек зур проектның бер өлеше булу өчен, Казанга Россиянең төрле  
        төбәкләреннән һәм Дөньяның  
        <span id='found_word_0' class='found_word' title='(төрле) <adj>'>төрле</span>  
        илләреннән бик күп  
        <span id='found_word_1' class='found_word' title='(кеше) <n>,<nom>,<sg>'>кеше</span>  
        <span id='found_word_2' class='found_word' title='(кил) <ifi>,<iv>,<p3>,<sg>,<v>'>килде</span>."  
    },  
    {  
      "id": -1  
    }  
  ],  
  "last_pos": "892447x39311905x75980782x114356633",  
  "found_all": 1359  
}

如您所見，匹配搜索查詢的每個單詞將在以下HTML標籤中返回：

 <span id='found_word_0' class='found_word' title='(LEMMA) <TAG1><TAG2>'>FOUND_WORD</span>

因此，例如，您可以使用CSS突出顯示它們。

MySQL數據庫格式

您可以在此處找到創建表示例。

mysql>從morph6_main_apertium限制10;

ID	團結的	句子	來源
0	1594501	1	1
1	761564	1	1
2	787834	1	1
3	1505641	1	1
4	420024	1	1
5	764201	1	1
6	1003674	1	1
7	1003851	1	1
8	764201	1	1
9	1057551	1	1

mysql>選擇 *來自morph6_united_apertium，其中id> = 100 limit 10;

ID	弗雷克	word_case	單詞	引理	標籤
100	1	1000084	599888	429156	2
101	60	1000085	599890	429158	2
102	5	1000086	599891	429159	2
103	2	1000087	599892	429160	2
104	1	1000088	599893	429161	2
105	10	1000089	599894	429162	2
106	1	100008	164606	119768	2
107	1	1000090	599895	429163	2
108	5	1000091	599899	429167	2
109	1	1000092	599901	429169	2

mysql>從morph6_words_case_apertium中選擇 *，其中ID> 200000限制10;

ID	弗雷克	word_case
200001	4	極
200002	1	極
200003	3	極
200004	290	極
200005	14	極
200006	1	極
200007	79	極
200008	1	極
200009	1	極
200010	1	極

mysql>從morph6_words_apertium中選擇 *，其中ID> 100000限制10;

ID	弗雷克	單詞
100001	975	o
100002	7	ouluph
100003	74	oul
100004	1	oul，樞
100005	1	oul ulecter
100006	8	oul ulecision
100007	1	oul
100008	1	oul
100009	1408	o
100010	3	ouluph

mysql> Select *從morph6_lemmas_apertium中iD> 300000限制10;

ID	弗雷克	引理
300001	1	極
300002	130	極
300003	8	極
300004	2	極
300005	3	極
300006	9	極
300007	2	極
300008	2	極
300009	1	極
300010	12	極

mysql>從morph6_tags_apertium中進行選擇 *，其中ID> 11100限制10;

ID	弗雷克	組合
11101	4	<ant>，<dat>，<f>，<frm>，<np>，<px2sg>
11102	17141	<ant>，<dat>，<f>，<np>
11103	387	<ant>，<dat>，<f>，<np>，<pl>
11104	1	<ant>，<dat>，<f>，<np>，<pl>，<px1pl>
11105	1	<ant>，<dat>，<f>，<np>，<pl>，<px1sg>
11106	12	<ant>，<dat>，<f>，<np>，<pl>，<px3sp>
11107	1	<ant>，<dat>，<f>，<np>，<pl>，<px>
11108	40	<ant>，<dat>，<f>，<np>，<px1pl>
11109	101	<ant>，<dat>，<f>，<np>，<px1sg>
11110	41	<ant>，<dat>，<f>，<np>，<px2sg>

mysql>從col1> 300限制3的來源選擇 *；

Col1	Col2	Col3
301	“ miras.belem.ru”（web-chйй）	miras.belem.ru
302	。 libe	書
303	дәdHisti	tatarstan.ru

孔徑

如果您使用Apertium的標籤器在形態上註釋語料庫，則可以使用我們的Python腳本從Apertium的輸出中生成表。

要使用此轉換器，您應該：

使用Apertium的標籤器註釋您的語料庫：

 cat bigfile.txt | apertium -n -d . tat-tagger | cg-proc dev/mansur.bin > bigfile_tagged.txt

Mansur.bin只是一個有一些其他規則的文件。您可以在這裡找到它。
結果，您應該獲取文件，其中包含帶註釋的句子：

 ^Мин/Мин<prn><pers><p1><sg><nom>$ ^үземне/үз<prn><ref><px1sg><acc>$ ^белә/бел<v><tv><prc_impf>$ ^башлаганнан/башла<vaux><ger_past><abl>$ ^бирле/бирле<post>$ ^түбән/түбән<adj>$ ^очка/оч<n><sg><dat>$ – ^ерак/ерак<adj>$ ^бабакайларга/бабакай<n><pl><dat>$ ^төшәргә/төш<v><iv><inf>$ ^ярата/ярат<v><tv><pres><p3><sg>$ ^идем/и<cop><ifi><p1><sg>$^./.<sent>$

以以下格式生成“ INV_SO”文本文件：

句子ID	源ID
1	1
2	1
3	1
4	1
5	1
6	1
7	1
8	1
9	1
10	1

並使用腳本將其放在同一目錄中。
3）以這種方式運行Python腳本：

 ./tat-tagger_to_ntables_v6.24.py tatcorpus2.sentences.apertium.tagged.txt

根據您的語料庫的大小，這將花費很多時間。
4）如果一切順利，您應該會獲取需要導入MySQL數據庫的新文件列表：

 tatcorpus2.sentences.apertium.tagged.txt.lemmas.output.txt
tatcorpus2.sentences.apertium.tagged.txt.main.output.txt
tatcorpus2.sentences.apertium.tagged.txt.tags-uniq.output.txt.sorted.txt
tatcorpus2.sentences.apertium.tagged.txt.tags.output.txt
tatcorpus2.sentences.apertium.tagged.txt.united.output.txt
tatcorpus2.sentences.apertium.tagged.txt.words.output.txt
tatcorpus2.sentences.apertium.tagged.txt.words_case.output.txt

ChangElog：

27.02.2017-發布FastMorph語料庫搜索引擎的第五版。現在，它消耗了大約減少內存（RAM）的2.5倍。

2016年11月18日 - FastMorph語料庫搜索引擎的第四版。更改列表：

添加了案例敏感搜索選項；
搜索系統的內存（RAM）用法減少了兩次；
由於應用程序體系結構的基本變化，搜索查詢現在執行3-5倍的速度。技術信息：版本4為同一語料庫消耗大約2 GB RAM。

19.07.2016-複雜形態搜索引擎“ FastMorph”的一些改進：

除了現有的蒙版“*”外，還添加了代表任何單個字符的任何數量的符號“蒙版”？有關它的更多信息，您可以在更新的指南中找到；
在技術計劃中，搜索系統的記憶使用量最多降低了25％。技術信息：版本3為同一語料庫消耗大約4 GB RAM。

13.06.2016-在FastMorph模塊中添加了單詞功能的中間搜索。例如，如果您鍵入 *ә執 * *，將會找到像oudibibimәliby的單詞，類似於 *fimby。

2016年4月21日 - 由於在“ FastMorph”模塊中實現了一些處理器優化和多線程支持，我們實現了複雜的形態搜索現在的性能快五倍。

03.04.2016-複雜的形態搜索系統的特徵顯著擴展。您可以在最高3.0和更高版本的指南中獲取有關它們的更多信息。

22.02.2016-複雜的形態搜索功能出現在書面tatar的語料庫中，您可以在其中使用諸如Wordform，Lemma，Lemma，語法標籤，單詞的開始和結尾等參數的不同組合，它們之間的距離。技術信息：版本1消耗了大約6 GB RAM的語料庫，其中包括116億個單詞。它的速度很高。

展開

fastmorph

FastMorph V5

特徵

一些速度測試

在機器上執行的測試具有以下特徵：

測試不同類型的查詢的結果：

系統要求

彙編的依賴項

使用

執照

JSON

輸入格式

輸出格式

MySQL數據庫格式

孔徑

ChangElog：

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express