tongrams rs下載tongrams rs源代碼下載

tongrams rs

Ai源碼

1.0.0

下載

`tongrams-rs` ：生鏽中的大量n個克

這是tongrams的生鏽端口，以索引並查詢壓縮空間中的大型語言模型，其中數據結構在以下論文中呈現：

Giulio Ermanno Pibiri和Rossano Venturini，大量N-Gram數據集的有效數據結構。在第40屆ACM信息檢索研究與發展會議論文集（Sigir 2017） ，第615-624頁。
Giulio Ermanno Pibiri和Rossano Venturini有效地處理大量的N-Gram數據集。信息系統（TOIS）的ACM交易，37.2（2019）：1-41。

可以做什麼

存儲具有頻率計數的n -gram語言模型。
查找n -grs以獲取頻率計數。

特徵

壓縮語言模型。 tongrams-rs可以在非常壓縮的空間中存儲大型N-克語言模型。例如， test_data中的單詞n -gram數據集（ n = 1..5）僅存儲在每克2.6字節中。
時間和記憶效率。 tongrams-rs採用Elias-Fano Trie ，它巧妙地編碼了由Elias-Fano代碼N -gram組成的TRIE數據結構，從而在壓縮空間中可以快速查找。
純銹。 tongrams-rs僅寫入Rust，可以輕鬆地插入您的鏽蝕代碼中。

輸入數據格式

n -gram計數文件的文件格式與tongrams中使用的文件格式相同，Tongrams是一種修改的Google格式，其中

一個單獨的n （順序）的不同值列出了每行1克
每個標頭行<number_of_grams>指示文件中的n個gram數量，
克<gram>中的令牌被一個空間（例如， the same time ）寬大，並且
克<gram>和計數<count>由水平標籤稀少。

 <number_of_grams>
<gram1><TAB><count1>
<gram2><TAB><count2>
<gram3><TAB><count3>
...

例如，

 61516
the // parent	1
the function is	22
the function a	4
the function to	1
the function and	1
...

命令行工具

tools提供了一些命令行工具來享受此庫。在下文中，使用從tongrams複製的test_data中的n -gram計數文件介紹了示例用法。

1。排序

要構建TRIE索引，您需要對n -gram計數文件進行排序。首先，準備umigram計數文件，按計數排序以使結果索引較小，為

 $ cat test_data/1-grams.sorted
8761
the	3681
is	1869
a	1778
of	1672
to	1638
and	1202
...

通過使用umigram文件作為詞彙，可執行的sort_grams對n -gram計數文件進行分類。

在這裡，我們將一個未分類的bigram計數文件分類為

 $ cat test_data/2-grams
38900
ways than	1
may come	1
frequent causes	1
way has	1
in which	14
...

您可以對BigRam文件（GZIP格式）進行排序，並使用以下命令編寫test_data/2-grams.sorted 。

 $ cargo run --release -p tools --bin sort_grams -- -i test_data/2-grams.gz -v test_data/1-grams.sorted.gz -o test_data/2-grams.sorted
Loading the vocabulary: "test_data/1-grams.sorted.gz"
Loading the records: "test_data/2-grams.gz"
Sorting the records
Writing the index into "test_data/2-grams.sorted.gz"

可以使用-f指定輸出文件格式，默認設置為.gz 。結果文件將是

 $ cat test_data/2-grams.sorted
38900
the //	1
the function	94
the if	3
the code	126
the compiler	117
...

2。索引

可執行的index從（排序） n -gr計數文件構建語言模型，命名為<order>-grams.sorted.gz ，並將其寫入二進製文件中。可以使用-f指定輸入文件格式，默認設置為.gz 。

例如，以下命令從n -gram計數文件（ n = test_data ）中構建語言模型，並將其寫入index.bin 。

 $ cargo run --release -p tools --bin index -- -n 5 -i test_data -o index.bin
Input files: ["test_data/1-grams.sorted.gz", "test_data/2-grams.sorted.gz", "test_data/3-grams.sorted.gz", "test_data/4-grams.sorted.gz", "test_data/5-grams.sorted.gz"]
Counstructing the index...
Elapsed time: 0.190 [sec]
252550 grams are stored.
Writing the index into "index.bin"...
Index size: 659366 bytes (0.629 MiB)
Bytes per gram: 2.611 bytes

如標準輸出所示，模型文件僅需每克2.6字節。

3。查找

可執行的lookup為查找n-克提供了一個演示，如下所示。

 $ cargo run --release -p tools --bin lookup -- -i index.bin 
Loading the index from "index.bin"...
Performing the lookup...
> take advantage
count = 8
> only 64-bit execution
count = 1
> Elias Fano
Not found
> 
Good bye!

4。內存統計

可執行的stats顯示了每個組件的內存使用量。

 $ cargo run --release -p tools --bin stats -- -i index.bin
Loading the index from "index.bin"...
{"arrays":[{"pointers":5927,"token_ids":55186},{"pointers":19745,"token_ids":92416},{"pointers":25853,"token_ids":107094},{"pointers":28135,"token_ids":111994}],"count_ranks":[{"count_ranks":5350},{"count_ranks":12106},{"count_ranks":13976},{"count_ranks":14582},{"count_ranks":14802}],"counts":[{"count":296},{"count":136},{"count":72},{"count":56},{"count":56}],"vocab":{"data":151560}}

基準

在目錄bench ，您可以使用test_data中的n -gram數據測量查找時間，並使用以下命令：

 $ RUSTFLAGS="-C target-cpu=native" cargo bench
count_lookup/tongrams/EliasFanoTrieCountLm
                        time:   [3.1818 ms 3.1867 ms 3.1936 ms]

報告的時間是查找5K隨機革蘭氏的總過去時間。上面的結果實際上是在我的筆記本電腦PC（Intel I7，16GB RAM）上獲得的，即EliasFanoTrieCountLm平均可以在0.64 micro sec中查找一克。