tongrams rs下载tongrams rs源代码下载

tongrams rs

Ai源码

1.0.0

下载

`tongrams-rs` ：生锈中的大量n个克

这是tongrams的生锈端口，以索引并查询压缩空间中的大型语言模型，其中数据结构在以下论文中呈现：

Giulio Ermanno Pibiri和Rossano Venturini，大量N-Gram数据集的有效数据结构。在第40届ACM信息检索研究与发展会议论文集（Sigir 2017） ，第615-624页。
Giulio Ermanno Pibiri和Rossano Venturini有效地处理大量的N-Gram数据集。信息系统（TOIS）的ACM交易，37.2（2019）：1-41。

可以做什么

存储具有频率计数的n -gram语言模型。
查找n -grs以获取频率计数。

特征

压缩语言模型。 tongrams-rs可以在非常压缩的空间中存储大型N-克语言模型。例如， test_data中的单词n -gram数据集（ n = 1..5）仅存储在每克2.6字节中。
时间和记忆效率。 tongrams-rs采用Elias-Fano Trie ，它巧妙地编码了由Elias-Fano代码N -gram组成的TRIE数据结构，从而在压缩空间中可以快速查找。
纯锈。 tongrams-rs仅写入Rust，可以轻松地插入您的锈蚀代码中。

输入数据格式

n -gram计数文件的文件格式与tongrams中使用的文件格式相同，Tongrams是一种修改的Google格式，其中

一个单独的n （顺序）的不同值列出了每行1克
每个标头行<number_of_grams>指示文件中的n个gram数量，
克<gram>中的令牌被一个空间（例如， the same time ）宽大，并且
克<gram>和计数<count>由水平标签稀少。

 <number_of_grams>
<gram1><TAB><count1>
<gram2><TAB><count2>
<gram3><TAB><count3>
...

例如，

 61516
the // parent	1
the function is	22
the function a	4
the function to	1
the function and	1
...

命令行工具

tools提供了一些命令行工具来享受此库。在下文中，使用从tongrams复制的test_data中的n -gram计数文件介绍了示例用法。

1。排序

要构建TRIE索引，您需要对n -gram计数文件进行排序。首先，准备umigram计数文件，按计数排序以使结果索引较小，为

 $ cat test_data/1-grams.sorted
8761
the	3681
is	1869
a	1778
of	1672
to	1638
and	1202
...

通过使用umigram文件作为词汇，可执行的sort_grams对n -gram计数文件进行分类。

在这里，我们将一个未分类的bigram计数文件分类为

 $ cat test_data/2-grams
38900
ways than	1
may come	1
frequent causes	1
way has	1
in which	14
...

您可以对BigRam文件（GZIP格式）进行排序，并使用以下命令编写test_data/2-grams.sorted 。

 $ cargo run --release -p tools --bin sort_grams -- -i test_data/2-grams.gz -v test_data/1-grams.sorted.gz -o test_data/2-grams.sorted
Loading the vocabulary: "test_data/1-grams.sorted.gz"
Loading the records: "test_data/2-grams.gz"
Sorting the records
Writing the index into "test_data/2-grams.sorted.gz"

可以使用-f指定输出文件格式，默认设置为.gz 。结果文件将是

 $ cat test_data/2-grams.sorted
38900
the //	1
the function	94
the if	3
the code	126
the compiler	117
...

2。索引

可执行的index从（排序） n -gr计数文件构建语言模型，命名为<order>-grams.sorted.gz ，并将其写入二进制文件中。可以使用-f指定输入文件格式，默认设置为.gz 。

例如，以下命令从n -gram计数文件（ n = test_data ）中构建语言模型，并将其写入index.bin 。

 $ cargo run --release -p tools --bin index -- -n 5 -i test_data -o index.bin
Input files: ["test_data/1-grams.sorted.gz", "test_data/2-grams.sorted.gz", "test_data/3-grams.sorted.gz", "test_data/4-grams.sorted.gz", "test_data/5-grams.sorted.gz"]
Counstructing the index...
Elapsed time: 0.190 [sec]
252550 grams are stored.
Writing the index into "index.bin"...
Index size: 659366 bytes (0.629 MiB)
Bytes per gram: 2.611 bytes

如标准输出所示，模型文件仅需每克2.6字节。

3。查找

可执行的lookup为查找n-克提供了一个演示，如下所示。

 $ cargo run --release -p tools --bin lookup -- -i index.bin 
Loading the index from "index.bin"...
Performing the lookup...
> take advantage
count = 8
> only 64-bit execution
count = 1
> Elias Fano
Not found
> 
Good bye!

4。内存统计

可执行的stats显示了每个组件的内存使用量。

 $ cargo run --release -p tools --bin stats -- -i index.bin
Loading the index from "index.bin"...
{"arrays":[{"pointers":5927,"token_ids":55186},{"pointers":19745,"token_ids":92416},{"pointers":25853,"token_ids":107094},{"pointers":28135,"token_ids":111994}],"count_ranks":[{"count_ranks":5350},{"count_ranks":12106},{"count_ranks":13976},{"count_ranks":14582},{"count_ranks":14802}],"counts":[{"count":296},{"count":136},{"count":72},{"count":56},{"count":56}],"vocab":{"data":151560}}

基准

在目录bench ，您可以使用test_data中的n -gram数据测量查找时间，并使用以下命令：

 $ RUSTFLAGS="-C target-cpu=native" cargo bench
count_lookup/tongrams/EliasFanoTrieCountLm
                        time:   [3.1818 ms 3.1867 ms 3.1936 ms]

报告的时间是查找5K随机革兰氏的总过去时间。上面的结果实际上是在我的笔记本电脑PC（Intel I7，16GB RAM）上获得的，即EliasFanoTrieCountLm平均可以在0.64 micro sec中查找一克。

托多

添加快速的Elias-Fano和Pertition Elias-Fano
添加最小的完美哈希
添加重新映射
支持概率分数
使sucds::EliasFano更快

许可

该库是MIT下提供的免费软件。

展开

附加信息

版本 1.0.0
类型 Ai源码
更新时间 2025-09-07
大小 1.89MB
来自于 Github

tongrams rs

`tongrams-rs` ：生锈中的大量n个克

可以做什么

特征

输入数据格式

命令行工具

1。排序

2。索引

3。查找

4。内存统计

基准

托多

许可

quicksync rs

redis rs

snkrs耐克app

rs开放世界驾驶最新版

rs开放世界驾驶

Drive RS游戏

chat.petals.dev

GPT Prompt Templates

GPTyped

ML stack

awesome free chatgpt

pywin_contextmenu

Google Dorks

shepherd

mongo express

tongrams rs

tongrams-rs ：生锈中的大量n个克

可以做什么

特征

输入数据格式

命令行工具

1。排序

2。索引

3。查找

4。内存统计

基准

托多

许可

`tongrams-rs` ：生锈中的大量n个克