tongrams rsダウンロード-Tongrams tongrams rsソースコードのダウンロード

tongrams rs

AI ソースコード

1.0.0

ダウンロード

`tongrams-rs` ：錆のn -gramsのトン

これは、圧縮空間で大規模な言語モデルをインデックス化およびクエリするためのtongramsの錆ポートであり、次の論文にデータ構造が提示されています。

Giulio Ermanno PibiriとRossano Venturini、大規模なN-Gramデータセットの効率的なデータ構造。情報検索における研究開発に関する第40回ACM会議の議事録（Sigir 2017） 、pp。615-624。
Giulio Ermanno PibiriとRossano Venturini、大規模なN-Gramデータセットを効率的に処理します。情報システム（TOI）に関するACMトランザクション、37.2（2019）：1-41。

何ができる

周波数カウントでn -gram言語モデルを保存します。
n -gramsを調べて、周波数カウントを取得します。

特徴

圧縮言語モデル。 tongrams-rs非常に圧縮された空間に大きなNグラム言語モデルを保存できます。たとえば、 test_dataのn -gramデータセット（ n = 1..5）という単語は、グラムあたりわずか2.6バイトに保存されます。
時間とメモリの効率。 tongrams-rs 、 Elias-Fano Trieを採用しています。これは、Elias-Fanoコードを介したn -gramsで構成されるTrieデータ構造を巧みにエンコードし、圧縮空間を迅速に検索できるようにします。
純粋な錆。 tongrams-rsは錆でのみ記述されており、錆コードに簡単に差し込むことができます。

入力データ形式

n -gramカウントファイルのファイル形式は、 tongramsで使用されているファイルと同じです。

nの個別の値ごとに1つのファイル（注文）は、行ごとに1グラムをリストします。
各ヘッダーrow <number_of_grams>は、ファイル内のn -gramsの数を示します。
グラムのトークン<gram>は、スペースによって節約されます（例えば、 the same time ）、そして
グラム<gram>とカウント<count>は、水平タブで節約されます。

 <number_of_grams>
<gram1><TAB><count1>
<gram2><TAB><count2>
<gram3><TAB><count3>
...

例えば、

 61516
the // parent	1
the function is	22
the function a	4
the function to	1
the function and	1
...

コマンドラインツール

tools 、このライブラリを楽しむためのいくつかのコマンドラインツールを提供します。以下では、 tongramsからコピーされたtest_dataのn -gramカウントファイルを使用して、例を使用して使用します。

1。ソート

Trieインデックスを構築するには、 n -gramカウントファイルをソートする必要があります。まず、結果のインデックスを小さくするためにカウントによってソートされたユニグラムカウントファイルを準備します。

 $ cat test_data/1-grams.sorted
8761
the	3681
is	1869
a	1778
of	1672
to	1638
and	1202
...

Unigramファイルを語彙として使用することにより、実行可能sort_grams n -gramカウントファイルをソートします。

ここでは、ASを使用していないBigRamカウントファイルをソートします

 $ cat test_data/2-grams
38900
ways than	1
may come	1
frequent causes	1
way has	1
in which	14
...

BigRamファイル（GZIP形式）を並べ替えて、次のコマンドでtest_data/2-grams.sortedを書き込むことができます。

 $ cargo run --release -p tools --bin sort_grams -- -i test_data/2-grams.gz -v test_data/1-grams.sorted.gz -o test_data/2-grams.sorted
Loading the vocabulary: "test_data/1-grams.sorted.gz"
Loading the records: "test_data/2-grams.gz"
Sorting the records
Writing the index into "test_data/2-grams.sorted.gz"

出力ファイル形式は-fで指定でき、デフォルト設定は.gzです。結果のファイルは次のとおりです

 $ cat test_data/2-grams.sorted
38900
the //	1
the function	94
the if	3
the code	126
the compiler	117
...

2。インデックス

実行可能index 、（sorted） n -gramカウントファイルの言語モデルを構築し、 <order>-grams.sorted.gzという名前のファイルをカウントし、バイナリファイルに書き込みます。入力ファイル形式は-fで指定でき、デフォルト設定は.gzです。

たとえば、次のコマンドは、 n -gramカウントファイル（ n = 1..5）からディレクトリtest_dataに配置され、 index.binに書き込みます。

 $ cargo run --release -p tools --bin index -- -n 5 -i test_data -o index.bin
Input files: ["test_data/1-grams.sorted.gz", "test_data/2-grams.sorted.gz", "test_data/3-grams.sorted.gz", "test_data/4-grams.sorted.gz", "test_data/5-grams.sorted.gz"]
Counstructing the index...
Elapsed time: 0.190 [sec]
252550 grams are stored.
Writing the index into "index.bin"...
Index size: 659366 bytes (0.629 MiB)
Bytes per gram: 2.611 bytes

標準の出力が示すように、モデルファイルはグラムあたり2.6バイトしかかかりません。

3。ルックアップ

実行可能lookup 、次のように、 n -gramsをルックアップするデモを提供します。

 $ cargo run --release -p tools --bin lookup -- -i index.bin 
Loading the index from "index.bin"...
Performing the lookup...
> take advantage
count = 8
> only 64-bit execution
count = 1
> Elias Fano
Not found
> 
Good bye!

4。メモリ統計

実行可能なstats 、各コンポーネントのメモリ使用の内訳を示しています。

 $ cargo run --release -p tools --bin stats -- -i index.bin
Loading the index from "index.bin"...
{"arrays":[{"pointers":5927,"token_ids":55186},{"pointers":19745,"token_ids":92416},{"pointers":25853,"token_ids":107094},{"pointers":28135,"token_ids":111994}],"count_ranks":[{"count_ranks":5350},{"count_ranks":12106},{"count_ranks":13976},{"count_ranks":14582},{"count_ranks":14802}],"counts":[{"count":296},{"count":136},{"count":72},{"count":56},{"count":56}],"vocab":{"data":151560}}

ベンチマーク

ディレクトリbenchでは、次のコマンドでtest_dataのn -gramデータを使用してルックアップ時間を測定できます。

 $ RUSTFLAGS="-C target-cpu=native" cargo bench
count_lookup/tongrams/EliasFanoTrieCountLm
                        time:   [3.1818 ms 3.1867 ms 3.1936 ms]

報告された時間は、5Kランダムグラムを検索するための合計経過時間です。上記の結果は、実際に私のラップトップPC（Intel I7、16GB RAM）で取得されました。つまり、 EliasFanoTrieCountLm平均0.64マイクロSECでグラムを検索できます。

トト

高速エリアスファノを追加し、適切なエリアスファノを加えます
最小限の完全なハッシュを追加します
再マッピングを追加します
確率スコアをサポートします
sucds::EliasFanoより速くします

ライセンス

このライブラリは、MITの下で提供されるフリーソフトウェアです。

拡大する

追加情報

バージョン 1.0.0
タイプ AI ソースコード
更新時間 2025-09-07
サイズ 1.89MB
から Github

tongrams rs

`tongrams-rs` ：錆のn -gramsのトン

何ができる

特徴

入力データ形式

コマンドラインツール

1。ソート

2。インデックス

3。ルックアップ

4。メモリ統計

ベンチマーク

トト

ライセンス

quicksync rs

redis rs

SNKRSナイキアプリ

RSオープンワールドドライビング最新バージョン

RSオープンワールドドライビング

ドライブRSゲーム

chat.petals.dev

GPT Prompt Templates

GPTyped

ML stack

awesome free chatgpt

pywin_contextmenu

Google Dorks

shepherd

mongo express

tongrams rs

tongrams-rs ：錆のn -gramsのトン

何ができる

特徴

入力データ形式

コマンドラインツール

1。ソート

2。インデックス

3。ルックアップ

4。メモリ統計

ベンチマーク

トト

ライセンス

`tongrams-rs` ：錆のn -gramsのトン