YouTokenToMeダウンロードYouTokenToMeソースコードのダウンロード

YouTokenToMe

その他のソースコード

ved

ダウンロード

Youtokentome

Youtokentomeは、計算効率に焦点を当てた監視されていないテキストトークネイザーです。現在、高速バイトペアエンコード（BPE）[Sennrich et al。]を実装しています。私たちの実装は、顔、FastBPE、およびDentePieceを抱き締めるよりも、トレーニングとトークン化においてはるかに高速です。一部のテストの場合、60倍高速です。ベンチマークの結果をご覧ください。

重要な利点：

トレーニングとトークン化のためのマルチスレッド
アルゴリズムにはO(N)複雑さがあり、ここでNはトレーニングデータの長さです
C ++での非常に効率的な実装
Pythonラッパーとコマンドラインインターフェイス

追加機能：

BPEドロップアウト（Provilkov et al、2019年に記載されている）

元の論文のアルゴリズムと同様に、私たちのものは、単語の境界を横切るトークンを考慮していません。 TentePieceと同様に、すべてのスペースシンボルがメタシンボル「o」（u+2581）に置き換えられました。トークンのシーケンスをテキストに戻し、単語の境界を復元できるようにします。

たとえば、フレーズはBlazingly fast tokenization!にトークン化することができます

['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']

インストール

pip install youtokentome

Pythonインターフェイス

例

自己完結型の例から始めましょう。

 import random

import youtokentome as yttm

train_data_path = "train_data.txt"
model_path = "example.model"

# Generating random file with training data
# 10000 lines with 100 characters in each line
n_lines = 10000
n_characters = 100
with open ( train_data_path , "w" ) as fout :
    for _ in range ( n_lines ):
        print ( "" . join ([ random . choice ( "abcd " ) for _ in range ( n_characters )]), file = fout )

# Generating random text
test_text = "" . join ([ random . choice ( "abcde " ) for _ in range ( 100 )])

# Training model
yttm . BPE . train ( data = train_data_path , vocab_size = 5000 , model = model_path )

# Loading model
bpe = yttm . BPE ( model = model_path )

# Two types of tokenization
print ( bpe . encode ([ test_text ], output_type = yttm . OutputType . ID ))
print ( bpe . encode ([ test_text ], output_type = yttm . OutputType . SUBWORD ))

トレーニングモデル

 youtokentome . BPE . train ( data , model , vocab_size , coverage , n_threads = - 1 , pad_id = 0 , unk_id = 1 , bos_id = 2 , eos_id = 3 )

BPEモデルを訓練し、ファイルに保存します。

args：

data ：文字列、トレーニングデータを使用したファイルへのパス
model ：文字列、訓練されたモデルが保存される場所へのパス
vocab_size ：int、最終語彙のトークンの数
coverage ：フロート、モデルでカバーされている文字の一部。範囲にある必要があります[0、1]。使用するのに適した価値は約0.9999です。
n_threads ：int、実行に使用される並列スレッドの数。 -1が渡されると、利用可能なすべてのスレッドが使用されます。スレッドの数は8によって制限されていることに注意してください（ベンチマークを参照）。
pad_id ：int、パディング用の予約ID
unk_id ：int、不明な記号の予約ID
bos_id ：int、文の開始のための予約済みID
eos_id ：int、文の終了のための予約ID

返品：ロードされたモデルを備えたクラスyoutokentome.BPE 。

モデルの読み込み

 youtokentome . BPE ( model , n_threads = - 1 )

クラスコンストラクター。訓練されたモデルをロードします。

model ：文字列、訓練されたモデルへのパス
n_threads ：int、実行に使用される並列スレッドの数。 -1に等しい場合、利用可能なスレッドの最大数が使用されます。

方法

クラスyoutokentome.BPEには次の方法があります。

エンコード

 encode ( self , sentences , output_type = yttm . OutputType . ID , bos = False , eos = False , reverse = False , dropout_prob = 0 )

args：

sentences ：文字列のリスト、トークン化のための文。
output_type ：enum、文はIDSまたはサブワードにトークン化できます。 IDSのOutputType.IDを使用し、 OutputType.SUBWORD for subwordsを使用します。
bos ：Bool、Trueの場合、トークン「文の始まり」が追加されます
eos ：bool、trueの場合、トークン「文の終わり」が追加されます
reverse ：ブール、もし本当なら、トークンの出力シーケンスが逆になります
dropout_prob ：float、bpe-dropout確率（マージがドロップされる確率）。範囲にある必要があります[0、1]。

返品： output_typeがyoutokentome.OutputType.IDまたはyoutokentome.OutputType.SUBWORDに等しい場合、整数のリストまたは文字列のリストのリストがそれぞれ返されます。

語彙

 vocab ( self )

返品： vocab_size文字列のリスト。リスト内のi番目の文字列は、i番目のサブワードに対応しています。

vocab_size

 vocab_size ( self )

返品： int。語彙のサイズ。

subword_to_id

 subword_to_id ( self , subword )

args：

subword ：文字列。

返品：範囲からの整数[0、vocab_size-1]。サブワードのIDまたは、語彙にそのようなサブワードがない場合、 unk_idが返されます。

id_to_subword

 id_to_subword ( self , id )

args：

id ：int、範囲にある必要があります[0、vocab_size-1]

返品：文字列。 IDによる語彙からのサブワード。

デコード

 decode ( self , ids , ignore_ids = None )

各IDをサブワードに変換し、スペースシンボルと連結します。

args：

ids ：整数のリストのリスト。すべての整数は範囲内にある必要があります[0、vocab_size-1]
ignore_ids ：整数のコレクション。これらのインデックスは、デコード中に無視されます。すべての整数は範囲内にある必要があります[0、vocab_size-1] [デフォルト：なし]

返品：文字列のリスト。

コマンドラインインターフェイス

例

$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000
$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA

サポートされているコマンド

YouTokenToMe次のコマンドをサポートしています。

 $ yttm --help

Usage: yttm [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  bpe     Train BPE model.
  decode  Decode ids to text.
  encode  Encode text to ids or subwords.
  vocab   Print list of learned subwords.

コマンドbpe使用すると、テキストファイルに基づいてBYTEペアエンコーディングモデルをトレーニングできます。

 $ yttm bpe --help

Usage: yttm bpe [OPTIONS]

  Train BPE model.

Options:
  --data PATH           Training data file path.  [required]
  --model PATH          Output model file path.  [required]
  --vocab_size INTEGER  Number of tokens in the final vocabulary.  [required]
  --coverage FLOAT      Fraction of characters covered by the model.  [default: 1.0]
  --n_threads INTEGER   Number of threads.  [default: -1]
  --pad_id INTEGER      Padding token id.  [default: 0]
  --unk_id INTEGER      Unknown token id.  [default: 1]
  --bos_id INTEGER      'Begin of sentence' token id.  [default: 2]
  --eos_id INTEGER      'End of sentence' token id.  [default: 3]
  --help                Show this message and exit.

文のコーパスにBPEエンコードを適用します。入力にはstdinを使用し、出力にはstdout使用します。

デフォルトでは、 n_threadsスレッドを使用してエンコードを並行して動作させます。スレッドの数は8に制限されています（ベンチマークを参照）。

--streamオプションを使用すると、 --n_threads無視され、すべての文は1つずつ処理されます。次の文が読まれる前に、各文はトークン化され、 stdoutに書き込まれます。

 $ yttm encode --help

Usage: yttm encode [OPTIONS]

  Encode text to ids or subwords.

Options:
  --model PATH         Path to file with learned model.  [required]
  --output_type TEXT   'id' or 'subword'.  [required]
  --n_threads INTEGER  Number of threads.  [default: -1]
  --bos                Add tab 'begin of sentence'.
  --eos                Add tab 'end of sentence'.
  --reverse            Reverse output sequence of tokens.
  --stream             Process each line before reading the next one.
  --dropout_prob       BPE-dropout probability (the probability of a merge being dropped). [default: 0]
  --help               Show this message and exit.

語彙を印刷します。これは、モデルを理解するのに役立ちます。

 $ yttm vocab --help

Usage: yttm vocab [OPTIONS]

  Print list of learned subwords.

Options:
  --model PATH  Path to file with learned model.  [required]
  --verbose     Add merging rules.
  --help        Show this message and exit.

IDをテキストに変換します。入力にはstdinを使用し、出力にはstdout使用します。

 $ yttm decode --help

Usage: yttm decode [OPTIONS]

  Decode ids to text.

Options:
  --model PATH  Path to file with learned model.  [required]
  --ignore_ids  List of indices to ignore for decoding. Example: --ignore_ids=1,2,3
  --help        Show this message and exit.

拡大する

追加情報

バージョン ved
タイプその他のソースコード
更新時間 2025-04-17
サイズ 57.54KB
から Github

YouTokenToMe

Youtokentome

インストール

Pythonインターフェイス

例

トレーニングモデル

モデルの読み込み

方法

エンコード

語彙

vocab_size

subword_to_id

id_to_subword

デコード

コマンドラインインターフェイス

例

サポートされているコマンド

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express