YouTokenToMe下載 - YouTokenToMe源代碼下載

YouTokenToMe

其他源碼

ved

下載

YouTokentome

YouTokentome是一種無監督的文本令牌，專注於計算效率。目前，它實現快速字節對編碼（BPE）[Sennrich等人]。我們的實施在訓練和令牌化方面比擁抱面孔，fastbpe和句子要快得多。在某些測試用例中，它的速度快60倍。查看我們的基準結果。

關鍵優勢：

多線程培訓和令牌化
該算法具有O(N)複雜性，其中N是訓練數據的長度
C ++的高效實施
Python包裝器和命令行接口

額外功能：

BPE-Dropout（如Provilkov等人，2019年所述）

除了原始論文的算法中，我們的詞不考慮跨越詞邊界的令牌。就像在句子中一樣，所有空間符號都被元符號“ - ”（U+2581）取代。它允許將令牌序列轉換回文本和恢復單詞邊界。

例如，這句話Blazingly fast tokenization!可以被象徵到

['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']

安裝

pip install youtokentome

Python接口

例子

讓我們從一個獨立的例子開始。

 import random

import youtokentome as yttm

train_data_path = "train_data.txt"
model_path = "example.model"

# Generating random file with training data
# 10000 lines with 100 characters in each line
n_lines = 10000
n_characters = 100
with open ( train_data_path , "w" ) as fout :
    for _ in range ( n_lines ):
        print ( "" . join ([ random . choice ( "abcd " ) for _ in range ( n_characters )]), file = fout )

# Generating random text
test_text = "" . join ([ random . choice ( "abcde " ) for _ in range ( 100 )])

# Training model
yttm . BPE . train ( data = train_data_path , vocab_size = 5000 , model = model_path )

# Loading model
bpe = yttm . BPE ( model = model_path )

# Two types of tokenization
print ( bpe . encode ([ test_text ], output_type = yttm . OutputType . ID ))
print ( bpe . encode ([ test_text ], output_type = yttm . OutputType . SUBWORD ))

培訓模型

 youtokentome . BPE . train ( data , model , vocab_size , coverage , n_threads = - 1 , pad_id = 0 , unk_id = 1 , bos_id = 2 , eos_id = 3 )

訓練BPE型號並保存以歸檔。

args：

data ：字符串，使用培訓數據的文件路徑
model ：弦，通往訓練型號的路徑
vocab_size ：int，最終詞彙中的令牌數量
coverage ：浮動，模型覆蓋的字符部分。必須在[0，1]的範圍內。使用的良好價值約為0.9999。
n_threads ：int，用於運行的並行線程的數量。如果傳遞-1，則將使用所有可用線程。請注意，線程數限制為8（請參見基準測試）。
pad_id ：INT，保留ID用於填充
unk_id ：int，未知符號的保留ID
bos_id ：INT，句子的開始句子的保留ID
eos_id ：int，句子結尾的保留ID

返回： youtokentome.BPE類帶有加載模型。

型號加載

 youtokentome . BPE ( model , n_threads = - 1 )

類構造函數。加載訓練有素的模型。

model ：弦，訓練型號的路徑
n_threads ：int，用於運行的並行線程的數量。如果等於-1，則將使用可用的最大線程數。

方法

youtokentome.BPE類具有以下方法：

編碼

 encode ( self , sentences , output_type = yttm . OutputType . ID , bos = False , eos = False , reverse = False , dropout_prob = 0 )

args：

sentences ：字符串列表，令牌化句子。
output_type ：枚舉，句子可以被標記為ID或子字。使用OutputType.ID用於IDS和OutputType.SUBWORD用於子字。
bos ：bool，如果為真，則將添加令牌“句子的開始”
eos ：bool，如果為真，則將添加令牌“句子的結尾”
reverse ：bool，如果為true，則代幣的輸出順序將被顛倒
dropout_prob ：float，bpe-dropout概率（合併刪除的概率）。必須在[0，1]的範圍內。

返回：如果output_type等於youtokentome.OutputType.ID或youtokentome.OutputType.SUBWORD ，則將分別返回整數列表或字符串列表的列表。

詞彙

 vocab ( self )

返回：列表vocab_size字符串。列表中的i-thst字符串對應於第i-th子字。

vocab_size

 vocab_size ( self )

返回： int。詞彙的大小。

subword_to_id

 subword_to_id ( self , subword )

args：

subword ：字符串。

返回：範圍[0，vocab_size-1]的整數。子字的ID或，如果詞彙中沒有此類子字，則將返回unk_id 。

id_to_subword

 id_to_subword ( self , id )

args：

id ：int，必須在[0，vocab_size-1]範圍內

返回：字符串。 ID詞彙的子字。

解碼

 decode ( self , ids , ignore_ids = None )

將每個ID轉換為子字，並與空間符號連接。

args：

ids ：整數列表。所有整數必須在[0，vocab_size-1]範圍內
ignore_ids ：整數集合。在解碼過程中，這些指數將被忽略。所有整數必須在[0，vocab_size-1] [默認：無]的範圍內

返回：字符串列表。

命令行接口

例子

$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000
$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA

支持的命令

YouTokenToMe支持以下命令：

 $ yttm --help

Usage: yttm [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  bpe     Train BPE model.
  decode  Decode ids to text.
  encode  Encode text to ids or subwords.
  vocab   Print list of learned subwords.

命令bpe允許您基於文本文件訓練字節對編碼模型。

 $ yttm bpe --help

Usage: yttm bpe [OPTIONS]

  Train BPE model.

Options:
  --data PATH           Training data file path.  [required]
  --model PATH          Output model file path.  [required]
  --vocab_size INTEGER  Number of tokens in the final vocabulary.  [required]
  --coverage FLOAT      Fraction of characters covered by the model.  [default: 1.0]
  --n_threads INTEGER   Number of threads.  [default: -1]
  --pad_id INTEGER      Padding token id.  [default: 0]
  --unk_id INTEGER      Unknown token id.  [default: 1]
  --bos_id INTEGER      'Begin of sentence' token id.  [default: 2]
  --eos_id INTEGER      'End of sentence' token id.  [default: 3]
  --help                Show this message and exit.

將BPE編碼用於句子語料庫。使用stdin進行輸入和stdout進行輸出。

默認情況下，使用n_threads線程並行編碼作品。線程數量限制為8（請參見基準測試）。

使用--stream選項， --n_threads將被忽略，所有句子將被一個一個處理。在讀取下一個句子之前，每個句子都將被象徵性並寫入stdout 。

 $ yttm encode --help

Usage: yttm encode [OPTIONS]

  Encode text to ids or subwords.

Options:
  --model PATH         Path to file with learned model.  [required]
  --output_type TEXT   'id' or 'subword'.  [required]
  --n_threads INTEGER  Number of threads.  [default: -1]
  --bos                Add tab 'begin of sentence'.
  --eos                Add tab 'end of sentence'.
  --reverse            Reverse output sequence of tokens.
  --stream             Process each line before reading the next one.
  --dropout_prob       BPE-dropout probability (the probability of a merge being dropped). [default: 0]
  --help               Show this message and exit.

打印詞彙。這對於理解模型可能很有用。

 $ yttm vocab --help

Usage: yttm vocab [OPTIONS]

  Print list of learned subwords.

Options:
  --model PATH  Path to file with learned model.  [required]
  --verbose     Add merging rules.
  --help        Show this message and exit.

將ID轉換回文本。使用stdin進行輸入和stdout進行輸出。

 $ yttm decode --help

Usage: yttm decode [OPTIONS]

  Decode ids to text.

Options:
  --model PATH  Path to file with learned model.  [required]
  --ignore_ids  List of indices to ignore for decoding. Example: --ignore_ids=1,2,3
  --help        Show this message and exit.

展開

附加信息

版本 ved
類型其他源碼
更新時間 2025-04-17
大小 57.54KB
來自於 Github

相關應用

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部