YouTokenToMe下载 - YouTokenToMe源代码下载

YouTokenToMe

其他源码

ved

下载

YouTokentome

YouTokentome是一种无监督的文本令牌，专注于计算效率。目前，它实现快速字节对编码（BPE）[Sennrich等人]。我们的实施在训练和令牌化方面比拥抱面孔，fastbpe和句子要快得多。在某些测试用例中，它的速度快60倍。查看我们的基准结果。

关键优势：

多线程培训和令牌化
该算法具有O(N)复杂性，其中N是训练数据的长度
C ++的高效实施
Python包装器和命令行接口

额外功能：

BPE-Dropout（如Provilkov等人，2019年所述）

除了原始论文的算法中，我们的词不考虑跨越词边界的令牌。就像在句子中一样，所有空间符号都被元符号“ - ”（U+2581）取代。它允许将令牌序列转换回文本和恢复单词边界。

例如，这句话Blazingly fast tokenization!可以被象征到

['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']

安装

pip install youtokentome

Python接口

例子

让我们从一个独立的例子开始。

 import random

import youtokentome as yttm

train_data_path = "train_data.txt"
model_path = "example.model"

# Generating random file with training data
# 10000 lines with 100 characters in each line
n_lines = 10000
n_characters = 100
with open ( train_data_path , "w" ) as fout :
    for _ in range ( n_lines ):
        print ( "" . join ([ random . choice ( "abcd " ) for _ in range ( n_characters )]), file = fout )

# Generating random text
test_text = "" . join ([ random . choice ( "abcde " ) for _ in range ( 100 )])

# Training model
yttm . BPE . train ( data = train_data_path , vocab_size = 5000 , model = model_path )

# Loading model
bpe = yttm . BPE ( model = model_path )

# Two types of tokenization
print ( bpe . encode ([ test_text ], output_type = yttm . OutputType . ID ))
print ( bpe . encode ([ test_text ], output_type = yttm . OutputType . SUBWORD ))

培训模型

 youtokentome . BPE . train ( data , model , vocab_size , coverage , n_threads = - 1 , pad_id = 0 , unk_id = 1 , bos_id = 2 , eos_id = 3 )

训练BPE型号并保存以归档。

args：

data ：字符串，使用培训数据的文件路径
model ：弦，通往训练型号的路径
vocab_size ：int，最终词汇中的令牌数量
coverage ：浮动，模型覆盖的字符部分。必须在[0，1]的范围内。使用的良好价值约为0.9999。
n_threads ：int，用于运行的并行线程的数量。如果传递-1，则将使用所有可用线程。请注意，线程数限制为8（请参见基准测试）。
pad_id ：INT，保留ID用于填充
unk_id ：int，未知符号的保留ID
bos_id ：INT，句子的开始句子的保留ID
eos_id ：int，句子结尾的保留ID

返回： youtokentome.BPE类带有加载模型。

型号加载

 youtokentome . BPE ( model , n_threads = - 1 )

类构造函数。加载训练有素的模型。

model ：弦，训练型号的路径
n_threads ：int，用于运行的并行线程的数量。如果等于-1，则将使用可用的最大线程数。

方法

youtokentome.BPE类具有以下方法：

编码

 encode ( self , sentences , output_type = yttm . OutputType . ID , bos = False , eos = False , reverse = False , dropout_prob = 0 )

args：

sentences ：字符串列表，令牌化句子。
output_type ：枚举，句子可以被标记为ID或子字。使用OutputType.ID用于IDS和OutputType.SUBWORD用于子字。
bos ：bool，如果为真，则将添加令牌“句子的开始”
eos ：bool，如果为真，则将添加令牌“句子的结尾”
reverse ：bool，如果为true，则代币的输出顺序将被颠倒
dropout_prob ：float，bpe-dropout概率（合并删除的概率）。必须在[0，1]的范围内。

返回：如果output_type等于youtokentome.OutputType.ID或youtokentome.OutputType.SUBWORD ，则将分别返回整数列表或字符串列表的列表。

词汇

 vocab ( self )

返回：列表vocab_size字符串。列表中的i-thst字符串对应于第i-th子字。

vocab_size

 vocab_size ( self )

返回： int。词汇的大小。

subword_to_id

 subword_to_id ( self , subword )

args：

subword ：字符串。

返回：范围[0，vocab_size-1]的整数。子字的ID或，如果词汇中没有此类子字，则将返回unk_id 。

id_to_subword

 id_to_subword ( self , id )

args：

id ：int，必须在[0，vocab_size-1]范围内

返回：字符串。 ID词汇的子字。

解码

 decode ( self , ids , ignore_ids = None )

将每个ID转换为子字，并与空间符号连接。

args：

ids ：整数列表。所有整数必须在[0，vocab_size-1]范围内
ignore_ids ：整数集合。在解码过程中，这些指数将被忽略。所有整数必须在[0，vocab_size-1] [默认：无]的范围内

返回：字符串列表。

命令行接口

例子

$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000
$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA

支持的命令

YouTokenToMe支持以下命令：

 $ yttm --help

Usage: yttm [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  bpe     Train BPE model.
  decode  Decode ids to text.
  encode  Encode text to ids or subwords.
  vocab   Print list of learned subwords.

命令bpe允许您基于文本文件训练字节对编码模型。

 $ yttm bpe --help

Usage: yttm bpe [OPTIONS]

  Train BPE model.

Options:
  --data PATH           Training data file path.  [required]
  --model PATH          Output model file path.  [required]
  --vocab_size INTEGER  Number of tokens in the final vocabulary.  [required]
  --coverage FLOAT      Fraction of characters covered by the model.  [default: 1.0]
  --n_threads INTEGER   Number of threads.  [default: -1]
  --pad_id INTEGER      Padding token id.  [default: 0]
  --unk_id INTEGER      Unknown token id.  [default: 1]
  --bos_id INTEGER      'Begin of sentence' token id.  [default: 2]
  --eos_id INTEGER      'End of sentence' token id.  [default: 3]
  --help                Show this message and exit.

将BPE编码用于句子语料库。使用stdin进行输入和stdout进行输出。

默认情况下，使用n_threads线程并行编码作品。线程数量限制为8（请参见基准测试）。

使用--stream选项， --n_threads将被忽略，所有句子将被一个一个处理。在读取下一个句子之前，每个句子都将被象征性并写入stdout 。

 $ yttm encode --help

Usage: yttm encode [OPTIONS]

  Encode text to ids or subwords.

Options:
  --model PATH         Path to file with learned model.  [required]
  --output_type TEXT   'id' or 'subword'.  [required]
  --n_threads INTEGER  Number of threads.  [default: -1]
  --bos                Add tab 'begin of sentence'.
  --eos                Add tab 'end of sentence'.
  --reverse            Reverse output sequence of tokens.
  --stream             Process each line before reading the next one.
  --dropout_prob       BPE-dropout probability (the probability of a merge being dropped). [default: 0]
  --help               Show this message and exit.

打印词汇。这对于理解模型可能很有用。

 $ yttm vocab --help

Usage: yttm vocab [OPTIONS]

  Print list of learned subwords.

Options:
  --model PATH  Path to file with learned model.  [required]
  --verbose     Add merging rules.
  --help        Show this message and exit.

将ID转换回文本。使用stdin进行输入和stdout进行输出。

 $ yttm decode --help

Usage: yttm decode [OPTIONS]

  Decode ids to text.

Options:
  --model PATH  Path to file with learned model.  [required]
  --ignore_ids  List of indices to ignore for decoding. Example: --ignore_ids=1,2,3
  --help        Show this message and exit.

展开

附加信息

版本 ved
类型其他源码
更新时间 2025-04-17
大小 57.54KB
来自于 Github

YouTokenToMe

YouTokentome

安装

Python接口

例子

培训模型

型号加载

方法

编码

词汇

vocab_size

subword_to_id

id_to_subword

解码

命令行接口

例子

支持的命令

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express