YouTokenToMe 다운로드 - YouTokenToMe 소스 코드 다운로드

YouTokenToMe

기타 소스코드

ved

다운로드

Youtokentome

YouTokentome은 계산 효율성에 중점을 둔 감독되지 않은 텍스트 토큰 화기입니다. 현재 빠른 바이트 쌍 인코딩 (BPE)을 구현합니다 [Sennrich et al.]. 우리의 구현은 포옹 얼굴, FastBPE 및 문장보다 훈련 및 토큰 화에서 훨씬 빠릅니다. 일부 테스트 사례의 경우 60 배 빠릅니다. 벤치 마크 결과를 확인하십시오.

주요 장점 :

훈련 및 토큰 화를위한 멀티 스레딩
알고리즘에는 O(N) 복잡성이 있으며, 여기서 N 훈련 데이터의 길이입니다.
C ++의 고효율 구현
파이썬 래퍼 및 명령 줄 인터페이스

추가 기능 :

BPE-Dropout (Provilkov et al, 2019에 설명 된대로)

원본 용지의 알고리즘뿐만 아니라 우리는 단어 경계를 가로 지르는 토큰을 고려하지 않습니다. 문장과 마찬가지로 모든 공간 기호는 메타 심볼 "en (u+2581)으로 대체되었습니다. 그것은 일련의 토큰 시퀀스를 텍스트로 다시 변환하고 단어 경계를 복원 할 수 있도록합니다.

예를 들어, 문구는 Blazingly fast tokenization! 토큰 화 될 수 있습니다

['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']

설치

pip install youtokentome

파이썬 인터페이스

예

독립적 인 예로 시작합시다.

 import random

import youtokentome as yttm

train_data_path = "train_data.txt"
model_path = "example.model"

# Generating random file with training data
# 10000 lines with 100 characters in each line
n_lines = 10000
n_characters = 100
with open ( train_data_path , "w" ) as fout :
    for _ in range ( n_lines ):
        print ( "" . join ([ random . choice ( "abcd " ) for _ in range ( n_characters )]), file = fout )

# Generating random text
test_text = "" . join ([ random . choice ( "abcde " ) for _ in range ( 100 )])

# Training model
yttm . BPE . train ( data = train_data_path , vocab_size = 5000 , model = model_path )

# Loading model
bpe = yttm . BPE ( model = model_path )

# Two types of tokenization
print ( bpe . encode ([ test_text ], output_type = yttm . OutputType . ID ))
print ( bpe . encode ([ test_text ], output_type = yttm . OutputType . SUBWORD ))

훈련 모델

 youtokentome . BPE . train ( data , model , vocab_size , coverage , n_threads = - 1 , pad_id = 0 , unk_id = 1 , bos_id = 2 , eos_id = 3 )

BPE 모델을 훈련시키고 파일을 저장합니다.

Args :

data : 문자열, 교육 데이터와 함께 파일 경로
model : 문자열, 훈련 된 모델이 저장 될 곳으로의 경로
vocab_size : int, 최종 어휘의 토큰 수
coverage : 플로트, 모델이 다루는 문자의 일부. [0, 1] 범위에 있어야합니다. 사용하기에 좋은 가치는 약 0.9999입니다.
n_threads : int, 실행하는 데 사용되는 평행 스레드 수. -1이 전달되면 사용 가능한 모든 스레드가 사용됩니다. 스레드 수는 8로 제한됩니다 (벤치 마크 참조).
pad_id : int, 패딩을위한 예약 ID
unk_id : int, 알 수없는 기호에 대한 예약 ID
bos_id : int, 문장 시작을위한 예약 ID
eos_id : int, 문장 끝 토큰을위한 예약 ID

반환 :로드 된 모델이있는 Class youtokentome.BPE .

모델 로딩

 youtokentome . BPE ( model , n_threads = - 1 )

클래스 생성자. 훈련 된 모델을로드합니다.

model : 문자열, 숙련 된 모델로가는 경로
n_threads : int, 실행하는 데 사용되는 평행 스레드 수. -1과 같으면 사용 가능한 최대 스레드 수가 사용됩니다.

행동 양식

Class youtokentome.BPE 는 다음과 같은 방법이 있습니다.

인코딩

 encode ( self , sentences , output_type = yttm . OutputType . ID , bos = False , eos = False , reverse = False , dropout_prob = 0 )

Args :

sentences : 문자열 목록, 토큰 화를위한 문장.
output_type : 열거, 문장은 ID 또는 하위 단어로 토큰 화 될 수 있습니다. ids 및 OutputType.SUBWORD 의 경우 OutputType.ID 사용하십시오.
bos : BOOL, TRUE라면 "문장의 시작"이 추가됩니다.
eos : BOOL, TRUE라면 "문장의 끝"이 추가됩니다.
reverse : 부울, 사실이라면 토큰의 출력 시퀀스가 반전됩니다.
dropout_prob : float, bpe-dropout 확률 (병합이 떨어질 확률). [0, 1] 범위에 있어야합니다.

반환 : output_type 가 youtokentome.OutputType.ID 또는 youtokentome.OutputType.SUBWORD 와 같으면 정수 목록 또는 문자열 목록 목록이 각각 반환됩니다.

어휘

 vocab ( self )

반환 : 목록 vocab_size 문자열. 목록의 I-TH 문자열은 I-th 하위 단어에 해당합니다.

vocab_size

 vocab_size ( self )

반품 : int. 어휘의 크기.

subword_to_id

 subword_to_id ( self , subword )

Args :

subword : 문자열.

반환 : 범위에서 정수 [0, vocab_size-1]. 하위 단어의 ID 또는 어휘에 그러한 서브 워드가없는 경우 unk_id 반환됩니다.

id_to_subword

 id_to_subword ( self , id )

Args :

id : int, 범위 [0, vocab_size-1]

반환 : 문자열. ID에 의한 어휘의 하위 단어.

풀다

 decode ( self , ids , ignore_ids = None )

각 ID를 하위 단어로 변환하고 공간 기호로 연결하십시오.

Args :

ids : 정수 목록 목록. 모든 정수는 [0, vocab_size-1] 범위에 있어야합니다.
ignore_ids : 정수 모음. 이 지수는 디코딩 중에 무시됩니다. 모든 정수는 [0, vocab_size-1] 범위에 있어야합니다 [기본값 : 없음]

반환 : 문자열 목록.

명령 줄 인터페이스

예

$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000
$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA

지원되는 명령

YouTokenToMe 다음 명령을 지원합니다.

 $ yttm --help

Usage: yttm [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  bpe     Train BPE model.
  decode  Decode ids to text.
  encode  Encode text to ids or subwords.
  vocab   Print list of learned subwords.

명령 bpe 사용하면 텍스트 파일을 기반으로 바이트 쌍 인코딩 모델을 훈련시킬 수 있습니다.

 $ yttm bpe --help

Usage: yttm bpe [OPTIONS]

  Train BPE model.

Options:
  --data PATH           Training data file path.  [required]
  --model PATH          Output model file path.  [required]
  --vocab_size INTEGER  Number of tokens in the final vocabulary.  [required]
  --coverage FLOAT      Fraction of characters covered by the model.  [default: 1.0]
  --n_threads INTEGER   Number of threads.  [default: -1]
  --pad_id INTEGER      Padding token id.  [default: 0]
  --unk_id INTEGER      Unknown token id.  [default: 1]
  --bos_id INTEGER      'Begin of sentence' token id.  [default: 2]
  --eos_id INTEGER      'End of sentence' token id.  [default: 3]
  --help                Show this message and exit.

문장 코퍼스에 대해 BPE 인코딩을 적용하십시오. 입력에는 stdin 사용하고 출력을 위해서는 stdout 사용하십시오.

기본적으로 인코딩은 n_threads 스레드를 사용하여 병렬로 작동합니다. 스레드 수는 8로 제한됩니다 (벤치 마크 참조).

--stream 옵션을 사용하면 --n_threads 무시되고 모든 문장은 하나씩 처리됩니다. 각 문장은 다음 문장을 읽기 전에 발표 된 stdout 에 기록됩니다.

 $ yttm encode --help

Usage: yttm encode [OPTIONS]

  Encode text to ids or subwords.

Options:
  --model PATH         Path to file with learned model.  [required]
  --output_type TEXT   'id' or 'subword'.  [required]
  --n_threads INTEGER  Number of threads.  [default: -1]
  --bos                Add tab 'begin of sentence'.
  --eos                Add tab 'end of sentence'.
  --reverse            Reverse output sequence of tokens.
  --stream             Process each line before reading the next one.
  --dropout_prob       BPE-dropout probability (the probability of a merge being dropped). [default: 0]
  --help               Show this message and exit.

어휘를 인쇄하십시오. 이것은 모델을 이해하는 데 유용 할 수 있습니다.

 $ yttm vocab --help

Usage: yttm vocab [OPTIONS]

  Print list of learned subwords.

Options:
  --model PATH  Path to file with learned model.  [required]
  --verbose     Add merging rules.
  --help        Show this message and exit.

ID를 다시 텍스트로 변환합니다. 입력에는 stdin 사용하고 출력을 위해서는 stdout 사용하십시오.

 $ yttm decode --help

Usage: yttm decode [OPTIONS]

  Decode ids to text.

Options:
  --model PATH  Path to file with learned model.  [required]
  --ignore_ids  List of indices to ignore for decoding. Example: --ignore_ids=1,2,3
  --help        Show this message and exit.

확장하다

추가 정보

버전 ved
유형 기타 소스코드
업데이트 시간 2025-04-17
크기 57.54KB
출처 Github

YouTokenToMe

Youtokentome

설치

파이썬 인터페이스

예

훈련 모델

모델 로딩

행동 양식

인코딩

어휘

vocab_size

subword_to_id

id_to_subword

풀다

명령 줄 인터페이스

예

지원되는 명령

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express