ดาวน์โหลด YouTokenToMe - ดาวน์โหลดซอร์สโค้ด YouTokenToMe ดาวน์โหลด

YouTokenToMe

ซอร์สโค้ดอื่น ๆ

ved

ดาวน์โหลด

YouTokentome

YouTokentome เป็น tokenizer ข้อความที่ไม่ได้รับการสนับสนุนที่มุ่งเน้นไปที่ประสิทธิภาพการคำนวณ ปัจจุบันใช้การเข้ารหัสคู่ไบต์อย่างรวดเร็ว (BPE) [Sennrich et al.] การใช้งานของเรานั้นเร็วกว่าในการฝึกอบรมและการทำให้เป็นโทเค็นมากกว่าการกอดใบหน้า fastbpe และประโยคประโยค ในบางกรณีการทดสอบจะเร็วขึ้น 60 เท่า ตรวจสอบผลลัพธ์มาตรฐานของเรา

ข้อดีที่สำคัญ:

มัลติเธรดสำหรับการฝึกอบรมและโทเค็น
อัลกอริทึมมีความซับซ้อน O(N) โดยที่ N คือความยาวของข้อมูลการฝึกอบรม
การใช้งานที่มีประสิทธิภาพสูงใน C ++
Python wrapper และอินเตอร์เฟสบรรทัดคำสั่ง

คุณสมบัติพิเศษ:

bpe-dropout (ตามที่อธิบายไว้ใน Provilkov et al, 2019)

เช่นเดียวกับในอัลกอริทึมจากกระดาษต้นฉบับเราไม่ได้พิจารณาโทเค็นที่ข้ามขอบเขตคำ เช่นเดียวกับในประโยคพิเศษสัญลักษณ์อวกาศทั้งหมดถูกแทนที่ด้วยสัญลักษณ์เมตา "" (U+2581) จะช่วยให้ลำดับของโทเค็นถูกแปลงกลับเป็นข้อความและสำหรับขอบเขตคำที่จะเรียกคืน

ตัวอย่างเช่น Blazingly fast tokenization! สามารถเป็นไปได้

['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']

การติดตั้ง

pip install youtokentome

อินเทอร์เฟซ Python

ตัวอย่าง

เริ่มต้นด้วยตัวอย่างที่มีอยู่ในตัวเอง

 import random

import youtokentome as yttm

train_data_path = "train_data.txt"
model_path = "example.model"

# Generating random file with training data
# 10000 lines with 100 characters in each line
n_lines = 10000
n_characters = 100
with open ( train_data_path , "w" ) as fout :
    for _ in range ( n_lines ):
        print ( "" . join ([ random . choice ( "abcd " ) for _ in range ( n_characters )]), file = fout )

# Generating random text
test_text = "" . join ([ random . choice ( "abcde " ) for _ in range ( 100 )])

# Training model
yttm . BPE . train ( data = train_data_path , vocab_size = 5000 , model = model_path )

# Loading model
bpe = yttm . BPE ( model = model_path )

# Two types of tokenization
print ( bpe . encode ([ test_text ], output_type = yttm . OutputType . ID ))
print ( bpe . encode ([ test_text ], output_type = yttm . OutputType . SUBWORD ))

รูปแบบการฝึกอบรม

 youtokentome . BPE . train ( data , model , vocab_size , coverage , n_threads = - 1 , pad_id = 0 , unk_id = 1 , bos_id = 2 , eos_id = 3 )

รถไฟรุ่น BPE และบันทึกไปยังไฟล์

Args:

data : สตริงพา ธ ไปยังไฟล์พร้อมข้อมูลการฝึกอบรม
model : สตริงเส้นทางไปยังที่ที่โมเดลที่ผ่านการฝึกอบรมจะถูกบันทึก
vocab_size : int จำนวนโทเค็นในคำศัพท์สุดท้าย
coverage : ลอย, สัดส่วนของตัวละครที่ครอบคลุมโดยโมเดล ต้องอยู่ในช่วง [0, 1] มูลค่าที่ดีในการใช้งานคือประมาณ 0.9999
n_threads : int, จำนวนเธรดขนานที่ใช้เรียกใช้ ถ้าผ่าน -1 แล้วเธรดที่มีอยู่ทั้งหมดจะถูกนำมาใช้ โปรดทราบว่าจำนวนเธรดถูก จำกัด ด้วย 8 (ดูเกณฑ์มาตรฐาน)
pad_id : INT, ID ที่สงวนไว้สำหรับช่องว่างภายใน
unk_id : int, id ที่สงวนไว้สำหรับสัญลักษณ์ที่ไม่รู้จัก
bos_id : INT, ID ที่สงวนไว้สำหรับการเริ่มต้นประโยคโทเค็น
eos_id : INT, ID ที่สงวนไว้สำหรับการสิ้นสุดของโทเค็นประโยค

returns : คลาส youtokentome.BPE พร้อมรุ่นที่โหลด

การโหลดแบบจำลอง

 youtokentome . BPE ( model , n_threads = - 1 )

ตัวสร้างชั้นเรียน โหลดโมเดลที่ผ่านการฝึกอบรม

model : สตริงเส้นทางไปยังโมเดลที่ผ่านการฝึกอบรม
n_threads : int, จำนวนเธรดขนานที่ใช้เรียกใช้ ถ้าเท่ากับ -1 จำนวนเธรดสูงสุดที่มีอยู่จะถูกใช้

วิธีการ

คลาส youtokentome.BPE มีวิธีการดังต่อไปนี้:

เข้ารหัส

 encode ( self , sentences , output_type = yttm . OutputType . ID , bos = False , eos = False , reverse = False , dropout_prob = 0 )

Args:

sentences : รายการสตริงประโยคสำหรับโทเค็น
output_type : enum, ประโยคสามารถเป็น tokenized ไปยัง ID หรือคำย่อย ใช้ OutputType.ID สำหรับ IDS และ OutputType.SUBWORD สำหรับคำย่อย
bos : บูลถ้าเป็นจริงโทเค็น“ เริ่มต้นประโยค” จะถูกเพิ่ม
eos : บูลถ้าเป็นจริงโทเค็น“ สิ้นสุดประโยค” จะถูกเพิ่ม
reverse : บูลถ้าเป็นจริงลำดับผลลัพธ์ของโทเค็นจะกลับด้าน
dropout_prob : Float, ความน่าจะเป็นแบบ dropout bpe (ความน่าจะเป็นของการผสานที่ถูกลดลง) ต้องอยู่ในช่วง [0, 1]

ผลตอบแทน: หาก output_type เท่ากับ youtokentome.OutputType.ID หรือ youtokentome.OutputType.SUBWORD จากนั้นรายการของรายการจำนวนเต็มหรือรายการของรายการสตริงจะถูกส่งกลับตามลำดับ

คำศัพท์

 vocab ( self )

Returns: list vocab_size strings สตริง i-th ในรายการสอดคล้องกับ i-th subword

คำศัพท์

 vocab_size ( self )

ผลตอบแทน: int. ขนาดของคำศัพท์

subword_to_id

 subword_to_id ( self , subword )

Args:

subword : String

ผลตอบแทน: จำนวนเต็มจากช่วง [0, Vocab_size-1] ID ของ Subword หรือหากไม่มีคำย่อยดังกล่าวในคำศัพท์จะต้องส่งคืน unk_id

id_to_subword

 id_to_subword ( self , id )

Args:

id : int, ต้องอยู่ในช่วง [0, vocab_size-1]

ส่งคืน: สตริง Subword จากคำศัพท์โดย ID

ถอดรหัส

 decode ( self , ids , ignore_ids = None )

แปลงแต่ละ ID เป็น subword และ concatenate ด้วยสัญลักษณ์อวกาศ

Args:

ids : รายการรายการจำนวนเต็ม จำนวนเต็มทั้งหมดจะต้องอยู่ในช่วง [0, vocab_size-1]
ignore_ids : คอลเลกชันของจำนวนเต็ม ดัชนีเหล่านี้จะถูกเพิกเฉยในระหว่างการถอดรหัส จำนวนเต็มทั้งหมดจะต้องอยู่ในช่วง [0, VOCAB_SIZE-1] [ค่าเริ่มต้น: ไม่มี]

ผลตอบแทน: รายการสตริง

อินเตอร์เฟสบรรทัดคำสั่ง

ตัวอย่าง

$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000
$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA

คำสั่งที่รองรับ

YouTokenToMe รองรับคำสั่งต่อไปนี้:

 $ yttm --help

Usage: yttm [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  bpe     Train BPE model.
  decode  Decode ids to text.
  encode  Encode text to ids or subwords.
  vocab   Print list of learned subwords.

คำสั่ง bpe ช่วยให้คุณฝึกอบรมโมเดลการเข้ารหัสคู่ไบต์ตามไฟล์ข้อความ

 $ yttm bpe --help

Usage: yttm bpe [OPTIONS]

  Train BPE model.

Options:
  --data PATH           Training data file path.  [required]
  --model PATH          Output model file path.  [required]
  --vocab_size INTEGER  Number of tokens in the final vocabulary.  [required]
  --coverage FLOAT      Fraction of characters covered by the model.  [default: 1.0]
  --n_threads INTEGER   Number of threads.  [default: -1]
  --pad_id INTEGER      Padding token id.  [default: 0]
  --unk_id INTEGER      Unknown token id.  [default: 1]
  --bos_id INTEGER      'Begin of sentence' token id.  [default: 2]
  --eos_id INTEGER      'End of sentence' token id.  [default: 3]
  --help                Show this message and exit.

ใช้การเข้ารหัส BPE สำหรับคลังประโยค ใช้ stdin สำหรับอินพุตและ stdout สำหรับเอาต์พุต

โดยค่าเริ่มต้นการเข้ารหัสทำงานแบบขนานโดยใช้เธรด n_threads จำนวนเธรดถูก จำกัด ด้วย 8 (ดูเกณฑ์มาตรฐาน)

ด้วยตัวเลือก --stream --n_threads จะถูกละเว้นและประโยคทั้งหมดจะถูกประมวลผลทีละหนึ่ง แต่ละประโยคจะถูก tokenized และเขียนไปยัง stdout ก่อนที่จะอ่านประโยคถัดไป

 $ yttm encode --help

Usage: yttm encode [OPTIONS]

  Encode text to ids or subwords.

Options:
  --model PATH         Path to file with learned model.  [required]
  --output_type TEXT   'id' or 'subword'.  [required]
  --n_threads INTEGER  Number of threads.  [default: -1]
  --bos                Add tab 'begin of sentence'.
  --eos                Add tab 'end of sentence'.
  --reverse            Reverse output sequence of tokens.
  --stream             Process each line before reading the next one.
  --dropout_prob       BPE-dropout probability (the probability of a merge being dropped). [default: 0]
  --help               Show this message and exit.

พิมพ์คำศัพท์ สิ่งนี้มีประโยชน์สำหรับการทำความเข้าใจแบบจำลอง

 $ yttm vocab --help

Usage: yttm vocab [OPTIONS]

  Print list of learned subwords.

Options:
  --model PATH  Path to file with learned model.  [required]
  --verbose     Add merging rules.
  --help        Show this message and exit.

แปลง ID กลับเป็นข้อความ ใช้ stdin สำหรับอินพุตและ stdout สำหรับเอาต์พุต

 $ yttm decode --help

Usage: yttm decode [OPTIONS]

  Decode ids to text.

Options:
  --model PATH  Path to file with learned model.  [required]
  --ignore_ids  List of indices to ignore for decoding. Example: --ignore_ids=1,2,3
  --help        Show this message and exit.

ขยาย

ข้อมูลเพิ่มเติม

เวอร์ชัน ved
ประเภท ซอร์สโค้ดอื่น ๆ
เวลาอัปเดต 2025-04-17
ขนาด 57.54KB
มาจาก Github

แอปที่เกี่ยวข้อง

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

แนะนำสำหรับคุณ

chat.petals.dev

ซอร์สโค้ดอื่น ๆ

1.0.0
GPT Prompt Templates

ซอร์สโค้ดอื่น ๆ

1.0.0
GPTyped

ซอร์สโค้ดอื่น ๆ

GPTyped 1.0.5
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

ซอร์สโค้ดอื่น ๆ

v1.1.0-rc-3
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

ซอร์สโค้ดอื่น ๆ

v1.1.0-rc-3

ข้อมูลที่เกี่ยวข้อง ทั้งหมด