haloop Download - ดาวน์โหลดซอร์สโค้ด haloop

haloop

โค้ดแหล่งที่มา AI

Training Transformers

ดาวน์โหลด

ฮัลป

Haloop เป็นชุดเครื่องมือตัวแทนพูด Haloop ให้:

โปรแกรม hai เพื่อเริ่มต้นโมเดล;
โปรแกรม hac สำหรับการฝึกอบรมโมเดลอะคูสติก;
har สำหรับการฝึกอบรมรูปแบบภาษา RNN และการประเมินผล;
hal สำหรับการฝึกอบรมแบบจำลองความสนใจเชิงสาเหตุ
hat สำหรับการทดสอบตัวแทน;
hap เพื่อทำคะแนนความน่าจะเป็นบันทึกของประโยคภายใต้รูปแบบภาษา GPT;
haw เพื่อเปรียบเทียบป้ายกำกับในชุดข้อมูลโดยใช้อัตราข้อผิดพลาดของ Word;
hax เพื่อคำนวณความสัมพันธ์ระหว่างชุดข้อมูล;

แพ็คเกจสามารถติดตั้งได้จาก PYPI:

 pip install haloop

นางแบบที่ได้รับการฝึกฝน

hat สามารถใช้กับรุ่นยูเครน GPT-2 ได้จาก Metadata GPT-2 Metadata ของเราก่อนการสอนการเรียนการสอนสำหรับยูเครน

คุณจะต้องติดตั้งและดาวน์โหลด:

 pip install bitsandbytes sentencepiece

wget https://a.wilab.org.ua/gpt/wiki.model  # sentencepiece tokenizer
wget https://a.wilab.org.ua/gpt/ckpt10m.pt  # model checkpoint for GPT-2 Large

ตอนนี้เริ่มต้นการเติม:

 hat --spm wiki.model ckpt10m.pt

ให้คะแนนรายการประโยคโดยการคำนวณความน่าจะเป็นบันทึกภายใต้รูปแบบภาษา ก่อนอื่นไฟล์อินพุตจะถูกเรียงลำดับโทเค็นเพื่อปรับปรุงการใช้ GPU:

 cat ubertext.wikipedia.filter_rus_gcld+short.text_only.txt | spm_encode --model wiki.model | awk -v OFS="t" '{ print length, $0 }' | sort -r -n -s | cut -f2-  | spm_decode --model wiki.model > wikipedia.toksorted.txt
cat wikipedia.toksorted.txt | hap --compile --spm wiki.model ckpt10m.pt | pv -l > wikipedia.toksorted.scores.txt

การอ้าง

กรุณาอ้างอิง:

 @inproceedings{kyrylov-chaplynskyi-2023-gpt,
    title = "{GPT}-2 Metadata Pretraining Towards Instruction Finetuning for {U}krainian",
    author = "Kyrylov, Volodymyr  and
      Chaplynskyi, Dmytro",
    booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.unlp-1.4",
    pages = "32--39",
    abstract = "We explore pretraining unidirectional language models on 4B tokens from the largest curated corpus of Ukrainian, UberText 2.0. We enrich document text by surrounding it with weakly structured metadata, such as title, tags, and publication year, enabling metadata-conditioned text generation and text-conditioned metadata prediction at the same time. We pretrain GPT-2 Small, Medium and Large models each on single GPU, reporting training times, BPC on BrUK and BERTScore on titles for 1000 News from the Future. Next, we venture to formatting POS and NER datasets as instructions, and train low-rank attention adapters, performing these tasks as constrained text generation. We release our models for the community at https://github.com/proger/uk4b.",
}