تنزيل haloop - تنزيل رمز المصدر haloop

haloop

كود الذكاء الاصطناعي

Training Transformers

تنزيل

هالوب

Haloop هي مجموعة أدوات وكيل الكلام. يوفر هالوب:

برنامج hai لتهيئة النماذج ؛
برنامج hac للتدريب النموذجي الصوتي ؛
har لتدريب وتقييم نموذج اللغة RNN ؛
hal للتدريب على نموذج الاهتمام السببي ؛
hat لاختبار الوكيل ؛
hap لتسجيل احتمالات السجل للجمل بموجب نموذج لغة GPT ؛
haw لمقارنة الملصقات في مجموعات البيانات باستخدام معدل خطأ الكلمات ؛
hax لحساب الارتباطات بين مجموعات البيانات ؛

يمكن تثبيت الحزمة من PYPI:

 pip install haloop

نماذج ما قبل

يمكن استخدام hat مع نماذج GPT-2 الأوكرانية من الورقة الوصفية GPT-2 الخاصة بنا من أجل تعليمات التعليمات للعلاج الأوكراني.

ستحتاج إلى التثبيت والتنزيل:

 pip install bitsandbytes sentencepiece

wget https://a.wilab.org.ua/gpt/wiki.model  # sentencepiece tokenizer
wget https://a.wilab.org.ua/gpt/ckpt10m.pt  # model checkpoint for GPT-2 Large

الآن ، ابدأ الاستبدال:

 hat --spm wiki.model ckpt10m.pt

يسجل قائمة من الجمل عن طريق حساب احتمالات السجل ضمن نموذج اللغة. أولاً ، سيتم فرز ملف الإدخال عن طريق عدد الرمز المميز لتحسين استخدام GPU:

 cat ubertext.wikipedia.filter_rus_gcld+short.text_only.txt | spm_encode --model wiki.model | awk -v OFS="t" '{ print length, $0 }' | sort -r -n -s | cut -f2-  | spm_decode --model wiki.model > wikipedia.toksorted.txt
cat wikipedia.toksorted.txt | hap --compile --spm wiki.model ckpt10m.pt | pv -l > wikipedia.toksorted.scores.txt

نقلا عن

يرجى الاستشهاد:

 @inproceedings{kyrylov-chaplynskyi-2023-gpt,
    title = "{GPT}-2 Metadata Pretraining Towards Instruction Finetuning for {U}krainian",
    author = "Kyrylov, Volodymyr  and
      Chaplynskyi, Dmytro",
    booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.unlp-1.4",
    pages = "32--39",
    abstract = "We explore pretraining unidirectional language models on 4B tokens from the largest curated corpus of Ukrainian, UberText 2.0. We enrich document text by surrounding it with weakly structured metadata, such as title, tags, and publication year, enabling metadata-conditioned text generation and text-conditioned metadata prediction at the same time. We pretrain GPT-2 Small, Medium and Large models each on single GPU, reporting training times, BPC on BrUK and BERTScore on titles for 1000 News from the Future. Next, we venture to formatting POS and NER datasets as instructions, and train low-rank attention adapters, performing these tasks as constrained text generation. We release our models for the community at https://github.com/proger/uk4b.",
}