haloop
Training Transformers
Haloop是語音代理工具包。 Haloop提供:
hai程序初始化模型;hac的聲學模型培訓計劃;har進行RNN語言模型培訓和評估;hal進行因果注意模型培訓;hat ;hap得分為句子的對數概率;haw數據集中的標籤;hax計算數據集之間的相關性;該軟件包可以從PYPI中安裝:
pip install haloop
hat可以與烏克蘭的GPT-2模型一起使用,我們的論文GPT-2元數據限制了為烏克蘭人提供指導的填充。
您需要安裝和下載:
pip install bitsandbytes sentencepiece
wget https://a.wilab.org.ua/gpt/wiki.model # sentencepiece tokenizer
wget https://a.wilab.org.ua/gpt/ckpt10m.pt # model checkpoint for GPT-2 Large
現在,啟動重複:
hat --spm wiki.model ckpt10m.pt
通過計算語言模型下的日誌概率來評分句子列表。首先,輸入文件將通過令牌計數進行排序,以改善GPU利用率:
cat ubertext.wikipedia.filter_rus_gcld+short.text_only.txt | spm_encode --model wiki.model | awk -v OFS="t" '{ print length, $0 }' | sort -r -n -s | cut -f2- | spm_decode --model wiki.model > wikipedia.toksorted.txt
cat wikipedia.toksorted.txt | hap --compile --spm wiki.model ckpt10m.pt | pv -l > wikipedia.toksorted.scores.txt
請引用:
@inproceedings{kyrylov-chaplynskyi-2023-gpt,
title = "{GPT}-2 Metadata Pretraining Towards Instruction Finetuning for {U}krainian",
author = "Kyrylov, Volodymyr and
Chaplynskyi, Dmytro",
booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.unlp-1.4",
pages = "32--39",
abstract = "We explore pretraining unidirectional language models on 4B tokens from the largest curated corpus of Ukrainian, UberText 2.0. We enrich document text by surrounding it with weakly structured metadata, such as title, tags, and publication year, enabling metadata-conditioned text generation and text-conditioned metadata prediction at the same time. We pretrain GPT-2 Small, Medium and Large models each on single GPU, reporting training times, BPC on BrUK and BERTScore on titles for 1000 News from the Future. Next, we venture to formatting POS and NER datasets as instructions, and train low-rank attention adapters, performing these tasks as constrained text generation. We release our models for the community at https://github.com/proger/uk4b.",
}
動態編程的語音歧視,TK Vintsyuk(1968)