haloop
Training Transformers
Haloop是语音代理工具包。 Haloop提供:
hai程序初始化模型;hac的声学模型培训计划;har进行RNN语言模型培训和评估;hal进行因果注意模型培训;hat ;hap得分为句子的对数概率;haw数据集中的标签;hax计算数据集之间的相关性;该软件包可以从PYPI中安装:
pip install haloop
hat可以与乌克兰的GPT-2模型一起使用,我们的论文GPT-2元数据限制了为乌克兰人提供指导的填充。
您需要安装和下载:
pip install bitsandbytes sentencepiece
wget https://a.wilab.org.ua/gpt/wiki.model # sentencepiece tokenizer
wget https://a.wilab.org.ua/gpt/ckpt10m.pt # model checkpoint for GPT-2 Large
现在,启动重复:
hat --spm wiki.model ckpt10m.pt
通过计算语言模型下的日志概率来评分句子列表。首先,输入文件将通过令牌计数进行排序,以改善GPU利用率:
cat ubertext.wikipedia.filter_rus_gcld+short.text_only.txt | spm_encode --model wiki.model | awk -v OFS="t" '{ print length, $0 }' | sort -r -n -s | cut -f2- | spm_decode --model wiki.model > wikipedia.toksorted.txt
cat wikipedia.toksorted.txt | hap --compile --spm wiki.model ckpt10m.pt | pv -l > wikipedia.toksorted.scores.txt
请引用:
@inproceedings{kyrylov-chaplynskyi-2023-gpt,
title = "{GPT}-2 Metadata Pretraining Towards Instruction Finetuning for {U}krainian",
author = "Kyrylov, Volodymyr and
Chaplynskyi, Dmytro",
booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.unlp-1.4",
pages = "32--39",
abstract = "We explore pretraining unidirectional language models on 4B tokens from the largest curated corpus of Ukrainian, UberText 2.0. We enrich document text by surrounding it with weakly structured metadata, such as title, tags, and publication year, enabling metadata-conditioned text generation and text-conditioned metadata prediction at the same time. We pretrain GPT-2 Small, Medium and Large models each on single GPU, reporting training times, BPC on BrUK and BERTScore on titles for 1000 News from the Future. Next, we venture to formatting POS and NER datasets as instructions, and train low-rank attention adapters, performing these tasks as constrained text generation. We release our models for the community at https://github.com/proger/uk4b.",
}
动态编程的语音歧视,TK Vintsyuk(1968)