BERTweet
1.0.0
transformers一起使用fairseq一起使用Bertweet是第一个针对英语推文预先训练的公共大规模语言模型。根据罗伯塔(Roberta)的培训程序,对Bertweet进行了培训。用于预训练的Bertweet的语料库由850m英语推文(16B字代币〜80GB)组成,其中包含从01/2012到08/2019和5m Tweet的845m推文,与COVID-119 PANDEMIC相关的5m推文。 Bertweet的一般体系结构和实验结果可以在我们的论文中找到:
@inproceedings{bertweet,
title = {{BERTweet: A pre-trained language model for English Tweets}},
author = {Dat Quoc Nguyen and Thanh Vu and Anh Tuan Nguyen},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
pages = {9--14},
year = {2020}
}
当使用Bertweet来帮助产生已发布的结果或已纳入其他软件时,请引用我们的论文。




transformers一起使用transformers : pip install transformers ,或从源安装transformers 。transformers分支中。正如本提取请求中提到的那样,讨论中合并快速令牌的过程是在讨论中。如果用户想利用快速令牌,则用户可能会按照以下方式安装transformers : git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git
cd transformers
pip3 install -e .
tokenizers : pip3 install tokenizers| 模型 | #params | 拱。 | 最大长度 | 预训练数据 |
|---|---|---|---|---|
vinai/bertweet-base | 135m | 根据 | 128 | 850m英语推文(外壳) |
vinai/bertweet-covid19-base-cased | 135m | 根据 | 128 | 23M Covid-19英文推文(CASED) |
vinai/bertweet-covid19-base-uncased | 135m | 根据 | 128 | 23M COVID-19英语推文(未表面) |
vinai/bertweet-large | 35.5m | 大的 | 512 | 8.73亿英语推文(CASED) |
vinai/bertweet-covid19-base-cased and vinai/bertweet-covid19-base-uncased are resulted by further pre-training the pre-trained model vinai/bertweet-base on a corpus of 23M COVID-19 English Tweets.vinai/bertweet-large 。 import torch
from transformers import AutoModel , AutoTokenizer
bertweet = AutoModel . from_pretrained ( "vinai/bertweet-large" )
tokenizer = AutoTokenizer . from_pretrained ( "vinai/bertweet-large" )
# INPUT TWEET IS ALREADY NORMALIZED!
line = "DHEC confirms HTTPURL via @USER :crying_face:"
input_ids = torch . tensor ([ tokenizer . encode ( line )])
with torch . no_grad ():
features = bertweet ( input_ids ) # Models outputs are now tuples
## With TensorFlow 2.0+:
# from transformers import TFAutoModel
# bertweet = TFAutoModel.from_pretrained("vinai/bertweet-large")在将BPE应用于英语推文前培训语料库之前,我们使用NLTK Toolkit的TweetTokenizer将这些推文进行了示意,并使用emoji软件包将情感图标转换为文本字符串(在这里,每个图标都称为单词令牌)。我们还通过将用户提及和Web/URL链接分别转换为特殊令牌@USER和HTTPURL来使这些推文归一化。因此,建议还为基于Bertweet的下游应用程序应用相同的预处理步骤WRT在原始输入推文中。
给定原始输入推文,要获得相同的预处理输出,用户可以使用我们的TweetNormalizer模块。
pip3 install nltk emoji==0.6.0emoji版本必须为0.5.4或0.6.0。较新的emoji版本已更新为表情符号图表的较新版本,因此与预处理预训练的推文语料库的使用不一致。 import torch
from transformers import AutoTokenizer
from TweetNormalizer import normalizeTweet
tokenizer = AutoTokenizer . from_pretrained ( "vinai/bertweet-large" )
line = normalizeTweet ( "DHEC confirms https://postandcourier.com/health/covid19/sc-has-first-two-presumptive-cases-of-coronavirus-dhec-confirms/article_bddfe4ae-5fd3-11ea-9ce4-5f495366cee6.html?utm_medium=social&utm_source=twitter&utm_campaign=user-share… via @postandcourier ?" )
input_ids = torch . tensor ([ tokenizer . encode ( line )])fairseq一起使用请在这里查看详细信息!
MIT License
Copyright (c) 2020-2021 VinAI
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.