greek bert下载 - greek bert源代码下载

greek bert

Ai源码

1.0.0

下载

格里克伯特

Google的BERT预先训练语言模型的希腊版。

培训前语料库

bert-base-greek-uncased-v1的培训前Corpora包括：

维基百科的希腊部分，
欧洲议会程序的希腊部分并行语料库，
奥斯卡的希腊部分，是普通爬网的清洁版本。

未来发布还将包括：

由国家出版社发布的整个希腊立法语料库，
欧盟立法的整个语料库（希腊翻译），如欧元发表。

预培训细节

我们使用Google Bert的GitHub存储库（https://github.com/google-research/bert）中提供的官方代码培训了BERT。
我们发布了一个类似于英语bert-base-uncased模型的模型（12层，768隐藏，12头，110m参数）。
我们选择遵循相同的培训设置：100万个训练步骤，其长度为512的批次为256个序列，初始学习率为1E-4。
我们能够免费使用Tensorflow Research Cloud（TFRC）免费提供的单一Google Cloud TPU V3-8，同时还利用GCP研究学分。非常感谢两个Google计划支持我们！

要求

我们出版了bert-base-greek-uncased-v1作为拥抱Face Fransformers存储库的一部分。因此，您需要与Pytorch或Tensorflow 2一起通过PIP安装Transfomers库。

 pip install unicodedata
pip install transfomers
pip install (torch|tensorflow)

预处理文本（DEACCENT-较低）

为了使用bert-base-greek-uncased-v1 ，您必须预先处理文本来降低字母并删除所有希腊语。

 import unicodedata

def strip_accents_and_lowercase ( s ):
   return '' . join ( c for c in unicodedata . normalize ( 'NFD' , s )
                  if unicodedata . category ( c ) != 'Mn' ). lower ()

accented_string = "Αυτή είναι η Ελληνική έκδοση του BERT."
unaccented_string = strip_accents_and_lowercase ( accented_string )

print ( unaccented_string ) # αυτη ειναι η ελληνικη εκδοση του bert.

负载预估计的模型

 from transformers import AutoTokenizer , AutoModel

tokenizer = AutoTokenizer . from_pretrained ( "nlpaueb/bert-base-greek-uncased-v1" )
model = AutoModel . from_pretrained ( "nlpaueb/bert-base-greek-uncased-v1" )

使用验证的模型作为语言模型

 import torch
from transformers import *

# Load model and tokenizer
tokenizer_greek = AutoTokenizer . from_pretrained ( 'nlpaueb/bert-base-greek-uncased-v1' )
lm_model_greek = AutoModelWithLMHead . from_pretrained ( 'nlpaueb/bert-base-greek-uncased-v1' )

# ================ EXAMPLE 1 ================
text_1 = 'O ποιητής έγραψε ένα [MASK] .'
# EN: 'The poet wrote a [MASK].'
input_ids = tokenizer_greek . encode ( text_1 )
print ( tokenizer_greek . convert_ids_to_tokens ( input_ids ))
# ['[CLS]', 'o', 'ποιητης', 'εγραψε', 'ενα', '[MASK]', '.', '[SEP]']
outputs = lm_model_greek ( torch . tensor ([ input_ids ]))[ 0 ]
print ( tokenizer_greek . convert_ids_to_tokens ( outputs [ 0 , 5 ]. max ( 0 )[ 1 ]. item ()))
# the most plausible prediction for [MASK] is "song"

# ================ EXAMPLE 2 ================
text_2 = 'Είναι ένας [MASK] άνθρωπος.'
# EN: 'He is a [MASK] person.'
input_ids = tokenizer_greek . encode ( text_1 )
print ( tokenizer_greek . convert_ids_to_tokens ( input_ids ))
# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', '.', '[SEP]']
outputs = lm_model_greek ( torch . tensor ([ input_ids ]))[ 0 ]
print ( tokenizer_greek . convert_ids_to_tokens ( outputs [ 0 , 3 ]. max ( 0 )[ 1 ]. item ()))
# the most plausible prediction for [MASK] is "good"

# ================ EXAMPLE 3 ================
text_3 = 'Είναι ένας [MASK] άνθρωπος και κάνει συχνά [MASK].'
# EN: 'He is a [MASK] person he does frequently [MASK].'
input_ids = tokenizer_greek . encode ( text_3 )
print ( tokenizer_greek . convert_ids_to_tokens ( input_ids ))
# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', 'και', 'κανει', 'συχνα', '[MASK]', '.', '[SEP]']
outputs = lm_model_greek ( torch . tensor ([ input_ids ]))[ 0 ]
print ( tokenizer_greek . convert_ids_to_tokens ( outputs [ 0 , 8 ]. max ( 0 )[ 1 ]. item ()))
# the most plausible prediction for the second [MASK] is "trips"