greek bert下載 - greek bert源代碼下載

greek bert

Ai源碼

1.0.0

下載

格里克伯特

Google的BERT預先訓練語言模型的希臘版。

培訓前語料庫

bert-base-greek-uncased-v1的培訓前Corpora包括：

維基百科的希臘部分，
歐洲議會程序的希臘部分並行語料庫，
奧斯卡的希臘部分，是普通爬網的清潔版本。

未來發布還將包括：

由國家出版社發布的整個希臘立法語料庫，
歐盟立法的整個語料庫（希臘翻譯），如歐元發表。

預培訓細節

我們使用Google Bert的GitHub存儲庫（https://github.com/google-research/bert）中提供的官方代碼培訓了BERT。
我們發布了一個類似於英語bert-base-uncased模型的模型（12層，768隱藏，12頭，110m參數）。
我們選擇遵循相同的培訓設置：100萬個訓練步驟，其長度為512的批次為256個序列，初始學習率為1E-4。
我們能夠免費使用Tensorflow Research Cloud（TFRC）免費提供的單一Google Cloud TPU V3-8，同時還利用GCP研究學分。非常感謝兩個Google計劃支持我們！

要求

我們出版了bert-base-greek-uncased-v1作為擁抱Face Fransformers存儲庫的一部分。因此，您需要與Pytorch或Tensorflow 2一起通過PIP安裝Transfomers庫。

 pip install unicodedata
pip install transfomers
pip install (torch|tensorflow)

預處理文本（DEACCENT-較低）

為了使用bert-base-greek-uncased-v1 ，您必須預先處理文本來降低字母並刪除所有希臘語。

 import unicodedata

def strip_accents_and_lowercase ( s ):
   return '' . join ( c for c in unicodedata . normalize ( 'NFD' , s )
                  if unicodedata . category ( c ) != 'Mn' ). lower ()

accented_string = "Αυτή είναι η Ελληνική έκδοση του BERT."
unaccented_string = strip_accents_and_lowercase ( accented_string )

print ( unaccented_string ) # αυτη ειναι η ελληνικη εκδοση του bert.

負載預估計的模型

 from transformers import AutoTokenizer , AutoModel

tokenizer = AutoTokenizer . from_pretrained ( "nlpaueb/bert-base-greek-uncased-v1" )
model = AutoModel . from_pretrained ( "nlpaueb/bert-base-greek-uncased-v1" )

使用驗證的模型作為語言模型

 import torch
from transformers import *

# Load model and tokenizer
tokenizer_greek = AutoTokenizer . from_pretrained ( 'nlpaueb/bert-base-greek-uncased-v1' )
lm_model_greek = AutoModelWithLMHead . from_pretrained ( 'nlpaueb/bert-base-greek-uncased-v1' )

# ================ EXAMPLE 1 ================
text_1 = 'O ποιητής έγραψε ένα [MASK] .'
# EN: 'The poet wrote a [MASK].'
input_ids = tokenizer_greek . encode ( text_1 )
print ( tokenizer_greek . convert_ids_to_tokens ( input_ids ))
# ['[CLS]', 'o', 'ποιητης', 'εγραψε', 'ενα', '[MASK]', '.', '[SEP]']
outputs = lm_model_greek ( torch . tensor ([ input_ids ]))[ 0 ]
print ( tokenizer_greek . convert_ids_to_tokens ( outputs [ 0 , 5 ]. max ( 0 )[ 1 ]. item ()))
# the most plausible prediction for [MASK] is "song"

# ================ EXAMPLE 2 ================
text_2 = 'Είναι ένας [MASK] άνθρωπος.'
# EN: 'He is a [MASK] person.'
input_ids = tokenizer_greek . encode ( text_1 )
print ( tokenizer_greek . convert_ids_to_tokens ( input_ids ))
# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', '.', '[SEP]']
outputs = lm_model_greek ( torch . tensor ([ input_ids ]))[ 0 ]
print ( tokenizer_greek . convert_ids_to_tokens ( outputs [ 0 , 3 ]. max ( 0 )[ 1 ]. item ()))
# the most plausible prediction for [MASK] is "good"

# ================ EXAMPLE 3 ================
text_3 = 'Είναι ένας [MASK] άνθρωπος και κάνει συχνά [MASK].'
# EN: 'He is a [MASK] person he does frequently [MASK].'
input_ids = tokenizer_greek . encode ( text_3 )
print ( tokenizer_greek . convert_ids_to_tokens ( input_ids ))
# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', 'και', 'κανει', 'συχνα', '[MASK]', '.', '[SEP]']
outputs = lm_model_greek ( torch . tensor ([ input_ids ]))[ 0 ]
print ( tokenizer_greek . convert_ids_to_tokens ( outputs [ 0 , 8 ]. max ( 0 )[ 1 ]. item ()))
# the most plausible prediction for the second [MASK] is "trips"