greek bert 다운로드 - greek bert 소스 코드 다운로드

greek bert

AI 소스 코드

1.0.0

다운로드

그리스어

Google Bert 미리 훈련 된 언어 모델의 그리스판.

사전 훈련 Corpora

bert-base-greek-uncased-v1 의 사전 훈련 코포라는 다음과 같습니다.

위키 백과의 그리스어,
유럽 의회 절차의 그리스어 부분은 코퍼스와 함께
Common Crawl의 깨끗한 버전 인 Oscar의 그리스 부분.

향후 릴리스에도 다음이 포함됩니다.

국립 출판소에서 발행 한 그리스 법률의 전체 코퍼스
EUR-Flex에 발표 된 EU 법률 (그리스어 번역)의 전체 코퍼스.

사전 훈련 세부 사항

Google Bert의 Github Repository (https://github.com/google-research/bert)에 제공된 공식 코드를 사용하여 Bert를 교육했습니다.
우리는 English bert-base-uncased 모델 (12 층, 768- 히든, 12- 헤드, 110m 매개 변수)과 유사한 모델을 발표했습니다.
우리는 초기 학습 속도 1E-4를 갖는 길이 512의 256 시퀀스의 배치로 1 백만 개의 훈련 단계를 동일한 훈련 설정을 선택했습니다.
TFRC (Tensorflow Research Cloud)에서 무료로 제공되는 단일 Google Cloud TPU v3-8을 GCP 리서치 크레딧을 활용할 수있었습니다. 우리를 지원해 주신 Google 프로그램 모두에게 큰 감사를드립니다!

요구 사항

우리는 Hugging Face의 Transformers 저장소의 일부로 bert-base-greek-uncased-v1 게시했습니다. 따라서 Pytorch 또는 Tensorflow 2와 함께 PIP를 통해 트랜스포머 라이브러리를 설치해야합니다.

 pip install unicodedata
pip install transfomers
pip install (torch|tensorflow)

사전 프로세스 텍스트 (deaccent- 낮음)

bert-base-greek-uncased-v1 사용하려면 텍스트를 소문자로 사전 프로세스하고 그리스의 모든 결정자를 제거해야합니다.

 import unicodedata

def strip_accents_and_lowercase ( s ):
   return '' . join ( c for c in unicodedata . normalize ( 'NFD' , s )
                  if unicodedata . category ( c ) != 'Mn' ). lower ()

accented_string = "Αυτή είναι η Ελληνική έκδοση του BERT."
unaccented_string = strip_accents_and_lowercase ( accented_string )

print ( unaccented_string ) # αυτη ειναι η ελληνικη εκδοση του bert.

사전 처리 된 모델을로드하십시오

 from transformers import AutoTokenizer , AutoModel

tokenizer = AutoTokenizer . from_pretrained ( "nlpaueb/bert-base-greek-uncased-v1" )
model = AutoModel . from_pretrained ( "nlpaueb/bert-base-greek-uncased-v1" )

사전 치료 된 모델을 언어 모델로 사용하십시오

 import torch
from transformers import *

# Load model and tokenizer
tokenizer_greek = AutoTokenizer . from_pretrained ( 'nlpaueb/bert-base-greek-uncased-v1' )
lm_model_greek = AutoModelWithLMHead . from_pretrained ( 'nlpaueb/bert-base-greek-uncased-v1' )

# ================ EXAMPLE 1 ================
text_1 = 'O ποιητής έγραψε ένα [MASK] .'
# EN: 'The poet wrote a [MASK].'
input_ids = tokenizer_greek . encode ( text_1 )
print ( tokenizer_greek . convert_ids_to_tokens ( input_ids ))
# ['[CLS]', 'o', 'ποιητης', 'εγραψε', 'ενα', '[MASK]', '.', '[SEP]']
outputs = lm_model_greek ( torch . tensor ([ input_ids ]))[ 0 ]
print ( tokenizer_greek . convert_ids_to_tokens ( outputs [ 0 , 5 ]. max ( 0 )[ 1 ]. item ()))
# the most plausible prediction for [MASK] is "song"

# ================ EXAMPLE 2 ================
text_2 = 'Είναι ένας [MASK] άνθρωπος.'
# EN: 'He is a [MASK] person.'
input_ids = tokenizer_greek . encode ( text_1 )
print ( tokenizer_greek . convert_ids_to_tokens ( input_ids ))
# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', '.', '[SEP]']
outputs = lm_model_greek ( torch . tensor ([ input_ids ]))[ 0 ]
print ( tokenizer_greek . convert_ids_to_tokens ( outputs [ 0 , 3 ]. max ( 0 )[ 1 ]. item ()))
# the most plausible prediction for [MASK] is "good"

# ================ EXAMPLE 3 ================
text_3 = 'Είναι ένας [MASK] άνθρωπος και κάνει συχνά [MASK].'
# EN: 'He is a [MASK] person he does frequently [MASK].'
input_ids = tokenizer_greek . encode ( text_3 )
print ( tokenizer_greek . convert_ids_to_tokens ( input_ids ))
# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', 'και', 'κανει', 'συχνα', '[MASK]', '.', '[SEP]']
outputs = lm_model_greek ( torch . tensor ([ input_ids ]))[ 0 ]
print ( tokenizer_greek . convert_ids_to_tokens ( outputs [ 0 , 8 ]. max ( 0 )[ 1 ]. item ()))
# the most plausible prediction for the second [MASK] is "trips"