This repo is meant as a space to centralize Romanian Transformers and to provide a uniform evaluation. Contributions are welcome.
We're using HuggingFace's Transformers lib, an awesome tool for NLP. What's BERT you ask? Here's a clear and condensed article about what BERT is and what it can do. Also check out this summary of different transformer models.
What follows is the list of Romanian transformer models, both masked and conditional language models.
Feel free to open an issue and add your model/eval here!
| Model | Type | Size | Article/Citation/Source | Pre-trained / Fine-tuned | Release Date |
|---|---|---|---|---|---|
| dumitrescustefan/bert-base-romanian-cased-v1 | BERT | 124M | PDF / Cite | Pre-trained | Apr, 2020 |
| dumitrescustefan/bert-base-romanian-uncased-v1 | BERT | 124M | PDF / Cite | Pre-trained | Apr, 2020 |
| racai/distillbert-base-romanian-cased | DistilBERT | 81M | - | Pre-trained | Apr, 2021 |
| readerbench/RoBERT-small | BERT | 19M | Pre-trained | May, 2021 | |
| readerbench/RoBERT-base | BERT | 114M | Pre-trained | May, 2021 | |
| readerbench/RoBERT-large | BERT | 341M | Pre-trained | May, 2021 | |
| dumitrescustefan/bert-base-romanian-ner | BERT | 124M | HF Space | Named Entity Recognition on RONECv2 | Jan, 2022 |
| snisioi/bert-legal-romanian-cased-v1 | BERT | 124M | - | Legal documents on MARCELLv2 | Jan, 2022 |
| readerbench/jurBERT-base | BERT | 111M | Legal documents | Oct, 2021 | |
| readerbench/jurBERT-large | BERT | 337M | Legal documents | Oct, 2021 |
| Model | Type | Size | Article/Citation/Source | Pre-trained / Fine-tuned | Release Date |
|---|---|---|---|---|---|
| dumitrescustefan/gpt-neo-romanian-780m | GPT-Neo | 780M | not yet / HF Space | Pre-trained | Sep, 2022 |
| readerbench/RoGPT2-base | GPT2 | 124M | Pre-trained | Jul, 2021 | |
| readerbench/RoGPT2-medium | GPT2 | 354M | Pre-trained | Jul, 2021 | |
| readerbench/RoGPT2-large | GPT2 | 774M | Pre-trained | Jul, 2021 |
NEW: Check out this HF Space to play with Romanian generative models: https://huggingface.co/spaces/dumitrescustefan/romanian-text-generation
Models are evaluated using the public Colab script available here. All results reported are the average score of 5 runs, using the same parameters. For larger models, if it was possible, a larger batch-size was simulated by accumulating gradients, such that all models should have the same effective batch size. Only standard models (not finetuned for a particular task) and that could fit in 16GB of RAM are evaluated.
The tests cover the following fields, and, for brevity, we select a single metric from each field:
| Model | Type | Size | NER/EM_strict | RoSTS/Pearson | Ro-pos-tagger/UPOS F1 | REDv2/hamming_loss |
|---|---|---|---|---|---|---|
| dumitrescustefan/bert-base-romanian-cased-v1 | BERT | 124M | 0.8815 | 0.7966 | 0.982 | 0.1039 |
| dumitrescustefan/bert-base-romanian-uncased-v1 | BERT | 124M | 0.8572 | 0.8149 | 0.9826 | 0.1038 |
| racai/distillbert-base-romanian-cased | DistilBERT | 81M | 0.8573 | 0.7285 | 0.9637 | 0.1119 |
| readerbench/RoBERT-small | BERT | 19M | 0.8512 | 0.7827 | 0.9794 | 0.1085 |
| readerbench/RoBERT-base | BERT | 114M | 0.8768 | 0.8102 | 0.9819 | 0.1041 |
| Model | Type | Size | NER/EM_strict | RoSTS/Pearson | Ro-pos-tagger/UPOS F1 | REDv2/hamming_loss | Perplexity |
|---|---|---|---|---|---|---|---|
| readerbench/RoGPT2-base | GPT2 | 124M | 0.6865 | 0.7963 | 0.9009 | 0.1068 | 52.34 |
| readerbench/RoGPT2-medium | GPT2 | 354M | 0.7123 | 0.7979 | 0.9098 | 0.114 | 31.26 |
Using HuggingFace's Transformers lib, instantiate a model and replace the model name as necessary. Then use an appropriate model head depending on your task. Here are a few examples:
from transformers import AutoTokenizer, AutoModel
import torch
# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
# tokenize a sentence and run through the model
input_ids = tokenizer.encode("Acesta este un test.", add_special_tokens=True, return_tensors="pt")
outputs = model(input_ids)
# get encoding
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tupletext = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
Give a prompt to a generative model and let it write:
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/gpt-neo-romanian-125m")
model = AutoModelForCausalLM.from_pretrained("dumitrescustefan/gpt-neo-romanian-125m")
input_ids = tokenizer.encode("Cine a fost Mihai Eminescu? A fost", return_tensors='pt')
text = model.generate(input_ids, max_length=128, do_sample=True, no_repeat_ngram_size=2, top_k=50, top_p=0.9, early_stopping=True)
print(tokenizer.decode(text[0], skip_special_tokens=True))P.S. You can test all generative models here: https://huggingface.co/spaces/dumitrescustefan/romanian-text-generation