classla/xlm-r-bertic-data
Updated • 127 • 3
How to use classla/xlm-r-bertic with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("fill-mask", model="classla/xlm-r-bertic") # Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("classla/xlm-r-bertic")
model = AutoModelForMaskedLM.from_pretrained("classla/xlm-r-bertic")This model was produced by pre-training XLM-Roberta-large 48k steps on South Slavic languages using XLM-R-BERTić dataset
Three tasks were chosen for model evaluation:
In all cases, this model was finetuned for specific downstream tasks.
Average macro-F1 scores from three runs were used to evaluate performance. Datasets used: hr500k, ReLDI-sr, ReLDI-hr, and SETimes.SR.
| system | dataset | F1 score |
|---|---|---|
| XLM-R-BERTić | hr500k | 0.927 |
| BERTić | hr500k | 0.925 |
| XLM-R-SloBERTić | hr500k | 0.923 |
| XLM-Roberta-Large | hr500k | 0.919 |
| crosloengual-bert | hr500k | 0.918 |
| XLM-Roberta-Base | hr500k | 0.903 |
| system | dataset | F1 score |
|---|---|---|
| XLM-R-SloBERTić | ReLDI-hr | 0.812 |
| XLM-R-BERTić | ReLDI-hr | 0.809 |
| crosloengual-bert | ReLDI-hr | 0.794 |
| BERTić | ReLDI-hr | 0.792 |
| XLM-Roberta-Large | ReLDI-hr | 0.791 |
| XLM-Roberta-Base | ReLDI-hr | 0.763 |
| system | dataset | F1 score |
|---|---|---|
| XLM-R-SloBERTić | SETimes.SR | 0.949 |
| XLM-R-BERTić | SETimes.SR | 0.940 |
| BERTić | SETimes.SR | 0.936 |
| XLM-Roberta-Large | SETimes.SR | 0.933 |
| crosloengual-bert | SETimes.SR | 0.922 |
| XLM-Roberta-Base | SETimes.SR | 0.914 |
| system | dataset | F1 score |
|---|---|---|
| XLM-R-BERTić | ReLDI-sr | 0.841 |
| XLM-R-SloBERTić | ReLDI-sr | 0.824 |
| BERTić | ReLDI-sr | 0.798 |
| XLM-Roberta-Large | ReLDI-sr | 0.774 |
| crosloengual-bert | ReLDI-sr | 0.751 |
| XLM-Roberta-Base | ReLDI-sr | 0.734 |
ParlaSent dataset was used to evaluate sentiment regression for Bosnian, Croatian, and Serbian languages. The procedure is explained in greater detail in the dedicated benchmarking repository.
| system | train | test | r^2 |
|---|---|---|---|
| xlm-r-parlasent | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.615 |
| BERTić | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.612 |
| XLM-R-SloBERTić | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.607 |
| XLM-Roberta-Large | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.605 |
| XLM-R-BERTić | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.601 |
| crosloengual-bert | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.537 |
| XLM-Roberta-Base | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.500 |
| dummy (mean) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | -0.12 |
| system | dataset | Accuracy score |
|---|---|---|
| BERTić | Copa-SR | 0.689 |
| XLM-R-SloBERTić | Copa-SR | 0.665 |
| XLM-R-BERTić | Copa-SR | 0.637 |
| crosloengual-bert | Copa-SR | 0.607 |
| XLM-Roberta-Base | Copa-SR | 0.573 |
| XLM-Roberta-Large | Copa-SR | 0.570 |
| system | dataset | Accuracy score |
|---|---|---|
| BERTić | Copa-HR | 0.669 |
| XLM-R-SloBERTić | Copa-HR | 0.628 |
| XLM-R-BERTić | Copa-HR | 0.635 |
| crosloengual-bert | Copa-HR | 0.669 |
| XLM-Roberta-Base | Copa-HR | 0.585 |
| XLM-Roberta-Large | Copa-HR | 0.571 |
Please cite the following paper:
@inproceedings{ljubesic-etal-2024-language,
title = "Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining",
author = "Ljube{\v{s}}i{\'c}, Nikola and
Suchomel, V{\'\i}t and
Rupnik, Peter and
Kuzman, Taja and
van Noord, Rik",
editor = "Melero, Maite and
Sakti, Sakriani and
Soria, Claudia",
booktitle = "Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.sigul-1.23",
pages = "189--203",
}