Jargon - Text models
Collection
Jargon is an efficient transformer encoder LM for French, combining the LinFormer attention mechanism with the RoBERTa model architecture.
•
10 items
•
Updated
•
1
Jargon is an efficient transformer encoder LM for French, combining the LinFormer attention mechanism with the RoBERTa model architecture.
Jargon is available in several versions with different context sizes and types of pre-training corpora.
| Model | Initialised from... | Training Data |
|---|---|---|
| jargon-general-base | scratch | 8.5GB Web Corpus |
| jargon-general-biomed | jargon-general-base | 5.4GB Medical Corpus |
| jargon-general-legal | jargon-general-base | 18GB Legal Corpus |
| jargon-multidomain-base | jargon-general-base | Medical+Legal Corpora |
| jargon-legal | scratch | 18GB Legal Corpus |
| jargon-legal-4096 | scratch | 18GB Legal Corpus |
| jargon-biomed | scratch | 5.4GB Medical Corpus |
| jargon-biomed-4096 | scratch | 5.4GB Medical Corpus |
| jargon-NACHOS | scratch | NACHOS |
| jargon-NACHOS-4096 | scratch | NACHOS |
The Jargon models were evaluated on an range of specialized downstream tasks.
For more info please check out the paper, accepted for publication at LREC-COLING 2024.
You can get started with this model using the code snippet below:
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
tokenizer = AutoTokenizer.from_pretrained("PantagrueLLM/jargon-NACHOS", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("PantagrueLLM/jargon-NACHOS", trust_remote_code=True)
jargon_maskfiller = pipeline("fill-mask", model=model, tokenizer=tokenizer)
output = jargon_maskfiller("Il est allé au <mask> hier")
You can also use the classes AutoModel, AutoModelForSequenceClassification, or AutoModelForTokenClassification to load Jargon models, depending on the downstream task in question.
If you use this model for your own research work, please cite as follows:
@inproceedings{segonne:hal-04535557,
TITLE = {{Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains}},
AUTHOR = {Segonne, Vincent and Mannion, Aidan and Alonzo Canul, Laura Cristina and Audibert, Alexandre and Liu, Xingyu and Macaire, C{\'e}cile and Pupier, Adrien and Zhou, Yongxin and Aguiar, Mathilde and Herron, Felix and Norr{\'e}, Magali and Amini, Massih-Reza and Bouillon, Pierrette and Eshkol-Taravella, Iris and Esperan{\c c}a-Rodier, Emmanuelle and Fran{\c c}ois, Thomas and Goeuriot, Lorraine and Goulian, J{\'e}r{\^o}me and Lafourcade, Mathieu and Lecouteux, Benjamin and Portet, Fran{\c c}ois and Ringeval, Fabien and Vandeghinste, Vincent and Coavoux, Maximin and Dinarelli, Marco and Schwab, Didier},
URL = {https://hal.science/hal-04535557},
BOOKTITLE = {{LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation}},
ADDRESS = {Turin, Italy},
YEAR = {2024},
MONTH = May,
KEYWORDS = {Self-supervised learning ; Pretrained language models ; Evaluation benchmark ; Biomedical document processing ; Legal document processing ; Speech transcription},
PDF = {https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf},
HAL_ID = {hal-04535557},
HAL_VERSION = {v1},
}