Hi all,
I’m a student currently working on a project on hate speech type detection in Arabic. The data (tweets) I managed to collect looks like follows:
Total rows: 36080
Distribution:
is_hate 5004
Origin-based Hate (OH) 1003
Gender/Sexuality Hate (GH) 1496
Religion/Sect/Ideology Hate (IH) 2252
Other **486
**
The pretrained model I’ve chosen is MARBERTv2, because it seems to be the best performing model at similar tasks. Due to the fact that my resources are limited and the data is imbalanced I wanted guidance on which configuration/hybrid would be the suitable choice;
for instance a shared MARBERTv2 encoder with two heads:
-
Head 1: binary
-
Head 2: fine-grained (4-way)
Or maybe MarBERT + XGBoost (It replaces the standard neural classification head)?
I am looking for a config that is best for this kind of hierarchical, imbalanced classification in order to maximize result metrics. I’d be really grateful if any experts on NLP/ people who’ve worked on this sorta thing before could offer advice on:
-
Which approach would likely give the best macro-F1 across the 4 hate subtypes?
-
Does anyone have experience using MARBERTv2 for multi-head or multi-task classification?
-
Any recommended loss functions (focal loss? class weights?) or sampling strategies?