What makes this Bengali NLP task so difficult?
Recently I posted a benchmark summary for three Bangla language models and one multilingual model (Indic-BERT). I’ve bolded any models within 1 percentage point of the top score.
Indic-BERT and my own ELECTRA model performed well on Sentiment Analysis and News Topics, but notably worse on Hate Speech classification, not matching mBERT. What makes this task so difficult, and why does it affect models differently?
Experiment 1: Revised Dataset
When I shared my results, the Indic-BERT team asked some questions and I went back to my original source for the data. The designated train and test CSVs were recently replaced with a single ‘revised’ CSV.
Code and supplementary materials for our paper titled "Classification Benchmarks for Under-resourced Bengali Language…
This change inspired my first experiment: does a test based on the revised CSV lead to better results?
In a re-run of the experiment, Indic-BERT improved, now scoring higher than mBERT. The neuralspace-reverie model held onto the #1 spot, extending its lead over Bangla-BERT.
Experiment 2: All Small
Even with the new CSV, I noticed that Hate Speech is the smallest dataset (1400 rows, with only 1050 for training). I can’t experiment with making this dataset larger (maybe data augmentation another day), so I wondered what would happen if the other training datasets were the same size? That lead to my second experiment.
Comparing changes in accuracy:
All models lost accuracy when training on only 1,050 rows, but the difference was much more visible for my ELECTRA model (orange line) on both tasks. This suggests that my pretrained model does not have as much information as I’d like, and its accuracy comes from finetuning on a large dataset.
This experiment did not reveal why Indic-BERT would underperform on the hate speech task — in one early run of this experiment, I actually got a higher score from Indic-BERT on the sentiment analysis task.