Training Bangla and Tamil language BERT models
Previously I posted my work training Hindi-BERT and dv-wave (Maldivian Dhivehi) using Google’s ELECTRA code, and publishing them on HuggingFace. Now I’ve returned to the project for two more South Asian language models: TaMillion (Tamil) and Bangla-Electra (Bangla/Bengali).
Corpus and Training
As I’ve described in earlier posts, I download 5GB of deduplicated text scraped from the web, from OSCAR-corpus, and added the latest dump of Wikipedia articles (~0.45GB as of 1 July, 2020).
Here are both CoLab notebooks: Tamil and Bangla.
Benchmarking
After training the model, I need to demonstrate better performance of these homemade models over simply using Google’s Multilingual BERT (mBERT). This is trickier in low resource languages, where there are no common standards for how to compare accuracy of models.
For Bangla
I found this topical pre-print on classification benchmarks in under-resourced languages, actually focusing on the example of Bangla, by
Md. Rezaul Karim, Bharathi Raja Chakravarthi, John P. McCrae, and Michael Cochez:
Train and test data for sentiment analysis and different classes of hate speech, apparently collected from social media, are included in this repo: github.com/rezacsedu/BengFastText — but I’m not sure if this is a final/official release.
In a sentiment analysis notebook, initially Bangla-Electra got a 68.9%, a slim advantage over mBERT (68.1%).
For both models, most of the incorrect responses were negatives falsely labeled positive. Reviewing the balance of training data, I noticed that many lines are repeated 2–3 times? This makes it difficult to estimate how many unique positive and negative examples exist.
I resumed training overnight and tried a feature ofSimpleTransformers
to add weights to each class in the loss function, without making a dent.
Does changing weights help the model understand better or are you just tipping the scales in the other direction?
It’s more like, imagine drawing a squiggly line across a 2D coordinate plate, and left of the line is labeled Negative, and right of it is labeled Positive. You’d think that most data points are far left or right of the line, but in practice it is a full field of data points, intelligently placed by the pretrained model. We are nudging the dividing line if it didn’t fall in the right place on the field.
Eventually I tried this task published by Soham Chatterjee, and here on 6 news categories, Bangla-Electra gets 82.3% right, vs. mBERT’s 72.3%.
Why are these accuracy numbers still below 90%? Why have they not improved greatly over mBERT?
I received a comment on my Hindi-BERT post that I should be asking the OSCAR team for the raw, unshuffled sentences for my corpus. I plan to reprocess Hindi, Bangla, and Tamil to see if this improves accuracy.
For Tamil
Sudalai Rajkumar published datasets for Tamil news categories, movie reviews, and the Tirukkural, a foundational work of Tamil literature.
I decided to test all three benchmarks.
News: Random: 16.7%, mBERT: 53.0%, TaMillion: 69.6%
Movie reviews (regression, RMSE): mBERT: 0.657, TaMillion: 0.627
Tirukkural: Random: 33.3%, mBERT and TaMillion: 50.8%
I continued training overnight from 100,000 to 190,000 steps, and accuracy only improved a tiny bit on the news task (from 68.2 to 69.6%).
ELECTRA vs. BERT
All of my models follow the same training process with ELECTRA. When I serialize them into PyTorch and h5 model formats, I use the HuggingFace script which converts it into a BERT model. For dv-wave, I noticed that telling SimpleTransformers
to use BERT settings outperformed the ELECTRA settings by multiple percentage points. The same held true for these two models.
Updates?
This article was written in July 2020. For latest recommended models, I will keep this readme up to date: https://github.com/mapmeld/use-this-now/blob/main/README.md#south-asian-language-model-projects