Making a Dhivehi language model
dv-wave for the Maldives
I’ve published my second language model, dv-wave, on HuggingFace:
🇲🇻 🇲🇻 🇲🇻
Back in March, Google released ELECTRA and I used it to train a Hindi-BERT model. In April, Suraj Parmar published a Sanskrit model (now expanded to Hindi, Gujarati, and Sanskrit). I reached out to him on Twitter, and he helped encourage me through converting my existing model and vocabulary to make it compatible with the HuggingFace/Transformers library.
My plan was to immediately move on to more languages, but there were three discouraging factors:
- Difficulty of training my first model and retraining it on larger vocabularies, larger models, TPUs, etc.
- For Hindi, I combined a 9GB deduped corpus from OSCAR and the latest Wikipedia articles (+1 GB). The corpus would be much smaller for Bengali (5.8GB) or Tamil (5.1GB).
- mBERT and XLM-R are trained on 100 languages, and there are cross-language benefits of that training. Unless I pick another language, zero in on one task/domain, or find a larger corpus, I shouldn’t expect my model to be beneficial.
Here’s how I ended up working on Dhivehi / ދިވެހި
- Dhivehi is spoken in the Maldives, and it is written with its own alphabet, Thaana. Writing is right-to-left, like Arabic, but vowel signs are placed above and below every letter for short, long, and absent vowels, as in Hindi and Southeast Asian languages. There are few Dhivehi speakers outside of the Maldives, so it’s not a focus of most language models.
- I visited the Maldives in 2017 on a mapping and translation project. Since then I follow their tech industry — they’ve been growing, inviting diverse groups to startup events, starting Code Clubs and a podcast, and considering a digital alternative to their tourist economy.
- In May, local coder Sofwath was tweeting about BERT, I started a small thread, and he sent me a link to his Dhivehi datasets including a 307MB corpus and a news categorization task. This isn’t huge in the NLP world, but it’s a big step up from OSCAR’s 79MB.
- I returned to training small models in CoLab + GPU for this project.
Links and caveats
- Training notebook (stopped at only 66,000 steps) colab.research.google.com/drive/1ZJ3tU9MwyWj6UtQ-8G7QJKTn-hG1uQ9v
- Finetuning notebook (8 news categories, 3 epochs)
Random: 12.5%; mBERT: 51.7%; dv-wave: 89.2%
colab.research.google.com/drive/1KnyQxRNWG_yVwms_x9MUAqFQVeMecTV7 - Do tokenizing with both
strip_accents=False
anddo_lower_case=False
; it looks like Transformers is conflating the two and even removing the vowel signs unless I set the tokenizer this way. - This corpus is small. I didn’t add in OSCAR or dv.Wikipedia.org in case it doubled up information, but it would be worth trying for a version 2.
- This corpus is mostly news-focused. This could raise accuracy when finetuning on the news categories, and hurt accuracy on other tasks, such as social media moderation. This is a problem in the Maldives as (I’ve heard) Facebook doesn’t have enough data or staff experts to moderate Dhivehi posts.
Updates?
This article was written in July 2020. For latest recommended models, I will keep this readme up to date: https://github.com/mapmeld/use-this-now/blob/main/README.md#south-asian-language-model-projects