A survey of papers that cited / footnoted me in NLP

5 min readApr 28, 2021

In these first four months of 2021, a few researchers have used my language models or mentioned me in their papers. I feel encouraged and validated to be part of someone else’s work. Sharing models on HuggingFace made it possible for them to continue that work and extend it into new areas.
I’m confident that the next year of papers from researchers in their native languages will greatly surpass my 2020 work.

The papers

Bangla Documents Classification using Transformer Based Deep Learning Models

M. M. Rahman, M. Aktaruzzaman Pramanik, R. Sadik, M. Roy and P. Chakraborty

Compared Multilingual BERT and Bangla-Electra model on three news topic classification tasks: BARD, OSBC, and ProthomAlo.

It has been inferred from our experiment that the ELECTRA model gained higher accuracy and f1 score while classifying different domains of Bangla documents of different data sources

Hostility Detection in Hindi leveraging Pre-Trained Language Models

Ojasv Kamal, Adarsh Kumar, Tejas Vaidhya

Compared Hindi-BERT, Indic-BERT, and HindiBERTa (RoBERTa-based model) on four scores (Multi Label Classification, Multitask Learning, Binary Classification, and Auxiliary Model). This was part of a competition for CONSTRAINT-2021 to classify multiple labels for hate speech.

Hindi-BERT was one of my earliest projects with ELECTRA, and the team ended up not using its results for a few tasks. It had comparable scores in some tasks, but the other models served them better overall.

we dropped the Hindi BERT model in this approach as the performance was poor compared to the other two models because of shallow architecture

Improving Word Alignment with Contextualized Embedding and Bilingual Dictionary

Minhan Xu, Yu Hong

It was difficult for me to find a copy of this article due to paywalls. The paper covers several languages and uses Hindi-BERT as the Hindi language model.

Multilingual Hope Speech Detection for Code-mixed and Transliterated Texts

Dhivya Chinnappa

The paper is about positive-thinking language in English, Tamil, and Malayalam languages. I’m unclear on whether my TaMillion model is compared to mBERT or the results of these models are combined in a weighted model.

Tamil Lyrics Corpus: Analysis and Experiments

Dhivya Chinnappa, Praveenraj Dhandapani

This paper may be my favorite for the creative application of NLP. The project is a binary classifier to tell between the work of two lyricists. They also suggest future work on emotion classification. I learned about “Rettai Kilavi” (described on Wikipedia as Tamil onomatopoeia) and trends in Tamil songs.

TaMillion was compared to mBERT and Indic-BERT.

In case of the results for Kannadasan, the highest performance is obtained both for precision and recall by fine tuning the Tamillion BERT model (P: .68, R: .61). We note that there is still room for improvement in this binary lyricist identification task.

Acknowledgements

Notes and conversations ended up with authors acknowledging me in Stochastic Parrots, AraGPT2 (a GPT-2 model trained from scratch on segmented Arabic), and MuRIL (Google’s Indian languages model). This was very kind as my contributions were quite small (sometimes a Tweet or few DMs), and their works have been amazing.

Your language models are going places ! Are they SOTA?

As you can tell, not really? Getting 68–70% accuracy on a binary classification task beats random chance, but I’m sure a new pretrained language model with more data/layers/etc will help these researchers do better in the near future.
When someone contacts me ahead of time, I tell them to try MuRIL, but they might say their model training is already done. Google announced MuRIL in December 2020 and posted the paper in March, so we’ll see it taking over in research sometime soon.
There are also models from local teams (Bangla-BERT, neuralspace-reverie, and Indic-BERT) which sometimes perform better than MuRIL.

Local knowledge and decolonizing NLP

When I was learning pre-trained transformer models, it was clear that Google, Facebook, Microsoft, and OpenAI had made English NLP into a kaiju battle between mega-labs. My laptop, CoLab notebook, and side project time let me travel back through where I’d met friends and spoken at conferences. Where there was little documentation about transformers-based language models there, I made my own projects.
For the most part, I’ve received questions, feedback, and code mentions from researchers in India, Bangladesh, and the Maldives. I hope this circle expands further, i.e. if the Spanish-language project can help de-bias models.

As an American with little stake in academia, should I be maintaining the go-to language model for these languages? No. How does this change?

Publishing an initial model and notebook helps. Multilingual BERT covers 100 languages, and we’ve found out that it doesn’t cover them as well as university teams with a good set of monolingual data. My ELECTRA models were a good fit for someone to enter a competition on Hindi hate speech or write a paper about Tamil song lyrics, without starting from scratch or using the outdated word2vec / TF-IDF methods.
Look and advocate for new models. When a researcher published Bangla-BERT, he asked Twitter for some tasks to compare results. I sent him my classification task notebooks and showed that his models were outperforming both my ELECTRA model and mBERT.
When spaCy started including my models in their quick start docs, I filed an issue to switch to MuRIL and other models. I still want to recommend the Thai-originated WangchanBERTa for Thai language, but this requires a pre-tokenization segmentation step. I’ve talked to the HuggingFace team a bit about this in Thai and Arabic, and hopefully that will be incorporated in their new working group.
When we have a larger, global conversation about decolonization, it’s important to ask if language modeling is acceptable. When I attended the Puliima conference in Australia, there was no universal rule about whether an indigenous language is taught to members only, in certain times and places, or to the online public. When your language has been actively suppressed under the banner of law or assimilation, it’s reasonable to halt the tech machinery and ask questions. For example, won’t machine translation benefit a few companies instead of employing indigenous people? Will this simulated voice or outside speaker have any reverence or cultural knowledge? Will it introduce bias?
South Asian languages are seen as more widespread, less exclusive, so there isn’t a comparable conversation. Similarly I’ve had good reception on working on Arabic support for HuggingFace. But it’s still important not to pretend to be an expert or authority on the language. It’s not my role to keep releasing competing language models or to send a paper into a conference without putting native/local speakers first (both figuratively and in author order).
Crediting researchers and developers, answering questions as best I can, calling out less sensitive research, reading and sharing discourse from multiple sources, and promoting new open collaborative programs.