2021 research mentions wrap-up

Nick Doiron
4 min readDec 28, 2021


Here are the remaining papers published in 2021 which used and/or footnoted my work. This is a follow-up to my April post.

I didn’t make many new models this year, and there are better options out there (MuRIL and several monolingual models), a Hindi/Tamil QA Kaggle competition to browse solutions, and a fantastic post from the 1st-place winner of that competition.
Why do the old models continue to appear in research? There are delays in applying new code and publishing that research, difficulty working on large models, and you want multiple models to make comparisons. For Thai, there are assumptions that you can drop a model into one multilingual pipeline without thinking about tokenization. Everyone is doing great work, considering so much is unknown.
I’ve considered adding deprecation notices or recommendations to my readmes and blogs, which has started with a new landing page: Use This Now.

Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking

Fangyu Liu, Ivan Vulic, Anna Korhonen, Nigel Collier
Researchers from Language Technology Lab, University of Cambridge.

The bert-base-thai model which I uploaded to HuggingFace was included with other language models.

Hostility Detection and Covid-19 Fake News Detection in Social Media

Ayush Gupta, Rohan Sukumaran, Kevin John, Sundeep Teki
PathCheck Foundation, Cambridge and Indian Institute of Information Technology, Sri City

Similar to work on https://arxiv.org/abs/2101.05494, but did not directly cite the model link. Found with a search of Hindi-BERT.

Extracting Latent Information from Datasets in CONSTRAINT 2021 Shared Task

Discusses the Hindi hate speech task where some teams used my model.

A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models

Firoj Alam, Arid Hasan, Tanvirul Alam, Akib Khan, Janntatul Tajrin, Naira Khan, Shammur Absar Chowdhury
From Qatar Computing Research Institute, Cognitive Insight, BJIT, and Dhaka University

ELECTRA is now the worst option out there :(

BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding

Abhik Bhattacharjee, Tahmid Hasan, Kazi Samin, Md Saiful Islam, M. Sohel Rahman, Anindya Iqbal, Rifat Shahriyar
From Bangladesh University of Engineering and Technology (BUET) and University of Rochester

Cross-Lingual Text Classification of Transliterated Hindi and Malayalam

Jitin Krishnan, Antonios Anastasopoulos, Hemant Purohit, Huzefa Rangwala
From George Mason University

Comparing Hindi-TPU-Electra monolingual model on mixed Devanagari / latinized Hindi. Their versions of mBERT and XLM-R perform best.

HuggingFace Datasets

This was a mass collaboration with Hugging Face. I got an acknowledgement for uploading notes on some datasets. To be honest I’m frustrated that common datasets such as XNLI are missing their model card information. The fields are too extensive and too often left in as ‘More Information Needed’.

Technical Domain Classification of Bangla Text using BERT

Koyel Ghosh, Dr. Apurbalal Senapati
From Central Institute of Technology, Assam


Odds & Ends


Not a research paper, but a complex project by students at St. Francis Institute of Technology in Mumbai.

Final / Thesis projects

Mitigating Language-Dependent Ethnic Bias in BERT

The paper from Korea Advanced Institute of Science and Technology doesn’t include results for Thai language, but the GitHub repo does include a Thai model upload in the configuration.py

Assessing the Compatibility of Cryptocurrencies and Islamic Law

This is from 2020, but I just noticed it this year. The law journal article cites my post on stablecoins and Islamic finance rules.



Nick Doiron

Web->ML developer and mapmaker.