2021 research mentions wrap-up

4 min readDec 28, 2021

Here are the remaining papers published in 2021 which used and/or footnoted my work. This is a follow-up to my April post.

I didn’t make many new models this year, and there are better options out there (MuRIL and several monolingual models), a Hindi/Tamil QA Kaggle competition to browse solutions, and a fantastic post from the 1st-place winner of that competition.
Why do the old models continue to appear in research? There are delays in applying new code and publishing that research, difficulty working on large models, and you want multiple models to make comparisons. For Thai, there are assumptions that you can drop a model into one multilingual pipeline without thinking about tokenization. Everyone is doing great work, considering so much is unknown.
I’ve considered adding deprecation notices or recommendations to my readmes and blogs, which has started with a new landing page: Use This Now.

Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking

Fangyu Liu, Ivan Vulic, Anna Korhonen, Nigel Collier
Researchers from Language Technology Lab, University of Cambridge.

The bert-base-thai model which I uploaded to HuggingFace was included with other language models.

Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking

Injecting external domain-specific knowledge (e.g., UMLS) into pretrained language models (LMs) advances their…

arxiv.org

Hostility Detection and Covid-19 Fake News Detection in Social Media

Ayush Gupta, Rohan Sukumaran, Kevin John, Sundeep Teki
PathCheck Foundation, Cambridge and Indian Institute of Information Technology, Sri City

Hostility Detection and Covid-19 Fake News Detection in Social Media

With the advent of social media, there has been an extremely rapid increase in the content shared online. Consequently, the…

arxiv.org

Similar to work on https://arxiv.org/abs/2101.05494, but did not directly cite the model link. Found with a search of Hindi-BERT.

Extracting Latent Information from Datasets in CONSTRAINT 2021 Shared Task

Discusses the Hindi hate speech task where some teams used my model.

Extracting Latent Information from Datasets in CONSTRAINT 2021 Shared Task

Renyuan Liu Xiaobing Zhou Part of the Communications in Computer and Information Science book series (CCIS, volume…

link.springer.com

A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models

Firoj Alam, Arid Hasan, Tanvirul Alam, Akib Khan, Janntatul Tajrin, Naira Khan, Shammur Absar Chowdhury
From Qatar Computing Research Institute, Cognitive Insight, BJIT, and Dhaka University

A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models

Bangla -- ranked as the 6th most widely spoken language across the world…

arxiv.org

ELECTRA is now the worst option out there :(

BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding

Abhik Bhattacharjee, Tahmid Hasan, Kazi Samin, Md Saiful Islam, M. Sohel Rahman, Anindya Iqbal, Rifat Shahriyar
From Bangladesh University of Engineering and Technology (BUET) and University of Rochester

BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language…

In this paper, we introduce ``Embedding Barrier'', a phenomenon that limits the monolingual performance of multilingual…

arxiv.org

Cross-Lingual Text Classification of Transliterated Hindi and Malayalam

Jitin Krishnan, Antonios Anastasopoulos, Hemant Purohit, Huzefa Rangwala
From George Mason University

Comparing Hindi-TPU-Electra monolingual model on mixed Devanagari / latinized Hindi. Their versions of mBERT and XLM-R perform best.

Cross-Lingual Text Classification of Transliterated Hindi and Malayalam

Transliteration is very common on social media, but transliterated text is not adequately handled by modern neural…

arxiv.org

HuggingFace Datasets

This was a mass collaboration with Hugging Face. I got an acknowledgement for uploading notes on some datasets. To be honest I’m frustrated that common datasets such as XNLI are missing their model card information. The fields are too extensive and too often left in as ‘More Information Needed’.

Datasets: A Community Library for Natural Language Processing

The scale, variety, and quantity of publicly-available NLP datasets has grown rapidly as researchers propose new tasks…

arxiv.org

Technical Domain Classification of Bangla Text using BERT

Koyel Ghosh, Dr. Apurbalal Senapati
From Central Institute of Technology, Assam

https://books.aijr.org/index.php/press/catalog/download/115/42/1366-1?inline=1

Odds & Ends

classification-of-hindi-news

Not a research paper, but a complex project by students at St. Francis Institute of Technology in Mumbai.

GitHub — danlobo1999/classification-of-hindi-news: Classification Of Hindi News (COHN), this…

Classification Of Hindi News (COHN), this application uses the monsoon-nlp/hindi-bert model and attempts to use…

github.com

Final / Thesis projects

GitHub - arnabx007/BQuAD-A-Bangla-QA-Dataset

For the final year thesis project we had to develop an extractive Question-Answering system for educational purposes…

github.com

GitHub - ShikhaAsrani/DLCV_Final_Project: Final project for DLCV course

Final project for DLCV course. Contribute to ShikhaAsrani/DLCV_Final_Project development by creating an account on…

github.com

Mitigating Language-Dependent Ethnic Bias in BERT

The paper from Korea Advanced Institute of Science and Technology doesn’t include results for Thai language, but the GitHub repo does include a Thai model upload in the configuration.py

GitHub - jaimeenahn/ethnic_bias: EMNLP 2021

This repository contains the code and data for the paper "Mitigating Language-Dependent Ethnic Bias in BERT" (EMNLP…

github.com

Assessing the Compatibility of Cryptocurrencies and Islamic Law

This is from 2020, but I just noticed it this year. The law journal article cites my post on stablecoins and Islamic finance rules.

About | HeinOnline

About | HeinOnline Law Journal Library | HeinOnline Law Journal Library | HeinOnline Please click here if you are not…

heinonline.org

2021 research mentions wrap-up

Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking

Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking

Injecting external domain-specific knowledge (e.g., UMLS) into pretrained language models (LMs) advances their…

Hostility Detection and Covid-19 Fake News Detection in Social Media

Hostility Detection and Covid-19 Fake News Detection in Social Media

With the advent of social media, there has been an extremely rapid increase in the content shared online. Consequently, the…

Extracting Latent Information from Datasets in CONSTRAINT 2021 Shared Task

Extracting Latent Information from Datasets in CONSTRAINT 2021 Shared Task

Renyuan Liu Xiaobing Zhou Part of the Communications in Computer and Information Science book series (CCIS, volume…

A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models

A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models

Bangla -- ranked as the 6th most widely spoken language across the world…

BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding

BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language…

In this paper, we introduce ``Embedding Barrier'', a phenomenon that limits the monolingual performance of multilingual…

Cross-Lingual Text Classification of Transliterated Hindi and Malayalam

Cross-Lingual Text Classification of Transliterated Hindi and Malayalam

Transliteration is very common on social media, but transliterated text is not adequately handled by modern neural…

HuggingFace Datasets

Datasets: A Community Library for Natural Language Processing

The scale, variety, and quantity of publicly-available NLP datasets has grown rapidly as researchers propose new tasks…

Technical Domain Classification of Bangla Text using BERT

Odds & Ends

classification-of-hindi-news

GitHub — danlobo1999/classification-of-hindi-news: Classification Of Hindi News (COHN), this…

Classification Of Hindi News (COHN), this application uses the monsoon-nlp/hindi-bert model and attempts to use…

Final / Thesis projects

GitHub - arnabx007/BQuAD-A-Bangla-QA-Dataset

For the final year thesis project we had to develop an extractive Question-Answering system for educational purposes…

GitHub - ShikhaAsrani/DLCV_Final_Project: Final project for DLCV course

Final project for DLCV course. Contribute to ShikhaAsrani/DLCV_Final_Project development by creating an account on…

Mitigating Language-Dependent Ethnic Bias in BERT

GitHub - jaimeenahn/ethnic_bias: EMNLP 2021

This repository contains the code and data for the paper "Mitigating Language-Dependent Ethnic Bias in BERT" (EMNLP…

Assessing the Compatibility of Cryptocurrencies and Islamic Law

About | HeinOnline

About | HeinOnline Law Journal Library | HeinOnline Law Journal Library | HeinOnline Please click here if you are not…

Written by Nick Doiron

No responses yet