Using language models in Q&A

Nick Doiron
2 min readJul 25, 2020

I recently posted new language models in Bangla/Bengali, Tamil, and Maldivian Dhivehi, and updated a previous Hindi language model. One of the popular questions I’ve received is, can we use these in question-answering tasks? This is a common request in other community and researcher-built models such as AraBERT.

SimpleTransformers, which I’ve been using on classification tasks, also has sample code for Q&A tasks. The challenge for languages other than English is finding suitable training data. Luckily Facebook’s MLQA includes Hindi samples and Google’s TyDiQA includes Bangla.
If you find datasets for Tamil or Dhivehi, please leave a comment!

Bangla-Electra on TyDi

I trim the full TyDiQA datasets to one language:

cat tydiqa-v1.0-dev.jsonl | grep '"language":"bengali"' > bn.jsonl

And then use the code in this notebook to convert the files into the SQuAD JSON format preferred by SimpleTransformers. The start and end byte information from TyDi is tricky to work with, because the Q&A training data should have the start character, not start byte.

The context for TyDi is super long, usually a whole Wikipedia article, which makes the task more challenging. The multiple excerpts for answers are also long and a little confusing. For a competitive/future project, we’d likely want to use Longformer from Allen Institute for AI. They provide a notebook for converting an existing pretrained model to a longform one.

Hindi-BERT on MLQA

The dataset comes essentially in the SQuAD JSON format that SimpleTransformers expects. These shorter paragraphs and questions were answered well.

This multilingual Q&A dataset includes Hindi-Hindi data (straightforward) and multiple combinations such as Hindi-English, English-Hindi, Chinese-Hindi, Hindi-Chinese, etc. with embeddings aligned across languages. I didn’t train on a big, parallel corpus, so I didn’t expect meaningful results on cross-language questioning. English, which was partly included in Hindi training data, did poorly.

For both languages, check out the CoLab notebooks!
https://colab.research.google.com/drive/1i6fidh2tItf_-IDkljMuaIGmEU6HT2Ar

Updates?

This article was written in July 2020. For latest recommended models, I will keep this readme up to date: https://github.com/mapmeld/use-this-now/blob/main/README.md#south-asian-language-model-projects

--

--