Interpretable Q&A AI

2 min readFeb 29, 2020

AllenNLP’s Interpret is designed not just to understand and model human language, but to expose how a language model decides its output. One of their examples, Reading Comprehension, receives a document and a question about its content, and uses AllenNLP’s ELMo-BiDAF model to return an answer and its position in the source text. This led me to think of a few improvements:

First, building and evaluating the model’s performance on another company’s English-language Q&A dataset, in this case, Google’s Natural Questions (website and Kaggle challenge).
Replacing the ELMo model with other pipelines for question-answering (such as Transformers).

Testing AllenNLP on Google’s Natural Questions

It’s really easy to get started with AllenNLP’s Q&A pipeline. I then wrote a quick script to compare its answers to the training dataset’s answers.

https://gist.github.com/mapmeld/a4e7c12807251e7bcaa8d8b58b9cbb62

Out of several Q&A’s, AllenNLP appeared to understand if the answer should be a name, number, or a range of dates, but usually picked the wrong ones. There were two issues: Google’s Natural Questions includes HTML tags, and it has much longer text source than the original model was trained on.
I tried rerunning the script without the HTML, and was impressed by some answers:

what are the minds two tracks and what is dual processing
an implicit ( automatic ) , unconscious process and an explicit ( controlled ) , conscious process
how does bill of rights apply to states
procedurally and substantively
what episode of how i met your mother is the slap bet
9th (note that it’s already reading this episode’s wiki article, and just needed to read a sentence about it being the n-th episode)
when did they finish building the sydney opera house
1973

Fine-tuning a model on Google’s Natural Questions

There were still plenty of misread answers, so the right approach would be to get a pre-trained Q&A model, and fine-tune it on the new dataset. It appears that AllenNLP has a separate allennlp-reading-comprehension repo for doing this, including an addition this month of a Transformers-based model:

Transformer QA by dirkgr · Pull Request #19 · allenai/allennlp-reading-comprehension

Model this F1 this EM hf F1 hf EM bert-base-cased 87.5% 80.1% 88.0% 80.3% bert-large-cased 89.8% 82.9% 90.1% 82.7%…

github.com

That said, I couldn’t figure out how to work with this system. Starting with the same Q&A model had some errors, as did the most recent models posted on https://storage.googleapis.com/allennlp-public-models/. Every model raises different errors on scripts/transformer_qa_eval.py.

I tried a different approach to access TransformerQAPredictor, but that still requires initializing with a model and dataset reader, so I’m looking forward to this being documented.

Future Ideas

Creating a machine learning system to predict which Wikipedia article I should fetch based on a question (so a user can submit just a question, and not the source document).
For languages other than English, there are multilingual Q&A datasets and challenges