New in NLP: super-cool projects and articles
Last weekend, I put the finishing touches on my ideal text classification server. Since then, I’ve been reading about three new projects and one article which are inspiring me to jump back into NLP projects:
Gobbli, from RTI International
This project appeared on my radar and quickly I decided it should be my next step evolving from the older machine algorithms SciKit-Learn to deep learning. This gives you one common interface for the most popular models, but also includes dummy modules and Docker containers so you can develop your application and set up your dev ops without delay.
The way that Gobbli works under the hood is interesting. If you train and use a model with FastText, it uses their vectorizer and also their black-box classifier. There isn’t a common point in the workflow where you could get embeddings / vectors from one platform and try training them on TensorFlow or your own neural net.
CEFR Checker, from Duolingo
NLP research frequently explores training on Wikipedia articles in multiple languages. Programs to simplify text have compared articles on the English Wikipedia to Simple English. Typically this is done with the seq2seq technique, but more recent papers use transformers, the same technique which built OpenAI’s GPT-2, Google’s BERT, and Facebook’s XLNet.
Duolingo’s tech team made it possible to release language content at different reading levels with this new CEFR Checker. They used movie subtitle databases (in part for directly matching sentences, in part for conversational content?) as a source to make word vectors align and work the same way in different languages. In other words, they can process and simplify any language that has enough adequately translated subtitles:
As anyone who’s learned the basics of a language knows, picking up simpler words is not a silver bullet. I’ve sat through Spanish-language meetings knowing what the topic was, but not what the intended message was. Similarly, their post doesn’t discuss any examples of making the text level more complex, where it might invent content or meaning to fill in the gaps.
Another downside is that there is no source code shared here :(
Interpret, from Allen NLP
The Allen Institute for AI has been building up their AllenNLP into an absolute giant Godzilla of NLP. Their new release, Interpret, is an Explainable AI tool designed for language models. Like the ELI5 library, it can remove several words and test permutations against black-box classifiers. But it also can show you masked language modeling (what word it thinks is missing), explains answers to questions, and looks for flaws in your classifiers.
The Bender Rule
This article captured the re-surfacing of an NLP research rule which they’re calling the Bender Rule: “Do state the name of the language that is being studied, even if it’s English.”
The article covers the academic discussion and Twitter convos better than I can, but essentially, the world is so big and our language modeling is heavily biased towards understanding English and Chinese.
Look at the projects above (excepting Duolingo’s multilingual example). Gobbli saves you time by fetching the word embeddings / vector data for you from different repos, but in hardcoding this, it means English is easy and no guidance is given for alternatives. AllenNLP has the landing page and a pre-print, where neither mentions English.
Now, it is totally possible for us to use them for other languages — for Gobbli I would point it to the 100+ languages supported by FastText, and for AllenNLP I can use embeddings or their own ELMo model (which has several unofficial models — 谢谢 to Harbin Institute Technology here).
I am setting goals to do a deeper dive into these frameworks for my own projects, and make some kind of change (no matter how small!) in their docs to spell out how they work with non-English text.