Your NLP code doesn’t speak AAVE
Following up on a paper criticism
I got into a short Twitter discussion last month with a writer of “It’s Morphin’ Time! Combating Linguistic Discrimination with Inflectional Perturbations” at Salesforce Research. Twitter isn’t the right forum for reviewing each other’s work, and I would choose a private channel for feedback on a student paper, but the blog post and promotion by Salesforce made this paper somewhat more public.
Morpheus is described as a tool to “combat linguistic discrimination”, “expose the linguistic biases” of existing models, thus “[ensuring] that NLP technologies are inclusive… for users with diverse linguistic backgrounds”.
There is a valid and novel technique where Morpheus creates new example sentences by breaking singular-plural and subject-verb agreement in random combinations. This replicates common errors in writing and speaking, and these errors are more frequent when someone is not a native speaker.
Where things went wrong for me is when the paper includes African American Vernacular English (AAVE) in the study. These are the four mentions of AAVE in the paper and appendices:
Even among native speakers, a significant number speak a dialect like African American Vernacular English (AAVE) rather than Standard English
[Citation is: David Crystal. 2003. English as a Global Language. Cambridge University Press.]…putting these models directly into production without addressing this inherent bias puts them at risk of committing linguistic discrimination by performing poorly for many speech communities (e.g., AAVE and L2 speakers).
Ensuring that NLP technologies are inclusive, in the sense of working for users with diverse linguistic backgrounds (e.g., speakers of World Englishes such as AAVE, as well as L2 speakers), is especially important […].
Appendix A: Examples of Inflectional Variation in English Dialects
African American Vernacular English
• They seen it.
• They run there yesterday.
• The folks was there
[Citation is: Walt Wolfram. 2004. The grammar of urban African American Vernacular English. Handbook of varieties of English, 2:111–32.]
The core of my criticism of the paper was summarized in my initial Tweet:
AAVE speakers use a consistent grammar which is neither described nor modeled in the paper.
If you’ve encountered academic articles about AAVE, you’ll know that this consistency of rules is a major concept in studying AAVE, and in debunking political and racist messaging of the “Ebonics scare” of the 1990s (and still ongoing).
I don’t believe that the writers of the paper intended to perpetuate stereotypes or bring in American politics (the first author is based in Singapore). Unfortunately they’ve done little to describe AAVE in their paper. A casual reader would equate it with what their code does — inserting grammatical errors into Standard English.
Politics aside, on a technical level, examples generated by Morpheus do not model how AAVE works. AAVE does sometimes go against Standard English grammar, but this is as simplistic as saying that British and Indian English sometimes add ‘u’ to words. If we randomly add ‘u’s, our pattern would only rarely match natural language.
As an American, I usually notice British and Indian English by new words (‘lorry’, ‘crore’, ‘lakhs’), new meanings (‘lift’), more frequency (‘thrice’), and even reversed meanings (‘take a class’). A compelling argument could be made that Morpheus doesn’t model these intricacies and potential sources of bias from Global English, either.
Paper’s acknowledgement of limitations vs. real limitations
The paper refers to this problem of real-world examples as a limitation:
MORPHEUS finds the distribution of examples that are adversarial for the target model, rather than that of real L2 speaker errors, which produced some unrealistic adversarial examples
But there are holes here: including only L2 speakers — previously differentiated from AAVE — and classifying the changes strictly as errors.
For the task presented (question answering) the consequence of BERT misreading an input is that the deployed model returns the wrong word or no answer. Stakes are low. For a model which is designed to discriminate (a spam or comment toxicity filter) these experiments in “statistical bias” could cause a significant racial bias.
Conclusions
I wonder if the paper was originally intended to discuss only common errors or L2 comprehension. AAVE is considered in the paper only parenthetically, and it should be addressed in more sensitive and inclusive research.
If I were analyzing bias toward AAVE in NLP, I would hope to partner with someone who is already studying racial bias in technology, and is familiar with AAVE via personal and/or academic experience.
I would propose to start by investigating whether major NLP models even recognize AAVE. If a model is trained on Wikipedia and Project Gutenberg, for example, it has seen very few examples. If it’s trained on social media or websites… then it’s an open question. AAVE speakers are often experts at code-switching, and websites (such as Salesforce.com or Twitter) may be coded as more or less white depending on context.
UPDATED
Just a few days after writing this, I was reading FairMLBook.org and they cite NLP research into African-American English from 2016. The link to the lab below contains that paper, a 2018 paper, and a large corpus of Tweets.
For future updates (after August 2020) see https://github.com/mapmeld/use-this-now/blob/main/README.md#nlp--aave