Reflecting on Stochastic Parrots
In December 2020, I stumbled on an early draft of the now widely-read FAccT conference paper ‘On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜’. After debating whether I would be taken seriously or blend in with the fray of political comments, I sent a Tweet to the first author. I was grateful for my suggestion to be added to the paper, and my name to be added to Acknowledgments. It is a weird thing to be acknowledged while the Google-based authors had to to diminish their involvement. Where they moved a mountain with spoons, I moved one spoonful.
Now that several months have passed, I’d like to write a bit about what I liked, what I got changed, and things which I still don’t like about Stochastic Parrots.
I lost where I said this before, but Stochastic Parrots is absolutely the reading which I would use to introduce someone to potential harms of language modeling. There’s an overview of the leading models, references to papers which consider dimensions of minority and how they’re represented, and usable facts and figures about climate impact.
Topics which could be their own research deep dive (e.g. intersectionality, what minority categories exist in a culture) might get only a sentence and a citation, but they’re in there.
Practically speaking, I scanned and adjusted my toxic comment labels from Perspective API after reading that it might flag innocuous posts about sexuality or gender identity as toxic.
My one point checked the wording of one section where the intent was clear to me, but the text sweepingly said “no one” was working on Dhivehi or Sudanese Arabic models. I’m much happier with the final wording and the added footnote:
Is it fair or just to ask, for example, that the residents of the Maldives (likely to be underwater by 2100 ) or the 800,000 people in Sudan affected by drastic floods  pay the environmental price of training and deploying ever larger English LMs, when similar large-scale models aren’t being produced for Dhivehi or Sudanese Arabic?
 By this comment, we do not intend to erase existing work on low-resource languages. One particularly exciting example is the Masakhane project , which explores participatory research techniques for developing MT for African languages. These promising directions do not involve amassing terabytes of data.
I was reminded of the exchange this week, with the IndabaX Sudan conference being held in Khartoum. It’s a welcome return of the Indaba / IndabaX events held across Africa in 2017-2019.
The intent of this Stochastic Parrots section was not to debate whether researchers and students exist — it is about the scale of resources to train a model such as GPT-3 and future models without concern for climate insecurity. But it drew a dividing line, saying that lower-GDP countries are more hurt or threatened by the ML industry than helped.
Despite the change, I still find this assertion unproven, and too likely to tap into stereotypes around who can create or lead development of technology.
I can’t imagine telling these students that ML poses an unacceptable harm to them, while in America we accept the resource costs of beef, sushi, and golf.
The 4 Remaining Issues
As a total outsider, I felt that I had time to raise only one point on Stochastic Parrots late into the writing process. I’m happy with how that was resolved. If I knew the authors better and had an earlier opportunity to review, I would hope to have a longer discussion. That’s something which I have to earn.
These issues were not unfamiliar to the authors; my point is that they were not discussed as much as I’d prefer in the paper.
1. Transfer Learning
On the Dhivehi/Sudanese Arabic point from earlier. Models and code from these experiments can benefit lower-resource languages and specific dialects. Transfer learning research already shows that multilingual models can learn tasks across languages, or improve accuracy on lower-resource languages by training on multilingual content. Learning English is not a unique process (not to push Chomsky / universal grammar, but linguists certainly see strong similarities between languages, which would have to be learned by a fluent model).
2. Reflecting Reality vs. Sensitivity
Within a day of the public link to Stochastic Parrots being shared on Twitter, Dr. Yoav Goldberg posted a GitHub Gist circulated widely and participated in many contentious Twitter conversations.
Dr. Timnit Gebru pointed out, “you [Goldberg] essentially gave us no feedback when we sent you the paper for feedback months ago”.
I’ve always wanted to discuss the two main points of the Gist, particularly the attention given to this one:
The paper takes one-sided political views, without presenting it as such and without presenting the alternative views.
This sentence fired up many Tweeters who felt that the ‘political’ views of the paper were its warnings about racist and bigoted text generation, especially when Goldberg on Twitter summarized this as ‘political and one-sided, without acknowledging it’ and described the authors’ perspective as ‘an extreme agenda’.
I can’t agree with this rhetoric, but the expanded / rephrased criticism in the Gist is something which I can agree with:
The authors suggest that good (= not dangerous) language models are language models which reflect the world as they think the world should be. This is a political argument, which packs within it an even larger political argument. However, an alternative view by which language models should reflect language as it is being used in a training corpus is at least as valid, and should be acknowledged.
Stochastic Parrots focuses “primarily on cases where LMs are used in generating text,” where even a simple language model can easily repeat what it’s seen. As mentioned in the original paper, we would need to improve models’ understanding of language and harms to control what language is generated, and even the newest models perform poorly in this regard, especially on issues involving religion and disabilities.
[It’s less complicated to train non-generative models, because the outputs are limited. This training does still impact fairness (a classic example from the word2vec-era was Mexican restaurant reviews getting lower scores, because ‘Mexican’ was learned as a more negative token)]
The two separate points which I want to take out of Goldberg’s Gist are:
…do we want the model to be aware of slurs? The paper very clearly argues that “no it definitely should not”.
…other linguistic forms that authors list as undesirable such as microagressions, dog-whistles, or subtle patterns such as refering to “woman doctors” or “both genders”. Again, if we want our models to actually model human language use, we want these patterns in the model
Again, these are mostly issues because we are talking about generative language models. If I were training a model on sentiment analysis or recommending popular Tweets, I would absolutely want a model to be ‘aware of slurs’. If I could rephrase the disagreement between Goldberg and the paper’s authors, I would put it as ‘would you remove all examples of toxic content before training a generative model which will then be fine-tuned to write stories, lyrics, chat bots / other applications.’. For language models in a professional environment, we never want the model to generate hateful language, and I think the takeaway from Stochastic Parrots is current language models are not advanced enough to avoid surfacing hateful content, far from understanding scenarios such as ‘the user indicated that a character in their story has racist views which frequently appeared in public in a given time period’ [not simply ‘of the times’ nor universally accepted by the targets of racism or members of the majority — see Washington’s circulating enslaved people in and out of Philadelphia to avoid having to grant freedom].
It’s entirely possible to have a local news story or book report about Huck Finn without repeating the specific language used in the book, so we are not banning people or models from reading or thinking.
This conversation extends to inclusive language, but becomes more complex.
The current ethical and technical lab for hateful language and user interactions seems to be storytelling models such as AI Dungeon. In 2020, users discovered that if their story prompts included ‘rape’, AI Dungeon would change their input to non-sexual (or even positive words) to push the model and users away from sexual violence. But these changes had their own consequences and insensitivity.
When I spoke to the startup behind AI Dungeon in March 2021, they were aware of Stochastic Parrots but did not know of the final publication. Shortly after, AI Dungeon and OpenAI restricted many other terms and phrases to avoid harm to minors, leading to a wider discussion of harm detection, monitoring of user stories, and the API-driven nature of large models such as GPT-3 (i.e. was it possible for users to run their own storytelling AI).
3. Was ‘Can Language Models be Too Big?’ the right question?
Another point from Goldberg’s Gist was that inefficient model training or architecture was the true climate risk, and toxic content was a more complex issue, also affecting small models. So why talk about language models being “too big”?
It reminds me of a line in a Rick and Morty episode: “the thing people don’t realize about the Gear Wars is that it was never really about the Gears at all”.
I took a stab at defending this part of Stochastic Parrots at the time:
I can’t agree with calling GPT-3 and other models ‘neural’ or ‘high quality’, as most modern language models use neural nets, and the standard for quality will continue to shift.
What I can say is, Stochastic Parrots was updated to include Switch-C (which was released in 2021, shortly before publication) in tables as the largest in dataset size and number of parameters. Yet we know Switch-C had a new architecture which can efficiently train a much larger model. This potential resolution for the large language model’s carbon problems was not discussed so seriously in the Stochastic Parrots paper.
Whether you think Stochastic Parrots addressed the problems of large language models may depend on whether you’re making a stand on precision of the paper’s wording, or the industry’s application of language models. I do wonder if the problem of large language models is how humans react — maybe it’s the believability, credibility, or investment in language models which have become more fluent?
4. Should language models exist?
In Dr. Emily Bender’s response to Goldberg’s Gist, she talks about “a world view where language modeling must necessarily exist and continue”. This wasn’t latched onto so much in the Twitter debates, but I would like to hear the longer-form discussion of the alternative.
Co-author Dr. Margaret Mitchell joined HuggingFace in August, so she doesn’t object to language modeling outright. I recommend her keynote opening this session of Stanford’s Workshop on Foundation Models.
I think that a world where language models should be abandoned or banned, akin to research on smallpox, is so far beyond my understanding.
The majority of ML/NLP projects are happening in medium-sized companies, universities, and the large cloud platforms. The largest cloud providers have carbon-offset options (AWS’s Oregon availability zone, and all of Google Cloud, use carbon-neutral energy sources). The discussion of what it means for the electricity to be neutral, versus renewable (i.e. without carbon storage or offsets) versus long-term costs of hardware or human resources assigned to benefit the environment, is a more complex question. But it’s not an ML question.
So I would say the most perceptible problem would be bad language models and bad applications of language models. These can be solved through political pressures and inquiries, independent research, and technical cooperation. The fears that companies will not invest in or listen to that research, that the oversight will be beyond our leaders, that the technology gap between the companies and the sociologists will be too vast — those fears are real, but addressable.