On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Dated Mar 3, 2021; last modified on Wed, 06 Oct 2021

The paper is written in a period when NLP practitioners are producing bigger (# of parameters; size of training data) language models (LMs), and pushing the top scores on benchmarks.

Environmental Risks

Large LMs consume a lot of resources, e.g. training a single BERT base model on GPUs was estimated to use as much energy as a trans-American flight.

Marginalized communities are doubly punished. They are least likely to benefit from LMs, e.g. 90% of the world’s languages have little LM-support; most LM applications serve needs of the privileged, e.g. Google Home, Alexa & Siri. They are also more likely to be harmed by negative effects of climate change.

Practitioners should report the resources (e.g. time and compute) consumed used. Governments should invest in compute clouds to provide equitable access to researchers.

Non-Inclusive LMs

Large datasets from the internet overrepresent hegemonic viewpoints and encode biases that can damage marginalized populations.

User-generated content sites have skewed demographics, e.g. in 2016, 67% of Redditors in the US were men, and 64% between ages 18 and 29; between 8.8 - 15% of Wikipedians are female. Furthermore, these sites have structural factors that make them less welcoming to marginalized groups, e.g. harassment on Twitter.

Sometimes excluded populations assume different fora, e.g. older adults with blogging, but the LMs are less likely to source from these non-mainstream alternatives.

Filtering (“Cleaning”) of training data may suppress the voice of marginalized groups, e.g. suppressing LGBTQ spaces in the name of purging pornographic content.

While social movements produce new norms, LMs might be stuck on older, less-inclusive understandings, e.g. social movements that do not receive significant media attention; LM retraining being expensive, etc.

LMs may encode biases, e.g. gun violence, homelessness and drug addiction are overrepresented in texts discussing mental illness; women doctors; both genders; illegal immigrants.

Even auditing LMs for biases requires an a priori understanding of the society, which tends to fall back to US protected attributes like race and gender.

Researchers should budget for documentation as part of the cost of dataset creation. Without documentation, it’s hard to investigate and mitigate such non-inclusivity.

LMs Misbehaving in the Town Square

Some people mistakenly impute meaning to the LM-generated texts. LMs are not performing natural language understanding (NLU). Misplaced hype can mislead the public and dissuade research directions that don’t depend on the ever-larger-LM train.

/r/SubSimulatorGPT2/ is an entertaining sub full of GPT-2 bots. /r/SubSimulatorGPT2Meta/ has the human commentary.

The texts are not grounded in communicative intent, or any model of the world, or any model of the reader’s state of mind. An LM is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without reference to meaning: a stochastic parrot.

Bad actors can take advantage of LMs to produce large quantities of seemingly coherent propaganda.

Biases in LMs can manifest as reputational harms that are invisible to users. Biases in LMs used for query expansion could influence search results.


  1. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. Emily M. Bender; Timnit Gebru; Angelina McMillan-Major; Shmargaret Shmitchell. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. University of Washington; Black in AI; The Aether. https://doi.org/10.1145/3442188.3445922 . Mar 3, 2021.