LLMs: Stochastic Parrots 🦜 and How (Not) to Use Them

was written in a period when NLP practitioners are producing bigger (# of parameters; size of training data) language models (LMs), and pushing the top scores on benchmarks. The paper itself was controversial because it led to Gebru being fired from Google, following disagreements with her managers on conditions (withdraw, or remove Google-affiliated authors) for publishing the paper.

A lot changed since mid-2021, when I initially wrote this page. OpenAI’s ChatGPT took the world by storm – reaching 123m MAU less than 3 months after launch and becoming the fastest-growing consumer application in history (TikTok took 9 months to hit 100m MAU).

My skepticism of LLMs is partially influenced by (1) seeming smarter by not boarding the hype train, and (2) feeling threatened by an LLM and reducing it to a stochastic parrot. Parrot or not, millions are finding value, and thus a fairer inspection is in order.

LLMs 101

An LLM is a language model consisting of a neural network with many parameters (typically billions of weights), trained on large quantities of unlabeled text using self-supervised learning or semi-supervised learning. Though trained along the lines of predicting the next word in a sentence (GPT-style) or completing a cloze test (BERT-style), neural LMs capture much of the syntax and semantics of human language.

notes that LLMs emerged around 2018, when I was pursuing an undergrad degree in Computer Science. Can’t say I caught onto the hype until much later in 2022 (3 years after graduating) when ChatGPT was all the rage.

When it comes to LLMs' skills and range of tasks, it seems to be more of a function of resources (data, parameter-size, computing power) devoted to them, and less of breakthroughs in design.

While OpenAI’s ChatGPT is the most popular, other LLMs are BERT (Google), T5 (Google), XLNet (CMU and Google), and RoBERTa (Meta).

LLMs have a barrier to entry. Training Meta’s LLaMA model, which has 65b parameters, took 21 days, and if it had been done on AWS, over 2.4m USD.

Applications of LLMs

LLMs are especially useful in that they have a natural language UI; one doesn’t need specialized knowledge to obtain information from the model. It’s like a massive database, where the queries are not in SQL but in natural language.

Incorporating an LLM into an automated decision-making software seems risky. Other embedding technologies have suffered from adversarial inputs that lead to poor outputs. Ultimately, it’s a matter of how bad are the results when the LLM misbehaves, and are there measures to limit the blast area.

LLMs can perform language translation, sentiment analysis, question-answering, summarization, and text classification.

Monetization for LLMs: enterprise and consumer subscriptions for access, AI-generated content, dialogue-based search.

charges $10/month. It should have better performance than the more general ChatGPT, but I don’t think I code often enough on my personal time to necessitate a subscription.

highlight eras in content generation: platform-generated content (2010 - 2015); user-generated content (2015 - 2020); AI-generated content (2020+).

LLMs can generate text. Midjourney , DALL·E , and Stable Diffusion are popular text-to-image models. As of June 2023, text-to-video models are yet to take off .

A Mental Model for LLMs

The LLM does not contain all the information verbatim. Instead, it’s an embedding (which comes with a loss of information). For a good percentage of queries, the answers generated from extrapolating are good enough. However, some extrapolations are erroneous, hence the trait of LLMs to hallucinate information that isn’t true in the real world. The problem comes from the LLM, regardless of hallucinating or not, coming across as confident and thus misleading users.

Beneficial usage of an LLM therefore comes down to prompting it, and then weighing/verifying the output before acting on it.

Some people mistakenly impute meaning to the LM-generated texts. LMs are not performing natural language understanding (NLU). Misplaced hype can mislead the public and dissuade research directions that don’t depend on the ever-larger-LM train.

/r/SubSimulatorGPT2/ is an entertaining sub full of GPT-2 bots. /r/SubSimulatorGPT2Meta/ has the human commentary.

The texts are not grounded in communicative intent, or any model of the world, or any model of the reader’s state of mind. An LM is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without reference to meaning: a stochastic parrot.

This is apparent when the model is made to do things that it wasn’t trained to do. You’d think that given the LLM’s generation of text in complex disciplines, it should be able to answer math questions consistently, but that’s not the case. For example, I asked GPT-4 “How many zeros does 50100 contain?” and it answered, “The number 50,100 contains two zeros.” This is probably due to how GPT tokenizes the input . Tokenization refers to the conversion of words into numbers, which is necessary because LLMs are mathematical functions whose input and output are lists of numbers.

It keeps on improving though. ’s noted that GPT-3 fails at addition questions that involve carrying the $1$, but seems like GPT-4 does not have that problem.

If ChatGPT were a lossless algorithm that answers questions by verbatim quotes from web pages, then it’d not be as impressive to us. However, because it rephrases content, it comes across as a student expressing ideas in their own words.

LLMs as a blurry JPEG of the web sometimes manifests in obvious ways. An HN user noted when asked, “Which is heavier, two pounds of bricks or one pound of feathers?” GPT 3.5 would say, “They are both the same weight, as they both weigh one pound.”

A useful criterion for gauging an LLM’s quality is the willingness of the company to use the text generated by the LLM as training material for the next model.

With bigger LLMs, you get better performance, but there’s no evidence to suggest that the whole is greater than the sum of their parts. Previous claims of emergent abilities at particular model sizes are due to choosing metrics that are especially harsh to smaller models. LLMs aren’t going to surprise us with Artificial General Intelligence.

This viewpoint is not apparent in ’s discussion of emergent abilities. Started a thread .

The prompts for LLMs can anthropomorphize them, e.g.,

I want you to act like {character} from {series}. I want you to respond and answer like {character} using the tone, manner and vocabulary {character} would use. Do not write any explanations. Only answer like {character}. You must know all of the knowledge of {character}. My first sentence is “Hi {character}.”
I want you to act as a math teacher. I will provide some mathematical equations or concepts, and it will be your job to explain them in easy-to-understand terms. This could include providing step-by-step instructions for solving a problem, demonstrating various techniques with visuals or suggesting online resources for further study. My first request is “I need help understanding how probability works.”

Read back in my first year of undergrad, and a passage on what knowledge stuck with me:

HUMAN. Why do you think that [flying novel-writing whales] cannot exist?
AI. For many reasons. First of all, our genetic engineering capabilities are far from adequate when it comes to turning fins into wins, not to mention our inability to cause 10-ton engineless creatures to defy gravity just by flapping these things. Secondly, the novel-writing part doesn’t even deserve responding to, since writing a good story requires much more than the technical ability to read and write.

Playing this game with ChatGPT:

HUMAN. Why do you think that [flying novel-writing whales] cannot exist?
ChatGPT. I apologize if my previous response gave the impression that I definitively stated zupchoks cannot exist. As an AI language model, I don’t have personal beliefs or opinions, and I cannot make absolute statements about the existence or non-existence of specific fictional creatures.
However, based on our current scientific understanding and knowledge, there is no evidence or documented information to suggest the existence of flying novel-writing whales or zupchoks. The concept of a zupchok, as you described it, involves a combination of extraordinary characteristics that do not align with the known biology of whales or any other creatures.
Whales, as we understand them, are marine mammals adapted to life in water, and their fins are not structured to evolve into wings for flight. Additionally, while animals can be trained to some extent, the idea of a whale with the ability to read, write, and create publishable stories is highly speculative and beyond the capabilities of current scientific knowledge.
However, it’s important to remember that the realm of fiction allows for imaginative and creative exploration, so within the context of a fictional story, anything is possible. The existence of a zupchok or any other fantastical creature is ultimately up to the creative imagination of writers and readers.

Back when reading , this argument made sense to me. But the threshold for intelligence is now higher, given that ChatGPT can pass this, but it’s not considered AGI.

Environmental Risks

Large LMs consume a lot of resources, e.g. training a single BERT base model on GPUs was estimated to use as much energy as a trans-American flight.

Marginalized communities are doubly punished, as they are least likely to benefit from LMs, and are also more likely to be harmed by negative effects of climate change.

Practitioners should report the resources (e.g. time and compute) consumed. Governments should invest in compute clouds to provide equitable access to researchers.

Non-Inclusive LMs

Large datasets from the internet overrepresent hegemonic viewpoints and encode biases that can damage marginalized populations. User-generated content sites have skewed demographics, e.g. in 2016, 67% of Redditors in the US were men, and 64% between ages 18 and 29. Furthermore, these sites have structural factors that make them less welcoming to marginalized groups.

Sometimes excluded populations assume different fora, e.g. older adults with blogging, but the LMs are less likely to source from these non-mainstream alternatives.

Filtering of training data may suppress the voice of marginalized groups, e.g. suppressing LGBTQ spaces in the name of purging pornographic content.

While social movements produce new norms, LMs might be stuck on older, less-inclusive understandings, e.g. social movements that do not receive significant media attention; LM retraining being expensive, etc.

LMs may encode biases, e.g. gun violence, homelessness and drug addiction are overrepresented in texts discussing mental illness; women doctors; both genders; illegal immigrants.

Even auditing LMs for biases requires an a priori understanding of the society, which tends to fall back to US protected attributes like race and gender.

Researchers should budget for documentation as part of the cost of dataset creation. Without documentation, it’s hard to investigate and mitigate such non-inclusivity.

LMs Misbehaving in the Town Square

Bad actors can take advantage of LMs to produce large quantities of seemingly coherent propaganda . contends that the more sinister implementation is chatbots masquerading as online friends and usually having good content, but every once in a while dropping some propaganda, taking advantage of ordinary social reasoning.

Biases in LMs can manifest as reputational harms that are invisible to users. Biases in LMs used for query expansion could influence search results.

References

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. Emily M. Bender; Timnit Gebru; Angelina McMillan-Major; Shmargaret Shmitchell. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. University of Washington; Black in AI; The Aether. doi.org . Mar 3, 2021. Cited 1637 times as of Jun 3, 2023.
UBS: ChatGPT is the Fastest Growing App of All Time. Ben Wodecki. aibusiness.com . Feb 2, 2023. Accessed Jun 3, 2023.
Let's chat about ChatGPT. Kevin Dennean; Sundeep Gantori; Delwin Kurnia Limas; Allen Pu; Reid Gilligan. UBS. www.ubs.com . Feb 22, 2023. Accessed Jun 3, 2023.
Text-to-video model. en.wikipedia.org . Accessed Jun 3, 2023.
ChatGPT Is a Blurry JPEG of the Web. Ted Chiang. www.newyorker.com . Feb 9, 2023. Accessed Jun 3, 2023.
Mostly Skeptical Thoughts On The Chatbot Propaganda Apocalypse. Scott Alexander. astralcodexten.substack.com . Feb 2, 2023. Accessed Jun 3, 2023.
AI’s Ostensible Emergent Abilities Are a Mirage. Katharine Miller. hai.stanford.edu . May 8, 2023. Accessed Jun 3, 2023.
Timnit Gebru. en.wikipedia.org . Accessed Jun 4, 2023.
GitHub Copilot · Your AI pair programmer. github.com . Accessed Jun 4, 2023.
Large language model. en.wikipedia.org . Accessed Jun 4, 2023.
ChatGPT and generative AI are booming, but at a very expensive price. Jonathan Vanian; Kif Leswing. www.cnbc.com . Accessed Jun 4, 2023.
f/awesome-chatgpt-prompts: This repo includes ChatGPT prompt curation to use ChatGPT better. github.com . Accessed Jun 4, 2023.
Computers Ltd.: What They Really Can't Do. Ch. 7: Can We Ourselves Do Any Better? > What is Knowledge? David Harel. www.wisdom.weizmann.ac.il . pdfs.semanticscholar.org . 2004.