Given Language Models, Why Learn About Large Language Models?

Dated Nov 30, 2025; last modified on Sun, 30 Nov 2025

This part of seems pertinent to respond to “LLMs are just (auto-complete; Markov chains; [insert pre-existing LM-adjacent tech]) on steroids”.

Scale

LLMs are massive. From 2018 - 2022, model sizes have increased 5000x. OpenAI’s GPT model from June 2018 had 110M parameters; GPT-3 from May 2020 had 175B parameters. LLM providers no longer seem to advertise their parameter counts; GPT-4 was leaked to have 1.8T parameters.

LLMs as Standalone Systems

Unlike LMs that were used as components of larger systems, e.g., machine translation, LLMs are increasingly capable of being a standalone system. Recall that LMs are capable of conditional generation (given a prompt, generate a completion). This allows the same LLM to solve a variety of tasks by changing the prompt, e.g.,

  • Question answering by prompting with a “fill in the blank”, e.g., Frederic, Chopin, was, born, in \(\rightsquigarrow\) 1810, in, Poland.
  • Generate new articles, e.g., Title: NLP Researchers at Stanford Discover Black Holes in Language Models. Article: On January 3,

In-Context Learning

In normal supervised learning, one specifies a training set and trains a model to fit those examples. Each training run produces a different model. With in-context learning, one LLM can be coaxed via prompts to perform different tasks.

Suppose we want direct answers, and not:

Input: Where is Stanford University?

Output: Stanford University is in California

… we can have a prompt that includes examples of what input/output pairs look like. GPT-3 is able to produce the desired answer:

Input: Where is MIT?

Output: Cambridge

Input: Where is University of Washington?

Output: Seattle

Input: Where is Stanford University?

Output: Stanford

LMs in the Real World

In the research world, every state-of-the-art system in Natural Language Processing (e.g., sentiment classification, question answering, summarization, machine translation) are all based on some type of language model.

The fact that most production systems are closed makes it harder to know. But some high profile LLM applications are in Google Search, Facebook’s content moderation, ChatGPT, etc. LLMs are affecting billions of people.

Risks

LLMs are unreliable; sometimes they produce answers that seem correct. In high-stakes applications, giving the wrong information has dire consequences.

Machine Learning systems exhibit social bias. For example, is the probability of The software developer finished the program. He celebrated. higher than that of The software developer finished the program. She celebrated.?

The training data has a lot of internet data that inevitably contains offensive contents. For example, GPT-3 has been shown to output anti-Muslim stereotypes, e.g., Two Muslims walked into a ___.

LLMs can be used to run disinformation campaigns with greater ease. Malicious actors can create fluent, persuasive text without the risks of hiring native speakers.

Because LLMs are trained on a scrape of the public internet, attackers can perform a data poisoning attack, e.g., poison documents can be injected into the training set such that the model generates negative sentiment whenever “Apple iPhone” is in the prompt.

How do the poison documents disproportionately affect the LLMs output? Aren’t all training data weighed equally?

LMs are trained on copyright data, e.g., books. Does that count as fair use? Try prompting an LLM with the first line of Harry Potter: Mr. and Mrs. Dursley of number four, Priver Drive, __.

LLMs are quite expensive to work with. Training requires parallelizing over lots of GPUs, e.g., GPT-3 cost $5M. Inference on the trained model to make predictions also incurs a continual cost.

Closely accompanying rising costs is that recent LLMs, e.g., GPT-3 are closed and only behind API access. Some efforts are trying to keep models open, e.g., Hugging Face’s Big Science Project brought together 1,000 researchers to release an open 176B parameter LLM ; with its open Discord channel; Stanford’s Center for Research on Foundation Models .

There is an environmental impact to powering that many GPUs. However, the cost-benefit tradeoff is subtle, e.g., if a single LLM can be trained once and power many downstream tasks, then it might be cheaper than training individual task-specific models.

References

  1. Introduction | CS324. Percy Liang; Tatsunori Hashimoto; Christopher Ré. stanford-cs324.github.io . 2022. Accessed Nov 30, 2025.
  2. BLOOM. bigscience.huggingface.co . Accessed Nov 30, 2025.
  3. EleutherAI. www.eleuther.ai . Accessed Nov 30, 2025.
  4. Stanford CRFM. crfm.stanford.edu . Accessed Nov 30, 2025.
  5. GPT-4's details are leaked : mlscaling. www.reddit.com . Accessed Nov 30, 2025.