Introduction to LLMs

What is a Language Model?

A language model (LM) is a probability distribution over sequences of tokens. Suppose we have a vocabulary $\mathcal{V}$ of a set of tokens, then a language model $p$ assigns each sequence of tokens $x_1, …, x_L \in \mathcal{V} $ a probability.

To assign meaningful probabilities to all sequences requires syntactic knowledge and world knowledge. Given $ \mathcal{V} = \{ \text{ate}, \text{ball}, \text{cheese}, \text{mouse}, \text{the} \} $:

The LM should assign a low value to $p(\text{mouse}, \text{the}, \text{the}, \text{cheese}, \text{ate})$ because it’s ungrammatical (syntactic knowledge).
The LM should assign $ p(\text{the}, \text{mouse}, \text{ate}, \text{the}, \text{cheese}) \gt p(\text{the}, \text{cheese}, \text{ate}, \text{the}, \text{mouse}) $ because of world knowledge.

We can also generate a sequence given an LM. The purest way to do this is to sample a sequence $x_{1:L}$ with probability equal to $p(x_{1:L})$. In practice, we don’t sample directly from an LM because of limitations of real LMs and the desire to get something closer to the “best” sequence and not just an “average” sequence.

What are the said limitations of real-life LMs that prevent direct sampling?

Using the chain rule of probability:

$$ p(x_{1:L}) = p(x_1) p(x_2 | x_1) p(x_3 | x1, x_2) … p(x_L | x_{1:L-1}) = \Pi_{i=1}^{L} p(x_i | x_{1:i-1}) $$

… for example:

$$ p(\text{GSW}, \text{beats}, \text{Cavs}) = p(\text{GSW}) \cdot p(\text{beats} | \text{GSW}) \cdot p(\text{Cavs} | \text{GSW}, \text{beats}) $$

An autoregressive language model is one where each conditional distribution $p(x_i | x_{1:i-1})$ can be computed efficiently. To generate a sequence $x_{1:L}$ from an autoregressive LM $p$, we sample one token at a time, i.e., for $i = 1, …, L$:

$$ x_i \sim p \left(x_i | x_{1:i-1} \right)^{1/T} $$

… where $T \ge 0$ is a temperature that controls randomness:

$T = 0$: deterministically choose the most probable $x_i$.
$T = 1$: sample “normally” from the pure LM.
$T = \infty $: sample from a uniform distribution over $\mathcal{V}$.

Raising the probabilities to power $1/T$ may make them not sum to $1$, but this is fixed by re-normalizing the distribution:

$$ p_T(x_i | x_{1:i-1}) \propto p \left( x_i | x_{1:i-1} \right) ^{1/T} $$

… also called the annealed conditional probability distribution.

We can perform conditional generation by specifying some prefix sequence (called a prompt) and sampling the rest (called the completion). Conditional generation allows LMs to solve a variety of tasks by simply changing the prompt.

A Brief History

In 1948, Claude Shannon introduced the entropy of a distribution as:

$$ H(p) = \sum_{x} p(x) log \frac{1}{p(x)} $$

… to measure the expected number of bits any algorithm needs to encode a sample $x \sim p $ into a bit-string. Intuitively, $log \frac{1}{p(x)}$ is the length of code used to represent an element $x$ that occurs with probability $p(x)$.

Shannon also described cross entropy:

$$ H(p, q) = \sum_{x} p(x) log \frac{1}{q(x)} $$

… where the compression scheme is given by the model $q$.

Crucially, $H(p, q) \ge H(p)$, and so we can get better estimates of $H(p)$ by constructing better models $q$, as measured by $H(p, q)$. $H(p)$ is generally inaccessible if $p$ is English.

In an n-gram model the prediction of a token $x_i$ only depends on the last $n-1$ characters rather than the full history:

$$ p(x_i | x_{1:i-1}) = p(x_i | x_{i-(n-1):i-1}) $$

While fitting n-gram models to data is extremely computationally cheap and scalable:

If $n$ is too small, then the model can’t capture long-range dependencies.
If $n$ is too big, then it’s statistically infeasible to get good estimates as almost all reasonable long sequences show up 0 times even in large corpora.

As a result, LMs were limited to tasks like speech recognition and machine translation where only capturing local dependencies wasn’t a huge problem.

Bengio et al., 2003 pioneered LMs where $p(x_i | x_{i-(n-1): i-1})$ is given by a neural network, making it statistically feasible to estimate neural LMs for much larger values of $n$. However, the main challenge was that training neural networks was much more computationally expensive.

Recurrent Neural Networks (RNNs), including Long Short Term Memory (LSTMs), effectively made $n = \infty $, but these were hard to train.

Transformers returned to having fixed content length $n$, but were much easier to train and exploited the parallelism of GPUs. Also, $n$ could be made “large enough” for many applications (GPT-3 used $n = 2048$).

References

Introduction | CS324. Percy Liang; Tatsunori Hashimoto; Christopher Ré. stanford-cs324.github.io . 2022. Accessed Dec 14, 2023.