Análisis · History & Fundamentals · Edition #0007

What is a language model — LLMs explained for your aunt

An LLM is autocomplete on steroids. It learns text patterns. That's why it can write, summarize, translate and code — and why it sometimes makes things up.

G
Germán Falcioni April 12, 2026
✦ Reading: 8 min
Next-word prediction: a cascade of probabilities visualized
TL;DR

An LLM predicts the next word from patterns it learned. It doesn't truly understand, but the patterns are rich enough that it looks like it does. Claude, GPT and Gemini are LLMs. Knowing how they work prevents surprises.

✦ Summarized with Claude at publish time
AI rewrite
Read it as…

What is a language model — LLMs explained for your aunt

There's a giant confusion

There's a giant confusion about what a language model actually does. "Predicts the next word" sounds too simple. "Understands the context" sounds too impressive.

The reality is more interesting than either. And understanding it changes how you use one.

The autocomplete from ten years ago

Remember the predictive text on keyboards a decade ago. You typed three words and it suggested "tomorrow," "later," "today." It worked because the machines learned which words tend to follow others. "Good" → "morning" is likely. Not because the machine understood. Because statistically, that's what happens.

An LLM (Large Language Model) is exactly the same. Just at brutal scale and sophistication.

How it's trained

The model reads text. It doesn't read well, it doesn't understand well. It just reads. Eyes in. As it does, it learns probabilities.

When "United Kingdom" appears, the likely next words are "is," "has," "remains." When "The art of" appears, the likely word is "war." When "import numpy as" appears, the next is "np" with very high probability.

GPT-3 was trained on roughly 300 billion tokens, per the original paper (Brown et al., 2020). Later models used even more. No public number exists for the latest Claude or GPT-4o, but they're in the same order of magnitude or higher.

How it predicts when you write to it

When you send a message to Claude, here's what happens. It reads your prompt. It calculates probabilities for every possible next word. It picks the most probable one, or samples with some variation. It writes that word. It repeats.

Word. After. Word.

What you read as a fluid response is, technically, thousands of prediction decisions made in milliseconds.

Why it can do such different tasks

Because almost any office task is, at bottom, sequence prediction.

Writing an email is predicting which word comes next in a logical narrative. Summarizing is predicting which words are essential in a text. Translating is predicting the sequence in language B that matches the one in language A. Coding is predicting which line follows in this program.

The model doesn't know it's summarizing. It has no concept of "summary." What it does is: in this context where the instruction says "summarize," what words are likely. It learned that from millions of examples.

Nobody ever programmed those capabilities. They emerged from pattern matching.

Emergent capabilities

Here things get strange.

Nobody programmed Claude to write poetry. Nobody trained it specifically on meter or rhyme. But it can do it.

How? Because it read so much poetry during training that the patterns of words in a poem are consistent. Not because it knows poetry. Because it learned the statistical correlations of words in poetic contexts.

Wei and other researchers documented in 2022 that as models grow in scale, capabilities suddenly appear that smaller versions didn't have. They called them emergent capabilities.

One important caveat. Schaeffer and others published a critique in 2023 showing that some of these "emergences" were artifacts of how performance was measured, not genuine jumps. The debate is still open. But it's clear that going from a one-billion-parameter model to a hundred-billion-parameter model unlocks things that couldn't be done before.

The most important bias: prediction vs comprehension

This is what most people underestimate.

Claude doesn't understand a poem. It predicts which word is likely in a poetic context. Does the difference matter?

A poet understands what emotion they're conveying. Claude shows patterns of words that correlate with that emotion based on what it learned.

If you ask a poet "why did you write about death this way?" they answer "because I felt this." If you ask Claude the same thing, the most honest answer it could give is "because that sequence of words has high probability in this context."

It matters when you need real comprehension. If you ask it to grasp the ethical nuances of a complex decision in a scenario it never saw in training, problem. It can sound credible while predicting badly. If you ask it to write something that sounds like you, it almost never fails. Your voice is a pattern. And patterns are its game.

The limitations that explain a lot of surprises

Hallucinations. Sometimes Claude invents data with the tone of total certainty. It predicts the likely word. If the pattern says "after 'Einstein discovered' comes a theory," it can generate a theory that doesn't exist but sounds plausible. Without external verification, it doesn't catch it.

Knowledge cutoff. Every model has a training cutoff date. Anything that happened after isn't in its parameters. It's not that it can't know. It's that the data wasn't there when it was trained. The date varies by model (Claude Sonnet 4.6, for example, has a different cutoff than GPT-4o).

Biases. The patterns it learned reflect the biases of the internet. If training texts said "programmer" more often in the masculine, the model learned that correlation. It's in the weights. Companies do post-training to mitigate, but the underlying bias persists.

Limited memory. Once trained, the model's weights are frozen. It doesn't learn from your conversation in real time. The "memory" features some platforms offer are external tricks: they save information about you in a separate database and inject it into the context. The model itself doesn't remember.

What this means in practice

Claude is excellent for creating, synthesizing, brainstorming, generating variations, summarizing, explaining, translating and exploring. It's less reliable for critical facts without verification, high-stakes decisions that require genuine comprehension, and situations that don't resemble anything it saw in training.

The models out there

Claude (Anthropic), training focused on safety, uncertainty calibration and usefulness. The current line is Opus, Sonnet and Haiku.

GPT-4o and the GPT family (OpenAI), natively multimodal, aggressive at trying to solve, dominant in code.

Gemini (Google), multimodal with emphasis on integration with the Google ecosystem.

Llama (Meta), open source, ideal for running locally or adapting to a specific domain.

Mixtral (Mistral) and Phi (Microsoft), also open source, efficient and useful for local deployments.

They all share the same base transformer architecture (Vaswani et al., 2017). The differences are in scale, training data and post-training tuning.

A question to close on

If every Claude response is word-by-word prediction, what could you change in your next prompt to improve those predictions? Probably more context, examples of how you want it to sound and explicit format constraints.

In "How an AI 'thinks,'" another piece in this series, I go deeper into the mechanics. In the meantime, next time something surprises you (positively or negatively) in a Claude response, ask yourself which statistical pattern would explain it. The answer is almost always there.

Next article
Generative AI — what it is and why it reshaped knowledge work