What is a language model — LLMs explained for your aunt
Start with your phone
You have a phone with autocomplete. You type "Hi, how" and it suggests "are you." You're not amazed. It's a basic tool.
A language model is that. But supercharged a thousand times over.
How it actually works
During training, the model reads. A lot. Novels, articles, code, conversations, basically everything that exists on the internet. GPT-3, the first one to break into the public, was trained on roughly 300 billion tokens (Brown et al., 2020). That's more text than you could read in a thousand lifetimes.
As it reads, it learns patterns. After "good" usually comes "morning." After "the rain" usually comes "falls."
When you write a prompt to Claude, the model guesses which word comes next. Then the next. And the next. Word by word, until it decides to stop.
It doesn't understand. It predicts. But its predictions are good enough to look like understanding.
Why it can write, translate, code
Because all those tasks are, at bottom, word prediction.
Writing a story is predicting which word comes next given a plot. Translating is predicting which sequence in another language matches this pattern. Coding is predicting which line of code is likely after this structure.
Claude doesn't know how to code the way a human does. It knows which code patterns are likely. And because millions of programmers write code in consistent ways, those patterns are very good.
The surprise: emergent capabilities
Nobody programmed Claude to write a poem about death. Nobody taught it math with a formal class.
But it can do both.
Why? Because the model learned patterns rich enough that, at some point, it starts doing things nobody anticipated. As if the patterns combined in ways that produced new abilities. We call these emergent capabilities.
The limitations
An LLM never lived a single day. It read every book that exists, but it never ate, never cried, never was betrayed. That matters when you need deep emotional context or real sensory experience. It can imitate. It can't comprehend.
It also hallucinates. Sometimes it invents data. Not out of malice. Because it predicts what word is likely and sometimes that word is a false fact. Without real experience, it doesn't catch it.
And it has a knowledge cutoff: it was trained up to a certain date and doesn't know what happened after. Not from incapacity. Because that data wasn't there when training ran.
Three ideas to take with you
First, an LLM predicts. Period. It doesn't think, it doesn't understand. But it predicts well enough that this is sufficient to change your work.
Second, its strengths are creating, synthesizing and exploring variations. Its weaknesses are critical facts without verification and high-stakes decisions where genuine comprehension is required.
Third, hallucinations and the knowledge cutoff aren't bugs. They're structural consequences of how the model works. Knowing this saves you surprises.
What is a language model — LLMs explained for your aunt
There's a giant confusion
There's a giant confusion about what a language model actually does. "Predicts the next word" sounds too simple. "Understands the context" sounds too impressive.
The reality is more interesting than either. And understanding it changes how you use one.
The autocomplete from ten years ago
Remember the predictive text on keyboards a decade ago. You typed three words and it suggested "tomorrow," "later," "today." It worked because the machines learned which words tend to follow others. "Good" → "morning" is likely. Not because the machine understood. Because statistically, that's what happens.
An LLM (Large Language Model) is exactly the same. Just at brutal scale and sophistication.
How it's trained
The model reads text. It doesn't read well, it doesn't understand well. It just reads. Eyes in. As it does, it learns probabilities.
When "United Kingdom" appears, the likely next words are "is," "has," "remains." When "The art of" appears, the likely word is "war." When "import numpy as" appears, the next is "np" with very high probability.
GPT-3 was trained on roughly 300 billion tokens, per the original paper (Brown et al., 2020). Later models used even more. No public number exists for the latest Claude or GPT-4o, but they're in the same order of magnitude or higher.
How it predicts when you write to it
When you send a message to Claude, here's what happens. It reads your prompt. It calculates probabilities for every possible next word. It picks the most probable one, or samples with some variation. It writes that word. It repeats.
Word. After. Word.
What you read as a fluid response is, technically, thousands of prediction decisions made in milliseconds.
Why it can do such different tasks
Because almost any office task is, at bottom, sequence prediction.
Writing an email is predicting which word comes next in a logical narrative. Summarizing is predicting which words are essential in a text. Translating is predicting the sequence in language B that matches the one in language A. Coding is predicting which line follows in this program.
The model doesn't know it's summarizing. It has no concept of "summary." What it does is: in this context where the instruction says "summarize," what words are likely. It learned that from millions of examples.
Nobody ever programmed those capabilities. They emerged from pattern matching.
Emergent capabilities
Here things get strange.
Nobody programmed Claude to write poetry. Nobody trained it specifically on meter or rhyme. But it can do it.
How? Because it read so much poetry during training that the patterns of words in a poem are consistent. Not because it knows poetry. Because it learned the statistical correlations of words in poetic contexts.
Wei and other researchers documented in 2022 that as models grow in scale, capabilities suddenly appear that smaller versions didn't have. They called them emergent capabilities.
One important caveat. Schaeffer and others published a critique in 2023 showing that some of these "emergences" were artifacts of how performance was measured, not genuine jumps. The debate is still open. But it's clear that going from a one-billion-parameter model to a hundred-billion-parameter model unlocks things that couldn't be done before.
The most important bias: prediction vs comprehension
This is what most people underestimate.
Claude doesn't understand a poem. It predicts which word is likely in a poetic context. Does the difference matter?
A poet understands what emotion they're conveying. Claude shows patterns of words that correlate with that emotion based on what it learned.
If you ask a poet "why did you write about death this way?" they answer "because I felt this." If you ask Claude the same thing, the most honest answer it could give is "because that sequence of words has high probability in this context."
It matters when you need real comprehension. If you ask it to grasp the ethical nuances of a complex decision in a scenario it never saw in training, problem. It can sound credible while predicting badly. If you ask it to write something that sounds like you, it almost never fails. Your voice is a pattern. And patterns are its game.
The limitations that explain a lot of surprises
Hallucinations. Sometimes Claude invents data with the tone of total certainty. It predicts the likely word. If the pattern says "after 'Einstein discovered' comes a theory," it can generate a theory that doesn't exist but sounds plausible. Without external verification, it doesn't catch it.
Knowledge cutoff. Every model has a training cutoff date. Anything that happened after isn't in its parameters. It's not that it can't know. It's that the data wasn't there when it was trained. The date varies by model (Claude Sonnet 4.6, for example, has a different cutoff than GPT-4o).
Biases. The patterns it learned reflect the biases of the internet. If training texts said "programmer" more often in the masculine, the model learned that correlation. It's in the weights. Companies do post-training to mitigate, but the underlying bias persists.
Limited memory. Once trained, the model's weights are frozen. It doesn't learn from your conversation in real time. The "memory" features some platforms offer are external tricks: they save information about you in a separate database and inject it into the context. The model itself doesn't remember.
What this means in practice
Claude is excellent for creating, synthesizing, brainstorming, generating variations, summarizing, explaining, translating and exploring. It's less reliable for critical facts without verification, high-stakes decisions that require genuine comprehension, and situations that don't resemble anything it saw in training.
The models out there
Claude (Anthropic), training focused on safety, uncertainty calibration and usefulness. The current line is Opus, Sonnet and Haiku.
GPT-4o and the GPT family (OpenAI), natively multimodal, aggressive at trying to solve, dominant in code.
Gemini (Google), multimodal with emphasis on integration with the Google ecosystem.
Llama (Meta), open source, ideal for running locally or adapting to a specific domain.
Mixtral (Mistral) and Phi (Microsoft), also open source, efficient and useful for local deployments.
They all share the same base transformer architecture (Vaswani et al., 2017). The differences are in scale, training data and post-training tuning.
A question to close on
If every Claude response is word-by-word prediction, what could you change in your next prompt to improve those predictions? Probably more context, examples of how you want it to sound and explicit format constraints.
In "How an AI 'thinks,'" another piece in this series, I go deeper into the mechanics. In the meantime, next time something surprises you (positively or negatively) in a Claude response, ask yourself which statistical pattern would explain it. The answer is almost always there.
What is a language model — LLMs explained for your aunt
A precise definition to start
A large language model (LLM) is a neural system that learned to estimate a probability distribution over possible tokens given an input sequence, by training on large text corpora.
It's not an accessible definition. But it's precise. And precision matters when you're going to base real decisions on what the system produces.
The underlying architecture: transformers
Every modern LLM (Claude, GPT, Gemini, Llama, Mixtral) is built on the transformer architecture, introduced in Vaswani et al., "Attention Is All You Need" (2017).
The central idea is the attention mechanism. For each token, the model assigns weights to all other tokens in the context, deciding mathematically which ones are relevant for predicting the next.
This is radically different from earlier architectures. RNNs and LSTMs processed sequentially, token by token, which caused vanishing gradients on long sequences and slow training. Feedforward networks processed in parallel but lost long-range relations. Transformers process the whole context in parallel (massively scalable) while preserving relations at any distance through multi-head attention.
The practical result is that the model can learn, for example, that the word "cat" in the first sentence is relevant for predicting a pronoun in the fifth sentence, even when separated by 200 tokens. That's what enables long-context handling.
Parameters and scaling laws
When you read "Claude has X billion parameters," those parameters are the numerical weights in the neural network. Every connection between operations has a number that gets adjusted during training via backpropagation to minimize prediction error.
Some verifiable numbers. GPT-3 has 175 billion parameters (Brown et al., 2020, NeurIPS). GPT-4 is estimated at roughly 1.8 trillion parameters in a mixture-of-experts architecture, per specialized press reports and leaks, although OpenAI never officially confirmed it. Anthropic doesn't publish the exact size of Claude. When someone gives you a parameter count for Claude, ask for the source. It almost never exists.
The scaling laws (Kaplan et al., 2020 at OpenAI; Hoffmann et al., 2022 at DeepMind, known as Chinchilla) show that capacity improves predictably with more parameters, more data and more training compute (measured in FLOPs).
But parameters without enough data are useless. A trillion-parameter model trained on too little text will perform worse than a 175-billion model trained well. Chinchilla specifically suggests that for a fixed compute budget you should balance model size and dataset size, not maximize one alone.
Emergent capabilities and the counter-critique
Emergent capabilities are abilities that don't appear in small models and emerge abruptly with scale. The term was coined by Wei et al. in 2022.
Examples from the original paper. In-context learning, where the model learns from examples inside the prompt without updating parameters. Chain-of-thought reasoning, where small models can't solve multi-step problems but large models start to when prompted to "think step by step." Arithmetic and symbolic reasoning that appear at particular scales.
But the counter-critique deserves mention. Schaeffer, Miranda and Koyejo published "Are Emergent Abilities of Large Language Models a Mirage?" at NeurIPS 2023. They argued that many of those "emergences" were artifacts of the non-linear metrics used for evaluation. Switch to continuous metrics and the improvement looks gradual, not abrupt.
What's the honest read today? There are real improvements with scale, but the "abilities-from-nowhere" narrative likely overstated the phenomenon. The improvements exist and are measurable, but they're less mysterious than the 2022 paper suggested.
Prediction versus comprehension: the epistemological limit
Here's the bias most people underestimate.
A very good prediction of "what's the next likely word" looks like comprehension. But epistemologically, it isn't.
Concrete case. You give Claude a Nature article on quantum physics (during training it read thousands of similar papers). You ask "what does superposition mean?" Claude gives a coherent, technically accurate response with correct terminology.
Does it understand superposition? No. It predicts which words correlate with "explanation of superposition" in similar contexts seen during training.
How can you tell? Because if you give it a concept it never saw with clear correlations (a new paradox, a genuinely novel counterfactual scenario), it generates a plausible response that's sometimes demonstrably false.
A physicist understands. They can visualize the phenomenon, see implications in domains they weren't trained on, identify novel experimental predictions. Claude predicts. It generates words statistically associated with quantum physics contexts.
Does the difference matter? It matters a lot when the system faces truly novel cases. Prediction fails in characteristic ways there. When the task is in-distribution from training, prediction shines.
Hallucinations: not bugs, structural consequence
When Claude invents a fact or cites a paper that doesn't exist, it's not a glitch.
It's that the model, given the input sequence, predicts that a certain output sequence is likely under the learned distribution. That output sounds credible because patterns of the form "X is a scientific theory" or "Y is a paper by Z" have very high probability in training. But X may be false and Y may not exist.
Without access to a verifiable database in real time, without sensorimotor experience, without an internal notion of "truth" separate from "probability," the model predicts likely words that may be false.
This is structural. It doesn't get "fixed" with cleverer prompting. It gets mitigated. Asking for citations with specific locators forces the model to anchor claims to its training. Using RAG (Retrieval-Augmented Generation), where you provide externally verified documents, drops hallucinations dramatically. Asking for quantified uncertainty helps. Implementing external verification with a second model call or an authoritative source is the most effective approach.
But the underlying phenomenon stays. Prediction without access to verification is always vulnerable to systematic error.
Knowledge cutoff and the illusion of currency
Every model has a training cutoff date. For Claude Sonnet 4.6 it's documented on Anthropic's official page. For GPT-4o, in OpenAI's documentation. Always verify the specific number from the provider.
Users tend to assume that if the model doesn't know something, it'll say so explicitly. But sometimes, if the question falls right near the cutoff, the model may generate a response that looks current but is pattern extrapolation, not verified fact.
That's why chatbots with web search enabled (like Claude connected to Brave Search, ChatGPT with Bing, Gemini with Google Search) give substantially more reliable results for recent information. Search injects post-cutoff information into the context. Prediction is still prediction, but now with fresh data available.
Multimodality: same idea, richer input
Claude, GPT-4o and Gemini are multimodal. They can process text, images, and in recent versions audio and video.
How does it work technically with images? The image goes through a visual encoder (typically a Vision Transformer, ViT) that converts it into a sequence of visual tokens. Those tokens enter the LLM as additional context. The LLM predicts text that's likely given the combined input.
Same underlying mechanism. Sequence prediction. The input is richer, but the core stays "what comes next based on the learned correlations."
Implications for prompt architecture
If you internalized everything above, five things change immediately in how you write prompts.
More context produces better prediction. That's why frameworks like CAFÉ (Context, Action, Format, Style) work. It's not magic. It's giving the model more signal about the distribution from which to predict.
Examples in the prompt activate intense pattern matching. Few-shot learning is giving three to five examples of the style you want. The model predicts responses that statistically resemble those examples.
Asking it to think step by step (chain-of-thought) forces the model to generate intermediate tokens that explore the prediction space more carefully. For reasoning problems, this usually raises accuracy significantly.
Temperature controls how far sampling deviates from the most probable word. Temperature zero gives you the maximum-probability word deterministically (consistent, repeatable, sometimes boring). Temperature one introduces variation (more creative, less predictable). Temperature two usually produces chaos.
Instruction order matters. The model attends more to the start and end of context than to the middle (the "lost in the middle" effect documented in recent literature). Put your critical instructions at the start or the end.
Post-training tuning: RLHF and Constitutional AI
After pretraining (predicting next word via cross-entropy loss), models go through additional tuning.
RLHF (Reinforcement Learning from Human Feedback), formalized by OpenAI in Ouyang et al. (2022). Humans rate response pairs. The model learns via PPO (Proximal Policy Optimization) to maximize the probability of well-rated responses while staying close to the base model so it doesn't degrade.
Constitutional AI, from Anthropic (Bai et al., 2022). On top of RLHF, the model is trained against a set of "constitutional principles." The model critiques its own responses against those principles and self-corrects.
This doesn't change the fundamental transformer parameters. But it recalibrates the distribution from which the model samples. Toward more useful, safer, less biased responses.
That's why Claude feels different from an unaligned base model. It was explicitly tuned to recognize limits, calibrate uncertainty and refuse tasks that violate principles.
Short, medium and long-term trajectory
In the short term, multiplication of specialized models. LLMs tuned to specific verticals (medicine, law, code, support). The market differentiates by domain.
In the medium term, agentic systems. LLMs as orchestrators predicting not just words but also actions (calling APIs, using tools, executing code, integrating with external databases). RAG becomes default. Anthropic's MCP protocol is the standard piece of that layer.
In the long term, the open question is epistemological. Do enhanced LLMs (with persistent memory, full multimodality, inference-time reasoning, continuous feedback) converge toward something close to AGI? Or is the ceiling very sophisticated prediction without genuine comprehension in the strong philosophical sense (qualia, intentionality, consciousness)?
My take. It's extremely powerful sequence prediction. To date, we haven't seen evidence of genuine comprehension in the strong philosophical sense. But the question is open and intellectual honesty demands not closing it.
The blog's thesis
When you use Claude you're using a system that at its core predicts the next word in a probability distribution. That this prediction is sophisticated to the point of looking like reasoning, creativity and emotional comprehension is behaviorally real. The underlying execution, however, is mathematically sequence prediction.
This doesn't make it less useful. It makes it different from human intelligence. And understanding that difference, that the system works very well without understanding what it's doing, is the start of using it well.
The people who get the most out of LLMs are the ones who internalized this frame. They know where prediction shines. They know where it fails in characteristic ways. They verify where truth matters. They don't expect the model to "become conscious" in the next release, because they know that's not what's changing.
What's changing is the quality of the prediction. And that, word by word, is enough to transform entire industries.