The text box that broke the barrier
November 2022. An ordinary Tuesday, OpenAI posts a low-key page on its site: chat.openai.com. No launch event, no keynote, no campaign. Five days later, a million people had tried it. Two months later, a hundred million — the fastest consumer adoption ever measured, according to the Stanford AI Index Report 2024. Faster than TikTok. Faster than Instagram. Faster than anything.
Classifying is not generating
Up to that point, most of the AI you saw inside a product did one thing: take something and label it. A spam filter reads your email and decides: spam or not spam. Netflix looks at your history and tags films: you'll like this, you won't like that. An X-ray goes through a trained model and comes out labeled: anomaly, no anomaly.
These are classification machines. Useful. Often essential. But none of them creates anything.
Generative AI is the opposite.
Ask Claude to "draft an email to a client asking for an extension, professional tone" and it writes one. It isn't pulling a stored template. It composes the email on the spot, drawing on patterns from the millions of emails it read during training.
Ask Midjourney for "a mountain at dusk, cinematic lighting" and the image appears. Not pulled from Google. Generated pixel by pixel.
Ask GitHub Copilot for a Python function and it writes the code. Not copied from Stack Overflow. Built from scratch, respecting syntax conventions it absorbed from billions of lines of public code.
How it learns to do any of this
It's trained on a hard-to-picture number of examples — digitized books, articles, code, labeled images. The training objective is a single task: predict the next word. Given the ones before, which comes next?
That sounds trivial. It isn't. To predict well, the model ends up learning how professional emails are structured, how a report opens, what conventions working code follows, which elements make up a good image.
It's like a musician who has listened to ten million songs across every genre and can now compose a coherent new piece.
The trick isn't that the model understands. It's that it captured the patterns at such high resolution that it looks like it understands.
Why it took so long
Until November 2022, using generative AI required programming, servers, and money. It was lab territory.
Then OpenAI opened a free text box in the browser, and suddenly your aunt could ask an AI to write her an email. That's what started the race.
The models that matter today
For text: Claude (Anthropic), ChatGPT (OpenAI), Gemini (Google). For images: DALL-E, Midjourney, Stable Diffusion. For code: GitHub Copilot, Claude Code, Cursor. For video: Sora, Runway.
Three things to take away:
- Generative isn't classification. Classifying sorts what already exists; generating makes something new. It isn't the same technology with more horsepower — it's a different category.
- The model doesn't search, it composes. There's no database of stored answers. Every response is assembled token by token, right then, by combining learned patterns.
- The pivotal year wasn't the year of invention — it was the year of access. Transformer had existed since 2017 and GPT-3 since 2020. What arrived in November 2022 was the text box. And the text box was the hard part.
The AI you used to know: labeling machines
Before 2022, almost every AI you encountered in a shipped product — in your inbox, your bank app, your feed — was a variant of one pattern: take something that exists and label it.
A spam filter reads ten thousand real emails, learns which words and structures signal suspicion, and hangs a tag on each new arrival: spam, not spam.
A social network studies your history and scores your next posts by priority: this will hold your attention, this won't.
A radiograph runs through a model trained on a million images and comes out labeled: anomaly, no anomaly.
Useful. Often essential.
But none of it creates anything. These are classification machines. They sort the world into buckets. They don't invent new buckets.
The pivot: from labeling to composing
Generative AI flips the equation. It doesn't receive something and stick a label on it; it receives an instruction and produces a new artifact.
Ask Claude for a client email and there is no stored email that matches yours. The model composes one in that moment — token by token — drawing on patterns from the millions of real emails it read during training.
Ask Midjourney for "a nineteenth-century physician, oil painting style, golden light" and the image isn't in any archive. It's generated pixel by pixel by a diffusion model that learns to "remove noise" until a coherent image emerges.
Ask GitHub Copilot for a Python function and there's no Stack Overflow in the loop. The model writes the code, respecting the syntax conventions it absorbed from billions of lines of public code.
Three new artifacts. Three different techniques underneath. One shared principle: combine learned patterns to produce something that didn't exist before.
The architecture that made it possible
All of this rests on a 2017 paper. Vaswani and co-authors, then at Google Brain, published Attention is All You Need — eight pages that rearranged the field. Today it has over 140,000 citations on Google Scholar, placing it among the most influential papers in the recent history of machine learning.
The core idea is called self-attention. Instead of processing language word by word in order — the way older networks (RNNs and LSTMs) did — the Transformer processes all tokens in a sequence simultaneously, letting each one "query" all the others.
That solves an old problem: on long texts, older networks forgot the opening by the time they reached the close. Transformer doesn't. A word at the start of a paragraph can directly influence one at the end, without degradation.
And it has a practical advantage: it parallelizes. Training runs that took weeks moved to taking days. That unlocked scale.
The scaling effect
With Transformer in hand, what was missing was compute and data. Both arrived:
- BERT, 2018 (Devlin et al., Google): 340 million parameters. Strong context understanding, but still limited to classification tasks.
- GPT-2, 2019 (Radford et al., OpenAI): 1.5 billion parameters. The first model that could write coherent paragraphs on arbitrary topics.
- GPT-3, 2020 (Brown et al., OpenAI): 175 billion parameters — a hundred times larger than GPT-2. The paper documents something researchers called few-shot learning: show it two or three examples, and it generalizes.
- ChatGPT, 2022: not a new model. GPT-3.5 wrapped in a chat interface, with RLHF applied to align responses.
- GPT-4, 2023: estimated at roughly 1.76 trillion parameters according to SemiAnalysis reporting, never confirmed by OpenAI.
Pattern: each 10x jump in parameters produces qualitatively new capabilities. Not "better" — different.
The inflection: November 2022
In November 2022, OpenAI shipped ChatGPT with a minimal interface: text box, enter, reply.
Five days to the first million users. Two months to the first hundred million. It's the fastest consumer adoption ever measured — TikTok took nine months to hit that mark, Instagram took two and a half years.
The technology underneath was already out: GPT-3.5 had been available by API for a while. What was new was the wrapper — a box in a browser, free, no code.
That wrapper broke the barrier. Generative AI stopped being an engineering project and became a tool your accountant opened between two meetings.
What actually changed at work
The adoption curve matters less than what people started doing once the product was one click away.
McKinsey estimated in The Economic Potential of Generative AI (2023) that between 60% and 70% of the time office workers currently spend on writing, analysis, and coding tasks can be automated or accelerated with generative assistants. The Stanford AI Index 2024 documents that, at the individual productivity level, workers using Copilot or Claude report 20% to 40% less time on routine text tasks.
It isn't that the AI writes your email and you sign it. It's that you go from drafting three versions to reviewing one. From searching Stack Overflow to inline autocomplete. From building a deck from scratch to editing a reasonable draft.
It's a change in speed, not in nature. But at that scale, speed becomes nature.
What's worth thinking about
Are you delegating tasks where the AI is faster and double-checking where it hallucinates, or are you accepting what it produces without looking? Do you understand how it works under the hood — enough to know where to trust it and where to verify — or are you treating it as a black box?
These are decisions you make once and live with for years.
If you want the full arc of how AI went from an academic field in crisis to a product with a hundred million users in two months, the timeline is in Timeline: AI's defining moments. If you want to know how it predicts token by token on the inside, that's told in How an AI thinks.
A technical genealogy: from the sequence problem to scale architectures
To understand why 2022-2023 — rather than 2020, when GPT-3 already existed — was the pivot for generative AI, it helps to return to a problem deep learning had been carrying since the early nineties: processing language without losing memory along the way.
The unsolved problem: memory across long sequences
Recurrent neural networks (RNNs), introduced in the early 1990s and formalized by Elman (1990), were the first serious attempt at learning temporal dynamics. They worked for short sequences — phrases, single sentences — but degraded quickly on text longer than a few hundred tokens. The technical problem is known as vanishing gradient: during backpropagation, gradients propagated across many time steps compound toward numerical insignificance. Practically, the network "forgot" the start of a long sequence by the time it reached the end.
Hochreiter and Schmidhuber (1997) introduced LSTMs — Long Short-Term Memory networks — which added gates to preserve relevant information across time. It was a real improvement, but the architecture remained fundamentally sequential: each token had to be processed after the previous one. That imposed a compute ceiling. Scaling LSTMs to massive corpora and long contexts was prohibitive.
Through 2017, the field was stuck under that ceiling.
Attention is All You Need: the paper that broke through
In June 2017, Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin — eight authors, most at Google Brain — published Attention is All You Need at NeurIPS. Eight pages. A new architecture.
The core idea: replace recurrence with a mechanism called self-attention. Instead of processing tokens in sequence, each token queries every other token in the sequence in parallel, each query weighted by an attention coefficient computed dynamically. Formally:
Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
where Q, K, V are linear projections of the input embeddings (query, key, value).
Two immediate consequences. First, complete parallelization: the whole sequence is processed in a single pass, not token by token. Second, unlimited attention range within the context window: a word at the start of a paragraph can directly influence one at the end, without the gradient having to traverse intermediate steps.
The paper has over 140,000 citations on Google Scholar today. It is, by a long distance, one of the most influential machine learning papers of the modern era.
The scaling curve
With Transformer as the foundation, OpenAI, Google, DeepMind, and Meta began scaling systematically.
- BERT (Devlin et al., 2018/2019): 340 million parameters. Bidirectional architecture, focused on comprehension.
- GPT-2 (Radford et al., 2019): 1.5 billion parameters. Decoder-only architecture, focused on generation.
- GPT-3 (Brown et al., 2020): 175 billion parameters. One hundred times larger than GPT-2. Few-shot learning capabilities appear and are documented in the paper.
- Chinchilla (Hoffmann et al., 2022, DeepMind): 70 billion parameters, trained on four times the tokens of Gopher (280B). Demonstrated empirically that most large models from 2020-2022 were undertrained: the optimal ratio between parameters and tokens is roughly 1:20.
- PaLM (Chowdhery et al., 2022, Google): 540 billion parameters.
- GPT-4 (2023, OpenAI): size unconfirmed. The most-cited public estimate comes from a SemiAnalysis report (July 2023) placing it at approximately 1.76 trillion parameters under a Mixture-of-Experts architecture. OpenAI has neither confirmed nor denied.
- Claude 3 (March 2024, Anthropic) and Claude 3.5 Sonnet (June 2024, updated October 2024): parameter counts not published.
Worth stating clearly: neither Anthropic nor OpenAI have confirmed exact parameter counts for their most recent models. Any specific number floating around for Claude 3 Opus or later GPT-4 is an external estimate, not an official figure.
The empirical rule: scale data and compute together
Kaplan et al. (2020, Scaling Laws for Neural Language Models) documented that training loss follows a power law in three variables: model size, data quantity, and total compute. Hoffmann et al. (2022) refined that rule with Chinchilla, showing that the optimum isn't "more parameters" but "more data and more parameters in the right ratio."
That result reoriented the industry. Post-Chinchilla models tend to train on far larger corpora than were used in 2020.
Alignment: the step that makes the model usable
A pre-trained model is powerful but not obedient. It can generate accurate text, toxic text, false text, or inappropriate text — all with the same confidence. Before it ships, it's aligned.
Two main method families:
RLHF (Reinforcement Learning from Human Feedback), formalized in Ouyang et al. (2022, Training language models to follow instructions with human feedback). The model generates pairs of responses to a prompt; humans rate which they prefer; a reward model trains on those preferences; finally, PPO is used to tune the original model toward preferred responses. It's the method behind InstructGPT and ChatGPT.
Constitutional AI, published by Bai et al. (2022, Anthropic) as Constitutional AI: Harmlessness from AI Feedback. The model critiques itself using an explicit set of principles — a "constitution" — and then trains on its own corrections. It reduces the dependency on human labelers for the harmlessness phase. It's the method behind Claude.
Both produce models that are dramatically more useful than the raw pre-trained version. But both also introduce their own failures: the model learns to produce responses that look good to evaluators, which correlates with being helpful — sometimes.
The emergent-abilities objection
A claim that circulated strongly between 2020 and 2022: large models exhibit emergent capabilities — skills that appear abruptly at some scale threshold, absent in smaller models.
Schaeffer, Miranda, and Koyejo (2023, Are Emergent Abilities of Large Language Models a Mirage?, NeurIPS 2023) pushed back on that reading. They showed that many "emergences" are artifacts of the metrics used: when the metric is binary (total success or total failure), improvement looks abrupt; when the same task is scored with a continuous metric (log-probability, partial credit), the improvement is smooth and predictable from the scaling laws.
The debate remains open. But the Schaeffer paper is an academic correction that popular arguments about "emergent AI" tend to ignore — and that's worth keeping in mind before trusting any sharp-jump chart.
Differences across frontier families: Claude, GPT, Gemini
All three frontier models use variants of decoder-only Transformer, but they diverge on design choices that matter:
GPT-4 (OpenAI): 128K context window on GPT-4 Turbo. Strengths in creative tasks and in tool integration via function calling. Relative weakness: a tendency to hallucinate on specific facts, more pronounced than in Claude on independent benchmarks like TruthfulQA.
Claude 3.5 Sonnet / 3.7 Sonnet (Anthropic): 200K context window. Strengths in long-form reasoning, structured writing, and honesty about uncertainty — a byproduct of Constitutional AI training. Relative weakness: less expansive on open-ended generation than GPT.
Gemini 1.5 Pro (Google): 1M tokens in production (2M in beta). Native multimodality: processes text, images, audio, and video in the same call. Strength: analysis of massive documents and videos. Relative weakness: fewer public feedback iterations than GPT and Claude.
On generic benchmarks (MMLU, HumanEval), all three sit in the 87-90% accuracy range. The real differences show up in specific tasks: multi-step reasoning, controlled creativity, multimodal analysis.
The next layer: agents and Model Context Protocol
Pure generative AI — "you write, the model responds" — was the 2022-2024 product. From late 2024 onward, the field is shifting toward agents: systems that don't just generate text but take actions on external tools.
Two key pieces appeared in that period.
Computer Use, launched by Anthropic in October 2024, lets Claude see a screenshot, move the cursor, click, and type in any application. It's still experimental — success rates on multi-step tasks are moderate — but it marks a paradigm shift.
Model Context Protocol (MCP), launched by Anthropic in November 2024 as an open standard, defines how a model connects to APIs, databases, and external tools. It's not a product: it's a protocol. It lets Claude (or any compatible model) query live data while reasoning.
The implication is that the value of these systems stops being only text generation. It moves to the combination: generation + search + computation + action.
The blog's thesis: this isn't a degree change, it's a category change
What this blog has argued since its first issue — and what this piece confirms with more technical context — is that what happened between 2017 and 2022 is not another iteration of progress in machine learning.
Classifiers, the expert systems of the eighties, the recommendation engines of the 2010s — they all operated on what already existed. Generative AI, with Transformer as engine and internet-scale data as fuel, is the first mass-consumer technology that produces original artifacts of professional quality at interactive speed.
The category change isn't that AI writes now. It's that the cognitive task of "composing an artifact of text" — email, memo, code, proposal — moved from being a human bottleneck to being a review bottleneck. For most of the history of knowledge work, the cost of producing a first draft was high and the cost of critiquing one was low. That asymmetry has flipped.
Three years after ChatGPT shipped, knowledge work no longer organizes around writing. It organizes around reviewing what someone else — human or model — wrote first. And whoever reviews best, not whoever writes best, wins.
That is the category change. Everything else is consequence.