On December 26, 2024, DeepSeek posted a 53-page technical document on GitHub titled DeepSeek-V3 Technical Report. This wasn't a marketing blog post: it was a detailed paper, with architecture, training decisions, loss curves, and a figure that froze Silicon Valley once analysts read it closely. The final training run had cost roughly $5.6 million.
The market took four weeks to process what that number implied. On Monday, January 27, 2025, with the DeepSeek app topping the US App Store ahead of ChatGPT, Wall Street reacted. Nvidia closed that day down $589 billion in market cap — the largest single-day loss for any company in US stock market history. The Nasdaq dropped 3.1 percent. The semiconductor index SOX fell 9.2.
What broke that Monday wasn't DeepSeek. What broke was a belief: that training frontier AI required US-scale capex and was therefore a game of few players.
Where did DeepSeek come from?
DeepSeek was founded in May 2023 in Hangzhou. Its creator, Liang Wenfeng, came out of running High-Flyer, a Chinese quantitative hedge fund that had accumulated a significant Nvidia GPU cluster — initially for high-frequency trading, not AI. When Liang pivoted toward language models, he had two things rare in the Chinese ecosystem: his own compute capacity without depending on a big tech, and a culture of optimization inherited from quantitative finance.
The product trajectory so far:
- DeepSeek-V2 (May 2024) — first model to draw attention on technical benchmarks, with the Mixture-of-Experts architecture they'd later scale up in V3.
- DeepSeek-V3 (December 2024) — 671 billion total parameters, 37 billion active per token. The paper that contains the $5.6 million figure.
- DeepSeek-R1 (January 2025) — an o1-style reasoning model. Open-weights under MIT license.
The MIT license on R1 matters. It's the most permissive license out there. You can download the weights, fine-tune them, run them in commercial production, without asking anyone.
The $5.6 million figure is disputed (and it matters less than you'd think)
Let's be honest with the numbers. The $5.6 million only covers the compute cost of the final training run — it doesn't include the GPUs (hardware capex they already had), researcher salaries, the dozens of prior failed experiments, or data labeling. SemiAnalysis and other independent analysts put the real total around $500 million when everything is included.
But that debate misses the point. Even if DeepSeek spent $500 million, training the equivalent of GPT-4 for half a billion dollars is still an order of magnitude below the $5 to $10 billion analysts had been projecting for the next scaling cycle at OpenAI or Anthropic.
Compute efficiency per dollar isn't Chinese marketing. It's documented in the paper, it's reproducible by technical teams, and its core techniques — multi-head latent attention, auxiliary-loss-free load balancing, FP8 mixed precision training — have already been adopted by Western labs.
Qwen, Kimi, and the rest of the pack
DeepSeek is the visible face, but it isn't the only serious player on the Chinese side.
Qwen (Alibaba). The most consistent series. Qwen2 (mid-2024), Qwen2.5 (late 2024), Qwen3 (2025). Models in different sizes — 7B, 32B, 72B — all with open weights under Apache 2.0 (also permissive). Qwen2.5-72B is the de facto model for many startups that need to self-host. Alibaba pushes it because they want to sell Alibaba Cloud infrastructure; the open models are the hook.
Kimi (Moonshot). Specialty: long context. It was the first to offer commercial million-token windows in Chinese, before Western models had equivalents. Strong in the Chinese market, less known outside.
Baichuan, Zhipu GLM, 01.AI (Yi). Three labs with capable models. 01.AI is founded by Kai-Fu Lee, a familiar name in the Western ecosystem. Zhipu has academic ties to Tsinghua.
Ernie Bot (Baidu). The most direct corporate response to ChatGPT. More closed, less technically innovative, but with huge distribution inside China via Baidu's products.
The ecosystem is more diverse than the media focus on DeepSeek suggests.
The context that doesn't make the headlines
To understand why Chinese labs innovated so aggressively on efficiency, you have to look at what Washington was doing.
In October 2022 the Biden Administration, through the Department of Commerce's Bureau of Industry and Security, imposed restrictions on exporting advanced GPUs to China. Nvidia's H100 and A100 — the reference chips for training frontier models — were banned. Nvidia responded by creating slightly degraded variants (the H800, later the H20) that cleared the regulatory thresholds. Washington tightened restrictions in October 2023 to close those loopholes.
The stated goal was to slow China down. The actual effect was different: Chinese labs, with less compute per researcher and inferior hardware, had to optimize aggressively. And they published the techniques.
That's what makes DeepSeek-V3 unique as a document: it's not a closed product with a black box inside, it's an operations manual. Any lab with resources can read the paper and apply the same techniques.
What do I, professionally, do with any of this?
Here's where it pays to separate use cases.
If your work involves real clients, contracts, sensitive data, reputation on the line — Claude is still the bet I make every day. Not because DeepSeek-R1 is bad on raw capability (it isn't), but because the combination of jurisdiction, trust track record, contractual guarantees, and consistent multilingual support doesn't exist in publicly-accessible Chinese models. For delegable work with data that can't leave my control, Anthropic in the US is still the provider I understand and can point accountability at.
If your work is a technical startup with a tight budget that needs to self-host — that's where Qwen2.5-72B or DeepSeek-V3 running on your own infrastructure are legitimate options. Permissive license, high capability, no third party watching your prompts. This is a real door that didn't exist two years ago for anyone outside big tech.
If you're learning — try all of them. DeepSeek has a public web app. Qwen has a Hugging Face demo. ChatGPT and Claude you know. Seeing how each one thinks gives you intuition that no blog post delivers.
If your beat is journalism, political research, human rights, or anything touching Asian geopolitics — publicly-accessible Chinese models aren't the tool. Not out of malice, out of source-country regulation.
To close, and to keep going
The rise of DeepSeek and the consolidation of Qwen changed the conversation about what "expensive" means when training a frontier model. They broke a cost assumption, spread techniques through public papers, and forced Western labs to respond on efficiency.
But they aren't interchangeable with Claude or ChatGPT for Western professional use. The content filters are real. The jurisdiction is real. The uneven multilingual support is real. They're different tools for different cases.
Where in your workflow would an open-weights option with a permissive license running on your own infrastructure actually help? If you want the broader competitive picture, The AI race is the next link. If you want to understand how these models' capability gets compared, How AIs are measured gives you the frame.
On Monday, January 27, 2025, an app called DeepSeek woke up at number one on the US App Store. Ahead of ChatGPT.
That had never happened before.
The app was the chatbot of a Chinese company most people had never heard of. And on Wall Street that same Monday, Nvidia — the company that makes the chips that train AI models — lost $589 billion in market value in a single day. The biggest one-day loss for any company in US stock market history.
What is DeepSeek and why did it shock everyone?
DeepSeek is a Chinese startup founded in 2023 in the city of Hangzhou. Its founder is Liang Wenfeng, and he came out of a quantitative hedge fund. Meaning: people used to optimizing every last cent.
In December 2024 they published a very detailed technical paper on a model called DeepSeek-V3. In the paper they said they'd trained a model comparable to GPT-4 for around $5.6 million.
To understand why that was a bombshell: the industry took for granted that training a model at that level cost anywhere from a hundred million to several hundred million dollars. Saying "we did it for under six million" was, at that moment, science fiction.
A month later, in January 2025, they released DeepSeek-R1. A model that could reason step by step — think before answering, like OpenAI's o1. But R1 was open-source: weights available for download, MIT license (the most permissive one out there).
That's when things blew up.
It's not just DeepSeek
China has more than one company doing this seriously.
Alibaba — yes, the ecommerce giant — has a model line called Qwen (pronounced "chwen"). Qwen2.5 and Qwen3 are competitive and also open-source. They get used a lot in startups that need to self-host AI and keep full control of their data.
Moonshot makes Kimi, which is strong at long conversations. Baichuan, Zhipu, 01.AI are other names in the ecosystem. And then there's Baidu's Ernie Bot — the most direct ChatGPT response, but more closed and less innovative.
Several players. Some aim to compete on benchmarks, others to dominate specific markets inside China.
The uncomfortable part
Two things you need to know if you're thinking of using any of these tools.
First: Chinese models accessible to the public come with mandatory filters. The Chinese regulator — the Cyberspace Administration of China, or CAC — requires models to register their training data and filter certain sensitive topics: Tiananmen, Taiwan, Party criticism, a few more. Ask about those and the model will duck.
For code, math, technical analysis: censorship doesn't touch you. For journalism, recent history, or anything political: it does.
Second: if you use the official service, your data goes to Chinese servers. That's not a footnote when you're working with client information.
What to take away
Three things worth holding onto:
- DeepSeek didn't "catch up" to the US by magic. It did it because starting in 2022 Washington restricted exports of advanced GPUs to China, and Chinese labs had to optimize with what they had. The result was the opposite of what was intended — they learned to do more with less and published how.
- Chinese models are a real option, with fine print. For neutral technical use (code, analysis, math) they work very well, often cheaper. For sensitive data or political topics, they aren't the tool.
- For professional client work, Claude is still the solid bet. For a side project or learning to run open models on your own infrastructure, Qwen2.5 with an Apache license is a legitimate way in. These aren't mutually exclusive calls.
On December 26, 2024, DeepSeek posted a 53-page technical document on GitHub titled DeepSeek-V3 Technical Report. This wasn't a marketing blog post: it was a detailed paper, with architecture, training decisions, loss curves, and a figure that froze Silicon Valley once analysts read it closely. The final training run had cost roughly $5.6 million.
The market took four weeks to process what that number implied. On Monday, January 27, 2025, with the DeepSeek app topping the US App Store ahead of ChatGPT, Wall Street reacted. Nvidia closed that day down $589 billion in market cap — the largest single-day loss for any company in US stock market history. The Nasdaq dropped 3.1 percent. The semiconductor index SOX fell 9.2.
What broke that Monday wasn't DeepSeek. What broke was a belief: that training frontier AI required US-scale capex and was therefore a game of few players.
Where did DeepSeek come from?
DeepSeek was founded in May 2023 in Hangzhou. Its creator, Liang Wenfeng, came out of running High-Flyer, a Chinese quantitative hedge fund that had accumulated a significant Nvidia GPU cluster — initially for high-frequency trading, not AI. When Liang pivoted toward language models, he had two things rare in the Chinese ecosystem: his own compute capacity without depending on a big tech, and a culture of optimization inherited from quantitative finance.
The product trajectory so far:
- DeepSeek-V2 (May 2024) — first model to draw attention on technical benchmarks, with the Mixture-of-Experts architecture they'd later scale up in V3.
- DeepSeek-V3 (December 2024) — 671 billion total parameters, 37 billion active per token. The paper that contains the $5.6 million figure.
- DeepSeek-R1 (January 2025) — an o1-style reasoning model. Open-weights under MIT license.
The MIT license on R1 matters. It's the most permissive license out there. You can download the weights, fine-tune them, run them in commercial production, without asking anyone.
The $5.6 million figure is disputed (and it matters less than you'd think)
Let's be honest with the numbers. The $5.6 million only covers the compute cost of the final training run — it doesn't include the GPUs (hardware capex they already had), researcher salaries, the dozens of prior failed experiments, or data labeling. SemiAnalysis and other independent analysts put the real total around $500 million when everything is included.
But that debate misses the point. Even if DeepSeek spent $500 million, training the equivalent of GPT-4 for half a billion dollars is still an order of magnitude below the $5 to $10 billion analysts had been projecting for the next scaling cycle at OpenAI or Anthropic.
Compute efficiency per dollar isn't Chinese marketing. It's documented in the paper, it's reproducible by technical teams, and its core techniques — multi-head latent attention, auxiliary-loss-free load balancing, FP8 mixed precision training — have already been adopted by Western labs.
Qwen, Kimi, and the rest of the pack
DeepSeek is the visible face, but it isn't the only serious player on the Chinese side.
Qwen (Alibaba). The most consistent series. Qwen2 (mid-2024), Qwen2.5 (late 2024), Qwen3 (2025). Models in different sizes — 7B, 32B, 72B — all with open weights under Apache 2.0 (also permissive). Qwen2.5-72B is the de facto model for many startups that need to self-host. Alibaba pushes it because they want to sell Alibaba Cloud infrastructure; the open models are the hook.
Kimi (Moonshot). Specialty: long context. It was the first to offer commercial million-token windows in Chinese, before Western models had equivalents. Strong in the Chinese market, less known outside.
Baichuan, Zhipu GLM, 01.AI (Yi). Three labs with capable models. 01.AI is founded by Kai-Fu Lee, a familiar name in the Western ecosystem. Zhipu has academic ties to Tsinghua.
Ernie Bot (Baidu). The most direct corporate response to ChatGPT. More closed, less technically innovative, but with huge distribution inside China via Baidu's products.
The ecosystem is more diverse than the media focus on DeepSeek suggests.
The context that doesn't make the headlines
To understand why Chinese labs innovated so aggressively on efficiency, you have to look at what Washington was doing.
In October 2022 the Biden Administration, through the Department of Commerce's Bureau of Industry and Security, imposed restrictions on exporting advanced GPUs to China. Nvidia's H100 and A100 — the reference chips for training frontier models — were banned. Nvidia responded by creating slightly degraded variants (the H800, later the H20) that cleared the regulatory thresholds. Washington tightened restrictions in October 2023 to close those loopholes.
The stated goal was to slow China down. The actual effect was different: Chinese labs, with less compute per researcher and inferior hardware, had to optimize aggressively. And they published the techniques.
That's what makes DeepSeek-V3 unique as a document: it's not a closed product with a black box inside, it's an operations manual. Any lab with resources can read the paper and apply the same techniques.
What do I, professionally, do with any of this?
Here's where it pays to separate use cases.
If your work involves real clients, contracts, sensitive data, reputation on the line — Claude is still the bet I make every day. Not because DeepSeek-R1 is bad on raw capability (it isn't), but because the combination of jurisdiction, trust track record, contractual guarantees, and consistent multilingual support doesn't exist in publicly-accessible Chinese models. For delegable work with data that can't leave my control, Anthropic in the US is still the provider I understand and can point accountability at.
If your work is a technical startup with a tight budget that needs to self-host — that's where Qwen2.5-72B or DeepSeek-V3 running on your own infrastructure are legitimate options. Permissive license, high capability, no third party watching your prompts. This is a real door that didn't exist two years ago for anyone outside big tech.
If you're learning — try all of them. DeepSeek has a public web app. Qwen has a Hugging Face demo. ChatGPT and Claude you know. Seeing how each one thinks gives you intuition that no blog post delivers.
If your beat is journalism, political research, human rights, or anything touching Asian geopolitics — publicly-accessible Chinese models aren't the tool. Not out of malice, out of source-country regulation.
To close, and to keep going
The rise of DeepSeek and the consolidation of Qwen changed the conversation about what "expensive" means when training a frontier model. They broke a cost assumption, spread techniques through public papers, and forced Western labs to respond on efficiency.
But they aren't interchangeable with Claude or ChatGPT for Western professional use. The content filters are real. The jurisdiction is real. The uneven multilingual support is real. They're different tools for different cases.
Where in your workflow would an open-weights option with a permissive license running on your own infrastructure actually help? If you want the broader competitive picture, The AI race is the next link. If you want to understand how these models' capability gets compared, How AIs are measured gives you the frame.
On October 7, 2022, the US Department of Commerce's Bureau of Industry and Security published a regulation that rewrote silicon geopolitics. Among other measures, it banned exports to China of GPUs above specific interconnect and compute thresholds — in practice, Nvidia A100s and H100s were placed out of the Chinese market's legal reach. Twelve months later, in October 2023, the regulation tightened to close the loopholes Nvidia had opened with the H800 and L40S variants.
The declared strategic logic was simple: if China can't buy the best chips, it can't train the best models, and the US preserves its frontier lead. The logic was simple, and the consequence was the opposite. When Liang Wenfeng published the DeepSeek-V3 paper in December 2024, he showed that a lab operating with degraded H800 GPUs — and a cluster estimated at around 2,000 units, small by frontier standards — could produce a functional GPT-4-range model at a documented compute cost of $5.6 million. The restriction produced the incentive. The incentive produced the innovation. And the innovation got published.
Worth taking this story apart with technical precision, because the surface reading — "China caught up to the US" — misses the point. The point is that the efficiency regime DeepSeek documented redefines what "expensive to train" means, and that new regime also applies to Western labs whose budgets were built on the earlier assumption.
Technical architecture: what makes DeepSeek-V3 efficient
The V3 paper describes three core technical innovations operating together.
Multi-head Latent Attention (MLA). Classic transformers store a key-value cache in memory proportional to the number of attention heads. In large models that cache dominates memory cost during inference. MLA compresses the KV representations into a reduced-dimensional latent space before storage, decompressing on-the-fly. It shrinks the cache size by a significant factor — DeepSeek reports roughly 93 percent reduction in KV cache size versus standard multi-head attention, at a marginal quality cost.
Mixture-of-Experts with auxiliary-loss-free load balancing. V3 has 671 billion total parameters but only 37 billion activate per token. The MoE architecture isn't DeepSeek's innovation — GPT-4 is assumed to be MoE as well. The innovation is the method for balancing load across experts. Classic MoE uses an auxiliary loss to push uniform expert utilization, which interferes with the main training loss. DeepSeek drops the auxiliary loss and balances via a dynamic per-expert bias that adjusts during training. Conceptually simple, disruptive in practice.
FP8 mixed-precision training. Training large models in FP16 is standard; in FP8 it's cheaper in memory and compute but numerically more fragile. DeepSeek implements a hybrid scheme where most operations run in FP8 with specific strategies to preserve numerical stability in critical regions (gradient accumulation, normalization). The result is a substantial reduction in compute per training step, with no measurable quality loss on benchmarks.
The three innovations are reproducible by any lab with a competent technical team. The paper describes them in enough detail to implement. That open publication is the strategic difference from the closed Western approach OpenAI adopted post-GPT-3.
DeepSeek-R1 and the open-source reproduction of the reasoning paradigm
In September 2024 OpenAI released o1, the first frontier reasoning model — a model that spends extra compute on internal chains of thought before producing the final answer. On Olympic math and competitive coding benchmarks, o1 jumped scores by orders of magnitude. OpenAI didn't publish how. The industry speculated.
DeepSeek-R1, released January 20, 2025, was the first public open reproduction of the paradigm. The accompanying paper — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — describes the method. Key point: emergent reasoning can be obtained through pure reinforcement learning on verification rewards (correct/incorrect on math and code problems), without needing human-supervised chain-of-thought.
The intermediate finding they reported — the R1-Zero model, trained only with RL with no prior supervised fine-tuning, exhibiting emergent self-reflection behavior ("aha moments" documented in the paper) — was technically striking and is under academic scrutiny. But the final R1 model, MIT-licensed, reached benchmarks competitive with o1 on math and code at a fraction of the inference cost.
The operational consequence for the field: any competent lab with a moderate training cluster can now reproduce a reasoning model. The technical moat of "having a reasoning model" evaporated in ninety days.
Qwen and the corporate open-weights strategy
Alibaba's play with Qwen is different and complementary. Alibaba doesn't need to sell inference — it has the Alibaba Cloud business. It needs to sell infrastructure. The open models are lead generation.
Qwen2.5, released September 2024, offered the most complete line of open checkpoints to date: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B parameters, all under Apache 2.0 except the largest (which requires a commercial license for large-scale use, on reasonable terms). Qwen2.5-72B-Instruct on standard benchmarks sits in the same range as Llama 3.1-70B and is competitive with Claude 3.5 Haiku on general text tasks.
Qwen3, the current generation, introduced native support for R1-style reasoning in some variants. The strategy is maintaining rough parity with the Western open frontier.
For a technical architect evaluating self-hosting, today's real choice is between Llama 3 (Meta, Apache), Qwen2.5/3 (Alibaba, Apache), and DeepSeek-V3 (DeepSeek, MIT). All three are competitive. The differentiation is license, bias profile, and language density. Qwen has a marginal edge on Asian languages and standard Spanish; Llama has an edge on English and Western data; DeepSeek has an edge on math and code.
The Chinese regulatory regime: CAC and the data mix
The Cyberspace Administration of China published in August 2023 the Interim Measures for the Management of Generative AI Services. Operational requirements for any model accessible to the public inside China:
- Prior registration of the model with the regulator before public launch.
- Declaration of the training corpus with verification of source "legality".
- Active filters on forbidden content categories: subversion of state power, Party criticism, questioning sovereignty over Taiwan, sensitive historical events (Tiananmen), separatist content related to Tibet or Xinjiang.
- Continuous audit mechanisms.
The technical implementation of these filters is post-training in most open models: they get applied at the service layer, not in the weights. This has an interesting consequence for anyone deploying the weights on their own infrastructure outside China — CAC filters are mostly removable with fine-tuning. What isn't removable is the bias in the initial data mix: the models were trained on corpora where certain historical and political narratives were underrepresented or skewed by country of origin.
For neutral technical use (code, math, scientific reasoning) the bias is irrelevant. For any application involving historical, political, or cultural analysis, it's a factor to account for — not an automatic deal-breaker, but an additional validation requirement.
The feedback effect on the West
What Washington didn't anticipate — or anticipated but underestimated — is that hardware restriction produces technical blowback on two timescales. Short path: Chinese labs optimize and publish. Long path: those techniques become industry standard, and Western labs, which had slack to be inefficient while compute was abundant, lose that slack.
This is already visible in 2026. Western labs published over the past year implementations of MLA, variants of auxiliary-loss-free MoE, and FP8 training experiments. Reported costs of training frontier models in the West are coming down, partly because of commercial competition, partly through direct migration of techniques originating in China.
Anthropic so far has chosen to keep its stack closed and doesn't publish architectural details. It's a choice coherent with its thesis — Anthropic's differentiation isn't in training efficiency, it's in alignment and reliability. But the broader point is that the technical frontier can no longer be treated as geographic property. Washington discovered that export restrictions on hardware are a slower instrument than papers published in Hangzhou.
Editorial thesis
I'll close with a thesis that goes past reporting.
The US GPU export restrictions on China in 2022-2023 produced a result opposite to the one stated. Instead of slowing Chinese labs, they forced those labs to convert hardware restriction into published architectural efficiency. The net result is a new global cost regime where training a GPT-4-competitive model costs an order of magnitude less than was believed 24 months ago, and where the techniques producing that regime are documented in open papers anyone can read.
The consequence for the Western professional user is twofold. On one side, the choice for work with sensitive data is still a tool aligned with your jurisdiction and your contractual regime — for me, for my consulting practice, for my clients, it's still Claude, and that choice doesn't change because of January 2025. On the other, the open-weights ecosystem with a permissive license stopped being an academic toy. Qwen2.5-72B and DeepSeek-V3 running on your own infrastructure are real production options for cases where self-hosting matters.
The geopolitical conflict didn't resolve. Export restrictions will probably tighten further over the next 18 months. But the efficiency genie is already out of the bottle, and the industry will operate in the new regime regardless of future political decisions. That's the structural consequence worth reading coolly.
What's your criterion for deciding between using a proprietary model under Western jurisdiction versus self-hosting an open-weights model with a permissive license for sensitive-data cases?