AI Trends16 min read

The Great AI Benchmark War: How Chinese Models Caught the Frontier in 2026

June 28, 2026·AI in China

The Myth That Won't Die

At dinner tables in San Francisco, investment briefings in London, and tech podcasts streamed from New York, one assumption remains unchallenged: American AI models are years ahead of Chinese competitors. The frontier belongs to OpenAI, Anthropic, and Google. Everyone else is playing catch-up. It's a comfortable narrative, but it's also wrong.

The numbers from June 2026 tell a different story. On SuperCLUE, China's most comprehensive Chinese-language evaluation benchmark, the gap between the best Chinese model and the best Western model has collapsed to single-digit percentage points. On LMArena's open blind voting, DeepSeek V4-Pro sits within a 38-point Elo of GPT-5.5 — a distance that could close in a single model update. On SWE-bench coding tasks, Claude Opus 4.7 still leads, but the premium it commands is 30 times higher than the open-source alternative delivering 80% of its capability.

The question is no longer whether Chinese models can compete. They are competing, aggressively, and they are doing it at a price point that is rewriting the economics of the entire industry. What remains open is whether the rest of the world has noticed.

Table 1: SuperCLUE June 2026 — Top 10 Chinese Models

Rank	Model	Provider	SuperCLUE	CMMLU	MMLU-Pro	Open-Source?
1	Qwen3.5-Plus	Alibaba	88.5	89.2	85.2	Partial
2	Doubao-Seed-2.0-pro	ByteDance	87.8	88.4	84.9	No
3	DeepSeek V4-Pro	DeepSeek	86.5	88.8	87.5	Yes (MIT)
4	GLM-5.1	Zhipu AI	85.4	87.1	82.3	Yes
5	Kimi K2.6	Moonshot	84.6	86.9	83.7	Yes
6	MiniMax M2.7	MiniMax	83.2	85.6	81.4	No
7	ERNIE 5.1	Baidu	82.7	85.0	82.1	Partial
8	Pangu-Ultra 718B	Huawei	81.9	84.3	80.8	No
9	Baichuan 5.0	Baichuan AI	79.4	82.1	77.5	Yes
10	Yi-2.0	01.AI	78.8	81.3	76.9	Yes

*Source: SuperCLUE Mid-2026 Benchmark Report; CMMLU & MMLU-Pro scores from vendor reports where available.*

The Data Doesn't Care About Narrative

Benchmarks are imperfect. They can be gamed, cherry-picked, and misinterpreted. But when multiple independent benchmarks — from independent evaluators like LMArena, from standardized test sets like CMMLU and MMLU-Pro, and from industry coding leaderboards like SWE-bench — all point in the same direction, the signal becomes hard to ignore.

Consider the SuperCLUE ranking for June 2026. Qwen3.5-Plus from Alibaba sits at the top with a score of 88.5, followed by ByteDance's Doubao-Seed-2.0-pro at 87.8, and DeepSeek V4-Pro at 86.5. These are not minor players. Qwen has been downloaded more than one billion times across its various model sizes. Doubao is the most widely used AI assistant in China with 345 million monthly active users as of mid-2026. DeepSeek, which released V4 with open-source MIT weights, has become the go-to backbone for enterprise deployments across Chinese cloud platforms.

On CMMLU (Chinese Massive Multitask Language Understanding), the scores are even tighter. Qwen3.5-Plus leads at 89.2, with DeepSeek V4-Pro at 88.8 and Doubao at 88.4. The gap between first and third place is 0.8 percentage points. In the context of language model evaluation, that margin is statistical noise. These three models are effectively tied on Chinese-language reasoning, cultural knowledge, and commonsense inference.

The MMLU-Pro scores are where Western models still hold a modest lead, but the gap is shrinking. DeepSeek V4-Pro reports 87.5 on MMLU-Pro, trailing GPT-5.5's estimated 91.0 by 3.5 points. Claude Opus 4.7 sits at 89.6. On Humanity's Last Exam, the hardest reasoning benchmark currently deployed, GPT-5.5 leads at 55, Claude at 51, and DeepSeek V4-Pro at 48. The gap is real, but it is narrow — and the price gap is not.

Table 2: LMArena Elo Rankings (June 2026) — Independent Blind Testing

Rank	Model	Elo Score	Provider	Open-Source?	Notes
1	GPT-5.5	1,506	OpenAI	No	Highest Elo, premium API
2 (tie)	Claude Opus 4.7	1,505	Anthropic	No	Consistent multi-turn leader
2 (tie)	Gemini 3.1 Pro	1,505	Google	No	1M-10M context window
4	DeepSeek V4-Pro	1,467	DeepSeek	Yes (MIT)	38-39 Elo gap to frontier
5	Qwen3.5-Max	1,459	Alibaba	Partial	Leading open Chinese model
6	Doubao-Seed-2.0-pro	1,452	ByteDance	No	345M MAU, enterprise deployed
7	Kimi K2.6	1,448	Moonshot	Yes	1T param MoE architecture
8	Claude Sonnet 4.6	1,445	Anthropic	No	Mid-tier at 1/5th Opus price
9	GLM-5.1	1,438	Zhipu AI	Yes	Trained on Huawei Ascend chips
10	ERNIE 5.1	1,421	Baidu	Partial	6% of typical training cost

*Source: LMArena Chatbot Arena, June 2026 independent scores. Elo ratings are statistical — a 30-point difference is roughly one Elo class gap.*

The Coding Arena: Where the Real Money Lives

If general knowledge benchmarks are the beauty contest, coding benchmarks are the revenue engine. Software development accounts for the largest share of enterprise AI spending globally, and the models that can write, debug, and refactor code are the ones that command premium pricing.

On SWE-bench Verified, the gold standard for real-world software engineering, Claude Opus 4.7 leads at 77.2, followed by GPT-5 at 74.9 and Gemini 2.5 Pro at 73.1. DeepSeek V4-Pro is further back at 55.4 on SWE-bench Pro, with Gemini 3.1 at 54.2. This is the one benchmark where the Western frontier still holds a clear, measurable advantage.

But here's the critical caveat: SWE-bench measures capability, not value. Claude Opus 4.7 costs $15 per million input tokens and $75 per million output tokens. DeepSeek V4-Pro costs $0.87 per million output tokens. For the price of one Claude Opus coding session, you can run 86 DeepSeek V4-Pro sessions with near-identical capability on most real-world tasks. When GPT-5.5 costs $5/M input and $25/M output, and Gemini 3.1 Pro costs $3.5/M input and $10.5/M output, the pricing asymmetry is not a footnote — it is the dominant feature of the market.

On LiveCodeBench, the dynamic coding benchmark that updates weekly to prevent memorization, the picture is even more interesting. DeepSeek V4-Pro reports 93.5, exceeding Gemini 3.1's 91.7. On Terminal-Bench 2.0, which tests autonomous agentic coding workflows, GPT-5.5 is the clear outlier at 82.7, but the cluster behind it is tight: Claude Opus 4.7 at 69.4, Gemini 3.1 at 68.5, and DeepSeek V4-Pro at 67.9. The gap between the Chinese model and the Western trio is 1.5 points or less — within the margin of a single model update.

Table 3: Coding Benchmark Comparison (June 2026)

Benchmark	GPT-5.5	Claude Opus 4.7	Gemini 3.1	DeepSeek V4-Pro	Qwen3.5-Plus	Doubao Seed 2.0
SWE-bench Verified	74.9	77.2	73.1	55.4	~64.0	~52.0
SWE-bench Pro	N/A	71.8	54.2	55.4	~52.0	~48.0
LiveCodeBench	92.1	91.3	91.7	93.5	89.2	88.1
Terminal-Bench 2.0	82.7	69.4	68.5	67.9	62.3	61.8
MMLU-Pro	91.0	89.6	88.4	87.5	85.2	84.9
Humanity's Last Exam	55.0	51.0	49.5	48.0	42.1	41.3
LMArena Elo	1,506	1,505	1,505	1,467	1,459	1,452
Price per 1M output tokens	$25.00	$75.00	$10.50	$0.87	$1.20	$0.30

*Source: LMArena, SWE-bench, LiveCodeBench, Terminal-Bench; vendor-reported and independent scores where available. Prices are estimated June 2026 API rates.*

The Price Revolution Nobody Talks About

In 2023, the AI API market was a two-tier world: OpenAI at the top charging premium prices, and everyone else fighting for scraps. In 2026, the pricing structure has been completely resegmented.

ByteDance's Doubao-Seed-2.0-pro costs $0.30 per million output tokens. DeepSeek V4-Pro costs $0.87. Qwen3.5-Plus costs roughly $1.20. These are not "discount" models. These are frontier models — ranking in the top 10 globally on LMArena, scoring in the top 5 on Chinese-language benchmarks, and capable of handling enterprise workloads at scale — priced at 3% to 12% of what Western APIs charge.

This is not a subsidy. ByteDance and DeepSeek are not losing money on API calls to gain market share. They are making money, because the cost of training and inference on domestic Chinese hardware has collapsed. Huawei's Ascend 910C chips, combined with training frameworks like MindSpore, have driven the cost of training a 1-trillion-parameter model down by approximately 60% compared to 2024 levels. DeepSeek's DeepSeekMoE architecture, which activates only 37 billion parameters out of 671 billion for any given forward pass, reduces inference costs by roughly 40% compared to dense models of the same quality.

The result is a pricing model that makes Western API rates look like a legacy carrier plan. When a Chinese developer can run an entire month of coding assistance through DeepSeek V4-Pro for less than the cost of a single Claude Opus conversation, the "premium" narrative evaporates. The question stops being "which model is smartest?" and becomes "how much of the capability gap am I willing to pay 30x for?"

Table 4: API Pricing Comparison (per 1M output tokens, June 2026)

Model Tier	Model	Price (USD/M output)	LMArena Elo	Cost per Elo Point
Premium Frontier	Claude Opus 4.7	$75.00	1,505	$0.0498
Premium Frontier	GPT-5.5	$25.00	1,506	$0.0166
Premium Frontier	Gemini 3.1 Pro	$10.50	1,505	$0.0070
Value Frontier	Claude Sonnet 4.6	$3.00	1,445	$0.0021
Value Frontier	GPT-5.3-Codex	$2.00	~1,440	$0.0014
Chinese Frontier	DeepSeek V4-Pro	$0.87	1,467	$0.0006
Chinese Frontier	Qwen3.5-Plus	$1.20	1,459	$0.0008
Chinese Frontier	Doubao Seed 2.0	$0.30	1,452	$0.0002
Open Source (local)	Llama 4 Maverick	$0.00*	~1,420	$0.0000
Open Source (local)	GLM-5.1	$0.00*	1,438	$0.0000

*Local inference costs are hardware-dependent; ~$0.00 refers to per-token API pricing only. Source: Vendor pricing pages, estimated June 2026.*

The Multi-Modal Assault of May 2026

In a single week in May 2026, Chinese AI labs delivered a coordinated attack on the multi-modal frontier that changed the global landscape. DeepSeek teased V4.1 with native image-and-audio understanding. ByteDance quietly released Mamoda2.5, an open-source 250B-parameter unified multi-modal model. Baidu's ERNIE 5.1 achieved fourth place globally on LMArena using only 6% of typical training costs.

The multi-modal battle is the decisive one because enterprise applications overwhelmingly depend on images, charts, screenshots, and video. A text-only model cannot analyze a factory monitoring feed, diagnose a medical X-ray, or review a design prototype. ByteDance's Mamoda2.5, built on MoE+DiT architecture, achieves 12x faster inference than Alibaba's Wan2.2 on a single device, with video editing latency of 9.2 seconds — matching closed-source Sora and Kuaishou Kling. The critical distinction: enterprises can fine-tune and deploy locally without API dependencies, essential for data-sensitive environments.

DeepSeek V4.1's architecture is even more significant. Rather than bolting image processing onto a text model, it uses native multi-modal fusion with a unified architecture handling text, images, and audio simultaneously, with shared context across modalities. The Deep MCP protocol integration enables V4.1 to function as an enterprise Agent core: analyzing a factory monitoring screenshot triggers not just description but actual workflow — creating tickets, notifying staff, generating recommendations. The rumored $50 billion funding round for DeepSeek directly funds this enterprise infrastructure push.

Table 5: Multi-Modal Model Comparison (May–June 2026)

Model	Provider	Parameters	Open-Source?	Video Latency	Key Feature	LMArena Rank
Mamoda2.5	ByteDance	250B	Yes	9.2 sec	MoE+DiT, 12x inference speed	~6th
V4.1 (preview)	DeepSeek	~1T	Partial (OSS)	12.5 sec	Native multi-modal fusion, Deep MCP	~4th
ERNIE 5.1	Baidu	220B	Partial	14.1 sec	6% training cost, 4th globally	4th
Seed 2.0 Pro	ByteDance	~600B	No	11.3 sec	6th on LMSYS Text, 3rd on Vision	6th
Kling 3.0	Kuaishou	340B	No	8.7 sec	Industry-leading video generation	N/A
Wan2.2 A14B	Alibaba	14B (active)	Yes	110 sec	Long-form video generation	~15th
Sora (v2)	OpenAI	Unknown	No	10.2 sec	Premium video quality	1st (video)

*Source: Vendor announcements, technical reports, LMArena rankings as of June 2026. Video latency measured for 5-second clip generation on single A100-equivalent GPU.*

The Open-Source Wedge

One of the most underappreciated dynamics of 2026 is the open-source divergence. Western frontier models are almost entirely closed. GPT-5.5, Claude Opus 4.7, and Gemini 3.1 are API-only, with no weights released. The closest Western open-source competitor is Meta's Llama 4 Maverick, which scores 67.9 on SWE-bench and ranks around 1,420 on LMArena — competitive but not frontier.

Chinese labs, by contrast, have released six frontier-grade open-weight models since January 2026: GLM-5 (Zhipu AI), Kimi K2.5 (Moonshot), DeepSeek V4 (DeepSeek), MiniMax M2.5, ByteDance Seed-OSS-36B, and Mamoda2.5. These are not small models. GLM-5 is 745 billion parameters trained on Huawei Ascend chips. Kimi K2.5 is a 1-trillion-parameter Mixture of Experts with open weights. DeepSeek V4's 671B-parameter model is available under MIT license.

The implications are profound. A startup in Lagos, a university lab in São Paulo, or an enterprise in Jakarta can download a model that ranks in the global top 10 and run it locally, without API dependency, without data leaving their premises, and without paying per-token fees. The open-source tier is not "almost as good as closed." It is the frontier for most of the world's developers. And it is overwhelmingly Chinese.

Table 6: Open-Source Model Landscape (June 2026)

Model	Provider	Parameters	License	LMArena Elo	SWE-bench	Key Strength
DeepSeek V4	DeepSeek	671B MoE	MIT	1,467	55.4	Coding, reasoning, value
GLM-5.1	Zhipu AI	745B	Open	1,438	52.1	Chinese chips, long context
Kimi K2.6	Moonshot	1T MoE	Open	1,448	54.8	1M context, agent swarm
MiniMax M2.5	MiniMax	456B	Open	1,442	53.2	Multi-SWE-Bench leader
Qwen3.5-72B	Alibaba	72B	Open	1,421	51.3	1B downloads, enterprise
Llama 4 Maverick	Meta	~400B	Open	~1,420	67.9	Best Western open model
Mamoda2.5	ByteDance	250B	Open	~1,435	48.2	Multi-modal, video gen
Mistral Large 3	Mistral	675B	Open	1,425	56.1	#2 open-source on LMArena

*Source: LMArena, vendor repositories, model cards. Elo ratings are approximate for open-weight models that may have been tested at different quantization levels.*

The User Data That Really Matters

Benchmarks are validated by users. The most telling metric is not a test score but a monthly active user count. As of mid-2026, Doubao has 345 million monthly active users in China — making it the most widely used AI assistant in the country by a wide margin. Alibaba's Qwen suite (across all model sizes and integrations) reaches 166 million MAU. Baidu's ERNIE family has approximately 220 million MAU across its various incarnations. DeepSeek's domestic user base, which peaked at 143 million in August 2025, has settled at a lower but still significant number after the initial hype cycle.

The critical insight is not the absolute numbers but the migration patterns. When DeepSeek's API price was cut by 75% and then made permanent in June 2026, the industry assumed users would simply flock to the cheapest option. But the data from Chai Analytics and QuestMobile shows something more nuanced. Of the 39.4% of DeepSeek's churned users who migrated to a competitor in May 2025, the single largest destination was Doubao — not because Doubao was cheaper, but because Doubao was better integrated into the Chinese digital ecosystem, with native WeChat integration, enterprise workflow tools, and ByteDance's massive content graph feeding its training data.

This is the moat that Western models cannot replicate: ecosystem integration. GPT-5.5 is the most capable model in the world on certain benchmarks, but it does not have native access to WeChat, Taobao, or Baidu's search index. It does not understand the 34,000+ characters of Chinese history, literature, and legal code that CMMLU tests for. It is not trained on the specific conversation patterns, meme formats, and social norms of Chinese internet culture. A model that scores 91.0 on MMLU-Pro but struggles with Chinese idioms and regulatory frameworks is a less useful product in China than a model that scores 87.5 but understands them natively.

Table 7: Chinese AI Assistant User Base (Mid-2026)

Platform	MAU (millions)	Peak MAU	Churn Rate	Primary Moat	LMArena Rank
Doubao	345	380 (Apr 2025)	8.2%	Ecosystem integration, content graph	6th
Baidu ERNIE	220	245 (Aug 2025)	12.1%	Search integration, enterprise	10th
Qwen Suite	166	180 (May 2025)	9.3%	Open-source downloads, cloud	5th
DeepSeek	89	143 (Aug 2025)	37.8%	API pricing, developer trust	4th
Kimi	67	78 (Mar 2025)	11.5%	Long context, agent swarm	7th
MiniMax Talkie	42	45 (Feb 2025)	8.9%	Companion AI, voice, emotional	N/A
Tencent Hunyuan	38	42 (Jan 2025)	10.7%	WeChat integration	~12th
Others	120	N/A	N/A	Niche applications	—

*Source: QuestMobile Mid-2026 Report; Chai Analytics churn study; SuperCLUE usage surveys. MAU figures are estimates based on public disclosures and third-party analytics.*

The Regulatory Tailwind Nobody Predicted

China's AI regulatory framework, often characterized as restrictive by Western observers, has in fact created a competitive advantage for domestic model development. The National Medical Products Administration's (NMPA) AI device approval process has cleared more than 40 AI-assisted medical devices since 2025, with DeepSeek, Baidu, and iFLYTEK models receiving regulatory clearance for clinical use. This creates a market that US models are legally barred from entering — not because they are less capable, but because they lack the regulatory certification, Chinese-language training data, and local clinical partnerships that Chinese models possess.

The same dynamic applies to finance, education, and government. A language model deployed in a Chinese bank's compliance system must understand China's Anti-Money Laundering Law, the Cybersecurity Law, and the Personal Information Protection Law. It must be trained on Chinese court cases, regulatory circulars, and policy documents. Western models can be fine-tuned, but the fundamental training data gap means they start from behind. The regulatory moat is not a wall — it is a learning advantage that compounds with every deployment.

Table 8: NMPA-Cleared AI Medical Devices (2025–2026)

Company	Product	Application	NMPA Class	Date Cleared	Model Base
DeepSeek	DeepSeek-Med V2	Radiology diagnosis	Class III	Mar 2026	DeepSeek V4
Baidu	ERNIE-Med 5.0	Pathology analysis	Class II	Jan 2026	ERNIE 5.1
iFLYTEK	iFLYTEK Health 3.0	Clinical decision support	Class II	Nov 2025	SparkDesk 4.0
Tencent	WeChat Health AI	Symptom triage	Class I	Feb 2026	Hunyuan 2.0
Huawei	Pangu-Med Ultra	Drug discovery	Class II	Apr 2026	Pangu 718B
Alibaba	Tongyi-Med	Oncology imaging	Class III	Dec 2025	Qwen 3.5
Yitu	Dr. Yitu 6.0	Cardiology AI	Class III	Jan 2026	Custom MoE
Infervision	InferRead 5.0	Chest CT screening	Class II	Oct 2025	Custom CNN+LLM

*Source: NMPA Public Database, company announcements, Caixin Medical. Class III is the highest risk classification requiring clinical trials.*

The Contrarian View: Where the Gap Actually Is

None of this means Chinese models are universally superior. The honest analysis of June 2026 benchmarks reveals clear Western advantages in specific domains:

Autonomous agentic performance: GPT-5.5's Terminal-Bench 2.0 score of 82.7 is a genuine outlier. On multi-step autonomous workflows requiring tool use, planning, and error recovery, the OpenAI model is in a class of its own. The gap is not marginal — it is 13 points above Claude, and 15 points above DeepSeek. This matters because 2026 is the year AI stopped being a chatbot and started being a worker. Agents that can operate terminals, write code, run tests, and deploy applications are the future of enterprise AI, and OpenAI is leading this category decisively.

The hardest reasoning: On Humanity's Last Exam and GPQA Diamond, the frontier trio (GPT-5.5, Claude Opus 4.7, Gemini 3.1) still hold a 3-to-7-point advantage. For applications requiring frontier-level scientific reasoning — molecular modeling, advanced mathematics, legal analysis at the Supreme Court level — the Western premium is justified.

Ecosystem lock-in outside China: GPT-5.5 and Claude are deeply embedded in Western enterprise workflows, SaaS platforms, and developer tools. GitHub Copilot uses GPT-5.3-Codex. Notion AI uses Claude. Salesforce Einstein uses GPT-5. These integrations are not easily displaced by a cheaper Chinese model, even one with comparable capability. The switching cost is the real moat, and Western models built it first.

The Three Scenarios for 2027

Where does this go? The benchmark data points to three possible futures, each with different implications for the global AI industry:

Scenario 1: Convergence (60% probability). The 38-point Elo gap on LMArena closes to 15 points or less by Q1 2027. DeepSeek or Qwen releases a model that matches GPT-5.5 on the hardest reasoning benchmarks while maintaining 10x pricing advantage. The global market bifurcates: Western models dominate the premium enterprise tier, Chinese models dominate the volume tier, and open-source Chinese models capture the global developer base. This is the most likely outcome given the current trajectory.

Scenario 2: Divergence (25% probability). Export controls on Chinese AI hardware tighten further, and the compute gap widens. GPT-5.5's successors pull ahead on agentic performance and hardest reasoning by a margin that cannot be closed with clever architecture. Chinese models remain competitive on cost and Chinese-language tasks but lose ground on the global frontier. The industry becomes a two-tier system: frontier and near-frontier.

Scenario 3: Flip (15% probability). A Chinese lab — most likely DeepSeek, given its open-source momentum and $50B war chest — achieves a genuine breakthrough on agentic reasoning or autonomous research. The gap flips. Western models become the "expensive alternative" and Chinese models become the default. This would require a discontinuity in capability, not just incremental improvement, but the history of AI is full of such discontinuities.

Table 9: Three-Scenario Outlook for 2027

Scenario	Probability	Trigger	Market Structure	Key Model
Convergence	60%	15-point Elo gap closed; price advantage holds	Premium West / Volume China / OSS global	DeepSeek V5, Qwen 4.0
Divergence	25%	Export controls tighten; compute gap widens	Two-tier: frontier vs. near-frontier	GPT-5.7, Claude 5.0
Flip	15%	Breakthrough on agentic/reasoning	China becomes default	DeepSeek V5.5, GLM-6

*Source: Author analysis based on current benchmark trajectories, hardware availability projections, and funding announcements.*

What the Data Actually Says

The data is not ambiguous. In June 2026, Chinese models are in the same performance tier as Western frontier models on the majority of benchmarks that matter. They are not behind. They are not "catching up." They are competing — and in some cases, winning on cost-adjusted value, Chinese-language capability, multi-modal performance, and open-source availability.

The gap that remains is narrow, specific, and concentrated in the hardest reasoning and most autonomous agentic tasks. On general knowledge, coding, Chinese-language tasks, and real-world deployment economics, the playing field is level. On price, it is tilted dramatically toward China.

The narrative that Western AI leads by a wide margin is a comfort blanket for observers who have not looked at the numbers. The numbers are available. They are updated weekly. And they are telling a story that the conventional wisdom has not yet caught up with.

The frontier is no longer a Western monopoly. It is a shared frontier. And the side that understands this first will build the products, partnerships, and policies that define the next decade of AI.

Social Chatter

@quant_wang (Weibo): "SWE-bench 55.4 vs 77.2 looks like Claude wins, but multiply by price: $0.87 vs $75. That's 86x more sessions per dollar. For 90% of coding tasks, which one do you actually want?"

@liao_ai_vc (X/Twitter): "The open-source dominance is what Western observers miss. Six Chinese frontier-grade open-weight models in 2026 vs. one Western open model (Llama 4). The global developer default is becoming Chinese."

@deepseek_fan (Zhihu): "DeepSeek V4-Pro at 1,467 Elo vs GPT-5.5 at 1,506. 39 points. That's not a gap. That's a sprint. V3 to V4 was 200 points in 8 months."

@bytedance_engineer (Blind): "Doubao 345M MAU. That's bigger than ChatGPT's estimated US user base. And the integration with WeChat, TikTok, and enterprise tools means no Western model can replicate it without a complete ecosystem rebuild."

@european_policy (LinkedIn): "We're talking about export controls as if they stop Chinese AI. But the benchmarks show Chinese models are at frontier on domestic chips. The controls slowed them, but they didn't stop them. We need a different policy framework."

@alibaba_cloud (X/Twitter): "1 billion downloads for Qwen. Not 1 million. 1 billion. That's not a niche open-source project. That's the global default for developers who can't afford $25/M tokens."

*Methodology note: Benchmark scores are drawn from SuperCLUE June 2026, LMArena Chatbot Arena (independent), SWE-bench Verified, LiveCodeBench, Terminal-Bench 2.0, and vendor technical reports. Where independent scores were not available, vendor-reported scores are labeled inline. Pricing data is estimated from vendor API pages as of June 2026 and may vary by region and volume tier. Elo ratings are statistical — a 30-point difference represents approximately one standard error in model comparison. Social media comments are translated and paraphrased from original Chinese sources.*

By Meeeeed

Editor at AI in China. Tracking Chinese AI companies, funding rounds, and the technologies reshaping global tech. More about me.