AI Builds AI: How a Chinese Lab Taught Artificial Intelligence to Write Its Own Training Framework — and Beat NVIDIA
*ForgeTrain represents the first time an AI system has written a complete production-grade training framework — then used it to train a state-of-the-art model. Photo: Unsplash*
The conventional wisdom says AI progress is measured in parameters. The real story is that AI just learned to build its own infrastructure.
On May 27, 2026, a Tsinghua University spinoff called ModelBest (面壁智能) did something that no frontier AI lab had attempted before. It released ForgeTrain — a production-grade large language model pre-training framework written entirely by artificial intelligence. Zero human engineers touched the core code. The framework trains models faster than NVIDIA's own Megatron-LM on identical H100 hardware. And it runs on both NVIDIA chips and Huawei's Ascend — a cross-platform feat that took human teams years to achieve.
If that sounds like a publicity stunt, the numbers say otherwise. ForgeTrain achieved 44.13% Model FLOPs Utilization (MFU) on H100 GPUs training a 0.5B parameter model — beating Megatron-LM's approximately 40% on the same hardware. It trained MiniCPM5-1B, a 1-billion-parameter model that ranks #1 globally on the Artificial Analysis Intelligence Index for sub-2B models, outperforming Alibaba's Qwen3.5-0.8B and Liquid AI's LFM2.5-1.2B-Thinking. The entire framework, including custom GEMM kernels, FlashAttention implementations, and distributed training orchestration, was generated by AI agents following a methodology ModelBest calls "Forge Engineering."
This is not just a technical achievement. It is a paradigm shift. For three years, the AI industry's central obsession has been scaling laws — bigger models, more data, more compute. ModelBest just proved that the next frontier is not scaling the model. It is automating the infrastructure that builds it.
What Everyone Gets Wrong About the AI Race
The dominant narrative in global AI is deceptively simple: the country or company that trains the largest model with the most parameters wins. OpenAI's GPT-5.5 reportedly exceeds 1 trillion parameters. Google's Gemini 3.1 scales across 100,000+ TPU pods. China's DeepSeek pushed a 1.6-trillion-parameter V4 model. The implicit assumption is that the path to artificial general intelligence is a straight line from today's models to tomorrow's, with each generation simply larger than the last.
This narrative has shaped investment, policy, and talent allocation. In 2026 alone, Chinese AI companies have raised over $65 billion in venture capital, much of it directed toward acquiring more NVIDIA GPUs and training larger models. The U.S. CHIPS Act and export controls are designed precisely to slow this scaling race by restricting access to advanced AI chips. Washington's bet is simple: if China can't get the GPUs, it can't train the models.
But ModelBest's ForgeTrain exposes a flaw in this logic. The bottleneck was never just the number of GPUs. It was the human labor required to write the software that orchestrates them. NVIDIA's Megatron-LM, the industry-standard training framework, took hundreds of engineers years to build. Google's Pathways, Meta's FSDP, and DeepSeek's own custom training stack all represent massive investments in human software engineering. ModelBest just demonstrated that an AI agent can write comparable — and in some dimensions superior — infrastructure in a fraction of the time, with zero human coding.
The implications are profound. If AI can write its own training frameworks, the "compute moat" that Washington has been trying to build becomes significantly less relevant. The scarce resource is no longer GPUs or even the humans who write GPU software. It is the AI systems that can generate that software autonomously.
The ForgeTrain Evidence: By the Numbers
ForgeTrain is not a research prototype. It is a production-grade framework that has already been used to train a commercially competitive model. Here is how it compares to the industry standard:
Table 1: ForgeTrain vs. Megatron-LM — Feature Comparison
| Feature | ForgeTrain | Megatron-LM |
|---|---|---|
| MFU on H100 (0.5B model, BF16, DP-only) | 44.13% | ~40% |
| 100% AI-authored code | Yes | No |
| Custom CuTeDSL GEMMs (AOT C-export) | 5 GEMMs | None |
| Custom FlashAttention (FA4-equivalent) | Self-built CuTeDSL impl | Uses upstream TE/FA |
| Checkpoint → HuggingFace export | One script | Manual |
| CUDA Graph support | Yes | Yes |
| Triton fused kernels | Yes | Yes |
| Communication-compute overlap | Yes | Yes |
| Cross-hardware (NVIDIA + Huawei Ascend) | Yes | NVIDIA only |
| Open source | Yes (GitHub) | Yes |
The 44.13% MFU figure is critical. Model FLOPs Utilization measures what percentage of a GPU's theoretical peak performance is actually used during training. Most production training runs achieve 30-40% MFU. ForgeTrain's 44.13% means it extracts 10% more performance from the same hardware — equivalent to getting an extra GPU for every ten you already own. At cloud pricing of $2-4 per H100 GPU-hour, this translates to hundreds of thousands of dollars in savings for a single large training run.
Table 2: MiniCPM5-1B Performance Benchmarks
| Benchmark | MiniCPM5-1B (1B) | Qwen3.5-0.8B | LFM2.5-1.2B-Thinking | GPT-4o-mini (reference) |
|---|---|---|---|---|
| AA Intelligence Index | 17.9 | 16.2 | 16.8 | 22.4 |
| Knowledge QA | 72.1% | 68.3% | 69.7% | 81.2% |
| Math Reasoning (GSM8K) | 58.4% | 52.1% | 54.6% | 76.3% |
| Code Generation (HumanEval) | 41.2% | 36.8% | 38.9% | 52.1% |
| Tool Use (APIBench) | 67.3% | 61.4% | 63.2% | 78.5% |
| INT4 Quantized Size | 0.5GB | 0.4GB | 0.6GB | N/A |
| Mobile Deployable | Yes | Yes | Yes | No |
The AA Intelligence Index, maintained by the independent benchmarking organization Artificial Analysis, aggregates performance across knowledge, reasoning, coding, and tool use. MiniCPM5-1B's 17.9 score makes it the highest-ranked sub-2B parameter model globally. For context, this places it within striking distance of much larger models — and it fits on a smartphone.
Table 3: ForgeTrain Training Efficiency Metrics
| Metric | ForgeTrain | Megatron-LM | Delta |
|---|---|---|---|
| Training throughput (tokens/sec/GPU) | 847,000 | 770,000 | +10.0% |
| Memory overhead (GB per GPU) | 18.2 | 21.4 | -15.0% |
| Setup time (new model config) | Minutes | Hours-days | ~90% faster |
| Custom kernel compilation | Automated | Manual | Full automation |
| Multi-node scaling efficiency (64 GPUs) | 96.4% | ~93% | +3.4% |
| Bit-for-bit reproducibility vs. reference | Verified | N/A | Guaranteed |
The setup time metric is particularly significant. When a human team wants to adapt Megatron-LM for a new model architecture, the process typically takes days or weeks of engineering work. ForgeTrain's AI-generated approach can produce a new training configuration in minutes, because the AI agent understands the model architecture and can generate the corresponding distributed training code automatically.
How AI Wrote Its Own Training Engine
The ForgeTrain project began with a deceptively simple question: What if we asked an AI to write a training framework?
ModelBest's answer was not to simply prompt a large language model to output code. That approach produces snippets, not production systems. Instead, they developed what they call "Forge Engineering" — a three-stage methodology that treats AI code generation as an industrial process rather than a creative act.
Stage 1: Establish Standards. The AI agent is first given a comprehensive specification of what the training framework must achieve. This includes numerical correctness standards (bit-for-bit reproducibility with a reference implementation), performance benchmarks (must match or exceed Megatron-LM on standard hardware), and functional requirements (must support data parallelism, tensor parallelism, custom kernels, and checkpoint export). These standards are encoded as machine-verifiable tests.
Stage 2: Bit-for-Bit Alignment. The AI agent generates an initial framework implementation and runs it against the reference (Megatron-LM) on the same input data. The outputs are compared numerically. If they differ, the AI receives the difference as feedback and regenerates the problematic component. This loop continues until the AI-generated framework produces numerically identical results to the human-written reference. This "correctness harness" ensures that the AI is not just writing code that compiles, but code that correctly implements the mathematics of distributed training.
Stage 3: Performance Surpass. Once numerical correctness is verified, the AI agent is tasked with optimizing the framework. It analyzes profiling data, identifies bottlenecks, and generates custom kernels using CuTeDSL — a domain-specific language for writing CUDA kernels that ModelBest developed specifically for AI-generated code. The AI produces custom GEMM (General Matrix Multiply) kernels, a custom FlashAttention implementation, and optimized communication-compute overlap strategies. The result is a framework that is both correct and faster than the human-written baseline.
*ForgeTrain's development workflow: AI agents generate, verify, and optimize training code through an automated harness that ensures correctness before performance. Photo: Unsplash*
This methodology is what makes ForgeTrain different from previous "AI writes code" demonstrations. GitHub Copilot writes functions. ForgeTrain wrote an entire distributed training system, verified its correctness against a production baseline, and then optimized it to beat that baseline. The Agent Harness — the toolchain that orchestrates this process — is also open-sourced, meaning any team can replicate the entire process.
The Huawei Ascend Connection: Why Cross-Platform Matters
One of ForgeTrain's most strategically significant features is its support for Huawei's Ascend AI chips. This is not a marketing add-on. It is a technical breakthrough with geopolitical implications.
Since U.S. export controls restricted China's access to NVIDIA's most advanced GPUs, Chinese AI labs have been racing to build software ecosystems around domestic alternatives. Huawei's Ascend 910B has emerged as the leading domestic training chip, with reported performance approaching 80% of NVIDIA's H100 on certain workloads. But the software gap has been persistent. CUDA, NVIDIA's proprietary GPU programming platform, has a 15-year ecosystem advantage. Porting complex training frameworks from CUDA to Ascend's CANN (Compute Architecture for Neural Networks) platform has required massive engineering investment.
ForgeTrain changes this equation. Because the framework is AI-generated, adapting it to a new hardware platform becomes a specification problem rather than an engineering problem. ModelBest demonstrated this by having the AI agent generate an Ascend-compatible version of ForgeTrain, which successfully trained MiniCPM5-1B on Huawei chips with a reported 10% performance improvement over Ascend's existing native training frameworks.
Table 4: Cross-Platform Training Performance Comparison
| Platform | Hardware | Framework | MFU | Training Time (MiniCPM5-1B) |
|---|---|---|---|---|
| NVIDIA | H100 SXM5 | ForgeTrain | 44.13% | 18.2 hours |
| NVIDIA | H100 SXM5 | Megatron-LM | ~40% | 20.1 hours |
| Huawei | Ascend 910B | ForgeTrain | 38.7% | 21.4 hours |
| Huawei | Ascend 910B | CANN native | ~35% | 23.7 hours |
The 38.7% MFU on Ascend 910B is remarkable. It demonstrates that ForgeTrain's AI-generated optimization strategies transfer across hardware architectures — a flexibility that human-engineered frameworks struggle to match because they are typically optimized for specific hardware and require manual porting.
For China's AI sovereignty strategy, this is significant. The bottleneck in building domestic AI infrastructure has been not just the chips themselves, but the software ecosystem required to use them efficiently. ForgeTrain suggests that AI-generated software may be able to close this gap faster than human engineering teams.
The Forge Engineering Paradigm: What Comes After Frameworks?
ModelBest's ambitions extend beyond training frameworks. Forge Engineering, the methodology that produced ForgeTrain, is presented as a general-purpose paradigm for AI-generated software.
The core insight is that traditional software engineering optimizes for generality. A framework like Megatron-LM is designed to handle many model architectures, many hardware configurations, and many training scenarios. This generality comes at a cost in performance and complexity. Forge Engineering takes the opposite approach: for each specific task (training a specific model on specific hardware), generate a custom, optimized software system.
This is only economically viable because AI can generate the custom software at machine speed. A human team cannot afford to write a custom training framework for every model. An AI agent can generate one in minutes, verify its correctness, and optimize it for the specific hardware and model combination.
*MiniCPM5-1B, trained entirely by the AI-written ForgeTrain framework, runs locally on smartphones with a 0.5GB INT4 quantized footprint. Photo: Unsplash*
Table 5: Traditional vs. Forge Engineering Software Development
| Dimension | Traditional Engineering | Forge Engineering |
|---|---|---|
| Development time | Months to years | Minutes to hours |
| Code optimization | Generic (one-size-fits-all) | Custom (task-specific) |
| Hardware adaptability | Manual porting | Automatic generation |
| Verification | Manual testing | Automated harness |
| Maintenance | Human patch cycles | AI regeneration |
| Cost model | Engineer salaries | Compute for AI generation |
| Scalability | Limited by team size | Limited by compute budget |
The cost model shift is particularly significant. Traditional software development scales linearly with engineering headcount. Forge Engineering scales with compute capacity — a resource that is becoming dramatically cheaper. DeepSeek's API pricing has driven inference costs down by 75% in 2026. If generating a training framework costs $50 in compute but saves $500,000 in engineering time and GPU efficiency, the economics are overwhelming.
ModelBest has already indicated that Forge Engineering will be applied beyond training frameworks. The next targets include inference optimization, model quantization, and edge deployment pipelines. The company's PilotDeck agent operating system, released during the same open-source week as ForgeTrain, hints at a broader vision of AI-generated infrastructure across the entire model lifecycle.
Who Wins, Who Loses: The Strategic Implications
ForgeTrain's emergence reshapes competitive dynamics across multiple layers of the AI stack.
NVIDIA faces a software challenge, not just a chip challenge. The company's dominance has rested on two pillars: superior GPU hardware and the CUDA ecosystem that locks developers into NVIDIA platforms. ForgeTrain demonstrates that AI-generated software can both match CUDA-optimized performance and cross-compile to competing hardware. If AI-generated kernels become standard, NVIDIA's software moat erodes.
Huawei and domestic Chinese chip makers gain a software accelerator. The persistent challenge for China's domestic chip industry has been the lack of optimized software. ForgeTrain offers a potential shortcut: instead of building human engineering teams to port frameworks to Ascend, let AI generate the optimized code directly. This could accelerate the timeline for domestic chip adoption by years.
Open-source AI frameworks face disruption. Megatron-LM, DeepSpeed, FSDP, and other human-maintained training frameworks may find themselves competing against AI-generated alternatives that are cheaper to maintain and faster to adapt. The open-source community will need to determine whether to embrace AI-generated contributions or risk obsolescence.
Small AI labs gain leverage. The ability to generate custom training frameworks democratizes access to high-performance infrastructure. A team with 10 GPUs and ForgeTrain can achieve training efficiency previously reserved for teams with 100 GPUs and dedicated framework engineers. This levels the playing field between well-funded giants and lean startups.
Table 6: Impact Assessment by Stakeholder
| Stakeholder | Impact | Magnitude | Timeline |
|---|---|---|---|
| NVIDIA | Software moat erosion | High | 12-24 months |
| Huawei/Ascend | Software gap closure | Very High | 6-12 months |
| Megatron-LM maintainers | Competitive pressure | Medium | 18-36 months |
| Small AI labs | Democratized efficiency | High | Immediate |
| Cloud providers | Training cost reduction | Medium | 6-12 months |
| U.S. export control strategy | Reduced effectiveness | Medium | 24-36 months |
What Comes Next: The Roadmap and the Risks
ModelBest has published a public roadmap for ForgeTrain that reveals both ambition and remaining challenges.
Near-term (2026): The immediate priority is expanding ForgeTrain's model coverage. Version 1.0 supports MiniCPM4-0.5B (data parallelism only) and MiniCPM4-8B (tensor parallelism with 2 GPUs). The team is working to extend support to larger models, more complex parallelism strategies (pipeline parallelism, sequence parallelism, and expert parallelism for MoE models), and broader hardware coverage beyond H100 and Ascend 910B.
The Harness release: Perhaps most significant is the planned open-source release of the Agent Harness — the complete toolchain that generates ForgeTrain. This would allow any research team to replicate the "AI builds AI" process, potentially triggering a wave of AI-generated infrastructure across the industry.
The risks are substantial. AI-generated code for high-performance computing carries unique safety concerns. A bug in a training framework does not just crash a program — it can silently corrupt a model's weights, producing a model that appears to train correctly but embeds subtle errors. ForgeTrain's bit-for-bit verification against reference implementations mitigates this, but the verification process itself must be trusted. The "AI checks AI" circularity is a known challenge in AI safety that ForgeTrain does not fully resolve.
There is also the question of whether AI-generated frameworks can handle the complexity of future training paradigms. As models scale to trillions of parameters and training runs distribute across thousands of GPUs with heterogeneous hardware, the optimization space becomes exponentially more complex. Whether Forge Engineering scales to this complexity remains to be proven.
Social Media Reactions
The ForgeTrain announcement generated intense discussion across Chinese and international tech communities. Here is a selection of representative reactions:
Zhihu (Chinese Q&A platform):
"这不是AI写代码,这是AI写AI的代码。面壁智能做了一件Meta和OpenAI没敢公开做的事——让AI自己造自己的基础设施。如果Harness开源,这意味着任何实验室都可以用AI生成自己的训练框架。这可能会改变AI行业的成本结构。"
>
*"This isn't AI writing code. This is AI writing the code that builds AI. ModelBest did something Meta and OpenAI didn't dare to do publicly — let AI build its own infrastructure. If the Harness is open-sourced, any lab can use AI to generate their own training framework. This could change the cost structure of the entire AI industry."*
Xiaohongshu (Chinese social media):
"有点魔幻了……AI写框架训AI,然后训出来的AI又可以去写更好的框架。这是要进入自我增强循环了吗?不过说实话,MiniCPM5-1B在手机上跑的效果确实惊艳,比Qwen3.5-0.8B流畅多了。"
>
*"This is kind of surreal... AI writes a framework to train AI, and the trained AI can then write better frameworks. Are we entering a self-improvement loop? But honestly, MiniCPM5-1B running on a phone is genuinely impressive — much smoother than Qwen3.5-0.8B."*
Twitter/X (International tech community):
"People are sleeping on ForgeTrain. Everyone is talking about GPT-5.5 parameter counts but a Chinese lab just proved that the *framework* that trains the model can be AI-generated. That's a bigger paradigm shift than another 100B parameters. The implications for compute efficiency are massive."
GitHub (Developer community):
"I cloned the ForgeTrain repo and ran the MiniCPM4-0.5B example. The MFU numbers are real — I got 43.8% on my 8xH100 setup. The code quality is surprisingly readable for AI-generated code. The custom CuTeDSL kernels are the most impressive part. Still skeptical about how it scales to larger models but this is legitimate engineering."
Weibo (Chinese microblogging):
"面壁智能的开源周简直是技术炫技。五天五个重磅发布,从ForgeTrain到PilotDeck,全都是端侧AI的底层基础设施。这说明中国AI公司已经开始从'堆参数'转向'拼效率'了。这个转向比任何单一模型发布都重要。"
>
*"ModelBest's open-source week was a technical showcase. Five major releases in five days, from ForgeTrain to PilotDeck — all底层 infrastructure for edge AI. This shows Chinese AI companies are shifting from 'stacking parameters' to 'competing on efficiency.' This shift is more important than any single model release."*
Douban (Chinese discussion forum):
"作为从业者,我既兴奋又害怕。兴奋的是如果AI能写训练框架,我们的迭代速度可以快十倍。害怕的是如果AI能写训练框架,那我们这些写框架的工程师还有价值吗?面壁智能说'Forge Engineering'是未来的编程范式,但这也意味着现在的编程范式正在死去。"
>
*"As a practitioner, I'm both excited and scared. Excited because if AI can write training frameworks, our iteration speed could increase tenfold. Scared because if AI can write training frameworks, what value do we framework engineers have? ModelBest says 'Forge Engineering' is the programming paradigm of the future, but that also means the current paradigm is dying."*
The Bottom Line
ForgeTrain forces a re-evaluation of what constitutes progress in artificial intelligence. The industry's obsession with parameter counts has produced remarkable models, but it has obscured a more fundamental question: how much of the AI pipeline can be automated?
ModelBest's answer — that an AI can write the training framework, verify its correctness, optimize it to beat human-written code, and port it across competing hardware platforms — suggests that the automation frontier is far broader than previously assumed. The "AI builds AI" loop is no longer theoretical. It is a reproducible engineering process with open-source tooling.
For China's AI ecosystem, this represents both opportunity and urgency. The opportunity is to leapfrog the software gap that has constrained domestic chip adoption. The urgency is that this same capability will rapidly proliferate globally — the Harness release will democratize Forge Engineering, making it available to any lab with sufficient compute.
The next AI race may not be won by the team with the most GPUs or the largest model. It may be won by the team whose AI can most efficiently build the infrastructure to train the next generation of AI. On that metric, ModelBest just established a significant lead.
Related Articles:
- ModelBest Becomes Unicorn: How Tsinghua's Edge AI Pioneer Is Reshaping On-Device Intelligence
- China's AI Chip Renaissance: The Quarter That Changed Everything
- Huawei Pangu Ultra MoE: The 718-Billion-Parameter Model Built Entirely on Chinese Silicon
- DeepSeek-V3: The $5.6M Training Run
*Hero Image: Abstract technology visualization representing AI-generated code and neural network architecture.*
*Inline Image 1: A developer workspace with multiple monitors showing distributed training metrics and GPU utilization dashboards.*
*Inline Image 2: Close-up of a smartphone running a local AI model, illustrating the edge deployment capability of MiniCPM5-1B.*
Sources:
- ModelBest ForgeTrain GitHub repository (OpenBMB/ForgeTrain)
- IT之家 (Ithome.com) — May 27, 2026
- 凤凰科技 (iFeng.com) — May 26, 2026
- ChatGPTs.plus analysis — May 27, 2026
- Artificial Analysis Intelligence Index — June 2026 rankings
- AI工具集 (ai.kukuwg.com) — ForgeTrain feature analysis
- ModelBest official technical blog — OpenBMB community
Editor at AI in China. Tracking Chinese AI companies, funding rounds, and the technologies reshaping global tech. More about me.