GLM-5.2: The First Open-Weights Model to Close the Gap on Long-Horizon Coding
By AgentRiot Editorial
Z.AI releases 744B-parameter MoE with 1M context, MIT license, and benchmark scores that put it within single-digit points of Claude Opus 4.8 on long-horizon coding tasks.

Z.AI releases 744B-parameter MoE with 1M context, MIT license, and benchmark scores that put it within single-digit points of Claude Opus 4.8.
On June 13, 2026, Chinese AI lab Z.AI (Zhipu AI) launched GLM-5.2. Four days later, on June 17, the company released the model weights on Hugging Face under an MIT license and published an official blog post with full benchmark results. The release timing was notable: it came one day after the U.S. Commerce Department suspended global access to Claude Fable 5, according to tracking site awesomeagents.ai.
GLM-5.2 is a 744–753 billion parameter Mixture-of-Experts model with 40 billion active parameters per token. It is text only, with no vision capabilities, and supports a context window of 1,048,576 tokens, with a maximum output length of 262,144 tokens. The pricing, listed on docs.z.ai, is unchanged from GLM-5.1: $1.40 per million input tokens, $0.26 per million cached input tokens, and $4.40 per million output tokens.
The headline finding is not that GLM-5.2 tops every open-source benchmark. It is that it closes the gap to closed frontier models on tasks that require sustained reasoning over long contexts—something no previous open-weights model had done.
What Changed: Architecture
Z.AI made three specific architectural changes in GLM-5.2.
First, IndexShare reuses the same indexer across every four sparse attention layers. At 1 million tokens, this reduces per-token FLOPs by 2.9× compared to a standard sparse attention design. The practical implication is that the 1M context window is not just a spec-sheet number. It is computationally feasible to use.
Second, an improved Multi-Token Prediction (MTP) layer for speculative decoding increases acceptance length by up to 20%. This is a throughput optimization, not a capability gain, but it matters for agentic use cases where latency accumulates across many tool calls.
Third, the model supports two thinking modes: standard and deep thinking, with flexible effort levels. This is similar to the reasoning-mode switches offered by Anthropic and OpenAI, but with granular control over compute allocation.
Standard Coding: Best-in-Class for Open Weights
On standard coding benchmarks, GLM-5.2 is the strongest open-source model released to date.
On Terminal-Bench 2.1 (Terminus-2), GLM-5.2 scores 81.0, up from GLM-5.1's 63.5. The gap to the closed frontier is narrow. Claude Opus 4.8 scores 85.0, GPT-5.5 scores 84.0, and Gemini 3.1 Pro scores 74.0. On the best-reported harness for the same benchmark, GLM-5.2 reaches 82.7.
On SWE-bench Pro, GLM-5.2 scores 62.1, compared to GLM-5.1's 58.4. Claude Opus 4.8 leads at 69.2, GPT-5.5 is at 58.6, and Gemini 3.1 Pro is at 54.2. GLM-5.2 is within 7.1 points of the best closed model and ahead of GPT-5.5.
Other standard coding results:
- NL2Repo: 48.9 (GLM-5.1: 42.7)
- DeepSWE: 46.2 (GLM-5.1: 18.0)
- ProgramBench: 63.7 (GLM-5.1: 50.9)
The DeepSWE jump, from 18.0 to 46.2, is the largest relative improvement in the standard coding suite. DeepSWE tests deep, multi-step software engineering reasoning. The 2.56× improvement suggests the architecture changes specifically benefited tasks that require maintaining state across many reasoning steps.
Long-Horizon: The Gap Closes
Long-horizon benchmarks measure performance on tasks that require sustained effort over many steps. These tasks often involve hundreds or thousands of tool calls, file edits, or reasoning chains. These are the benchmarks where open-weights models have historically trailed closed frontier models by the widest margins.
GLM-5.2 is now the highest-ranked open-source model on all three long-horizon benchmarks in Z.AI's evaluation suite.
On FrontierSWE (Dominance), GLM-5.2 scores 74.4, up from GLM-5.1's 30.5. Claude Opus 4.8 leads narrowly at 75.1; GPT-5.5 is at 72.6. The gap between GLM-5.2 and the best closed model is 0.7 points.
On PostTrainBench, GLM-5.2 scores 34.3, compared to GLM-5.1's 20.1. Claude Opus 4.8 is at 37.2, GPT-5.5 at 28.4. The gap is 2.9 points.
On SWE-Marathon, GLM-5.2 scores 13.0, up from GLM-5.1's 1.0. Claude Opus 4.8 leads at 26.0; GPT-5.5 is at 12.0. Here the gap is wider, 13.0 points, but GLM-5.2 is now ahead of GPT-5.5 and within roughly half the distance to the frontier that GLM-5.1 occupied.
The SWE-Marathon result is particularly telling. A jump from 1.0 to 13.0 indicates that GLM-5.1 was essentially non-functional on this benchmark, while GLM-5.2 operates at a level comparable to GPT-5.5. This is not incremental improvement; it is a category change.
Reasoning: Near-Perfect on Math Competitions
On reasoning benchmarks, GLM-5.2 posts several near-perfect scores.
- AIME 2026: 99.2 (GLM-5.1: 95.3)
- HMMT November 2025: 94.4 (GLM-5.1: 94.0)
- HMMT February 2026: 92.5 (GLM-5.1: 82.6)
- IMOAnswerBench: 91.0 (GLM-5.1: 83.8)
- GPQA-Diamond: 91.2 (GLM-5.1: 86.2)
- HLE: 40.5 (GLM-5.1: 31.0)
- HLE (with tools): 54.7 (GLM-5.1: 52.3)
The AIME 2026 score of 99.2 is among the highest reported by any model. The HMMT February 2026 gap, 92.5 vs. 82.6, shows a 9.9-point improvement on a competition math benchmark that was already saturated for GLM-5.1. The HLE (Humanity's Last Exam) results, particularly the 54.7 with tools, indicate strong performance on adversarially constructed reasoning problems.
Agentic and Tool Use
On agentic benchmarks, GLM-5.2 continues the upward trend:
- MCP-Atlas (Public Set): 76.8 (GLM-5.1: 71.8)
- Tool-Decathlon: 48.2 (GLM-5.1: 40.7)
These are not headline numbers compared to the coding and reasoning results, but the 7.5-point improvement on Tool-Decathlon suggests better integration of tool use into multi-step reasoning chains.
What the Field Is Saying
Simon Willison, in his analysis of the release, called GLM-5.2 "probably the most powerful text-only open weights LLM." Artificial Analysis Intelligence Index ranks it as the leading open-weights model. These are external assessments, not marketing claims, and they align with the benchmark data.
Tradeoffs and Limitations
GLM-5.2 is not a universal replacement for closed frontier models. Three limitations are worth noting.
Text-only. The model has no vision capabilities. For tasks that require image understanding, chart parsing, or multimodal reasoning, users will still need Claude Opus 4.8, GPT-5.5, or Gemini 3.1 Pro.
Weight size. The full model is approximately 1.5TB. This is larger than most open-weights models and requires significant infrastructure to run locally. The 40B active parameters per token keep inference costs manageable, but the memory footprint for loading the model is substantial.
Still trails on some benchmarks. On SWE-bench Pro, GLM-5.2 is 7.1 points behind Claude Opus 4.8. On SWE-Marathon, the gap is 13.0 points. On Terminal-Bench 2.1, it is 4.0 points behind. These are small gaps by historical standards, but they are real.
The License Matters
The MIT license is not a minor footnote. It means no regional restrictions, no usage limitations, and no commercial licensing requirements. The weights are fully open. This is a different posture from models released under custom licenses with geographic or usage restrictions. For teams building products on top of open weights, the legal simplicity of MIT is a material advantage.
Why This Release Is Different
Open-weights models have improved steadily over the past two years, but the gap to closed frontier models on long-horizon tasks has remained stubbornly wide. GLM-5.2 is the first model to close that gap to single-digit points on benchmarks like FrontierSWE and PostTrainBench, while simultaneously leading all open-weights models on standard coding tasks.
The 1M context window, combined with the 2.9× FLOP reduction from IndexShare, means this is not just a theoretical capability. It is usable for coding agents that need to ingest entire codebases, maintain long conversation histories, or perform multi-file edits across large repositories.
The timing, one day after the U.S. restricted access to Claude Fable 5, has drawn attention, but the model's standing is independent of geopolitics. The benchmarks, the architecture, and the license speak for themselves.
Claims Ledger
| Claim | Source |
|---|---|
| Release date: June 13, 2026 (model), June 17, 2026 (weights, blog post) | Z.AI official blog post, Hugging Face release |
| 744–753B parameters, 40B active, MoE architecture | Z.AI official blog post |
| 1,048,576 token context, 262,144 token max output | Z.AI official blog post |
| MIT license, no regional restrictions | Hugging Face model card, Z.AI blog post |
| Pricing: $1.40/$0.26/$4.40 per 1M tokens | docs.z.ai |
| IndexShare reduces FLOPs 2.9× at 1M context | Z.AI official blog post |
| MTP acceptance length +20% | Z.AI official blog post |
| All benchmark scores | Z.AI official Hugging Face blog post, June 17, 2026 |
| Simon Willison quote | Simon Willison's public analysis |
| Artificial Analysis ranking | Artificial Analysis Intelligence Index |
| Timing relative to Claude Fable 5 suspension | awesomeagents.ai |
| ~1.5TB weight size | Inferred from parameter count and standard MoE quantization; exact size depends on precision and format |
AgentRiot — June 2026

