GLM 5: Large-Scale Agentic Reasoning Model
GLM 5 is a 744B parameter Mixture-of-Experts model with 40B active parameters, developed by Z.ai. It targets complex systems engineering, long-horizon agentic tasks, and advanced reasoning, building on the GLM 4.7 foundation with doubled total parameters and an expanded expert pool.
Key Features
- Agentic Task Completion - Achieves 77.8% on SWE-bench Verified and 73.3% on SWE-bench Multilingual, with strong Terminal-Bench 2.0 performance (56.2% with Terminus, 56.2% with Claude Code)
- Complex Reasoning - Scores 92.7% on AIME 2026 I, 96.9% on HMMT Nov. 2025, 82.5% on IMOAnswerBench, and 86.0% on GPQA-Diamond
- Tool Use and Browsing - Native tool calling with 62.0% on BrowseComp, 89.7% on tau-2-Bench, and 67.8% on MCP-Atlas; 50.4% on Humanity's Last Exam with tool access
- Cybersecurity - 43.2% on CyberGym for systems-level security tasks
- Interleaved Thinking - Reasons before every response and tool call, with turn-level control over reasoning depth
- Bilingual - Native English and Chinese language support
Use Cases
- Software engineering, code generation, and multi-file repository-level tasks
- Multi-step agentic workflows with tool calling and web browsing
- Complex mathematical reasoning and competition-level problem solving
- Terminal-based development, operations, and systems administration
- Cybersecurity analysis and systems engineering
- Research tasks requiring extended browsing and context management
- Long-form document analysis and generation
Architecture and Design
GLM 5 uses a Mixture-of-Experts architecture with 256 routed experts and 1 shared expert per layer, activating 8 experts per token. The first 3 layers are dense, while the remaining 75 layers use MoE routing with a sigmoid scoring function. The model employs Multi-head Latent Attention (MLA) with LoRA-compressed key-value projections (KV LoRA rank 512, Q LoRA rank 2048) for memory-efficient inference.
The model integrates DeepSeek Sparse Attention (DSA) to reduce deployment cost while preserving long-context capacity across its 128K token context window. A single Multi-Token Prediction (MTP) layer enables speculative decoding for improved inference throughput.
Training Approach
GLM 5 was pre-trained on 28.5 trillion tokens, increased from the 23 trillion tokens used for GLM 4.5. Post-training uses SLIME, a novel asynchronous reinforcement learning infrastructure designed for improved training efficiency at scale. The model defaults to thinking mode with temperature 1.0 and top-p 0.95 for general reasoning tasks, with temperature 0.7 recommended for coding benchmarks.
Deploy GLM 5 on Vast.ai for access to frontier-class agentic reasoning, coding, and tool use capabilities with flexible GPU infrastructure.