GLM 5

Model Library/GLM 5

LLM

Reasoning

744B MoE model for agentic reasoning, coding, and tool use

On-Demand Dedicated 8xH200

Details

Modalities

text

Version

Recommended Hardware

8xH200

Estimated Price

Provider

Z.ai

Family

GLM

Parameters

754B

Context

128000 tokens

License

MIT

GLM 5: Large-Scale Agentic Reasoning Model

GLM 5 is a 744B parameter Mixture-of-Experts model with 40B active parameters, developed by Z.ai. It targets complex systems engineering, long-horizon agentic tasks, and advanced reasoning, building on the GLM 4.7 foundation with doubled total parameters and an expanded expert pool.

Key Features

Agentic Task Completion - Achieves 77.8% on SWE-bench Verified and 73.3% on SWE-bench Multilingual, with strong Terminal-Bench 2.0 performance (56.2% with Terminus, 56.2% with Claude Code)
Complex Reasoning - Scores 92.7% on AIME 2026 I, 96.9% on HMMT Nov. 2025, 82.5% on IMOAnswerBench, and 86.0% on GPQA-Diamond
Tool Use and Browsing - Native tool calling with 62.0% on BrowseComp, 89.7% on tau-2-Bench, and 67.8% on MCP-Atlas; 50.4% on Humanity's Last Exam with tool access
Cybersecurity - 43.2% on CyberGym for systems-level security tasks
Interleaved Thinking - Reasons before every response and tool call, with turn-level control over reasoning depth
Bilingual - Native English and Chinese language support

Use Cases

Software engineering, code generation, and multi-file repository-level tasks
Multi-step agentic workflows with tool calling and web browsing
Complex mathematical reasoning and competition-level problem solving
Terminal-based development, operations, and systems administration
Cybersecurity analysis and systems engineering
Research tasks requiring extended browsing and context management
Long-form document analysis and generation

Architecture and Design

GLM 5 uses a Mixture-of-Experts architecture with 256 routed experts and 1 shared expert per layer, activating 8 experts per token. The first 3 layers are dense, while the remaining 75 layers use MoE routing with a sigmoid scoring function. The model employs Multi-head Latent Attention (MLA) with LoRA-compressed key-value projections (KV LoRA rank 512, Q LoRA rank 2048) for memory-efficient inference.

The model integrates DeepSeek Sparse Attention (DSA) to reduce deployment cost while preserving long-context capacity across its 128K token context window. A single Multi-Token Prediction (MTP) layer enables speculative decoding for improved inference throughput.

Training Approach

GLM 5 was pre-trained on 28.5 trillion tokens, increased from the 23 trillion tokens used for GLM 4.5. Post-training uses SLIME, a novel asynchronous reinforcement learning infrastructure designed for improved training efficiency at scale. The model defaults to thinking mode with temperature 1.0 and top-p 0.95 for general reasoning tasks, with temperature 0.7 recommended for coding benchmarks.

Deploy GLM 5 on Vast.ai for access to frontier-class agentic reasoning, coding, and tool use capabilities with flexible GPU infrastructure.

Quick Start Guide

Choose a model and click 'Deploy' above to find available GPUs recommended for this model.

Rent your dedicated instance preconfigured with the model you've selected.

Start sending requests to your model instance and getting responses right now.

Related Models

text

GLM 4.7

Advanced agentic, reasoning and coding model

text

GLM 4.7-Flash

Lightweight agentic, reasoning and coding model