Z.ai logoGLM 5

LLM
Reasoning

744B MoE model for agentic reasoning, coding, and tool use

On-Demand Dedicated 8xH200

Details

Modalities

text

Version

V5

Recommended Hardware

8xH200

Estimated Price

Loading...

Provider

Z.ai

Family

GLM

Parameters

754B

Context

128000 tokens

License

MIT

GLM 5: Large-Scale Agentic Reasoning Model

GLM 5 is a 744B parameter Mixture-of-Experts model with 40B active parameters, developed by Z.ai. It targets complex systems engineering, long-horizon agentic tasks, and advanced reasoning, building on the GLM 4.7 foundation with doubled total parameters and an expanded expert pool.

Key Features

  • Agentic Task Completion - Achieves 77.8% on SWE-bench Verified and 73.3% on SWE-bench Multilingual, with strong Terminal-Bench 2.0 performance (56.2% with Terminus, 56.2% with Claude Code)
  • Complex Reasoning - Scores 92.7% on AIME 2026 I, 96.9% on HMMT Nov. 2025, 82.5% on IMOAnswerBench, and 86.0% on GPQA-Diamond
  • Tool Use and Browsing - Native tool calling with 62.0% on BrowseComp, 89.7% on tau-2-Bench, and 67.8% on MCP-Atlas; 50.4% on Humanity's Last Exam with tool access
  • Cybersecurity - 43.2% on CyberGym for systems-level security tasks
  • Interleaved Thinking - Reasons before every response and tool call, with turn-level control over reasoning depth
  • Bilingual - Native English and Chinese language support

Use Cases

  • Software engineering, code generation, and multi-file repository-level tasks
  • Multi-step agentic workflows with tool calling and web browsing
  • Complex mathematical reasoning and competition-level problem solving
  • Terminal-based development, operations, and systems administration
  • Cybersecurity analysis and systems engineering
  • Research tasks requiring extended browsing and context management
  • Long-form document analysis and generation

Architecture and Design

GLM 5 uses a Mixture-of-Experts architecture with 256 routed experts and 1 shared expert per layer, activating 8 experts per token. The first 3 layers are dense, while the remaining 75 layers use MoE routing with a sigmoid scoring function. The model employs Multi-head Latent Attention (MLA) with LoRA-compressed key-value projections (KV LoRA rank 512, Q LoRA rank 2048) for memory-efficient inference.

The model integrates DeepSeek Sparse Attention (DSA) to reduce deployment cost while preserving long-context capacity across its 128K token context window. A single Multi-Token Prediction (MTP) layer enables speculative decoding for improved inference throughput.

Training Approach

GLM 5 was pre-trained on 28.5 trillion tokens, increased from the 23 trillion tokens used for GLM 4.5. Post-training uses SLIME, a novel asynchronous reinforcement learning infrastructure designed for improved training efficiency at scale. The model defaults to thinking mode with temperature 1.0 and top-p 0.95 for general reasoning tasks, with temperature 0.7 recommended for coding benchmarks.

Deploy GLM 5 on Vast.ai for access to frontier-class agentic reasoning, coding, and tool use capabilities with flexible GPU infrastructure.

Quick Start Guide

Choose a model and click 'Deploy' above to find available GPUs recommended for this model.

Rent your dedicated instance preconfigured with the model you've selected.

Start sending requests to your model instance and getting responses right now.

Vast AI

© 2026 Vast.ai. All rights reserved.

Vast.ai