744B MoE model for agentic reasoning, coding, and tool use
text
V5
8xH200
Loading...
Z.ai
GLM
754B
128000 tokens
MIT
GLM 5 is a 744B parameter Mixture-of-Experts model with 40B active parameters, developed by Z.ai. It targets complex systems engineering, long-horizon agentic tasks, and advanced reasoning, building on the GLM 4.7 foundation with doubled total parameters and an expanded expert pool.
GLM 5 uses a Mixture-of-Experts architecture with 256 routed experts and 1 shared expert per layer, activating 8 experts per token. The first 3 layers are dense, while the remaining 75 layers use MoE routing with a sigmoid scoring function. The model employs Multi-head Latent Attention (MLA) with LoRA-compressed key-value projections (KV LoRA rank 512, Q LoRA rank 2048) for memory-efficient inference.
The model integrates DeepSeek Sparse Attention (DSA) to reduce deployment cost while preserving long-context capacity across its 128K token context window. A single Multi-Token Prediction (MTP) layer enables speculative decoding for improved inference throughput.
GLM 5 was pre-trained on 28.5 trillion tokens, increased from the 23 trillion tokens used for GLM 4.5. Post-training uses SLIME, a novel asynchronous reinforcement learning infrastructure designed for improved training efficiency at scale. The model defaults to thinking mode with temperature 1.0 and top-p 0.95 for general reasoning tasks, with temperature 0.7 recommended for coding benchmarks.
Deploy GLM 5 on Vast.ai for access to frontier-class agentic reasoning, coding, and tool use capabilities with flexible GPU infrastructure.
Choose a model and click 'Deploy' above to find available GPUs recommended for this model.
Rent your dedicated instance preconfigured with the model you've selected.
Start sending requests to your model instance and getting responses right now.

© 2026 Vast.ai. All rights reserved.