Model Library/GLM 4.7-Flash

Z.ai logoGLM 4.7-Flash

LLM
Reasoning

Lightweight agentic, reasoning and coding model

On-Demand Dedicated 1xH200

Details

Modalities

text

Version

V4.7-Flash

Recommended Hardware

1xH200

Estimated Price

Loading...

Provider

Z.ai

Family

GLM

Parameters

31B

Context

128000 tokens

License

MIT

GLM 4.7-Flash: Efficient Agentic and Reasoning Model

GLM 4.7-Flash is a 30B-A3B Mixture of Experts (MoE) model developed by Z.ai, designed to deliver strong agentic, reasoning, and coding performance in a compact and efficient architecture. It is positioned as one of the strongest models in the 30B parameter class.

Key Features

  • MoE Architecture - Uses a 30B-A3B Mixture of Experts design that activates only a fraction of parameters per token, providing an efficient balance between performance and resource usage
  • Strong Coding Performance - Achieves 59.2% on SWE-bench Verified, substantially outperforming comparable models in real-world software engineering tasks
  • Agentic Capabilities - Scores 79.5% on tau-2-Bench and 42.8% on BrowseComp, demonstrating effective tool use and web browsing abilities
  • Mathematical Reasoning - Achieves 91.6% on AIME 2025, competitive with much larger models
  • Thinking Mode - Supports preserved thinking for multi-turn agentic conversations, maintaining reasoning context across turns
  • Tool Calling - Native support for structured tool calling and function integration in agentic workflows

Use Cases

  • Code generation and debugging for software engineering tasks
  • Agentic workflows with tool calling and web browsing
  • Mathematical reasoning and problem solving
  • Multi-turn conversations with context retention
  • Research tasks requiring tool integration
  • Lightweight deployment scenarios requiring strong performance with lower resource usage

Architecture

GLM 4.7-Flash uses a Mixture of Experts architecture with 30B total parameters and 3B active parameters per token. This sparse activation pattern enables the model to maintain high performance while requiring significantly fewer computational resources during inference compared to dense models of similar capability. The model supports thinking mode with preserved reasoning across conversation turns, enabling coherent multi-step agentic task completion.

Training Approach

The model was trained with emphasis on agentic task performance, coding, and reasoning. Evaluation uses temperature 1.0 with top-p 0.95 for general tasks, with specialized settings for coding and agentic benchmarks including temperature 0.7 for SWE-bench evaluations.

Deploy GLM 4.7-Flash on Vast.ai for efficient access to strong agentic and reasoning capabilities with flexible GPU infrastructure.

Quick Start Guide

Choose a model and click 'Deploy' above to find available GPUs recommended for this model.

Rent your dedicated instance preconfigured with the model you've selected.

Start sending requests to your model instance and getting responses right now.

Vast AI

© 2026 Vast.ai. All rights reserved.

Vast.ai