Model Library/Qwen3.6 35B A3B

Alibaba logoQwen3.6 35B A3B

LLM
Vision Language
MoE
Reasoning
Coding

Agentic coding MoE with hybrid Gated DeltaNet and vision support

On-Demand Dedicated 1xRTX PRO 6000 S

Details

Modalities

text, vision

Recommended Hardware

1xRTX PRO 6000 S

Estimated Price

Loading...

Provider

Alibaba

Family

Qwen3.6

Parameters

35B

Context

262144 tokens

License

apache-2.0

Qwen3.6 35B A3B: Agentic Coding with Hybrid Gated DeltaNet

Qwen3.6 35B A3B is the first open-weight model in the Qwen3.6 series, built on direct community feedback and focused on stability and real-world utility. It combines a hybrid Gated DeltaNet and Gated Attention architecture with sparse Mixture-of-Experts routing and a vision encoder for unified multimodal reasoning.

Key Features

  • Agentic Coding - Handles frontend workflows and repository-level reasoning with improved fluency and precision over earlier Qwen generations
  • Thinking Preservation - New option to retain reasoning context from historical messages, streamlining iterative development and reducing redundant token generation
  • Hybrid Architecture - Alternating Gated DeltaNet and Gated Attention blocks combined with sparse MoE, balancing long-context efficiency against attention precision
  • Sparse Mixture-of-Experts - 256 total experts with 8 routed and 1 shared expert active per token, delivering 35B total capacity with only 3B active parameters
  • Multi-Token Prediction - Trained with multi-step MTP, enabling speculative decoding for lower-latency inference
  • Native 262K Context - Handles 262,144 tokens natively, extensible up to 1,010,000 tokens via YaRN RoPE scaling
  • Multimodal Inputs - Unified vision-language model supporting text, image, and video inputs
  • Tool Calling - Native tool-calling support with the qwen3_coder parser for agent workflows

Benchmark Performance

Coding and Software Engineering:

  • SWE-bench Verified: 73.4
  • SWE-bench Multilingual: 67.2
  • SWE-bench Pro: 49.5
  • Terminal-Bench 2.0: 51.5
  • LiveCodeBench v6: 80.4
  • NL2Repo: 29.4
  • QwenClawBench: 52.6

General Agent and Tool Use:

  • TAU3-Bench: 67.2
  • DeepPlanning: 25.9
  • MCPMark: 37.0
  • MCP-Atlas: 62.8
  • WideSearch: 60.1

Knowledge:

  • MMLU-Pro: 85.2
  • MMLU-Redux: 93.3
  • SuperGPQA: 64.7
  • C-Eval: 90.0

STEM and Reasoning:

  • GPQA: 86.0
  • HLE: 21.4
  • HMMT Feb 25: 90.7
  • HMMT Nov 25: 89.1
  • HMMT Feb 26: 83.6
  • IMOAnswerBench: 78.9
  • AIME26: 92.6

Use Cases

  • Agentic coding tasks across frontend, backend, and repository-level workflows
  • Multi-turn agent scenarios where preserved reasoning context improves decision consistency
  • Tool-calling and MCP-based automation
  • Competition-level mathematics and STEM reasoning
  • Long-context document analysis up to 262K tokens natively
  • Visual question answering and image-grounded reasoning
  • Video understanding with configurable frame sampling

Architecture

Qwen3.6 35B A3B uses a 40-layer hybrid architecture organized as ten cycles of three Gated DeltaNet blocks followed by one Gated Attention block, each paired with a sparse Mixture-of-Experts feed-forward layer.

Gated DeltaNet provides linear-attention efficiency with a fixed-size recurrent state, keeping long-context compute and memory cost tractable. The interleaved Gated Attention blocks use 16 query heads and 2 key-value heads with a 256-dimensional head and a 64-dimensional rotary position embedding, preserving precise token-level attention where it is most valuable.

The Mixture-of-Experts layer routes each token through 8 of 256 available experts plus 1 shared expert, with a 512-dimensional expert intermediate size. The model is trained with Multi-Token Prediction across multiple steps, enabling speculative decoding at inference time.

A 2048-dimensional language backbone pairs with a vision encoder to form a unified multimodal model, supporting a 248,320-token padded vocabulary and handling text, image, and video inputs through a shared representation.

Deploy Qwen3.6 35B A3B on Vast.ai with vLLM, SGLang, or llama.cpp for efficient agentic coding, long-context reasoning, and multimodal inference on flexible GPU infrastructure.

Quick Start Guide

Choose a model and click 'Deploy' above to find available GPUs recommended for this model.

Rent your dedicated instance preconfigured with the model you've selected.

Start sending requests to your model instance and getting responses right now.