## Overview

Krea 2 Turbo is the distilled, few-step release checkpoint of the Krea 2 text-to-image model family from Krea.ai. It is a diffusion transformer with roughly 12 billion parameters that generates images from natural-language prompts. Turbo is post-trained and distilled from the Krea 2 Raw base checkpoint, so it produces high-quality images in far fewer sampling steps, making it well suited to fast, interactive generation.

## Key Features

- Few-step generation: distillation lets Turbo render detailed images in a handful of inference steps instead of the dozens a base diffusion model typically needs.
- Strong prompt following across photographic, illustrative, painterly, and stylized outputs.
- High-resolution synthesis suitable for detailed, print-quality imagery.
- Broad stylistic range, from photorealism and impressionist painting to retro, halftone, and low-poly aesthetics.

## Use Cases

Krea 2 Turbo is designed for creative, commercial, developer, and research workflows, including image generation, concepting, design exploration, visual production, and integration into applications and creative tools. Its low step count makes it a good fit for real-time or high-throughput generation where latency matters.

## Architecture

Krea 2 is a text-to-image diffusion transformer with about 12 billion parameters. The family ships two open-weight checkpoints: Krea 2 Raw, the base release prior to additional post-training, and Krea 2 Turbo, the post-trained and distilled checkpoint optimized for few-step sampling. Turbo pairs the diffusion transformer with a text encoder and VAE for end-to-end text-to-image synthesis.

## Model Family

This entry deploys Krea 2 Turbo. The base checkpoint it is distilled from is published as [Krea 2 Raw](https://huggingface.co/krea/Krea-2-Raw), and the full release is available at [Krea on Hugging Face](https://huggingface.co/krea).


Distilled few-step version of Krea 2, a 12B text-to-image diffusion transformer for fast high-quality image generation

Krea 2 Turbo


### GLM 5.2: 1M-Context Agentic Reasoning Model

GLM 5.2 is a 753B parameter Mixture-of-Experts model developed by Z.ai. It builds on GLM 5 with a sparse-attention design that supports a 1M-token context window, targeting long-horizon agentic tasks, large-codebase engineering, and advanced reasoning, with native English and Chinese support.

#### Key Features

- **Frontier Reasoning** - 99.2 on AIME 2026 and 91.2 on GPQA-Diamond
- **Agentic Software Engineering** - 62.1 on SWE-bench Pro for repository-level, multi-file tasks
- **Tool Use** - 40.5 on Humanity's Last Exam, rising to 54.7 with tool access
- **1M-Token Context** - Long-document analysis, large-repo navigation, and extended agentic trajectories in a single context
- **Interleaved Thinking** - Reasons before every response and tool call; defaults to thinking mode
- **Bilingual** - Native English and Chinese language support

#### Use Cases

- Software engineering, code generation, and multi-file repository-level tasks
- Multi-step agentic workflows with tool calling and web browsing
- Complex mathematical reasoning and competition-level problem solving
- Long-context document analysis, synthesis, and generation
- Terminal-based development, operations, and systems administration
- Research tasks requiring extended browsing and context management

#### Architecture and Design

GLM 5.2 is a 753B parameter Mixture-of-Experts model that uses a sparse attention mechanism with **IndexShare**, reusing the attention indexer across every four layers to reduce the cost of long-context inference while preserving capacity across its 1M-token window. The design extends the GLM 5 architecture toward longer context and more reliable agentic behavior.

#### Training and Inference

GLM 5.2 builds on the GLM 5 foundation with refreshed post-training for stronger agentic coding and tool use. It defaults to thinking mode, reasoning before each response and tool call, and Z.ai recommends a temperature of 1.0 with top-p 0.95 for general reasoning tasks.

Deploy GLM 5.2 on Vast.ai for frontier-class agentic reasoning, coding, and long-context capabilities with flexible GPU infrastructure.


753B MoE model with 1M-token context for agentic reasoning, coding, and tool use

Q4_K_M (Unsloth)

GLM 5.2


### Qwen3.6 35B A3B: Agentic Coding with Hybrid Gated DeltaNet

Qwen3.6 35B A3B is the first open-weight model in the Qwen3.6 series, built on direct community feedback and focused on stability and real-world utility. It combines a hybrid Gated DeltaNet and Gated Attention architecture with sparse Mixture-of-Experts routing and a vision encoder for unified multimodal reasoning.

#### Key Features

- **Agentic Coding** - Handles frontend workflows and repository-level reasoning with improved fluency and precision over earlier Qwen generations
- **Thinking Preservation** - New option to retain reasoning context from historical messages, streamlining iterative development and reducing redundant token generation
- **Hybrid Architecture** - Alternating Gated DeltaNet and Gated Attention blocks combined with sparse MoE, balancing long-context efficiency against attention precision
- **Sparse Mixture-of-Experts** - 256 total experts with 8 routed and 1 shared expert active per token, delivering 35B total capacity with only 3B active parameters
- **Multi-Token Prediction** - Trained with multi-step MTP, enabling speculative decoding for lower-latency inference
- **Native 262K Context** - Handles 262,144 tokens natively, extensible up to 1,010,000 tokens via YaRN RoPE scaling
- **Multimodal Inputs** - Unified vision-language model supporting text, image, and video inputs
- **Tool Calling** - Native tool-calling support with the `qwen3_coder` parser for agent workflows

#### Benchmark Performance

**Coding and Software Engineering:**
- SWE-bench Verified: 73.4
- SWE-bench Multilingual: 67.2
- SWE-bench Pro: 49.5
- Terminal-Bench 2.0: 51.5
- LiveCodeBench v6: 80.4
- NL2Repo: 29.4
- QwenClawBench: 52.6

**General Agent and Tool Use:**
- TAU3-Bench: 67.2
- DeepPlanning: 25.9
- MCPMark: 37.0
- MCP-Atlas: 62.8
- WideSearch: 60.1

**Knowledge:**
- MMLU-Pro: 85.2
- MMLU-Redux: 93.3
- SuperGPQA: 64.7
- C-Eval: 90.0

**STEM and Reasoning:**
- GPQA: 86.0
- HLE: 21.4
- HMMT Feb 25: 90.7
- HMMT Nov 25: 89.1
- HMMT Feb 26: 83.6
- IMOAnswerBench: 78.9
- AIME26: 92.6

#### Use Cases

- Agentic coding tasks across frontend, backend, and repository-level workflows
- Multi-turn agent scenarios where preserved reasoning context improves decision consistency
- Tool-calling and MCP-based automation
- Competition-level mathematics and STEM reasoning
- Long-context document analysis up to 262K tokens natively
- Visual question answering and image-grounded reasoning
- Video understanding with configurable frame sampling

#### Architecture

Qwen3.6 35B A3B uses a 40-layer hybrid architecture organized as ten cycles of three Gated DeltaNet blocks followed by one Gated Attention block, each paired with a sparse Mixture-of-Experts feed-forward layer.

Gated DeltaNet provides linear-attention efficiency with a fixed-size recurrent state, keeping long-context compute and memory cost tractable. The interleaved Gated Attention blocks use 16 query heads and 2 key-value heads with a 256-dimensional head and a 64-dimensional rotary position embedding, preserving precise token-level attention where it is most valuable.

The Mixture-of-Experts layer routes each token through 8 of 256 available experts plus 1 shared expert, with a 512-dimensional expert intermediate size. The model is trained with Multi-Token Prediction across multiple steps, enabling speculative decoding at inference time.

A 2048-dimensional language backbone pairs with a vision encoder to form a unified multimodal model, supporting a 248,320-token padded vocabulary and handling text, image, and video inputs through a shared representation.

Deploy Qwen3.6 35B A3B on Vast.ai with vLLM, SGLang, or llama.cpp for efficient agentic coding, long-context reasoning, and multimodal inference on flexible GPU infrastructure.


Agentic coding MoE with hybrid Gated DeltaNet and vision support

UD-Q8_K_XL

UD-Q5_K_XL

UD-Q4_K_XL

Qwen3.6 35B A3B


### Gemma 4 31B IT: Dense Vision-Language Model

Gemma 4 is Google DeepMind's next-generation family of open multimodal models. The 31B variant is the dense flagship of the family, built to deliver frontier-level reasoning, coding, and multimodal understanding on consumer GPUs and workstations. It natively handles text and image input, supports a 256K context window, and covers 140+ languages.

#### Key Features

-   **Dense 31B Architecture** - 30.7B-parameter dense transformer targeting the highest-quality end of the Gemma 4 family.
-   **Hybrid Attention** - Interleaves sliding window (local) and full global attention layers, with unified Keys and Values on global layers and Proportional RoPE (p-RoPE) for efficient long-context processing.
-   **Reasoning / Thinking Mode** - Built-in configurable thinking mode lets the model reason step-by-step before answering.
-   **Multimodal** - Native text and image understanding with variable aspect ratio and resolution support; video analysis via frame sequences.
-   **Function Calling** - Native structured tool use with a custom tool-call protocol for agentic workflows.
-   **Long Context** - 256K token context window for document analysis, long-form reasoning, and agent trajectories.
-   **Multilingual** - Out-of-the-box support for 35+ languages, pre-trained on 140+.
-   **Native System Prompts** - First-class support for the system role.

#### Use Cases

-   Document and PDF parsing, OCR (including multilingual and handwriting)
-   Chart, diagram, and screen/UI understanding
-   Long-context reasoning and summarization
-   Code generation, completion, and correction
-   Agentic workflows with structured function calling
-   Visual question answering and image analysis
-   Multilingual chat and translation

#### Architecture

Gemma 4 31B IT is a 60-layer dense transformer with a 1024-token sliding window on local attention layers and unified Keys/Values on global layers, paired with a ~550M parameter vision encoder. The final layer is always global, ensuring deep awareness for long-context tasks while local layers keep the memory footprint manageable.

#### Benchmarks

Instruction-tuned results reported by Google DeepMind (selected):

-   MMLU Pro: 85.2%
-   AIME 2026 (no tools): 89.2%
-   LiveCodeBench v6: 80.0%
-   Codeforces ELO: 2150
-   GPQA Diamond: 84.3%
-   Tau2 (average over 3): 76.9%
-   HLE (no tools): 19.5%
-   HLE (with search): 26.5%
-   BigBench Extra Hard: 74.4%
-   MMMLU: 88.4%
-   MMMU Pro (vision): 76.9%
-   MATH-Vision: 85.6%
-   MedXPertQA MM: 61.3%
-   MRCR v2 8-needle 128K: 66.4%

For full benchmark tables and model family comparisons, see the [model card on HuggingFace](https://huggingface.co/google/gemma-4-31B-it).


Gemma 4 31B dense vision-language model by Google with 256K context and thinking mode

NVFP4

Gemma 4 31B IT


### Qwen3.5 27B: Dense Vision-Language Reasoning Model

Qwen3.5 27B is a dense multimodal foundation model from Alibaba's Qwen team, built on a hybrid Gated DeltaNet and Gated Attention architecture. With 27 billion parameters, it pairs strong text reasoning with native vision understanding through early fusion multimodal training, delivering competitive benchmark performance against much larger models while remaining practical to serve on single-node hardware.

#### Key Features

-   **Unified Vision-Language Foundation** - Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks
-   **Efficient Hybrid Architecture** - Gated Delta Networks combined with Gated Attention deliver high-throughput inference with minimal latency overhead
-   **Scalable RL Generalization** - Reinforcement learning scaled across million-agent environments with progressively complex task distributions for robust real-world adaptability
-   **Global Linguistic Coverage** - Expanded support to 201 languages and dialects for inclusive worldwide deployment
-   **Long Context** - 262,144 tokens natively, extensible up to 1,010,000 tokens with YaRN

#### Architecture

-   Causal Language Model with Vision Encoder
-   27B dense parameters
-   64 layers with a 16 × (3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN)) hybrid layout
-   Gated DeltaNet linear attention (48 V heads, 16 QK heads, head dim 128)
-   Gated Attention (24 Q heads, 4 KV heads, head dim 256)
-   Feed Forward Network intermediate dimension 17408
-   Multi-token prediction (MTP) trained with multi-steps
-   Native 262K context, extensible to 1M tokens

#### Use Cases

-   Multimodal reasoning and visual question answering
-   Document, chart, and diagram understanding
-   Coding and software engineering agents
-   Tool-using agent workflows across long horizons
-   Multilingual chat and instruction following across 201 languages
-   Long-context analysis and retrieval over large document sets

#### Benchmarks

On the Qwen3.5 benchmark suite ([source](https://huggingface.co/Qwen/Qwen3.5-27B)), Qwen3.5 27B scores MMLU-Pro 86.1, MMLU-Redux 93.2, C-Eval 90.5, SuperGPQA 65.6, IFEval 95.0, GPQA Diamond 85.5, and LongBench v2 60.6 — outperforming the larger Qwen3-235B-A22B on several of these metrics while activating every parameter densely.


Dense 27B vision-language model with unified multimodal reasoning

Qwen3.5 27B


LTX-2.3 is a DiT-based (Diffusion Transformer) audio-video foundation model developed by Lightricks, representing a significant update to LTX-2 with improved audio and visual quality alongside enhanced prompt adherence. The model generates synchronized video and audio within a single unified architecture, enabling practical multimodal content creation from various input combinations.

## Key Features

LTX-2.3 supports a broad range of generation modes within its unified architecture:

- **Text-to-Video**: Generate video content directly from text descriptions
- **Image-to-Video**: Animate static images into dynamic video sequences
- **Video-to-Video**: Transform existing video with style or content modifications
- **Audio-Visual Generation**: Create synchronized audio and video output together
- **Cross-Modal Generation**: Support for audio-to-video, text-to-audio, and audio-to-audio workflows

The model includes a multi-stage pipeline with spatial upscalers (1.5x and 2x) and a temporal upscaler (2x) for producing higher resolution output and smoother frame rates.

## Architecture

LTX-2.3 is built on a Diffusion Transformer (DiT) architecture that combines diffusion models with transformer-based processing. This design handles both video and audio generation within a single framework while maintaining temporal coherence across both modalities.

The model processes video with width and height divisible by 32, and frame counts divisible by 8 plus 1, allowing for flexible output configurations. A distilled variant enables faster generation in as few as 8 steps with classifier-free guidance of 1.

## Training and Customization

The base model (dev variant) is fully trainable, supporting various customization approaches:

- **LoRA Training**: Create Low-Rank Adaptations for specific styles or subjects
- **IC-LoRA**: Image-Conditioned LoRAs for more precise control
- **Motion Adaptation**: Train custom motion patterns efficiently
- **Style Transfer**: Adapt the model to specific visual styles
- **Likeness Training**: Capture both appearance and sound characteristics

Training for motion, style, or likeness customization can be completed in under one hour in many configurations.

## Use Cases

LTX-2.3 is designed for creative video generation applications including:

- Short-form video content creation
- Animation and motion design
- Visual storytelling with synchronized audio
- Creative experimentation with multimodal generation
- Prototyping video concepts from text descriptions
- Video transformation and style transfer

## Prompting

Effective prompting significantly impacts generation quality. The model responds well to detailed, descriptive prompts that clearly articulate the desired visual and audio elements. For best results, provide specific details about motion, scene composition, and audio characteristics when generating audiovisual content.

## Integration

LTX-2.3 integrates with ComfyUI through built-in LTXVideo nodes, enabling visual workflow-based generation. The model is also available through the LTX-2 PyTorch codebase for programmatic access, with Diffusers support planned.

For more details about the model architecture and capabilities, see the [model page on Hugging Face](https://huggingface.co/Lightricks/LTX-2.3).


LTX-2.3 is a DiT-based audio-video foundation model with improved quality and prompt adherence for synchronized video and audio generation

LTX-2.3


### DeepSeek V4 Flash

DeepSeek V4 Flash is a Mixture-of-Experts (MoE) language model with 284B total parameters and 13B activated per token, released as a preview alongside the larger DeepSeek V4 Pro. It targets highly efficient long-context intelligence, supporting a context window of up to one million tokens. Flash reaches reasoning quality comparable to the Pro version when given a larger thinking budget, making it a strong general-purpose model for reasoning, coding, math, and agentic workflows.

### Architecture

DeepSeek V4 Flash pairs a sparsely activated MoE design with a hybrid attention mechanism that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency. Manifold-Constrained Hyper-Connections (mHC) strengthen residual connections and stabilize signal propagation, and the model was trained on more than 32T tokens using the Muon optimizer for faster convergence. Weights use an FP4 + FP8 mixed format, with MoE expert parameters stored in FP4 and most remaining parameters in FP8.

The model exposes three reasoning modes: Non-think for routine daily tasks and low-risk decisions, Think High for complex problem-solving and planning, and Think Max for pushing the boundary of model reasoning.

### Key Features

- Mixture-of-Experts model with 284B total parameters and 13B activated per token
- Context window of up to one million tokens for long-document analysis and retrieval
- Hybrid CSA and HCA attention for efficient long-context serving
- Three selectable reasoning modes spanning fast responses to deep deliberation
- Native tool-calling and reasoning parsing for agentic applications
- Open-source under the MIT License

### Use Cases

- Conversational assistants and chat applications
- Coding assistance, code generation, and software engineering agents
- Mathematical problem-solving and step-by-step reasoning
- Long-context document analysis, summarization, and retrieval over very large inputs
- Agentic workflows involving tool use, browsing, and multi-step planning

### Benchmarks

In Max mode, DeepSeek V4 Flash reports strong reasoning and knowledge results, including 88.1 on GPQA Diamond (Pass@1), 86.2 on MMLU-Pro, and 34.8 on Humanity's Last Exam (Pass@1). It shows leading coding and math performance, with 91.6 on LiveCodeBench, a Codeforces rating of 3052, and 94.8 on HMMT 2026 Feb.

Long-context evaluations at the full 1M-token window report 78.7 on MRCR 1M and 60.5 on CorpusQA 1M. On agentic tasks, Flash reports 79.0 resolved on SWE-bench Verified, 56.9 on Terminal Bench 2.0, 73.2 on BrowseComp, and 45.1 on HLE with tools.

The base model, DeepSeek V4 Flash Base, reports 88.7 on MMLU (5-shot), 68.3 on MMLU-Pro, 90.8 on GSM8K, and 69.5 on HumanEval (Pass@1), among other results.

For full model details and the complete evaluation tables, see the model card on [Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash).


FP4 Mixture-of-Experts model with 1M context

DeepSeek V4 Flash


# FLUX.1 Kontext [dev]

FLUX.1 Kontext [dev] is a 12 billion parameter rectified flow transformer from Black Forest Labs, built for in-context image editing. Unlike text-to-image models that generate from a prompt alone, Kontext takes an existing image together with a natural-language edit instruction and produces a revised image that preserves the parts you did not ask to change. It is the open-weight member of the FLUX.1 Kontext family, released to support third-party research and development.

## Overview

Kontext performs instruction-based editing: you provide a source image and describe the change you want in plain language, such as adding an object, altering a style, or adjusting a scene. The model applies the edit while maintaining consistency with the original, making it well suited to iterative workflows where an image is refined over several successive edits. It is trained using guidance distillation for more efficient inference.

## Key Features

- Change existing images based on a written edit instruction.
- Preserve character, style, and object references across edits without any finetuning.
- Robust consistency that allows an image to be refined through multiple successive edits with minimal visual drift.
- Guidance-distilled training for more efficient generation.
- Open weights intended to drive new scientific research and to empower artists to develop innovative workflows.

## Use Cases

- Targeted photo edits driven by a text instruction, for example adding, removing, or modifying elements of a scene.
- Style and appearance changes that keep the subject and composition intact.
- Reference-guided editing that carries a character, style, or object across generations.
- Multi-step creative pipelines where an image is progressively refined edit by edit.

## Architecture

FLUX.1 Kontext [dev] is a rectified flow transformer operating in latent space, applying flow matching for in-context image generation and editing. The 12B-parameter transformer is paired with text encoders and a VAE, and conditions generation jointly on the input image and the edit instruction so that outputs remain faithful to the source. A reference implementation and sampling code are provided by Black Forest Labs, and the model is available for both ComfyUI and Diffusers workflows.

## Responsible Use

The model repository includes filters for illegal or infringing content, and the FLUX.1 Kontext models were subjected to multiple rounds of pre-release and third-party safety evaluation. In its evaluations, FLUX.1 Kontext [dev] demonstrated high resilience against violative inputs relative to other similar open-weight models. Deployers are expected to keep content filters or manual review in place when using the model.

For full model details, weights, and documentation, see the model card on [Hugging Face](https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev).


12B rectified flow transformer for instruction-based in-context image editing

FLUX.1 Kontext [dev]


### Gemma 4 12B IT: Encoder-Free Omni Model

Gemma 4 is Google DeepMind's family of open multimodal models. The 12B "Unified" variant is an encoder-free omni model that natively handles text, image, audio, and video input and generates text output. It brings audio and vision understanding directly into a single decoder-only transformer, with no separate encoders, making it well suited to local and on-device deployment. It supports a 256K token context window and multilingual coverage across 140+ languages.

#### Key Features

-   **Encoder-Free Unified Multimodality** - Raw image patches and audio waveforms are projected directly into the model's embedding space through lightweight linear layers, so all modalities flow into one decoder-only transformer with no dedicated vision or audio encoders.
-   **Native Audio Understanding** - Automatic speech recognition (ASR) and speech-to-translated-text across multiple languages, built into the model.
-   **Native Vision** - Image understanding with variable aspect ratio and resolution support, plus video analysis via frame sequences.
-   **Hybrid Attention** - Interleaves local sliding-window attention with full global attention and always ends on a global layer; global layers use unified Keys and Values with Proportional RoPE (p-RoPE) for efficient long-context processing.
-   **Long Context** - 256K token context window for long documents, long-form reasoning, and multi-turn multimodal sessions.
-   **Multilingual** - Pre-trained across 140+ languages.
-   **Dense Architecture** - A dense transformer sized for laptops, workstations, and consumer GPUs.

#### Use Cases

-   On-device and local multimodal assistants combining text, image, and audio
-   Automatic speech recognition and speech-to-translated-text translation
-   Visual question answering and image analysis
-   Document, chart, and screen understanding
-   Long-context reasoning and summarization
-   Multilingual chat and translation
-   Code generation and completion

#### Architecture

The "Unified" designation refers to the encoder-free design. Where other Gemma 4 models use dedicated encoders to pre-process multimodal inputs, the 12B model eliminates them entirely, projecting raw image patches and audio waveforms straight into the decoder's embedding space. All modalities are processed by a single decoder-only transformer, reducing multimodal latency and allowing the whole model to be fine-tuned in one pass. The hybrid attention stack interleaves sliding-window local layers with global layers, keeping the memory footprint low while preserving deep long-context awareness.

#### Benchmarks

Google DeepMind reports instruction-tuned results across reasoning, coding, multilingual, vision, audio, and long-context suites, including MATH-Vision for visual math and MRCR v2 for long-context retrieval. For the full benchmark tables and family comparisons, see the [model card on HuggingFace](https://huggingface.co/google/gemma-4-12B-it).


Gemma 4 12B Unified encoder-free omni model by Google with native text, image, and audio input and 256K context

Gemma 4 12B IT


### Gemma 4 E2B IT: Lightweight Omni Model

Gemma 4 is Google DeepMind's family of open multimodal models. The E2B variant is the smallest omni model in the family, built for phone-class and edge deployment. It natively handles text, image, and audio input and generates text output, pairing a lightweight vision encoder with a dedicated audio encoder. It supports a 128K token context window and multilingual coverage across 140+ languages.

#### Key Features

-   **Omni Multimodality** - Native understanding of text, image, and audio, with a compact vision encoder and a dedicated audio encoder for on-device multimodal workloads.
-   **Native Audio Understanding** - Automatic speech recognition (ASR) and speech-to-translated-text across multiple languages.
-   **Native Vision** - Image understanding with variable aspect ratio and resolution support.
-   **Hybrid Attention** - Interleaves local sliding-window attention with full global attention and always ends on a global layer; global layers use unified Keys and Values with Proportional RoPE (p-RoPE) for efficient long-context processing.
-   **Long Context** - 128K token context window for long documents and multi-turn multimodal sessions.
-   **Multilingual** - Pre-trained across 140+ languages.
-   **Smallest Footprint** - The most compact Gemma 4 model, targeting high-end phones and edge devices.

#### Use Cases

-   Phone-class and edge multimodal assistants
-   Automatic speech recognition and speech-to-translated-text translation
-   Visual question answering and image analysis
-   Document and image understanding
-   Multilingual chat and translation
-   Lightweight on-device reasoning

#### Architecture

Gemma 4 E2B processes multimodal inputs through dedicated lightweight encoders: a compact vision encoder handles images and video frames, and an audio encoder handles speech, both feeding a dense decoder-only transformer. The hybrid attention stack interleaves sliding-window local layers with global layers and ends on a global layer, so the model keeps a very small memory footprint while retaining long-context awareness. It is the most deployable member of the family, running on high-end phones as well as laptops.

#### Benchmarks

Google DeepMind reports instruction-tuned results across reasoning, coding, multilingual, vision, audio, and long-context suites, including MATH-Vision for visual math and MRCR v2 for long-context retrieval. For the full benchmark tables and family comparisons, see the [model card on HuggingFace](https://huggingface.co/google/gemma-4-E2B-it).


Gemma 4 E2B omni model by Google with native text, image, and audio input, 128K context, and phone-class efficiency

Gemma 4 E2B IT


### Gemma 4 E4B IT: Efficient Omni Model

Gemma 4 is Google DeepMind's family of open multimodal models. The E4B variant is one of the family's small, efficient omni models, designed to run on laptops and high-end phones. It natively handles text, image, and audio input and generates text output, pairing a lightweight vision encoder with a dedicated audio encoder. It supports a 128K token context window and multilingual coverage across 140+ languages.

#### Key Features

-   **Omni Multimodality** - Native understanding of text, image, and audio, with a compact vision encoder and a dedicated audio encoder for on-device multimodal workloads.
-   **Native Audio Understanding** - Automatic speech recognition (ASR) and speech-to-translated-text across multiple languages.
-   **Native Vision** - Image understanding with variable aspect ratio and resolution support.
-   **Hybrid Attention** - Interleaves local sliding-window attention with full global attention and always ends on a global layer; global layers use unified Keys and Values with Proportional RoPE (p-RoPE) for efficient long-context processing.
-   **Long Context** - 128K token context window for long documents and multi-turn multimodal sessions.
-   **Multilingual** - Pre-trained across 140+ languages.
-   **Efficient by Design** - A small dense model targeting on-device and edge deployment.

#### Use Cases

-   On-device and edge multimodal assistants
-   Automatic speech recognition and speech-to-translated-text translation
-   Visual question answering and image analysis
-   Document and image understanding
-   Multilingual chat and translation
-   Lightweight reasoning and coding assistance

#### Architecture

Gemma 4 E4B processes multimodal inputs through dedicated lightweight encoders: a compact vision encoder handles images and video frames, and an audio encoder handles speech, both feeding a dense decoder-only transformer. The hybrid attention stack interleaves sliding-window local layers with global layers and ends on a global layer, so the model keeps a small memory footprint while retaining long-context awareness. Its size makes it deployable in environments ranging from high-end phones to laptops.

#### Benchmarks

Google DeepMind reports instruction-tuned results across reasoning, coding, multilingual, vision, audio, and long-context suites, including MATH-Vision for visual math and MRCR v2 for long-context retrieval. For the full benchmark tables and family comparisons, see the [model card on HuggingFace](https://huggingface.co/google/gemma-4-E4B-it).


Gemma 4 E4B omni model by Google with native text, image, and audio input, 128K context, and on-device efficiency

Gemma 4 E4B IT


### Granite 4.0 H Small

Granite 4.0 H Small is a 32-billion-parameter long-context instruct model from the Granite Team at IBM. It is finetuned from Granite-4.0-H-Small-Base using a combination of permissively licensed open-source instruction datasets and internally collected synthetic data, refined through supervised finetuning, reinforcement-learning alignment, and model merging. Granite 4.0 instruct models emphasize improved instruction following and tool-calling, making them well suited to enterprise and agentic applications.

#### Architecture

Granite 4.0 H Small uses a decoder-only Mixture-of-Experts transformer built on a hybrid attention design. Its core components are Grouped-Query Attention, Mamba2 state-space layers, MoE feed-forward blocks with shared experts, SwiGLU activations, RMSNorm, and shared input/output embeddings. The hybrid layout pairs a small number of full-attention layers with a majority of Mamba2 layers (4 attention layers to 36 Mamba2 layers), which keeps the KV cache small even at long context and makes the model efficient to serve across a wide range of sequence lengths. As a sparse MoE, only a fraction of its total experts are activated per token, so it delivers the quality of a large model with the runtime cost closer to a much smaller one.

#### Key Features

- Native context window of 131,072 tokens for long-document and long-conversation workloads.
- Hybrid Mamba2/transformer architecture with a very low KV-cache footprint at long context.
- Strong tool-calling and function-calling support using an OpenAI-style function definition schema.
- Multilingual coverage across English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese.

#### Use Cases

- AI assistants and chatbots for business and general-purpose domains.
- Agentic and tool-using workflows that call external functions and APIs.
- Retrieval-Augmented Generation over enterprise knowledge bases.
- Summarization, text classification, and structured text extraction.
- Question answering and multilingual dialogue.
- Code-related tasks, including Fill-In-the-Middle completions.

Granite 4.0 H Small is available on Vast with vLLM and SGLang for full-precision and FP8 serving, and with llama.cpp for quantized GGUF deployments.

For full model details, benchmarks, and usage examples, see the [model card on HuggingFace](https://huggingface.co/ibm-granite/granite-4.0-h-small).


IBM's 32B hybrid Mamba-2/transformer MoE instruct model with strong tool-calling and low KV-cache memory

Granite 4.0 H Small


### Tencent Hunyuan 3 (Hy3): Advanced Mixture-of-Experts LLM

Hy3 is a large Mixture-of-Experts (MoE) language model developed by the Tencent Hy Team. It activates only a small fraction of its experts per token (21B active parameters, with 192 experts and top-8 routing), pairing the quality of a very large model with the inference efficiency of a far smaller dense one. Following the Hy3 Preview release, the team scaled up post-training with higher-quality data gathered from feedback across 50+ products, producing a model that rivals flagship open-source models several times its active size and delivers strong gains on real-world productivity tasks. Weights are published on [Hugging Face](https://huggingface.co/tencent/Hy3), with an FP8 checkpoint available at [Hy3-FP8](https://huggingface.co/tencent/Hy3-FP8).

#### Key Features

-   **Mixture-of-Experts efficiency** - Sparse top-8 routing over 192 experts keeps active compute low while retaining large-model quality.
-   **Long context** - Native 256K-token context window for long documents, codebases, and extended multi-turn dialogue.
-   **Strong agentic and reasoning capability** - Post-trained with scaled reinforcement learning for reasoning, tool use, and long-horizon tasks.
-   **Production-grade tool calling** - Reliable tool-call and output-format handling that generalizes across agent scaffoldings.
-   **Reasoning parser support** - Ships with dedicated reasoning and tool-call parsers for vLLM and SGLang.

#### Architecture

Hy3 uses a Mixture-of-Experts transformer with 80 layers and grouped-query attention (64 attention heads, 8 key-value heads). A dedicated multi-token-prediction layer supports speculative decoding for lower latency. Only the top-8 of its 192 experts are activated per token, so the model reasons with a large knowledge capacity while keeping per-token computation modest.

#### Agentic and Reasoning Strengths

Building on Hy3 Preview, the team improved the quality and diversity of post-training data while scaling up reinforcement learning. Hy3 shows solid gains across reasoning, agentic, and long-context evaluations, remaining competitive with much larger flagship models. In a blind evaluation run with 270 domain experts using tasks drawn from their own work, Hy3 scored 2.67 out of 4, ahead of GLM-5.1 at 2.51, with its largest advantages in frontend development, data and storage, and CI/CD tasks. In productivity scenarios such as coding, office work, financial modeling, frontend design, and game development, Hy3 performs as a reliable, cost-effective option.

#### Reliability Improvements

Hy3 targets the operational failure modes that matter in production. Tool-call and output-format reliability were raised to production-grade standards, with error recovery and efficiency improved and cross-scaffolding variance on SWE-Bench Verified held within about 4 percent. An anti-hallucination training regime lowered the internal hallucination rate from 12.5 to 5.4 percent and commonsense error rates from 25.4 to 12.7 percent. Joint SFT and RL optimization improved coreference resolution, ellipsis recovery, and multi-turn constraint tracking, cutting the internal multi-turn issue rate from 17.4 to 7.9 percent and improving long-dialogue performance while keeping outputs concise.

#### Use Cases

-   Autonomous agents and tool-using assistants
-   Long-context document and codebase analysis
-   Coding, frontend development, and software engineering workflows
-   Multi-turn conversational assistants and customer support
-   Reasoning-heavy research, analysis, and productivity tasks

Deploy Tencent Hunyuan 3 (Hy3) on Vast.ai with vLLM or SGLang for scalable, OpenAI-compatible inference.


Tencent's 295B Mixture-of-Experts LLM (21B active) with strong agent and reasoning capabilities and 256K context.

Tencent Hunyuan 3 (Hy3)


## Kimi K2.7 Code

Kimi K2.7 Code is an open-source, natively multimodal agentic model from Moonshot AI, built for long-horizon, coding-driven work. It is a large mixture-of-experts model with one trillion total parameters and thirty-two billion activated per token, distributed across 384 routed experts with a shared expert, and it ships as a native INT4 checkpoint using the same quantization approach as Kimi K2 Thinking. It shares its architecture with the Kimi K2.5 and K2.6 releases and extends the line with a code-first focus.

### Native multimodality

Kimi K2.7 Code accepts both text and image input through a MoonViT vision encoder, so it can read screenshots, diagrams, UI mockups, charts, and documents alongside a coding prompt. This makes it well suited to coding-driven design tasks where the model works from a visual reference and produces or edits code to match. Experimental video input is also part of the model family.

### Thinking-mode reasoning and tool use

The model runs in thinking mode, producing an explicit reasoning trace before its final answer, with thinking preserved by default across a conversation. It is trained for agentic tool calling, letting it plan and execute multi-step tasks, call external tools, and orchestrate long-running workflows rather than answering in a single turn. Recommended sampling for thinking mode uses a temperature of 1.0 and top-p of 0.95.

### Key features

- Native mixture-of-experts architecture with one trillion total parameters and thirty-two billion activated per token
- Native INT4 quantization for efficient serving of a trillion-parameter model
- Native image understanding through an integrated MoonViT vision encoder
- Thinking-mode reasoning with preserved chains of thought across turns
- Agentic tool calling for long-horizon, multi-step task orchestration
- A 256K-token context window for large codebases and long sessions

### Use cases

- Long-horizon, coding-driven agentic development and refactoring across large codebases
- Coding-driven design: turning screenshots, mockups, and diagrams into working code
- Reading and reasoning over technical documents, charts, and UI captures
- Tool-using agents that plan, call functions, and iterate over multi-step workflows
- Visual question answering grounded in code and technical context

Kimi K2.7 Code is available on Hugging Face at https://huggingface.co/moonshotai/Kimi-K2.7-Code


Kimi K2.7 Code is an open-source, native-multimodal agentic MoE model from Moonshot AI with 1T total parameters and 32B activated, natively INT4-quantized and specialized for long-horizon, coding-driven agentic workflows with thinking-mode reasoning, tool calling, and image input.

Kimi K2.7 Code


### Qwen3-VL 30B A3B Instruct

Qwen3-VL 30B A3B Instruct is a vision-language model from Alibaba's Qwen team and the most capable generation of the Qwen-VL series to date. It pairs strong text understanding and generation with deep visual perception and reasoning, extended context, and stronger agentic behavior. The model uses a Mixture-of-Experts design with roughly 30 billion total parameters and about 3 billion activated per token, so it delivers large-model quality at the inference cost of a much smaller dense model. This is the Instruct edition, tuned for direct, non-thinking responses across multimodal chat, document, and agent workloads.

### Key capabilities

Qwen3-VL acts as a visual agent that can operate PC and mobile GUIs, recognizing interface elements, understanding their function, invoking tools, and completing multi-step tasks. Its visual coding boost turns images and videos into working Draw.io diagrams and HTML, CSS, and JavaScript. Advanced spatial perception lets it judge object positions, viewpoints, and occlusions, with stronger 2D grounding and new 3D grounding for spatial reasoning and embodied AI.

The model natively handles 256K tokens of context, expandable to 1M, so it can work over entire books and hours-long video with full recall and second-level indexing. Its multimodal reasoning is tuned for STEM and math, favoring causal analysis and logical, evidence-based answers. Broader, higher-quality pretraining sharpens visual recognition across a wide range of subjects including public figures, anime, products, landmarks, and flora and fauna.

Optical character recognition now spans 32 languages, up from 19, and stays robust in low light, blur, and tilt while handling rare or ancient characters, technical jargon, and complex long-document structure. Because vision and text are fused seamlessly, its text-only understanding remains on par with comparable pure language models.

### Architecture

Three architectural updates drive the gains. Interleaved-MRoPE allocates positional frequencies across time, width, and height for stronger long-horizon video reasoning. DeepStack fuses multi-level vision-transformer features to capture fine-grained detail and tighten image-text alignment. Text-Timestamp Alignment moves beyond earlier temporal encodings to precise, timestamp-grounded event localization for improved video temporal modeling.

### Use cases

Qwen3-VL fits visual question answering, document and OCR pipelines, chart and diagram interpretation, image and video captioning, GUI automation and agentic tool use, spatial and 3D grounding, and multimodal STEM problem solving. The Instruct edition is well suited to interactive assistants and production inference where fast, direct answers are preferred over an explicit reasoning trace.

### Deployment on Vast

This entry ships vLLM and SGLang engines for the full-precision flagship, an official FP8 checkpoint for memory-efficient serving on a single high-memory GPU, and llama.cpp GGUF quantizations for cost-effective deployments. All engines expose an OpenAI-compatible API.

The model card and weights are available on [Hugging Face](https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct).


Efficient 30B MoE vision-language model with 3B active params

Qwen3-VL 30B A3B Instruct


### Seed-OSS 36B Instruct

Seed-OSS is a series of open-source large language models developed by ByteDance's Seed Team, designed for powerful long-context, reasoning, agent, and general capabilities alongside versatile developer-friendly features. Despite being trained on 12T tokens, Seed-OSS achieves strong performance across several popular open benchmarks. Seed-OSS-36B-Instruct is the instruction-tuned chat model in the series and is primarily optimized for international (i18n) use cases.

#### Key Features

-   **Flexible control of thinking budget** - Users can dynamically adjust the reasoning length to match the task, trading extra chain-of-thought for accuracy on hard problems or shorter responses for simple ones. This lets you tune inference efficiency in production.
-   **Enhanced reasoning capability** - Specifically optimized for reasoning tasks while maintaining balanced, strong general capabilities.
-   **Agentic intelligence** - Performs well on agentic tasks such as tool use and issue resolving.
-   **Native long context** - Trained natively with context lengths of up to 512K tokens.

#### Thinking Budget

Seed-OSS lets you specify how many tokens the model may spend on its internal reasoning before answering. With no budget set, the model thinks with unlimited length by default. When a budget is specified, the model periodically reflects on how much of the budget it has consumed and delivers its final response once the budget is exhausted or the reasoning naturally concludes. The team recommends budget values that are integer multiples of 512 (for example 512, 1K, 2K, 4K, 8K, or 16K), since the model was extensively trained on these intervals; a budget of 0 produces a direct answer with no visible reasoning.

#### Architecture

Seed-OSS adopts a causal language model architecture with rotary position embeddings (RoPE), grouped-query attention (GQA), RMSNorm normalization, and SwiGLU activations. This design supports its native long-context training and efficient inference at extended sequence lengths.

#### Benchmarks

According to the publisher's reported results, Seed-OSS-36B-Instruct is competitive across reasoning, math, coding, agentic, and long-context evaluations. It performs strongly on math and reasoning suites such as AIME and BeyondAIME, on coding benchmarks including LiveCodeBench, on instruction-following evaluations like IFEval, and on long-context retrieval tasks such as RULER at extended sequence lengths. On challenging tasks the model's chain of thought lengthens and accuracy improves as the thinking budget grows, while simpler tasks reach strong scores with shorter reasoning.

#### Use Cases

-   Interactive chatbots and virtual assistants
-   Long-document analysis, summarization, and retrieval over large contexts
-   Agentic workflows involving tool use and multi-step problem solving
-   Math, reasoning, and coding assistance
-   Research on reasoning and post-training behavior

Deploy Seed-OSS 36B Instruct on Vast.ai to serve an OpenAI-compatible API with vLLM or run quantized GGUF builds with llama.cpp. For full model details, see the [model card on HuggingFace](https://huggingface.co/ByteDance-Seed/Seed-OSS-36B-Instruct).


Seed-OSS 36B Instruct


### Kimi K2.6

Kimi K2.6 is an open-source, native multimodal agentic model from Moonshot AI that advances practical capabilities in long-horizon coding, coding-driven design, proactive autonomous execution, and swarm-based task orchestration. It is a Mixture-of-Experts model with 1 trillion total parameters and 32 billion activated per token, built on the Kimi K2.5 architecture.

#### Key Features

- **Long-Horizon Coding** — Significant improvements on complex, end-to-end coding tasks, generalizing robustly across programming languages (Rust, Go, Python) and domains spanning front-end, DevOps, and performance optimization.
- **Coding-Driven Design** — Transforms simple prompts and visual inputs into production-ready interfaces and lightweight full-stack workflows, generating structured layouts, interactive elements, and rich animations with deliberate aesthetic precision.
- **Elevated Agent Swarm** — Scales horizontally to 300 sub-agents executing 4,000 coordinated steps; dynamically decomposes tasks into parallel, domain-specialized subtasks, delivering end-to-end outputs from documents to websites to spreadsheets in a single autonomous run.
- **Proactive & Open Orchestration** — Demonstrates strong performance in powering persistent 24/7 background agents that proactively manage schedules, execute code, and orchestrate cross-platform operations without human oversight.
- **Thinking & Instant Modes** — Supports reasoning (thinking) mode by default and an instant-response mode; `preserve_thinking` retains full reasoning content across multi-turn interactions for coding-agent scenarios.
- **Multimodal Input** — Accepts text, image, and video input via the MoonViT vision encoder (400M parameters).

#### Model Summary

| | |
|:---|:---|
| Architecture | Mixture-of-Experts (MoE) |
| Total Parameters | 1T |
| Activated Parameters | 32B |
| Number of Layers | 61 (1 dense + 60 MoE) |
| Number of Experts | 384 (8 selected per token, 1 shared) |
| Attention Hidden Dimension | 7168 |
| MoE Hidden Dimension per Expert | 2048 |
| Number of Attention Heads | 64 |
| Vocabulary Size | 160K |
| Context Length | 256K |
| Attention Mechanism | MLA |
| Activation Function | SwiGLU |
| Vision Encoder | MoonViT (400M parameters) |

Kimi K2.6 ships with native INT4 quantization, using the same method as Kimi K2 Thinking.

#### Benchmarks

**Agentic**
- HLE-Full (with tools): 54.0
- BrowseComp: 83.2 (86.3 with Agent Swarm)
- DeepSearchQA (f1-score): 92.5
- DeepSearchQA (accuracy): 83.0
- WideSearch (item-f1): 80.8
- Toolathlon: 50.0
- MCPMark: 55.9
- Claw Eval (pass^3): 62.3; (pass@3): 80.9
- APEX-Agents: 27.9
- OSWorld-Verified: 73.1

**Coding**
- Terminal-Bench 2.0 (Terminus-2): 66.7
- SWE-Bench Pro: 58.6
- SWE-Bench Multilingual: 76.7
- SWE-Bench Verified: 80.2
- SciCode: 52.2
- OJBench (python): 60.6
- LiveCodeBench (v6): 89.6

**Reasoning & Knowledge**
- HLE-Full: 34.7
- AIME 2026: 96.4
- HMMT 2026 (Feb): 92.7
- IMO-AnswerBench: 86.0
- GPQA-Diamond: 90.5

**Vision**
- MMMU-Pro: 79.4 (80.1 with python)
- CharXiv (RQ): 80.4 (86.7 with python)
- MathVision: 87.4 (93.2 with python)
- BabyVision: 39.8 (68.5 with python)
- V* (with python): 96.9

#### Use Cases

- Autonomous agentic workflows spanning coding, research, and browsing
- Long-horizon software engineering and multi-step code generation
- Coding-driven UI/UX design from prompts and visual inputs
- Document, chart, and image understanding at scale
- Multi-agent task orchestration with parallel sub-agent coordination
- Persistent background agents for schedule management and cross-platform operations


Kimi K2.6 is an open-source, native multimodal agentic MoE model from Moonshot AI with 1T total parameters, 32B activated, advancing long-horizon coding, coding-driven design, and swarm-based task orchestration

96k Context

Unsloth UD-Q8_K_XL (llama.cpp)

Kimi K2.6


### Gemma 4 26B A4B IT: Mixture-of-Experts Vision-Language Model

Gemma 4 is Google DeepMind's next-generation family of open multimodal models. The 26B A4B variant is a Mixture-of-Experts model with 25.2B total parameters but only 3.8B active per token, delivering frontier-level quality at the inference speed of a much smaller dense model. It handles text and image input natively, supports a 256K context window, and covers 140+ languages.

#### Key Features

-   **Mixture-of-Experts Architecture** - 128 fine-grained experts with top-8 routing and a shared expert, activating only 4B parameters per token for efficient inference.
-   **Hybrid Attention** - Interleaves sliding window (local) and full global attention layers, with unified Keys and Values on global layers and Proportional RoPE (p-RoPE) for long context efficiency.
-   **Reasoning / Thinking Mode** - Built-in configurable thinking mode lets the model reason step-by-step before answering.
-   **Multimodal** - Native text and image understanding with variable aspect ratio and resolution support; video analysis via frame sequences.
-   **Function Calling** - Native structured tool use for agentic workflows.
-   **Long Context** - 256K token context window for document analysis, long-form reasoning, and agent trajectories.
-   **Multilingual** - Out-of-the-box support for 35+ languages, pre-trained on 140+.
-   **Native System Prompts** - First-class support for the system role.

#### Use Cases

-   Document and PDF parsing, OCR (including multilingual and handwriting)
-   Chart, diagram, and screen/UI understanding
-   Long-context reasoning and summarization
-   Code generation, completion, and correction
-   Agentic workflows with structured function calling
-   Visual question answering and image analysis
-   Multilingual chat and translation

#### Architecture

Gemma 4 26B A4B uses a 30-layer MoE transformer with a 1024-token sliding window on local attention layers and unified Keys/Values on global layers, paired with a ~550M parameter vision encoder. Each expert has a GELU-activated FFN and the routing selects 8 of 128 experts plus 1 shared expert per token.

#### Benchmarks

Instruction-tuned results reported by Google DeepMind (selected):

-   MMLU Pro: 82.6%
-   AIME 2026 (no tools): 88.3%
-   LiveCodeBench v6: 77.1%
-   GPQA Diamond: 82.3%
-   BigBench Extra Hard: 64.8%
-   MMMLU: 86.3%
-   MMMU Pro (vision): 73.8%
-   MATH-Vision: 82.4%
-   MRCR v2 8-needle 128K: 44.1%

For full benchmark tables and model family comparisons, see the [model card on HuggingFace](https://huggingface.co/google/gemma-4-26B-A4B-it).


Gemma 4 26B A4B MoE vision-language model by Google with 256K context and thinking mode

Gemma 4 26B A4B IT


### Qwen3.5 35B A3B: Unified Vision-Language MoE Reasoning Model

Qwen3.5 35B A3B is a multimodal mixture-of-experts foundation model from Alibaba's Qwen team, featuring a hybrid Gated DeltaNet and sparse MoE architecture. It has 35 billion total parameters with 3 billion activated per token, delivering high-throughput inference at minimal latency. The model was trained with early fusion on multimodal tokens to achieve native vision-language understanding alongside strong text reasoning, coding, and agentic capabilities.

#### Key Features

-   **Unified Vision-Language Foundation** - Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks
-   **Efficient Hybrid Architecture** - Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead
-   **Scalable RL Generalization** - Reinforcement learning scaled across million-agent environments with progressively complex task distributions for robust real-world adaptability
-   **Global Linguistic Coverage** - Expanded support to 201 languages and dialects for inclusive worldwide deployment
-   **Long Context** - 262,144 tokens natively, extensible up to 1,010,000 tokens with YaRN

#### Architecture

-   Causal Language Model with Vision Encoder
-   35B total parameters, 3B activated per token
-   40 layers with a 10 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE)) hybrid layout
-   Mixture of Experts with 256 experts, 8 routed + 1 shared activated
-   Multi-token prediction (MTP) trained with multi-steps
-   Native 262K context, extensible to 1M tokens

#### Use Cases

-   Multimodal reasoning and visual question answering
-   Document, chart, and diagram understanding
-   Coding and software engineering agents
-   Tool-using agent workflows across long horizons
-   Multilingual chat and instruction following across 201 languages
-   Long-context analysis and retrieval over large document sets

#### Benchmarks

On the Qwen3.5 benchmark suite ([source](https://huggingface.co/Qwen/Qwen3.5-35B-A3B)), Qwen3.5 35B A3B scores MMLU-Pro 85.3, MMLU-Redux 93.3, C-Eval 90.2, SuperGPQA 63.4, IFEval 91.9, GPQA Diamond 84.2, and LongBench v2 59.0, placing it competitively with much larger MoE peers while activating only 3B parameters per token.


Efficient 35B MoE with 3B active params, unified vision-language reasoning

Qwen3.5 35B A3B


### GLM 5: Large-Scale Agentic Reasoning Model

GLM 5 is a 744B parameter Mixture-of-Experts model with 40B active parameters, developed by Z.ai. It targets complex systems engineering, long-horizon agentic tasks, and advanced reasoning, building on the GLM 4.7 foundation with doubled total parameters and an expanded expert pool.

#### Key Features

- **Agentic Task Completion** - Achieves 77.8% on SWE-bench Verified and 73.3% on SWE-bench Multilingual, with strong Terminal-Bench 2.0 performance (56.2% with Terminus, 56.2% with Claude Code)
- **Complex Reasoning** - Scores 92.7% on AIME 2026 I, 96.9% on HMMT Nov. 2025, 82.5% on IMOAnswerBench, and 86.0% on GPQA-Diamond
- **Tool Use and Browsing** - Native tool calling with 62.0% on BrowseComp, 89.7% on tau-2-Bench, and 67.8% on MCP-Atlas; 50.4% on Humanity's Last Exam with tool access
- **Cybersecurity** - 43.2% on CyberGym for systems-level security tasks
- **Interleaved Thinking** - Reasons before every response and tool call, with turn-level control over reasoning depth
- **Bilingual** - Native English and Chinese language support

#### Use Cases

- Software engineering, code generation, and multi-file repository-level tasks
- Multi-step agentic workflows with tool calling and web browsing
- Complex mathematical reasoning and competition-level problem solving
- Terminal-based development, operations, and systems administration
- Cybersecurity analysis and systems engineering
- Research tasks requiring extended browsing and context management
- Long-form document analysis and generation

#### Architecture and Design

GLM 5 uses a Mixture-of-Experts architecture with 256 routed experts and 1 shared expert per layer, activating 8 experts per token. The first 3 layers are dense, while the remaining 75 layers use MoE routing with a sigmoid scoring function. The model employs Multi-head Latent Attention (MLA) with LoRA-compressed key-value projections (KV LoRA rank 512, Q LoRA rank 2048) for memory-efficient inference.

The model integrates DeepSeek Sparse Attention (DSA) to reduce deployment cost while preserving long-context capacity across its 128K token context window. A single Multi-Token Prediction (MTP) layer enables speculative decoding for improved inference throughput.

#### Training Approach

GLM 5 was pre-trained on 28.5 trillion tokens, increased from the 23 trillion tokens used for GLM 4.5. Post-training uses SLIME, a novel asynchronous reinforcement learning infrastructure designed for improved training efficiency at scale. The model defaults to thinking mode with temperature 1.0 and top-p 0.95 for general reasoning tasks, with temperature 0.7 recommended for coding benchmarks.

Deploy GLM 5 on Vast.ai for access to frontier-class agentic reasoning, coding, and tool use capabilities with flexible GPU infrastructure.


744B MoE model for agentic reasoning, coding, and tool use

GLM 5


### Qwen3.5 397B A17B: Efficient Multimodal Reasoning with Hybrid Attention

Qwen3.5 397B A17B is a multimodal mixture-of-experts language model developed by the Qwen team, featuring a novel hybrid architecture that combines Gated Delta Networks with sparse Gated Attention. It supports text, image, and video inputs with native reasoning capabilities across 201 languages and dialects.

#### Key Features

- **Hybrid DeltaNet-Attention Architecture** - Alternating blocks of Gated Delta Networks (linear attention) and Gated Attention with grouped-query heads, enabling efficient long-context processing while maintaining strong attention quality
- **Sparse Mixture-of-Experts** - 512 total experts with 10 routed and 1 shared expert active per token, delivering high capacity with efficient inference
- **Native Multimodal Support** - Early fusion training enables unified processing of text, images, and video inputs with near-parity to text-only performance
- **Interleaved Thinking** - Default reasoning mode generates structured thinking traces before responses, with per-turn control for balancing accuracy against latency
- **Tool Use and Agentic Workflows** - Native support for function calling and multi-step agent-based task execution
- **Multilingual Coverage** - Supports 201 languages and dialects with strong performance across diverse linguistic contexts

#### Benchmark Performance

**Reasoning and Mathematics:**
- AIME 2026: 91.3%
- HMMT Feb 2025: 94.8%
- GPQA Diamond: 88.4%

**Knowledge and Instruction:**
- MMLU-Pro: 87.8
- SuperGPQA: 70.4
- C-Eval: 93.0
- IFBench: 76.5

**Coding and Software Engineering:**
- SWE-bench Verified: 76.4%
- LiveCodeBench v6: 83.6
- SecCodeBench: 68.3

**Tool Use and Agent Tasks:**
- BFCL-V4: 72.9
- TAU2-Bench: 86.7
- Tool-Decathlon: 38.3

**Vision and Multimodal:**
- MMMU: 85.0
- MathVision: 88.6
- OmniDocBench: 90.8
- OCRBench: 93.1
- VideoMME (with subtitles): 87.5

#### Use Cases

- Complex mathematical reasoning and competition-level problem solving
- Multi-turn agentic workflows with tool calling and structured reasoning
- Code generation, debugging, and real-world software engineering tasks
- Document analysis, OCR, and visual question answering
- Video understanding and temporal reasoning
- Multilingual applications spanning 201 languages
- Multi-step research tasks requiring tool integration
- Image-based reasoning and spatial understanding

#### Architecture

Qwen3.5 397B A17B employs a 60-layer hybrid architecture organized in a repeating pattern of 15 cycles. Each cycle consists of three Gated DeltaNet blocks followed by one Gated Attention block, with every block paired with a mixture-of-experts feed-forward layer.

Gated Delta Networks provide efficient linear attention with fixed-size recurrent state, enabling long-context processing without the quadratic memory cost of standard attention. The interleaved Gated Attention blocks use grouped-query attention with 32 query heads and 2 key-value heads, preserving the model's ability to perform precise token-level attention when needed.

The mixture-of-experts layer routes each token through 10 of 512 available experts plus 1 shared expert, enabling the model to maintain high total capacity while keeping per-token computation efficient. Multi-Token Prediction training enables speculative decoding for faster inference throughput.

#### Training Approach

Qwen3.5 397B A17B was trained with early fusion multimodal pre-training, achieving near-complete training efficiency parity between multimodal and text-only settings. Post-training employs scalable reinforcement learning frameworks supporting massive-scale agent scaffolds with progressive task complexity, enabling strong performance on agentic and tool-use benchmarks.

Deploy Qwen3.5 397B A17B on Vast.ai for access to frontier-level multimodal reasoning, coding, and agentic capabilities with flexible GPU infrastructure.


Efficient multimodal reasoning model with hybrid DeltaNet-attention architecture

FP8 Quantized

Qwen3.5 397B A17B


### GLM 4.7-Flash: Efficient Agentic and Reasoning Model

GLM 4.7-Flash is a 30B-A3B Mixture of Experts (MoE) model developed by Z.ai, designed to deliver strong agentic, reasoning, and coding performance in a compact and efficient architecture. It is positioned as one of the strongest models in the 30B parameter class.

#### Key Features

- **MoE Architecture** - Uses a 30B-A3B Mixture of Experts design that activates only a fraction of parameters per token, providing an efficient balance between performance and resource usage
- **Strong Coding Performance** - Achieves 59.2% on SWE-bench Verified, substantially outperforming comparable models in real-world software engineering tasks
- **Agentic Capabilities** - Scores 79.5% on tau-2-Bench and 42.8% on BrowseComp, demonstrating effective tool use and web browsing abilities
- **Mathematical Reasoning** - Achieves 91.6% on AIME 2025, competitive with much larger models
- **Thinking Mode** - Supports preserved thinking for multi-turn agentic conversations, maintaining reasoning context across turns
- **Tool Calling** - Native support for structured tool calling and function integration in agentic workflows

#### Use Cases

- Code generation and debugging for software engineering tasks
- Agentic workflows with tool calling and web browsing
- Mathematical reasoning and problem solving
- Multi-turn conversations with context retention
- Research tasks requiring tool integration
- Lightweight deployment scenarios requiring strong performance with lower resource usage

#### Architecture

GLM 4.7-Flash uses a Mixture of Experts architecture with 30B total parameters and 3B active parameters per token. This sparse activation pattern enables the model to maintain high performance while requiring significantly fewer computational resources during inference compared to dense models of similar capability. The model supports thinking mode with preserved reasoning across conversation turns, enabling coherent multi-step agentic task completion.

#### Training Approach

The model was trained with emphasis on agentic task performance, coding, and reasoning. Evaluation uses temperature 1.0 with top-p 0.95 for general tasks, with specialized settings for coding and agentic benchmarks including temperature 0.7 for SWE-bench evaluations.

Deploy GLM 4.7-Flash on Vast.ai for efficient access to strong agentic and reasoning capabilities with flexible GPU infrastructure.


Lightweight agentic, reasoning and coding model

GLM 4.7-Flash


### GLM 4.7: Advanced Coding, Reasoning, and Agentic Model

GLM 4.7 is a 358B parameter language model developed by Z.ai, designed as a comprehensive coding partner with significant improvements over GLM 4.6 across coding, reasoning, tool use, and agentic tasks.

#### Key Features

- **Core Coding** - Major improvements in real-world software engineering with SWE-bench Verified score of 73.8% (+5.8% over GLM 4.6) and SWE-bench Multilingual at 66.7% (+12.9%)
- **Vibe Coding** - Improved UI generation quality with cleaner, more modern webpage output and better slide generation with accurate layout and sizing
- **Tool Use** - Strong performance in tool-integrated workflows with BrowseComp score of 52% and tau-2-Bench score of 87.4%
- **Complex Reasoning** - Achieves 42.8% on Humanity's Last Exam (HLE) with tools, 95.7% on AIME 2025, and 97.1% on HMMT Feb 2025
- **Interleaved Thinking** - The model thinks before every response and tool call, enabling more deliberate and accurate outputs
- **Preserved Thinking** - Retains thinking blocks across multi-turn conversations, improving coherence in agentic coding workflows
- **Turn-level Thinking Control** - Per-turn control over reasoning depth allows optimization of latency and cost

#### Use Cases

- Code generation, debugging, and real-world software engineering tasks
- Multi-turn agentic workflows with tool calling and web browsing
- Complex mathematical reasoning and problem solving
- Web UI and application development
- Terminal-based development and operations
- Multi-step research tasks requiring tool integration
- Long-form document analysis and generation

#### Architecture and Thinking Capabilities

GLM 4.7 introduces a refined thinking architecture with three distinct modes. Interleaved thinking allows the model to reason before every response and tool call. Preserved thinking retains reasoning blocks across conversation turns, which is particularly valuable for multi-step coding agent tasks where context continuity improves accuracy. Turn-level thinking provides granular control over when and how deeply the model reasons, allowing users to balance output quality against latency.

The model supports integration with popular coding agent frameworks and provides native tool calling capabilities with structured output for function calling workflows.

#### Training Approach

GLM 4.7 was trained with a focus on real-world coding performance, agentic task completion, and reasoning depth. The model uses a default evaluation setting of temperature 1.0 with top-p 0.95 for general tasks, with specialized settings for coding benchmarks including temperature 0.7 for SWE-bench and Terminal Bench evaluations.

Deploy GLM 4.7 on Vast.ai for access to advanced coding, reasoning, and agentic capabilities with flexible GPU infrastructure.


Advanced agentic, reasoning and coding model

GLM 4.7


### Qwen3 Coder Next: Ultra-Efficient Coding Agent Model

Qwen3 Coder Next is an 80B parameter sparse Mixture-of-Experts language model from
Alibaba's Qwen team, designed specifically for coding agents and local development.
With only 3B parameters activated per token, it achieves performance comparable to
models with 10-20x more active parameters, making it one of the most efficient
coding models available.

#### Key Features

- **Extreme Efficiency** -- 512 total experts with 10 activated per token plus 1
  shared expert, delivering strong coding performance at a fraction of the compute
  cost of dense models in the same parameter class
- **Advanced Agentic Capabilities** -- Purpose-built for autonomous coding workflows
  with long-horizon reasoning, complex tool usage, and robust recovery from execution
  failures across multi-step tasks
- **Native Tool Calling** -- First-class support for function calling through the
  OpenAI-compatible API, enabling integration with development tools, file systems,
  and external services
- **256K Native Context** -- Handles large codebases, lengthy documentation, and
  extended multi-turn conversations without truncation, with architecture support
  for extension to 1M tokens via YaRN

#### Hybrid Attention Architecture

Qwen3 Coder Next introduces a novel hybrid attention design that alternates between
two complementary attention mechanisms across its 48 layers. The architecture follows
a repeating pattern: three Gated DeltaNet layers (linear attention) followed by one
Gated Attention layer (traditional transformer attention), each connected through
MoE feed-forward blocks.

Gated DeltaNet layers provide efficient linear attention for fast sequential
processing, while Gated Attention layers with rotary position embeddings handle
precise token relationships. This hybrid approach enables both high throughput during
generation and strong performance on tasks requiring exact positional reasoning.

#### Use Cases

- **Autonomous Coding Agents** -- Ideal backbone for agent scaffolds including
  Claude Code, Qwen Code, Qoder, Kilo, Trae, and Cline, with native support for
  the tool-calling patterns these frameworks require
- **Software Engineering** -- Code generation, debugging, refactoring, and
  repository-level understanding across large codebases
- **Local Development** -- The sparse activation pattern makes it practical to run
  on fewer GPUs than comparably capable dense models, suitable for team-level
  or individual developer deployments
- **Multi-Step Workflows** -- Complex tasks involving file manipulation, test
  execution, dependency analysis, and iterative code refinement benefit from
  the model's long context and agentic training

#### Performance

Qwen3 Coder Next demonstrates competitive performance across major coding benchmarks
despite its significantly lower active parameter count. The model's agentic training
recipe enables it to handle real-world software engineering tasks that require
planning, tool use, and error recovery -- capabilities that go beyond static code
completion. Benchmark evaluations show it performing at levels comparable to models
with substantially more active parameters, validating the efficiency of its sparse
MoE architecture and hybrid attention design.

Deploy Qwen3 Coder Next on Vast.ai for efficient access to advanced coding agent
capabilities with flexible GPU infrastructure.


Ultra-efficient 80B coding agent with only 3B active parameters

Qwen3 Coder Next


# Kimi K2.5

Kimi K2.5 is an open-source, native multimodal agentic model developed by Moonshot AI. Built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base, this model seamlessly integrates vision and language understanding with advanced agentic capabilities.

## Model Overview

Kimi K2.5 represents a significant advancement in multimodal AI, combining a trillion-parameter Mixture-of-Experts (MoE) architecture with native vision capabilities. The model activates 32 billion parameters per inference while maintaining efficiency through its expert-based design with 384 total experts and 8 selected per token.

The architecture features 61 layers, Multi-Latent Attention (MLA) for efficient attention computation, and a 400M parameter MoonViT vision encoder. This design enables the model to process text, images, and video inputs within a unified framework.

## Key Capabilities

### Native Multimodality

Unlike models that retrofit vision capabilities, Kimi K2.5 was pre-trained on vision-language tokens from the ground up. This native multimodal approach enables superior visual knowledge extraction and cross-modal reasoning, allowing the model to understand and reason about visual content with the same fluency as text.

### Coding with Vision

Kimi K2.5 can generate code directly from visual specifications, transforming UI designs and video workflows into functional implementations. The model autonomously orchestrates tools for visual data processing, bridging the gap between design and development.

### Agent Swarm

The model introduces a novel agent swarm capability, transitioning from single-agent execution to self-directed, coordinated multi-agent workflows. Kimi K2.5 can decompose complex tasks into parallel sub-tasks and dynamically instantiate domain-specific agents to handle them, enabling sophisticated problem-solving at scale.

## Operational Modes

Kimi K2.5 supports two distinct operational modes:

**Thinking Mode** (Default): Provides detailed reasoning content alongside responses, ideal for complex analytical tasks. Uses temperature 1.0 and top_p 0.95 for optimal performance.

**Instant Mode**: Delivers faster responses with disabled thinking, suitable for straightforward queries. Uses temperature 0.6 for more focused outputs.

## Benchmark Performance

Kimi K2.5 demonstrates strong performance across diverse evaluation benchmarks:

### Reasoning and Knowledge
- AIME 2025: 96.1
- GPQA-Diamond: 87.6
- MMLU-Pro: 87.1
- HLE-Full (with tools): 50.2

### Vision and Multimodal
- MMMU-Pro: 78.5
- VideoMMMU: 86.6
- OCRBench: 92.3
- OmniDocBench: 88.8
- InfoVQA: 92.6

### Coding
- SWE-Bench Verified: 76.8
- SWE-Bench Pro: 50.7
- LiveCodeBench: 85.0
- Terminal Bench 2.0: 50.8

### Agentic Search
- BrowseComp (Agent Swarm): 78.4
- WideSearch (Agent Swarm): 79.0
- DeepSearchQA: 77.1

### Long Context
- Longbench v2: 61.0
- AA-LCR: 70.0

## Use Cases

Kimi K2.5 excels in a variety of applications:

- **Multimodal Analysis**: Understanding and reasoning about images, videos, and text in unified workflows
- **Complex Reasoning**: Solving mathematical, logical, and analytical problems with detailed explanations
- **Software Engineering**: Generating, reviewing, and debugging code across multiple languages
- **Visual Coding**: Converting UI/UX designs directly into functional code
- **Document Understanding**: Extracting information from documents using advanced OCR capabilities
- **Multi-step Problem Solving**: Orchestrating tools and agents to tackle complex, multi-faceted tasks
- **Information Retrieval**: Conducting thorough research using coordinated agent swarm capabilities


Kimi K2.5 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base

Kimi K2.5


LTX-2 is a DiT-based (Diffusion Transformer) audio-video foundation model developed by Lightricks that generates synchronized video and audio within a single unified model. With 19 billion parameters, it represents a significant advancement in multimodal generation, enabling practical video creation with accompanying audio from various input modalities.

## Key Features

LTX-2 supports multiple generation modes within a single architecture:

- **Text-to-Video**: Generate video content directly from text descriptions
- **Image-to-Video**: Animate static images into dynamic video sequences
- **Audio-Visual Generation**: Create synchronized audio and video output together
- **Cross-Modal Generation**: Support for audio-to-video, text-to-audio, and video-to-audio workflows

The unified architecture allows all these capabilities to work together seamlessly, making it possible to generate complete audiovisual content from simple prompts.

## Architecture

LTX-2 is built on a Diffusion Transformer (DiT) architecture, combining the strengths of diffusion models with transformer-based processing. This design enables the model to handle both video and audio generation within a single framework, maintaining temporal coherence across both modalities.

The model processes video with width and height divisible by 32, and frame counts divisible by 8 plus 1, allowing for flexible output configurations while maintaining generation quality.

## Training and Customization

The base model is fully trainable, supporting various customization approaches:

- **LoRA Training**: Create Low-Rank Adaptations for specific styles or subjects
- **IC-LoRA**: Image-Conditioned LoRAs for more precise control
- **Motion Adaptation**: Train custom motion patterns efficiently
- **Style Transfer**: Adapt the model to specific visual styles
- **Likeness Training**: Capture both appearance and sound characteristics

These customization options enable users to adapt LTX-2 for specific creative applications while building on its foundation capabilities.

## Use Cases

LTX-2 is designed for creative video generation applications including:

- Short-form video content creation
- Animation and motion design
- Visual storytelling with synchronized audio
- Creative experimentation with multimodal generation
- Prototyping video concepts from text descriptions

## Prompting

Effective prompting significantly impacts generation quality. The model responds well to detailed, descriptive prompts that clearly articulate the desired visual and audio elements. For best results, users should provide specific details about motion, scene composition, and audio characteristics when generating audiovisual content.

## Integration

LTX-2 integrates with ComfyUI through built-in LTXVideo nodes, enabling visual workflow-based generation. The model is also supported in the Hugging Face Diffusers library for programmatic access.

For more details about the model architecture and training approach, see the [model page on Hugging Face](https://huggingface.co/Lightricks/LTX-2).


LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model

LTX-2


### ACE-Step V1: Open-Source Music Generation Model

ACE-Step is an open-source foundation model for music generation developed by ACE Studio and StepFun. It combines diffusion-based generation with Sana's Deep Compression AutoEncoder and a lightweight linear transformer architecture to deliver fast, high-quality music synthesis.

#### Key Features

- **Exceptional Speed** - 15× faster than LLM-based baselines for music generation
- **High Musical Quality** - Produces coherent output across melody, harmony, and rhythm
- **Full Song Generation** - Creates complete musical compositions with controllable duration
- **Natural Language Control** - Accepts text descriptions for music generation
- **Multilingual Support** - Supports 17 languages for input prompts
- **Open Source** - Released under Apache 2.0 license for commercial use

#### Use Cases

- Text-to-music generation from natural language descriptions
- Music remixing and style transfer
- Lyric editing and vocal manipulation
- Foundation model for specialized music generation tools
- Voice cloning applications
- Rapid prototyping of musical ideas
- Background music creation for media projects

#### Technical Architecture

- **Model Type**: Diffusion-based generation with transformer conditioning
- **Audio Processing**: Sana's Deep Compression AutoEncoder
- **Conditioning**: Lightweight linear transformer
- **Inference**: Optimized for real-time performance

#### Training Approach

ACE-Step employs a holistic architectural design that overcomes key limitations of existing music generation approaches. The model uses diffusion-based techniques combined with efficient audio compression to achieve high-quality output while maintaining fast inference speeds.

#### Limitations and Considerations

- Language performance varies, with top 10 languages delivering best results
- Structural coherence may decline for compositions exceeding 5 minutes
- Rendering of rare instruments can be inconsistent
- Output sensitivity to random seeds varies
- Vocal synthesis quality is limited compared to dedicated TTS models
- Some genres may produce suboptimal results

Deploy ACE-Step V1 on Vast.ai for fast, cost-effective music generation with enterprise-grade infrastructure.


ACE-Step is a novel open-source foundation model for music generation that overcomes key limitations of existing approaches through a holistic architectural design

ACE Step V1 3.5B


DeepSeek OCR is a vision-language model from DeepSeek AI that specializes in optical character recognition and document understanding. The model introduces "Contexts Optical Compression" as its core innovation, optimizing how visual information is compressed when processing text-heavy documents.

## Key Features

DeepSeek OCR excels at converting documents and images into structured formats, with particular emphasis on markdown conversion and raw text extraction. The model supports flexible inference modes through multiple configuration sizes (Tiny, Small, Base, Large, Gundam) that can be adjusted based on processing requirements with varying base_size and image_size parameters.

The model includes specialized grounding capabilities using grounding tokens for enhanced document understanding, making it particularly effective at maintaining context and structure during OCR operations. It employs n-gram logit processing for structured output generation, which proves especially useful for complex table extraction tasks.

## Architecture

Built on the Transformers framework with Safetensors format, DeepSeek OCR utilizes Flash Attention 2 for optimized performance on NVIDIA GPUs. The architecture supports custom inference parameters including crop_mode for flexible processing of various document layouts and formats. Integration with vLLM enables accelerated inference with batch processing support for production workloads.

## Use Cases

DeepSeek OCR is designed for a wide range of document processing applications:

- Document digitization and conversion to markdown format
- Table extraction from complex document layouts
- Multi-page PDF processing and analysis
- Batch OCR operations for production workflows
- Text extraction from images and scanned documents

## Performance and Adoption

The model has achieved significant adoption in the community, with over 4 million downloads monthly. It is actively deployed in more than 78 community Spaces, demonstrating diverse real-world applications across document understanding tasks.

DeepSeek OCR is published under the MIT license, making it accessible for both commercial and non-commercial use.


Contexts Optical Compression vision language model

DeepSeek OCR


### DeepSeek-R1-0528: Advanced Reasoning Language Model

DeepSeek-R1-0528 is an advanced reasoning model developed by DeepSeek AI that significantly improves upon its predecessor through enhanced computational depth and inference capabilities. Released under the MIT license, it represents a major advancement in open-source reasoning AI.

#### Key Capabilities

- **Deep Reasoning** - Enhanced computational depth for complex problem-solving, using extended token chains to explore multiple solution paths
- **Chain-of-Thought Processing** - Extended thinking depth for complex mathematical and logical problems
- **Function Calling** - Enhanced support for tool use and API integration
- **Reduced Hallucination** - Lower error rates compared to previous versions through reinforcement learning optimization
- **Commercial License** - MIT license permits commercial use and modification

#### Benchmark Performance

**Mathematics:**
- AIME 2024: 91.4% accuracy
- AIME 2025: 87.5% accuracy
- HMMT 2025: 79.4% accuracy

**Programming:**
- Codeforces Division 1: 1930 rating
- LiveCodeBench: 73.3% accuracy

**General Knowledge:**
- MMLU-Pro: 85.0% (Exact Match)
- GPQA-Diamond: 81.0% accuracy

#### Use Cases

- Complex mathematical problem solving
- Advanced code generation and debugging
- Research and technical analysis
- Scientific reasoning and hypothesis testing
- Legal document analysis
- Financial modeling and forecasting
- Educational tutoring for advanced subjects
- Logical reasoning and proof generation

#### Training Approach

DeepSeek-R1-0528 employs reinforcement learning to incentivize reasoning capability, with optimization mechanisms during post-training that increase computational depth. This approach allows the model to explore multiple solution paths before generating final answers, leading to significant improvements in accuracy on challenging reasoning tasks.

The model demonstrates a 25% improvement in AIME 2025 performance compared to its predecessor, achieved through increased reasoning depth averaging 23K tokens per question versus 12K in the earlier version.

#### Architecture

The model uses a transformer-based architecture enhanced with reinforcement learning techniques specifically designed to improve reasoning capabilities. The training process optimizes for extended chain-of-thought processing, enabling the model to break down complex problems into manageable steps.

Deploy DeepSeek-R1-0528 on Vast.ai for access to enterprise-grade GPU infrastructure at competitive pricing, enabling advanced reasoning capabilities for research and production applications.


DeepSeek R1 0528


### DeepSeek V3.1: Hybrid Thinking Language Model

DeepSeek V3.1 is a hybrid language model developed by DeepSeek AI that operates in both thinking and non-thinking modes. This dual-mode architecture allows the model to provide either deep reasoning with visible thought processes or fast responses without intermediate reasoning, depending on the task requirements.

#### Key Features

- **Hybrid Architecture** - Unique dual-mode operation supporting both thinking mode (similar to DeepSeek-R1) and non-thinking mode for faster responses
- **Enhanced Tool Usage** - Significantly improved performance in tool calling and agent-based tasks through post-training optimization
- **Extended Context** - Two-phase long context extension approach for handling extended conversations and documents
- **MIT License** - Open source with commercial use permissions

#### Benchmark Performance

**General Knowledge:**
- MMLU-Redux: 93.7% (thinking mode)
- MMLU-Pro: 83.7% (thinking mode)

**Mathematics:**
- AIME 2024: 93.1% accuracy (thinking mode)

**Programming:**
- LiveCodeBench: 74.8% accuracy
- Codeforces Division 1: 2091 rating

**Agent Tasks:**
- SWE Verified: 66% success rate
- SWE-bench Multilingual: 54.5% success rate
- BrowseComp (Chinese): 49.2% accuracy

#### Use Cases

- Complex reasoning tasks requiring visible thought processes
- Fast-response applications where speed is prioritized
- Tool-using agents and function calling systems
- Multi-step web research and search agents
- Code generation and debugging
- Mathematical problem solving
- Long-form document analysis and generation
- Customer support with reasoning transparency

#### Hybrid Mode Architecture

DeepSeek V3.1's unique feature is its ability to switch between operational modes:

**Thinking Mode**: Generates visible reasoning chains before final answers, ideal for complex problems where transparency and step-by-step logic are valuable. This mode achieves higher accuracy on challenging benchmarks.

**Non-Thinking Mode**: Provides direct answers without intermediate reasoning steps, optimized for speed and efficiency in straightforward queries.

This flexibility allows users to choose the appropriate mode based on their specific needs—transparency and accuracy for critical decisions, or speed for routine queries.

#### Training Approach

The model builds upon DeepSeek-V3.1-Base through extensive post-training optimization. A two-phase long context extension process significantly expanded the model's ability to handle extended inputs, with targeted training on tool usage and agent capabilities.

Post-training specifically focused on enhancing function calling, tool integration, and agent-based task performance, making the model particularly strong in real-world applications requiring external tool interaction.

Deploy DeepSeek V3.1 on Vast.ai to leverage its hybrid thinking capabilities with flexible GPU infrastructure for both research and production applications.


DeepSeek V3.1


### DeepSeek V3.2 Exp: Sparse Attention Language Model

DeepSeek V3.2 Exp is an experimental language model developed by DeepSeek AI that introduces DeepSeek Sparse Attention (DSA), a novel mechanism designed to optimize long-context scenarios. Building on V3.1-Terminus, this model represents ongoing research into more efficient transformer architectures, particularly for extended text processing.

#### Key Features

- **Sparse Attention Innovation** - Introduces DeepSeek Sparse Attention (DSA) achieving fine-grained sparse attention for the first time, delivering efficiency gains while maintaining output quality
- **Long-Context Optimization** - Specifically designed to excel in extended text processing scenarios
- **Tool Integration** - Enhanced capabilities for function calling and multi-turn conversations with tool use
- **MIT License** - Open source with full commercial use permissions

#### Benchmark Performance

**General Knowledge:**
- MMLU-Pro: 85.0%

**Mathematics:**
- AIME 2025: 89.3% accuracy (improved from 88.4%)

**Programming:**
- Codeforces: 2121 rating (improved from 2046)

**Factual Accuracy:**
- SimpleQA: 97.1% (improved from 96.8%)

#### Use Cases

- Long-form document analysis and generation
- Multi-turn conversational AI with extended context
- Code generation and debugging tasks
- Research and technical analysis requiring extended reasoning
- Tool-using agents with function calling capabilities
- Web browsing and information retrieval tasks
- Customer support with context-aware responses
- Educational applications with detailed explanations

#### Sparse Attention Architecture

DeepSeek V3.2 Exp's primary innovation is the introduction of DeepSeek Sparse Attention (DSA), which achieves fine-grained sparse attention patterns. This mechanism optimizes the model's ability to process long contexts efficiently while maintaining performance comparable to or better than dense attention models.

The sparse attention approach allows the model to focus computational resources on the most relevant parts of long sequences, enabling efficient processing of extended documents and conversations without sacrificing output quality.

#### Training Approach

The model's training configurations were deliberately aligned with V3.1-Terminus to rigorously evaluate the sparse attention mechanism's impact. This controlled approach ensures fair performance comparisons and validates the effectiveness of the architectural innovations.

The experimental nature of this release reflects DeepSeek AI's ongoing research into more efficient transformer architectures, with a particular focus on improving performance in long-context scenarios.

Deploy DeepSeek V3.2 Exp on Vast.ai to leverage cutting-edge sparse attention technology for efficient long-context processing in research and production applications.


DeepSeek V3.2 Exp


### Dia 1.6B: Realistic Dialogue Generation from Text

Dia is a text-to-speech model developed by Nari Labs that directly generates highly realistic dialogue from transcripts. The model supports English language generation and enables emotion and tone control through audio conditioning.

#### Key Features

**Dialogue Generation with Speaker Tags**
Dia produces natural speech from transcripts using `[S1]` and `[S2]` speaker tags, making it easy to create multi-speaker conversations directly from text.

**Nonverbal Communication**
The model recognizes and generates approximately 20 different nonverbal expressions including laughter, coughing, throat clearing, sighing, and gasps. These are triggered using simple tags like "(laughs)", "(clears throat)", and "(sighs)".

**Voice Cloning**
Dia includes voice cloning functionality that enables speaker consistency across generations. The model produces different voices with each generation without requiring fine-tuning on specific voices, and supports seed-fixing for reproducibility.

**Audio Conditioning**
The model can be conditioned on audio input, enabling precise control over emotion and tone in the generated speech output.

#### Use Cases

- Creating realistic dialogue for audio content and storytelling
- Generating conversational speech with multiple speakers
- Producing speech with emotional expressions and nonverbal sounds
- Voice synthesis applications requiring speaker consistency
- Accessibility tools for text-to-speech conversion

#### Training and Architecture

Dia draws inspiration from SoundStorm and Parakeet architectures, utilizing the Descript Audio Codec for audio generation. The model development benefited from resources provided by the Google TPU Research Cloud program and a Hugging Face ZeroGPU grant.





Dia directly generates highly realistic dialogue from a transcript. You can condition the output on audio, enabling emotion and tone control

Dia 1.6B


### FLUX.1 [dev]: Advanced Text-to-Image Generation

FLUX.1 [dev] is a 12 billion parameter rectified flow transformer developed by Black Forest Labs for generating images from text descriptions. The model represents a significant advancement in open-source image generation technology, delivering cutting-edge output quality that rivals closed-source competitors.

<warning>
**This is a gated model**.  To enable access, set environment variable `HF_TOKEN` at page [Settings -> Environment Variables](https://cloud.vast.ai/settings/)  Your HuggingFace account must be granted access before you can download this model.
</warning>

#### Architecture and Design

Built as a rectified flow transformer, FLUX.1 [dev] employs guidance distillation during training—a technique that enhances inference efficiency compared to traditional diffusion approaches. This architectural choice enables developers to achieve strong generation results with reduced computational overhead while maintaining high output quality.

The model's open weights enable researchers and developers to study, modify, and build upon the architecture, fostering innovation in the image generation space.

#### Key Capabilities

FLUX.1 [dev] demonstrates several distinguishing strengths in image generation:

- **Prompt Following:** Matches the performance of closed-source competitors in understanding and accurately executing complex text prompts
- **Output Quality:** Delivers cutting-edge image quality, positioning itself as a leading open-source alternative to proprietary models
- **Computational Efficiency:** Guidance distillation optimization reduces inference computational demands while preserving generation quality
- **Integration Flexibility:** Compatible with multiple platforms including Diffusers, ComfyUI, and various API providers

#### Use Cases

The model excels across diverse image generation applications:

- Digital art creation and creative design projects
- Marketing materials and visual content for commercial campaigns
- Product visualization and concept mockups
- Scientific visualization and research illustration
- Social media content generation
- Concept art and illustration for creative industries
- Rapid prototyping of visual ideas

#### Community and Ecosystem

FLUX.1 [dev] has achieved substantial adoption since release, with monthly downloads exceeding 1.5 million. The model has spawned a vibrant ecosystem including 36,000+ adapter models and 100+ community Spaces, demonstrating its versatility as a foundation for specialized image generation applications.

#### Technical Considerations

The model acknowledges typical constraints of statistical image generation systems: it may occasionally fail to generate output that precisely matches prompts, and like other large-scale models trained on web data, it might reflect patterns present in training data. Users building production systems should implement appropriate content filtering and quality validation workflows.


Rectified flow transformer capable of generating images from text descriptions

FLUX.1 [dev]


## Overview

FLUX.2 [dev] is a 32 billion parameter rectified flow transformer developed by Black Forest Labs for text-to-image generation, editing, and composition. The model represents state-of-the-art performance in open text-to-image generation, single-reference editing, and multi-reference editing tasks.

## Key Features

### Unified Generation and Editing

FLUX.2 [dev] provides a unified approach to image generation and editing without requiring separate models or fine-tuning. The model can:

- Generate high-quality images from text descriptions
- Edit images based on single reference inputs
- Combine and compose images using multiple reference inputs
- Maintain consistent characters, objects, and styles across generations

### Reference-Based Workflows

Users can reference specific characters, objects, and visual styles directly through the model's multi-modal input system, eliminating the need for traditional fine-tuning or LoRA adapters. This enables consistent character generation and style transfer without additional training steps.

### Computational Efficiency

Built using guidance distillation techniques, FLUX.2 [dev] achieves efficient inference while maintaining high output quality. The model operates in bfloat16 precision and supports 4-bit quantization for reduced memory requirements.

### Safety Measures

The model incorporates comprehensive safety features including:

- Pre-training and post-training safety measures against harmful content
- Third-party safety evaluations
- Inference-time filtering for NSFW and IP-infringing content
- C2PA content provenance metadata for generated images

## Architecture

FLUX.2 [dev] is based on a rectified flow transformer architecture with 32 billion parameters. The model processes text prompts alongside optional image references to generate or edit images. Rectified flows provide a direct path between noise and image distributions, enabling efficient sampling with fewer inference steps compared to traditional diffusion models.

## Use Cases

### Creative and Artistic Applications

- Digital art creation with precise style control
- Character design with consistent appearance across generations
- Concept art and illustration
- Visual storytelling with coherent character and scene continuity

### Content Creation

- Marketing materials and advertisements
- Social media content generation
- Product visualization and mockups
- Editorial imagery

### Research and Development

- Computer vision research
- Image editing algorithm development
- Multi-modal model research
- Generative AI studies

### Professional Workflows

- Rapid prototyping for design projects
- Reference image creation for traditional artists
- Style exploration and iteration
- Image composition and editing

## Integration Support

FLUX.2 [dev] integrates with popular inference frameworks including Diffusers and ComfyUI, as well as custom implementations. The model typically requires 28-50 inference steps for high-quality outputs.


Rectified flow transformer capable of generating, editing and combining images based on text instructions

FLUX.2 [dev]


### GLM 4.5V: Advanced Vision-Language Foundation Model

GLM 4.5V is a multimodal AI system built on ZhipuAI's flagship language foundation model, leveraging GLM-4.5-Air (106B parameters with 12B active) as its architectural backbone. The model combines sophisticated vision and language understanding capabilities for advanced reasoning tasks, achieving state-of-the-art performance among models of similar scale across 42 public vision-language benchmarks.

#### Architecture and Design

The model employs a hybrid architecture that integrates visual understanding capabilities into the GLM-4.5-Air foundation model. This design enables efficient parameter allocation while maintaining competitive performance against larger multimodal systems. The architecture supports extended context processing with 64,000 token capacity, enabling analysis of lengthy documents and extended visual content.

Training methodology incorporated reinforcement learning with curriculum sampling (RLCS) and chain-of-thought reasoning mechanisms to enhance accuracy and interpretability across diverse visual domains.

#### Key Capabilities

GLM 4.5V demonstrates exceptional performance across multiple visual understanding scenarios:

**Image Analysis:**
- Scene comprehension and contextual understanding
- Multi-image comparison and relationship analysis
- Spatial recognition and geometric reasoning
- Visual grounding with precise bounding box identification using normalized coordinates

**Video Understanding:**
- Long-form video segmentation and temporal analysis
- Event detection across extended video sequences
- Temporal reasoning and narrative comprehension

**Document Processing:**
- Chart and diagram interpretation
- Long-form document analysis with extended context
- Table extraction and structured data understanding

**GUI Automation:**
- Screen reading and interface interpretation
- Icon recognition and UI element identification
- Desktop task assistance and workflow automation

#### Distinctive Features

**Thinking Mode Toggle:**
A unique capability enables users to adjust the balance between quick responses and deep reasoning. This adaptive processing allows optimization for either rapid inference or thorough analytical tasks depending on application requirements.

**Flexible Input Handling:**
- Supports arbitrary aspect ratios for diverse visual content
- Processes images up to 4K resolution
- Handles multiple images simultaneously for comparative analysis

**Hybrid Training Approach:**
Enables robust handling of diverse visual content types through comprehensive training across image, video, document, and interface understanding tasks.

#### Performance and Benchmarks

GLM 4.5V achieves state-of-the-art performance among models of comparable scale across 42 public vision-language benchmarks. The model outperforms larger competitors in specific domains despite its more efficient parameter allocation, demonstrating the effectiveness of its architectural design and training methodology.

#### Use Cases

The model excels in applications requiring sophisticated vision-language understanding:

- Visual question answering across diverse domains
- Document analysis and information extraction
- Chart and diagram interpretation for data analysis
- Long-form video content understanding and summarization
- GUI automation and interface interaction
- Multi-image comparative analysis
- Image captioning with detailed descriptions
- Visual content moderation and classification
- Spatial reasoning and geometric analysis
- Educational content analysis and tutoring

#### Deployment and Integration

GLM 4.5V supports multiple inference frameworks for flexible deployment:

- **Transformers:** Standard integration for research and development
- **vLLM:** Optimized inference for production environments
- **SGLang:** Advanced framework support

The model includes optimizations for video processing and multi-GPU inference, enabling efficient deployment across different hardware configurations and use case requirements.

#### Technical Considerations

The thinking mode toggle provides a unique advantage for applications requiring variable processing depth. Quick mode enables rapid responses for interactive applications, while deep reasoning mode supports complex analytical tasks requiring thorough evaluation.

The model's support for arbitrary aspect ratios and 4K resolution processing makes it particularly suitable for professional document analysis and high-resolution visual content understanding, where maintaining original image fidelity is critical for accurate interpretation.


GLM-4.5V is based on ZhipuAI’s next-generation flagship text foundation model GLM-4.5-Air

GLM 4.5V


### GLM 4.6: Advanced Agentic and Reasoning Model

GLM 4.6 is a large language model developed by Z.ai (Zhipu AI) that excels in agentic applications, reasoning tasks, and code generation. Building upon GLM-4.5, this model introduces significant improvements in context handling, reasoning capabilities, and tool-using agent integration.

This template defaults to 32k context for wider compatibility in search

#### Key Features

- **Extended Context** - Expanded context window from 128K to 200K tokens for handling complex, long-form tasks
- **Enhanced Reasoning** - Clear improvements in reasoning performance with support for tool use during inference
- **Superior Coding** - Demonstrates stronger real-world coding performance in applications and complex development tasks
- **Agentic Capabilities** - Advanced tool-using and search-based agent integration for multi-step workflows
- **MIT License** - Open source with full commercial use permissions

#### Benchmark Performance

GLM-4.6 was evaluated across eight public benchmarks covering agents, reasoning, and coding, demonstrating clear performance gains over GLM-4.5 and competitive results against leading models.

The model shows particularly strong performance in:
- Agentic task completion
- Complex reasoning workflows
- Real-world coding applications
- Tool-integrated systems

#### Use Cases

- Agentic applications requiring multi-step reasoning and tool use
- Complex code generation and debugging tasks
- Research and technical analysis with extended context
- Tool-using systems and function calling applications
- Search-based agents and information retrieval
- Long-form document analysis and generation
- Multi-turn conversations with context retention
- Educational applications with detailed explanations

#### Architecture and Capabilities

GLM-4.6 builds on the General Language Model architecture with specific optimizations for reasoning and tool use. The model supports function calling and tool integration during inference, enabling sophisticated agentic workflows where the model can autonomously use external tools to complete complex tasks.

The expanded 200K token context window allows the model to process extensive documents, maintain coherent multi-turn conversations, and handle complex reasoning chains that require reference to large amounts of information.

#### Training Approach

The model was trained with a focus on improving real-world performance in coding, reasoning, and agentic tasks. Evaluation settings include temperature of 1.0 for general tasks, with optimized sampling parameters for specialized applications like code generation.

Deploy GLM 4.6 on Vast.ai for access to advanced agentic and reasoning capabilities with flexible GPU infrastructure for research and production applications.


GLM 4.6


### GPT-OSS-120b: Open-Weight Reasoning Model

GPT-OSS-120b is an open-weight model from OpenAI designed for production use cases requiring powerful reasoning capabilities. The model features adjustable reasoning effort and complete chain-of-thought visibility, making it ideal for applications where transparency and control over the reasoning process are essential.

#### Key Features

- **Adjustable Reasoning** - Configure reasoning effort across low, medium, and high settings to balance speed and accuracy
- **Chain-of-Thought Access** - Complete visibility into the model's reasoning process for transparency and debugging
- **Agentic Functions** - Native support for function calling, web browsing, Python code execution, and structured outputs
- **Fine-Tuning Ready** - Fully customizable through parameter adjustment for specialized tasks
- **Apache 2.0 License** - Permissive open source license with no copyleft restrictions

#### Use Cases

- Production applications requiring adjustable reasoning depth
- Agentic systems with function calling and tool use
- Applications requiring reasoning transparency
- Code execution and analysis tasks
- Web browsing and information retrieval agents
- Structured output generation for data processing
- Fine-tuned specialized models for domain-specific tasks
- Research applications requiring model customization

#### Reasoning Architecture

GPT-OSS-120b's distinctive feature is its adjustable reasoning capability. Users can configure the model's reasoning effort to match their specific needs—using low effort for quick responses on straightforward queries, or high effort for complex problems requiring deep analysis.

The model provides complete access to its chain-of-thought process, allowing developers to inspect how the model arrives at conclusions. This transparency is valuable for debugging, verification, and understanding model behavior in critical applications.

#### Agentic Capabilities

The model includes native support for multiple agentic functions, enabling it to:
- Call external functions and APIs
- Browse web content for information retrieval
- Execute Python code for computational tasks
- Generate structured outputs in predefined formats

These capabilities make GPT-OSS-120b particularly well-suited for building autonomous agents that can interact with external tools and systems.

#### Training and Optimization

GPT-OSS-120b employs MXFP4 quantization applied to Mixture-of-Experts (MoE) weights during post-training, enabling efficient inference while maintaining model quality. The model uses OpenAI's harmony response format for structured interactions.

Deploy GPT-OSS-120b on Vast.ai for access to flexible reasoning capabilities with transparent chain-of-thought processing for production and research applications.


OpenAI's open-weight models designed for powerful reasoning

GPT-OSS-120b


### GPT-OSS-20b: Efficient Open-Weight Model

GPT-OSS-20b is an open-weight language model from OpenAI designed for lower latency and specialized use cases. With adjustable reasoning capabilities and native agentic functions, this model provides a balance of performance and efficiency for applications requiring fast responses with reasoning transparency.

#### Key Features

- **Efficient Architecture** - Optimized for lower latency while maintaining reasoning capabilities
- **Adjustable Reasoning** - Configure reasoning effort across low, medium, and high settings
- **Chain-of-Thought Access** - Full visibility into reasoning processes for debugging and verification
- **Agentic Functions** - Native support for function calling, web browsing, Python execution, and structured outputs
- **Fine-Tuning Ready** - Customizable for domain-specific applications
- **Apache 2.0 License** - Permissive open source with no copyleft restrictions

#### Use Cases

- Lower latency applications requiring quick responses
- Specialized domains through fine-tuning
- Agentic systems with tool integration
- Function calling and API integration tasks
- Web browsing and information retrieval
- Code execution and analysis
- Structured output generation
- Local and edge deployment scenarios

#### Reasoning Capabilities

GPT-OSS-20b supports three levels of reasoning effort, configurable via system prompts:

**Low**: Quick responses optimized for conversational queries where speed is prioritized over deep analysis.

**Medium**: Balanced approach providing analytical depth while maintaining reasonable response times.

**High**: Comprehensive analysis for complex problems requiring thorough reasoning chains.

The model provides complete access to its chain-of-thought process, enabling developers to inspect and verify how conclusions are reached—valuable for debugging and ensuring model reliability in production applications.

#### Agentic Architecture

GPT-OSS-20b includes native support for multiple agentic capabilities:
- **Function Calling**: Execute defined functions with schema validation
- **Web Browsing**: Retrieve information from web sources
- **Python Execution**: Run computational tasks and data processing
- **Structured Outputs**: Generate responses in predefined formats

These built-in capabilities eliminate the need for external tooling layers, simplifying deployment of autonomous agents.

#### Training and Optimization

The model employs MXFP4 quantization applied to Mixture-of-Experts (MoE) weights during post-training, enabling efficient inference while preserving model quality. The model uses OpenAI's harmony response format for structured interactions.

Deploy GPT-OSS-20b on Vast.ai for access to efficient reasoning with transparent chain-of-thought processing, ideal for specialized applications and lower-latency use cases.


GPT OSS 20b


### HiDream I1 Full: State-of-the-Art Image Generation Foundation Model

HiDream I1 is an open-source image generation foundation model featuring 17 billion parameters that achieves state-of-the-art quality with rapid generation speeds. Released in May 2025, the model delivers industry-leading prompt adherence while maintaining exceptional versatility across diverse artistic styles from photorealistic imagery to cartoon and artistic renderings.

#### Architecture and Design

The model employs a sparse diffusion transformer architecture, detailed in the technical paper "HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer." The system integrates multiple components for optimal performance:

- VAE component from FLUX.1 [schnell] for latent space encoding
- Text encoders combining Google's T5-v1.1-xxl and Meta's Llama 3.1-8B-Instruct for comprehensive prompt understanding
- HiDreamImagePipeline for efficient inference execution
- Flash Attention optimization support for improved computational efficiency

The sparse transformer design enables the model to generate high-quality images within seconds while maintaining competitive computational requirements.

#### Benchmark Performance

HiDream I1 demonstrates exceptional results across multiple evaluation frameworks:

**GenEval Results (Overall Score: 0.83):**
Achieved the highest composite score among evaluated models, with perfect single object generation (1.00) and near-perfect two-object scenarios (0.98). Strong performance in color attribution (0.72) and counting accuracy (0.79).

**DPG-Bench (Overall: 85.89):**
Leads in relation comprehension (93.74) and miscellaneous categories (91.83), demonstrating sophisticated understanding of object relationships and complex scene composition.

**HPSv2.1 Benchmark (33.82 averaged):**
Surpasses leading competitors including Flux.1-dev (32.47) and DALL-E 3 (31.44) in human preference alignment, with particularly strong performance in animation style (35.05).

#### Key Capabilities

The model excels in several distinguishing areas:

- **Prompt Adherence:** Industry-leading performance in understanding and executing complex text descriptions
- **Style Versatility:** Exceptional quality across photorealistic, cartoon, artistic, and animation styles
- **Generation Speed:** Produces high-quality images within seconds
- **Quality Consistency:** Maintains strong results across diverse prompts and use cases
- **Commercial Accessibility:** MIT license enables unrestricted commercial and research applications

#### Use Cases

HiDream I1 Full supports a wide range of image generation applications:

- Commercial content creation for marketing and advertising
- Digital art and creative design across multiple styles
- Product visualization and mockup generation
- Scientific research and academic visualization
- Animation and character design
- Concept art for creative industries
- Rapid prototyping of visual concepts
- Social media content generation

#### Technical Considerations

The model is available in three variants: full, dev (distilled), and fast (distilled), allowing users to select the appropriate balance between quality and computational efficiency for their specific use cases. The full variant provides maximum quality, while distilled versions offer accelerated inference for time-sensitive applications.


Open-source image generative foundation model with 17B parameters that achieves state-of-the-art image generation quality within seconds

HiDream I1 Full


### InternVL3 78B: Flagship Multimodal Language Model

InternVL3 78B represents the flagship model in OpenGVLab's InternVL3 series, combining a 6B vision transformer with Qwen2.5-72B as the language component. The model demonstrates superior overall performance through integrated multimodal perception and reasoning capabilities, representing a significant advancement in open-source multimodal AI through its native training approach that achieves strong vision-language performance without compromising text-only capabilities.

#### Architecture and Design

The model follows a proven ViT-MLP-LLM paradigm enhanced with several architectural innovations:

**Vision Component:**
- InternViT-6B-448px-V2_5 processes images through dynamic resolution tiling
- Pixel Unshuffle reduces visual tokens to one-quarter of original count for computational efficiency
- Variable Visual Position Encoding (V2PE) implements flexible positional increments for improved long-context understanding

**Language Integration:**
- Qwen2.5-72B serves as the language backbone
- Native integration enables simultaneous multimodal representation development
- Maintains strong text-only performance despite multimodal training

**Multi-modal Support:**
- Handles images with dynamic resolution processing
- Processes video sequences with temporal understanding
- Supports interleaved image-text sequences for complex conversations

#### Advanced Training Methodology

**Native Multimodal Pre-Training:**
A distinguishing characteristic is the consolidation of language and vision learning into a single pre-training stage, rather than sequentially adapting language models to vision. This approach enables simultaneous development of multimodal representations, resulting in more cohesive understanding across modalities.

**Mixed Preference Optimization (MPO):**
Addresses distribution shift between training (ground-truth tokens) and inference (model-predicted tokens) by incorporating preference signals during training. This methodology enhances reasoning capabilities and reduces exposure bias during generation.

**Test-Time Scaling:**
Employs Best-of-N evaluation with VisualPRM-8B as a critic model for reasoning and mathematics tasks, enabling quality-optimized inference for applications requiring high accuracy.

#### Benchmark Performance

InternVL3 78B excels across diverse evaluation categories:

- **Multimodal Reasoning:** Superior performance on mathematical and visual reasoning benchmarks
- **Document Understanding:** Strong OCR, chart interpretation, and document analysis capabilities
- **Video Comprehension:** Effective temporal understanding of video sequences
- **GUI and Spatial Reasoning:** Advanced interface grounding and spatial analysis
- **Language Performance:** Outperforms base Qwen2.5 models on text-only tasks despite multimodal training focus

The model's ability to exceed text-only baseline performance while maintaining multimodal capabilities demonstrates the effectiveness of native multimodal training approaches.

#### Key Capabilities

The model demonstrates exceptional performance across multiple domains:

**Image Analysis:**
- Single and multi-image conversations with detailed descriptions
- Fine-grained visual understanding and attribute recognition
- Complex scene comprehension and relationship analysis

**Document Processing:**
- Optical character recognition across diverse formats
- Chart and diagram interpretation with data extraction
- Technical documentation understanding

**Video Understanding:**
- Frame-by-frame analysis with temporal coherence
- Event detection and narrative comprehension
- Long-form video summarization

**Agent Applications:**
- GUI navigation and interface interpretation
- Tool usage coordination for autonomous agents
- Spatial reasoning for robotic applications

**Industrial Applications:**
- 3D vision perception and depth understanding
- Specialized image analysis for domain-specific tasks

#### Use Cases

InternVL3 78B excels in applications requiring sophisticated multimodal understanding:

- Visual question answering across diverse domains
- Document analysis and information extraction
- Video content understanding and summarization
- GUI automation and interface interaction
- Scientific visualization interpretation
- Educational content analysis
- Medical image interpretation with contextual analysis
- Industrial quality inspection with visual reasoning
- Autonomous agent development requiring visual understanding
- Technical documentation processing

#### Deployment and Integration

The model supports flexible deployment through multiple frameworks:

- **Transformers Library:** Standard integration (requires version 4.37.2+)
- **LMDeploy:** Production-optimized deployment with RESTful API compatibility
- **Quantization Support:** BF16, FP16, and 8-bit quantized variants for efficiency
- **Multi-GPU Support:** Distributed inference for accelerated processing

#### Technical Considerations

The native multimodal pre-training approach distinguishes InternVL3 78B from models that adapt pre-trained language models to vision tasks. This methodology enables more cohesive cross-modal understanding, as evidenced by the model's ability to outperform text-only baselines while maintaining strong multimodal performance.

The V2PE and Pixel Unshuffle innovations reduce computational requirements for long visual sequences, making the model practical for applications requiring analysis of high-resolution images or extended video content. Test-time scaling with critic models provides an additional quality lever for accuracy-critical applications.


Advanced multimodal large language model (MLLM) 

InternVL3 78B


### Juggernaut XI v11: Advanced SDXL-Based Image Generation

Juggernaut XI v11 is a text-to-image generation model developed by RunDiffusion, built on the Stable Diffusion XL architecture. The model excels at converting natural language prompts into high-quality visual outputs with exceptional prompt adherence and significantly improved aesthetics across multiple domains including photography, cinematography, and landscape imagery.

#### Architecture and Training Approach

Unlike incremental updates, Juggernaut XI v11 underwent comprehensive retraining from scratch using GPT-4 Vision captioning technology. This ground-up approach enables more robust aesthetic improvements compared to derivative models built through fine-tuning alone.

The training methodology incorporated several key innovations:

- Significantly expanded and refined dataset with higher-quality source images
- Improved shot classification accuracy across full-body, portrait, and mid-shot categories
- Integration of RunDiffusion Photo technology for enhanced detail refinement
- GPT-4 Vision-powered captioning for more accurate prompt-image alignment

#### Key Capabilities

Juggernaut XI v11 demonstrates several distinguishing strengths:

- **Prompt Adherence:** Exceptional interpretation and execution of user intentions, accurately translating complex descriptions into visual outputs
- **Aesthetic Quality:** Massively improved overall visual quality compared to previous versions
- **Anatomical Accuracy:** Enhanced rendering of challenging elements including hands, eyes, faces, and compositional details
- **Prompting Flexibility:** Supports both natural language descriptions and tagging-style inputs for diverse user preferences
- **Text Generation:** Expanded capabilities for generating accurate text within images

#### Use Cases

The model excels across diverse image generation applications:

- Digital art and creative design projects
- Marketing materials and commercial graphics
- Product visualization and mockups
- Cinematic concept art and storyboarding
- Portrait and character generation
- Landscape and environmental imagery
- Social media content creation
- Photography-style image synthesis

#### Technical Considerations

The model's ground-up retraining approach distinguishes it from incremental fine-tuning strategies, potentially yielding more consistent improvements across diverse prompts and use cases. Users can leverage both natural language and tag-based prompting methodologies depending on their workflow preferences and desired level of control over generation parameters.


Amazing prompt adherence with massively improved aesthetics and enhanced text generation capability

Juggernaut XI v11


### Kimi K2 Instruct: Trillion-Parameter MoE Model

Kimi K2 Instruct is a Mixture-of-Experts language model developed by Moonshot AI featuring advanced agentic capabilities and specialized coding expertise. With an extended context window and strong tool-calling abilities, this model excels at autonomous software development tasks and complex multi-turn interactions.

This template defaults to 32k context for wider compatibility in search

#### Key Features

- **Agentic Intelligence** - Excels at autonomous decision-making and tool utilization with strong real-time function invocation capabilities
- **Coding Excellence** - Specialized in software engineering tasks with particular strength in frontend development and agent-based coding
- **Extended Context** - Operates with 256K token context window, doubled from previous version for longer documents and conversations
- **Tool Integration** - Native tool-calling capabilities enabling real-time function execution based on user requests
- **Modified MIT License** - Open source with commercial use permissions

#### Benchmark Performance

**Software Engineering:**
- SWE-Bench Verified: 69.2% accuracy
- SWE-Bench Multilingual: 55.9% accuracy
- Terminal-Bench: 44.5% accuracy

Results represent mean accuracy over five independent full-test-set runs with controlled evaluation conditions.

#### Use Cases

- Autonomous code generation and debugging
- Frontend development with focus on aesthetics and practicality
- Agent-based software development workflows
- Complex multi-turn technical conversations
- Long-document analysis and retrieval
- Real-time tool integration for development tasks
- Multi-step coding projects requiring planning and execution
- Technical documentation generation and analysis

#### Agentic Architecture

Kimi K2 Instruct's primary strength lies in its agentic capabilities—the ability to autonomously make decisions and utilize tools to accomplish complex tasks. The model can invoke functions in real-time based on user requests, enabling sophisticated workflows where the model independently selects and executes appropriate tools.

This agentic intelligence makes the model particularly effective for software development tasks that require multiple steps, tool integration, and autonomous problem-solving.

#### Extended Context Processing

The model's 256K token context window—doubled from the previous 128K version—enables handling of extensive codebases, lengthy technical documents, and complex multi-turn conversations. This extended context is crucial for software development tasks that require understanding large amounts of code or maintaining coherence across long interactions.

#### Mixture-of-Experts Architecture

Kimi K2 Instruct employs a Mixture-of-Experts architecture with 61 layers, 384 expert modules, and Modified Linear Attention (MLA) mechanism. This architecture enables efficient processing while maintaining high performance across diverse tasks.

Deploy Kimi K2 Instruct on Vast.ai to leverage advanced agentic coding capabilities with extended context processing for autonomous software development and complex technical tasks.


Open-source trillion-parameter MoE AI model

Kimi K2 Instruct 0905


## Overview

Kimi K2 Thinking represents Moonshot AI's latest advancement in open-source reasoning models, building on the capabilities of its predecessor with an enhanced deep-thinking architecture. The model combines step-by-step reasoning with dynamic tool invocation, creating an agent-like interface designed for complex problem-solving tasks that require sustained cognitive processing.

Released under a Modified MIT License, Kimi K2 Thinking supports both commercial and research applications, making advanced reasoning capabilities accessible to a wide range of users and organizations.

## Key Features

### Advanced Reasoning Architecture

Kimi K2 Thinking interleaves chain-of-thought reasoning with function calls, enabling autonomous workflows that can span hundreds of sequential steps without performance degradation. This architecture allows the model to maintain coherent behavior across 200-300 consecutive tool invocations, substantially exceeding earlier models that typically degrade after 30-50 calls.

### Optimized Performance Through Quantization

The model features native INT4 quantization achieved through Quantization-Aware Training (QAT), providing approximately 2x faster generation speed without sacrificing performance quality. This optimization makes the model more efficient while maintaining the accuracy and reliability required for complex reasoning tasks.

### Mixture-of-Experts Architecture

Built on a Mixture-of-Experts (MoE) architecture, Kimi K2 Thinking employs 1 trillion total parameters with 32 billion active parameters per inference. The model utilizes 384 experts, selecting 8 per token, distributed across 61 layers including one dense layer. This efficient design enables powerful reasoning capabilities while maintaining computational efficiency.

### Extended Context Understanding

With a context window of 256,000 tokens and a vocabulary of 160,000 tokens, Kimi K2 Thinking can process and reason over extensive documents, long-form content, and complex multi-turn conversations. The model uses Multi-head Latent Attention (MLA) mechanisms to effectively manage this large context window.

## Benchmark Performance

### Reasoning Tasks

Kimi K2 Thinking demonstrates exceptional performance on challenging reasoning benchmarks:

- **HLE (with tools)**: Achieves scores ranging from 44.9 to 51.0, showcasing strong logical reasoning capabilities when augmented with tool access
- **AIME25 (with Python)**: Scores between 99.1 and 100.0 on this advanced mathematics competition benchmark
- **HMMT25 (with Python)**: Achieves 95.1 to 97.5 on the Harvard-MIT Mathematics Tournament problems

### Agentic Search Performance

The model excels at autonomous search and information retrieval tasks:

- **BrowseComp**: 60.2 score on English web browsing comprehension
- **BrowseComp-ZH**: 62.3 score on Chinese web browsing comprehension
- **Seal-0**: 56.3 on search-enhanced language tasks

### Coding Capabilities

Kimi K2 Thinking shows strong performance on software engineering benchmarks:

- **SWE-bench Verified**: 71.3 score on real-world software engineering tasks
- **LiveCodeBenchV6**: 83.1 on live coding challenges

## Use Cases

### Autonomous Research

The model's ability to maintain coherent reasoning across hundreds of sequential steps makes it ideal for autonomous research tasks that require iterative information gathering, analysis, and synthesis. The extended agency duration allows it to conduct comprehensive investigations without losing track of the overall objective.

### Complex Coding Projects

With strong performance on software engineering benchmarks, Kimi K2 Thinking excels at understanding codebases, debugging complex issues, and implementing multi-step solutions. The model's reasoning capabilities enable it to break down complex programming challenges into manageable steps.

### Extended Writing Projects

The large context window and sustained reasoning capabilities make the model well-suited for long-form content creation, technical documentation, and structured writing projects that require maintaining consistency and coherence across thousands of tokens.

### Problem-Solving with Tool Integration

The model's architecture enables it to seamlessly integrate reasoning with tool calls, making it effective for tasks that require both analytical thinking and practical execution. This includes data analysis workflows, computational problem-solving, and tasks requiring web search or API interactions.

## Training Approach

Kimi K2 Thinking incorporates Quantization-Aware Training (QAT) directly into its training process, enabling native INT4 quantization without the quality degradation typically associated with post-training quantization. This approach allows the model to maintain high performance while operating with improved efficiency.

The model's training focused on developing extended reasoning chains and tool integration capabilities, enabling the agent-like behavior that distinguishes it from traditional language models. The recommended operating temperature for inference is 1.0, optimizing the balance between creativity and consistency in the model's outputs.


Open-source trillion-parameter MoE AI model with thinking

Kimi K2 Thinking


### Llama 4 Maverick 17B 128E Instruct: Natively Multimodal AI

Llama 4 Maverick is a natively multimodal AI model featuring a mixture-of-experts (MoE) architecture with 17 billion activated parameters distributed across 128 total experts. Released by Meta in April 2025, this model represents a significant advancement in the Llama ecosystem by combining text and image understanding capabilities within a unified architecture.

#### Architecture and Design

The model employs an auto-regressive language architecture with mixture-of-experts and early fusion for native multimodality. This design enables seamless processing of both text and visual inputs without requiring separate encoding pipelines. The model supports an extensive 10 million token context length and can process up to 5 input images simultaneously.

Trained on approximately 22 trillion tokens from publicly available sources, licensed datasets, and Meta products/services data, the model incorporates knowledge through August 2024. Training consumed 2.38 million GPU hours on H100-80GB hardware, with releases available in both BF16 and FP8 quantization formats.

#### Multilingual Capabilities

The model provides comprehensive multilingual support across 12 languages: Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese. This enables deployment in diverse global contexts while maintaining consistent performance across linguistic boundaries.

#### Performance Benchmarks

Llama 4 Maverick demonstrates strong results across multiple evaluation domains:

- **Mathematical Reasoning:** 61.2 on MATH (exact match, majority@1)
- **General Knowledge:** 85.5 on MMLU
- **Code Generation:** 77.6 on MBPP (pass@1)
- **Document Understanding:** 91.6 ANLS on DocVQA
- **Chart Interpretation:** 85.3 accuracy on ChartQA
- **Advanced Reasoning:** 69.8 accuracy on GPQA Diamond

These results reflect the model's versatility in handling both traditional language tasks and advanced visual reasoning challenges.

#### Use Cases

The model excels in applications requiring multimodal understanding:

- Assistant-like conversational experiences combining text and visual context
- Visual reasoning and logical inference from images
- Image captioning and detailed description generation
- Document analysis and information extraction from visual materials
- Chart and diagram interpretation for data analysis
- Multilingual content understanding across supported languages

#### Training Philosophy

Llama 4 Maverick emphasizes improved system prompt steerability, allowing developers greater control over model behavior. The model exhibits reduced false refusals to benign queries while maintaining comprehensive safety fine-tuning. This balance enables more natural conversational tones while preserving flexibility for application-specific customization.


The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences

Llama 4 Maverick 17B 128E Instruct


### Llama 4 Scout 17B 16E Instruct: Efficient Multimodal Intelligence

Llama 4 Scout represents Meta's efficiency-focused entry in the Llama 4 series, combining natively multimodal capabilities with practical deployability. Released in April 2025, this model employs a mixture-of-experts (MoE) architecture with 17 billion activated parameters distributed across 16 experts, totaling 109 billion parameters. Scout achieves competitive performance while maintaining substantially lower computational requirements than its larger sibling, Maverick.

#### Architecture and Efficiency Design

The model leverages early fusion for native multimodality within its MoE architecture, enabling integrated text-image understanding without separate encoding pipelines. A defining characteristic of Scout is its deployment efficiency: the model can fit within a single H100 GPU using on-the-fly int4 quantization, making it significantly more accessible for production environments.

Trained on approximately 40 trillion tokens from publicly available sources, licensed datasets, and Meta products/services data, Scout incorporates knowledge through August 2024. Training consumed 5.0 million GPU hours on H100-80GB hardware, with releases available in BF16 format and int4 quantization support. The model supports a 10 million token context window and can process up to 5 input images simultaneously.

#### Multilingual Capabilities

Scout provides comprehensive multilingual support across 12 languages: Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese. This enables consistent performance across diverse linguistic contexts while maintaining the model's efficiency advantages.

#### Performance Benchmarks

Scout demonstrates competitive results across multiple evaluation domains:

**Pre-trained Model Performance:**
- **General Knowledge:** 79.6 on MMLU (comparable to Llama 3.1 70B at 79.3)
- **Mathematical Reasoning:** 50.3 on MATH
- **Code Generation:** 67.8 on MBPP (pass rate)
- **Chart Interpretation:** 83.4 accuracy on ChartQA

**Instruction-Tuned Performance:**
- **Advanced Reasoning:** 74.3 accuracy on MMLU Pro
- **Expert-Level Science:** 57.2 accuracy on GPQA Diamond
- **Document Understanding:** 94.4 ANLS on DocVQA

These results reflect Scout's balance between performance and computational efficiency, making it suitable for applications where resource constraints matter.

#### Use Cases

The model excels in applications requiring multimodal understanding with deployment efficiency:

- Assistant-like conversational experiences combining text and visual context
- Visual reasoning and logical inference from images
- Document analysis and information extraction from visual materials
- Chart and diagram interpretation for data analysis
- Code generation with multilingual support
- Production deployments requiring efficient resource utilization

#### Safety and Safeguards

Meta implements comprehensive risk mitigation through three approaches: fine-tuning that emphasizes natural refusal tones while reducing false rejections, system-level protections including Llama Guard, Prompt Guard, and Code Shield, and extensive red teaming focused on CBRNE proliferation, child safety, and cyber attack enablement. The model's system prompt emphasizes conversational tone while avoiding preachy or templated language patterns.

#### Technical Integration

Scout integrates with the transformers library (version 4.51.0+) using flex_attention for optimal performance. The model's implementation demonstrates straightforward integration into existing workflows, with support for both standard BF16 inference and efficient int4 quantization for resource-constrained environments.


Llama 4 Scout 17B 16E Instruct


### LTX Video: Real-Time DiT-Based Video Generation

LTX Video, developed by Lightricks, represents a breakthrough in video synthesis technology as the first Diffusion Transformer (DiT)-based video generation model capable of producing high-quality videos in real-time. The model generates 30 FPS videos at 1216×704 resolution faster than playback speed, marking a significant advancement in computational efficiency for video generation systems.

#### Architecture and Design

The model employs a Diffusion Transformer architecture trained on large-scale video datasets. Multiple model variants provide flexibility for different deployment scenarios:

- **13B Models:** Dev and distilled variants deliver highest quality output for demanding applications
- **2B Models:** Lighter computational requirements enable broader hardware accessibility
- **FP8 Quantized Versions:** Reduced memory footprint for resource-constrained environments

All versions support resolutions divisible by 32 and frame counts divisible by 8+1, with a recommended maximum of 257 frames. The architecture operates optimally under 720×1280 resolution.

#### Key Capabilities

LTX Video supports multiple conditioning modes for diverse creative workflows:

- **Image-to-Video Generation:** Converts static images into dynamic video sequences with natural motion
- **Video-to-Video Conditioning:** Extends or modifies existing video segments with temporal consistency
- **Multi-Condition Support:** Accepts multiple images or video clips with specified target frame ranges
- **Flexible Resolution:** Adapts to various aspect ratios and resolutions within architectural constraints
- **Real-Time Inference:** The distilled 2B variant achieves 15× faster processing with real-time capable speeds

#### Performance and Optimization

Quality scales with model size—the 13B dev version provides superior results but demands greater computational resources, while the distilled 2B variant balances quality with inference speed. The distillation process reduces required diffusion steps while maintaining competitive output quality, enabling practical real-time generation workflows.

FP8 quantization further reduces memory requirements without substantial quality degradation, making high-quality video generation accessible on consumer hardware.

#### Use Cases

LTX Video excels in applications requiring rapid video synthesis:

- Marketing and advertising video content generation
- Social media short-form video creation
- Product visualization with motion and animation
- Cinematic concept previsualization and storyboarding
- Educational and tutorial video production
- Video editing and enhancement workflows
- Game cinematics and cutscene generation
- Rapid prototyping of video concepts

#### Integration and Deployment

The model integrates with multiple platforms and frameworks, enabling flexible deployment:

- LTX-Studio for integrated creative workflows
- Fal.ai and Replicate for cloud-based inference
- ComfyUI for node-based video generation pipelines
- Hugging Face Diffusers library for custom integration

This broad platform support enables developers and creators to incorporate LTX Video into existing workflows with minimal friction.

#### Technical Considerations

Real-time generation capabilities make LTX Video particularly valuable for interactive applications requiring immediate feedback. The multi-variant architecture allows users to select the appropriate balance between quality and computational efficiency based on specific use case requirements and available hardware resources.


LTX-Video is the first DiT-based video generation model capable of generating high-quality videos in real-time

LTX Video


### Mochi 1 Preview: State-of-the-Art Open Video Generation

Mochi 1 Preview is an open state-of-the-art video generation model developed by Genmo, featuring high-fidelity motion synthesis and strong prompt adherence. As the largest openly released video generative model at 10 billion parameters, Mochi 1 represents a significant advancement in democratizing professional-quality video generation technology through its Apache 2.0 license.

#### Architecture and Design

The system employs an innovative asymmetric architecture comprising two specialized components:

**AsymmDiT (Asymmetric Diffusion Transformer):**
- 10 billion parameter model representing the largest open video generation system
- 48 transformer layers with 24 attention heads
- Asymmetric design allocates nearly 4× more parameters to visual processing (3,072 dimensions) than text encoding (1,536 dimensions)
- Processes 44,520 visual tokens and 256 text tokens for comprehensive scene understanding

**AsymmVAE (Video Encoder):**
- 362 million parameter autoencoder
- Achieves 128× compression through 8× spatial and 6× temporal reduction
- Encodes video data into efficient 12-channel latent space representation

The architecture employs a simplified prompt encoding approach using a single T5-XXL language model, departing from complex multi-encoder systems while maintaining strong prompt adherence.

#### Key Capabilities

Mochi 1 excels in photorealistic video generation with several distinguishing strengths:

- **High-Fidelity Motion:** Generates realistic movement and temporal dynamics across diverse scenarios
- **Strong Prompt Adherence:** Accurately interprets and executes complex textual descriptions
- **Photorealistic Quality:** Specializes in realistic rendering suitable for professional applications
- **Simplified Architecture:** Single-encoder approach reduces complexity while maintaining quality
- **Open Access:** Apache 2.0 license enables unrestricted research and commercial use

#### Performance and Deployment

Multiple deployment configurations accommodate different hardware scenarios:

- **Single GPU:** Requires approximately 60GB VRAM (H100 recommended for optimal performance)
- **Multi-GPU:** Supports distributed inference for accelerated generation
- **Memory-Efficient Variants:** bf16 precision reduces requirements to approximately 22GB VRAM

The model ships with multiple interfaces for flexible integration:

- Gradio UI for interactive exploration
- Command-line interface for batch processing
- Programmatic API for custom workflows
- Diffusers library integration for standardized deployment

#### Current Limitations

The preview release acknowledges several constraints:

- Maximum 480p resolution output
- Occasional visual distortions during extreme motion sequences
- Suboptimal performance with animated or non-photorealistic content styles

These limitations reflect the model's specialization in photorealistic generation and provide opportunities for future architectural refinements.

#### Use Cases

Mochi 1 Preview excels in applications requiring photorealistic video synthesis:

- Marketing and advertising video content
- Product demonstrations with realistic motion
- Cinematic previsualization and concept development
- Educational and tutorial video generation
- Social media content creation
- Video editing and enhancement workflows
- Research in video generation techniques
- Prototyping for film and media production

#### Technical Considerations

The asymmetric architecture's heavy visual parameter allocation reflects the computational demands of high-fidelity motion synthesis. Users should expect optimal results with photorealistic prompts, while animated or stylized requests may require prompt engineering or post-processing refinement.

The simplified single-encoder approach reduces deployment complexity compared to multi-encoder systems, potentially easing integration into existing creative pipelines while maintaining competitive prompt adherence.


Mochi 1 preview is an open state-of-the-art video generation model with high-fidelity motion and strong prompt adherence in preliminary evaluation.

Mochi 1 Preview


### Qwen Image: Foundation Model for Text Rendering and Image Editing

Qwen Image is an image generation foundation model within the Qwen ecosystem, launched in August 2025. The model distinguishes itself through significant advances in complex text rendering and precise image editing capabilities, with exceptional performance in Chinese character rendering—addressing a capability gap that most competing models underserve in multilingual image generation.

#### Architecture and Design

Built on the Diffusers library framework, Qwen Image employs a comprehensive architecture that integrates multiple visual intelligence capabilities beyond traditional text-to-image generation. The system supports flexible aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3) and deploys efficiently across GPU (bfloat16) and CPU (float32) configurations.

Standard inference configuration utilizes 50 steps with a true_cfg_scale of 4.0, balancing generation quality with computational efficiency.

#### Text Rendering Excellence

A defining capability is the model's exceptional typographic accuracy across diverse scripts, from alphabetic languages to logographic Chinese characters. Unlike simple text overlay approaches that treat text as a post-processing step, Qwen Image seamlessly integrates text into visual compositions while preserving layout coherence and contextual harmony.

This capability makes the model particularly valuable for applications requiring accurate multilingual text within generated imagery, especially for Chinese language content where most competing models struggle with character complexity and stroke accuracy.

#### Image Editing Capabilities

Beyond generation, Qwen Image functions as a comprehensive foundation model for intelligent visual creation and manipulation. The system supports advanced operations including:

- Style transfer across artistic and photographic domains
- Object insertion and removal with contextual awareness
- Detail enhancement and refinement
- Text editing within existing images
- Human pose manipulation and adjustment
- Precise compositional modifications

#### Visual Understanding Integration

The architecture incorporates broad image comprehension tasks enabling sophisticated editing capabilities:

- Object detection and localization
- Semantic segmentation for precise region control
- Depth and edge estimation for realistic modifications
- Novel view synthesis for 3D-aware generation
- Super-resolution capabilities for detail enhancement

#### Use Cases

Qwen Image excels in applications requiring sophisticated text and editing capabilities:

- Multilingual marketing materials requiring accurate Chinese text rendering
- Product visualization with integrated textual elements
- Poster and banner design with complex typography
- Image editing and enhancement workflows
- Style transfer and artistic adaptation
- Content localization for international markets
- E-commerce product imagery with text overlays
- Social media content with multilingual text

#### Community and Ecosystem

The model has achieved substantial adoption with nearly 201,000 monthly downloads. A vibrant ecosystem has emerged including 383 adapters for specialized tasks, 46 fine-tuned variants, 14 quantizations for deployment flexibility, and 100+ community Spaces demonstrating diverse applications.

#### Technical Considerations

The model's Apache 2.0 license enables unrestricted commercial and research applications. Its multilingual text rendering capabilities, particularly for Chinese characters, position it as a specialized solution for content creators requiring accurate typographic integration in generated imagery—a capability that remains challenging for most general-purpose image generation models.


Foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing

Qwen Image (FP8)


### Qwen3 235B A22B Thinking 2507: Advanced Reasoning Language Model

Qwen3 235B A22B Thinking 2507 is a mixture-of-experts (MoE) language model specifically designed for extended reasoning tasks. With 235 billion total parameters and 22 billion activated parameters per token, this model represents Alibaba's approach to transparent reasoning processes in large language models.

#### Architecture and Thinking Design

The model employs a distinctive architecture featuring 94 layers with 128 total experts, activating 8 experts per token. A defining characteristic is its mandatory thinking mode: the model automatically includes reasoning tokens in all outputs through an enforced `<think>` tag in the chat template. This design makes the model's internal reasoning process visible, enabling users to understand how conclusions are reached.

The architecture incorporates group query attention with 64 query heads and 4 key-value heads, optimizing the balance between computational efficiency and reasoning capability. The model natively supports a context length of 262,144 tokens, expandable to 1 million tokens with specialized configuration.

#### Long-Context Processing

Qwen3 235B Thinking implements dual chunk attention and MInference sparse attention mechanisms for efficient processing of ultra-long sequences. These optimizations deliver up to a 3× speedup compared to standard attention implementations, making extended reasoning over large documents practical for production environments.

#### Performance Benchmarks

The model achieves state-of-the-art results among open-source thinking models across multiple reasoning domains:

- **Mathematics:** 92.3% on AIME25
- **Scientific Reasoning:** 83.9% on HMMT25
- **Code Generation:** 74.1% on LiveCodeBench
- **Academic Knowledge:** 84.4% on MMLU-Pro

These results reflect the model's particular strength in tasks requiring multi-step reasoning and complex problem-solving.

#### Multi-modal Agent Capabilities

Beyond pure reasoning, the model features enhanced tool-calling functionality optimized for agentic workflows. Integration with the Qwen-Agent framework enables the model to function as an orchestration layer in multi-step agent applications, coordinating external tools and reasoning about action sequences.

#### Multilingual Support

The model demonstrates improved instruction-following and alignment capabilities across 81 languages, making it suitable for global deployment scenarios requiring consistent reasoning quality across linguistic boundaries.

#### Use Cases

The model excels in applications requiring transparent reasoning processes:

- Mathematical problem-solving with step-by-step explanations
- Scientific research assistance requiring logical inference
- Code generation with reasoning about implementation choices
- Multi-step planning in agentic systems
- Complex decision-making requiring auditable reasoning chains
- Educational applications where understanding the reasoning process is valuable
- Research tasks requiring long-context analysis

#### Technical Considerations

The model's thinking mode is mandatory and cannot be disabled. All outputs incorporate visible reasoning tokens, which increases token consumption compared to traditional language models. Applications should account for this characteristic when designing user experiences and managing computational costs.


Qwen3 235B A22B Thinking 2507


### Qwen3 Coder 480B A35B Instruct: Specialized Agentic Coding Model

Qwen3 Coder 480B A35B Instruct represents Alibaba's latest advancement in specialized code generation, employing a mixture-of-experts (MoE) architecture with 480 billion total parameters and 35 billion activated parameters. The model delivers performance comparable to leading proprietary models while introducing significant capabilities in agentic coding workflows and repository-scale understanding.

This template defaults to 32k context for wider compatibility in search

#### Architecture and Design

The model features 62 transformer layers with grouped query attention utilizing 96 query heads and 8 key-value heads. The MoE architecture incorporates 160 total experts, activating 8 per token to balance computational efficiency with coding expertise. Trained and deployed in BF16 precision, the model natively supports a 256,000 token context length, extendable to 1 million tokens using Yarn scaling techniques.

A defining characteristic is the model's direct code generation approach: unlike reasoning-focused variants, it operates exclusively in non-thinking mode and does not generate intermediate reasoning blocks. This design prioritizes immediate, actionable code output optimized for development workflows.

#### Agentic Coding Capabilities

The model demonstrates significant performance among open-source models on agentic coding tasks, including autonomous browser interaction and complex multi-step programming workflows. Native support for function calling with well-defined schemas enables seamless integration with external tools, APIs, and development environments. The model can orchestrate tool usage, reason about API interactions, and coordinate multi-step coding operations autonomously.

#### Repository-Scale Understanding

The extended context window facilitates comprehensive analysis of large codebases, enabling the model to maintain awareness across thousands of lines of code. This capability makes it practical for tasks requiring holistic understanding of project structure, dependencies, and architectural patterns.

#### Tool Integration

Qwen3 Coder demonstrates compatibility with multiple development platforms through standardized function call formatting:

- Qwen Code IDE integration for inline code generation
- CLINE development environment support
- Generic function calling interfaces for custom tooling

The model's tool-calling implementation uses structured schemas that enable type-safe interactions with external systems.

#### Performance Optimization

Optimal inference utilizes specific parameter configurations:

- Temperature: 0.7, Top-p: 0.8, Top-k: 20
- Repetition penalty: 1.05
- Maximum output tokens: 65,536 for comprehensive generation tasks

These settings balance creativity in code generation with consistency and correctness, while the extended output window accommodates substantial code artifacts.

#### Use Cases

The model excels in applications requiring sophisticated code generation and automation:

- Agentic coding systems requiring autonomous code writing and debugging
- Browser automation and web scraping with code generation
- Repository-scale refactoring and codebase analysis
- API integration and tool orchestration in development workflows
- Code generation for large-scale projects requiring contextual awareness
- Automated testing and validation code generation
- Documentation generation from existing codebases
- Multi-file code generation maintaining consistency across modules

#### Technical Considerations

The model's non-thinking mode makes it ideal for production environments requiring immediate code output without verbose reasoning steps. Applications can expect direct, actionable responses optimized for integration into automated development pipelines.


Qwen3 Coder 480B A35B Instruct


### Qwen3 VL 235B A22B Instruct: Flagship Vision-Language Model

Qwen3 VL 235B A22B Instruct represents the most powerful vision-language model in the Qwen series, combining 236 billion parameters through a hybrid dense and Mixture-of-Experts (MoE) architecture. The model delivers exceptional performance across visual understanding, agent applications, extended context processing, and multimodal reasoning tasks.

#### Architecture Innovations

Qwen3 VL introduces three significant architectural upgrades that distinguish it from previous vision-language systems:

**Interleaved-MRoPE:**
Distributes positional embeddings across temporal, width, and height dimensions to enhance extended video reasoning capabilities. This approach enables more sophisticated understanding of spatial and temporal relationships in visual content.

**DeepStack:**
Integrates multi-level visual transformer features to preserve fine-grained details throughout the processing pipeline. This innovation strengthens image-text alignment by maintaining visual information at multiple scales, enabling both detailed local analysis and global scene understanding.

**Text-Timestamp Alignment:**
Moves beyond traditional temporal embeddings to provide precise, timestamp-anchored event localization in video analysis. This capability enables accurate temporal grounding of events within long-form video content.

#### Key Capabilities

**Visual Understanding:**
The model excels at recognizing diverse visual content including celebrities, anime characters, products, landmarks, flora, fauna, and numerous other categories. Enhanced OCR capabilities support 32 languages (expanded from 19), enabling multilingual document processing and text recognition across diverse scripts.

**Agent Functions:**
Advanced agentic capabilities include:
- PC and mobile GUI navigation for automation tasks
- Visual coding generation producing Draw.io diagrams, HTML, CSS, and JavaScript from images
- Spatial perception for 2D and 3D grounding in robotic applications
- Tool usage coordination for autonomous agent workflows

**Extended Context Processing:**
Native 256,000 token context windows enable comprehensive analysis of lengthy documents and extended video content. The architecture supports expansion to 1 million tokens, facilitating complete book processing and multi-hour video analysis with full contextual recall.

**Multimodal Reasoning:**
Demonstrates particular strength in STEM and mathematical problem-solving through evidence-based causal analysis. The reasoning-enhanced capabilities enable step-by-step problem decomposition and systematic solution development.

#### Performance and Benchmarks

Qwen3 VL achieves competitive results across both multimodal and pure text benchmarks, demonstrating balanced performance that doesn't compromise language capabilities for visual understanding. The model's strong STEM reasoning performance reflects its architectural innovations in maintaining fine-grained visual details while processing complex logical relationships.

#### Use Cases

The model excels in applications requiring sophisticated multimodal intelligence:

- Visual question answering across diverse domains with specialized knowledge
- Long-form document analysis and information extraction
- Extended video content understanding and temporal event localization
- GUI automation for PC and mobile interfaces
- Visual code generation from mockups and wireframes
- Multilingual OCR and document processing across 32 languages
- Mathematical and scientific problem-solving with visual context
- Autonomous agent development requiring visual understanding
- 2D and 3D spatial reasoning for robotics applications
- Educational content analysis and tutoring
- Medical image interpretation with detailed reasoning
- Technical documentation processing with diagram understanding

#### Deployment Options

The model supports flexible deployment configurations:

- **Standard Instruct:** Optimized for general-purpose vision-language tasks
- **Thinking Edition:** Enhanced reasoning capabilities for complex analytical tasks
- **Context Scaling:** Native 256K with expansion to 1M tokens for extended content
- **Multi-GPU Support:** Distributed inference for production environments
- **Framework Integration:** Compatible with vLLM and standard inference frameworks

#### Technical Considerations

The hybrid dense-MoE architecture enables efficient scaling while maintaining quality across diverse task types. The 22B activated parameters per forward pass provide computational efficiency comparable to smaller models while leveraging the full 236B parameter capacity for specialized capabilities.

The Interleaved-MRoPE and DeepStack innovations specifically address challenges in long-form video understanding and fine-grained visual detail preservation—capabilities that distinguish Qwen3 VL from earlier vision-language systems. The text-timestamp alignment mechanism enables precise temporal grounding, making the model particularly valuable for applications requiring accurate event localization in video content.

The expanded 32-language OCR support addresses a critical gap in multilingual document processing, enabling consistent performance across diverse linguistic contexts. This capability, combined with extended context processing, makes the model suitable for international enterprise applications requiring document analysis across multiple languages.


Qwen3 VL 235B A22B Instruct


### RealVisXL V5.0: SDXL-Based Photorealistic Image Generation

RealVisXL V5.0 is a photorealistic text-to-image generation model built on the Stable Diffusion XL architecture. Developed by Evgeny, the model specializes in generating high-quality photorealistic imagery across diverse subjects and scenarios, with particular attention to anatomical accuracy and visual fidelity.

#### Architecture and Design

Built on the StableDiffusionXLPipeline architecture, RealVisXL V5.0 leverages the SDXL foundation to achieve photorealistic outputs. The model is distributed in Safetensors format for efficient loading and deployment, enabling rapid integration into existing workflows.

#### Key Capabilities

RealVisXL V5.0 excels in photorealistic generation with several optimization strategies:

- **Photorealistic Output:** Specializes in generating images with photographic quality and realistic lighting
- **Flexible Sampling:** Supports multiple sampling methods optimized for quality and efficiency
- **High-Resolution Enhancement:** Integrates with upscaling workflows using denoising strength of 0.1-0.3 and 1.1-1.5x upscale ratios
- **Quality Refinement:** Benefits from specific negative prompting strategies for anatomical and facial detail improvement

#### Recommended Inference Parameters

Optimal results are achieved with specific sampling configurations:

- **DPM++ SDE Karras:** 30+ steps for balanced quality and speed
- **DPM++ 2M Karras:** 50+ steps for maximum quality
- **Upscaling:** Denoising strength 0.1-0.3 with 1.1-1.5x ratios for detail enhancement

Users can employ negative prompts focusing on anatomical accuracy and facial refinements to enhance output quality, particularly for human subjects.

#### Use Cases

The model excels in applications requiring photorealistic image generation:

- Portrait photography and character generation
- Product visualization with photographic quality
- Architectural and interior visualization
- Marketing materials requiring realistic imagery
- Stock photography generation
- Concept visualization for film and media
- Fashion and lifestyle imagery
- Realistic scene composition

#### Community and Adoption

RealVisXL V5.0 demonstrates significant adoption within the generative AI ecosystem, with over 58,000 monthly downloads and 39 active Spaces implementations. The model has earned 115 community likes, reflecting its effectiveness for photorealistic generation tasks.

#### Technical Considerations

As an SDXL-based model, RealVisXL V5.0 benefits from the stability and quality characteristics of the Stable Diffusion XL architecture while specializing in photorealistic output. Users should experiment with sampling methods and negative prompting strategies to achieve optimal results for their specific use cases.


RealVisXL V5.0


### Stable Diffusion XL Base 1.0: Foundation for Latent Diffusion

Stable Diffusion XL Base 1.0 (SDXL) is a foundational text-to-image generation model developed by Stability AI that represents a significant architectural advancement through its ensemble of experts pipeline. The model combines a base generation system with specialized refinement capabilities, enabling substantially improved image quality compared to previous Stable Diffusion versions.

#### Architecture and Innovation

SDXL employs an ensemble of experts pipeline that marks a departure from previous single-model architectures. The system operates in two stages:

1. **Base Model:** Generates initial noisy latents from text prompts
2. **Refinement Module:** Processes latents during final denoising steps with specialized expertise

This two-stage approach allocates computational resources more efficiently, enabling higher quality outputs through focused expertise at different generation phases.

The system implements latent diffusion technology using two fixed, pretrained text encoders—OpenCLIP-ViT/G and CLIP-ViT/L—allowing comprehensive interpretation of complex textual prompts for accurate image generation.

#### Key Capabilities

SDXL demonstrates several distinguishing improvements over previous Stable Diffusion versions:

- **Enhanced Quality:** User preference studies show the base model substantially outperforms Stable Diffusion 1.5 and 2.1
- **Refinement Pipeline:** Optional refinement module achieves optimal results through specialized final processing
- **Flexible Workflows:** Supports standalone operation or SDEdit techniques for high-resolution enhancement
- **Complex Prompt Understanding:** Dual text encoder architecture enables sophisticated prompt interpretation
- **img2img Processing:** Alternative pipeline for high-resolution enhancement through iterative refinement

#### Use Cases

SDXL serves as a foundation for diverse image generation applications:

- Artistic creation and digital design
- Creative tool development and prototyping
- Educational applications for generative AI
- Research in generative model capabilities
- Safe deployment studies for content generation systems
- Foundation for specialized fine-tuned models
- Rapid concept visualization
- Creative exploration and experimentation

#### Technical Considerations

The developers acknowledge inherent limitations in the latent diffusion approach: the model cannot achieve perfect photorealism, struggles with accurate text rendering within images, faces compositional challenges in complex scenes, and produces slightly lossy outputs due to autoencoding architecture.

As with large-scale models trained on web data, SDXL may reflect patterns present in training data. Production deployments should implement appropriate content filtering and quality validation workflows.

#### Foundation for Ecosystem

SDXL has become a foundational architecture for numerous specialized models and fine-tunes, including photorealistic variants, artistic style adaptations, and domain-specific implementations. Its ensemble approach and architectural innovations enable downstream developers to build specialized models while benefiting from the base system's robust generation capabilities.


SDXL consists of an ensemble of experts pipeline for latent diffusion

Stable Diffusion XL Base 1.0


### Wan2.2 I2V A14B: MoE-Based Image-to-Video Generation

Wan2.2 I2V A14B is an open-source image-to-video generation model developed by Wan-AI that introduces a Mixture-of-Experts (MoE) architecture to video diffusion models. Supporting both 480P and 720P resolutions, the model delivers enhanced capability for complex motion generation and cinematic-quality outputs while maintaining computational efficiency.

#### Architecture: Dual-Expert MoE Design

The model employs an innovative dual-expert MoE framework that strategically separates the denoising process across timesteps. This architecture features:

**High-Noise Expert:**
- Handles early denoising stages during generation
- Focuses on overall layout, composition, and scene structure
- Establishes fundamental video characteristics

**Low-Noise Expert:**
- Manages later refinement stages
- Refines video details and aesthetic qualities
- Enhances realism and visual fidelity

**Efficiency Through Specialization:**
- 14B active parameters per inference step despite 27B total parameter count
- Automatic switching between experts based on signal-to-noise ratio (SNR) thresholds
- Computational efficiency comparable to smaller single-expert models

This architecture achieves more stable video synthesis with reduced unrealistic camera movements compared to traditional single-model approaches.

#### Training and Data Scale

Wan2.2 benefits from significantly expanded training data compared to previous versions:

- 65.6% increase in training images
- 83.2% increase in training videos
- Enhanced diversity in stylized scenes and aesthetic preferences
- Improved generalization across motion complexity levels

#### Key Capabilities

The model demonstrates several distinguishing strengths:

- **Image-to-Video Synthesis:** Converts static images into dynamic video sequences with natural motion
- **Optional Text Guidance:** Supports text prompts for directing video content and motion
- **Prompt Extension:** Enables image-only generation with automatic prompt derivation
- **Style Versatility:** Handles diverse aesthetic preferences from photorealistic to stylized
- **Consumer Hardware Compatibility:** Runs on RTX 4090 and comparable consumer GPUs
- **High Frame Rate:** Processes at 24 FPS for smooth high-definition output

#### Performance and Benchmarks

According to evaluation benchmarks, Wan2.2 I2V achieves superior performance against leading commercial models across multiple dimensions including motion quality, temporal consistency, and aesthetic fidelity. The dual-expert architecture's specialized processing stages contribute to reduced artifacts and more natural motion patterns.

#### Deployment Options

The model supports flexible deployment configurations:

- **Single-GPU Inference:** Model offloading enables deployment on consumer hardware
- **Multi-GPU Inference:** FSDP and DeepSize Ulysses support for accelerated generation
- **Framework Integration:** Compatible with Diffusers and ComfyUI workflows
- **Resolution Flexibility:** Supports both 480P and 720P output

#### Use Cases

Wan2.2 I2V excels in applications requiring image-to-video conversion:

- Product visualization with animated demonstrations
- Marketing content from static product photography
- Social media content enhancement
- Cinematic previsualization from concept art
- Video editing and enhancement workflows
- E-commerce product presentations with motion
- Educational content animation from diagrams
- Storyboard animation for film and media

#### Technical Considerations

The MoE architecture's separation of layout and refinement stages enables more stable generation compared to single-model approaches. The switching mechanism's SNR-based expert selection ensures appropriate processing intensity throughout the denoising pipeline, reducing computational waste while maintaining output quality.

The expanded training dataset contributes to improved handling of complex motion patterns and diverse aesthetic styles, making the model suitable for both photorealistic and stylized content generation.


Wan2.2 introduces a Mixture-of-Experts (MoE) architecture into image to video diffusion models

Wan2.2 I2V A14B (FP8)


### Wan2.2 T2V A14B: MoE-Based Text-to-Video Generation

Wan2.2 T2V A14B is an open-source text-to-video generation model developed by Wan-AI that introduces a Mixture-of-Experts (MoE) architecture to video diffusion systems. Released in July 2025, the model generates 5-second videos at both 480P and 720P resolutions with cinematic aesthetics and complex motion capabilities that surpass previous open-source and commercial models.

#### Architecture: Dual-Expert MoE Design

The model employs a novel two-expert system that strategically separates the video generation process:

**High-Noise Expert:**
- Handles early denoising stages of generation
- Focuses on overall layout and composition
- Establishes fundamental scene structure and motion patterns

**Low-Noise Expert:**
- Manages later refinement stages
- Refines video details and aesthetic qualities
- Enhances cinematic elements including lighting, contrast, and color tone

**Efficiency Through Specialization:**
- Approximately 27B total parameters with only 14B active per inference step
- Automatic expert switching via signal-to-noise ratio (SNR) thresholds
- Computational efficiency comparable to smaller dense models
- Reduced computational waste through targeted expert deployment

#### Training and Data Scale

Wan2.2 T2V benefits from significantly expanded training data:

- 65.6% increase in training images compared to Wan2.1
- 83.2% increase in training videos
- Enhanced diversity in motion types and semantic content
- Improved generalization across cinematic styles and aesthetics

This expanded dataset enables superior handling of complex motion patterns and diverse aesthetic preferences.

#### Key Capabilities

The model demonstrates several distinguishing strengths:

- **Cinematic Aesthetics:** Granular control over lighting, composition, contrast, and color tone
- **Complex Motion Generation:** Superior performance on Wan-Bench 2.0 evaluations against commercial systems
- **Multi-Resolution Support:** Generates both 480P and 720P outputs
- **Prompt Extension:** Integration with Qwen models or DashScope API for enhanced prompt elaboration
- **Consumer Hardware Compatibility:** Runs efficiently on RTX 4090 through model offloading
- **Parameter Optimization:** Supports parameter-type conversion for improved inference speed

#### Performance and Benchmarks

According to proprietary benchmarks, Wan2.2 T2V demonstrates superior performance compared to leading commercial video generation systems. The model excels particularly in complex motion scenarios where traditional single-expert architectures struggle with temporal consistency and realistic movement patterns.

The dual-expert MoE design contributes to reduced artifacts and more natural motion dynamics through specialized processing at appropriate denoising stages.

#### Deployment Options

The model supports flexible deployment configurations:

- **Single-GPU Inference:** Model offloading enables deployment on consumer hardware
- **Multi-GPU Inference:** Advanced optimization for accelerated generation
- **Framework Integration:** Compatible with standard video generation workflows
- **Resolution Flexibility:** Adapts between 480P and 720P based on quality-speed requirements

#### Use Cases

Wan2.2 T2V excels in applications requiring text-driven video synthesis:

- Professional video content creation for marketing and advertising
- AI-assisted filmmaking and commercial production
- Cinematic previsualization and storyboarding
- Social media content generation
- Educational and tutorial video production
- Research applications in generative media
- Concept visualization for film and media industries
- Rapid prototyping of video concepts from text descriptions

#### Technical Considerations

The MoE architecture's separation of layout and refinement stages enables more stable generation compared to traditional single-model approaches. The SNR-based switching mechanism ensures appropriate processing intensity throughout the denoising pipeline, optimizing both quality and computational efficiency.

The model's focus on cinematic aesthetics makes it particularly suitable for professional content creation requiring granular control over visual characteristics. Users seeking stylized or artistic outputs will benefit from the expanded training dataset's diversity in aesthetic preferences.

#### Distinction from I2V Variant

While sharing identical MoE architecture principles, Wan2.2 T2V focuses exclusively on text-to-video generation from textual prompts. The complementary I2V variant (Wan2.2 I2V A14B) specializes in image-to-video synthesis, enabling conditional generation from static images. Both models leverage the same dual-expert design philosophy while optimizing for their respective input modalities.


Wan2.2 introduces a Mixture-of-Experts (MoE) architecture into text to video diffusion models

Audio-to-Text Transcription

Built for This

Models

ACE Step V1 3.5B

Dia 1.6B

Related Blogs

Related Guides

Start Building: Audio-to-Text Transcription Templates