Mistral Small 4 Just Dropped — Run It on Affordable H200s with Vast.ai

April 7, 2026
4 Min Read
By Team Vast
Share
Subscribe

Mistral Small 4 Just Dropped — Run It on Affordable H200s with Vast.ai

Mistral Small 4 is the first Mistral model to unify three previously separate families — instruct, reasoning (Magistral), and coding (Devstral) — into a single set of weights. It scores 78.0 on MMLU-Pro and 71.2 on GPQA Diamond, outperforms GPT-OSS 120B on LiveCodeBench with 20% less output, and delivers 40% lower latency and 3x higher throughput than Mistral Small 3.

Under the hood, it's a 119B-parameter mixture-of-experts model that activates just 6.5B parameters per token. It includes a Pixtral vision encoder, a 256K-token context window, native function calling, and a per-request reasoning_effort parameter to toggle between fast responses and deep chain-of-thought. Weights are stored in FP8 and it fits on 2x H200 141 GB GPUs with no quantization required. It's Apache 2.0 licensed.

This guide walks through deploying Mistral Small 4 on Vast.ai with vLLM.

Model Overview

| Property | Value | |---|---| | Developer | Mistral AI | | Model | Mistral-Small-4-119B-2603 | | Architecture | MoE with Multi-head Latent Attention (MLA), 128 experts, 4 active per token | | Total Parameters | 119B | | Active Parameters | 6.5B per token | | Context Length | 256K tokens | | Vision | Pixtral encoder | | License | Apache 2.0 | | HuggingFace | mistralai/Mistral-Small-4-119B-2603 |

Deploy Mistral Small 4 on Vast.ai with vLLM

The model weights are ~111 GB in FP8. With KV cache and overhead, you need 2x H200 141 GB (282 GB total) or 4x H100 80 GB. We'll use 2x H200 since they offer more headroom per card.

Prerequisites

  • A Vast.ai account with credits
  • jq for parsing JSON responses
  • The Vast CLI and OpenAI Python client installed:
pip install --upgrade vastai openai
vastai set api-key <YOUR_API_KEY>

Generate an API key to secure your model endpoint:

export VLLM_API_KEY=$(openssl rand -hex 24)
echo $VLLM_API_KEY  # save this somewhere

Find a GPU

vastai search offers "gpu_name=H200 num_gpus=2 direct_port_count>=1 rentable=true disk_space>=500 cuda_vers>=12.2"

Launch the Instance

Mistral Small 4 requires a custom vLLM Docker image with fixes for tool calling and reasoning parsing. Upstream support is tracked in vLLM PR #37081.

Pick an offer ID from the search results and deploy:

vastai create instance <OFFER_ID> \
  --image mistralllm/vllm-ms4:latest \
  --env '-p 8000:8000' \
  --disk 500 \
  --onstart-cmd "vllm serve mistralai/Mistral-Small-4-119B-2603 \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 2 \
    --max-model-len 131072 \
    --attention-backend FLASH_ATTN_MLA \
    --tool-call-parser mistral \
    --enable-auto-tool-choice \
    --reasoning-parser mistral \
    --api-key $VLLM_API_KEY"

The --reasoning-parser mistral flag separates the model's chain-of-thought reasoning from the final answer in API responses. --max-model-len 131072 sets a 128K context window as a practical default — the model supports up to 256K natively if you have the VRAM headroom.

Check progress with vastai logs <INSTANCE_ID>.

Call the API

Find your instance's IP and port:

vastai show instance <INSTANCE_ID> --raw | jq -r '"\(.public_ipaddr):\(.ports["8000/tcp"][0].HostPort)"'

You can also find this in the Vast console.

Text Generation

from openai import OpenAI

client = OpenAI(
    base_url="http://<IP>:<PORT>/v1",
    api_key="<YOUR_VLLM_API_KEY>",
)

response = client.chat.completions.create(
    model="mistralai/Mistral-Small-4-119B-2603",
    messages=[
        {"role": "user", "content": "What are three advantages of mixture-of-experts architectures over dense transformers? Be concise."}
    ],
    max_tokens=512,
    temperature=0.1,
)

print(response.choices[0].message.content)

Vision

The built-in Pixtral encoder handles images alongside text — pass image URLs directly in the message content:

response = client.chat.completions.create(
    model="mistralai/Mistral-Small-4-119B-2603",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this chart. What benchmarks are being compared, and which model performs best?"},
                {"type": "image_url", "image_url": {"url": "https://cms.mistral.ai/assets/7d5b181d-776d-4406-aabc-b88414edf567.png"}},
            ],
        }
    ],
    max_tokens=1024,
    temperature=0.1,
)

print(response.choices[0].message.content)

Configurable Reasoning

Toggle between fast responses and step-by-step reasoning on a per-request basis with reasoning_effort:

response = client.chat.completions.create(
    model="mistralai/Mistral-Small-4-119B-2603",
    messages=[
        {"role": "user", "content": "Find all integers n such that n^2 + 2n + 4 is divisible by 7."}
    ],
    max_tokens=2048,
    temperature=0.7,
    extra_body={"reasoning_effort": "high"},
)

# Reasoning trace is separated from the final answer
data = response.model_dump()
thinking = data["choices"][0]["message"].get("reasoning")
answer = data["choices"][0]["message"]["content"]

print("Thinking:", thinking[:500] if thinking else "N/A")
print("Answer:", answer)

Use reasoning_effort="none" for fast, everyday tasks. Use "high" for math, logic, and complex analysis.

Cleanup

When you're done, destroy the instance to stop charges:

vastai destroy instance <INSTANCE_ID>

Conclusion

Mistral Small 4 packs 119B parameters into a model that activates 6.5B per token, unifying instruct, reasoning, and coding capabilities behind a single endpoint. With vision, a 256K context window, and native function calling, it covers a wide range of use cases. The NVFP4 checkpoint (~66 GB) fits on 2x H200, and a trained Eagle speculative decoding head is available for faster generation.

Resources