Running Qwen 3.5 Medium Models on Vast.ai

March 6, 2026

4 Min Read

By Team Vast

Model	Total Params	Active Params	Architecture	VRAM (BF16)
Qwen3.5-122B-A10B	122B	10B	MoE	~244 GB
Qwen3.5-35B-A3B	35B	3B	MoE	~66 GB
Qwen3.5-27B	27B	27B	Dense	~54 GB

Deploying on Vast.ai

Install the Vast.ai CLI and set your API key:

pip install --upgrade vastai
export VAST_API_KEY="YOUR_KEY_HERE"
vastai set api-key $VAST_API_KEY

Search for a suitable GPU. The model needs ~66 GB VRAM in BF16, so an 80 GB card works:

vastai search offers "gpu_ram>=80 num_gpus=1 direct_port_count>=1 rentable=true disk_space>=200 cuda_vers>=12.2" -o dph

Deploy the model using SGLang's Docker image. Replace YOUR_INSTANCE_ID with an offer ID from the search results:

vastai create instance YOUR_INSTANCE_ID \
    --image lmsysorg/sglang:latest \
    --env '-p 8000:8000' \
    --disk 200 \
    --onstart-cmd "python3 -m sglang.launch_server \
        --model-path Qwen/Qwen3.5-35B-A3B \
        --host 0.0.0.0 --port 8000 \
        --tp-size 1 \
        --context-length 32768 \
        --reasoning-parser qwen3 \
        --mem-fraction-static 0.85"

Key flags:

--reasoning-parser qwen3 enables the model's built-in thinking mode, separating reasoning from the final answer
--mem-fraction-static 0.85 allocates 85% of GPU memory for KV cache
--context-length 32768 sets a 32K context window (the model supports up to 262K natively)

The model will download (~70 GB) and load automatically. Check vastai logs YOUR_INSTANCE_ID — you'll see "The server is fired up and ready to roll!" when it's serving.

Calling the Model

Find your instance's IP and port in the Instances tab of the Vast.ai console. Click the IP address button to see the port mapping for 8000/tcp.

Install the OpenAI SDK:

pip install --upgrade openai

Then call the model:

from openai import OpenAI

VAST_IP_ADDRESS = "YOUR_IP"
VAST_PORT = "YOUR_PORT"

client = OpenAI(
    api_key="EMPTY",
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-35B-A3B",
    messages=[
        {"role": "user", "content": "What are three benefits of mixture-of-experts models? Be concise."}
    ],
    max_tokens=1024,
    temperature=0.7
)

print(response.choices[0].message.content)

Response

1. **Computational Efficiency:** Only a subset of parameters is activated per input,
   reducing inference costs while maintaining a massive total parameter count.
2. **Enhanced Performance:** The larger effective model capacity allows for better
   accuracy on complex and diverse tasks compared to dense models of similar compute.
3. **Specialization:** Individual experts can learn to handle specific input patterns
   or domains, improving the model's robustness and adaptability.

With --reasoning-parser qwen3, the model's thinking process is automatically separated from the final answer. The thinking tokens appear in reasoning_content in the API response, while the clean answer appears in content.

Other Models

You can swap Qwen/Qwen3.5-35B-A3B for the other medium models:

Qwen3.5-122B-A10B: Larger MoE model, 122B total / 10B active. At BF16 the full weights need ~244 GB, so you'll need multiple GPUs (e.g., 4× A100 80GB with --tp-size 4). Alternatively, GGUF quantizations are available — the Q4_K_M variant is ~70 GB and fits on a single 80GB card.
Qwen3.5-27B: Dense model, all 27B parameters active. Needs ~54 GB VRAM. Fits on a single A100 or H100.

Conclusion

Qwen3.5-35B-A3B packs 35B parameters into a model that only activates 3B per token, and its hybrid DeltaNet architecture delivers substantially faster inference than standard transformers. With SGLang and a single A100 on Vast.ai, you can have it serving in under 10 minutes.

For production use, consider increasing --context-length — the model supports up to 262K natively, with 128K+ recommended for best reasoning performance. GGUF quantizations (Q4_K_M at ~22 GB) are also available for running on consumer GPUs like the RTX 4090.

Useful Resources