Running Qwen 3.5 Medium Models on Vast.ai
Qwen 3.5 is Alibaba's latest model family, featuring a hybrid architecture that combines Gated DeltaNet (linear attention) with standard attention in a 3:1 ratio. This design enables dramatically faster inference — up to 8.6x at 32K context and 19x at 256K — while maintaining strong reasoning capabilities. The medium models released February 24, 2026 include:
| Model | Total Params | Active Params | Architecture | VRAM (BF16) |
|---|---|---|---|---|
| Qwen3.5-122B-A10B | 122B | 10B | MoE | ~244 GB |
| Qwen3.5-35B-A3B | 35B | 3B | MoE | ~66 GB |
| Qwen3.5-27B | 27B | 27B | Dense | ~54 GB |
The MoE models route each token through 8 of 256 experts plus 1 shared expert, so only a fraction of the total parameters are active per token. The 35B-A3B model activates just 3B parameters per token while having 35B total — giving you large-model quality at small-model inference cost. All three models are Apache 2.0 licensed, so no HuggingFace token is needed.
In this guide, we'll deploy Qwen3.5-35B-A3B on Vast.ai using SGLang, which has day-0 Qwen3.5 support as of v0.5.9, on a single A100 80GB GPU.
Deploying on Vast.ai
Install the Vast.ai CLI and set your API key:
pip install --upgrade vastai
export VAST_API_KEY="YOUR_KEY_HERE"
vastai set api-key $VAST_API_KEY
Search for a suitable GPU. The model needs ~66 GB VRAM in BF16, so an 80 GB card works:
vastai search offers "gpu_ram>=80 num_gpus=1 direct_port_count>=1 rentable=true disk_space>=200 cuda_vers>=12.2" -o dph
Deploy the model using SGLang's Docker image. Replace YOUR_INSTANCE_ID with an offer ID from the search results:
vastai create instance YOUR_INSTANCE_ID \
--image lmsysorg/sglang:latest \
--env '-p 8000:8000' \
--disk 200 \
--onstart-cmd "python3 -m sglang.launch_server \
--model-path Qwen/Qwen3.5-35B-A3B \
--host 0.0.0.0 --port 8000 \
--tp-size 1 \
--context-length 32768 \
--reasoning-parser qwen3 \
--mem-fraction-static 0.85"
Key flags:
--reasoning-parser qwen3enables the model's built-in thinking mode, separating reasoning from the final answer--mem-fraction-static 0.85allocates 85% of GPU memory for KV cache--context-length 32768sets a 32K context window (the model supports up to 262K natively)
The model will download (~70 GB) and load automatically. Check vastai logs YOUR_INSTANCE_ID — you'll see "The server is fired up and ready to roll!" when it's serving.
Calling the Model
Find your instance's IP and port in the Instances tab of the Vast.ai console. Click the IP address button to see the port mapping for 8000/tcp.
Install the OpenAI SDK:
pip install --upgrade openai
Then call the model:
from openai import OpenAI
VAST_IP_ADDRESS = "YOUR_IP"
VAST_PORT = "YOUR_PORT"
client = OpenAI(
api_key="EMPTY",
base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)
response = client.chat.completions.create(
model="Qwen/Qwen3.5-35B-A3B",
messages=[
{"role": "user", "content": "What are three benefits of mixture-of-experts models? Be concise."}
],
max_tokens=1024,
temperature=0.7
)
print(response.choices[0].message.content)
Response
1. **Computational Efficiency:** Only a subset of parameters is activated per input,
reducing inference costs while maintaining a massive total parameter count.
2. **Enhanced Performance:** The larger effective model capacity allows for better
accuracy on complex and diverse tasks compared to dense models of similar compute.
3. **Specialization:** Individual experts can learn to handle specific input patterns
or domains, improving the model's robustness and adaptability.
With --reasoning-parser qwen3, the model's thinking process is automatically separated from the final answer. The thinking tokens appear in reasoning_content in the API response, while the clean answer appears in content.
Other Models
You can swap Qwen/Qwen3.5-35B-A3B for the other medium models:
- Qwen3.5-122B-A10B: Larger MoE model, 122B total / 10B active. At BF16 the full weights need ~244 GB, so you'll need multiple GPUs (e.g., 4× A100 80GB with
--tp-size 4). Alternatively, GGUF quantizations are available — the Q4_K_M variant is ~70 GB and fits on a single 80GB card. - Qwen3.5-27B: Dense model, all 27B parameters active. Needs ~54 GB VRAM. Fits on a single A100 or H100.
Conclusion
Qwen3.5-35B-A3B packs 35B parameters into a model that only activates 3B per token, and its hybrid DeltaNet architecture delivers substantially faster inference than standard transformers. With SGLang and a single A100 on Vast.ai, you can have it serving in under 10 minutes.
For production use, consider increasing --context-length — the model supports up to 262K natively, with 128K+ recommended for best reasoning performance. GGUF quantizations (Q4_K_M at ~22 GB) are also available for running on consumer GPUs like the RTX 4090.



