Running Nemotron-Cascade-2 on Vast.ai

Running Nemotron-Cascade-2 on Vast.ai
NVIDIA released Nemotron-Cascade-2-30B-A3B, a reasoning model that achieves gold-medal-level scores on international math and programming competitions. It does this with 30B total parameters and only 3B active per token - 20x fewer than competing models - outperforming both Qwen3.5-35B-A3B and larger open-weight alternatives.
The model fits on a single 80 GB GPU at about 60 GB in BF16 with no quantization or multi-GPU setup required. It includes built-in chain-of-thought reasoning, tool-integrated reasoning for math and code, and an OpenAI-compatible API.
This guide walks through deploying the model on Vast.ai with vLLM. Vast.ai offers A100 and H100 GPUs at 3-5x less than traditional cloud providers, with no contracts or minimums.
Model Overview
| Property | Value | | --- | --- | | Developer | NVIDIA | | Model | Nemotron-Cascade-2-30B-A3B | | Architecture | Hybrid Mamba-Attention MoE | | Total Parameters | 30B | | Active Parameters | 3B per token | | Training | Cascade RL post-training from Nemotron-3-Nano base | | License | NVIDIA Open Model License | | HuggingFace | nvidia/Nemotron-Cascade-2-30B-A3B |
Deploy Nemotron-Cascade-2 on Vast.ai with vLLM
The model requires a single 80 GB GPU (A100 or H100) with about 150 GB disk for weights and cache.
Prerequisites
- A Vast.ai account with credits
- The Vast CLI installed and configured:
pip install --upgrade vastai
vastai set api-key <YOUR_API_KEY>
Generate an API key to secure your model endpoint:
export NEMOTRON_API_KEY=$(openssl rand -hex 24)
echo $NEMOTRON_API_KEY # save this somewhere
Find a GPU
Search for the cheapest single 80 GB GPU with direct port access:
# -o dph sorts results by price (dollars per hour, cheapest first)
vastai search offers "gpu_ram>=80 num_gpus=1 direct_port_count>=1 cuda_vers>=12.4 cuda_vers<13" -o dph
Launch the Instance
Pick an instance ID from the search results and deploy with vLLM:
vastai create instance <INSTANCE_ID> \
--image vllm/vllm-openai:latest \
--env '-p 8000:8000' \
--disk 150 \
--onstart-cmd "vllm serve nvidia/Nemotron-Cascade-2-30B-A3B \
--host 0.0.0.0 --port 8000 \
--max-model-len 32768 \
--api-key $NEMOTRON_API_KEY \
--trust-remote-code"
This pulls the model from HuggingFace and starts an OpenAI-compatible API server on port 8000. Initial startup takes several minutes while the model downloads.
Once the instance is running, get your connection details:
vastai show instances --raw | jq '.[] | select(.id == <INSTANCE_ID>) | {id, actual_status, public_ipaddr, ports}'
You can also find the IP and port mapping in the Vast console.
Call the API
Send a request to test the model:
curl http://<VAST_IP>:<PORT>/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $NEMOTRON_API_KEY" \
-d '{
"model": "nvidia/Nemotron-Cascade-2-30B-A3B",
"messages": [{"role": "user", "content": "Find all integers n such that n^2 + 2n + 4 is divisible by 7."}],
"max_tokens": 2048
}'
The model thinks before it answers - you will see its step-by-step reasoning in the response, followed by the final answer. This chain-of-thought reasoning is what drives its competition-level performance on math and code.
Cleanup
When you're done, destroy the instance to stop charges:
vastai destroy instance <INSTANCE_ID>
Conclusion
Nemotron-Cascade-2 delivers reasoning performance that previously required models 20x its size, and it runs on a single GPU. That combination of capability and efficiency makes it a practical option for math, code, and general reasoning workloads. With Vast.ai, you can have it running in minutes at a fraction of what traditional cloud providers charge.


