Running Nemotron-Cascade-2 on Vast.ai

March 25, 2026
3 Min Read
By Team Vast
Share
Subscribe

Running Nemotron-Cascade-2 on Vast.ai

NVIDIA released Nemotron-Cascade-2-30B-A3B, a reasoning model that achieves gold-medal-level scores on international math and programming competitions. It does this with 30B total parameters and only 3B active per token - 20x fewer than competing models - outperforming both Qwen3.5-35B-A3B and larger open-weight alternatives.

The model fits on a single 80 GB GPU at about 60 GB in BF16 with no quantization or multi-GPU setup required. It includes built-in chain-of-thought reasoning, tool-integrated reasoning for math and code, and an OpenAI-compatible API.

This guide walks through deploying the model on Vast.ai with vLLM. Vast.ai offers A100 and H100 GPUs at 3-5x less than traditional cloud providers, with no contracts or minimums.

Model Overview

| Property | Value | | --- | --- | | Developer | NVIDIA | | Model | Nemotron-Cascade-2-30B-A3B | | Architecture | Hybrid Mamba-Attention MoE | | Total Parameters | 30B | | Active Parameters | 3B per token | | Training | Cascade RL post-training from Nemotron-3-Nano base | | License | NVIDIA Open Model License | | HuggingFace | nvidia/Nemotron-Cascade-2-30B-A3B |

Deploy Nemotron-Cascade-2 on Vast.ai with vLLM

The model requires a single 80 GB GPU (A100 or H100) with about 150 GB disk for weights and cache.

Prerequisites

  • A Vast.ai account with credits
  • The Vast CLI installed and configured:
pip install --upgrade vastai
vastai set api-key <YOUR_API_KEY>

Generate an API key to secure your model endpoint:

export NEMOTRON_API_KEY=$(openssl rand -hex 24)
echo $NEMOTRON_API_KEY  # save this somewhere

Find a GPU

Search for the cheapest single 80 GB GPU with direct port access:

# -o dph sorts results by price (dollars per hour, cheapest first)
vastai search offers "gpu_ram>=80 num_gpus=1 direct_port_count>=1 cuda_vers>=12.4 cuda_vers<13" -o dph

Launch the Instance

Pick an instance ID from the search results and deploy with vLLM:

vastai create instance <INSTANCE_ID> \
  --image vllm/vllm-openai:latest \
  --env '-p 8000:8000' \
  --disk 150 \
  --onstart-cmd "vllm serve nvidia/Nemotron-Cascade-2-30B-A3B \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 32768 \
    --api-key $NEMOTRON_API_KEY \
    --trust-remote-code"

This pulls the model from HuggingFace and starts an OpenAI-compatible API server on port 8000. Initial startup takes several minutes while the model downloads.

Once the instance is running, get your connection details:

vastai show instances --raw | jq '.[] | select(.id == <INSTANCE_ID>) | {id, actual_status, public_ipaddr, ports}'

You can also find the IP and port mapping in the Vast console.

Call the API

Send a request to test the model:

curl http://<VAST_IP>:<PORT>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $NEMOTRON_API_KEY" \
  -d '{
    "model": "nvidia/Nemotron-Cascade-2-30B-A3B",
    "messages": [{"role": "user", "content": "Find all integers n such that n^2 + 2n + 4 is divisible by 7."}],
    "max_tokens": 2048
  }'

The model thinks before it answers - you will see its step-by-step reasoning in the response, followed by the final answer. This chain-of-thought reasoning is what drives its competition-level performance on math and code.

Cleanup

When you're done, destroy the instance to stop charges:

vastai destroy instance <INSTANCE_ID>

Conclusion

Nemotron-Cascade-2 delivers reasoning performance that previously required models 20x its size, and it runs on a single GPU. That combination of capability and efficiency makes it a practical option for math, code, and general reasoning workloads. With Vast.ai, you can have it running in minutes at a fraction of what traditional cloud providers charge.

Resources