Liquid AI's LFM2 Just Dropped — Here's How to Run It on Vast.ai

March 6, 2026
3 Min Read
By Team Vast
Share
Subscribe

Liquid AI just released LFM2-24B-A2B, a 24 billion parameter foundation model that only activates 2.3 billion parameters per token. That ratio — less than 10% of total parameters active at inference time — makes it one of the most aggressively sparse mixture-of-experts models available today.

What makes LFM2 different from other MoE models isn't just the sparsity. The architecture is a hybrid design that replaces most of the attention layers with gated short convolution blocks. Out of 40 total layers, only 10 use grouped query attention — the other 30 are convolutional. This is a fundamentally different approach from transformer-only models, and it's what allows Liquid AI to push the active parameter count so low while maintaining competitive quality across benchmarks like MMLU-Pro, GPQA Diamond, IFEval, and MATH-500.

Liquid AI, an MIT spinoff, has scaled this architecture from 350M to 24B parameters following log-linear scaling laws. LFM2 is the first model in the family large enough to be broadly useful — and with day-one support for vLLM, SGLang, and llama.cpp, it's ready to deploy today.

Note: LFM2-24B-A2B is an early checkpoint trained on 17 trillion tokens without reinforcement learning or extensive post-training. A post-trained version (LFM2.5-24B-A2B) is coming.

Model Overview

PropertyValue
DeveloperLiquid AI
ModelLFM2-24B-A2B
ArchitectureHybrid MoE: 30 convolutional + 10 GQA attention layers, 64 experts, top-4 routing
Total Parameters24B
Active Parameters2.3B per token
Context Length32,768 tokens
Training Data17T tokens (mixed BF16/FP8)
LicenseLFM Open License v1.0
HuggingFaceLiquidAI/LFM2-24B-A2B

Deploy LFM2-24B-A2B on Vast.ai with vLLM

At BF16, the model weights are ~48 GB — it fits on a single A100-80GB with room for a 32K context KV cache. No quantization or multi-GPU setup needed.

Prerequisites

  • A Vast.ai account with credits
  • The Vast CLI installed and configured:
pip install --upgrade vastai  # make sure you're on v0.5.0+
vastai set api-key <YOUR_API_KEY>

Find a GPU

Search for a single GPU with at least 80 GB VRAM and direct port access:

vastai search offers "gpu_ram>=80 num_gpus=1 direct_port_count>=1 cuda_vers>=12.4 cuda_vers<13" -o dph

Important: Use a machine with CUDA 12.x (not 13.x). The vLLM nightly image is built against CUDA 12 and will fail with a driver mismatch on CUDA 13 machines.

Launch the Instance

Pick an instance ID from the search results and deploy with vLLM. LFM2 requires a newer version of the transformers library than what ships in the vLLM image, so we upgrade it at startup:

vastai create instance <INSTANCE_ID> \
  --image vllm/vllm-openai:nightly \
  --env "-p 8000:8000" \
  --disk 100 \
  --onstart-cmd "pip install 'transformers>=5.0.0' && \
    vllm serve LiquidAI/LFM2-24B-A2B \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 32768 --dtype auto \
    --trust-remote-code"

This upgrades the tokenizer library, pulls the model from HuggingFace, and starts an OpenAI-compatible API server on port 8000. Initial startup takes several minutes while the model downloads (~48 GB for the BF16 weights).

Once the instance is running, find your IP and port in the Vast console — click the IP button on your instance to see the port mapping.

Why nightly and transformers>=5? LFM2's tokenizer uses the TokenizersBackend class introduced in transformers 5.0. The stable vLLM image pins transformers<5, so the tokenizer fails to load. Using nightly with a transformers upgrade resolves this. This was tested with vLLM nightly build v0.16.0rc2. Once a stable vLLM release ships with transformers>=5.0.0 support, you can switch to vllm/vllm-openai:latest and drop the pip install step.

Call the API

curl http://<VAST_IP>:<PORT>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "LiquidAI/LFM2-24B-A2B",
    "messages": [{"role": "user", "content": "What makes mixture-of-experts models efficient?"}],
    "max_tokens": 300
  }'

Cleanup

When you're done, destroy the instance to stop charges:

vastai destroy instance <INSTANCE_ID>

Resources