Model Compression with LLM-Compressor and Deployment on Vast.ai (Part 1)

July 22, 2025

10 Min Read

By Team Vast

The Tools: LLM-Compressor and Vast.ai

LLM-Compressor: Making Models Leaner

The vLLM team has developed LLM-Compressor as a comprehensive solution for reducing model size without sacrificing performance. Unlike other compression tools that focus on a single technique, LLM-Compressor offers a complete toolkit for optimization.

What makes it particularly powerful:

Comprehensive approach: Combines quantization, pruning, and distillation techniques that can be mixed and matched
Production-ready: Built specifically for deployment scenarios with seamless vLLM integration
Developer-friendly: The one-shot API handles the complex compression pipeline automatically
Proven techniques: Implements industry-standard methods like GPTQ, AWQ, and SmoothQuant

Vast.ai: Affordable GPU Power

Traditional cloud providers can make GPU-intensive tasks prohibitively expensive. Vast.ai changes this equation by creating a marketplace where you can access high-end GPUs at competitive rates—often significantly cheaper than major cloud platforms.

For model compression workflows, Vast.ai offers several advantages:

Flexible pricing: Pay-per-hour model means you only pay for compression time, not idle resources
Hardware variety: Choose from different GPU types based on your specific memory and compute requirements
Quick deployment: Get instances running in minutes without complex setup procedures
Community-driven: Access to a wider range of hardware options through the marketplace model

In this Tutorial

This tutorial demonstrates how to:

Compress the Qwen3-8B model using LLM-Compressor with 8-bit weight and activation quantization (W8A8)
Upload the compressed Qwen3-8B-W8A8 model to Hugging Face for easy sharing and deployment

In the second part of this series, we'll show how to deploy the compressed Qwen3-8B-W8A8 model on Vast.ai for efficient inference and compare its performance with the full-precision model.

GPU Requirements

Model compression requires significant computational resources, especially for larger models. For compressing the Qwen3-8B model, we recommend the following specifications:

Minimum Requirements:

GPU Memory: 30GB+ VRAM (e.g., A5000, A6000, RTX 4090, V100-32GB)
GPU Compute: CUDA-compatible GPU with compute capability 7.0+
System RAM: 64GB recommended (minimum 32GB)
Storage: 80GB free disk space for model files and datasets
CUDA Version: 11.8 or newer

Why these requirements?

The full Qwen3-8B model requires approximately 16GB just to load in FP16 format
During compression, we need memory for both the original model and compression overhead
Calibration datasets and intermediate files require additional storage space
Model compilers and quantization tools have their own memory overhead

For larger models (12B+), you would need even more VRAM (40GB+) or techniques like model sharding.

Renting an Instance on Vast.ai

Ensure that you have a Vast.ai account. If not, sign up at vast.ai
Go to the Vast Templates in the Console https://cloud.vast.ai/templates/
Select the PyTorch (CuDNN Runtime) Template - this provides a pre-configured environment with PyTorch and CUDA
Filter for an instance with the following specifications for optimal performance:
- 1 GPU with high VRAM (A5000, A6000, RTX 4090, or similar)
- 30GB+ of VRAM (required for full model loading plus quantization overhead)
- 80GB+ of storage (for model weights, checkpoints, and datasets)
- 32GB+ of system RAM (64GB recommended for smoother operation)
Select an instance and click rent - hourly pricing will vary based on the selected hardware
Install the Vast TLS certificate in your browser to access the notebook server securely https://docs.vast.ai/instances/jupyter#1SmCz
Go to your Instances https://cloud.vast.ai/instances/ and click "Open" to access the Jupyter server on your instance
Upload this notebook to the server through the Jupyter interface

Setup and Installation

First we will install and import our required packages. The cell below installs:

Transformers: For loading and saving the model in Hugging Face format
PyTorch: The deep learning framework that powers our model
Hugging Face Hub: For uploading our compressed model
LLM-Compressor: The toolkit we'll use for quantization

# Install required packages
# Using specific versions to ensure compatibility
pip install transformers==4.51.3  # Recent Transformers version with good Qwen3 support
pip install torch==2.7.0          # PyTorch version compatible with current CUDA drivers
pip install huggingface_hub       # Core Hugging Face Hub library
pip install huggingface_hub[hf_xet]  # Extension for handling large file transfers
pip install huggingface_hub[cli]     # Command-line interface for HF Hub
pip install llmcompressor            # The model compression toolkit

Next, import the necessary libraries:

# Import necessary libraries
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from llmcompressor.modifiers.quantization import GPTQModifier  # For weight quantization
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier  # For activation quantization preprocessing
from llmcompressor import oneshot  # Simplified API for end-to-end compression

Define Model and Output Path

Next, we'll define our model variables and prepare for downloading the Qwen/Qwen3-8B model.

MODEL_ID: The Hugging Face model ID for the original model
OUTPUT_MODEL_ID: A name for our compressed model version
OUTPUT_DIR: Local directory where the compressed model will be saved

# Define model path and output directory
MODEL_ID = "Qwen/Qwen3-8B"  # Original model on Hugging Face
OUTPUT_MODEL_ID = "Qwen3-8B-W8A8"  # Our compressed model name
OUTPUT_DIR = "./Qwen3-8B-W8A8"  # Local directory for saving
os.makedirs(OUTPUT_DIR, exist_ok=True)  # Create output directory if it doesn't exist

Download the model files using the snapshot_download function:

from huggingface_hub import snapshot_download

# Download the model files to local directory
# This is a more reliable way to download large models compared to the standard AutoModel approach
# This will download all model files (~16GB for Qwen3-8B) and may take some time depending on connection speed
snapshot_download(
    repo_id="Qwen/Qwen3-8B",
    local_dir_use_symlinks=False  # Set to True to use symlinks instead of copying files (saves space but may cause issues)
)

Configure Quantization

Now we'll define our compression recipe, which consists of two steps:

SmoothQuant: A technique that redistributes the quantization difficulty between weights and activations, making activations easier to quantize with minimal accuracy loss
GPTQ: A weight quantization method specifically designed for large language models that uses a reconstruction approach to minimize the impact of quantization on model quality

We're using W8A8 quantization, which means:

W8: Weights stored in 8-bit precision (instead of FP16/FP32)
A8: Activations computed in 8-bit precision at inference time

This will reduce the model size by approximately 1.7x and potentially increase inference speed.

# Create quantization recipe
# A recipe is a sequence of compression techniques applied in order
recipe = [
    # First apply SmoothQuant to make activations easier to quantize
    # The smoothing_strength parameter controls how aggressively to shift quantization difficulty
    # from activations to weights (higher = more shifting, 0.8 is a good balance)
    SmoothQuantModifier(smoothing_strength=0.8),

    # Then apply GPTQ for weight and activation quantization
    GPTQModifier(
        targets="Linear",  # Apply to all linear layers in the model
        scheme="W8A8",    # 8-bit weights and activations (other options: W4A16, W8A16, etc.)
        ignore=["lm_head"] # Don't quantize the language modeling head (important for output quality)
    )
]

Apply Quantization with oneshot API

Now we'll apply our compression recipe to the model using the oneshot API.

The oneshot API handles the entire compression workflow:

Loading the model from Hugging Face or local path
Creating a calibration dataset (or using an existing one)
Running calibration to determine optimal quantization parameters
Applying the compression recipe
Saving the compressed model

We're using the open_platypus dataset for calibration, which consists of diverse open-source technical questions and answers. This helps ensure the model maintains accuracy across a range of technical topics.

Note: This cell will take a significant amount of time (30-60 minutes) and GPU memory to run.

print(f"Starting compression of {MODEL_ID}...")

# Apply quantization using oneshot API
oneshot(
    model=MODEL_ID,               # HF model ID or local path to load the model from
    dataset="open_platypus",      # Calibration dataset - preset dataset with technical Q&A
    recipe=recipe,                # Our compression recipe defined above
    output_dir=OUTPUT_DIR,        # Where to save the compressed model
    max_seq_length=2048,          # Maximum sequence length for calibration samples
    num_calibration_samples=512,  # Number of samples to use for calibration (more is better but slower)
)

Test the Compressed Model

After compression, it's important to verify that the model still performs reasonably well. This cell loads the compressed model and runs a simple inference test.

We'll evaluate the model on a prompt about AI technology to see how well it generates coherent and relevant text. This helps us verify that the quantization didn't significantly degrade model quality.

Note that a proper evaluation would involve more systematic testing across multiple prompts and metrics.

# Load the compressed model from our output directory
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)
model = AutoModelForCausalLM.from_pretrained(
    OUTPUT_DIR,
    device_map="auto",      # Automatically place model on available devices (CPU/GPU)
    torch_dtype="auto"      # Use the model's native precision (will be INT8 for quantized weights)
)

# Simple test to verify the model still works
prompt = "AI technology is transforming industries by"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate text with some randomness for creativity
with torch.no_grad():  # Disable gradient calculation for inference
    outputs = model.generate(
        inputs["input_ids"],
        max_new_tokens=100,     # Generate up to 100 new tokens
        do_sample=True,         # Use sampling instead of greedy decoding
        temperature=0.7         # Control randomness (lower = more deterministic)
    )

# Decode the generated text and print it
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Prompt: {prompt}")
print(f"Generated text: {generated_text}")

Upload the Model to Hugging Face

Now that we have successfully compressed the model, we'll upload it to Hugging Face Hub for easy sharing and deployment.

First, we'll set up the necessary credentials and repository information:

from huggingface_hub import HfApi, login
import os

# Define variables for Hugging Face upload
# You'll need to replace these with your actual Hugging Face information
HF_USERNAME = "HF_USERNAME" # Add your Hugging Face username here
HF_TOKEN = "HF_TOKEN"       # Replace with your actual Hugging Face token (from settings page)
HF_MODEL_ID = f"{HF_USERNAME}/{OUTPUT_MODEL_ID}"  # Full repository path on Hugging Face

Next, we'll create a proper model card (README.md) with information about our compressed model. This helps others understand what the model is, how it was compressed, and how to use it:

# Create a detailed model card describing the compressed model
model_card_content = f"""---
language:
- en
- zh
library_name: transformers
license: other
datasets:
- open_platypus
---

# Qwen3-8B-W8A8

This is a compressed version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) using llm-compressor with the following optimizations:
- 8-bit weight quantization using GPTQ
- 8-bit activation quantization
- SmoothQuant pre-processing

## Model Details

- **Original Model**: Qwen/Qwen3-8B
- **Quantization Method**: GPTQ + SmoothQuant (W8A8)
- **Compression Libraries**: [llm-compressor](https://github.com/vllm-project/llm-compressor)
- **Calibration Dataset**: open_platypus (512 samples)
- **Optimized For**: Inference with vLLM
"""

# Write README.md to model directory
readme_path = os.path.join(OUTPUT_DIR, "README.md")
with open(readme_path, "w") as f:
    f.write(model_card_content)

Finally, we'll upload the compressed model to Hugging Face Hub. This process may take some time depending on your internet connection speed, as the model files are large (approximately 9.5GB even after compression):

# Login to Hugging Face with your token
login(token=HF_TOKEN)

# Initialize the Hugging Face API
api = HfApi()

# Create the repository (if it doesn't exist)
api.create_repo(
    repo_id=HF_MODEL_ID,
    exist_ok=True,        # Don't error if repo already exists
    private=False,        # Set to True if you want a private repository
)

# Upload the model files
print(f"Starting upload of {OUTPUT_DIR} to {HF_MODEL_ID}...")
api.upload_folder(
    folder_path=OUTPUT_DIR,
    repo_id=HF_MODEL_ID,
    repo_type="model",
    ignore_patterns=["*.tmp", "*.log", "__pycache__"],  # Files to ignore during upload
    commit_message="Upload compressed Qwen3-8B with W8A8 quantization",
)

print(f"Model successfully uploaded to: https://huggingface.co/{HF_MODEL_ID}")

Wrapping Up: From 16GB to 9.5GB Without Breaking the Bank

We've just walked through a transformation that would have seemed impossible a few years ago—taking a sophisticated language model and cutting its resource requirements by approximately 40% while preserving its capabilities.

What We Accomplished

Our compression workflow delivered results across multiple dimensions:

Technical achievements:

Reduced model footprint from 16GB to 9.5GB using W8A8 quantization
Maintained output quality through careful calibration with technical datasets
Created a production-ready model compatible with modern inference engines

Economic impact:

Enabled deployment on more affordable GPUs with less VRAM requirements
Reduced cloud inference costs through lower resource requirements
Made advanced language models accessible to teams with limited budgets

Operational benefits:

Faster model loading and reduced memory pressure during inference
Better resource utilization allowing for higher throughput or multi-model serving
Simplified deployment pipeline through automated compression workflows

The Bigger Picture

Model compression offers an alternative approach to AI deployment. Rather than accepting ever-increasing resource requirements, we can actively optimize models for real-world constraints. The combination of LLM-Compressor's sophisticated algorithms and Vast.ai's accessible infrastructure makes this optimization practical for any team.

Coming Up Next

In Part 2, we'll put our compressed model to the test. We'll deploy both our optimized Qwen3-8B-W8A8 model and the original full-precision version on Vast.ai, then conduct detailed comparisons of their performance, cost, and output quality. You'll see exactly how much you can save without compromising on results—and learn when compression makes the most sense for your specific use case.