As AI models grow increasingly powerful, they're also becoming increasingly expensive to deploy. The latest language models require massive amounts of GPU memory and computational resources, putting them out of reach for many teams and use cases. Model compression has emerged as a critical technique for making these models more accessible while maintaining their performance.
Today, we'll explore how to compress large language models using LLM-Compressor and deploy them cost-effectively on Vast.ai. We'll walk through the complete process of taking a 16GB model and reducing it to approximately 9.5GB while preserving quality—making deployment significantly more affordable and accessible.
The vLLM team has developed LLM-Compressor as a comprehensive solution for reducing model size without sacrificing performance. Unlike other compression tools that focus on a single technique, LLM-Compressor offers a complete toolkit for optimization.
What makes it particularly powerful:
Traditional cloud providers can make GPU-intensive tasks prohibitively expensive. Vast.ai changes this equation by creating a marketplace where you can access high-end GPUs at competitive rates—often significantly cheaper than major cloud platforms.
For model compression workflows, Vast.ai offers several advantages:
This tutorial demonstrates how to:
Qwen3-8B
model using LLM-Compressor with 8-bit weight and activation quantization (W8A8)Qwen3-8B-W8A8
model to Hugging Face for easy sharing and deploymentIn the second part of this series, we'll show how to deploy the compressed Qwen3-8B-W8A8
model on Vast.ai for efficient inference and compare its performance with the full-precision model.
Model compression requires significant computational resources, especially for larger models. For compressing the Qwen3-8B model, we recommend the following specifications:
For larger models (12B+), you would need even more VRAM (40GB+) or techniques like model sharding.
PyTorch (CuDNN Runtime)
Template - this provides a pre-configured environment with PyTorch and CUDAFirst we will install and import our required packages. The cell below installs:
# Install required packages
# Using specific versions to ensure compatibility
pip install transformers==4.51.3 # Recent Transformers version with good Qwen3 support
pip install torch==2.7.0 # PyTorch version compatible with current CUDA drivers
pip install huggingface_hub # Core Hugging Face Hub library
pip install huggingface_hub[hf_xet] # Extension for handling large file transfers
pip install huggingface_hub[cli] # Command-line interface for HF Hub
pip install llmcompressor # The model compression toolkit
Next, import the necessary libraries:
# Import necessary libraries
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from llmcompressor.modifiers.quantization import GPTQModifier # For weight quantization
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier # For activation quantization preprocessing
from llmcompressor import oneshot # Simplified API for end-to-end compression
Next, we'll define our model variables and prepare for downloading the Qwen/Qwen3-8B
model.
MODEL_ID
: The Hugging Face model ID for the original modelOUTPUT_MODEL_ID
: A name for our compressed model versionOUTPUT_DIR
: Local directory where the compressed model will be saved# Define model path and output directory
MODEL_ID = "Qwen/Qwen3-8B" # Original model on Hugging Face
OUTPUT_MODEL_ID = "Qwen3-8B-W8A8" # Our compressed model name
OUTPUT_DIR = "./Qwen3-8B-W8A8" # Local directory for saving
os.makedirs(OUTPUT_DIR, exist_ok=True) # Create output directory if it doesn't exist
Download the model files using the snapshot_download function:
from huggingface_hub import snapshot_download
# Download the model files to local directory
# This is a more reliable way to download large models compared to the standard AutoModel approach
# This will download all model files (~16GB for Qwen3-8B) and may take some time depending on connection speed
snapshot_download(
repo_id="Qwen/Qwen3-8B",
local_dir_use_symlinks=False # Set to True to use symlinks instead of copying files (saves space but may cause issues)
)
Now we'll define our compression recipe, which consists of two steps:
SmoothQuant: A technique that redistributes the quantization difficulty between weights and activations, making activations easier to quantize with minimal accuracy loss
GPTQ: A weight quantization method specifically designed for large language models that uses a reconstruction approach to minimize the impact of quantization on model quality
We're using W8A8 quantization, which means:
This will reduce the model size by approximately 1.7x and potentially increase inference speed.
# Create quantization recipe
# A recipe is a sequence of compression techniques applied in order
recipe = [
# First apply SmoothQuant to make activations easier to quantize
# The smoothing_strength parameter controls how aggressively to shift quantization difficulty
# from activations to weights (higher = more shifting, 0.8 is a good balance)
SmoothQuantModifier(smoothing_strength=0.8),
# Then apply GPTQ for weight and activation quantization
GPTQModifier(
targets="Linear", # Apply to all linear layers in the model
scheme="W8A8", # 8-bit weights and activations (other options: W4A16, W8A16, etc.)
ignore=["lm_head"] # Don't quantize the language modeling head (important for output quality)
)
]
Now we'll apply our compression recipe to the model using the oneshot API.
The oneshot API handles the entire compression workflow:
We're using the open_platypus
dataset for calibration, which consists of diverse open-source technical questions and answers. This helps ensure the model maintains accuracy across a range of technical topics.
Note: This cell will take a significant amount of time (30-60 minutes) and GPU memory to run.
print(f"Starting compression of {MODEL_ID}...")
# Apply quantization using oneshot API
oneshot(
model=MODEL_ID, # HF model ID or local path to load the model from
dataset="open_platypus", # Calibration dataset - preset dataset with technical Q&A
recipe=recipe, # Our compression recipe defined above
output_dir=OUTPUT_DIR, # Where to save the compressed model
max_seq_length=2048, # Maximum sequence length for calibration samples
num_calibration_samples=512, # Number of samples to use for calibration (more is better but slower)
)
After compression, it's important to verify that the model still performs reasonably well. This cell loads the compressed model and runs a simple inference test.
We'll evaluate the model on a prompt about AI technology to see how well it generates coherent and relevant text. This helps us verify that the quantization didn't significantly degrade model quality.
Note that a proper evaluation would involve more systematic testing across multiple prompts and metrics.
# Load the compressed model from our output directory
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)
model = AutoModelForCausalLM.from_pretrained(
OUTPUT_DIR,
device_map="auto", # Automatically place model on available devices (CPU/GPU)
torch_dtype="auto" # Use the model's native precision (will be INT8 for quantized weights)
)
# Simple test to verify the model still works
prompt = "AI technology is transforming industries by"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate text with some randomness for creativity
with torch.no_grad(): # Disable gradient calculation for inference
outputs = model.generate(
inputs["input_ids"],
max_new_tokens=100, # Generate up to 100 new tokens
do_sample=True, # Use sampling instead of greedy decoding
temperature=0.7 # Control randomness (lower = more deterministic)
)
# Decode the generated text and print it
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Prompt: {prompt}")
print(f"Generated text: {generated_text}")
Now that we have successfully compressed the model, we'll upload it to Hugging Face Hub for easy sharing and deployment.
First, we'll set up the necessary credentials and repository information:
from huggingface_hub import HfApi, login
import os
# Define variables for Hugging Face upload
# You'll need to replace these with your actual Hugging Face information
HF_USERNAME = "HF_USERNAME" # Add your Hugging Face username here
HF_TOKEN = "HF_TOKEN" # Replace with your actual Hugging Face token (from settings page)
HF_MODEL_ID = f"{HF_USERNAME}/{OUTPUT_MODEL_ID}" # Full repository path on Hugging Face
Next, we'll create a proper model card (README.md) with information about our compressed model. This helps others understand what the model is, how it was compressed, and how to use it:
# Create a detailed model card describing the compressed model
model_card_content = f"""---
language:
- en
- zh
library_name: transformers
license: other
datasets:
- open_platypus
---
# Qwen3-8B-W8A8
This is a compressed version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) using llm-compressor with the following optimizations:
- 8-bit weight quantization using GPTQ
- 8-bit activation quantization
- SmoothQuant pre-processing
## Model Details
- **Original Model**: Qwen/Qwen3-8B
- **Quantization Method**: GPTQ + SmoothQuant (W8A8)
- **Compression Libraries**: [llm-compressor](https://github.com/vllm-project/llm-compressor)
- **Calibration Dataset**: open_platypus (512 samples)
- **Optimized For**: Inference with vLLM
"""
# Write README.md to model directory
readme_path = os.path.join(OUTPUT_DIR, "README.md")
with open(readme_path, "w") as f:
f.write(model_card_content)
Finally, we'll upload the compressed model to Hugging Face Hub. This process may take some time depending on your internet connection speed, as the model files are large (approximately 9.5GB even after compression):
# Login to Hugging Face with your token
login(token=HF_TOKEN)
# Initialize the Hugging Face API
api = HfApi()
# Create the repository (if it doesn't exist)
api.create_repo(
repo_id=HF_MODEL_ID,
exist_ok=True, # Don't error if repo already exists
private=False, # Set to True if you want a private repository
)
# Upload the model files
print(f"Starting upload of {OUTPUT_DIR} to {HF_MODEL_ID}...")
api.upload_folder(
folder_path=OUTPUT_DIR,
repo_id=HF_MODEL_ID,
repo_type="model",
ignore_patterns=["*.tmp", "*.log", "__pycache__"], # Files to ignore during upload
commit_message="Upload compressed Qwen3-8B with W8A8 quantization",
)
print(f"Model successfully uploaded to: https://huggingface.co/{HF_MODEL_ID}")
We've just walked through a transformation that would have seemed impossible a few years ago—taking a sophisticated language model and cutting its resource requirements by approximately 40% while preserving its capabilities.
Our compression workflow delivered results across multiple dimensions:
Technical achievements:
Economic impact:
Operational benefits:
Model compression offers an alternative approach to AI deployment. Rather than accepting ever-increasing resource requirements, we can actively optimize models for real-world constraints. The combination of LLM-Compressor's sophisticated algorithms and Vast.ai's accessible infrastructure makes this optimization practical for any team.
In Part 2, we'll put our compressed model to the test. We'll deploy both our optimized Qwen3-8B-W8A8 model and the original full-precision version on Vast.ai, then conduct detailed comparisons of their performance, cost, and output quality. You'll see exactly how much you can save without compromising on results—and learn when compression makes the most sense for your specific use case.
© 2025 Vast.ai. All rights reserved.