Using LLM-Compressor to Quantize Qwen3-8B on Vast.ai (Part 2)

July 23, 2025
12 Min Read
By Team Vast

In Part 1 of this series, we created a quantized version of the Qwen3-8B model (Qwen3-8B-W8A8) using LLM-Compressor with 8-bit weight and activation quantization. In this notebook, we'll demonstrate how to deploy and compare this quantized model against its full precision counterpart (Qwen3-8B) using Vast.ai. We'll examine both deployment processes and compare the outputs to assess any quality differences between the models while highlighting the efficiency gains from quantization.

Key benefits of quantized models:

  • Reduced memory footprint (approximately 1.7x smaller in our case)
  • Lower inference latency
  • Decreased computational requirements
  • More affordable deployment options

Let's start by exploring how to deploy our quantized model on Vast.ai.

Deploying our Model on Vast.ai

First, we will install the Vast.ai SDK and input our API key. This allows us to interact with the Vast.ai platform programmatically and manage our GPU instances.

#In an environment of your choice
pip install --upgrade vastai
# Here we will set our api key
export VAST_API_KEY="VAST_API_KEY" #Your key here
vastai set api-key $VAST_API_KEY

Next, we'll search for an appropriate GPU instance to serve our quantized model. Since this is a W8A8 quantized version of Qwen3-8B, we can use a machine with less VRAM than would be required for the full model. We're looking for machines with at least 24GB of VRAM, which is sufficient for our 8-bit quantized model and its context window:

vastai search offers "compute_cap >= 750 \
geolocation=US \
gpu_ram >= 24 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 20 \
rentable = true"

Once we've identified a suitable instance from the search results, we'll deploy our quantized model to that instance. We're using the vLLM Docker image with an OpenAI-compatible API, which makes it easy to serve and interact with our model.

Note that we're deploying the W8A8 quantized version of Qwen3-8B that was created and uploaded to Hugging Face in Part 1. Please replace your-hf-username with your actual Hugging Face username to use your compressed model:

export INSTANCE_ID= #insert instance ID
vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000' --disk 20 --args --model your-hf-username/Qwen3-8B-W8A8

Calling Our Model Using the OpenAI SDK

Now that our model is deployed and running on Vast.ai, we'll set up a client to interact with it. The vLLM server exposes an OpenAI-compatible API, allowing us to use the OpenAI SDK to send requests to our model.

First, we need to install the OpenAI SDK if you haven't already:

pip install openai

Then, we use the OpenAI SDK and the VAST_IP_ADDRESS and VAST_PORT from our instance to call our model.

Also add your hugginface username HF_USERNAME to call the model we saved to hugginface.

from openai import OpenAI

VAST_IP_ADDRESS = "VAST_IP_ADDRESS"
VAST_PORT = "VAST_PORT"

HF_USERNAME="HF_USERNAME"

# Create a client instance pointing to the vLLM server
client = OpenAI(
    api_key="dummy-key",  # vLLM doesn't require a real API key
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"  # Point to your vLLM server
)

# Call the model
response = client.chat.completions.create(
    model=f"{HF_USERNAME}/Qwen3-8B-W8A8",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the major applications of deep learning?"}
    ],
    temperature=0.7,
    max_tokens=1000
)

# Print the response
print(response.choices[0].message.content)

Output:

<think>
Okay, the user is asking about the major applications of deep learning. I need to provide a comprehensive yet clear answer. Let me start by recalling the different areas where deep learning is commonly used.

First, computer vision comes to mind. That includes image recognition, object detection, and maybe even image generation. Then there's natural language processing, like chatbots, translation, and sentiment analysis.

I should also mention speech recognition, which is used in virtual assistants. Autonomous vehicles rely on deep learning for perception tasks. Healthcare is another big area, such as medical imaging analysis and drug discovery.

Finance uses deep learning for fraud detection and algorithmic trading. Recommender systems in e-commerce and streaming services are also important. Maybe touch on generative models like GANs and diffusion models for creating art or text.

Wait, I should categorize these applications to make it organized. Let me list them under different headings. Also, I need to ensure I don't miss any major areas. Oh, maybe mention robotics and control systems too.

I should check if there are any recent advancements or emerging fields. For example, deep learning in gaming, like reinforcement learning for game AI. Also, maybe something about anomaly detection in various industries.

Need to keep the explanation straightforward without too much jargon. Make sure each application is explained briefly but clearly. Avoid going into too much technical detail unless necessary.

Let me structure the answer with clear sections for each major application. Start with an introduction, then list each area with a brief description. Conclude by summarizing the versatility of deep learning.

Wait, the user might be looking for both current and potential future applications. Should I include that? Maybe a short note at the end about ongoing research and future possibilities.

Also, check for any overlaps or if some applications are better categorized under specific fields. For example, NLP and speech recognition are both related to language processing.

I think that's a solid structure. Now, I'll draft the answer accordingly, making sure it's informative and covers all key areas without being too verbose.
</think>

Deep learning, a subset of machine learning, has revolutionized numerous fields due to its ability to model complex patterns in data. Here are its major applications across various domains:

---

### **1. Computer Vision**
- **Image Recognition**: Identifying objects, scenes, and patterns in images (e.g., facial recognition, medical imaging analysis).
- **Object Detection**: Locating and classifying multiple objects in a single image (e.g., autonomous vehicles, surveillance systems).
- **Image Generation**: Creating new images using Generative Adversarial Networks (GANs) or diffusion models (e.g., AI art tools).
- **Video Analysis**: Action recognition, video summarization, and scene understanding.

---

### **2. Natural Language Processing (NLP)**
- **Text Generation**: Writing articles, stories, or code using models like GPT or BERT.
- **Translation**: Real-time language translation (e.g., Google Translate).
- **Sentiment Analysis**: Determining emotions in text (e.g., social media monitoring).
- **Chatbots & Virtual Assistants**: Interactive AI assistants (e.g., Siri, Alexa, customer service bots).

---

### **3. Speech Recognition & Synthesis**
- **Voice Assistants**: Converting speech to text (e.g., Alexa, Google Assistant).
- **Speech-to-Text**: Transcribing audio for accessibility or transcription services.
- **Text-to-Speech**: Synthesizing natural-sounding speech (e.g., audiobooks, virtual characters).

---

### **4. Autonomous Systems**
- **Self-Driving Cars**: Perceiving environments, navigating, and making real-time decisions.
- **Robotics**: Controlling robotic arms, navigation, and object manipulation.
- **Industrial Automation**: Predictive maintenance and quality control in manufacturing.

---

### **5. Healthcare**
- **Medical Imaging**: Diagnosing diseases (e.g., tumors in X-rays, MRIs) using CNNs.
- **Drug Discovery**: Accelerating the development of new medications through molecular modeling.
- **Personalized Medicine**: Predicting patient outcomes based on genetic data and medical history.

---

### **6. Finance**
- **Fraud Detection**: Identifying unusual transaction patterns.
- **Algorithmic Trading**: Predicting market trends and executing trades.
- **Risk Management**: Assessing credit risk and financial stability.

---

### **7. Recommender Systems**
- **E-commerce**: Personalized product recommendations (e.g., Amazon, Netflix).
- **Streaming Services**: Content suggestions based on user preferences.
- **Social Media**: Curating news feeds and trending topics.

---

### **8. Gaming & Entertainment**
- **Game AI**: Creating intelligent opponents or NPCs (e.g., AlphaGo, Dota 2 bots).
- **Content Creation**: Generating music, art, or video using deep learning models.

---

Compare with full Qwen/Qwen3-8B Model

Now that we've seen the output from our quantized model, let's deploy and test the full precision version of the same model for comparison. This will help us evaluate any potential quality differences and understand the tradeoffs of quantization.

First, we need to find a more powerful machine with significantly more GPU memory. The full precision model requires approximately 3-4x more VRAM than its quantized counterpart. Let's search for instances with at least 80GB of VRAM:

vastai search offers "compute_cap >= 750 \
geolocation=US \
gpu_ram >= 80 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 40 \
rentable = true"

Now we'll deploy the full precision model. Notice the differences in our configuration compared to the quantized model:

  1. We're requiring more disk space (40GB vs 20GB)
  2. We're using a much more powerful GPU with significantly more VRAM
  3. We're deploying the original model from Qwen directly (Qwen/Qwen3-8B)
export INSTANCE_ID= #insert instance ID
vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000' --disk 40 --args --model  Qwen/Qwen3-8B

With the full precision model deployed, let's query it with the same prompt we used for the quantized model to perform a direct comparison of their outputs:


VAST_IP_ADDRESS = "VAST_IP_ADDRESS"
VAST_PORT = "VAST_PORT"

# Create a client instance pointing to the vLLM server
client = OpenAI(
    api_key="dummy-key",  # vLLM doesn't require a real API key
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"  # Point to your vLLM server
)

# Call the model
response = client.chat.completions.create(
    model="Qwen/Qwen3-8B",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the major applications of deep learning?"}
    ],
    temperature=0.7,
    max_tokens=1000
)

# Print the response
print(response.choices[0].message.content)
<think>
Okay, the user is asking about the major applications of deep learning. Let me start by recalling what I know. Deep learning is a subset of machine learning that uses neural networks with many layers. It's been really popular in various fields. I need to list the main areas where it's applied.

First, computer vision comes to mind. Image recognition, object detection, maybe even things like facial recognition. Then there's natural language processing, like chatbots, translation services, text generation. Speech recognition is another big one, like virtual assistants.

What else? Autonomous vehicles use deep learning for perception tasks, like identifying objects on the road. Healthcare is another area, maybe medical imaging analysis, predicting diseases. Then there's recommendation systems used by companies like Netflix or Amazon.

Wait, what about robotics? Deep reinforcement learning is used in robotics for control systems. Also, maybe gaming, like in AlphaGo. Financial services for fraud detection or algorithmic trading.

I should also mention some more specialized areas like generative models for creating images or music, and maybe something about NLP like sentiment analysis. Oh, and maybe mention that these applications are driven by the ability of deep learning to handle large amounts of data and complex patterns.

Wait, the user might be looking for a comprehensive list, so I should structure it into categories. Let me check if I missed any major areas. Maybe cybersecurity? Or maybe something like anomaly detection in various industries. Also, maybe mention the difference between deep learning and traditional machine learning in terms of application.

I need to make sure each application is clearly explained with examples. Let me organize them into sections. Computer Vision, NLP, Speech Recognition, Autonomous Vehicles, Healthcare, Recommendation Systems, Robotics, Gaming, Financial Services, Cybersecurity, Generative Models, and maybe others.

I should also note that deep learning is used in both supervised and unsupervised tasks. Oh, and maybe mention some specific models like CNNs for images, RNNs or Transformers for NLP. But the user might not need the technical details, just the applications.

Wait, the user might be a student or someone new to the field, so keeping it clear and concise is important. Avoid jargon where possible. Also, make sure to highlight the impact of each application.

Let me make sure I cover all major areas. Maybe add something about drug discovery in healthcare. Or maybe mention that deep learning is used in weather prediction or climate modeling.

Hmm, I think that's a good list. Let me structure it with bullet points and explanations for each application. Also, maybe conclude with a note on the versatility and impact of deep learning.
</think>

Deep learning, a subset of machine learning, has revolutionized numerous fields due to its ability to model complex patterns and process large datasets. Here are the **major applications** of deep learning:

---

### **1. Computer Vision**
- **Image Recognition**: Identifying objects, people, and scenes in images (e.g., Google Photos, medical imaging).
- **Object Detection**: Locating and classifying objects in real-time (e.g., autonomous vehicles, security cameras).
- **Facial Recognition**: Used in biometric authentication (e.g., smartphones, border control).
- **Image Generation**: Creating realistic images (e.g., GANs for art, style transfer).

---

### **2. Natural Language Processing (NLP)**
- **Text Generation**: Writing articles, stories, or code (e.g., chatbots, AI assistants like Siri).
- **Machine Translation**: Languages like Google Translate use neural networks for real-time translation.
- **Sentiment Analysis**: Analyzing emotions in text (e.g., social media monitoring, customer feedback).
- **Speech Recognition**: Converting speech to text (e.g., voice assistants, transcription services).

---

### **3. Autonomous Systems**
- **Self-Driving Cars**: Detecting pedestrians, traffic signs, and obstacles (e.g., Tesla Autopilot, Waymo).
- **Robotics**: Controlling robotic arms and navigation systems (e.g., industrial robots, drones).
- **Reinforcement Learning**: Training agents for tasks like game playing (e.g., AlphaGo, Dota 2).

---

### **4. Healthcare**
- **Medical Imaging**: Diagnosing diseases from X-rays, MRIs, or CT scans (e.g., cancer detection).
- **Drug Discovery**: Accelerating the development of new drugs by predicting molecular interactions.
- **Personalized Medicine**: Tailoring treatments based on patient data (e.g., genomics analysis).

---

### **5. Recommendation Systems**
- **E-commerce**: Product recommendations (e.g., Amazon, Netflix).
- **Streaming Services**: Content suggestions (e.g., Spotify, YouTube).
- **Social Media**: Curating news feeds (e.g., Facebook, Instagram).

---

### **6. Finance**
- **Fraud Detection**: Identifying unusual

Conclusion: Comparing Model Outputs and Size Tradeoffs

After deploying and testing both the quantized model (W8A8) and the full precision model, we can make several key observations:

Model Size and Resource Requirements

  • Full Precision Model (Qwen3-8B):

    • Required ~80GB+ VRAM GPU
    • Needed 40GB disk space
    • Model size: ~16GB (FP16)
  • Quantized Model (Qwen3-8B-W8A8):

    • Required only 24GB VRAM GPU (~3.3x less memory)
    • Needed only 20GB disk space
    • Model size: ~9.5GB (8-bit weights and activations)

Output Quality Comparison

Looking at the outputs from both models, we can observe:

  1. Structural Similarity: Both models organized their responses with similar categories (Computer Vision, NLP, Healthcare, etc.) and maintained good formatting with markdown.

  2. Content Quality: Both models produced high-quality, relevant content with appropriate examples for each application domain. The specific details varied slightly, but the core information was consistent.

  3. Reasoning Process: Both models showed similar thinking processes in their <think> sections, demonstrating that the quantized model retained the structured reasoning capabilities of the original model.

  4. Minimal Quality Degradation: The quantized model showed negligible degradation in output quality compared to the full precision model, despite being significantly smaller.

Key Takeaways

  1. Deployment Accessibility: The W8A8 quantized model makes deployment much more accessible, requiring far less expensive hardware while maintaining output quality.

  2. Cost Efficiency: Using the quantized model can significantly reduce cloud GPU costs by allowing deployment on less expensive hardware.

  3. Production Readiness: 8-bit quantization appears to be production-ready for this type of model, offering an excellent balance between efficiency and performance.

  4. Use Case Fit: For most general text generation and understanding tasks, the quantized model appears to be a suitable replacement for the full precision model.

This notebook demonstrates that weight and activation quantization (W8A8) using LLM Compressor is an effective approach for deploying large language models in resource-constrained environments without significant quality compromises.

Vast.ai provides an excellent platform for this workflow, allowing us to use high-end GPUs for model compression and more cost-effective GPUs for efficient inference.

Vast AI

© 2025 Vast.ai. All rights reserved.

Vast.ai