Running Google's Gemma 3 on Vast.ai

April 8, 2025

6 Min Read

By Team Vast

What makes Gemma 3 Special

Gemma 3 is available in four model sizes: 1B, 4B, 12B, and 27B parameters. Each size comes in both instruction-tuned (IT) and pre-trained (PT) variants, allowing developers to choose the optimal model for their specific applications.

What makes Gemma 3 particularly accessible is its efficient architecture - all models are optimized to run on a single GPU, with the smaller variants capable of running on consumer-grade GPUs through Vast.ai's platform.

Gemma 3 offers multiple state-of-the-art capabilities:

1. Enhanced Context Window

Supports up to 128K tokens context window
Improved long-range understanding and coherence
Better handling of complex, multi-turn conversations

2. Multi-Modal Capabilities

Native image understanding and analysis (available in 4B, 12B, and 27B models only)
Support for both text and image inputs in the same prompt
Advanced ability to analyze images, text, and short videos

3. Improved Performance

State-of-the-art performance for its size
Enhanced reasoning capabilities
Better code generation and understanding

4. Multilingual Support

Out-of-the-box support for over 35 languages
Pretrained support for over 140 languages

5. Additional Features

Function calling support for automated workflows
Official quantized versions for faster performance
Optimized for various hardware platforms (NVIDIA GPUs, Google Cloud TPUs, AMD GPUs)

Using Gemma 3 For Image Comparison On Vast

Gemma 3 supports multimodal inputs, including an impressive image comparison capability. In this guide, we'll demonstrate this feature using images from the Open Image Preferences dataset, available here.

Image Selection

For this post, we've selected two contrasting astronaut images from the Open Image Preferences dataset:

Astronaut 1

Astronaut 2

Launch an Instance

First, let's set up a Vast.ai instance with the necessary configurations to run Gemma 3:

vastai create instance $INSTANCE_ID \
    --image vllm/vllm-openai:latest \
    --disk 40 \
    --env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=hf-token' \
    --args --model google/gemma-3-4b-it --gpu-memory-utilization 0.9 --max-model-len 4096 --limit-mm-per-prompt "image=2"

Call Gemma 3 Using the OpenAI SDK

Next, we'll write a script to call Gemma 3 on our instance. vLLM exposes an OpenAI-compatible API endpoint, allowing us to interact with Gemma 3 using the OpenAI SDK:

from openai import OpenAI
import base64

# Your Vast.ai instance details
VAST_IP_ADDRESS = ""
VAST_PORT = ""

# Initialize the client
client = OpenAI(
    api_key="EMPTY",
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)

# Load and convert local images to base64
def image_to_base64(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Convert images to base64
image1_base64 = image_to_base64("/path/to/image-1.jpg")
image2_base64 = image_to_base64("/path/to/image-2.jpg")

response = client.chat.completions.create(
    model="google/gemma-3-4b-it",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Compare these two images"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image1_base64}"
                    }
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image2_base64}"
                    }
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)

Here is our output from Gemma 3:

Okay, let's compare these two images of astronauts. Here's a breakdown of the key differences and similarities:

**Similarities:**

* **Subject:** Both images feature a spacesuit and the astronaut's helmet.
* **Style:**  Both are highly stylized and appear to be digitally rendered images, likely created with a focus on aesthetic appeal and a sense of wonder. They evoke a futuristic and somewhat surreal feeling.
* **Helmet:** The overall shape and design of the helmet are fairly consistent between the two – a classic, bulbous, and slightly vintage-looking design.



**Differences:**

1. **Background:** This is the most significant difference.
   * **Image 1:** Features a vibrant, complex, and detailed space scene with a blue and gold starry sky, nebulas, and a visible horizon line suggesting a planet or moon. The background is rich in color and texture.
   * **Image 2:** Shows a much simpler and muted background. It’s a dusky blue sky with scattered stars and a faint suggestion of a horizon with mountains. There's no nebula or dramatic planetary features.

2. **Suit Detail & Appearance:**
    * **Image 1:** The spacesuit is more elaborate and brightly colored. It has intricate gold detailing on the arms and chest and a much more modern and polished look. It also appears to be a little more reflective.
    * **Image 2:** The suit is weathered and looks older. There's a visible rust and decay on the metal parts, suggesting it's been used and exposed to the elements. It’s a more rugged and worn appearance.

3. **Pose and Composition:**
   * **Image 1:** The astronaut is standing and facing slightly towards the camera, but the composition is broader, showing more of the suit and the expansive space.
   * **Image 2:** The astronaut is positioned at a three-quarter angle, focusing primarily on the helmet and upper body. The composition is more intimate and tightly framed.

4. **Lighting & Tone:**
   * **Image 1:**  The lighting is dramatic and highlights the spacesuit with warm, glowing lights. It's a brighter and more vibrant image overall.
   * **Image 2:** The lighting is softer and cooler, creating a more subdued and melancholic atmosphere.

**In essence:**

* **Image 1** is a grand, visually stunning depiction of an astronaut experiencing the majesty of space.
* **Image 2** is a more intimate and evocative portrayal, suggesting a sense of isolation, exploration, and the passage of time.



Would you like me to elaborate on any specific aspect of the comparison, such as the artistic style or the possible symbolism of the images?

Image Comparison Analysis

The model provides detailed, structured analysis of the images, demonstrating its sophisticated visual understanding capabilities. The analysis includes:

Comprehensive Comparison: The model systematically breaks down both similarities and differences between the images
Structured Output: Results are organized into clear categories (Similarities, Differences, In essence)
Detailed Observations: The model captures both obvious and subtle details, from overall composition to specific visual elements
Contextual Understanding: The analysis includes interpretation of artistic style, mood, and potential symbolism
Interactive Engagement: The model offers to elaborate on specific aspects of the comparison

This level of analysis showcases Gemma 3's advanced multimodal capabilities, particularly its ability to:

Process and compare multiple images simultaneously
Provide nuanced visual analysis
Generate structured, coherent responses
Maintain context throughout the analysis

Next Steps

The image comparison capabilities we've demonstrated can be applied to several real-world use cases:

Visual Quality Control: Compare product images for manufacturing defects
Design Evolution: Analyze changes between design iterations
Medical Imaging: Compare diagnostic images for changes or anomalies

Beyond Image Analysis: Additional Capabilities

While image comparison showcases Gemma 3's visual understanding, the model's versatility extends far beyond visual tasks. Here are several other powerful applications you can build:

Multilingual Chatbots: Create AI assistants that can communicate in 140+ languages
Document Analysis: Leverage the 128K token context window for processing extensive documents
Educational Tools: Build interactive learning systems with multilingual support
Visual Search: Combine text and image understanding for sophisticated search applications

These applications can be deployed using the same Vast.ai infrastructure we've explored, allowing for efficient development and scaling.

Conclusion

Gemma 3 represents a significant advancement in open-source AI models, offering state-of-the-art performance while being designed to run efficiently on a single GPU. Vast.ai's infrastructure provides an ideal platform for deploying Gemma 3 in production environments, making it accessible for both development and scaling applications.