April 8, 2025-Vast.aiGemma3Multimodal AIAI
Google has recently released Gemma 3, their latest open-source language model family. The model, designed to run efficiently on a single GPU, brings state-of-the-art capabilities including advanced multimodal understanding, multilingual support, and an impressive 128K token context window. In this post, we'll explore the key features of Gemma 3 and demonstrate how to deploy it on Vast.ai to leverage one of its new features, comparing images.
Gemma 3 is available in four model sizes: 1B, 4B, 12B, and 27B parameters. Each size comes in both instruction-tuned (IT) and pre-trained (PT) variants, allowing developers to choose the optimal model for their specific applications.
What makes Gemma 3 particularly accessible is its efficient architecture - all models are optimized to run on a single GPU, with the smaller variants capable of running on consumer-grade GPUs through Vast.ai's platform.
Gemma 3 offers multiple state-of-the-art capabilities:
Gemma 3 supports multimodal inputs, including an impressive image comparison capability. In this guide, we'll demonstrate this feature using images from the Open Image Preferences dataset, available here.
For this post, we've selected two contrasting astronaut images from the Open Image Preferences dataset:
Astronaut 1
Astronaut 2
First, let's set up a Vast.ai instance with the necessary configurations to run Gemma 3:
vastai create instance $INSTANCE_ID \
--image vllm/vllm-openai:latest \
--disk 40 \
--env '-p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=hf-token' \
--args --model google/gemma-3-4b-it --gpu-memory-utilization 0.9 --max-model-len 4096 --limit-mm-per-prompt "image=2"
Next, we'll write a script to call Gemma 3 on our instance. vLLM exposes an OpenAI-compatible API endpoint, allowing us to interact with Gemma 3 using the OpenAI SDK:
from openai import OpenAI
import base64
# Your Vast.ai instance details
VAST_IP_ADDRESS = ""
VAST_PORT = ""
# Initialize the client
client = OpenAI(
api_key="EMPTY",
base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)
# Load and convert local images to base64
def image_to_base64(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# Convert images to base64
image1_base64 = image_to_base64("/path/to/image-1.jpg")
image2_base64 = image_to_base64("/path/to/image-2.jpg")
response = client.chat.completions.create(
model="google/gemma-3-4b-it",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Compare these two images"
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image1_base64}"
}
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image2_base64}"
}
}
]
}
]
)
print(response.choices[0].message.content)
Here is our output from Gemma 3:
Okay, let's compare these two images of astronauts. Here's a breakdown of the key differences and similarities:
**Similarities:**
* **Subject:** Both images feature a spacesuit and the astronaut's helmet.
* **Style:** Both are highly stylized and appear to be digitally rendered images, likely created with a focus on aesthetic appeal and a sense of wonder. They evoke a futuristic and somewhat surreal feeling.
* **Helmet:** The overall shape and design of the helmet are fairly consistent between the two – a classic, bulbous, and slightly vintage-looking design.
**Differences:**
1. **Background:** This is the most significant difference.
* **Image 1:** Features a vibrant, complex, and detailed space scene with a blue and gold starry sky, nebulas, and a visible horizon line suggesting a planet or moon. The background is rich in color and texture.
* **Image 2:** Shows a much simpler and muted background. It’s a dusky blue sky with scattered stars and a faint suggestion of a horizon with mountains. There's no nebula or dramatic planetary features.
2. **Suit Detail & Appearance:**
* **Image 1:** The spacesuit is more elaborate and brightly colored. It has intricate gold detailing on the arms and chest and a much more modern and polished look. It also appears to be a little more reflective.
* **Image 2:** The suit is weathered and looks older. There's a visible rust and decay on the metal parts, suggesting it's been used and exposed to the elements. It’s a more rugged and worn appearance.
3. **Pose and Composition:**
* **Image 1:** The astronaut is standing and facing slightly towards the camera, but the composition is broader, showing more of the suit and the expansive space.
* **Image 2:** The astronaut is positioned at a three-quarter angle, focusing primarily on the helmet and upper body. The composition is more intimate and tightly framed.
4. **Lighting & Tone:**
* **Image 1:** The lighting is dramatic and highlights the spacesuit with warm, glowing lights. It's a brighter and more vibrant image overall.
* **Image 2:** The lighting is softer and cooler, creating a more subdued and melancholic atmosphere.
**In essence:**
* **Image 1** is a grand, visually stunning depiction of an astronaut experiencing the majesty of space.
* **Image 2** is a more intimate and evocative portrayal, suggesting a sense of isolation, exploration, and the passage of time.
Would you like me to elaborate on any specific aspect of the comparison, such as the artistic style or the possible symbolism of the images?
The model provides detailed, structured analysis of the images, demonstrating its sophisticated visual understanding capabilities. The analysis includes:
This level of analysis showcases Gemma 3's advanced multimodal capabilities, particularly its ability to:
The image comparison capabilities we've demonstrated can be applied to several real-world use cases:
While image comparison showcases Gemma 3's visual understanding, the model's versatility extends far beyond visual tasks. Here are several other powerful applications you can build:
These applications can be deployed using the same Vast.ai infrastructure we've explored, allowing for efficient development and scaling.
Gemma 3 represents a significant advancement in open-source AI models, offering state-of-the-art performance while being designed to run efficiently on a single GPU. Vast.ai's infrastructure provides an ideal platform for deploying Gemma 3 in production environments, making it accessible for both development and scaling applications.