Blog

Generating Videos with Mochi on Vast.ai

- Team Vast

January 24, 2025-Vast.ai Videos Mochi Setup AI

Generating Videos with Mochi on Vast.ai

Background

Video generation has become increasingly important in creative and professional workflows. With the advancement of generative AI models, we can now create high-quality videos from text descriptions. Mochi, a new video generation model, allows us to create compelling video content from text prompts. This technology can be used for creating promotional content, educational materials, or artistic expressions without the need for traditional video production equipment.

Mochi, built on a novel diffusion model architecture, offers state-of-the-art capabilities in text-to-video generation. In this guide, we'll show you how to set up and run Mochi for creating videos from text prompts. The notebook demonstrates basic usage including memory optimizations and parameter controls for video length and quality. While this guide focuses on single video generation, the approach can be easily adapted for batch processing or integration into larger content creation pipelines.

With Vast, you can run this computationally intensive model on powerful GPUs at affordable rates, making video generation accessible and cost-effective. You can find the notebook this guide is based on here.

Setting Up the Environment

To deploy on Vast, we recommend using the Vast AI Template for PyTorch: PyTorch (cuDNN Runtime). This template includes many required libraries and comes with SSH and JupyterLab pre-configured. While Mochi typically requires around 60GB VRAM for single-GPU operation, we're using the Hugging Face Diffusers implementation which includes optimizations like CPU offloading and VAE tiling that allow it to run on our RTX A6000 (48GB). For production use cases, an H100 or multiple GPUs may be recommended.

First, install the required dependencies:

pip install -r requirements.txt

Then, install the latest development version of diffusers, which includes the MochiPipeline:

pip install git+https://github.com/huggingface/diffusers.git

Setting up the Pipeline

Let's look at how to set up our video generation pipeline using the Mochi model:

import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video
from IPython.display import Video, display

# Load the Mochi pipeline
pipe = MochiPipeline.from_pretrained(
    "genmo/mochi-1-preview",
    variant="bf16",
    torch_dtype=torch.bfloat16
)

# Enable memory optimizations
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()

Creating the Video Generation Function

We'll create a function that handles the video generation process:

def generate_video(prompt, output_filename="output.mp4", num_frames=90, fps=30):
    """
    Generates a video given a text prompt using the Mochi Pipeline.
    :param prompt: Text prompt for video generation
    :param output_filename: Name of the output video file
    :param num_frames: Number of frames to generate
    :param fps: Frames per second for the video
    """
    print(f"\nGenerating video for prompt: {prompt}\n")

    with torch.autocast("cuda", torch.bfloat16, cache_enabled=False):
        frames = pipe(prompt, num_frames=num_frames).frames[0]

    # Export to video file
    export_to_video(frames, output_filename, fps=fps)
    print(f"Video saved to: {output_filename}\n")

    # Display the video
    display(Video(output_filename, embed=True))

Example Prompts

Here are some example prompts we can use to generate videos:

# Nature close-up
prompt = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."
generate_video(prompt, "chameleon_eye.mp4")

# Landscape scene
prompt = "A serene waterfall flowing gently into a crystal clear lake surrounded by lush green forest."
generate_video(prompt, "waterfall_forest.mp4")

# Space scene
prompt = "An astronaut floating in space with Earth in the background, ultra-realistic footage, 4k."
generate_video(prompt, "astronaut_space.mp4")

# Nature scene
prompt = "A cute puppy running happily in a field of colorful flowers during sunset."
generate_video(prompt, "puppy_flowers.mp4")

Key Features

Memory Optimizations: The implementation uses CPU offloading and VAE tiling, making it possible to run on GPUs with less VRAM than typically required. This is crucial as the base model needs 60GB VRAM.
Handy Convenience Tooling: Uses the Hugging Face Diffusers library which handles model loading, memory management, and inference, making it easy to get started without dealing with low-level details. The implementation also automatically handles video file writing and display in the notebook.
Customizable Generation: The code supports adjusting key parameters like frame count and FPS, allowing control over video length and quality based on your needs.

Tips for Best Performance

Match your hardware to your needs - the model requires approximately 60GB VRAM for single-GPU operation. For production use cases, Mochi's documentation recommends using H100 GPUs.
Be aware of model limitations - the output is 480p resolution and may show warping/distortions in cases with extreme motion.
Keep your prompts focused on photorealistic descriptions, as the model is specifically optimized for photorealistic styles rather than animated content.

Conclusion

This implementation provides a foundation for text-to-video generation using Mochi. You can modify the generation parameters to suit your needs and adapt the code for batch processing of multiple prompts.

With Vast, you can access the GPU power needed for video generation at an affordable price point, making AI video creation accessible and cost-effective. Happy generating!

Share on

Continue Reading:

Transcribing Audio with Whisper Large V3 on Vast.ai

DeepSeek R1: Open-Source Disruptor or Overhyped Upstart?

Structured Outputs with vLLM and Outlines on Vast.ai

Solutions
Hosting
Console

Contact
Get in Touch

All the answers you need in 24h or less.