January 24, 2025-Vast.aiVideosMochiSetupAI
Video generation has become increasingly important in creative and professional workflows. With the advancement of generative AI models, we can now create high-quality videos from text descriptions. Mochi, a new video generation model, allows us to create compelling video content from text prompts. This technology can be used for creating promotional content, educational materials, or artistic expressions without the need for traditional video production equipment.
Mochi, built on a novel diffusion model architecture, offers state-of-the-art capabilities in text-to-video generation. In this guide, we'll show you how to set up and run Mochi for creating videos from text prompts. The notebook demonstrates basic usage including memory optimizations and parameter controls for video length and quality. While this guide focuses on single video generation, the approach can be easily adapted for batch processing or integration into larger content creation pipelines.
With Vast, you can run this computationally intensive model on powerful GPUs at affordable rates, making video generation accessible and cost-effective. You can find the notebook this guide is based on here.
To deploy on Vast, we recommend using the Vast AI Template for PyTorch: PyTorch (cuDNN Runtime). This template includes many required libraries and comes with SSH and JupyterLab pre-configured. While Mochi typically requires around 60GB VRAM for single-GPU operation, we're using the Hugging Face Diffusers implementation which includes optimizations like CPU offloading and VAE tiling that allow it to run on our RTX A6000 (48GB). For production use cases, an H100 or multiple GPUs may be recommended.
First, install the required dependencies:
pip install -r requirements.txt
Then, install the latest development version of diffusers, which includes the MochiPipeline:
pip install git+https://github.com/huggingface/diffusers.git
Let's look at how to set up our video generation pipeline using the Mochi model:
import torch
from diffusers import MochiPipeline
from diffusers.utils import export_to_video
from IPython.display import Video, display
# Load the Mochi pipeline
pipe = MochiPipeline.from_pretrained(
"genmo/mochi-1-preview",
variant="bf16",
torch_dtype=torch.bfloat16
)
# Enable memory optimizations
pipe.enable_model_cpu_offload()
pipe.enable_vae_tiling()
We'll create a function that handles the video generation process:
def generate_video(prompt, output_filename="output.mp4", num_frames=90, fps=30):
"""
Generates a video given a text prompt using the Mochi Pipeline.
:param prompt: Text prompt for video generation
:param output_filename: Name of the output video file
:param num_frames: Number of frames to generate
:param fps: Frames per second for the video
"""
print(f"\nGenerating video for prompt: {prompt}\n")
with torch.autocast("cuda", torch.bfloat16, cache_enabled=False):
frames = pipe(prompt, num_frames=num_frames).frames[0]
# Export to video file
export_to_video(frames, output_filename, fps=fps)
print(f"Video saved to: {output_filename}\n")
# Display the video
display(Video(output_filename, embed=True))
Here are some example prompts we can use to generate videos:
# Nature close-up
prompt = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."
generate_video(prompt, "chameleon_eye.mp4")
# Landscape scene
prompt = "A serene waterfall flowing gently into a crystal clear lake surrounded by lush green forest."
generate_video(prompt, "waterfall_forest.mp4")
# Space scene
prompt = "An astronaut floating in space with Earth in the background, ultra-realistic footage, 4k."
generate_video(prompt, "astronaut_space.mp4")
# Nature scene
prompt = "A cute puppy running happily in a field of colorful flowers during sunset."
generate_video(prompt, "puppy_flowers.mp4")
Match your hardware to your needs - the model requires approximately 60GB VRAM for single-GPU operation. For production use cases, Mochi's documentation recommends using H100 GPUs.
Be aware of model limitations - the output is 480p resolution and may show warping/distortions in cases with extreme motion.
Keep your prompts focused on photorealistic descriptions, as the model is specifically optimized for photorealistic styles rather than animated content.
This implementation provides a foundation for text-to-video generation using Mochi. You can modify the generation parameters to suit your needs and adapt the code for batch processing of multiple prompts.
With Vast, you can access the GPU power needed for video generation at an affordable price point, making AI video creation accessible and cost-effective. Happy generating!