Text to Video Generation Using HunyuanVideo on Vast

May 16, 2025

5 Min Read

By Team Vast

Introduction

The recent advancements in AI for video generation have been very fast paced. Models now create much more realistic videos than before, with multiple different use cases. One such model is HunyuanVideo, a model known for its impressive outputs—and tackle the practical challenges of running it efficiently on high-memory GPUs like the A100 or H100.

HunyuanVideo is Tencent's state-of-the-art text-to-video generation model that rivals or surpasses leading closed-source alternatives. As the largest open-source video generation model with over 13 billion parameters, HunyuanVideo represents a significant breakthrough in AI-powered video creation.

We also have a notebook to follow along once you deploy the Vast Instance.

Key Innovations

Unified Architecture: Employs a "Dual-stream to Single-stream" transformer design that effectively handles both image and video generation.
MLLM Text Encoder: Utilizes a multimodal large language model for superior text-to-visual alignment compared to previous text-encoder architectures.
Efficient Compression: Implements causal 3D VAE for efficient spatial-temporal compression to enable training at higher resolution.
Automatic Prompt Rewriting: Features intelligent prompt enhancement to optimize user-given prompts tailored for the model's consumption.

What You’ll Learn

In this guide, we will:

Set up a custom Docker template for HunyuanVideo on Vast.ai.
Launch the Docker template on a suitable GPU instance.
Download pretrained models and generate high-quality videos from text prompts.

Let’s get started with HunyuanVideo!

Renting an Instance on Vast

Create a Custom Template

Tencent maintains a custom Docker image for HunyuanVideo: hunyuanvideo/hunyuanvideo:cuda_12. To run the model on Vast, we'll create a custom template using this image.

Follow these steps:

Ensure you have a Vast.ai account.
Navigate to the Templates page on the Vast Console.
Find the existing template named PyTorch (CuDNN Runtime) and click Edit.
Rename the template to HunyuanVideo.
Replace the Docker image in the Image Path:Tag field with hunyuanvideo/hunyuanvideo:cuda_12.
Set the On-start Script to:
```
git clone https://github.com/tencent/HunyuanVideo
```
This ensures the HunyuanVideo repository is downloaded on instance startup.
Allocate sufficient disk space, for example, 80 GB.
Click Save & Use to save your template and prepare to deploy it.

Selecting an Instance

HunyuanVideo requires a GPU with at least 80GB VRAM to run smoothly. Available GPUs meeting this requirement include Nvidia’s A100 and H100 cards.

To select an instance:

Filter Vast instances based on:
- Instance Type: A100 or H100
- Number of GPUs: 1
- VRAM: ≥ 80GB
Choose and rent a suitable instance.
Install the Vast TLS certificate in your browser, enabling secure access to your Jupyter notebook server: Installing TLS Certificate Guide.
Open the Jupyter server via https://cloud.vast.ai/instances/, clicking Open for your rented instance.
Upload this notebook into the directory /workspace/HunyuanVideo/ on the server.

Downloading the Model Weights

Before generating videos, we need to download pretrained model weights from Hugging Face. These include the video model weights and the text encoders.

Run the following commands inside your instance:

huggingface-cli download tencent/HunyuanVideo --local-dir ./ckpts
huggingface-cli download xtuner/llava-llama-3-8b-v1_1-transformers --local-dir ./ckpts/llava-llama-3-8b-v1_1-transformers
python hyvideo/utils/preprocess_text_encoder_tokenizer_utils.py --input_dir ckpts/llava-llama-3-8b-v1_1-transformers --output_dir ckpts/text_encoder
huggingface-cli download openai/clip-vit-large-patch14 --local-dir ./ckpts/text_encoder_2

For a more in-depth discussion of the checkpoints, refer to Tencent’s checkpoint README.

Generating Videos using HunyuanVideo

Displaying Videos in Jupyter

To conveniently view generated videos within the Jupyter notebook, let’s define a helper function using IPython.display.Video:

from IPython.display import Video

def show_video(video_path, width=640, height=360, embed=True):
    """
    Display a video in a Jupyter notebook.

    Parameters:
    -----------
    video_path : str
        Path to the video file (local file or URL)
    width : int, optional
        Width of the video player in pixels
    height : int, optional
        Height of the video player in pixels
    embed : bool, optional
        Whether to embed the video in the notebook (True)
        or just link to it (False)

    Returns:
    --------
    IPython.display.Video
        Video display object
    """
    return Video(video_path, width=width, height=height, embed=embed)

Generating Your First Video: Cat on the Grass

Let’s generate a realistic video of a cat walking on the grass using the example prompt provided by Tencent.

Run the following command in your terminal:

python sample_video.py \
    --video-size 720 1280 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --flow-reverse \
    --use-cpu-offload \
    --save-path ./results

This command will create a video approximately 129 frames long (roughly 4-5 seconds at standard frame rates) at 720x1280 resolution.

Once the generation finishes, locate the video file inside the ./results directory.

To display it in your notebook, run:

show_video("./results/[your_video_filename].mp4")

Replace [your_video_filename].mp4 with the actual file name.

You should see a high-quality, realistically animated cat walking on grass, showcasing HunyuanVideo’s remarkable detail in animal motion and textures.

Creating a Video of an Astronaut Walking on the Moon

Next, try generating a completely different scene:

python sample_video.py \
    --video-size 720 1280 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "An astronaut walks across the moon, realistic style." \
    --flow-reverse \
    --use-cpu-offload \
    --save-path ./results

After the video is generated, display it in your notebook the same way:

show_video("./results/[your_astronaut_video].mp4")

The resulting video will demonstrate the model’s versatility at rendering vastly different environments and characters, from furry animals to astronauts in space suits.

Conclusion and Next Steps

You've now successfully generated your first videos using HunyuanVideo on a Vast-powered cloud instance. This powerful model unlocks creative possibilities in AI-driven video generation with:

Flexible Resolution Options: Try different --video-size values (e.g., 1280×720, 960×960) to optimize for your needs.
Adjustable Quality Settings: Modify --infer-steps (default 50) to trade off video generation quality and speed.
Creative Control: Experiment with --embedded-cfg-scale (default 6.0) to balance prompt fidelity and creative variance.
Deterministic Outputs: Set --seed to reproduce favorite generated videos reliably.

Deploying large models like HunyuanVideo can be resource-intensive but is now accessible with cloud platforms like Vast, giving you access to top-tier GPUs without upfront hardware costs.

With these foundations, you’re ready to explore more sophisticated prompt engineering, fine-tune generation parameters, or even integrate HunyuanVideo into multimedia projects!

Resources: