Text to Video Generation Using HunyuanVideo on Vast

Introduction
The recent advancements in AI for video generation have been very fast paced. Models now create much more realistic videos than before, with multiple different use cases. One such model is HunyuanVideo, a model known for its impressive outputs—and tackle the practical challenges of running it efficiently on high-memory GPUs like the A100 or H100.
HunyuanVideo is Tencent's state-of-the-art text-to-video generation model that rivals or surpasses leading closed-source alternatives. As the largest open-source video generation model with over 13 billion parameters, HunyuanVideo represents a significant breakthrough in AI-powered video creation.
We also have a notebook to follow along once you deploy the Vast Instance.
Key Innovations
- Unified Architecture: Employs a "Dual-stream to Single-stream" transformer design that effectively handles both image and video generation.
- MLLM Text Encoder: Utilizes a multimodal large language model for superior text-to-visual alignment compared to previous text-encoder architectures.
- Efficient Compression: Implements causal 3D VAE for efficient spatial-temporal compression to enable training at higher resolution.
- Automatic Prompt Rewriting: Features intelligent prompt enhancement to optimize user-given prompts tailored for the model's consumption.
What You’ll Learn
In this guide, we will:
- Set up a custom Docker template for HunyuanVideo on Vast.ai.
- Launch the Docker template on a suitable GPU instance.
- Download pretrained models and generate high-quality videos from text prompts.
Let’s get started with HunyuanVideo!
Renting an Instance on Vast
Create a Custom Template
Tencent maintains a custom Docker image for HunyuanVideo: hunyuanvideo/hunyuanvideo:cuda_12. To run the model on Vast, we'll create a custom template using this image.
Follow these steps:
- Ensure you have a Vast.ai account.
- Navigate to the Templates page on the Vast Console.
- Find the existing template named PyTorch (CuDNN Runtime) and click Edit.
- Rename the template to HunyuanVideo.
- Replace the Docker image in the Image Path:Tag field with
hunyuanvideo/hunyuanvideo:cuda_12. - Set the On-start Script to:
This ensures the HunyuanVideo repository is downloaded on instance startup.git clone https://github.com/tencent/HunyuanVideo - Allocate sufficient disk space, for example, 80 GB.
- Click Save & Use to save your template and prepare to deploy it.
Selecting an Instance
HunyuanVideo requires a GPU with at least 80GB VRAM to run smoothly. Available GPUs meeting this requirement include Nvidia’s A100 and H100 cards.
To select an instance:
- Filter Vast instances based on:
- Instance Type: A100 or H100
- Number of GPUs: 1
- VRAM: ≥ 80GB
- Choose and rent a suitable instance.
- Install the Vast TLS certificate in your browser, enabling secure access to your Jupyter notebook server: Installing TLS Certificate Guide.
- Open the Jupyter server via https://cloud.vast.ai/instances/, clicking Open for your rented instance.
- Upload this notebook into the directory
/workspace/HunyuanVideo/on the server.
Downloading the Model Weights
Before generating videos, we need to download pretrained model weights from Hugging Face. These include the video model weights and the text encoders.
Run the following commands inside your instance:
huggingface-cli download tencent/HunyuanVideo --local-dir ./ckpts
huggingface-cli download xtuner/llava-llama-3-8b-v1_1-transformers --local-dir ./ckpts/llava-llama-3-8b-v1_1-transformers
python hyvideo/utils/preprocess_text_encoder_tokenizer_utils.py --input_dir ckpts/llava-llama-3-8b-v1_1-transformers --output_dir ckpts/text_encoder
huggingface-cli download openai/clip-vit-large-patch14 --local-dir ./ckpts/text_encoder_2
For a more in-depth discussion of the checkpoints, refer to Tencent’s checkpoint README.
Generating Videos using HunyuanVideo
Displaying Videos in Jupyter
To conveniently view generated videos within the Jupyter notebook, let’s define a helper function using IPython.display.Video:
from IPython.display import Video
def show_video(video_path, width=640, height=360, embed=True):
"""
Display a video in a Jupyter notebook.
Parameters:
-----------
video_path : str
Path to the video file (local file or URL)
width : int, optional
Width of the video player in pixels
height : int, optional
Height of the video player in pixels
embed : bool, optional
Whether to embed the video in the notebook (True)
or just link to it (False)
Returns:
--------
IPython.display.Video
Video display object
"""
return Video(video_path, width=width, height=height, embed=embed)
Generating Your First Video: Cat on the Grass
Let’s generate a realistic video of a cat walking on the grass using the example prompt provided by Tencent.
Run the following command in your terminal:
python sample_video.py \
--video-size 720 1280 \
--video-length 129 \
--infer-steps 50 \
--prompt "A cat walks on the grass, realistic style." \
--flow-reverse \
--use-cpu-offload \
--save-path ./results
This command will create a video approximately 129 frames long (roughly 4-5 seconds at standard frame rates) at 720x1280 resolution.
Once the generation finishes, locate the video file inside the ./results directory.
To display it in your notebook, run:
show_video("./results/[your_video_filename].mp4")
Replace [your_video_filename].mp4 with the actual file name.
You should see a high-quality, realistically animated cat walking on grass, showcasing HunyuanVideo’s remarkable detail in animal motion and textures.
Creating a Video of an Astronaut Walking on the Moon
Next, try generating a completely different scene:
python sample_video.py \
--video-size 720 1280 \
--video-length 129 \
--infer-steps 50 \
--prompt "An astronaut walks across the moon, realistic style." \
--flow-reverse \
--use-cpu-offload \
--save-path ./results
After the video is generated, display it in your notebook the same way:
show_video("./results/[your_astronaut_video].mp4")
The resulting video will demonstrate the model’s versatility at rendering vastly different environments and characters, from furry animals to astronauts in space suits.
Conclusion and Next Steps
You've now successfully generated your first videos using HunyuanVideo on a Vast-powered cloud instance. This powerful model unlocks creative possibilities in AI-driven video generation with:
- Flexible Resolution Options: Try different
--video-sizevalues (e.g., 1280×720, 960×960) to optimize for your needs. - Adjustable Quality Settings: Modify
--infer-steps(default 50) to trade off video generation quality and speed. - Creative Control: Experiment with
--embedded-cfg-scale(default 6.0) to balance prompt fidelity and creative variance. - Deterministic Outputs: Set
--seedto reproduce favorite generated videos reliably.
Deploying large models like HunyuanVideo can be resource-intensive but is now accessible with cloud platforms like Vast, giving you access to top-tier GPUs without upfront hardware costs.
With these foundations, you’re ready to explore more sophisticated prompt engineering, fine-tune generation parameters, or even integrate HunyuanVideo into multimedia projects!
Resources:


