Blog

Turn Any PDF into an AI-Generated Podcast with Notebook Llama

- Team Vast

December 3, 2024-AI LLama

Turn Any PDF into an AI-Generated Podcast with Notebook Llama

AI has now gone multimodal: instead of just sending chat messages to a chatbot, there are awesome use cases involving audio, videos and images. One amazing product like this is NotebookLM from Google which has taken the AI community by storm. It takes in YouTube videos, PDF’s or other documents, and derives insights and even creates a podcast from the notes and original material.

But Meta’s Llama team came out with Notebook Llama, an open source version of this flow, which lets users run their own models to create podcasts on their own material. We will show you how to run it with Vast. In this tutorial, we will show you how to put the world of AI researchers in the palm of your hands for the materials that you care about. Let's dive into how you can set this up using Vast.ai's GPU marketplace.

What is Notebook Llama?

Notebook Llama is an innovative pipeline that transforms PDF documents into podcast-ready audio content through a series of four notebooks, each handling a specific part of the conversion process:

PDF Pre-processing: Uses Llama 3.2-3B-Instruct to convert PDFs into clean, structured text
Initial Transcript Generation: Employs Llama-3.2-8B-Instruct to create a conversational dialogue
Transcript Refinement: Leverages Llama 3.2 3B Instruct to enhance natural flow and engagement
Text-to-Speech Generation: Combines Suno/Bark and Parler-TTS for dynamic, two-voice audio

The notebooks that we'll use in this tutorial are slightly modified from their GitHub counterparts. You can find them at the following links:

Setting Up Your Environment on Vast.ai

Step 1: Select the Right Instance

Visit Vast.ai and select a GPU instance with at least 40GB of RAM. We use 40GB as a rough estimate because the notebooks are running multiple small to mid-sized models.
Use this base image: https://cloud.vast.ai/?ref_id=62897&template_id=c50a4b1cc2fc37a62e2f3bea7cbd892a
Important: Choose "Jupyter Notebook" as the Launch Mode along with direct HTTPS access

Step 2: Getting into your instance:

Once the instance is running, you can connect directly to it via the console in Vast.ai. More information about this an Jupyter Notebooks on vast can be found here

Step 3: File Organization

Upload the notebook files
Create a resources folder
Upload your target PDF
Ensure requirements.txt is present

Step 4: Environment Setup

Once your instance is running:

Connect via the console in Vast.ai
Navigate to the app directory:

cd app

Install dependencies:

pip install -r requirements.txt
pip install git+https://github.com/huggingface/parler-tts.git
sudo apt-get install ffmpeg

Step 5: Hugging Face Authentication

Make sure that you have a Hugging Face account. If you don't have one, you can create one here.
Make sure that you accept the terms for the Llama-3.2-3B-Instruct model.
Get your access token from the Hugging Face Settings page
Run:

huggingface-cli login

Enter your token when prompted

How It Works: The Pipeline Explained

Notebook 1: PDF Processing

The first notebook handles the crucial task of converting your PDF into clean, structured text. It uses PyPDF2 for initial extraction and Llama 3.2-3B-Instruct for intelligent text cleaning. What makes this approach unique is its use of a lightweight language model instead of traditional regex-based cleaning.

Notebook 2: Creating the Conversation

This stage transforms the cleaned text into a natural dialogue using the Llama model. The model creates a two-speaker conversation where:

Speaker 1 leads and teaches the content
Speaker 2 asks questions and maintains flow The output includes natural elements like "umm" and "hmm" to maintain authenticity.

Notebook 3: Enhanced Rewriting

The third notebook uses Llama 3.2 3B Instruct to refine the conversation further. It's prompted to act as an "Oscar-winning screenwriter," adding:

Natural dialogue patterns
Relevant anecdotes and analogies
TTS-compatible formatting

Notebook 4: Audio Generation

The final stage brings your podcast to life using two different TTS approaches:

Speaker 1: Parler-TTS for an expressive, dramatic voice
Speaker 2: Suno/Bark for a more methodical style

This combination creates a dynamic conversation rather than monotonous single-voice narration.

For this specific notebook, we created a separate speaker description for speaker 2. Each Speaker is also using a different model.

Output:

Each of these notebooks outputs a file to the resources folder to be used in the next step/notebook. At the end, there will be an MP# file that can serve as a podcast for you to listen to! We've included it in our folder for you to take a listen to.

Why Use Vast.ai for This Project?

These notebooks leverage multiple large language models and TTS systems, requiring significant GPU resources for running multiple models at the same time. Vast.ai provides:

Access to high-memory GPUs (40GB+) necessary for running larger models
Cost-effective GPU rental compared to dedicated hardware or other cloud providers
Easy-to-use Jupyter notebook support
Great docker templates for running notebooks and being able to get running quickly in your environment
Flexible scaling based on your needs

Trying it out Yourself

Ready to try it yourself? Clone the repository, follow the setup instructions above, and start with the PDF provided to see the results

The beauty of this system is its modularity and its open source nature - you can modify each stage to suit your needs, whether that's using different models, adjusting the conversation style, or tweaking the voice characteristics.

Through the power of Vast.ai's GPU marketplace and the latest AI models, Notebook Llama offers an innovative way to make technical content more accessible and engaging. Give it a try and transform your PDFs into engaging podcast conversations!

Share on

Continue Reading:

Running Google's Gemma 3 on Vast.ai

SEC Filing Analysis Using Mistral Small 3.1's Long Context Window

Alibaba's Qwen: An Open-Source AI Model that Surpasses DeepSeek?

Solutions
Hosting
Console

Contact
Get in Touch

All the answers you need in 24h or less.