Turn Any PDF into an AI-Generated Podcast with Notebook Llama
Turn Any PDF into an AI-Generated Podcast with Notebook Llama
AI has now gone multimodal: instead of just sending chat messages to a chatbot, there are awesome use cases involving audio, videos and images. One amazing product like this is NotebookLM from Google which has taken the AI community by storm. It takes in YouTube videos, PDF’s or other documents, and derives insights and even creates a podcast from the notes and original material.
But Meta’s Llama team came out with Notebook Llama, an open source version of this flow, which lets users run their own models to create podcasts on their own material. We will show you how to run it with Vast. In this tutorial, we will show you how to put the world of AI researchers in the palm of your hands for the materials that you care about. Let's dive into how you can set this up using Vast.ai's GPU marketplace.
What is Notebook Llama?
Notebook Llama is an innovative pipeline that transforms PDF documents into podcast-ready audio content through a series of four notebooks, each handling a specific part of the conversion process:
- PDF Pre-processing: Uses Llama 3.2-3B-Instruct to convert PDFs into clean, structured text
- Initial Transcript Generation: Employs Llama-3.2-8B-Instruct to create a conversational dialogue
- Transcript Refinement: Leverages Llama 3.2 3B Instruct to enhance natural flow and engagement
- Text-to-Speech Generation: Combines Suno/Bark and Parler-TTS for dynamic, two-voice audio
The notebooks that we'll use in this tutorial are slightly modified from their GitHub counterparts. You can find them at the following links:
Setting Up Your Environment on Vast.ai
Setting Up Your Environment on Vast.ai
Step 1: Select the Right Instance
- Visit Vast.ai and select a GPU instance with at least 40GB of RAM. We use 40GB as a rough estimate because the notebooks are running multiple small to mid-sized models.
- Use this base image: https://cloud.vast.ai/?ref_id=62897&template_id=c50a4b1cc2fc37a62e2f3bea7cbd892a
- Important: Choose "Jupyter Notebook" as the Launch Mode along with direct HTTPS access
Step 2: Getting into your instance:
Once the instance is running, you can connect directly to it via the console in Vast.ai. More information about this an Jupyter Notebooks on vast can be found here
Step 3: File Organization
- Upload the notebook files
- Create a
resourcesfolder - Upload your target PDF
- Ensure
requirements.txtis present
Step 4: Environment Setup
Once your instance is running:
- Connect via the console in Vast.ai
- Navigate to the
appdirectory:
cd app
- Install dependencies:
pip install -r requirements.txt
pip install git+https://github.com/huggingface/parler-tts.git
sudo apt-get install ffmpeg
Step 5: Hugging Face Authentication
- Make sure that you have a Hugging Face account. If you don't have one, you can create one here.
- Make sure that you accept the terms for the Llama-3.2-3B-Instruct model.
- Get your access token from the Hugging Face Settings page
- Run:
huggingface-cli login
- Enter your token when prompted
How It Works: The Pipeline Explained
Notebook 1: PDF Processing
The first notebook handles the crucial task of converting your PDF into clean, structured text. It uses PyPDF2 for initial extraction and Llama 3.2-3B-Instruct for intelligent text cleaning. What makes this approach unique is its use of a lightweight language model instead of traditional regex-based cleaning.
Notebook 2: Creating the Conversation
This stage transforms the cleaned text into a natural dialogue using the Llama model. The model creates a two-speaker conversation where:
- Speaker 1 leads and teaches the content
- Speaker 2 asks questions and maintains flow The output includes natural elements like "umm" and "hmm" to maintain authenticity.
Notebook 3: Enhanced Rewriting
The third notebook uses Llama 3.2 3B Instruct to refine the conversation further. It's prompted to act as an "Oscar-winning screenwriter," adding:
- Natural dialogue patterns
- Relevant anecdotes and analogies
- TTS-compatible formatting
Notebook 4: Audio Generation
The final stage brings your podcast to life using two different TTS approaches:
- Speaker 1: Parler-TTS for an expressive, dramatic voice
- Speaker 2: Suno/Bark for a more methodical style
This combination creates a dynamic conversation rather than monotonous single-voice narration.
For this specific notebook, we created a separate speaker description for speaker 2. Each Speaker is also using a different model.
Output:
Each of these notebooks outputs a file to the resources folder to be used in the next step/notebook. At the end, there will be an MP# file that can serve as a podcast for you to listen to! We've included it in our folder for you to take a listen to.
Why Use Vast.ai for This Project?
These notebooks leverage multiple large language models and TTS systems, requiring significant GPU resources for running multiple models at the same time. Vast.ai provides:
- Access to high-memory GPUs (40GB+) necessary for running larger models
- Cost-effective GPU rental compared to dedicated hardware or other cloud providers
- Easy-to-use Jupyter notebook support
- Great docker templates for running notebooks and being able to get running quickly in your environment
- Flexible scaling based on your needs
Trying it out Yourself
Ready to try it yourself? Clone the repository, follow the setup instructions above, and start with the PDF provided to see the results
The beauty of this system is its modularity and its open source nature - you can modify each stage to suit your needs, whether that's using different models, adjusting the conversation style, or tweaking the voice characteristics.
Through the power of Vast.ai's GPU marketplace and the latest AI models, Notebook Llama offers an innovative way to make technical content more accessible and engaging. Give it a try and transform your PDFs into engaging podcast conversations!



