AI has now gone multimodal: instead of just sending chat messages to a chatbot, there are awesome use cases involving audio, videos and images. One amazing product like this is NotebookLM from Google which has taken the AI community by storm. It takes in YouTube videos, PDF’s or other documents, and derives insights and even creates a podcast from the notes and original material.
But Meta’s Llama team came out with Notebook Llama, an open source version of this flow, which lets users run their own models to create podcasts on their own material. We will show you how to run it with Vast. In this tutorial, we will show you how to put the world of AI researchers in the palm of your hands for the materials that you care about. Let's dive into how you can set this up using Vast.ai's GPU marketplace.
Notebook Llama is an innovative pipeline that transforms PDF documents into podcast-ready audio content through a series of four notebooks, each handling a specific part of the conversion process:
The notebooks that we'll use in this tutorial are slightly modified from their GitHub counterparts. You can find them at the following links:
Once the instance is running, you can connect directly to it via the console in Vast.ai. More information about this an Jupyter Notebooks on vast can be found here
resources
folderrequirements.txt
is presentOnce your instance is running:
app
directory:cd app
pip install -r requirements.txt
pip install git+https://github.com/huggingface/parler-tts.git
sudo apt-get install ffmpeg
huggingface-cli login
The first notebook handles the crucial task of converting your PDF into clean, structured text. It uses PyPDF2 for initial extraction and Llama 3.2-3B-Instruct for intelligent text cleaning. What makes this approach unique is its use of a lightweight language model instead of traditional regex-based cleaning.
This stage transforms the cleaned text into a natural dialogue using the Llama model. The model creates a two-speaker conversation where:
The third notebook uses Llama 3.2 3B Instruct to refine the conversation further. It's prompted to act as an "Oscar-winning screenwriter," adding:
The final stage brings your podcast to life using two different TTS approaches:
This combination creates a dynamic conversation rather than monotonous single-voice narration.
For this specific notebook, we created a separate speaker description for speaker 2. Each Speaker is also using a different model.
Each of these notebooks outputs a file to the resources folder to be used in the next step/notebook. At the end, there will be an MP# file that can serve as a podcast for you to listen to! We've included it in our folder for you to take a listen to.
These notebooks leverage multiple large language models and TTS systems, requiring significant GPU resources for running multiple models at the same time. Vast.ai provides:
Ready to try it yourself? Clone the repository, follow the setup instructions above, and start with the PDF provided to see the results
The beauty of this system is its modularity and its open source nature - you can modify each stage to suit your needs, whether that's using different models, adjusting the conversation style, or tweaking the voice characteristics.
Through the power of Vast.ai's GPU marketplace and the latest AI models, Notebook Llama offers an innovative way to make technical content more accessible and engaging. Give it a try and transform your PDFs into engaging podcast conversations!