Speaker Diarization with Pyannote on VAST

Introduction
In multi-speaker audio recordings like meetings, podcasts, or interviews, knowing who spoke when is crucial for many applications. This process, known as Speaker Diarization, partitions an audio stream into segments according to speaker identity.
PyAnnote Audio, an open-source toolkit built on PyTorch, offers state-of-the-art models for speaker diarization that are both accurate and accessible. Running these models on Vast.ai provides a cost-effective solution for processing large audio datasets without investing in expensive hardware. This combination gives developers and researchers the tools they need to build sophisticated audio processing pipelines at a fraction of the cost of traditional cloud providers.
Why Speaker Diarization Matters
Speaker Diarization provides several key benefits for audio processing pipelines:
-
Speaker Identification: It identifies different speakers in a conversation, meeting, or any multi-speaker audio recording.
-
Improved Transcription: When combined with speech-to-text systems, diarization allows for speaker-attributed transcripts, making it clear who said what.
-
Processing Efficiency: By segmenting audio by speaker and removing non-speech portions, diarization can significantly reduce the computational load for downstream tasks like speech recognition, allowing these systems to process only relevant speech segments rather than the entire audio file.
-
Audio Indexing: Makes audio content searchable by speaker, allowing users to find all segments where a specific person speaks.
What This Guide Covers
In this guide, we will:
- Set up the Pyannote Audio Speaker Diarization pipeline
- Process audio files to detect different speakers and their speaking turns
- Calculate speaking time for each identified speaker
- Identify regions with overlapping speech
- Extract and save speaker-specific segments from the input audio
- Play and verify the diarization results
The output will be a collection of audio files separated by speaker, making them ready for further processing in speech-to-text pipelines or speaker-specific analysis.
Why VAST.ai
VAST.ai offers a marketplace approach to GPU rentals that provides significant advantages for audio processing tasks. Unlike traditional cloud providers, VAST.ai allows you to:
- Rent precisely the GPU capacity needed for your workload
- Access GPUs at more affordable rates
- Access more types of GPU SKUs, particularly lower RAM/more affordable GPUs
- Avoid long-term commitments for experimental projects
For speaker diarization specifically, VAST.ai is particularly well-suited as these models typically require GPU acceleration but don't demand extensive resources. Users can rent just the right amount of computing power for this specific task without overpaying for unused capacity, making it an economical choice for audio processing workflows that might otherwise be cost-prohibitive on traditional cloud platforms.
Download This Notebook
To follow along with this tutorial, you can download the complete Jupyter notebook:
Having the notebook will allow you to execute the code blocks as you read through this guide, making it easier to understand and experiment with the diarization implementation.
Choosing an Instance
For running the Pyannote Speaker Diarization model on VAST.ai, you'll need a relatively modest GPU setup. The pyannote/speaker-diarization-3.1 model runs in pure PyTorch and is designed to be efficient. Here are the recommended specifications:
- GPU: A low-end GPU like an RTX 3060 or 4060 would be sufficient.
- VRAM: 6-8GB of VRAM should be adequate as the Pyannote diarization pipeline is relatively efficient.
- RAM: 8-16GB system RAM is recommended for processing audio files.
- Storage: At least 10GB for the model, dependencies, and your audio files.
- CUDA: Make sure the instance has CUDA installed (version 11.0+ recommended).
- Python: Python 3.8+ with PyTorch installed.
Selecting an Instance
Follow these steps to set up your environment:
- Ensure that you have a Vast.ai account
- Go to the Vast Templates in the Console https://cloud.vast.ai/templates/
- Select the
PyTorch (CuDNN Runtime)Template - Filter for an instance with:
- 1 GPU
- 6-8GB of VRAM
- 8-16GB system RAM
- 10GB of storage
- Select an instance and click rent
- Install the Vast TLS certificate in your browser to access the notebook server https://docs.vast.ai/instances/jupyter#1SmCz
- Go to your Instances https://cloud.vast.ai/instances/ and click "Open" to access the jupyter server on your instance.
- Upload the notebook to the server or create a new notebook
Installing Dependencies
Let's start by installing the necessary Python packages:
%%bash
pip install pyannote.audio
pip install pydub
pip install librosa
pip install datasets
We also need to install FFmpeg for audio processing:
%%bash
apt-get update && apt-get install -y ffmpeg
Setting Up Your Hugging Face Token
Pyannote models are hosted on Hugging Face, so you'll need to set up authentication:
# Make sure you've accepted the user conditions at:
# https://huggingface.co/pyannote/speaker-diarization-3.1
# https://huggingface.co/pyannote/segmentation-3.0
HF_TOKEN = "" # Add your token here
Ensure that you have accepted the terms for the models at the URLs above. The models are free to use, but you must agree to their terms of service.
Downloading Test Data
For this tutorial, we'll use a sample file from the AMI Meeting Corpus dataset, which is a collection of 100 hours of meeting recordings. This dataset is perfect for testing speaker diarization as it contains natural multi-speaker conversations.
from datasets import load_dataset
import os
import soundfile as sf
# Create a directory to save the files
os.makedirs("ami_samples", exist_ok=True)
# Load the dataset with the correct split
dataset = load_dataset("diarizers-community/ami", "ihm", split="train", streaming=True)
# Load just one sample
n_samples = 1
samples = list(dataset.take(n_samples))
for i, sample in enumerate(samples):
audio = sample["audio"]
audio_array = audio["array"]
sampling_rate = audio["sampling_rate"]
# Calculate duration in seconds
duration = len(audio_array) / sampling_rate
# Use soundfile to save the audio
output_path = f"ami_samples/sample_{i}.wav"
sf.write(output_path, audio_array, sampling_rate)
print(f"Saved {output_path} - Speaker: {sample['speakers']} - Duration: {duration:.2f} seconds")
Speaker Diarization
First, let's set up the Speaker Diarization pipeline:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token=HF_TOKEN
)
# Move pipeline to appropriate device
pipeline = pipeline.to(device)
Next, we'll process our audio file to identify different speakers and their speaking turns:
# Process the audio file
audio_file = "./ami_samples/sample_0.wav"
print(f"Processing {audio_file} on {device}")
output = pipeline(audio_file)
print("Voice activity segments:")
for segment, _, speaker in output.itertracks(yield_label=True):
result = f"{segment.start:.2f} --> {segment.end:.2f} (duration: {segment.duration:.2f}s) Speaker: {speaker}"
print(result)
The output will look something like this:
Processing ./ami_samples/sample_0.wav on cuda
Voice activity segments:
18.36 --> 18.42 (duration: 0.07s) Speaker: SPEAKER_03
23.01 --> 25.63 (duration: 2.62s) Speaker: SPEAKER_03
27.08 --> 27.64 (duration: 0.56s) Speaker: SPEAKER_05
...
Additional Analytics
Now that we have processed our file, let's explore some useful features of the Pyannote SDK:
Speaker Time
Here we calculate the total speaking time for each speaker:
for speaker in output.labels():
speaking_time = output.label_duration(speaker)
print(f"Speaker {speaker} total speaking time: {speaking_time:.2f}s")
This will output something like:
Speaker SPEAKER_00 total speaking time: 558.98s
Speaker SPEAKER_01 total speaking time: 18.98s
Speaker SPEAKER_02 total speaking time: 22.88s
Speaker SPEAKER_03 total speaking time: 469.68s
Speaker SPEAKER_04 total speaking time: 698.02s
Speaker SPEAKER_05 total speaking time: 190.70s
Speaker SPEAKER_06 total speaking time: 5.74s
...
Speaker Overlap
Pyannote can identify regions where multiple speakers are talking simultaneously:
overlap = output.get_overlap()
print(f"Overlapping speech regions: {overlap}")
This will output the segments where more than one speaker is speaking.
Overlapping speech regions:
[[ 00:00:27.672 --> 00:00:27.689]
[ 00:00:38.337 --> 00:00:38.860]
[ 00:00:40.395 --> 00:00:40.463]
Filter by Speaker
We can also filter the diarization output to focus on a specific speaker:
speaker = "SPEAKER_06"
speaker_turns = output.label_timeline(speaker)
print(f"Speaker {speaker} speaks at:")
for speaker_turn in speaker_turns:
print(speaker_turn)
This will show the segments where this speaker is speaking.
Speaker SPEAKER_06 speaks at:
[ 00:03:45.767 --> 00:03:45.852]
[ 00:03:55.386 --> 00:03:55.521]
[ 00:05:07.257 --> 00:05:07.274]
...
Extracting Speaker Segments
To verify our results and prepare the audio for further processing, let's split the original audio into segments by speaker:
import shutil
from pydub import AudioSegment
def split_audio_by_segments(audio_path, diarization_output, output_dir="output_segments"):
"""
Split an audio file into multiple files based on diarization output
Parameters:
-----------
audio_path: str
Path to the input audio file
diarization_output: Annotation
Pyannote diarization output
output_dir: str
Directory to save the output segments
"""
# Clear the output directory if it exists
if os.path.exists(output_dir):
shutil.rmtree(output_dir)
# Create output directory
os.makedirs(output_dir, exist_ok=True)
# Load the audio file
audio = AudioSegment.from_file(audio_path)
# Extract each segment with speaker information
for i, (segment, _, speaker) in enumerate(diarization_output.itertracks(yield_label=True)):
# Convert seconds to milliseconds
start_ms = int(segment.start * 1000)
end_ms = int(segment.end * 1000)
# Extract segment
segment_audio = audio[start_ms:end_ms]
# Generate output filename with speaker information
filename = os.path.basename(audio_path)
name, ext = os.path.splitext(filename)
output_path = os.path.join(output_dir, f"{name}_segment_{i+1:04d}_{start_ms:08d}ms-{end_ms:08d}ms_speaker_{speaker}{ext}")
# Export segment
segment_audio.export(output_path, format=ext.replace('.', ''))
print(f"Saved segment {i+1} to {output_path} (Speaker: {speaker})")
# Apply the function to our audio file
split_audio_by_segments(audio_file, output)
Inspecting Results
To verify our results, we'll create a function to play audio files in our Jupyter environment:
import librosa
from IPython.display import Audio, display
def play_audio(file_path, sr=None):
"""
Play an audio file in a Jupyter notebook.
"""
# Load the audio file
y, sr = librosa.load(file_path, sr=sr)
# Display an audio widget to play the sound
audio_widget = Audio(data=y, rate=sr)
display(audio_widget)
Now, let's listen to a few clips to verify that the speakers were correctly identified and isolated.
import os
audio_dir = "./output_segments/"
audio_files = os.listdir(audio_dir)
audio_files.sort()
n_offset = 21
n_clips = 5
for fname in audio_files[n_offset:n_clips + n_offset]:
print(f"File: {fname}")
# Extract speaker info if present in filename
if "_speaker_" in fname:
speaker_part = fname.split("_speaker_")[1].split(".")[0]
print(f"Speaker: {speaker_part}")
play_audio(audio_dir + fname)
Around file 21 or so, we see a 14-second clip of SPEAKER_00 speaking:
File: sample_0_segment_0025_00055364ms-00070045ms_speaker_SPEAKER_00.wav Speaker: SPEAKER_00
A few files later we see a 10-second clip of SPEAKER_00 speaking again:
File: sample_0_segment_0029_00071530ms-00081840ms_speaker_SPEAKER_00.wav Speaker: SPEAKER_00
Note: The filenames may change. Each time we run the model there may be a different number of clips. The diarization model sometimes captures <1 second of audio and labels it as a speaker.
The 10-second clip is audio of SPEAKER_00 about the meeting agenda. It is a file of just SPEAKER_00 talking.
The 14-second clip has two other speakers speaking. This seems like an error at first, but we will notice two other files that capture the additional speakers:
sample_0_segment_0026_00059228ms-00060325ms_speaker_SPEAKER_02.wav
sample_0_segment_0027_00061793ms-00062924ms_speaker_SPEAKER_02.wav
With further processing, we could remove these from the SPEAKER_00 file if necessary for our application.
We can also use the Pyannote overlap function to verify that we did have overlapping speech at this time in the audio file.
Verifying Speaker Overlap
Finally, let's check for regions where multiple speakers are talking simultaneously:
overlap = output.get_overlap()
for overlap_ts in overlap:
print(f"Overlapping speech regions: {overlap_ts}")
Here we see two overlapping segments that match up with the overlapping files we found above.
sample_0_segment_0026_00059228ms-00060325ms_speaker_SPEAKER_02.wav
sample_0_segment_0027_00061793ms-00062924ms_speaker_SPEAKER_02.wav
Overlapping speech regions: [ 00:00:59.228 --> 00:01:00.325]
Overlapping speech regions: [ 00:01:01.793 --> 00:01:02.924]
Conclusion
This tutorial demonstrates how Pyannote's speaker diarization on VAST.ai provides a powerful solution for identifying "who spoke when" in multi-speaker recordings. The implementation offers several advantages:
- Accuracy and Efficiency: Pyannote accurately identifies different speakers and their speaking times, even detecting overlapping speech
- Practical Applications: Enables speaker-attributed transcription, conversation analytics, and content indexing for meetings, podcasts, and interviews
- Cost-Effective Processing: VAST.ai's affordable GPU rentals make processing large audio datasets accessible without expensive hardware investments
The Pyannote diarization model provides impressive results out of the box, and running it on VAST.ai makes it accessible and affordable for a wide range of applications. Whether you're building a meeting transcription service, analyzing call center interactions, or researching conversation dynamics, this approach gives you a solid starting point.
Look out for more content from Vast about other audio processing tasks!


