April 19, 2025-Speaker DiarizationVast.aiAudio ProcessingPyannote
In multi-speaker audio recordings like meetings, podcasts, or interviews, knowing who spoke when is crucial for many applications. This process, known as Speaker Diarization, partitions an audio stream into segments according to speaker identity.
PyAnnote Audio, an open-source toolkit built on PyTorch, offers state-of-the-art models for speaker diarization that are both accurate and accessible. Running these models on Vast.ai provides a cost-effective solution for processing large audio datasets without investing in expensive hardware. This combination gives developers and researchers the tools they need to build sophisticated audio processing pipelines at a fraction of the cost of traditional cloud providers.
Speaker Diarization provides several key benefits for audio processing pipelines:
Speaker Identification: It identifies different speakers in a conversation, meeting, or any multi-speaker audio recording.
Improved Transcription: When combined with speech-to-text systems, diarization allows for speaker-attributed transcripts, making it clear who said what.
Processing Efficiency: By segmenting audio by speaker and removing non-speech portions, diarization can significantly reduce the computational load for downstream tasks like speech recognition, allowing these systems to process only relevant speech segments rather than the entire audio file.
Audio Indexing: Makes audio content searchable by speaker, allowing users to find all segments where a specific person speaks.
In this guide, we will:
The output will be a collection of audio files separated by speaker, making them ready for further processing in speech-to-text pipelines or speaker-specific analysis.
VAST.ai offers a marketplace approach to GPU rentals that provides significant advantages for audio processing tasks. Unlike traditional cloud providers, VAST.ai allows you to:
For speaker diarization specifically, VAST.ai is particularly well-suited as these models typically require GPU acceleration but don't demand extensive resources. Users can rent just the right amount of computing power for this specific task without overpaying for unused capacity, making it an economical choice for audio processing workflows that might otherwise be cost-prohibitive on traditional cloud platforms.
To follow along with this tutorial, you can download the complete Jupyter notebook:
Having the notebook will allow you to execute the code blocks as you read through this guide, making it easier to understand and experiment with the diarization implementation.
For running the Pyannote Speaker Diarization model on VAST.ai, you'll need a relatively modest GPU setup. The pyannote/speaker-diarization-3.1 model runs in pure PyTorch and is designed to be efficient. Here are the recommended specifications:
Follow these steps to set up your environment:
PyTorch (CuDNN Runtime)
TemplateLet's start by installing the necessary Python packages:
%%bash
pip install pyannote.audio
pip install pydub
pip install librosa
pip install datasets
We also need to install FFmpeg for audio processing:
%%bash
apt-get update && apt-get install -y ffmpeg
Pyannote models are hosted on Hugging Face, so you'll need to set up authentication:
# Make sure you've accepted the user conditions at:
# https://huggingface.co/pyannote/speaker-diarization-3.1
# https://huggingface.co/pyannote/segmentation-3.0
HF_TOKEN = "" # Add your token here
Ensure that you have accepted the terms for the models at the URLs above. The models are free to use, but you must agree to their terms of service.
For this tutorial, we'll use a sample file from the AMI Meeting Corpus dataset, which is a collection of 100 hours of meeting recordings. This dataset is perfect for testing speaker diarization as it contains natural multi-speaker conversations.
from datasets import load_dataset
import os
import soundfile as sf
# Create a directory to save the files
os.makedirs("ami_samples", exist_ok=True)
# Load the dataset with the correct split
dataset = load_dataset("diarizers-community/ami", "ihm", split="train", streaming=True)
# Load just one sample
n_samples = 1
samples = list(dataset.take(n_samples))
for i, sample in enumerate(samples):
audio = sample["audio"]
audio_array = audio["array"]
sampling_rate = audio["sampling_rate"]
# Calculate duration in seconds
duration = len(audio_array) / sampling_rate
# Use soundfile to save the audio
output_path = f"ami_samples/sample_{i}.wav"
sf.write(output_path, audio_array, sampling_rate)
print(f"Saved {output_path} - Speaker: {sample['speakers']} - Duration: {duration:.2f} seconds")
First, let's set up the Speaker Diarization pipeline:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token=HF_TOKEN
)
# Move pipeline to appropriate device
pipeline = pipeline.to(device)
Next, we'll process our audio file to identify different speakers and their speaking turns:
# Process the audio file
audio_file = "./ami_samples/sample_0.wav"
print(f"Processing {audio_file} on {device}")
output = pipeline(audio_file)
print("Voice activity segments:")
for segment, _, speaker in output.itertracks(yield_label=True):
result = f"{segment.start:.2f} --> {segment.end:.2f} (duration: {segment.duration:.2f}s) Speaker: {speaker}"
print(result)
The output will look something like this:
Processing ./ami_samples/sample_0.wav on cuda
Voice activity segments:
18.36 --> 18.42 (duration: 0.07s) Speaker: SPEAKER_03
23.01 --> 25.63 (duration: 2.62s) Speaker: SPEAKER_03
27.08 --> 27.64 (duration: 0.56s) Speaker: SPEAKER_05
...
Now that we have processed our file, let's explore some useful features of the Pyannote SDK:
Here we calculate the total speaking time for each speaker:
for speaker in output.labels():
speaking_time = output.label_duration(speaker)
print(f"Speaker {speaker} total speaking time: {speaking_time:.2f}s")
This will output something like:
Speaker SPEAKER_00 total speaking time: 558.98s
Speaker SPEAKER_01 total speaking time: 18.98s
Speaker SPEAKER_02 total speaking time: 22.88s
Speaker SPEAKER_03 total speaking time: 469.68s
Speaker SPEAKER_04 total speaking time: 698.02s
Speaker SPEAKER_05 total speaking time: 190.70s
Speaker SPEAKER_06 total speaking time: 5.74s
...
Pyannote can identify regions where multiple speakers are talking simultaneously:
overlap = output.get_overlap()
print(f"Overlapping speech regions: {overlap}")
This will output the segments where more than one speaker is speaking.
Overlapping speech regions:
[[ 00:00:27.672 --> 00:00:27.689]
[ 00:00:38.337 --> 00:00:38.860]
[ 00:00:40.395 --> 00:00:40.463]
We can also filter the diarization output to focus on a specific speaker:
speaker = "SPEAKER_06"
speaker_turns = output.label_timeline(speaker)
print(f"Speaker {speaker} speaks at:")
for speaker_turn in speaker_turns:
print(speaker_turn)
This will show the segments where this speaker is speaking.
Speaker SPEAKER_06 speaks at:
[ 00:03:45.767 --> 00:03:45.852]
[ 00:03:55.386 --> 00:03:55.521]
[ 00:05:07.257 --> 00:05:07.274]
...
To verify our results and prepare the audio for further processing, let's split the original audio into segments by speaker:
import shutil
from pydub import AudioSegment
def split_audio_by_segments(audio_path, diarization_output, output_dir="output_segments"):
"""
Split an audio file into multiple files based on diarization output
Parameters:
-----------
audio_path: str
Path to the input audio file
diarization_output: Annotation
Pyannote diarization output
output_dir: str
Directory to save the output segments
"""
# Clear the output directory if it exists
if os.path.exists(output_dir):
shutil.rmtree(output_dir)
# Create output directory
os.makedirs(output_dir, exist_ok=True)
# Load the audio file
audio = AudioSegment.from_file(audio_path)
# Extract each segment with speaker information
for i, (segment, _, speaker) in enumerate(diarization_output.itertracks(yield_label=True)):
# Convert seconds to milliseconds
start_ms = int(segment.start * 1000)
end_ms = int(segment.end * 1000)
# Extract segment
segment_audio = audio[start_ms:end_ms]
# Generate output filename with speaker information
filename = os.path.basename(audio_path)
name, ext = os.path.splitext(filename)
output_path = os.path.join(output_dir, f"{name}_segment_{i+1:04d}_{start_ms:08d}ms-{end_ms:08d}ms_speaker_{speaker}{ext}")
# Export segment
segment_audio.export(output_path, format=ext.replace('.', ''))
print(f"Saved segment {i+1} to {output_path} (Speaker: {speaker})")
# Apply the function to our audio file
split_audio_by_segments(audio_file, output)
To verify our results, we'll create a function to play audio files in our Jupyter environment:
import librosa
from IPython.display import Audio, display
def play_audio(file_path, sr=None):
"""
Play an audio file in a Jupyter notebook.
"""
# Load the audio file
y, sr = librosa.load(file_path, sr=sr)
# Display an audio widget to play the sound
audio_widget = Audio(data=y, rate=sr)
display(audio_widget)
Now, let's listen to a few clips to verify that the speakers were correctly identified and isolated.
import os
audio_dir = "./output_segments/"
audio_files = os.listdir(audio_dir)
audio_files.sort()
n_offset = 21
n_clips = 5
for fname in audio_files[n_offset:n_clips + n_offset]:
print(f"File: {fname}")
# Extract speaker info if present in filename
if "_speaker_" in fname:
speaker_part = fname.split("_speaker_")[1].split(".")[0]
print(f"Speaker: {speaker_part}")
play_audio(audio_dir + fname)
Around file 21 or so, we see a 14-second clip of SPEAKER_00
speaking:
File: sample_0_segment_0025_00055364ms-00070045ms_speaker_SPEAKER_00.wav Speaker: SPEAKER_00
A few files later we see a 10-second clip of SPEAKER_00
speaking again:
File: sample_0_segment_0029_00071530ms-00081840ms_speaker_SPEAKER_00.wav Speaker: SPEAKER_00
Note: The filenames may change. Each time we run the model there may be a different number of clips. The diarization model sometimes captures <1 second of audio and labels it as a speaker.
The 10-second clip is audio of SPEAKER_00
about the meeting agenda. It is a file of just SPEAKER_00
talking.
The 14-second clip has two other speakers speaking. This seems like an error at first, but we will notice two other files that capture the additional speakers:
sample_0_segment_0026_00059228ms-00060325ms_speaker_SPEAKER_02.wav
sample_0_segment_0027_00061793ms-00062924ms_speaker_SPEAKER_02.wav
With further processing, we could remove these from the SPEAKER_00
file if necessary for our application.
We can also use the Pyannote overlap function to verify that we did have overlapping speech at this time in the audio file.
Finally, let's check for regions where multiple speakers are talking simultaneously:
overlap = output.get_overlap()
for overlap_ts in overlap:
print(f"Overlapping speech regions: {overlap_ts}")
Here we see two overlapping segments that match up with the overlapping files we found above.
sample_0_segment_0026_00059228ms-00060325ms_speaker_SPEAKER_02.wav
sample_0_segment_0027_00061793ms-00062924ms_speaker_SPEAKER_02.wav
Overlapping speech regions: [ 00:00:59.228 --> 00:01:00.325]
Overlapping speech regions: [ 00:01:01.793 --> 00:01:02.924]
This tutorial demonstrates how Pyannote's speaker diarization on VAST.ai provides a powerful solution for identifying "who spoke when" in multi-speaker recordings. The implementation offers several advantages:
The Pyannote diarization model provides impressive results out of the box, and running it on VAST.ai makes it accessible and affordable for a wide range of applications. Whether you're building a meeting transcription service, analyzing call center interactions, or researching conversation dynamics, this approach gives you a solid starting point.
Look out for more content from Vast about other audio processing tasks!