Speaker Diarization with Pyannote on VAST

April 19, 2025

8 Min Read

By Team Vast

Introduction

In multi-speaker audio recordings like meetings, podcasts, or interviews, knowing who spoke when is crucial for many applications. This process, known as Speaker Diarization, partitions an audio stream into segments according to speaker identity.

PyAnnote Audio, an open-source toolkit built on PyTorch, offers state-of-the-art models for speaker diarization that are both accurate and accessible. Running these models on Vast.ai provides a cost-effective solution for processing large audio datasets without investing in expensive hardware. This combination gives developers and researchers the tools they need to build sophisticated audio processing pipelines at a fraction of the cost of traditional cloud providers.

Why Speaker Diarization Matters

Speaker Diarization provides several key benefits for audio processing pipelines:

Speaker Identification: It identifies different speakers in a conversation, meeting, or any multi-speaker audio recording.
Improved Transcription: When combined with speech-to-text systems, diarization allows for speaker-attributed transcripts, making it clear who said what.
Processing Efficiency: By segmenting audio by speaker and removing non-speech portions, diarization can significantly reduce the computational load for downstream tasks like speech recognition, allowing these systems to process only relevant speech segments rather than the entire audio file.
Audio Indexing: Makes audio content searchable by speaker, allowing users to find all segments where a specific person speaks.

What This Guide Covers

In this guide, we will:

Set up the Pyannote Audio Speaker Diarization pipeline
Process audio files to detect different speakers and their speaking turns
Calculate speaking time for each identified speaker
Identify regions with overlapping speech
Extract and save speaker-specific segments from the input audio
Play and verify the diarization results

The output will be a collection of audio files separated by speaker, making them ready for further processing in speech-to-text pipelines or speaker-specific analysis.

Why VAST.ai

VAST.ai offers a marketplace approach to GPU rentals that provides significant advantages for audio processing tasks. Unlike traditional cloud providers, VAST.ai allows you to:

Rent precisely the GPU capacity needed for your workload
Access GPUs at more affordable rates
Access more types of GPU SKUs, particularly lower RAM/more affordable GPUs
Avoid long-term commitments for experimental projects

For speaker diarization specifically, VAST.ai is particularly well-suited as these models typically require GPU acceleration but don't demand extensive resources. Users can rent just the right amount of computing power for this specific task without overpaying for unused capacity, making it an economical choice for audio processing workflows that might otherwise be cost-prohibitive on traditional cloud platforms.

Download This Notebook

To follow along with this tutorial, you can download the complete Jupyter notebook:

Notebook

Having the notebook will allow you to execute the code blocks as you read through this guide, making it easier to understand and experiment with the diarization implementation.

Choosing an Instance

For running the Pyannote Speaker Diarization model on VAST.ai, you'll need a relatively modest GPU setup. The pyannote/speaker-diarization-3.1 model runs in pure PyTorch and is designed to be efficient. Here are the recommended specifications:

GPU: A low-end GPU like an RTX 3060 or 4060 would be sufficient.
VRAM: 6-8GB of VRAM should be adequate as the Pyannote diarization pipeline is relatively efficient.
RAM: 8-16GB system RAM is recommended for processing audio files.
Storage: At least 10GB for the model, dependencies, and your audio files.
CUDA: Make sure the instance has CUDA installed (version 11.0+ recommended).
Python: Python 3.8+ with PyTorch installed.

Selecting an Instance

Follow these steps to set up your environment:

Ensure that you have a Vast.ai account
Go to the Vast Templates in the Console https://cloud.vast.ai/templates/
Select the PyTorch (CuDNN Runtime) Template
Filter for an instance with:
- 1 GPU
- 6-8GB of VRAM
- 8-16GB system RAM
- 10GB of storage
Select an instance and click rent
Install the Vast TLS certificate in your browser to access the notebook server https://docs.vast.ai/instances/jupyter#1SmCz
Go to your Instances https://cloud.vast.ai/instances/ and click "Open" to access the jupyter server on your instance.
Upload the notebook to the server or create a new notebook

Installing Dependencies

Let's start by installing the necessary Python packages:

%%bash
pip install pyannote.audio
pip install pydub
pip install librosa
pip install datasets

We also need to install FFmpeg for audio processing:

%%bash
apt-get update && apt-get install -y ffmpeg

Setting Up Your Hugging Face Token

Pyannote models are hosted on Hugging Face, so you'll need to set up authentication:

# Make sure you've accepted the user conditions at:
# https://huggingface.co/pyannote/speaker-diarization-3.1
# https://huggingface.co/pyannote/segmentation-3.0

HF_TOKEN = ""  # Add your token here

Ensure that you have accepted the terms for the models at the URLs above. The models are free to use, but you must agree to their terms of service.

Downloading Test Data

For this tutorial, we'll use a sample file from the AMI Meeting Corpus dataset, which is a collection of 100 hours of meeting recordings. This dataset is perfect for testing speaker diarization as it contains natural multi-speaker conversations.

from datasets import load_dataset
import os
import soundfile as sf

# Create a directory to save the files
os.makedirs("ami_samples", exist_ok=True)

# Load the dataset with the correct split
dataset = load_dataset("diarizers-community/ami", "ihm", split="train", streaming=True)

# Load just one sample
n_samples = 1
samples = list(dataset.take(n_samples))

for i, sample in enumerate(samples):
    audio = sample["audio"]
    audio_array = audio["array"]
    sampling_rate = audio["sampling_rate"]

    # Calculate duration in seconds
    duration = len(audio_array) / sampling_rate

    # Use soundfile to save the audio
    output_path = f"ami_samples/sample_{i}.wav"
    sf.write(output_path, audio_array, sampling_rate)

    print(f"Saved {output_path} - Speaker: {sample['speakers']} - Duration: {duration:.2f} seconds")

Speaker Diarization

First, let's set up the Speaker Diarization pipeline:

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token=HF_TOKEN
)

# Move pipeline to appropriate device
pipeline = pipeline.to(device)

Next, we'll process our audio file to identify different speakers and their speaking turns:

# Process the audio file
audio_file = "./ami_samples/sample_0.wav"
print(f"Processing {audio_file} on {device}")
output = pipeline(audio_file)

print("Voice activity segments:")

for segment, _, speaker in output.itertracks(yield_label=True):
    result = f"{segment.start:.2f} --> {segment.end:.2f} (duration: {segment.duration:.2f}s) Speaker: {speaker}"
    print(result)

The output will look something like this:

Processing ./ami_samples/sample_0.wav on cuda
Voice activity segments:
18.36 --> 18.42 (duration: 0.07s) Speaker: SPEAKER_03
23.01 --> 25.63 (duration: 2.62s) Speaker: SPEAKER_03
27.08 --> 27.64 (duration: 0.56s) Speaker: SPEAKER_05
...

Additional Analytics

Now that we have processed our file, let's explore some useful features of the Pyannote SDK:

Speaker Time

Here we calculate the total speaking time for each speaker:

for speaker in output.labels():
    speaking_time = output.label_duration(speaker)
    print(f"Speaker {speaker} total speaking time: {speaking_time:.2f}s")

This will output something like:

Speaker SPEAKER_00 total speaking time: 558.98s
Speaker SPEAKER_01 total speaking time: 18.98s
Speaker SPEAKER_02 total speaking time: 22.88s
Speaker SPEAKER_03 total speaking time: 469.68s
Speaker SPEAKER_04 total speaking time: 698.02s
Speaker SPEAKER_05 total speaking time: 190.70s
Speaker SPEAKER_06 total speaking time: 5.74s
...

Speaker Overlap

Pyannote can identify regions where multiple speakers are talking simultaneously:

overlap = output.get_overlap()
print(f"Overlapping speech regions: {overlap}")

This will output the segments where more than one speaker is speaking.

Overlapping speech regions:
[[ 00:00:27.672 -->  00:00:27.689]
 [ 00:00:38.337 -->  00:00:38.860]
 [ 00:00:40.395 -->  00:00:40.463]

Filter by Speaker

We can also filter the diarization output to focus on a specific speaker:

speaker = "SPEAKER_06"
speaker_turns = output.label_timeline(speaker)
print(f"Speaker {speaker} speaks at:")
for speaker_turn in speaker_turns:
    print(speaker_turn)

This will show the segments where this speaker is speaking.

Speaker SPEAKER_06 speaks at:
[ 00:03:45.767 -->  00:03:45.852]
[ 00:03:55.386 -->  00:03:55.521]
[ 00:05:07.257 -->  00:05:07.274]
...

Extracting Speaker Segments

To verify our results and prepare the audio for further processing, let's split the original audio into segments by speaker:

import shutil
from pydub import AudioSegment

def split_audio_by_segments(audio_path, diarization_output, output_dir="output_segments"):
    """
    Split an audio file into multiple files based on diarization output

    Parameters:
    -----------
    audio_path: str
        Path to the input audio file
    diarization_output: Annotation
        Pyannote diarization output
    output_dir: str
        Directory to save the output segments
    """
    # Clear the output directory if it exists
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)

    # Create output directory
    os.makedirs(output_dir, exist_ok=True)

    # Load the audio file
    audio = AudioSegment.from_file(audio_path)

    # Extract each segment with speaker information
    for i, (segment, _, speaker) in enumerate(diarization_output.itertracks(yield_label=True)):
        # Convert seconds to milliseconds
        start_ms = int(segment.start * 1000)
        end_ms = int(segment.end * 1000)

        # Extract segment
        segment_audio = audio[start_ms:end_ms]

        # Generate output filename with speaker information
        filename = os.path.basename(audio_path)
        name, ext = os.path.splitext(filename)
        output_path = os.path.join(output_dir, f"{name}_segment_{i+1:04d}_{start_ms:08d}ms-{end_ms:08d}ms_speaker_{speaker}{ext}")

        # Export segment
        segment_audio.export(output_path, format=ext.replace('.', ''))
        print(f"Saved segment {i+1} to {output_path} (Speaker: {speaker})")

# Apply the function to our audio file
split_audio_by_segments(audio_file, output)

Inspecting Results

To verify our results, we'll create a function to play audio files in our Jupyter environment:

import librosa
from IPython.display import Audio, display

def play_audio(file_path, sr=None):
    """
    Play an audio file in a Jupyter notebook.
    """
    # Load the audio file
    y, sr = librosa.load(file_path, sr=sr)

    # Display an audio widget to play the sound
    audio_widget = Audio(data=y, rate=sr)
    display(audio_widget)

Now, let's listen to a few clips to verify that the speakers were correctly identified and isolated.

import os
audio_dir = "./output_segments/"

audio_files = os.listdir(audio_dir)
audio_files.sort()

n_offset = 21
n_clips = 5

for fname in audio_files[n_offset:n_clips + n_offset]:
    print(f"File: {fname}")

    # Extract speaker info if present in filename
    if "_speaker_" in fname:
        speaker_part = fname.split("_speaker_")[1].split(".")[0]
        print(f"Speaker: {speaker_part}")

    play_audio(audio_dir + fname)

Around file 21 or so, we see a 14-second clip of SPEAKER_00 speaking:

File: sample_0_segment_0025_00055364ms-00070045ms_speaker_SPEAKER_00.wav Speaker: SPEAKER_00

A few files later we see a 10-second clip of SPEAKER_00 speaking again:

File: sample_0_segment_0029_00071530ms-00081840ms_speaker_SPEAKER_00.wav Speaker: SPEAKER_00

Note: The filenames may change. Each time we run the model there may be a different number of clips. The diarization model sometimes captures <1 second of audio and labels it as a speaker.

The 10-second clip is audio of SPEAKER_00 about the meeting agenda. It is a file of just SPEAKER_00 talking.

The 14-second clip has two other speakers speaking. This seems like an error at first, but we will notice two other files that capture the additional speakers:

sample_0_segment_0026_00059228ms-00060325ms_speaker_SPEAKER_02.wav sample_0_segment_0027_00061793ms-00062924ms_speaker_SPEAKER_02.wav

With further processing, we could remove these from the SPEAKER_00 file if necessary for our application.

We can also use the Pyannote overlap function to verify that we did have overlapping speech at this time in the audio file.

Verifying Speaker Overlap

Finally, let's check for regions where multiple speakers are talking simultaneously:

overlap = output.get_overlap()
for overlap_ts in overlap:
    print(f"Overlapping speech regions: {overlap_ts}")

Here we see two overlapping segments that match up with the overlapping files we found above.

sample_0_segment_0026_00059228ms-00060325ms_speaker_SPEAKER_02.wav sample_0_segment_0027_00061793ms-00062924ms_speaker_SPEAKER_02.wav

Overlapping speech regions: [ 00:00:59.228 -->  00:01:00.325]
Overlapping speech regions: [ 00:01:01.793 -->  00:01:02.924]

Conclusion

This tutorial demonstrates how Pyannote's speaker diarization on VAST.ai provides a powerful solution for identifying "who spoke when" in multi-speaker recordings. The implementation offers several advantages:

Accuracy and Efficiency: Pyannote accurately identifies different speakers and their speaking times, even detecting overlapping speech
Practical Applications: Enables speaker-attributed transcription, conversation analytics, and content indexing for meetings, podcasts, and interviews
Cost-Effective Processing: VAST.ai's affordable GPU rentals make processing large audio datasets accessible without expensive hardware investments

The Pyannote diarization model provides impressive results out of the box, and running it on VAST.ai makes it accessible and affordable for a wide range of applications. Whether you're building a meeting transcription service, analyzing call center interactions, or researching conversation dynamics, this approach gives you a solid starting point.

Look out for more content from Vast about other audio processing tasks!