Blog

Voice Activity Detection (VAD) with Pyannote on VAST

- Team Vast

March 5, 2025-Audio ProcessingSpeech RecognitionVast.ai

Introduction

In the world of audio processing and speech recognition, identifying when someone is speaking versus when there's silence or background noise is a critical first step. This process, known as Voice Activity Detection (VAD), serves as the foundation for many speech-related applications, from transcription services to voice assistants. While conceptually simple, implementing an efficient and accurate VAD system can significantly improve downstream tasks and reduce computational costs.

PyAnnote Audio, an open-source toolkit built on PyTorch, offers state-of-the-art models for VAD that are both accurate and accessible. Running these models on Vast.ai provides a cost-effective solution for processing large audio datasets without investing in expensive hardware. This combination gives developers and researchers the tools they need to build sophisticated audio processing pipelines at a fraction of the cost of traditional cloud providers.

Why Voice Activity Detection Matters

VAD provides several key benefits for speech processing pipelines:

  1. Reduced Computation Load: By filtering out non-speech segments before running speech-to-text (STT) models, we significantly reduce the computational resources needed for transcription.

  2. Improved Accuracy: Many STT models perform better when processing only speech segments rather than trying to interpret silence or background noise.

  3. Efficient Storage: Extracting only the speech segments can reduce storage requirements for large audio datasets.

  4. Better User Experience: For applications like voice assistants or transcription services, VAD helps eliminate unnecessary processing of silence.

What This Guide Covers

In this guide, we will:

  • Set up the Pyannote Audio VAD pipeline
  • Process audio files to detect speech segments
  • Extract and save only the speech portions of the input audio
  • Visualize the results

The output will be a collection of audio files containing only the detected speech segments from the original recording, making them ready for further processing in speech-to-text pipelines.

Why VAST.ai

VAST.ai offers a marketplace approach to GPU rentals that provides significant advantages for audio processing tasks. Unlike traditional cloud providers, VAST.ai allows you to:

  • Rent precisely the GPU capacity needed for your workload
  • Access GPU's at more affordable rates
  • Access more types of SKU's of GPU's, particularly lower RAM/more affordable GPUs
  • Avoid long-term commitments for experimental projects

For VAD specifically, VAST.ai offers an ideal balance of performance and cost-effectiveness, as these models benefit from GPU acceleration without requiring the most expensive hardware tiers.

Download This Notebook

To follow along with this tutorial, you can download the complete Jupyter notebook:

Notebook

Having the notebook will allow you to execute the code blocks as you read through this guide, making it easier to understand and experiment with the VAD implementation.

Choosing an Instance

For running the Pyannote Voice Activity Detection model on VAST.ai, you'll need a relatively modest GPU setup since VAD models are computationally efficient compared to larger AI tasks. Here are the recommended specifications:

  • GPU: A mid-range GPU like an RTX 3060 or 4060 would be sufficient.
  • VRAM: 6-8GB of VRAM should be adequate as the Pyannote VAD model is relatively small.
  • RAM: 8-16GB system RAM is recommended for processing audio files.
  • Storage: At least 10GB for the model, dependencies, and your audio files.
  • CUDA: Make sure the instance has CUDA installed (version 11.0+ recommended).
  • Python: Python 3.8+ with PyTorch installed.

Selecting an Instance

Follow these steps to set up your environment:

  1. Ensure that you have a Vast.ai account
  2. Go to the Vast Templates in the Console https://cloud.vast.ai/templates/
  3. Select the PyTorch (CuDNN Runtime) Template
  4. Filter for an instance with:
    • 1 GPU
    • 6-8GB of VRAM
    • 8-16GB system RAM
    • 10GB of storage
  5. Select an instance and click rent
  6. Install the Vast TLS certificate in your browser to access the notebook server https://docs.vast.ai/instances/jupyter#1SmCz
  7. Go to your Instances https://cloud.vast.ai/instances/ and click "Open" to access the jupyter server on your instance.
  8. Upload the notebook to the server or create a new notebook

Installing Dependencies

Let's start by installing the necessary Python packages:

%%bash
pip install pyannote.audio
pip install pydub
pip install librosa
pip install yt-dlp

We also need to install FFmpeg for audio processing:

%%bash
apt-get update && apt-get install -y ffmpeg

Setting Up Your Hugging Face Token

Pyannote models are hosted on Hugging Face, so you'll need to set up authentication:

# Make sure you've accepted the user conditions at:
# https://hf.co/pyannote/voice-activity-detection
# https://hf.co/pyannote/segmentation

HF_TOKEN = ""  # Add your token here

Ensure that you have accepted the terms for the models at the URLs above. The models are free to use, but you must agree to their terms of service.

Generating Test Data

For this tutorial, we'll download a sample audio file from Vast.ai's YouTube channel. You can also use your own audio file if you prefer:

yt-dlp -f "bestaudio" --extract-audio --audio-format wav -o "test.wav" https://www.youtube.com/watch?v=542xENIxKFU

Voice Activity Detection

First, let's set up the VAD pipeline:

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
        "pyannote/voice-activity-detection",
        use_auth_token=HF_TOKEN
    )

# Move pipeline to appropriate device
pipeline = pipeline.to(device)

Next, we'll process our audio file to identify speech segments:

# Process the audio file
audio_file = "test.wav"
output = pipeline(audio_file)

print(f"Processing {audio_file} on {device}")
print("Voice activity segments:")

# Get all speech segments
speech_segments = list(output.get_timeline().support())

for i, speech in enumerate(speech_segments):
    # active speech between speech.start and speech.end
    print(f"Segment {i+1}: Speech from {speech.start:.2f}s to {speech.end:.2f}s (duration: {speech.duration:.2f}s)")

The output will look like this when using the default audio file:

Processing test.wav on cuda
Voice activity segments:
Segment 1: Speech from 6.78s to 51.62s (duration: 44.84s)
Segment 2: Speech from 53.56s to 54.27s (duration: 0.71s)
Segment 3: Speech from 55.55s to 84.76s (duration: 29.21s)
Segment 4: Speech from 86.53s to 89.03s (duration: 2.50s)
...

Saving Speech Segments

Now we'll create a function to extract the identified speech segments from our audio file:

import os
import shutil
from pydub import AudioSegment


def split_audio_by_segments(audio_path, segments, output_dir="output_segments"):
    """
    Split an audio file into multiple files based on speech segments

    Parameters:
    -----------
    audio_path: str
        Path to the input audio file
    segments: list
        List of speech segments (with start and end attributes)
    output_dir: str
        Directory to save the output segments
    """
    # Clear the output directory if it exists
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)

    # Create output directory
    os.makedirs(output_dir, exist_ok=True)

    # Load the audio file
    audio = AudioSegment.from_file(audio_path)

    # Extract each segment
    for i, segment in enumerate(segments):
        # Convert seconds to milliseconds
        start_ms = int(segment.start * 1000)
        end_ms = int(segment.end * 1000)

        # Extract segment
        segment_audio = audio[start_ms:end_ms]

        # Generate output filename
        filename = os.path.basename(audio_path)
        name, ext = os.path.splitext(filename)
        output_path = os.path.join(output_dir, f"{name}_segment_{i+1:04d}_{start_ms:08d}ms-{end_ms:08d}ms{ext}")

        # Export segment
        segment_audio.export(output_path, format=ext.replace('.', ''))
        print(f"Saved segment {i+1} to {output_path}")

Let's apply this function to extract our speech segments:

split_audio_by_segments(audio_file, speech_segments)

This will be the output when using the default audio file:

Saved segment 1 to output_segments/test_segment_0001_00006780ms-00051617ms.wav
Saved segment 2 to output_segments/test_segment_0002_00053558ms-00054267ms.wav
Saved segment 3 to output_segments/test_segment_0003_00055549ms-00084760ms.wav
Saved segment 4 to output_segments/test_segment_0004_00086532ms-00089029ms.wav
...

Inspecting Results

To verify our results, we'll create a function to play audio files in our Jupyter environment:

import librosa
from IPython.display import Audio, display

def play_audio(file_path, sr=None):
    """
    Play an audio file in a Jupyter notebook.

    Parameters:
    -----------
    file_path : str
        Path to the audio file to play
    sr : int, optional
        Sample rate to load the audio with. If None, uses the file's native sample rate.

    Returns:
    --------
    Audio widget that can be played in the notebook

    Example:
    --------
    >>> play_audio('path/to/audio.wav')
    """
    # Load the audio file
    y, sr = librosa.load(file_path, sr=sr)


    # Return an audio widget to play the sound
    audio_widget = Audio(data=y, rate=sr)
    display(audio_widget)

First, we'll play the original audio file. Listen to the first minute or so to get an idea of what it sounds like before VAD.

play_audio(audio_file)

Next, we'll listen to the first three clips to verify that we have isolated the speech in our test file:

import os
audio_dir = "./output_segments/"

audio_files = os.listdir(audio_dir)
audio_files.sort()

n_clips = 3

for fname in audio_files[0:n_clips]:
    play_audio(audio_dir + fname)

You'll see that the audio lengths match up with the speech_segments output:

Segment 1: Speech from 6.78s to 51.62s (duration: 44.84s)
Segment 2: Speech from 53.56s to 54.27s (duration: 0.71s)
Segment 3: Speech from 55.55s to 84.76s (duration: 29.21s)
...

Listening to the output files, we can see that we have effectively isolated the speech.

Conclusion

With this implementation, you now have a working Voice Activity Detection system that can identify and extract speech segments from audio files. This forms an excellent foundation for more advanced audio processing tasks like speech recognition, speaker diarization, or audio content analysis.

Look out for more content from Vast about all these other types of tasks!

Share on
  • Contact
  • Get in Touch