March 5, 2025-Audio ProcessingSpeech RecognitionVast.ai
In the world of audio processing and speech recognition, identifying when someone is speaking versus when there's silence or background noise is a critical first step. This process, known as Voice Activity Detection (VAD), serves as the foundation for many speech-related applications, from transcription services to voice assistants. While conceptually simple, implementing an efficient and accurate VAD system can significantly improve downstream tasks and reduce computational costs.
PyAnnote Audio, an open-source toolkit built on PyTorch, offers state-of-the-art models for VAD that are both accurate and accessible. Running these models on Vast.ai provides a cost-effective solution for processing large audio datasets without investing in expensive hardware. This combination gives developers and researchers the tools they need to build sophisticated audio processing pipelines at a fraction of the cost of traditional cloud providers.
VAD provides several key benefits for speech processing pipelines:
Reduced Computation Load: By filtering out non-speech segments before running speech-to-text (STT) models, we significantly reduce the computational resources needed for transcription.
Improved Accuracy: Many STT models perform better when processing only speech segments rather than trying to interpret silence or background noise.
Efficient Storage: Extracting only the speech segments can reduce storage requirements for large audio datasets.
Better User Experience: For applications like voice assistants or transcription services, VAD helps eliminate unnecessary processing of silence.
In this guide, we will:
The output will be a collection of audio files containing only the detected speech segments from the original recording, making them ready for further processing in speech-to-text pipelines.
VAST.ai offers a marketplace approach to GPU rentals that provides significant advantages for audio processing tasks. Unlike traditional cloud providers, VAST.ai allows you to:
For VAD specifically, VAST.ai offers an ideal balance of performance and cost-effectiveness, as these models benefit from GPU acceleration without requiring the most expensive hardware tiers.
To follow along with this tutorial, you can download the complete Jupyter notebook:
Having the notebook will allow you to execute the code blocks as you read through this guide, making it easier to understand and experiment with the VAD implementation.
For running the Pyannote Voice Activity Detection model on VAST.ai, you'll need a relatively modest GPU setup since VAD models are computationally efficient compared to larger AI tasks. Here are the recommended specifications:
Follow these steps to set up your environment:
PyTorch (CuDNN Runtime)
TemplateLet's start by installing the necessary Python packages:
%%bash
pip install pyannote.audio
pip install pydub
pip install librosa
pip install yt-dlp
We also need to install FFmpeg for audio processing:
%%bash
apt-get update && apt-get install -y ffmpeg
Pyannote models are hosted on Hugging Face, so you'll need to set up authentication:
# Make sure you've accepted the user conditions at:
# https://hf.co/pyannote/voice-activity-detection
# https://hf.co/pyannote/segmentation
HF_TOKEN = "" # Add your token here
Ensure that you have accepted the terms for the models at the URLs above. The models are free to use, but you must agree to their terms of service.
For this tutorial, we'll download a sample audio file from Vast.ai's YouTube channel. You can also use your own audio file if you prefer:
yt-dlp -f "bestaudio" --extract-audio --audio-format wav -o "test.wav" https://www.youtube.com/watch?v=542xENIxKFU
First, let's set up the VAD pipeline:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/voice-activity-detection",
use_auth_token=HF_TOKEN
)
# Move pipeline to appropriate device
pipeline = pipeline.to(device)
Next, we'll process our audio file to identify speech segments:
# Process the audio file
audio_file = "test.wav"
output = pipeline(audio_file)
print(f"Processing {audio_file} on {device}")
print("Voice activity segments:")
# Get all speech segments
speech_segments = list(output.get_timeline().support())
for i, speech in enumerate(speech_segments):
# active speech between speech.start and speech.end
print(f"Segment {i+1}: Speech from {speech.start:.2f}s to {speech.end:.2f}s (duration: {speech.duration:.2f}s)")
The output will look like this when using the default audio file:
Processing test.wav on cuda
Voice activity segments:
Segment 1: Speech from 6.78s to 51.62s (duration: 44.84s)
Segment 2: Speech from 53.56s to 54.27s (duration: 0.71s)
Segment 3: Speech from 55.55s to 84.76s (duration: 29.21s)
Segment 4: Speech from 86.53s to 89.03s (duration: 2.50s)
...
Now we'll create a function to extract the identified speech segments from our audio file:
import os
import shutil
from pydub import AudioSegment
def split_audio_by_segments(audio_path, segments, output_dir="output_segments"):
"""
Split an audio file into multiple files based on speech segments
Parameters:
-----------
audio_path: str
Path to the input audio file
segments: list
List of speech segments (with start and end attributes)
output_dir: str
Directory to save the output segments
"""
# Clear the output directory if it exists
if os.path.exists(output_dir):
shutil.rmtree(output_dir)
# Create output directory
os.makedirs(output_dir, exist_ok=True)
# Load the audio file
audio = AudioSegment.from_file(audio_path)
# Extract each segment
for i, segment in enumerate(segments):
# Convert seconds to milliseconds
start_ms = int(segment.start * 1000)
end_ms = int(segment.end * 1000)
# Extract segment
segment_audio = audio[start_ms:end_ms]
# Generate output filename
filename = os.path.basename(audio_path)
name, ext = os.path.splitext(filename)
output_path = os.path.join(output_dir, f"{name}_segment_{i+1:04d}_{start_ms:08d}ms-{end_ms:08d}ms{ext}")
# Export segment
segment_audio.export(output_path, format=ext.replace('.', ''))
print(f"Saved segment {i+1} to {output_path}")
Let's apply this function to extract our speech segments:
split_audio_by_segments(audio_file, speech_segments)
This will be the output when using the default audio file:
Saved segment 1 to output_segments/test_segment_0001_00006780ms-00051617ms.wav
Saved segment 2 to output_segments/test_segment_0002_00053558ms-00054267ms.wav
Saved segment 3 to output_segments/test_segment_0003_00055549ms-00084760ms.wav
Saved segment 4 to output_segments/test_segment_0004_00086532ms-00089029ms.wav
...
To verify our results, we'll create a function to play audio files in our Jupyter environment:
import librosa
from IPython.display import Audio, display
def play_audio(file_path, sr=None):
"""
Play an audio file in a Jupyter notebook.
Parameters:
-----------
file_path : str
Path to the audio file to play
sr : int, optional
Sample rate to load the audio with. If None, uses the file's native sample rate.
Returns:
--------
Audio widget that can be played in the notebook
Example:
--------
>>> play_audio('path/to/audio.wav')
"""
# Load the audio file
y, sr = librosa.load(file_path, sr=sr)
# Return an audio widget to play the sound
audio_widget = Audio(data=y, rate=sr)
display(audio_widget)
First, we'll play the original audio file. Listen to the first minute or so to get an idea of what it sounds like before VAD.
play_audio(audio_file)
Next, we'll listen to the first three clips to verify that we have isolated the speech in our test file:
import os
audio_dir = "./output_segments/"
audio_files = os.listdir(audio_dir)
audio_files.sort()
n_clips = 3
for fname in audio_files[0:n_clips]:
play_audio(audio_dir + fname)
You'll see that the audio lengths match up with the speech_segments
output:
Segment 1: Speech from 6.78s to 51.62s (duration: 44.84s)
Segment 2: Speech from 53.56s to 54.27s (duration: 0.71s)
Segment 3: Speech from 55.55s to 84.76s (duration: 29.21s)
...
Listening to the output files, we can see that we have effectively isolated the speech.
With this implementation, you now have a working Voice Activity Detection system that can identify and extract speech segments from audio files. This forms an excellent foundation for more advanced audio processing tasks like speech recognition, speaker diarization, or audio content analysis.
Look out for more content from Vast about all these other types of tasks!