Blog

Implementing Speech-to-Text with Speaker Diarization: Comparing Pyannote and Sortformer on VAST.ai

- Team Vast

March 14, 2025-PyannoteNVIDIASpeach recognitionVast.ai

Overview

Today we'll explore two leading open source speaker diarization technologies: Pyannote Audio and NVIDIA's Sortformer. We'll integrate them with Whisper for speech recognition on VAST.ai's cloud computing infrastructure. Speaker diarization technology answers the critical question of "who spoke when?" by segmenting audio recordings based on speaker identity. This is particularly helpful for transcribing audio with multiple speakers.

The Power of Whisper for Speech Recognition

OpenAI's Whisper represents a significant advancement in speech-to-text translation. While Whisper provides high-quality transcription, it doesn't inherently distinguish between different speakers. For multi-speaker content like meetings, interviews, or podcasts, we need to combine Whisper with speaker diarization technology to create truly useful transcripts.

The Impact of Speaker Diarization

Speaker diarization technology offers multiple advantages in audio processing workflows:

  1. Speaker Differentiation: Distinguishes between multiple speakers in conversations, interviews, meetings, and other multi-person recordings.

  2. Enhanced Transcription Quality: When paired with speech recognition systems like Whisper, diarization creates speaker-attributed transcripts that associate text with specific speakers.

  3. Computational Optimization: By identifying and isolating speech segments by speaker and filtering out non-speech audio, diarization can optimize downstream processing tasks, reducing unnecessary computation.

  4. Content Navigation: Enables searching and indexing audio content by individual speakers, making it easy to locate specific speakers' contributions.

What We'll Build

By the end of this tutorial, we'll have:

  • Transcribed an audio file with Whisper
  • Implemented two different diarization models:
  • Processed audio files to detect different speakers and their speaking turns
  • Combined diarization results with Whisper transcription to create speaker-attributed transcripts
  • Compared the outputs from both diarization pipelines
  • Learned which approach works best for different scenarios

Let's dive in!

Setting Up Our Environment on VAST.ai

For this project, we'll use VAST.ai to access GPU computing resources. VAST.ai operates as a marketplace where you can rent GPUs from various providers, which is useful for running computationally intensive models like the ones we'll be using for speaker diarization.

For this project, we'll show you how to use our platform to rent GPUs from various providers. This is particularly useful for running computationally intensive models like the ones we'll be using for speaker diarization in this tutorial.

The NVIDIA Sortformer model in particular requires significant GPU resources and GPU RAM.

Choosing an Instance

The NVIDIA Sortformer model is much larger than the Pyannote diarization model.

For running the Pyannote Speaker Diarization model on VAST.ai, you'll need a relatively modest GPU setup. The pyannote/speaker-diarization-3.1 model runs in pure PyTorch and is designed to be efficient. Here are the recommended specifications:

For NVIDIA Sortformer:

  • GPU: A higher-end GPU is recommended. NVIDIA tests were performed on an RTX A6000 (48GB VRAM).
  • VRAM: At least 16GB VRAM is recommended, with 24GB+ preferred for longer recordings.
  • RAM: 16-32GB system RAM is recommended.
  • Storage: At least 15GB for the model, dependencies, and your audio files.
  • CUDA: Make sure the instance has CUDA installed (version 11.0+ recommended).
  • Python: Python 3.8+ with PyTorch installed.

Renting an Instance on Vast.ai

  1. Ensure that you have a Vast.ai account
  2. Go to the Vast Templates in the Console https://cloud.vast.ai/templates/
  3. Select the PyTorch (CuDNN Runtime) Template
  4. Filter for an instance with:
  • 1 GPU
  • 24GB+ of VRAM
  • 16-32GB system RAM
  • 15GB of storage
  1. Select an instance and click rent
  2. Install the Vast TLS certificate in your browser to access the notebook server https://docs.vast.ai/instances/jupyter#1SmCz
  3. Go to your Instances https://cloud.vast.ai/instances/ and click "Open" to access the jupyter server on your instance.
  4. Upload the provided notebook to the server or create your own

You can download the notebook here.

Install Dependencies

Before we begin, let's install the necessary Python packages and system dependencies for our diarization pipelines:

# Downgrade NumPy to a version compatible with NeMo/Sortformer
pip install numpy==1.24.3 --force-reinstall
pip install pydub
pip install librosa
pip install datasets
pip install transformers
pip install accelerate
pip install pyannote.audio
apt-get update && apt-get install -y build-essential g++
pip install Cython packaging
pip install "git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]"
apt-get update && apt-get install -y ffmpeg

Set up your Huggingface Token

Here we set our huggingface token as HF_TOKEN. We need this to access the model.

Ensure that you have accepted the terms for https://huggingface.co/pyannote/speaker-diarization-3.1 and https://huggingface.co/pyannote/segmentation-3.0. This model is free to use, but you must accept their terms.

# Make sure you've accepted the user conditions at:
# https://huggingface.co/pyannote/speaker-diarization-3.1
# https://huggingface.co/pyannote/segmentation-3.0

HF_TOKEN = ""

Download Test Data

We will use a sample file from the AMI Meeting Corpus dataset https://huggingface.co/datasets/diarizers-community/ami, which is a collection of 100 hours of meeting recordings.

We'll pull one sample from the dataset (specifically the 7th sample) which contains a multi-speaker discussion about speech recognition research.

from datasets import load_dataset
import os
import soundfile as sf

# Create a directory to save the files
os.makedirs("ami_samples", exist_ok=True)

# Load the dataset with the correct split
dataset = load_dataset("diarizers-community/ami", "ihm", split="train", streaming=True)

# To get the 7th sample (index 6), skip the first 6 and take 1
index_to_get = 6
sample = next(iter(dataset.skip(index_to_get).take(1)))

# Extract audio data
audio = sample["audio"]
audio_array = audio["array"]
sampling_rate = audio["sampling_rate"]

# Save to the proper directory
output_path = "./sample.wav"
sf.write(output_path, audio_array, sampling_rate)

Whisper Transcription

Now that we have our environment set up and test data loaded, let's establish a baseline transcription using Whisper without speaker diarization. This will give us a reference point to compare with our speaker-attributed results later.

Whisper is an automatic speech recognition (ASR) system trained on a massive dataset of diverse audio. Key advantages include:

  • Robust performance across different accents and acoustic environments
  • Multi-language support
  • Ability to understand context and filter out non-speech audio
  • Timestamp generation capability

However, without diarization, Whisper cannot differentiate between speakers, which is why we'll combine it with diarization models.

Set Up Whisper Pipeline

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

whisper_pipeline = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

Transcribe Our File in 30s Chunks

The Whisper model can only handle audio files up to 30 seconds in length. We use the librosa library to open our file and feed it into Whisper in 30 second chunks.

We'll process the first 5 minutes (300 seconds) of the audio.

import librosa
import os

# Load the audio file
audio_file = "./sample.wav"
audio, sr = librosa.load(audio_file)
if sr != 16000:
    audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
    sr = 16000

# Use shorter segments to reduce memory usage
inc = 30
transcription_whisper = []
stop_transcript = 300

# Process audio in chunks
for i in range(0, int(len(audio) / (sr * inc)) + 1):
    if i == (stop_transcript / 30):
        break
    start_sample = int(inc * i * sr)
    end_sample = min(int(inc * (i + 1) * sr), len(audio))
    segment_audio = audio[start_sample:end_sample]

    result = whisper_pipeline(segment_audio)
    transcription_text = result["text"].strip()
    transcription_whisper.append(transcription_text)

Here we print out the transcript to get an idea of what the conversation is about.

for text in transcription_whisper:
    print(text)

Pyannote Speaker Diarization

With our baseline Whisper transcription complete, we'll now implement our first diarization approach using Pyannote Audio. This open-source toolkit provides a modular pipeline that combines speech segmentation, embedding extraction, and clustering to identify unique speakers.

Tradeoffs:

  • Advantages: Lower computational requirements, works well on modest hardware, easier setup process, and good performance in clear audio conditions with minimal overlap
  • Limitations: May struggle with heavily overlapped speech, typically identifies fewer simultaneous speakers (optimal for 2-3 speakers), and relies on separate pipeline components that can propagate errors

Let's set up the Pyannote Speaker Diarization pipeline and examine its performance on our test audio.

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token=HF_TOKEN
)

# Move pipeline to appropriate device
pyannote_diarization_pipeline = pipeline.to(device)

Pyannote Diarization Results

Next, we process the file to get the timestamps where speech starts and ends.

# Process the audio file
audio_file = "./sample.wav"
print(f"Processing {audio_file} on {device}")
pyannote_diarization_output = pyannote_diarization_pipeline(audio_file)

The Pyannote Speaker Diarization model gives us a list of segment timestamps labeled with a speaker.

Here we:

  1. Load the audio file using librosa
  2. Break up diarized segments that are greater than 30s long
  3. Feed those segments into Whisper for speech-to-text translation

We also stop the transcript at 300s.

import math
import librosa

# Load the audio file
audio_file = "./sample.wav"
audio, sr = librosa.load(audio_file)
if sr != 16000:
    audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
    sr = 16000

pyannote_transcript = []
stop_transcript = 300

for segment, _, speaker in pyannote_diarization_output.itertracks(yield_label=True):

    # Ensure segments are less than 30s long
    segment_chunks = math.ceil(segment.duration/30)
    if segment.start > stop_transcript:
        break

    for i in range(segment_chunks):

        chunk_start = segment.start + (i * 30)
        chunk_duration = min(30,segment.end - chunk_start)
        chunk_end = chunk_start + chunk_duration

        # Extract a segment by sample indices
        start_sample = int(chunk_start * sr)
        end_sample = int(chunk_end * sr)

        if chunk_duration < 0.5:
            continue

        segment_audio = audio[start_sample:end_sample]

        result = whisper_pipeline(segment_audio)
        transcription_text = result["text"].strip()

        output = {
            "speaker":speaker,
            "segment_start":chunk_start,
            "segment_end":chunk_end,
            "text":transcription_text
        }
        pyannote_transcript.append(output)

Next we will create a function to print the transcript. There are some overlapping speech segments in this audio file so we'll indent the overlapped speech to easily recognize them.

def print_transcript(transcript):

    sorted_transcript = sorted(transcript, key=lambda x: x["segment_start"])

    longest_end = 0

    for line in sorted_transcript:
        speaker = line["speaker"]
        segment_start = line["segment_start"]
        segment_end = line["segment_end"]
        transcription_text = line["text"]
        formatted_output = f"[{speaker}] ({segment_start:.2f} --> {segment_end:.2f}) {transcription_text}"

        if segment_end > longest_end:
            longest_end = segment_end

        if segment_end < longest_end:
            print("\t" + formatted_output)

        else:
            print(formatted_output)

Here we'll print out the diarized transcription.

You'll notice places where we capture one speaker speaking while another speaker interjects.

In this case SPEAKER_01 replies "Yeah, yeah" while SPEAKER_06 speaks. You can see that "Yeah, yeah" text in both speakers' speech. In a more sophisticated system we could go in and label that "Yeah, yeah" section in SPEAKER_06's speech as SPEAKER_01.

[SPEAKER_06] (97.59 --> 127.59) from the slides to enhance the speech recognition on the meetings data is not working. Yeah, yeah. So that's one thing. And so what we did with Alessandro is performing some statistical tests to see if the appearance of the words during the meeting is [SPEAKER_01] (104.42 --> 104.98) Yeah, yeah.

print_transcript(pyannote_transcript)

Now that we've seen how Pyannote handles speaker diarization, let's explore NVIDIA's Sortformer approach.

NVIDIA Sortformer

Having explored Pyannote's approach to speaker diarization, let's now implement NVIDIA's Sortformer model. While Pyannote uses a pipeline of separate components, Sortformer represents a more integrated, transformer-based approach specifically designed for complex multi-speaker environments.

Tradeoffs:

  • Advantages: Superior handling of overlapping speech, higher speaker capacity (up to 4 simultaneous speakers), better temporal resolution, and integrated end-to-end architecture
  • Limitations: Significantly higher computational requirements, longer processing time, more complex setup through the NeMo framework, and larger GPU memory footprint

Let's set up Sortformer and compare its diarization results with those from Pyannote.

Sortformer Setup

First, we'll set up our Sortformer model.

from nemo.collections.asr.models import SortformerEncLabelModel

sortformer_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_sortformer_4spk-v1")

Sortformer Diarization

We then create a script to process the audio into a diarized transcript.

Sortformer requires significant memory for larger audio files. So here we:

  1. Break up the file into 5 minute chunks
  2. Save each chunk to disk temporarily
  3. Run the Sortformer diarizer on those files
  4. Further break the audio up into diarized segments with librosa
  5. Output the transcript of each diarized segment

We also stop transcribing at 300s.

import os
import re
import torch
import librosa
import soundfile as sf

# Create output directory
os.makedirs("speaker_segments", exist_ok=True)

# Pattern for parsing diarization output
pattern = re.compile(r'(\d+\.\d+)\s+(\d+\.\d+)\s+(speaker_\d+)')

# Load audio
audio_file = "./sample.wav"
audio, sr = librosa.load(audio_file)
if sr != 16000:
    audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
    sr = 16000

# Process audio in chunks
chunk_duration = 300  # 5 minutes
chunk_samples = chunk_duration * sr
num_chunks = (len(audio) + chunk_samples - 1) // chunk_samples
sortformer_transcript = []

stop_transcript = 300

for chunk_idx in range(num_chunks):

    if chunk_idx * chunk_duration >= stop_transcript:
        break

    print(f"\nProcessing chunk {chunk_idx+1}/{num_chunks}")

    # Extract chunk
    start_sample = chunk_idx * chunk_samples
    end_sample = min(start_sample + chunk_samples, len(audio))
    chunk_audio = audio[start_sample:end_sample]
    chunk_start_time = start_sample / sr

    # Save temp file
    temp_path = "temp_chunk.wav"
    sf.write(temp_path, chunk_audio, sr)

    # Diarize
    torch.cuda.empty_cache()
    try:
        segments = sortformer_model.diarize(audio=temp_path, batch_size=1)

        for segment_list in segments:
            for segment_str in segment_list:
                match = pattern.match(segment_str)
                if not match:
                    continue

                start_str, end_str, speaker = match.groups()
                start_time = float(start_str) + chunk_start_time
                end_time = float(end_str) + chunk_start_time

                # Get speaker audio
                start_sample = int(start_time * sr)
                end_sample = int(end_time * sr)

                if start_sample >= len(audio) or end_sample > len(audio):
                    continue

                speaker_audio = audio[start_sample:end_sample]

                # Transcribe
                result = whisper_pipeline(speaker_audio, return_timestamps=True)
                transcription_text = result["text"].strip()

                if transcription_text:
                    output = {
                        "speaker":speaker,
                        "segment_start":start_time,
                        "segment_end":end_time,
                        "text":transcription_text
                    }
                    sortformer_transcript.append(output)

    except Exception as e:
        print(f"Error in chunk {chunk_idx}: {e}")

# Clean up and save results
if os.path.exists("temp_chunk.wav"):
    os.remove("temp_chunk.wav")

Compare Results

Here we print all 3 transcripts for assessment.

for text in transcription_whisper:
    print(text)
print_transcript(sortformer_transcript)
print_transcript(pyannote_transcript)

We'll also create a function that allows us to listen to specific parts of the file.

import librosa
import numpy as np
from IPython.display import Audio, display

def play_segment(audio_file, start_time, end_time, sr=16000):
    """
    Play a specific segment of an audio file.

    Parameters:
    -----------
    audio_file : str
        Path to the audio file
    start_time : float
        Start time in seconds
    end_time : float
        End time in seconds
    sr : int, optional
        Sample rate (default: 16000)
    """
    # Load the audio file
    audio, sample_rate = librosa.load(audio_file, sr=sr)

    # Convert times to sample indices
    start_idx = int(start_time * sample_rate)
    end_idx = int(end_time * sample_rate)

    # Extract the segment
    segment = audio[start_idx:end_idx]

    # Display information about the segment
    duration = (end_time - start_time)
    print(f"Playing segment from {start_time:.2f}s to {end_time:.2f}s (duration: {duration:.2f}s)")

    # Play the audio
    display(Audio(segment, rate=sample_rate))

play_segment("./sample.wav", 0, 120)

Compare 3 Segments

To better understand the performance differences, we'll analyze three representative segments from our transcripts. For each segment, we'll compare the manually diarized Whisper transcript (ground truth) with the automated diarization from Pyannote and Sortformer.

Segment 1: Discussion About Project Work

The ground truth shows a clean exchange between S1 (asking about Matthew) and S2 (explaining the NBaseList rescoring project), with S1 making several "yeah" interjections during S2's longer explanation.

Pyannote: Pyannote correctly splits the conversation between SPEAKER_01 and SPEAKER_06, with the "Yeah, yeah" interjection. Pyannote's segmentation aligns well with natural speech boundaries, creating larger cohesive segments that maintain context. This helps Whisper produce more coherent transcriptions.

Sortformer: Sortformer struggles significantly here, incorrectly assigning both speakers' content primarily to speaker_2. It over-segments the audio into very short chunks (most 2-4 seconds long), disrupting the natural flow. This fragmentation likely contributes to Whisper's language detection errors, where it suddenly transcribes English into other languages. The interjection is detected but consists of only the word "on" rather than the "Yeah, yeah" captured by Pyannote.

Whisper + Manual Diarization

[S1] Okay, so we can maybe get a head start. Have you been talking to Matthew about what you've been doing?

[S2] A little bit, but I'm not sure he knows everything. to explain why this NBaseList rescoring from the slides to enhance the speech recognition on the meetings data is not working. So that's one thing. And so what we did with Alessandro is performing some statistical tests to see whatever if the words, the appearance of the words during the meeting is... (S1 interjects with yeah multiple times)

Whisper + Pyannote Diarization

[SPEAKER_01] (65.32 --> 76.88) Okay, so We can maybe get a head start Have you been talking to Matthew about what you've been doing or

[SPEAKER_06] (77.01 --> 78.40) Um...
[SPEAKER_06] (79.87 --> 85.84) A little bit but I'm not sure he knows everything.
[SPEAKER_06] (85.86 --> 97.03) Basically, well first we try to explain why this n-best-list rescoring
[SPEAKER_06] (97.59 --> 127.59) from the slides to enhance the speech recognition on the meetings data is not working. Yeah, yeah. So that's one thing. And so what we did with Alessandro is performing some statistical tests to see if the appearance of the words during the meeting is

[SPEAKER_01] (104.42 --> 104.98) Yeah, yeah.

Whisper + Sortformer Diarization

[speaker_2] (69.92 --> 77.36) um we can maybe get a head start um have you been talking to Matthew about what you've been doing or uh
[speaker_2] (80.00 --> 83.52) little bit but I'm not sure he knows everything
[speaker_2] (87.36 --> 90.80) En fait, d'abord nous essayons de
[speaker_2] (92.64 --> 96.80) explain why this NBaseList rescoring
[speaker_2] (98.24 --> 105.84) from the slides to enhance the speech recognition on the meetings data is not working. Yeah, yeah.

[speaker_3] (101.92 --> 102.08) on

[speaker_2] (107.28 --> 107.68) uh
[speaker_2] (108.48 --> 109.44) that's one thing
[speaker_2] (110.40 --> 113.52) Et ce que nous avons fait avec Alessandro,
[speaker_2] (116.32 --> 118.40) performing some statistical tests
[speaker_2] (118.80 --> 120.08) to see whatever
[speaker_2] (120.96 --> 124.16) if the words the
[speaker_2] (124.96 --> 126.88) the appearance of the word during the meeting

#Play Segment 1
play_segment("./sample.wav", 65, 128)

Segment 2: Discussion About Statistical Correlations

The ground truth shows S2 explaining statistical correlations, S1 making the "intuitive feeling" comment, and S1 questioning the "no" result.

Pyannote: Pyannote maintains speaker consistency with SPEAKER_06 continuing from the previous segment, showing good temporal tracking. It correctly captures the "intuitive feeling" interjection but assigns it to SPEAKER_03 rather than SPEAKER_01 - a speaker identity error. The timestamps precisely capture the back-and-forth when SPEAKER_01 takes over while SPEAKER_06 interjects with "The result was no."

Sortformer: Sortformer's timestamp fragmentation worsens, breaking the explanation into smaller segments. This causes more language detection errors. The system does correctly identify speaker_3's interjection, but the speaker labeling is inconsistent with segment 1, making it difficult to track who's who across the conversation.

Whisper + Manual Diarization

[s2] ...independent of the appearances of the different slides. So in the case if it is dependent that would mean that certain words tend to appear during certain slides. (S1 interjects with yeah multiple times)

[s1] Yeah, which was the intuitive feeling.

[s2] And then there is a correlation and then there will be a reason like for believing that this will work. (S1 interjects with yeah multiple times)

[s1] And the result was no.

(missed transcription)[s2] The result was no

[s1] But the question is, is that result no because it's no or is it also just because there's not really enough data to be sure about anything?

Whisper + Pyannote Diarization

[SPEAKER_06] (127.59 --> 150.32) independent of the appearances of the different slides. Yeah. So in the case, if it is dependent, that would mean that certain words tend to appear during certain slides. Yeah, which was the intuitive feeling. And that there is a correlation, and then there will be a reason for believing that this will work.

[SPEAKER_03] (140.73 --> 141.97) Yeah, which was the intuitive.
[SPEAKER_01] (142.03 --> 143.59) correlation and

[SPEAKER_01] (150.76 --> 163.90) Yeah. And the result was no. The result was no. But the question is, is that result no because it's no, or is it also just because there's not really enough data to be sure about anything?

[SPEAKER_06] (154.03 --> 155.30) The result was no.

Whisper + Sortformer Diarization

[speaker_2] (127.12 --> 131.20) is independent of the
[speaker_2] (131.52 --> 133.28) apparences des différents slides.
[speaker_2] (133.36 --> 134.32) Yeah. So the...
[speaker_2] (134.56 --> 145.12) in the case if it is dependent that would mean that certain words tend to appear during certain slides yeah which is that there is a correlation and then there will be a

[speaker_3] (141.20 --> 143.68) Which was the intuitive feeling.

[speaker_2] (145.76 --> 147.20) a reason like for
[speaker_2] (147.60 --> 150.56) believing that this will work.

[speaker_3] (149.52 --> 149.84) isso.

[speaker_2] (150.80 --> 151.12) Yeah.
[speaker_2] (151.44 --> 157.44) and the result was no. The result was no. But the question is, is that result...

[speaker_3] (154.16 --> 155.44) the result was no but the

[speaker_2] (158.08 --> 164.16) no because it's no or is it also just because there's not really enough data to be sure about anything?

#Play Segment 2
play_segment("./sample.wav", 127, 165)

Segment 3: Extended Explanation About Language Statistics

A new speaker (S3) gives a lengthy explanation about language statistics with multiple brief interjections from others.

Pyannote: Pyannote correctly identifies this as a new speaker (SPEAKER_00) and maintains two large segments that preserve the monologue structure. This approach helps Whisper maintain context and produce coherent text. The system captures multiple short interjections with appropriate timestamps, though it sometimes assigns them to different speakers (SPEAKER_03 vs SPEAKER_01).

Sortformer: Sortformer shows significant improvement here, correctly identifying a new speaker (speaker_1) and maintaining longer segments. The improved segmentation results in better transcription quality and fewer language errors. The system captures multiple brief interjections from speaker_2. The performance boost suggests Sortformer works better with longer monologues than with rapid exchanges.

Whisper + Manual Diarization

[S3] In my opinion it is no because of the nature of the language. I mean intuitively of course you tend to use more the words that are on the slide, but the mass of the words actually you use are words that are common. Just think that 50% of the words on average, whatever corpus you take are stop words, articles, etc. So in terms of recognition, 50% goes away. Of the rest, remaining 50%, I mean, are all words that appear one, two, three times. So in any case, even if actually they tend to be related or statistically related to a single slide, in any case, in terms of recognition, do not help at all. It's not going to help. Or help very little. So that's the kind of thing, I mean, that's the kind of measurestatistical independence is not on single words overall which basically means in terms of recognition that doesn't help. (S1/S2 interject with yeah, sure, and other utterances multiple times)

Whisper + Pyannote Diarization

[SPEAKER_00] (164.11 --> 194.11) In my opinion, it is not because of the nature of the language. I mean, intuitively, of course, you tend to use more the words that are on the slide. Sure. But the mass of the words actually you use are words that are common. Just think that 50% of the words, on average, whatever corpus you take, are stop words, articles, etc. So in terms of recognition, 50% goes away. Of the rest, remaining 50%, I mean, are all words that appear one, two, three times. Yeah. so in any case even if actually

[SPEAKER_03] (182.42 --> 182.97) Yeah, I can.
[SPEAKER_01] (182.97 --> 184.02) etc.

[SPEAKER_00] (194.11 --> 216.44) They tend to be related or statistically related to a single slide. In any case, in terms of recognition, do not help at all. It's not going to... Yeah. Or help very little. So that's the kind of thing, I mean, that's the kind of measure in statistical independence is not on single words. Yeah. Overall. Yeah, yeah. Which basically means in terms of recognition that doesn't help.
[SPEAKER_01] (204.49 --> 205.25) Not gonna, yeah.
[SPEAKER_01] (212.32 --> 213.80) Yeah. Overall. Yeah, yeah.

Whisper + Sortformer Diarization

[speaker_1] (164.16 --> 172.48) In my opinion it is not because of the nature of the language. I mean intuitively of course you tend to use more the words that are on the slide.

[speaker_2] (171.12 --> 171.36) off.

[speaker_1] (172.64 --> 182.16) the mass of the words actually you use are words that are common. Just think that 50% of the words on average, whatever corpus you take, are stop words.
[speaker_1] (182.48 --> 195.84) articles etc. So in terms of recognition 50% goes away. Of the rest remaining 50% I mean are all words that appear one, two, three times. So in any case even if actually they tend to be

[speaker_2] (182.48 --> 183.28) yeah I think we're successful
[speaker_2] (183.92 --> 184.00) I'm just,
[speaker_2] (186.24 --> 186.56) Yeah.
[speaker_2] (192.24 --> 192.56) Yeah.

[speaker_1] (196.40 --> 216.80) related or statistically related to a single slide. In any case, in terms of recognition, do not help at all. It's not going to, yeah. Or help very little. So that's the kind of thing, I mean, that's the kind of measure in statistical independence is not on single words. Yeah. Overall. Yeah, yeah. Which basically means in terms of recognition that doesn't help.

[speaker_2] (204.40 --> 205.44) It's not going to work.
[speaker_2] (212.32 --> 212.64) Yeah
[speaker_2] (213.36 --> 213.84) Yeah, yeah.
#Play Segment 3
play_segment("./sample.wav", 165, 214)

Conclusion

After analyzing both diarization approaches across multiple conversation segments, we can draw several conclusions about their relative strengths and limitations.

In this post, we've explored two different approaches to speaker diarization:

  1. Pyannote shows strong performance with consistent speaker tracking and natural segmentation, particularly in conversational exchanges. It effectively identifies speaker changes and most interjections, though occasionally misattributes speakers.

  2. Sortformer delivers mixed results - it struggles with rapid conversational exchanges but excels with longer monologues. Its tendency to over-segment audio sometimes causes language detection errors in Whisper.

Our analysis is based on a limited audio sample, so these findings should be considered preliminary. A comprehensive evaluation would require testing across diverse audio recordings with different speakers, acoustic conditions, and conversation dynamics.

Methods for Improving Diarization and Transcription

Several approaches can enhance the performance of these systems:

  1. Fine-tuning Models: Adapt the diarization models to your specific speakers by creating a small labeled dataset of their speech. This helps reduce speaker confusion like we observed with the "intuitive feeling" comment.

  2. Segment Length Optimization: Tailor your approach to content type - use Pyannote's longer segments for conversational audio, while Sortformer with appropriate chunking works better for presentations or lectures.

  3. Whisper's Word-level Timestamps: Leverage Whisper's timestamp capability to transcribe longer segments, then align these with diarization boundaries. This preserves more context while allowing precise speaker attribution.

  4. Post-processing Speaker Consistency: Implement consistency checks to ensure speakers maintain the same ID throughout the transcript, especially important for longer recordings.

  5. LLM-based Error Correction: Use a large language model to post-process transcripts and fix language detection errors, like the French phrases that appeared in Sortformer's English transcription.

The best choice depends on your specific use case:

  • For general-purpose diarization with modest hardware, Pyannote offers a good balance of performance and efficiency
  • For content with longer monologues and fewer speaker transitions, Sortformer may justify its higher computational requirements

Both approaches integrate well with Whisper to create speaker-attributed transcripts. The hardware differences are notable - Pyannote runs efficiently on modest setups, while Sortformer requires more careful resource management.

We recommend testing both approaches on samples representative of your target audio content to determine which best suits your needs.

Share on
  • Contact
  • Get in Touch