March 14, 2025-PyannoteNVIDIASpeach recognitionVast.ai
Today we'll explore two leading open source speaker diarization technologies: Pyannote Audio and NVIDIA's Sortformer. We'll integrate them with Whisper for speech recognition on VAST.ai's cloud computing infrastructure. Speaker diarization technology answers the critical question of "who spoke when?" by segmenting audio recordings based on speaker identity. This is particularly helpful for transcribing audio with multiple speakers.
OpenAI's Whisper represents a significant advancement in speech-to-text translation. While Whisper provides high-quality transcription, it doesn't inherently distinguish between different speakers. For multi-speaker content like meetings, interviews, or podcasts, we need to combine Whisper with speaker diarization technology to create truly useful transcripts.
Speaker diarization technology offers multiple advantages in audio processing workflows:
Speaker Differentiation: Distinguishes between multiple speakers in conversations, interviews, meetings, and other multi-person recordings.
Enhanced Transcription Quality: When paired with speech recognition systems like Whisper, diarization creates speaker-attributed transcripts that associate text with specific speakers.
Computational Optimization: By identifying and isolating speech segments by speaker and filtering out non-speech audio, diarization can optimize downstream processing tasks, reducing unnecessary computation.
Content Navigation: Enables searching and indexing audio content by individual speakers, making it easy to locate specific speakers' contributions.
By the end of this tutorial, we'll have:
Let's dive in!
For this project, we'll use VAST.ai to access GPU computing resources. VAST.ai operates as a marketplace where you can rent GPUs from various providers, which is useful for running computationally intensive models like the ones we'll be using for speaker diarization.
For this project, we'll show you how to use our platform to rent GPUs from various providers. This is particularly useful for running computationally intensive models like the ones we'll be using for speaker diarization in this tutorial.
The NVIDIA Sortformer model in particular requires significant GPU resources and GPU RAM.
The NVIDIA Sortformer model is much larger than the Pyannote diarization model.
For running the Pyannote Speaker Diarization model on VAST.ai, you'll need a relatively modest GPU setup. The pyannote/speaker-diarization-3.1 model runs in pure PyTorch and is designed to be efficient. Here are the recommended specifications:
PyTorch (CuDNN Runtime)
TemplateYou can download the notebook here.
Before we begin, let's install the necessary Python packages and system dependencies for our diarization pipelines:
# Downgrade NumPy to a version compatible with NeMo/Sortformer
pip install numpy==1.24.3 --force-reinstall
pip install pydub
pip install librosa
pip install datasets
pip install transformers
pip install accelerate
pip install pyannote.audio
apt-get update && apt-get install -y build-essential g++
pip install Cython packaging
pip install "git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]"
apt-get update && apt-get install -y ffmpeg
Here we set our huggingface token as HF_TOKEN
. We need this to access the model.
Ensure that you have accepted the terms for https://huggingface.co/pyannote/speaker-diarization-3.1 and https://huggingface.co/pyannote/segmentation-3.0. This model is free to use, but you must accept their terms.
# Make sure you've accepted the user conditions at:
# https://huggingface.co/pyannote/speaker-diarization-3.1
# https://huggingface.co/pyannote/segmentation-3.0
HF_TOKEN = ""
We will use a sample file from the AMI Meeting Corpus dataset https://huggingface.co/datasets/diarizers-community/ami, which is a collection of 100 hours of meeting recordings.
We'll pull one sample from the dataset (specifically the 7th sample) which contains a multi-speaker discussion about speech recognition research.
from datasets import load_dataset
import os
import soundfile as sf
# Create a directory to save the files
os.makedirs("ami_samples", exist_ok=True)
# Load the dataset with the correct split
dataset = load_dataset("diarizers-community/ami", "ihm", split="train", streaming=True)
# To get the 7th sample (index 6), skip the first 6 and take 1
index_to_get = 6
sample = next(iter(dataset.skip(index_to_get).take(1)))
# Extract audio data
audio = sample["audio"]
audio_array = audio["array"]
sampling_rate = audio["sampling_rate"]
# Save to the proper directory
output_path = "./sample.wav"
sf.write(output_path, audio_array, sampling_rate)
Now that we have our environment set up and test data loaded, let's establish a baseline transcription using Whisper without speaker diarization. This will give us a reference point to compare with our speaker-attributed results later.
Whisper is an automatic speech recognition (ASR) system trained on a massive dataset of diverse audio. Key advantages include:
However, without diarization, Whisper cannot differentiate between speakers, which is why we'll combine it with diarization models.
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
whisper_pipeline = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
The Whisper model can only handle audio files up to 30 seconds in length. We use the librosa library to open our file and feed it into Whisper in 30 second chunks.
We'll process the first 5 minutes (300 seconds) of the audio.
import librosa
import os
# Load the audio file
audio_file = "./sample.wav"
audio, sr = librosa.load(audio_file)
if sr != 16000:
audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
sr = 16000
# Use shorter segments to reduce memory usage
inc = 30
transcription_whisper = []
stop_transcript = 300
# Process audio in chunks
for i in range(0, int(len(audio) / (sr * inc)) + 1):
if i == (stop_transcript / 30):
break
start_sample = int(inc * i * sr)
end_sample = min(int(inc * (i + 1) * sr), len(audio))
segment_audio = audio[start_sample:end_sample]
result = whisper_pipeline(segment_audio)
transcription_text = result["text"].strip()
transcription_whisper.append(transcription_text)
Here we print out the transcript to get an idea of what the conversation is about.
for text in transcription_whisper:
print(text)
With our baseline Whisper transcription complete, we'll now implement our first diarization approach using Pyannote Audio. This open-source toolkit provides a modular pipeline that combines speech segmentation, embedding extraction, and clustering to identify unique speakers.
Tradeoffs:
Let's set up the Pyannote Speaker Diarization pipeline and examine its performance on our test audio.
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token=HF_TOKEN
)
# Move pipeline to appropriate device
pyannote_diarization_pipeline = pipeline.to(device)
Next, we process the file to get the timestamps where speech starts and ends.
# Process the audio file
audio_file = "./sample.wav"
print(f"Processing {audio_file} on {device}")
pyannote_diarization_output = pyannote_diarization_pipeline(audio_file)
The Pyannote Speaker Diarization model gives us a list of segment timestamps labeled with a speaker.
Here we:
We also stop the transcript at 300s.
import math
import librosa
# Load the audio file
audio_file = "./sample.wav"
audio, sr = librosa.load(audio_file)
if sr != 16000:
audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
sr = 16000
pyannote_transcript = []
stop_transcript = 300
for segment, _, speaker in pyannote_diarization_output.itertracks(yield_label=True):
# Ensure segments are less than 30s long
segment_chunks = math.ceil(segment.duration/30)
if segment.start > stop_transcript:
break
for i in range(segment_chunks):
chunk_start = segment.start + (i * 30)
chunk_duration = min(30,segment.end - chunk_start)
chunk_end = chunk_start + chunk_duration
# Extract a segment by sample indices
start_sample = int(chunk_start * sr)
end_sample = int(chunk_end * sr)
if chunk_duration < 0.5:
continue
segment_audio = audio[start_sample:end_sample]
result = whisper_pipeline(segment_audio)
transcription_text = result["text"].strip()
output = {
"speaker":speaker,
"segment_start":chunk_start,
"segment_end":chunk_end,
"text":transcription_text
}
pyannote_transcript.append(output)
Next we will create a function to print the transcript. There are some overlapping speech segments in this audio file so we'll indent the overlapped speech to easily recognize them.
def print_transcript(transcript):
sorted_transcript = sorted(transcript, key=lambda x: x["segment_start"])
longest_end = 0
for line in sorted_transcript:
speaker = line["speaker"]
segment_start = line["segment_start"]
segment_end = line["segment_end"]
transcription_text = line["text"]
formatted_output = f"[{speaker}] ({segment_start:.2f} --> {segment_end:.2f}) {transcription_text}"
if segment_end > longest_end:
longest_end = segment_end
if segment_end < longest_end:
print("\t" + formatted_output)
else:
print(formatted_output)
Here we'll print out the diarized transcription.
You'll notice places where we capture one speaker speaking while another speaker interjects.
In this case SPEAKER_01
replies "Yeah, yeah" while SPEAKER_06
speaks. You can see that "Yeah, yeah" text in both speakers' speech. In a more sophisticated system we could go in and label that "Yeah, yeah" section in SPEAKER_06
's speech as SPEAKER_01
.
[SPEAKER_06] (97.59 --> 127.59) from the slides to enhance the speech recognition on the meetings data is not working. Yeah, yeah. So that's one thing. And so what we did with Alessandro is performing some statistical tests to see if the appearance of the words during the meeting is [SPEAKER_01] (104.42 --> 104.98) Yeah, yeah.
print_transcript(pyannote_transcript)
Now that we've seen how Pyannote handles speaker diarization, let's explore NVIDIA's Sortformer approach.
Having explored Pyannote's approach to speaker diarization, let's now implement NVIDIA's Sortformer model. While Pyannote uses a pipeline of separate components, Sortformer represents a more integrated, transformer-based approach specifically designed for complex multi-speaker environments.
Tradeoffs:
Let's set up Sortformer and compare its diarization results with those from Pyannote.
First, we'll set up our Sortformer model.
from nemo.collections.asr.models import SortformerEncLabelModel
sortformer_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_sortformer_4spk-v1")
We then create a script to process the audio into a diarized transcript.
Sortformer requires significant memory for larger audio files. So here we:
librosa
We also stop transcribing at 300s.
import os
import re
import torch
import librosa
import soundfile as sf
# Create output directory
os.makedirs("speaker_segments", exist_ok=True)
# Pattern for parsing diarization output
pattern = re.compile(r'(\d+\.\d+)\s+(\d+\.\d+)\s+(speaker_\d+)')
# Load audio
audio_file = "./sample.wav"
audio, sr = librosa.load(audio_file)
if sr != 16000:
audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
sr = 16000
# Process audio in chunks
chunk_duration = 300 # 5 minutes
chunk_samples = chunk_duration * sr
num_chunks = (len(audio) + chunk_samples - 1) // chunk_samples
sortformer_transcript = []
stop_transcript = 300
for chunk_idx in range(num_chunks):
if chunk_idx * chunk_duration >= stop_transcript:
break
print(f"\nProcessing chunk {chunk_idx+1}/{num_chunks}")
# Extract chunk
start_sample = chunk_idx * chunk_samples
end_sample = min(start_sample + chunk_samples, len(audio))
chunk_audio = audio[start_sample:end_sample]
chunk_start_time = start_sample / sr
# Save temp file
temp_path = "temp_chunk.wav"
sf.write(temp_path, chunk_audio, sr)
# Diarize
torch.cuda.empty_cache()
try:
segments = sortformer_model.diarize(audio=temp_path, batch_size=1)
for segment_list in segments:
for segment_str in segment_list:
match = pattern.match(segment_str)
if not match:
continue
start_str, end_str, speaker = match.groups()
start_time = float(start_str) + chunk_start_time
end_time = float(end_str) + chunk_start_time
# Get speaker audio
start_sample = int(start_time * sr)
end_sample = int(end_time * sr)
if start_sample >= len(audio) or end_sample > len(audio):
continue
speaker_audio = audio[start_sample:end_sample]
# Transcribe
result = whisper_pipeline(speaker_audio, return_timestamps=True)
transcription_text = result["text"].strip()
if transcription_text:
output = {
"speaker":speaker,
"segment_start":start_time,
"segment_end":end_time,
"text":transcription_text
}
sortformer_transcript.append(output)
except Exception as e:
print(f"Error in chunk {chunk_idx}: {e}")
# Clean up and save results
if os.path.exists("temp_chunk.wav"):
os.remove("temp_chunk.wav")
Here we print all 3 transcripts for assessment.
for text in transcription_whisper:
print(text)
print_transcript(sortformer_transcript)
print_transcript(pyannote_transcript)
We'll also create a function that allows us to listen to specific parts of the file.
import librosa
import numpy as np
from IPython.display import Audio, display
def play_segment(audio_file, start_time, end_time, sr=16000):
"""
Play a specific segment of an audio file.
Parameters:
-----------
audio_file : str
Path to the audio file
start_time : float
Start time in seconds
end_time : float
End time in seconds
sr : int, optional
Sample rate (default: 16000)
"""
# Load the audio file
audio, sample_rate = librosa.load(audio_file, sr=sr)
# Convert times to sample indices
start_idx = int(start_time * sample_rate)
end_idx = int(end_time * sample_rate)
# Extract the segment
segment = audio[start_idx:end_idx]
# Display information about the segment
duration = (end_time - start_time)
print(f"Playing segment from {start_time:.2f}s to {end_time:.2f}s (duration: {duration:.2f}s)")
# Play the audio
display(Audio(segment, rate=sample_rate))
play_segment("./sample.wav", 0, 120)
To better understand the performance differences, we'll analyze three representative segments from our transcripts. For each segment, we'll compare the manually diarized Whisper transcript (ground truth) with the automated diarization from Pyannote and Sortformer.
The ground truth shows a clean exchange between S1 (asking about Matthew) and S2 (explaining the NBaseList rescoring project), with S1 making several "yeah" interjections during S2's longer explanation.
Pyannote: Pyannote correctly splits the conversation between SPEAKER_01 and SPEAKER_06, with the "Yeah, yeah" interjection. Pyannote's segmentation aligns well with natural speech boundaries, creating larger cohesive segments that maintain context. This helps Whisper produce more coherent transcriptions.
Sortformer: Sortformer struggles significantly here, incorrectly assigning both speakers' content primarily to speaker_2. It over-segments the audio into very short chunks (most 2-4 seconds long), disrupting the natural flow. This fragmentation likely contributes to Whisper's language detection errors, where it suddenly transcribes English into other languages. The interjection is detected but consists of only the word "on" rather than the "Yeah, yeah" captured by Pyannote.
[S1] Okay, so we can maybe get a head start. Have you been talking to Matthew about what you've been doing?
[S2] A little bit, but I'm not sure he knows everything. to explain why this NBaseList rescoring from the slides to enhance the speech recognition on the meetings data is not working. So that's one thing. And so what we did with Alessandro is performing some statistical tests to see whatever if the words, the appearance of the words during the meeting is... (S1 interjects with yeah multiple times)
[SPEAKER_01] (65.32 --> 76.88) Okay, so We can maybe get a head start Have you been talking to Matthew about what you've been doing or
[SPEAKER_06] (77.01 --> 78.40) Um...
[SPEAKER_06] (79.87 --> 85.84) A little bit but I'm not sure he knows everything.
[SPEAKER_06] (85.86 --> 97.03) Basically, well first we try to explain why this n-best-list rescoring
[SPEAKER_06] (97.59 --> 127.59) from the slides to enhance the speech recognition on the meetings data is not working. Yeah, yeah. So that's one thing. And so what we did with Alessandro is performing some statistical tests to see if the appearance of the words during the meeting is[SPEAKER_01] (104.42 --> 104.98) Yeah, yeah.
[speaker_2] (69.92 --> 77.36) um we can maybe get a head start um have you been talking to Matthew about what you've been doing or uh
[speaker_2] (80.00 --> 83.52) little bit but I'm not sure he knows everything
[speaker_2] (87.36 --> 90.80) En fait, d'abord nous essayons de
[speaker_2] (92.64 --> 96.80) explain why this NBaseList rescoring
[speaker_2] (98.24 --> 105.84) from the slides to enhance the speech recognition on the meetings data is not working. Yeah, yeah.[speaker_3] (101.92 --> 102.08) on
[speaker_2] (107.28 --> 107.68) uh
[speaker_2] (108.48 --> 109.44) that's one thing
[speaker_2] (110.40 --> 113.52) Et ce que nous avons fait avec Alessandro,
[speaker_2] (116.32 --> 118.40) performing some statistical tests
[speaker_2] (118.80 --> 120.08) to see whatever
[speaker_2] (120.96 --> 124.16) if the words the
[speaker_2] (124.96 --> 126.88) the appearance of the word during the meeting
#Play Segment 1
play_segment("./sample.wav", 65, 128)
The ground truth shows S2 explaining statistical correlations, S1 making the "intuitive feeling" comment, and S1 questioning the "no" result.
Pyannote: Pyannote maintains speaker consistency with SPEAKER_06 continuing from the previous segment, showing good temporal tracking. It correctly captures the "intuitive feeling" interjection but assigns it to SPEAKER_03 rather than SPEAKER_01 - a speaker identity error. The timestamps precisely capture the back-and-forth when SPEAKER_01 takes over while SPEAKER_06 interjects with "The result was no."
Sortformer: Sortformer's timestamp fragmentation worsens, breaking the explanation into smaller segments. This causes more language detection errors. The system does correctly identify speaker_3's interjection, but the speaker labeling is inconsistent with segment 1, making it difficult to track who's who across the conversation.
[s2] ...independent of the appearances of the different slides. So in the case if it is dependent that would mean that certain words tend to appear during certain slides. (S1 interjects with yeah multiple times)
[s1] Yeah, which was the intuitive feeling.
[s2] And then there is a correlation and then there will be a reason like for believing that this will work. (S1 interjects with yeah multiple times)
[s1] And the result was no.
(missed transcription)[s2] The result was no
[s1] But the question is, is that result no because it's no or is it also just because there's not really enough data to be sure about anything?
[SPEAKER_06] (127.59 --> 150.32) independent of the appearances of the different slides. Yeah. So in the case, if it is dependent, that would mean that certain words tend to appear during certain slides. Yeah, which was the intuitive feeling. And that there is a correlation, and then there will be a reason for believing that this will work.
[SPEAKER_03] (140.73 --> 141.97) Yeah, which was the intuitive. [SPEAKER_01] (142.03 --> 143.59) correlation and
[SPEAKER_01] (150.76 --> 163.90) Yeah. And the result was no. The result was no. But the question is, is that result no because it's no, or is it also just because there's not really enough data to be sure about anything?
[SPEAKER_06] (154.03 --> 155.30) The result was no.
[speaker_2] (127.12 --> 131.20) is independent of the
[speaker_2] (131.52 --> 133.28) apparences des différents slides.
[speaker_2] (133.36 --> 134.32) Yeah. So the...
[speaker_2] (134.56 --> 145.12) in the case if it is dependent that would mean that certain words tend to appear during certain slides yeah which is that there is a correlation and then there will be a[speaker_3] (141.20 --> 143.68) Which was the intuitive feeling.
[speaker_2] (145.76 --> 147.20) a reason like for
[speaker_2] (147.60 --> 150.56) believing that this will work.[speaker_3] (149.52 --> 149.84) isso.
[speaker_2] (150.80 --> 151.12) Yeah.
[speaker_2] (151.44 --> 157.44) and the result was no. The result was no. But the question is, is that result...[speaker_3] (154.16 --> 155.44) the result was no but the
[speaker_2] (158.08 --> 164.16) no because it's no or is it also just because there's not really enough data to be sure about anything?
#Play Segment 2
play_segment("./sample.wav", 127, 165)
A new speaker (S3) gives a lengthy explanation about language statistics with multiple brief interjections from others.
Pyannote: Pyannote correctly identifies this as a new speaker (SPEAKER_00) and maintains two large segments that preserve the monologue structure. This approach helps Whisper maintain context and produce coherent text. The system captures multiple short interjections with appropriate timestamps, though it sometimes assigns them to different speakers (SPEAKER_03 vs SPEAKER_01).
Sortformer: Sortformer shows significant improvement here, correctly identifying a new speaker (speaker_1) and maintaining longer segments. The improved segmentation results in better transcription quality and fewer language errors. The system captures multiple brief interjections from speaker_2. The performance boost suggests Sortformer works better with longer monologues than with rapid exchanges.
[S3] In my opinion it is no because of the nature of the language. I mean intuitively of course you tend to use more the words that are on the slide, but the mass of the words actually you use are words that are common. Just think that 50% of the words on average, whatever corpus you take are stop words, articles, etc. So in terms of recognition, 50% goes away. Of the rest, remaining 50%, I mean, are all words that appear one, two, three times. So in any case, even if actually they tend to be related or statistically related to a single slide, in any case, in terms of recognition, do not help at all. It's not going to help. Or help very little. So that's the kind of thing, I mean, that's the kind of measurestatistical independence is not on single words overall which basically means in terms of recognition that doesn't help. (S1/S2 interject with yeah, sure, and other utterances multiple times)
[SPEAKER_00] (164.11 --> 194.11) In my opinion, it is not because of the nature of the language. I mean, intuitively, of course, you tend to use more the words that are on the slide. Sure. But the mass of the words actually you use are words that are common. Just think that 50% of the words, on average, whatever corpus you take, are stop words, articles, etc. So in terms of recognition, 50% goes away. Of the rest, remaining 50%, I mean, are all words that appear one, two, three times. Yeah. so in any case even if actually
[SPEAKER_03] (182.42 --> 182.97) Yeah, I can. [SPEAKER_01] (182.97 --> 184.02) etc.
[SPEAKER_00] (194.11 --> 216.44) They tend to be related or statistically related to a single slide. In any case, in terms of recognition, do not help at all. It's not going to... Yeah. Or help very little. So that's the kind of thing, I mean, that's the kind of measure in statistical independence is not on single words. Yeah. Overall. Yeah, yeah. Which basically means in terms of recognition that doesn't help.
[SPEAKER_01] (204.49 --> 205.25) Not gonna, yeah.
[SPEAKER_01] (212.32 --> 213.80) Yeah. Overall. Yeah, yeah.
[speaker_1] (164.16 --> 172.48) In my opinion it is not because of the nature of the language. I mean intuitively of course you tend to use more the words that are on the slide.
[speaker_2] (171.12 --> 171.36) off.
[speaker_1] (172.64 --> 182.16) the mass of the words actually you use are words that are common. Just think that 50% of the words on average, whatever corpus you take, are stop words.
[speaker_1] (182.48 --> 195.84) articles etc. So in terms of recognition 50% goes away. Of the rest remaining 50% I mean are all words that appear one, two, three times. So in any case even if actually they tend to be[speaker_2] (182.48 --> 183.28) yeah I think we're successful [speaker_2] (183.92 --> 184.00) I'm just, [speaker_2] (186.24 --> 186.56) Yeah. [speaker_2] (192.24 --> 192.56) Yeah.
[speaker_1] (196.40 --> 216.80) related or statistically related to a single slide. In any case, in terms of recognition, do not help at all. It's not going to, yeah. Or help very little. So that's the kind of thing, I mean, that's the kind of measure in statistical independence is not on single words. Yeah. Overall. Yeah, yeah. Which basically means in terms of recognition that doesn't help.
[speaker_2] (204.40 --> 205.44) It's not going to work. [speaker_2] (212.32 --> 212.64) Yeah [speaker_2] (213.36 --> 213.84) Yeah, yeah.
#Play Segment 3
play_segment("./sample.wav", 165, 214)
After analyzing both diarization approaches across multiple conversation segments, we can draw several conclusions about their relative strengths and limitations.
In this post, we've explored two different approaches to speaker diarization:
Pyannote shows strong performance with consistent speaker tracking and natural segmentation, particularly in conversational exchanges. It effectively identifies speaker changes and most interjections, though occasionally misattributes speakers.
Sortformer delivers mixed results - it struggles with rapid conversational exchanges but excels with longer monologues. Its tendency to over-segment audio sometimes causes language detection errors in Whisper.
Our analysis is based on a limited audio sample, so these findings should be considered preliminary. A comprehensive evaluation would require testing across diverse audio recordings with different speakers, acoustic conditions, and conversation dynamics.
Several approaches can enhance the performance of these systems:
Fine-tuning Models: Adapt the diarization models to your specific speakers by creating a small labeled dataset of their speech. This helps reduce speaker confusion like we observed with the "intuitive feeling" comment.
Segment Length Optimization: Tailor your approach to content type - use Pyannote's longer segments for conversational audio, while Sortformer with appropriate chunking works better for presentations or lectures.
Whisper's Word-level Timestamps: Leverage Whisper's timestamp capability to transcribe longer segments, then align these with diarization boundaries. This preserves more context while allowing precise speaker attribution.
Post-processing Speaker Consistency: Implement consistency checks to ensure speakers maintain the same ID throughout the transcript, especially important for longer recordings.
LLM-based Error Correction: Use a large language model to post-process transcripts and fix language detection errors, like the French phrases that appeared in Sortformer's English transcription.
The best choice depends on your specific use case:
Both approaches integrate well with Whisper to create speaker-attributed transcripts. The hardware differences are notable - Pyannote runs efficiently on modest setups, while Sortformer requires more careful resource management.
We recommend testing both approaches on samples representative of your target audio content to determine which best suits your needs.