January 23, 2025-Vast.aiSpeech recognitionWhisper Large V3SetupAI
With the rise of Generative AI, there have been many advancements in the field of speech recognition. One of the most notable is the development of Whisper, a family of open-source speech recognition models from OpenAI that can transcribe audio in multiple languages. Companies use this technology to transcribe customer service calls, summarize meetings, and create interactive voice assistants. Other applications might be extracting structured data from audio or video about what is being said.
Whisper Large V3, the latest version of the Whisper models, offers improved accuracy and performance over previous versions. In this guide, we'll show you how to set up and run Whisper Large V3 for batch audio transcription. This guide is built for a backend data pipeline, but can be easily modified to interact with users directly.
With Vast, you can run this model on extremely affordable and powerful GPUs to increase the speed of your transcription pipeline. You can see the script that this is based upon here.
To deploy on Vast, we will use the Vast AI Template for Pytorch: PyTorch (cuDNN Runtime). It has a lot of the libraries that we need, and comes with ssh and jupyterlab out of the box for us. For GPU's the L40s are very cost effective for reasonable batch sizes and the model itself is pretty small. Finally, the dataset is fairly large and gets expanded when creating the dataset splits, so we recommend having 110GB or more of disk space just to be safe.
We will take the requirements.txt file from here to install the necessary libraries to the template so that we can run the script.
First, let's look at how to set up our transcription pipeline. We'll use Hugging Face's transformers library to load and run the model.
```python
import os
import torch
from huggingface_hub import snapshot_download
from transformers import (
pipeline,
AutoModelForSpeechSeq2Seq,
AutoProcessor,
logging
)
# Configure model details
MODEL_DIR = "./model"
MODEL_NAME = "openai/whisper-large-v3-turbo"
MODEL_REVISION = "41f01f3fe87f28c78e2fbf8b568835947dd65ed9"
# Create model directory and download model files
os.makedirs(MODEL_DIR, exist_ok=True)
snapshot_download(
MODEL_NAME,
local_dir=MODEL_DIR,
ignore_patterns=["*.pt", "*.bin"],
revision=MODEL_REVISION,
)
```
Next, we'll set up our Automatic Speech Recognition (ASR) pipeline. This involves loading the model and processor, and configuring them for optimal performance:
```python
def initialize_asr_pipeline():
print("Setting up pipeline")
processor = AutoProcessor.from_pretrained(MODEL_NAME)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
MODEL_NAME,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
use_safetensors=True,
device_map=device
)
model.generation_config.language = "<|en|>"
# Configure feature extractor
feature_extractor = processor.feature_extractor
feature_extractor.sampling_rate = 16000
feature_extractor.return_tensors = 'pt'
# Create pipeline
asr_pipeline = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=feature_extractor,
torch_dtype=torch.float16
)
return asr_pipeline
```
We'll use the Hugging Face datasets library to load our audio data. This example uses LibriSpeech, but you can modify it to use your own dataset:
```python
def load_dataset(dataset_name):
from datasets import load_dataset, Audio
print("Loading dataset", dataset_name)
dataset = load_dataset(dataset_name, "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
print("Dataset loaded")
return dataset["audio"]
def batch_audio(dataset, batch_size):
for i in range(0, len(dataset), batch_size):
print(f"Yielding batch {i}")
yield dataset[i : i + batch_size]
```
Finally, we'll put it all together to transcribe our audio files in batches. Because we defined the functions above, actually running the transcription is very simple.
```python
# Initialize pipeline
asr_pipeline = initialize_asr_pipeline()
# Load dataset
DATASET_NAME = "openslr/librispeech_asr" # You can change this to your dataset
dataset = load_dataset(DATASET_NAME)
# Process batches
results = []
for batch in batch_audio(dataset, batch_size=32):
print("Processing batch")
with torch.no_grad():
transcriptions = asr_pipeline(batch)
print("Batch processed")
results.extend(transcriptions)
print(results)
```
Some of our results are below:
'Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.', " Nor is Mr. Quilter's manner less interesting than his matter.", ' he tells us that at this festive season of the year with christmas and roast beef looming before us similes drawn from eating and its results occur most readily to the mind'
These can be compared to the original audio dataset here. The audio in this dataset is very clear and easy to transcribe.
We can see that the text that is returned from this pipeline is basically perfect for the samples that we show. We would expect that the model's performance would drift if it were run on a private dataset specific to your use case, or if the audio was not as clear as this dataset. Luckily since we have access to the weights, this model can be fine-tuned to your specific use case.
This implementation provides a solid foundation for batch audio transcription using Whisper Large V3. You can modify the dataset loading code to work with your own audio files, and adjust the batch size based on your available GPU memory.
Now you can see how with Vast, you can save a meaningful amount of money on labor-intensive tasks like transcription with an affordable and powerful GPU. Happy Transcribing!