Blog

Transcribing Audio with Whisper Large V3 on Vast.ai

- Team Vast

January 23, 2025-Vast.aiSpeech recognitionWhisper Large V3SetupAI

Transcribing Audio with Whisper Large V3 on Vast.ai

Background

With the rise of Generative AI, there have been many advancements in the field of speech recognition. One of the most notable is the development of Whisper, a family of open-source speech recognition models from OpenAI that can transcribe audio in multiple languages. Companies use this technology to transcribe customer service calls, summarize meetings, and create interactive voice assistants. Other applications might be extracting structured data from audio or video about what is being said.

Whisper Large V3, the latest version of the Whisper models, offers improved accuracy and performance over previous versions. In this guide, we'll show you how to set up and run Whisper Large V3 for batch audio transcription. This guide is built for a backend data pipeline, but can be easily modified to interact with users directly.

With Vast, you can run this model on extremely affordable and powerful GPUs to increase the speed of your transcription pipeline. You can see the script that this is based upon here.

Setting Up the Environment

To deploy on Vast, we will use the Vast AI Template for Pytorch: PyTorch (cuDNN Runtime). It has a lot of the libraries that we need, and comes with ssh and jupyterlab out of the box for us. For GPU's the L40s are very cost effective for reasonable batch sizes and the model itself is pretty small. Finally, the dataset is fairly large and gets expanded when creating the dataset splits, so we recommend having 110GB or more of disk space just to be safe.

We will take the requirements.txt file from here to install the necessary libraries to the template so that we can run the script.

Running the Script:

Setting up Transformers

First, let's look at how to set up our transcription pipeline. We'll use Hugging Face's transformers library to load and run the model.

```python
import os
import torch
from huggingface_hub import snapshot_download
from transformers import (
    pipeline,
    AutoModelForSpeechSeq2Seq,
    AutoProcessor,
    logging
)

# Configure model details
MODEL_DIR = "./model"
MODEL_NAME = "openai/whisper-large-v3-turbo"
MODEL_REVISION = "41f01f3fe87f28c78e2fbf8b568835947dd65ed9"

# Create model directory and download model files
os.makedirs(MODEL_DIR, exist_ok=True)
snapshot_download(
    MODEL_NAME,
    local_dir=MODEL_DIR,
    ignore_patterns=["*.pt", "*.bin"],
    revision=MODEL_REVISION,
)
```

Creating the ASR Pipeline

Next, we'll set up our Automatic Speech Recognition (ASR) pipeline. This involves loading the model and processor, and configuring them for optimal performance:

```python
def initialize_asr_pipeline():
    print("Setting up pipeline")

    processor = AutoProcessor.from_pretrained(MODEL_NAME)
    device = "cuda" if torch.cuda.is_available() else "cpu"

    model = AutoModelForSpeechSeq2Seq.from_pretrained(
        MODEL_NAME,
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True,
        use_safetensors=True,
        device_map=device
    )
    model.generation_config.language = "<|en|>"

    # Configure feature extractor
    feature_extractor = processor.feature_extractor
    feature_extractor.sampling_rate = 16000
    feature_extractor.return_tensors = 'pt'

    # Create pipeline
    asr_pipeline = pipeline(
        "automatic-speech-recognition",
        model=model,
        tokenizer=processor.tokenizer,
        feature_extractor=feature_extractor,
        torch_dtype=torch.float16
    )

    return asr_pipeline
```

Loading and Processing Audio Data

We'll use the Hugging Face datasets library to load our audio data. This example uses LibriSpeech, but you can modify it to use your own dataset:

```python
def load_dataset(dataset_name):
    from datasets import load_dataset, Audio

    print("Loading dataset", dataset_name)
    dataset = load_dataset(dataset_name, "clean", split="validation")
    dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
    print("Dataset loaded")
    return dataset["audio"]

def batch_audio(dataset, batch_size):
    for i in range(0, len(dataset), batch_size):
        print(f"Yielding batch {i}")
        yield dataset[i : i + batch_size]
```

Running the Transcription

Finally, we'll put it all together to transcribe our audio files in batches. Because we defined the functions above, actually running the transcription is very simple.

```python
# Initialize pipeline
asr_pipeline = initialize_asr_pipeline()

# Load dataset
DATASET_NAME = "openslr/librispeech_asr"  # You can change this to your dataset
dataset = load_dataset(DATASET_NAME)

# Process batches
results = []
for batch in batch_audio(dataset, batch_size=32):
    print("Processing batch")
    with torch.no_grad():
        transcriptions = asr_pipeline(batch)
    print("Batch processed")
    results.extend(transcriptions)

print(results)
```

Results:

Some of our results are below:

'Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.', " Nor is Mr. Quilter's manner less interesting than his matter.", ' he tells us that at this festive season of the year with christmas and roast beef looming before us similes drawn from eating and its results occur most readily to the mind'

These can be compared to the original audio dataset here. The audio in this dataset is very clear and easy to transcribe.

We can see that the text that is returned from this pipeline is basically perfect for the samples that we show. We would expect that the model's performance would drift if it were run on a private dataset specific to your use case, or if the audio was not as clear as this dataset. Luckily since we have access to the weights, this model can be fine-tuned to your specific use case.

Key Features

  • Batch Processing: The code processes audio files in batches of 32 for improved efficiency. This is very important for running this model as it benefits a lot from batching on the GPU.
  • Transformers Pipeline: This is a very easy way to run the model and handles a lot of boilerplate code to properly prepare the model and the individual samnples for inference. This is extra important as highly performant Whisper serving frameworks aren't released yet.
  • Flexible Dataset Support: Can work with any dataset hosted on Hugging Face's datasets Hub, with minimal modification for specific dataset schema.

Tips for Best Performance

  1. Adjust the batch size based on your GPU memory to ensure to fully utilize the tensor cores on the GPU.
  2. Consider preprocessing your audio files to remove noise or normalize volume
  3. Process the audio outputs with an LLM to automatically see if the text is coherent, if it isn't, that might be a good sign that the audio isn't clear.
  4. Use these outputs to speed up humans doing the transcription and get a faster transcription with humans in the loop.
  5. Leverage the faster and accurately transcribed outputs to fine-tune Whisper to your specific use case, increasing the accuracy and speed of your data flywheel and value to your business.

This implementation provides a solid foundation for batch audio transcription using Whisper Large V3. You can modify the dataset loading code to work with your own audio files, and adjust the batch size based on your available GPU memory.

Now you can see how with Vast, you can save a meaningful amount of money on labor-intensive tasks like transcription with an affordable and powerful GPU. Happy Transcribing!

Share on
  • Contact
  • Get in Touch