Dia 1.6B: Realistic Dialogue Generation from Text
Dia is a text-to-speech model developed by Nari Labs that directly generates highly realistic dialogue from transcripts. The model supports English language generation and enables emotion and tone control through audio conditioning.
Key Features
Dialogue Generation with Speaker Tags
Dia produces natural speech from transcripts using [S1] and [S2] speaker tags, making it easy to create multi-speaker conversations directly from text.
Nonverbal Communication
The model recognizes and generates approximately 20 different nonverbal expressions including laughter, coughing, throat clearing, sighing, and gasps. These are triggered using simple tags like "(laughs)", "(clears throat)", and "(sighs)".
Voice Cloning
Dia includes voice cloning functionality that enables speaker consistency across generations. The model produces different voices with each generation without requiring fine-tuning on specific voices, and supports seed-fixing for reproducibility.
Audio Conditioning
The model can be conditioned on audio input, enabling precise control over emotion and tone in the generated speech output.
Use Cases
- Creating realistic dialogue for audio content and storytelling
- Generating conversational speech with multiple speakers
- Producing speech with emotional expressions and nonverbal sounds
- Voice synthesis applications requiring speaker consistency
- Accessibility tools for text-to-speech conversion
Training and Architecture
Dia draws inspiration from SoundStorm and Parakeet architectures, utilizing the Descript Audio Codec for audio generation. The model development benefited from resources provided by the Google TPU Research Cloud program and a Hugging Face ZeroGPU grant.