Dia 1.6BDia directly generates highly realistic dialogue from a transcript. You can condition the output on audio, enabling emotion and tone control
audio
1xRTX 4090
Loading...
Nari Labs
Dia
Apache 2.0
Dia is a text-to-speech model developed by Nari Labs that directly generates highly realistic dialogue from transcripts. The model supports English language generation and enables emotion and tone control through audio conditioning.
Dialogue Generation with Speaker Tags
Dia produces natural speech from transcripts using [S1] and [S2] speaker tags, making it easy to create multi-speaker conversations directly from text.
Nonverbal Communication The model recognizes and generates approximately 20 different nonverbal expressions including laughter, coughing, throat clearing, sighing, and gasps. These are triggered using simple tags like "(laughs)", "(clears throat)", and "(sighs)".
Voice Cloning Dia includes voice cloning functionality that enables speaker consistency across generations. The model produces different voices with each generation without requiring fine-tuning on specific voices, and supports seed-fixing for reproducibility.
Audio Conditioning The model can be conditioned on audio input, enabling precise control over emotion and tone in the generated speech output.
Dia draws inspiration from SoundStorm and Parakeet architectures, utilizing the Descript Audio Codec for audio generation. The model development benefited from resources provided by the Google TPU Research Cloud program and a Hugging Face ZeroGPU grant.
Choose a model and click 'Deploy' above to find available GPUs recommended for this model.
Rent your dedicated instance preconfigured with the model you've selected.
Start sending requests to your model instance and getting responses right now.