svara-TTS — Open Multilingual TTS for India’s Voices

Community Article Published October 27, 2025

🤗 Hugging Face - svara-tts-v1 Model 🤗 Hugging Face - Spaces Open In Colab GitHub


India speaks in hundreds of languages and dialects—each with its own pulse, cadence, and sentiment. Most TTS systems flatten that richness. Svara‑TTS is our attempt to bring it back.

Why Voices Matter for India 🇮🇳

India doesn’t speak in one tone — it speaks in hundreds of languages, each with its own rhythm, emotion, and cadence. Every language carries a distinct way of expressing warmth, curiosity, hesitation, or resolve. The same sentence in Marathi or Tamil can sound tender, firm, or teasing — depending on where the breath falls or how the voice rises at the end. That texture is what makes a voice human.

Most open-source text-to-speech (TTS) systems today flatten that richness. They sound neutral, robotic, or overly tuned to high-resource languages like English or Hindi — often missing the gentle softness of Malayalam, the rolling lilt of Assamese, or the earthy directness of Bhojpuri.

If you’ve ever heard your mother tongue spoken by a machine, you’ve probably felt it: the missing warmth, the hollow rhythm, the absence of emotion. These aren’t technical artifacts — they’re losses in meaning. Voice is how we teach, how we comfort, how we build trust.

svara-TTS was born from that realization — that an open, expressive, multilingual voice model shouldn’t be a luxury. It should be a foundation: something every developer, teacher, or creative person in India can use to make technology sound like us.

Built on the foundation of Orpheus, which demonstrated remarkable emotional nuance and naturalness, svara-TTS extends this vision to India’s linguistic landscape. It is capable of speaking 19 Indian languages, with balanced male–female voices, emotion-aware conditioning, and zero-shot voice cloning for adaptation — all trained openly on public datasets.

By bringing together the science of language models with the soul of India’s voices, svara-TTS makes something simple but powerful possible: A future where every Indian language can speak — and be heard — beautifully.

svara‑TTS aims to change that for India—openly.


🎧 svara-tts-v1 in Action

Hindi (Happy 😊)

आज का दिन तो सच में बहुत ख़ास है

Magahi (Chat 💬)

हम बस देख रहलो, सब कुछ धीरे-धीरे बदल रहल बा।

Marathi (Sad 😢)

मला माहीत होतं हे एक दिवस होणारच. पण आजच होईल असं वाटलं नव्हतं

Malayalam (Sad 😢)

ഞാൻ എല്ലായിടത്തും അന്വേഷിച്ചു. എങ്കിലും അത് എവിടെയും കണ്ടില്ല

Maithili (Fear 😨)

ओ आवाज कत’ सँ आयल? हमरा त बुझाइ नहि रहल अछि

Tamil (Surprise 😲)

இது என்னடா நடந்தது? நம்பவே முடியல.

Punjabi (Surprise 😲)

ਤੈਨੂੰ ਇੱਥੇ ਵੇਖ ਕੇ ਯਕੀਨ ਨਹੀਂ ਆ ਰਿਹਾ। ਕਿੰਨਾ ਸਮਾਂ ਹੋ ਗਿਆ।

English (Anger 😠)

Do you even realize what you’ve done

Nepali

मलाई लाग्थ्यो सबै कुरा योजना अनुसार हुनेछ

Kannada

ಇಂದು ನಾವು ಮುಂದಿನ ಯೋಜನೆಯ ಕುರಿತು ಚರ್ಚೆ ನಡೆಸಲಿದ್ದೇವೆ. ದಯವಿಟ್ಟು ನಿಮ್ಮ ಅಭಿಪ್ರಾಯಗಳನ್ನು ಹಂಚಿಕೊಳ್ಳಿ

Sanskrit

वाणी मनुष्यस्य शोभा भवति। तस्मात् विचार्य वदेत् सदा।

Telugu

ఈ నిర్ణయం అందరి అభిప్రాయాలను పరిశీలించి తీసుకోబడింది

Try for yourself here:

🤗 Hugging Face Space

What we built

  • 19 Indian languages, from Hindi, Bengali, Marathi and Telugu to Bodo, Dogri, Maithili, Magahi, and Sanskrit—along with Indian English.
  • Emotion‑aware conditioning with simple tags like <happy>, <sad>, <anger>, <fear>.
  • Balanced male–female voice coverage and a straightforward speaker identity convention (Language (Gender)).
  • Zero‑shot voice cloning support to adapt to new voices with short references.
  • Code‑switching awareness so common blends sound natural rather than stitched.

We intentionally kept the speaker taxonomy simple. Because we combined SYSPIN, IndicTTS, RASA, and SPICOR—and some languages overlap—each “voice” is a learned blend anchored to its gender and language identity rather than a celebrity clone. The goal is expressivity and inclusivity, not imitation.

The shift: from phonemes to language models for speech

Classical pipelines are great at what to say; they struggle with how to say it. They lack access to context (sarcasm vs sincerity), they break easily across code‑switching, and they need careful, brittle alignment stages.

Language models (LMs) are built for sequences. If you represent speech as a sequence of discrete audio tokens, you can train a transformer to predict sound the way it predicts words—capturing pauses, emphasis, and emotion as part of the sequence itself. That single design decision unlocks three things:

  1. Expressivity emerges from learning, rather than being bolted on later.
  2. Multilingual transfer: related languages borrow strength from one another.
  3. Efficiency: discrete tokens + modern inference (KV‑cache, quantization, adapters) make real‑time synthesis and on‑device deployment far more practical.

svara‑TTS builds on this idea (inspired by Orpheus), but scales it toward India’s tapestry of languages and speech styles.

The training journey (and the choices that mattered)

Early prototypes in a single language (e.g., Marathi‑only) sounded good; joint training with a related language (Marathi+Hindi) sounded better: smoother rhythm, more reliable word stress, richer emotion. That finding guided the next step: train all 19 languages together so they share a multilingual acoustic space.

A few small choices had outsized impact:

  • Sentence‑end emotion tags We append emotion/style at the end, not mid‑sentence. This nudges the model to shape the entire utterance around a feeling rather than spiking briefly in the middle.

  • Punctuation as prosody Ellipses, commas, and exclamation marks during preprocessing gave the model more reliable cues for pause and emphasis. Tiny tokens; big naturalness gains.

  • Numbers, normalized SYSPIN commonly marks digits like {2}{two}. We expand such patterns into distinct records to improve digit handling. Even then, we recommend normalizing large numerals at inference (see Limitations).

  • LoRA over full finetuning In ablations, full finetuning (FFT) on large multilingual mixes increased the risk of overfitting and catastrophic forgetting. Small‑batch LoRA adapters were more stable and easier to adapt per‑language without erasing what the model already knew.

Training data at a glance

Total hours: 2000+ high‑quality speech

Speakers: ~50 (≈25 male / ≈25 female)

Languages: 19 (18 Indic + Indian English)

# Language Approx. Total Hours Sources
1 Hindi ~100 h SYSPIN, IndicTTS
2 Bengali ~130 h SYSPIN, RASA, IndicTTS
3 Marathi ~130 h SYSPIN, RASA, IndicTTS
4 Telugu ~120 h SYSPIN, RASA, IndicTTS
5 Kannada ~120 h SYSPIN, RASA, IndicTTS
6 Bhojpuri ~100 h SYSPIN
7 Magahi ~100 h SYSPIN
8 Chhattisgarhi ~100 h SYSPIN
9 Maithili ~120 h SYSPIN, RASA
10 Assamese ~60 h RASA, IndicTTS
11 Bodo ~52 h RASA
12 Dogri ~48 h RASA
13 Gujarati ~59 h RASA, IndicTTS
14 Malayalam ~60 h RASA, IndicTTS
15 Punjabi ~55 h RASA, IndicTTS
16 Tamil ~55 h RASA, IndicTTS
17 English (Indian) ~97 h SPICOR
18 Nepali ~55 h RASA
19 Sanskrit ~51 h RASA

Note: Each dataset’s license/terms apply. Full attributions and links are listed below.

What Svara‑TTS is (and is not)

Svara‑TTS is a foundation for expressive Indian speech—an open model meant to be adapted. It is not a celebrity‑voice generator. It won’t perfectly mimic the accent of a small village you grew up in. What it aims to do is sound like home: rhythmically plausible, emotionally believable, and respectful of each language’s music.

Limitations (honest notes)

  • Proper nouns & rare entities can trip pronunciation. Using a <clear> tag at the end generally helps
  • Very long sentences may flatten; add punctuation or chunk text.
  • Emotion strength varies across languages depending on data density.
  • Numbers & currency: normalize large values for clarity, for example, ₹1,23,456 → “one lakh twenty‑three thousand four hundred fifty‑six”
  • Code‑mixing works well in common patterns but is not a rules engine.

These aren’t caveats; they’re a roadmap. Many of them improve with targeted finetuning and better text normalization.

Where this goes next

  • More controllable expressivity (emotion intensity, speaking rate).
  • A practical LoRA path for fast speaker/domain adaptation.
  • Better conversational back‑channels (“hmm…”, “acha?”, “arey!”).
  • Stronger code‑switch behavior in noisy, in‑the‑wild text.
  • Streamed synthesis for low‑latency interactive experiences.

If you’re a developer, linguist, teacher, or radio artist and want to help—come build with us.

🙏 Acknowledgments

This work was developed by Kenpath Technologies for the open-source community. We also thank RunPod for the startup credits that supported our GPU compute.

  • Canopy Labs — Orpheus: foundational ideas and an inspiring open release Release: https://canopylabs.ai/releases/orpheus_can_speak_any_language
  • SPIRE Lab, IISc BangaloreSYSPIN (multilingual studio data) and SPICOR (Indian English, male & female)
  • AI4BharatRASA expressive speech
  • IIT MadrasIndicTTS
  • Unsloth — notes & tooling that helped streamline parts of our workflow
  • RunPod — startup GPU credits that let us move quickly

We’re grateful to the communities that made these resources available. Open speech is a collective act.

🔗 Links


If svara‑tts helps you teach, translate, tell stories, or bring a service alive in a local language, please let us know. That’s the future we’re working toward — voices that sound like us, available to everyone.

Community

Hey, great work. Since, you mentioned that you have used SYSPIN dataset for training, I wonder if there is a way to generate voices based on regions as well. Maybe Vaani dataset by IISC might help in this case well.

Article author

Hi @Dark7Devil , thanks!
I think Vaani may be more suitable for an ASR model, since it has multiple speakers and the recording quality is not controlled.

But I think a few hours of audio on top of this model should be enough to fine-tune for specific regions.

great work @adityachhabra . we tested it for our workflows and expressionism is really good.
next steps for v2: enhancing the spectrum of emotions (e.g. sarcasm)

Sign up or log in to comment