svara-TTS — Open Multilingual TTS for India’s Voices
India speaks in hundreds of languages and dialects—each with its own pulse, cadence, and sentiment. Most TTS systems flatten that richness. Svara‑TTS is our attempt to bring it back.
Why Voices Matter for India 🇮🇳
India doesn’t speak in one tone — it speaks in hundreds of languages, each with its own rhythm, emotion, and cadence. Every language carries a distinct way of expressing warmth, curiosity, hesitation, or resolve. The same sentence in Marathi or Tamil can sound tender, firm, or teasing — depending on where the breath falls or how the voice rises at the end. That texture is what makes a voice human.
Most open-source text-to-speech (TTS) systems today flatten that richness. They sound neutral, robotic, or overly tuned to high-resource languages like English or Hindi — often missing the gentle softness of Malayalam, the rolling lilt of Assamese, or the earthy directness of Bhojpuri.
If you’ve ever heard your mother tongue spoken by a machine, you’ve probably felt it: the missing warmth, the hollow rhythm, the absence of emotion. These aren’t technical artifacts — they’re losses in meaning. Voice is how we teach, how we comfort, how we build trust.
svara-TTS was born from that realization — that an open, expressive, multilingual voice model shouldn’t be a luxury. It should be a foundation: something every developer, teacher, or creative person in India can use to make technology sound like us.
Built on the foundation of Orpheus, which demonstrated remarkable emotional nuance and naturalness, svara-TTS extends this vision to India’s linguistic landscape. It is capable of speaking 19 Indian languages, with balanced male–female voices, emotion-aware conditioning, and zero-shot voice cloning for adaptation — all trained openly on public datasets.
By bringing together the science of language models with the soul of India’s voices, svara-TTS makes something simple but powerful possible: A future where every Indian language can speak — and be heard — beautifully.
svara‑TTS aims to change that for India—openly.
🎧 svara-tts-v1 in Action
Hindi (Happy 😊)
आज का दिन तो सच में बहुत ख़ास हैMagahi (Chat 💬)
हम बस देख रहलो, सब कुछ धीरे-धीरे बदल रहल बा।Marathi (Sad 😢)
मला माहीत होतं हे एक दिवस होणारच. पण आजच होईल असं वाटलं नव्हतंMalayalam (Sad 😢)
ഞാൻ എല്ലായിടത്തും അന്വേഷിച്ചു. എങ്കിലും അത് എവിടെയും കണ്ടില്ലMaithili (Fear 😨)
ओ आवाज कत’ सँ आयल? हमरा त बुझाइ नहि रहल अछिTamil (Surprise 😲)
இது என்னடா நடந்தது? நம்பவே முடியல.Punjabi (Surprise 😲)
ਤੈਨੂੰ ਇੱਥੇ ਵੇਖ ਕੇ ਯਕੀਨ ਨਹੀਂ ਆ ਰਿਹਾ। ਕਿੰਨਾ ਸਮਾਂ ਹੋ ਗਿਆ।English (Anger 😠)
Do you even realize what you’ve doneNepali
मलाई लाग्थ्यो सबै कुरा योजना अनुसार हुनेछKannada
ಇಂದು ನಾವು ಮುಂದಿನ ಯೋಜನೆಯ ಕುರಿತು ಚರ್ಚೆ ನಡೆಸಲಿದ್ದೇವೆ. ದಯವಿಟ್ಟು ನಿಮ್ಮ ಅಭಿಪ್ರಾಯಗಳನ್ನು ಹಂಚಿಕೊಳ್ಳಿSanskrit
वाणी मनुष्यस्य शोभा भवति। तस्मात् विचार्य वदेत् सदा।Telugu
ఈ నిర్ణయం అందరి అభిప్రాయాలను పరిశీలించి తీసుకోబడిందిTry for yourself here:
What we built
- 19 Indian languages, from Hindi, Bengali, Marathi and Telugu to Bodo, Dogri, Maithili, Magahi, and Sanskrit—along with Indian English.
- Emotion‑aware conditioning with simple tags like
<happy>,<sad>,<anger>,<fear>. - Balanced male–female voice coverage and a straightforward speaker identity convention (
Language (Gender)). - Zero‑shot voice cloning support to adapt to new voices with short references.
- Code‑switching awareness so common blends sound natural rather than stitched.
We intentionally kept the speaker taxonomy simple. Because we combined SYSPIN, IndicTTS, RASA, and SPICOR—and some languages overlap—each “voice” is a learned blend anchored to its gender and language identity rather than a celebrity clone. The goal is expressivity and inclusivity, not imitation.
The shift: from phonemes to language models for speech
Classical pipelines are great at what to say; they struggle with how to say it. They lack access to context (sarcasm vs sincerity), they break easily across code‑switching, and they need careful, brittle alignment stages.
Language models (LMs) are built for sequences. If you represent speech as a sequence of discrete audio tokens, you can train a transformer to predict sound the way it predicts words—capturing pauses, emphasis, and emotion as part of the sequence itself. That single design decision unlocks three things:
- Expressivity emerges from learning, rather than being bolted on later.
- Multilingual transfer: related languages borrow strength from one another.
- Efficiency: discrete tokens + modern inference (KV‑cache, quantization, adapters) make real‑time synthesis and on‑device deployment far more practical.
svara‑TTS builds on this idea (inspired by Orpheus), but scales it toward India’s tapestry of languages and speech styles.
The training journey (and the choices that mattered)
Early prototypes in a single language (e.g., Marathi‑only) sounded good; joint training with a related language (Marathi+Hindi) sounded better: smoother rhythm, more reliable word stress, richer emotion. That finding guided the next step: train all 19 languages together so they share a multilingual acoustic space.
A few small choices had outsized impact:
Sentence‑end emotion tags We append emotion/style at the end, not mid‑sentence. This nudges the model to shape the entire utterance around a feeling rather than spiking briefly in the middle.
Punctuation as prosody Ellipses, commas, and exclamation marks during preprocessing gave the model more reliable cues for pause and emphasis. Tiny tokens; big naturalness gains.
Numbers, normalized SYSPIN commonly marks digits like
{2}{two}. We expand such patterns into distinct records to improve digit handling. Even then, we recommend normalizing large numerals at inference (see Limitations).LoRA over full finetuning In ablations, full finetuning (FFT) on large multilingual mixes increased the risk of overfitting and catastrophic forgetting. Small‑batch LoRA adapters were more stable and easier to adapt per‑language without erasing what the model already knew.
Training data at a glance
Total hours: 2000+ high‑quality speech
Speakers: ~50 (≈25 male / ≈25 female)
Languages: 19 (18 Indic + Indian English)
| # | Language | Approx. Total Hours | Sources |
|---|---|---|---|
| 1 | Hindi | ~100 h | SYSPIN, IndicTTS |
| 2 | Bengali | ~130 h | SYSPIN, RASA, IndicTTS |
| 3 | Marathi | ~130 h | SYSPIN, RASA, IndicTTS |
| 4 | Telugu | ~120 h | SYSPIN, RASA, IndicTTS |
| 5 | Kannada | ~120 h | SYSPIN, RASA, IndicTTS |
| 6 | Bhojpuri | ~100 h | SYSPIN |
| 7 | Magahi | ~100 h | SYSPIN |
| 8 | Chhattisgarhi | ~100 h | SYSPIN |
| 9 | Maithili | ~120 h | SYSPIN, RASA |
| 10 | Assamese | ~60 h | RASA, IndicTTS |
| 11 | Bodo | ~52 h | RASA |
| 12 | Dogri | ~48 h | RASA |
| 13 | Gujarati | ~59 h | RASA, IndicTTS |
| 14 | Malayalam | ~60 h | RASA, IndicTTS |
| 15 | Punjabi | ~55 h | RASA, IndicTTS |
| 16 | Tamil | ~55 h | RASA, IndicTTS |
| 17 | English (Indian) | ~97 h | SPICOR |
| 18 | Nepali | ~55 h | RASA |
| 19 | Sanskrit | ~51 h | RASA |
Note: Each dataset’s license/terms apply. Full attributions and links are listed below.
What Svara‑TTS is (and is not)
Svara‑TTS is a foundation for expressive Indian speech—an open model meant to be adapted. It is not a celebrity‑voice generator. It won’t perfectly mimic the accent of a small village you grew up in. What it aims to do is sound like home: rhythmically plausible, emotionally believable, and respectful of each language’s music.
Limitations (honest notes)
- Proper nouns & rare entities can trip pronunciation. Using a
<clear>tag at the end generally helps - Very long sentences may flatten; add punctuation or chunk text.
- Emotion strength varies across languages depending on data density.
- Numbers & currency: normalize large values for clarity, for example,
₹1,23,456 → “one lakh twenty‑three thousand four hundred fifty‑six” - Code‑mixing works well in common patterns but is not a rules engine.
These aren’t caveats; they’re a roadmap. Many of them improve with targeted finetuning and better text normalization.
Where this goes next
- More controllable expressivity (emotion intensity, speaking rate).
- A practical LoRA path for fast speaker/domain adaptation.
- Better conversational back‑channels (“hmm…”, “acha?”, “arey!”).
- Stronger code‑switch behavior in noisy, in‑the‑wild text.
- Streamed synthesis for low‑latency interactive experiences.
If you’re a developer, linguist, teacher, or radio artist and want to help—come build with us.
🙏 Acknowledgments
This work was developed by Kenpath Technologies for the open-source community. We also thank RunPod for the startup credits that supported our GPU compute.
- Canopy Labs — Orpheus: foundational ideas and an inspiring open release Release: https://canopylabs.ai/releases/orpheus_can_speak_any_language
- SPIRE Lab, IISc Bangalore — SYSPIN (multilingual studio data) and SPICOR (Indian English, male & female)
- AI4Bharat — RASA expressive speech
- IIT Madras — IndicTTS
- Unsloth — notes & tooling that helped streamline parts of our workflow
- RunPod — startup GPU credits that let us move quickly
We’re grateful to the communities that made these resources available. Open speech is a collective act.
🔗 Links
- Model: https://huggingface.co/kenpath/svara-tts-v1
- Demo Space: https://huggingface.co/spaces/kenpath/svara-tts
- Inference repo: https://github.com/Kenpath/svara-tts-inference
- Colab: https://colab.research.google.com/drive/15YxFo1DzdQNbFUIZ1HJA4AN4oHqKxGtg
If svara‑tts helps you teach, translate, tell stories, or bring a service alive in a local language, please let us know. That’s the future we’re working toward — voices that sound like us, available to everyone.