KaniTTS-expo2025-osaka-ja

Running on Zero

App Files Files Community

KaniTTS-expo2025-osaka-ja / CLAUDE.md

Den Pavloff

fix token conflict

8a1b058 16 days ago

preview code

raw

history blame contribute delete

3.99 kB

A newer version of the Gradio SDK is available: 5.49.1

Upgrade

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

KaniTTS is a Text-to-Speech system that uses causal language models to generate speech via NeMo audio codec tokens. The project is deployed as a HuggingFace Gradio Space.

Running the Application

# Run the Gradio app (launches on http://0.0.0.0:7860)
python app.py

The app requires a HuggingFace token set as the HF_TOKEN environment variable to download models.

Architecture

Token Flow Pipeline

The system uses a custom token layout that interleaves text and audio in a single sequence:

Input prompt construction (KaniModel.get_input_ids):
- START_OF_HUMAN → text tokens → END_OF_TEXT → END_OF_HUMAN
- Optionally prefixed with speaker ID (e.g., "andrew: Hello world")
LLM generation (KaniModel.model_request):
- Model generates sequence containing: text section + START_OF_SPEECH + audio codec tokens + END_OF_SPEECH
Audio decoding (NemoAudioPlayer.get_waveform):
- Extracts audio tokens between START_OF_SPEECH and END_OF_SPEECH
- Audio tokens are arranged in 4 interleaved codebooks (q=4)
- Tokens are offset by audio_tokens_start + (codebook_size * codebook_index)
- NeMo codec reconstructs waveform from the 4 codebooks

Key Classes

NemoAudioPlayer (util.py:27-170)

Loads NeMo AudioCodecModel for waveform reconstruction
Manages special token IDs (derived from tokeniser_length base)
Validates output has required speech markers
Extracts and decodes 4-codebook audio tokens from LLM output
Returns 22050 Hz audio as NumPy array

KaniModel (util.py:172-303)

Wraps HuggingFace causal LM (loaded with bfloat16, auto device mapping)
Prepares prompts with conversation/modality control tokens
Runs generation with sampling parameters (temp, top_p, repetition_penalty)
Delegates audio reconstruction to NemoAudioPlayer
Returns tuple: (audio_array, text, timing_report)

InitModels (util.py:305-343)

Factory that loads all models from model_config.yaml at startup
Returns dict mapping model names to KaniModel instances
All models share the same NemoAudioPlayer instance

Examples (util.py:345-387)

Converts examples.yaml structure into Gradio Examples format
Output order: [text, model, speaker_id, temperature, top_p, repetition_penalty, max_len]

Configuration Files

model_config.yaml

nemo_player: NeMo codec config (model name, token layout constants)
models: Dict of available TTS models with device_map and optional speaker_id mappings

examples.yaml

List of example prompts with associated parameters for Gradio UI

Dependency Setup

create_env.py runs before imports in app.py to:

Install transformers from git main branch (required for compatibility)
Set OMP_NUM_THREADS=4
Uses /tmp/deps_installed marker to avoid reinstalling on every run

Important Token Constants

All special tokens are defined relative to tokeniser_length (64400):

start_of_speech = tokeniser_length + 1
end_of_speech = tokeniser_length + 2
start_of_human = tokeniser_length + 3
end_of_human = tokeniser_length + 4
start_of_ai = tokeniser_length + 5
end_of_ai = tokeniser_length + 6
pad_token = tokeniser_length + 7
audio_tokens_start = tokeniser_length + 10
codebook_size = 4032

Multi-Speaker Support

Models with speaker_id mappings in model_config.yaml support voice selection:

Speaker IDs are prefixed to the text prompt (e.g., "andrew: Hello")
The Gradio UI shows/hides speaker dropdown based on selected model
Base models (v.0.1, v.0.2) generate random voices without speaker control

HuggingFace Spaces Deployment

The README.md header contains HF Spaces metadata:

sdk: gradio with version 5.46.0
app_file: app.py as entrypoint
References 3 model checkpoints and the NeMo codec