--- library_name: transformers license: apache-2.0 pipeline_tag: text-to-speech --- # CSM-1B-HF ## Sesame CSM 1B model weights for my [Hugging Face implementation](https://github.com/thomasgauthier/csm-hf/). --- ## Overview CSM-HF is a Hugging Face implementation of [Sesame's Conversational Speech Model (CSM)](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice). CSM-HF is a complete rewrite of the [pytorch code provided by Sesame](https://github.com/SesameAILabs/csm). This codebase is designed to be fully compatible with Hugging Face `transformers`, from inference to training. ## Changes from Sesame's implementation - created a `CSMModel` class - replaced backbone and decoder torchtune models with HF transformers `LllamaModel` - added a processor class to prepare inputs for the model - added labels support and [decoder training amortization](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice#:~:text=The%20audio%20decoder%20is%20trained%20on%20only%20a%20random%201/16%20subset%20of%20the%20audio%20frames%2C%20while%20the%20zeroth%20codebook%20is%20trained%20on%20every%20frame.) - added `generate_frame` and `generate` methods to the model class for generating audio - full support for HuggingFace `Trainer` ## Generation You can use the model to generate audio from text input. Here's an example for voice cloning: ```python import torch from modeling_csm import CSMModel from huggingface_hub import hf_hub_download from transformers import AutoTokenizer from tokenizers.processors import TemplateProcessing from moshi.models import loaders from processor import CSMProcessor import torchaudio device = 'cuda' def load_llama3_tokenizer(): """ https://github.com/huggingface/transformers/issues/22794#issuecomment-2092623992 """ tokenizer_name = "meta-llama/Llama-3.2-1B" tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) bos = tokenizer.bos_token eos = tokenizer.eos_token tokenizer._tokenizer.post_processor = TemplateProcessing( single=f"{bos}:0 $A:0 {eos}:0", pair=f"{bos}:0 $A:0 {eos}:0 {bos}:1 $B:1 {eos}:1", special_tokens=[(f"{bos}", tokenizer.bos_token_id), (f"{eos}", tokenizer.eos_token_id)], ) return tokenizer text_tokenizer = load_llama3_tokenizer() mimi_weight = hf_hub_download(loaders.DEFAULT_REPO, loaders.MIMI_NAME) audio_tokenizer = loaders.get_mimi(mimi_weight, device=device) audio_tokenizer.set_num_codebooks(32) processor = CSMProcessor(text_tokenizer, audio_tokenizer) def load_audio(path, target_sr): audio, sr = torchaudio.load(path) audio = audio.squeeze(0) if sr != target_sr: audio = torchaudio.functional.resample(audio, orig_freq=sr, new_freq=target_sr) return audio model = CSMModel.from_pretrained("thomasgauthier/csm-1b-hf", torch_dtype=torch.bfloat16) model.to('cuda') inputs = processor( messages=[ { "role": "speaker_0", "content": [ {"type": "text", "text": ""}, {"type": "audio"} # This placeholder is required for audio tokenization (it maps to the first element in the `audios` list passed to the processor) ] }, { "role": "speaker_0", "content": [ {"type": "text", "text": "Hello, this is voice cloning speaking"}, # does not include audio as the model will generate it ] } ], audios=[load_audio('AUDIO_CLIP_FOR_VOICE_CLONING.wav', audio_tokenizer.sample_rate)], return_tensors="pt" ) import torch with torch.inference_mode(): # Generate up to 50 new frames gen_frames = model.generate( input_ids=inputs['input_ids'].cuda(), attention_mask=inputs['attention_mask'].cuda(), max_new_frames=50, topk=50, temperature=1.0, use_cache=True, stop_on_all_zeros=True, ) decoded_audio = audio_tokenizer.decode(gen_frames.permute(0, 2, 1)).squeeze(0).squeeze(0) audio_array = (decoded_audio * 32768).to(torch.int16).cpu().numpy() # Audio can be played with the following code: # from IPython.display import Audio # Audio(audio_array, rate=audio_tokenizer.sample_rate) ``` ## Architecture Model architecture is discussed in [ARCHITECTURE.md](https://github.com/thomasgauthier/csm-hf/blob/main/ARCHITECTURE.md) (written by O1) ## Training ### Data Format CSM-HF expects training data in a JSONL format, where each line is a JSON object containing a conversation. Each conversation consists of: - `messages`: An array of message objects, each with: - `role`: Speaker identifier (e.g., "speaker_0", "speaker_1") - `content`: Array of content objects, which can be: - Text: `{"type": "text", "text": "The message text"}` - Audio: `{"type": "audio", "url": "path/to/audio/file.wav"}` - `training_mask`: Boolean array indicating which messages should be used for training (true) or context (false) Example data format: ```json { "messages": [ { "role": "speaker_0", "content": [ {"type": "text", "text": "We have a chance for a new life here."}, {"type": "audio", "url": "clips/example_audio.wav"} ] }, { "role": "speaker_1", "content": [ {"type": "text", "text": "Uncle?"}, {"type": "audio", "url": "clips/response_audio.wav"} ] } ], "training_mask": [false, true] } ``` ### Training Process The model uses a two-stage autoregressive architecture: 1. **Backbone (Inter-frame Processing)**: - Processes the entire sequence of frames - Each frame represents a combined embedding of all codebooks - Handles long-range dependencies between utterances 2. **Decoder (Intra-frame Processing)**: - Processes a single frame at a time - Generates 32 codebooks sequentially (1 semantic + 31 acoustic) - Each codebook is treated as a token in the sequence Training leverages compute amortization techniques: - The zeroth (semantic) codebook is trained on all frames - The remaining codebooks (1-31) are trained on only `amortization_ratio` of the frames - This significantly reduces memory usage while maintaining quality To train the model: ```bash python train.py \ --train_file path/to/training_data.jsonl \ --output_dir ./output \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 8 \ --learning_rate 5e-6 ``` ## TODO - [x] Two-stage autoregressive architecture implementation - [x] Multi-codebook audio tokenization - [x] Compute amortization for efficient training - [x] Dataset preparation with interleaved text/audio - [x] Custom training loop with separate backbone/decoder losses - [x] Proper handling of epoch repetition for decoder amortization - [x] Memory optimization techniques (mixed precision, gradient accumulation) - [ ] LoRA support for efficient fine-tuning - [ ] Faster inference with `torch.compile` - [ ] Coice cloning with prompt tuning / prefix optimization - [ ] Support for DPO - [ ] Support for RL (GRPO, RLOO, etc.) ## Acknowledgements Special thanks to: - **Sesame Labs** for the original architecture design and implementation - **Hugging Face** for the Transformers library and training infrastructure - **Claude** and **ChatGPT** for assistance with documentation and code development This project builds upon research and tools from the open-source community. I am grateful for the collaborative spirit that makes projects like this possible.