How to decode CSM tokens into audio tensors for streaming

HHolzhauer · June 23, 2025, 7:31am

Using the new ‘sesame/csm-1b’ model and the CsmForConditionalGeneration class I am attempting to stream the audio generation to minimize latency. I have successfully setup the ‘Optional[“BaseStreaming”]’ interface which receives tokens as they are generated, but am at a loss as to how to decode the token into audio tensors so I can stream them to something.

I tried discerning how to do this from the source code but I was unable to find a solution

John6666 · June 23, 2025, 8:16am

I found this.

Or with this function?

github.com/huggingface/transformers

src/transformers/models/csm/processing_csm.py

main


      
                      padding_right = extra_padding
                  else:
                      padding_left = padding_left
                      padding_right = padding_right + extra_padding
          
                  cur_length = cur_length + padding_left + padding_right
                  cur_length = (cur_length - dilation * (kernel_size - 1) - 1) // stride + 1
          
              return cur_length
          
          def save_audio(
              self,
              audio: AudioInput,
              saving_path: Union[str, Path, list[Union[str, Path]]],
              **kwargs: Unpack[CsmProcessorKwargs],
          ):
              # TODO: @eustlb, this should be in AudioProcessor
              if not is_soundfile_available():
                  raise ImportError("Please install `soundfile` to save audio files.")
          
              # ensure correct audio input

AileynDev · April 5, 2026, 5:19pm

I built a streaming pipeline for CSM-1B that handles the token-to-audio decode. The key issue is that HF’s StaticCache uses index_copy_ which breaks CUDA graphs. Replacing it with slice assignment + a persistent backbone cache gets you reduce-overhead compilation. Full code with patches and a demo server: https://github.com/D3velop-llc/csm-rtx5090

Topic		Replies	Views
Joining SpeechEncoderDecoder embedding chunks for processing longer audio Intermediate	1	577	June 10, 2022
Info about insertion of "distillation_token" into the audio spectrogram transformer class 🤗Transformers	0	197	October 4, 2023
Whisper fine-tuned with custom tokens works with model.generate but doesn't with a pipeline() 🤗Transformers	3	94	January 14, 2026
Streaming token output from models like T5 🤗Transformers	7	12340	June 7, 2023
Using Padding for ASR models 🤗Transformers	0	345	December 16, 2022

How to decode CSM tokens into audio tensors for streaming

Related topics