How to decode CSM tokens into audio tensors for streaming

Using the new ‘sesame/csm-1b’ model and the CsmForConditionalGeneration class I am attempting to stream the audio generation to minimize latency. I have successfully setup the ‘Optional[“BaseStreaming”]’ interface which receives tokens as they are generated, but am at a loss as to how to decode the token into audio tensors so I can stream them to something.

I tried discerning how to do this from the source code but I was unable to find a solution

1 Like

I found this.

Or with this function?

I built a streaming pipeline for CSM-1B that handles the token-to-audio decode. The key issue is that HF’s StaticCache uses index_copy_ which breaks CUDA graphs. Replacing it with slice assignment + a persistent backbone cache gets you reduce-overhead compilation. Full code with patches and a demo server: https://github.com/D3velop-llc/csm-rtx5090

1 Like