Add README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,97 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: onnxruntime
|
| 3 |
+
tags:
|
| 4 |
+
- snac
|
| 5 |
+
- onnx
|
| 6 |
+
- 24khz
|
| 7 |
+
- decoder
|
| 8 |
+
- browser
|
| 9 |
+
license: other
|
| 10 |
+
language:
|
| 11 |
+
- en
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# SNAC 24 kHz — Decoder as ONNX (browser-ready)
|
| 15 |
+
|
| 16 |
+
This repo provides **ONNX decoders** for the SNAC 24 kHz codec so you can decode SNAC tokens **on-device**, including **in the browser** with `onnxruntime-web`.
|
| 17 |
+
|
| 18 |
+
**Why?** If your TTS front-end is a decoder-only Transformer (e.g. Orpheus-style) that can stream out SNAC tokens fast and cheaply, you can keep synthesis private and responsive by decoding the audio **in the user’s browser/CPU** (or WebGPU when available).
|
| 19 |
+
|
| 20 |
+
> In a Colab CPU test, we saw ~**2.1× real-time** decoding for a longer file using the ONNX model (inference time only, excluding model load). Your mileage will vary with hardware and browser.
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## Files
|
| 25 |
+
|
| 26 |
+
- **`snac24_int2wav_static.onnx`** — *int → wav* decoder
|
| 27 |
+
Inputs (int64):
|
| 28 |
+
- `codes0`: `[1, 12]`
|
| 29 |
+
- `codes1`: `[1, 24]`
|
| 30 |
+
- `codes2`: `[1, 48]`
|
| 31 |
+
Output:
|
| 32 |
+
- `audio`: `float32 [1, 1, 24576]` (24 kHz)
|
| 33 |
+
|
| 34 |
+
Shapes correspond to a **48-frame window**. Each frame is **512 samples**, so one window = **24576 samples** ≈ **1.024 s** at 24 kHz.
|
| 35 |
+
Token alignment: `L0*4 = L1*2 = L2*1 = shared_frames`.
|
| 36 |
+
|
| 37 |
+
- **`snac24_latent2wav_static.onnx`** — *latent → wav* decoder
|
| 38 |
+
Input: `z` `float32 [1, 768, 48]` → Output: `audio [1, 1, 24576]`
|
| 39 |
+
Use this if you reconstruct the latent yourself (RVQ embeddings + 1×1 conv projections).
|
| 40 |
+
|
| 41 |
+
- **`snac24_codes.json`** — sample codes (for testing)
|
| 42 |
+
|
| 43 |
+
- **`snac24_quantizers.json`** — RVQ metadata/weights (stride + embeddings + 1×1 projections) to reconstruct `z` if needed.
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## Browser (WASM/WebGPU) quickstart
|
| 48 |
+
|
| 49 |
+
Serve these files from a local server with cross-origin isolation for multithreaded WASM (e.g., COOP/COEP headers). If not isolated, WASM will typically run **single-threaded**.
|
| 50 |
+
|
| 51 |
+
```html
|
| 52 |
+
<script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js"></script>
|
| 53 |
+
<script>
|
| 54 |
+
(async () => {
|
| 55 |
+
// Prefer WebGPU if available; else WASM
|
| 56 |
+
const providers = (typeof navigator.gpu !== 'undefined') ? ['webgpu','wasm'] : ['wasm'];
|
| 57 |
+
// Enable SIMD; threads only if crossOriginIsolated
|
| 58 |
+
ort.env.wasm.simd = true;
|
| 59 |
+
ort.env.wasm.numThreads = crossOriginIsolated ? (navigator.hardwareConcurrency||4) : 1;
|
| 60 |
+
|
| 61 |
+
const session = await ort.InferenceSession.create('snac24_int2wav_static.onnx', {
|
| 62 |
+
executionProviders: providers,
|
| 63 |
+
graphOptimizationLevel: 'all',
|
| 64 |
+
});
|
| 65 |
+
|
| 66 |
+
// Example: one 48-frame window (12/24/48 tokens). Replace with real codes.
|
| 67 |
+
const T0=12, T1=24, T2=48;
|
| 68 |
+
const feed = {
|
| 69 |
+
codes0: new ort.Tensor('int64', BigInt64Array.from(new Array(T0).fill(0), x=>BigInt(x)), [1,T0]),
|
| 70 |
+
codes1: new ort.Tensor('int64', BigInt64Array.from(new Array(T1).fill(0), x=>BigInt(x)), [1,T1]),
|
| 71 |
+
codes2: new ort.Tensor('int64', BigInt64Array.from(new Array(T2).fill(0), x=>BigInt(x)), [1,T2]),
|
| 72 |
+
};
|
| 73 |
+
|
| 74 |
+
const t0 = performance.now();
|
| 75 |
+
const out = await session.run(feed);
|
| 76 |
+
const t1 = performance.now();
|
| 77 |
+
const audio = out.audio.data; // Float32Array [1,1,24576]
|
| 78 |
+
|
| 79 |
+
// Play it (24 kHz)
|
| 80 |
+
const ctx = new (window.AudioContext||window.webkitAudioContext)({sampleRate:24000});
|
| 81 |
+
const buf = ctx.createBuffer(1, audio.length, 24000);
|
| 82 |
+
buf.copyToChannel(audio, 0);
|
| 83 |
+
const src = ctx.createBufferSource(); src.buffer = buf; src.connect(ctx.destination); src.start();
|
| 84 |
+
|
| 85 |
+
console.log({ usedEP: providers[0], infer_ms: (t1-t0).toFixed(2), samples: audio.length });
|
| 86 |
+
})();
|
| 87 |
+
</script>
|
| 88 |
+
Streaming note
|
| 89 |
+
|
| 90 |
+
SNAC is streamable in principle. For practical low-latency TTS, emit ~200 ms of tokens, decode in ~100 ms,
|
| 91 |
+
start playback, and continue decoding subsequent chunks; cross-fade a few ms to hide seams.
|
| 92 |
+
|
| 93 |
+
Threads / GPU
|
| 94 |
+
|
| 95 |
+
Multithreaded WASM requires cross-origin isolation (COOP/COEP). Without it, browsers typically run single-threaded.
|
| 96 |
+
|
| 97 |
+
WebGPU can accelerate on desktop and mobile when kernels are supported; this model usually falls back to WASM if not.
|