|
|
--- |
|
|
library_name: transformers |
|
|
license: cc-by-4.0 |
|
|
datasets: |
|
|
- openslr/librispeech_asr |
|
|
--- |
|
|
|
|
|
# X-Codec (speech, HuBERT) |
|
|
|
|
|
This codec is part of the X-Codec family of codecs as shown below: |
|
|
|
|
|
| Model checkpoint | Semantic Model | Domain | Training Data | |
|
|
|--------------------------------------------|-----------------------------------------------------------------------|---------------|-------------------------------| |
|
|
| **[xcodec-hubert-librispeech](https://huggingface.co/hf-audio/xcodec-hubert-librispeech) (this model)** | [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960) | Speech | Librispeech | |
|
|
| [xcodec-wavlm-mls](https://huggingface.co/hf-audio/xcodec-wavlm-mls) | [microsoft/wavlm-base-plus](https://huggingface.co/microsoft/wavlm-base-plus)| Speech | MLS English | |
|
|
| [xcodec-wavlm-more-data](https://huggingface.co/hf-audio/xcodec-wavlm-more-data) | [microsoft/wavlm-base-plus](https://huggingface.co/microsoft/wavlm-base-plus)| Speech | MLS English + Internal data | |
|
|
| [xcodec-hubert-general](https://huggingface.co/hf-audio/xcodec-hubert-general) | [ZhenYe234/hubert_base_general_audio](https://huggingface.co/ZhenYe234/hubert_base_general_audio) | General audio | 200k hours internal data | |
|
|
| [xcodec-hubert-general-balanced](https://huggingface.co/hf-audio/xcodec-hubert-general-balanced) | [ZhenYe234/hubert_base_general_audio](https://huggingface.co/ZhenYe234/hubert_base_general_audio) | General audio | More balanced data | |
|
|
|
|
|
Original model is `xcodec_hubert_librispeech` from [this table](https://github.com/zhenye234/xcodec?tab=readme-ov-file#available-models). |
|
|
|
|
|
## Example usage |
|
|
|
|
|
The example below applies the codec over all possible bandwidths. |
|
|
|
|
|
```python |
|
|
|
|
|
from datasets import Audio, load_dataset |
|
|
from transformers import XcodecModel, AutoFeatureExtractor |
|
|
import torch |
|
|
import os |
|
|
from scipy.io.wavfile import write as write_wav |
|
|
|
|
|
|
|
|
model_id = "hf-audio/xcodec-hubert-librispeech" |
|
|
torch_device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
available_bandwidths = [0.5, 1, 1.5, 2, 4] |
|
|
|
|
|
# load model |
|
|
model = XcodecModel.from_pretrained(model_id, device_map=torch_device) |
|
|
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id) |
|
|
|
|
|
# load audio example |
|
|
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") |
|
|
librispeech_dummy = librispeech_dummy.cast_column( |
|
|
"audio", Audio(sampling_rate=feature_extractor.sampling_rate) |
|
|
) |
|
|
audio_array = librispeech_dummy[0]["audio"]["array"] |
|
|
inputs = feature_extractor( |
|
|
raw_audio=audio_array, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt" |
|
|
).to(model.device) |
|
|
audio = inputs["input_values"] |
|
|
|
|
|
for bandwidth in available_bandwidths: |
|
|
print(f"Encoding with bandwidth: {bandwidth} kbps") |
|
|
# encode |
|
|
audio_codes = model.encode(audio, bandwidth=bandwidth, return_dict=False) |
|
|
print("Codebook shape", audio_codes.shape) |
|
|
# 0.5 kbps -> torch.Size([1, 1, 293]) |
|
|
# 1.0 kbps -> torch.Size([1, 2, 293]) |
|
|
# 1.5 kbps -> torch.Size([1, 3, 293]) |
|
|
# 2.0 kbps -> torch.Size([1, 4, 293]) |
|
|
# 4.0 kbps -> torch.Size([1, 8, 293]) |
|
|
|
|
|
# decode |
|
|
input_values_dec = model.decode(audio_codes).audio_values |
|
|
|
|
|
# save audio to file |
|
|
write_wav(f"{os.path.basename(model_id)}_{bandwidth}.wav", feature_extractor.sampling_rate, input_values_dec.squeeze().detach().cpu().numpy()) |
|
|
|
|
|
write_wav("original.wav", feature_extractor.sampling_rate, audio.squeeze().detach().cpu().numpy()) |
|
|
``` |
|
|
|
|
|
### 🔊 Audio Samples |
|
|
|
|
|
**Original** |
|
|
<audio controls> |
|
|
<source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/original.wav" type="audio/wav"> |
|
|
</audio> |
|
|
|
|
|
**0.5 kbps** |
|
|
<audio controls> |
|
|
<source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-librispeech_0.5.wav" type="audio/wav"> |
|
|
</audio> |
|
|
|
|
|
**1 kbps** |
|
|
<audio controls> |
|
|
<source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-librispeech_1.wav" type="audio/wav"> |
|
|
</audio> |
|
|
|
|
|
**1.5 kbps** |
|
|
<audio controls> |
|
|
<source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-librispeech_1.5.wav" type="audio/wav"> |
|
|
</audio> |
|
|
|
|
|
**2 kbps** |
|
|
<audio controls> |
|
|
<source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-librispeech_2.wav" type="audio/wav"> |
|
|
</audio> |
|
|
|
|
|
**4 kbps** |
|
|
<audio controls> |
|
|
<source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-librispeech_4.wav" type="audio/wav"> |
|
|
</audio> |
|
|
|
|
|
## Batch example |
|
|
|
|
|
```python |
|
|
|
|
|
from datasets import Audio, load_dataset |
|
|
from transformers import XcodecModel, AutoFeatureExtractor |
|
|
import torch |
|
|
|
|
|
|
|
|
model_id = "hf-audio/xcodec-hubert-librispeech" |
|
|
torch_device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
bandwidth = 4 |
|
|
n_audio = 2 # number of audio samples to process in a batch |
|
|
|
|
|
# load model |
|
|
model = XcodecModel.from_pretrained(model_id, device_map=torch_device) |
|
|
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id) |
|
|
|
|
|
# load audio example |
|
|
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") |
|
|
ds = ds.cast_column( |
|
|
"audio", Audio(sampling_rate=feature_extractor.sampling_rate) |
|
|
) |
|
|
audio = [audio_sample["array"] for audio_sample in ds[-n_audio:]["audio"]] |
|
|
print(f"Input audio shape: {[_sample.shape for _sample in audio]}") |
|
|
# Input audio shape: [(113840,), (71680,)] |
|
|
inputs = feature_extractor( |
|
|
raw_audio=audio, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt" |
|
|
).to(model.device) |
|
|
audio = inputs["input_values"] |
|
|
print(f"Padded audio shape: {audio.shape}") |
|
|
# Padded audio shape: torch.Size([2, 1, 113920]) |
|
|
|
|
|
# encode |
|
|
audio_codes = model.encode(audio, bandwidth=bandwidth, return_dict=False) |
|
|
print("Codebook shape", audio_codes.shape) |
|
|
# Codebook shape torch.Size([2, 8, 356]) |
|
|
|
|
|
# decode |
|
|
decoded_audio = model.decode(audio_codes).audio_values |
|
|
print("Decoded audio shape", decoded_audio.shape) |
|
|
# Decoded audio shape torch.Size([2, 1, 113920]) |
|
|
``` |
|
|
|