Update README.md

2344bd0 verified about 1 month ago

6.18 kB

	---
	library_name: transformers
	license: cc-by-4.0
	datasets:
	- openslr/librispeech_asr
	---

	# X-Codec (speech, HuBERT)

	This codec is part of the X-Codec family of codecs as shown below:

	\| Model checkpoint \| Semantic Model \| Domain \| Training Data \|
	\|--------------------------------------------\|-----------------------------------------------------------------------\|---------------\|-------------------------------\|
	\| [xcodec-hubert-librispeech](https://huggingface.co/hf-audio/xcodec-hubert-librispeech) (this model) \| [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960) \| Speech \| Librispeech \|
	\| [xcodec-wavlm-mls](https://huggingface.co/hf-audio/xcodec-wavlm-mls) \| [microsoft/wavlm-base-plus](https://huggingface.co/microsoft/wavlm-base-plus)\| Speech \| MLS English \|
	\| [xcodec-wavlm-more-data](https://huggingface.co/hf-audio/xcodec-wavlm-more-data) \| [microsoft/wavlm-base-plus](https://huggingface.co/microsoft/wavlm-base-plus)\| Speech \| MLS English + Internal data \|
	\| [xcodec-hubert-general](https://huggingface.co/hf-audio/xcodec-hubert-general) \| [ZhenYe234/hubert_base_general_audio](https://huggingface.co/ZhenYe234/hubert_base_general_audio) \| General audio \| 200k hours internal data \|
	\| [xcodec-hubert-general-balanced](https://huggingface.co/hf-audio/xcodec-hubert-general-balanced) \| [ZhenYe234/hubert_base_general_audio](https://huggingface.co/ZhenYe234/hubert_base_general_audio) \| General audio \| More balanced data \|

	Original model is `xcodec_hubert_librispeech` from [this table](https://github.com/zhenye234/xcodec?tab=readme-ov-file#available-models).

	## Example usage

	The example below applies the codec over all possible bandwidths.

	```python

	from datasets import Audio, load_dataset
	from transformers import XcodecModel, AutoFeatureExtractor
	import torch
	import os
	from scipy.io.wavfile import write as write_wav


	model_id = "hf-audio/xcodec-hubert-librispeech"
	torch_device = "cuda" if torch.cuda.is_available() else "cpu"
	available_bandwidths = [0.5, 1, 1.5, 2, 4]

	# load model
	model = XcodecModel.from_pretrained(model_id, device_map=torch_device)
	feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)

	# load audio example
	librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
	librispeech_dummy = librispeech_dummy.cast_column(
	"audio", Audio(sampling_rate=feature_extractor.sampling_rate)
	)
	audio_array = librispeech_dummy[0]["audio"]["array"]
	inputs = feature_extractor(
	raw_audio=audio_array, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt"
	).to(model.device)
	audio = inputs["input_values"]

	for bandwidth in available_bandwidths:
	print(f"Encoding with bandwidth: {bandwidth} kbps")
	# encode
	audio_codes = model.encode(audio, bandwidth=bandwidth, return_dict=False)
	print("Codebook shape", audio_codes.shape)
	# 0.5 kbps -> torch.Size([1, 1, 293])
	# 1.0 kbps -> torch.Size([1, 2, 293])
	# 1.5 kbps -> torch.Size([1, 3, 293])
	# 2.0 kbps -> torch.Size([1, 4, 293])
	# 4.0 kbps -> torch.Size([1, 8, 293])

	# decode
	input_values_dec = model.decode(audio_codes).audio_values

	# save audio to file
	write_wav(f"{os.path.basename(model_id)}_{bandwidth}.wav", feature_extractor.sampling_rate, input_values_dec.squeeze().detach().cpu().numpy())

	write_wav("original.wav", feature_extractor.sampling_rate, audio.squeeze().detach().cpu().numpy())
	```

	### 🔊 Audio Samples

	Original
	<audio controls>
	<source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/original.wav" type="audio/wav">
	</audio>

	0.5 kbps
	<audio controls>
	<source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-librispeech_0.5.wav" type="audio/wav">
	</audio>

	1 kbps
	<audio controls>
	<source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-librispeech_1.wav" type="audio/wav">
	</audio>

	1.5 kbps
	<audio controls>
	<source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-librispeech_1.5.wav" type="audio/wav">
	</audio>

	2 kbps
	<audio controls>
	<source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-librispeech_2.wav" type="audio/wav">
	</audio>

	4 kbps
	<audio controls>
	<source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-librispeech_4.wav" type="audio/wav">
	</audio>

	## Batch example

	```python

	from datasets import Audio, load_dataset
	from transformers import XcodecModel, AutoFeatureExtractor
	import torch


	model_id = "hf-audio/xcodec-hubert-librispeech"
	torch_device = "cuda" if torch.cuda.is_available() else "cpu"
	bandwidth = 4
	n_audio = 2 # number of audio samples to process in a batch

	# load model
	model = XcodecModel.from_pretrained(model_id, device_map=torch_device)
	feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)

	# load audio example
	ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
	ds = ds.cast_column(
	"audio", Audio(sampling_rate=feature_extractor.sampling_rate)
	)
	audio = [audio_sample["array"] for audio_sample in ds[-n_audio:]["audio"]]
	print(f"Input audio shape: {[_sample.shape for _sample in audio]}")
	# Input audio shape: [(113840,), (71680,)]
	inputs = feature_extractor(
	raw_audio=audio, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt"
	).to(model.device)
	audio = inputs["input_values"]
	print(f"Padded audio shape: {audio.shape}")
	# Padded audio shape: torch.Size([2, 1, 113920])

	# encode
	audio_codes = model.encode(audio, bandwidth=bandwidth, return_dict=False)
	print("Codebook shape", audio_codes.shape)
	# Codebook shape torch.Size([2, 8, 356])

	# decode
	decoded_audio = model.decode(audio_codes).audio_values
	print("Decoded audio shape", decoded_audio.shape)
	# Decoded audio shape torch.Size([2, 1, 113920])
	```