|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- unsloth |
|
|
- trl |
|
|
- sft |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- SparkAudio/Spark-TTS-0.5B |
|
|
pipeline_tag: text-to-speech |
|
|
--- |
|
|
|
|
|
# Spark-TTS 0.5B Fine-Tuned Model (16-bit Merged) |
|
|
|
|
|
This repository hosts a fine-tuned Spark-TTS 0.5B model optimized for speech synthesis using the Unsloth and TRL libraries. The model is saved and shared in a merged 16-bit format for efficient storage and faster inference while maintaining high-quality outputs. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Architecture:** Transformer-based Text-to-Speech (Spark-TTS) |
|
|
- **Model Size:** 0.5 Billion parameters |
|
|
- **Precision:** 16-bit merged weights (optimized for inference) |
|
|
- **Fine-tuning:** Full fine-tuning enabled with LoRA adapters (bfloat16 precision) |
|
|
- **Training Framework:** Unsloth & TRL (Supervised Fine-Tuning) |
|
|
- **Tokenizer:** Compatible tokenizer included |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is intended for research and development in text-to-speech synthesis tasks, especially where GPU memory efficiency and long context handling are priorities. |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from unsloth import FastModel |
|
|
import torch |
|
|
|
|
|
# Load the fine-tuned Spark-TTS model and tokenizer from Hugging Face Hub |
|
|
model, tokenizer = FastModel.from_pretrained( |
|
|
"sureshbeekhani/spark-tts-0.5b-finetune-16bit", |
|
|
max_seq_length=2048, # Adjust based on your needs |
|
|
dtype=torch.bfloat16, # Use bfloat16 for LoRA compatibility and efficiency |
|
|
full_finetuning=False, # Set to False if you want to use the model for inference only |
|
|
) |
|
|
|
|
|
# Example text input for speech synthesis |
|
|
text = "Hello, welcome to the Spark-TTS fine-tuned model demo!" |
|
|
|
|
|
# Tokenize the input text |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
|
|
|
# Generate speech output from the model |
|
|
# Note: Adjust this to your model’s specific generate method if applicable |
|
|
outputs = model.generate(**inputs) |
|
|
|
|
|
# Process or save outputs as needed (e.g., convert to audio waveform) |
|
|
# This part depends on your model’s output format and synthesis pipeline |
|
|
|
|
|
print("Inference completed successfully.") |
|
|
|
|
|
# Limitations |
|
|
LoRA fine-tuning is supported only with bfloat16 precision. |
|
|
|
|
|
Designed primarily for speech synthesis; may not perform well for unrelated NLP tasks. |
|
|
|
|
|
Usage in production should be tested carefully for latency and quality trade-offs. |
|
|
|
|
|
#License |
|
|
|
|
|
This model is licensed under the MIT License. |
|
|
If you want, I can help generate a README.md file or add badges and additional sections! |