Voxtral-Small-24B LoRA Fine-tuned on CoRaL
Danstral is a state-of-the-art 24B parameter model for Danish automatic speech recognition (ASR). It combines the decoder and audio-adapter of Voxtral-Small-24B-2507 with the audio encoder from roest-whisper-large-v1. The decoder and audio-adapter were fine-tuned using LoRA for 2 epochs (40 hours) on the Danish CoRaL dataset, using three NVIDIA L40 GPUs. While it achieves state-of-the-art performance on CoRaL, it is a massive model and likely overkill compared to Whisper-based models.
Evaluation Results
| Model | Number of parameters | CoRaL CER | CoRaL WER |
|---|---|---|---|
| hinge/danstral-v1 | 24B | 4.2% ± 0.2% | 9.7% ± 0.3% |
| Alvenir/coral-1-whisper-large | 1.540B | 4.3% ± 0.2% | 10.4% ± 0.3% |
| alexandrainst/roest-315m | 0.315B | 6.6% ± 0.2% | 17.0% ± 0.4% |
| mhenrichsen/hviske-v2 | 1.540B | 4.7% ± 0.07% | 11.8% ± 0.3% |
| openai/whisper-large-v3 | 1.540B | 11.4% ± 0.3% | 28.3% ± 0.6% |
Limitations
- Danstral-v1 is huge. It's 16x the size of coral-1-whisper-large with only modest performance improvements. However, the LoRA adapter itself is only 25 million parameters.
- Danstral-v1 is a fine-tuned version of voxtral-small-24b, whose encoder is a fine-tuned version of mistral-small-24b. Mistral does not disclose its training datasets, but it is likely that Danish Wikipedia articles were used. Since the CoRaL test split also contains read-aloud samples from Danish Wikipedia, there is a risk of data leakage, which could influence the test scores.
- The model was fine-tuned solely on the CoRaL v1 dataset, so performance may deteriorate for other data sources.
Future Work and Ideas
- Further optimization. The state-of-the-art performance was achieved with a 25M parameter LoRA adapter. I only conducted a few experiments, and there are likely more performance gains to be had by tweaking the LoRA configuration or by conducting a full parameter fine-tune.
- Knowledge distillation. Danstral-v1 can be used for knowledge distillation to train smaller models.
How to Use
See https://github.com/ChristianHinge/danstral for the training script.
from transformers import VoxtralForConditionalGeneration, AutoProcessor, WhisperForConditionalGeneration
import torch
from peft import PeftModel
from datasets import load_dataset, Audio
repo_id = "mistralai/Voxtral-Small-24B-2507"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map="auto",attn_implementation="flash_attention_2")
# Load audio encoder
whisper_model = WhisperForConditionalGeneration.from_pretrained(
"CoRal-project/roest-whisper-large-v1",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2"
)
whisper_encoder_state_dict = whisper_model.model.encoder.state_dict()
model.audio_tower.load_state_dict(whisper_encoder_state_dict)
# Load LoRA adapters
model = PeftModel.from_pretrained(model, "hinge/danstral-v1")
coral = load_dataset("CoRal-project/coral", "read_aloud")
coral = coral.cast_column("audio", Audio(sampling_rate=16000))
for i in range(10):
sample = coral["test"][i]
audio_data = sample['audio']
ground_truth = sample['text']
inputs = processor.apply_transcription_request(language="da", audio=audio_data['array'], format=["WAV"], model_id=repo_id)
inputs = inputs.to("cuda:0", dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=256,do_sample=False)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(f"Ground Truth: {ground_truth}")
print(f"Prediction: {decoded_outputs[0]}")
print("-" * 40)
Shoutouts
- Viktor Stenby Johansson and Rasmus Asgaard for ASR hackathon and ideation
- The CoRal project and Alexandra Institute for curating Danish datasets and leading the effort in Danish NLP
- Downloads last month
- 11
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for hinge/danstral-v1
Base model
mistralai/Mistral-Small-24B-Base-2501
Finetuned
mistralai/Voxtral-Small-24B-2507
Dataset used to train hinge/danstral-v1
Evaluation results
- CER on CoRal read-aloudtest set self-reportedx
- WER on CoRal read-aloudtest set self-reportedx