Voxtral-Small-24B LoRA Fine-tuned on CoRaL

Danstral is a state-of-the-art 24B parameter model for Danish automatic speech recognition (ASR). It combines the decoder and audio-adapter of Voxtral-Small-24B-2507 with the audio encoder from roest-whisper-large-v1. The decoder and audio-adapter were fine-tuned using LoRA for 2 epochs (40 hours) on the Danish CoRaL dataset, using three NVIDIA L40 GPUs. While it achieves state-of-the-art performance on CoRaL, it is a massive model and likely overkill compared to Whisper-based models.


Evaluation Results

Model Number of parameters CoRaL CER CoRaL WER
hinge/danstral-v1 24B 4.2% ± 0.2% 9.7% ± 0.3%
Alvenir/coral-1-whisper-large 1.540B 4.3% ± 0.2% 10.4% ± 0.3%
alexandrainst/roest-315m 0.315B 6.6% ± 0.2% 17.0% ± 0.4%
mhenrichsen/hviske-v2 1.540B 4.7% ± 0.07% 11.8% ± 0.3%
openai/whisper-large-v3 1.540B 11.4% ± 0.3% 28.3% ± 0.6%

Limitations

  • Danstral-v1 is huge. It's 16x the size of coral-1-whisper-large with only modest performance improvements. However, the LoRA adapter itself is only 25 million parameters.
  • Danstral-v1 is a fine-tuned version of voxtral-small-24b, whose encoder is a fine-tuned version of mistral-small-24b. Mistral does not disclose its training datasets, but it is likely that Danish Wikipedia articles were used. Since the CoRaL test split also contains read-aloud samples from Danish Wikipedia, there is a risk of data leakage, which could influence the test scores.
  • The model was fine-tuned solely on the CoRaL v1 dataset, so performance may deteriorate for other data sources.

Future Work and Ideas

  • Further optimization. The state-of-the-art performance was achieved with a 25M parameter LoRA adapter. I only conducted a few experiments, and there are likely more performance gains to be had by tweaking the LoRA configuration or by conducting a full parameter fine-tune.
  • Knowledge distillation. Danstral-v1 can be used for knowledge distillation to train smaller models.

How to Use

See https://github.com/ChristianHinge/danstral for the training script.

from transformers import VoxtralForConditionalGeneration, AutoProcessor, WhisperForConditionalGeneration
import torch
from peft import PeftModel
from datasets import load_dataset, Audio

repo_id = "mistralai/Voxtral-Small-24B-2507"

processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map="auto",attn_implementation="flash_attention_2")

# Load audio encoder
whisper_model = WhisperForConditionalGeneration.from_pretrained(
    "CoRal-project/roest-whisper-large-v1",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
)

whisper_encoder_state_dict = whisper_model.model.encoder.state_dict()
model.audio_tower.load_state_dict(whisper_encoder_state_dict)

# Load LoRA adapters
model = PeftModel.from_pretrained(model, "hinge/danstral-v1")

coral = load_dataset("CoRal-project/coral", "read_aloud")
coral = coral.cast_column("audio", Audio(sampling_rate=16000))

for i in range(10):
    sample = coral["test"][i]
    audio_data = sample['audio']
    ground_truth = sample['text']

    inputs = processor.apply_transcription_request(language="da", audio=audio_data['array'], format=["WAV"], model_id=repo_id)
    inputs = inputs.to("cuda:0", dtype=torch.bfloat16)

    outputs = model.generate(**inputs, max_new_tokens=256,do_sample=False)
    decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)

    print(f"Ground Truth: {ground_truth}")
    print(f"Prediction: {decoded_outputs[0]}")
    print("-" * 40)

Shoutouts

  • Viktor Stenby Johansson and Rasmus Asgaard for ASR hackathon and ideation
  • The CoRal project and Alexandra Institute for curating Danish datasets and leading the effort in Danish NLP
Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hinge/danstral-v1

Dataset used to train hinge/danstral-v1

Evaluation results