Voxtral-Small-24B LoRA Fine-tuned on CoRaL

Danstral is a state-of-the-art 24B parameter model for Danish automatic speech recognition (ASR). It combines the decoder and audio-adapter of Voxtral-Small-24B-2507 with the audio encoder from roest-whisper-large-v1. The decoder and audio-adapter were fine-tuned using LoRA for 2 epochs (40 hours) on the Danish CoRaL dataset, using three NVIDIA L40 GPUs. While it achieves state-of-the-art performance on CoRaL, it is a massive model and likely overkill compared to Whisper-based models.

Evaluation Results

Model	Number of parameters	CoRaL CER	CoRaL WER
hinge/danstral-v1	24B	4.2% ± 0.2%	9.7% ± 0.3%
Alvenir/coral-1-whisper-large	1.540B	4.3% ± 0.2%	10.4% ± 0.3%
alexandrainst/roest-315m	0.315B	6.6% ± 0.2%	17.0% ± 0.4%
mhenrichsen/hviske-v2	1.540B	4.7% ± 0.07%	11.8% ± 0.3%
openai/whisper-large-v3	1.540B	11.4% ± 0.3%	28.3% ± 0.6%

Limitations

Danstral-v1 is huge. It's 16x the size of coral-1-whisper-large with only modest performance improvements. However, the LoRA adapter itself is only 25 million parameters.
Danstral-v1 is a fine-tuned version of voxtral-small-24b, whose encoder is a fine-tuned version of mistral-small-24b. Mistral does not disclose its training datasets, but it is likely that Danish Wikipedia articles were used. Since the CoRaL test split also contains read-aloud samples from Danish Wikipedia, there is a risk of data leakage, which could influence the test scores.
The model was fine-tuned solely on the CoRaL v1 dataset, so performance may deteriorate for other data sources.

Future Work and Ideas

Further optimization. The state-of-the-art performance was achieved with a 25M parameter LoRA adapter. I only conducted a few experiments, and there are likely more performance gains to be had by tweaking the LoRA configuration or by conducting a full parameter fine-tune.
Knowledge distillation. Danstral-v1 can be used for knowledge distillation to train smaller models.

How to Use

See https://github.com/ChristianHinge/danstral for the training script.

from transformers import VoxtralForConditionalGeneration, AutoProcessor, WhisperForConditionalGeneration
import torch
from peft import PeftModel
from datasets import load_dataset, Audio

repo_id = "mistralai/Voxtral-Small-24B-2507"

processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map="auto",attn_implementation="flash_attention_2")

# Load audio encoder
whisper_model = WhisperForConditionalGeneration.from_pretrained(
    "CoRal-project/roest-whisper-large-v1",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
)

whisper_encoder_state_dict = whisper_model.model.encoder.state_dict()
model.audio_tower.load_state_dict(whisper_encoder_state_dict)

# Load LoRA adapters
model = PeftModel.from_pretrained(model, "hinge/danstral-v1")

coral = load_dataset("CoRal-project/coral", "read_aloud")
coral = coral.cast_column("audio", Audio(sampling_rate=16000))

for i in range(10):
    sample = coral["test"][i]
    audio_data = sample['audio']
    ground_truth = sample['text']

    inputs = processor.apply_transcription_request(language="da", audio=audio_data['array'], format=["WAV"], model_id=repo_id)
    inputs = inputs.to("cuda:0", dtype=torch.bfloat16)

    outputs = model.generate(**inputs, max_new_tokens=256,do_sample=False)
    decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)

    print(f"Ground Truth: {ground_truth}")
    print(f"Prediction: {decoded_outputs[0]}")
    print("-" * 40)

Shoutouts

Viktor Stenby Johansson and Rasmus Asgaard for ASR hackathon and ideation
The CoRal project and Alexandra Institute for curating Danish datasets and leading the effort in Danish NLP

Downloads last month: 11

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hinge/danstral-v1

Base model

mistralai/Mistral-Small-24B-Base-2501

Finetuned

mistralai/Voxtral-Small-24B-2507

Adapter

(1)

this model

Dataset used to train hinge/danstral-v1

Evaluation results

CER on CoRal read-aloud
test set self-reported

x
WER on CoRal read-aloud
test set self-reported

x

View on Papers With Code