Model Card for Qwen2-VL-7B-Audio-ASR
Model Details
Model Description: This project extends Qwen/Qwen2.5-VL-7B-Instruct, a powerful Vision-Language Model (VLM), into a multi-modal model capable of understanding and transcribing spoken English. By integrating the audio-encoding capabilities of OpenAI's Whisper large-v3 encoder, we have effectively taught a VLM to "hear," enabling it to perform high-quality Automatic Speech Recognition (ASR).
The core of this work lies in a novel data processing pipeline that allows for batch-efficient training. The model was fine-tuned using a two-stage process, starting with adapter tuning and followed by end-to-end QLoRA optimization.
- Developed by: lordChipotle
- Model Type: Audio-Vision-Language Model
- Language(s): English
- License: Apache-2.0
- Finetuned from model:
Qwen/Qwen2.5-VL-7B-Instruct - Audio Encoder: OpenAI Whisper
large-v3
Notebook Walkthrough
If you're interested in the entire training code, please see this Colab Notebook(https://colab.research.google.com/drive/132FZOydWessJdiPxt5hlXJri44WkP90P?usp=sharing)
Technical Approach & Pipeline
The primary challenge was to enable a VLM, originally designed for text and images, to process variable-length audio inputs. We achieved this through the following pipeline:
See the Diagram(https://imgur.com/a/CKuM9sf)
- Conversation Formatting: Each audio-text pair from the dataset is first structured into a conversational format.
- Chat Templating & Placeholder Injection: A custom chat template is applied, which inserts special placeholder tokens (
<|audio_start|>,<|audio_pad|>,<|audio_end|>) where the audio information belongs. The number of<|audio_pad|>tokens is scaled based on the audio clip's duration. - Dual-Path Encoding:
- The Whisper audio encoder processes the raw audio waveform to generate rich audio embeddings.
- The Qwen2 text encoder processes the text part of the prompt.
- Dynamic Embedding Swapping: In the final step before the LLM, the placeholder embeddings from the text stream are dynamically replaced ("hot-swapped") with their corresponding audio embeddings. This creates a unified text-and-audio embedding sequence.
- Training: The model is then trained on this combined sequence to predict the ground-truth text transcript. This approach allows for efficient batching of audio and text data.
How to Get Started with the Model
Use the code below to get started with the model for speech transcription.
import torch
import torchaudio
import torchaudio.transforms as T
import requests
from peft import PeftModel
from transformers import (
BitsAndBytesConfig,
Qwen2VLProcessor,
AutoModelForCausalLM
)
from transformers.models.qwen2_vl.modeling_qwen2_vl import AudioQwen2VLForConditionalGeneration
# --- Configuration ---
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float32
BASE_REPO = "lordChipotle/qwen2-vl-audio-7b"
ADAPTER_REPO = "lordChipotle/qwen2-vl-audio-7b-qlora"
# --- Load Model and Processor ---
print("Loading base model, processor, and applying LoRA adapter...")
processor = Qwen2VLProcessor.from_pretrained(BASE_REPO, trust_remote_code=True)
# Load the base model with quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=DTYPE,
)
model = AudioQwen2VLForConditionalGeneration.from_pretrained(
BASE_REPO,
quantization_config=bnb_config,
device_map="auto",
attn_implementation="flash_attention_2"
)
# Apply the LoRA adapter
model = PeftModel.from_pretrained(model, ADAPTER_REPO)
print("Model, processor, and adapter loaded.")
# --- Inference Functions ---
def prepare_audio(audio_path, target_sr=16000):
waveform, sample_rate = torchaudio.load(audio_path)
if sample_rate != target_sr:
resampler = T.Resample(orig_freq=sample_rate, new_freq=target_sr)
waveform = resampler(waveform)
if waveform.shape[0] > 1:
waveform = torch.mean(waveform, dim=0, keepdim=True)
return waveform.squeeze().numpy()
def transcribe(audio_path, max_new_tokens=128):
print(f"Loading and preparing audio from: {audio_path}")
audio_array = prepare_audio(audio_path)
chat = [
{"role": "system", "content": [{"type": "text", "text": "You are an ASR assistant."}]},
{"role": "user", "content": [{"type": "audio", "array": audio_array}, {"type": "text", "text": "Transcribe this."}]},
]
text = processor.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
print("Generating transcription...")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
response = processor.decode(outputs[0], skip_special_tokens=True)
try:
return response.split("assistant\n")[-1].strip()
except:
return response
# --- Example Usage ---
# Download a sample audio file for testing
# !wget [https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac](https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac)
AUDIO_FILE = "1.flac"
transcription = transcribe(AUDIO_FILE)
print("\n--- TRANSCRIPTION ---")
print(transcription)
print("---------------------")
Deployment and Inference
For optimized inference, especially in a production environment, it is recommended to use serving frameworks like vLLM, which can provide significant speedups.
Training Details
Training Data
The model was fine-tuned on a subset of the speechbrain/LargeScaleASR(Recently renamed to speechbrain/LoquaciousSet) dataset. This dataset comprises 25,000 hours of diverse, transcribed English speech. For this project, a smaller shard consisting of the first two parts of the 'small' configuration (train-0000* and train-0001*) was used for training, and the first part of the 'test' set (test-00000*) was used for validation.
Training Procedure
The fine-tuning was conducted in two stages to effectively adapt the VLM for audio processing.
Stage 1: Audio Adapter Training
In the first stage, the language model and the pre-trained Whisper audio encoder were frozen. Only the newly introduced audio_proj layer was trained. This stage aims to align the audio feature space with the language model's embedding space.
- Learning Rate:
1e-4 - Batch Size:
2(per device) - Gradient Accumulation Steps:
4(Effective batch size of 8) - Max Steps:
1000
Stage 2: QLoRA End-to-End Fine-Tuning
In the second stage, the entire model was unfrozen and fine-tuned end-to-end using QLoRA (Quantized Low-Rank Adaptation). This method significantly reduces memory requirements by quantizing the base model to 4-bits using NF4 (4-bit NormalFloat) quantization and then training a small number of LoRA adapters on top.
- Learning Rate:
2e-5 - Batch Size:
2(per device) - Gradient Accumulation Steps:
8(Effective batch size of 16) - Epochs:
1 - Quantization:
4-bit NF4withbfloat16compute dtype. - LoRA Config:
r: 16lora_alpha: 32target_modules:['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj', 'audio_proj']lora_dropout: 0.05
Evaluation
The model's performance was monitored using Weights & Biases. The plots below show the training and evaluation loss during the second stage of fine-tuning.
Training & Evaluation Loss (Stage 2)
Chart(https://imgur.com/a/zXj0jF1)
The evaluation loss shows a consistent downward trend, indicating that the model was successfully learning to transcribe speech from the audio data. The training loss also decreased steadily, converging to a low value.
Citation
If you use this model in your work, please consider citing the original Qwen and Whisper models, as well as this derivative work.
@misc{qwen2_vl_audio_asr,
author = {lordChipotle},
title = {Qwen2-VL-7B for Speech Understanding},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Hub},
howpublished = {\\url{[https://huggingface.co/lordChipotle/qwen2-vl-audio-7b-qlora](https://huggingface.co/lordChipotle/qwen2-vl-audio-7b-qlora)}}
}