File size: 8,395 Bytes
4a15f1e 8b7d9ad 4a15f1e aababcf 4a15f1e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
---
license: apache-2.0
language:
- en
- zh
- th
- id
- vi
pipeline_tag: audio-text-to-text
tags:
- multimodal
- audio-language-model
- audio
base_model:
- mispeech/dasheng-0.6B
- Qwen/Qwen2.5-Omni-7B
base_model_relation: finetune
---
# MiDashengLM-7B-1021
MiDashengLM is an efficient audio-language model that achieves holistic audio understanding through caption-based alignment.
It achieves state-of-the-art performance on multiple audio understanding benchmarks while maintaining high inference efficiency—delivering 3.2× throughput speedup and supporting batch sizes up to 512.
📖 For more detailed introduction and technical report, please visit our [GitHub repository](https://github.com/xiaomi-research/dasheng-lm).
Note that for most applications, we strongly recommend using the BF16 version ([mispeech/midashenglm-7b-1021-bf16](https://huggingface.co/mispeech/midashenglm-7b-1021-bf16)) for optimal performance and efficiency.
## Usage
### Load Model
```python
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
model_id = "mispeech/midashenglm-7b-1021-fp32" # Only for exact reproduction; otherwise strongly recommend "mispeech/midashenglm-7b-1021-bf16"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
```
### Construct Prompt
```python
user_prompt = "Caption the audio." # You may try any other prompt
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": user_prompt},
{
"type": "audio",
"path": "/path/to/example.wav",
# or "url": "https://example.com/example.wav"
# or "audio": np.random.randn(16000)
},
],
},
]
```
### Generate Output
```python
import torch
with torch.no_grad():
model_inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
add_special_tokens=True,
return_dict=True,
).to(device=model.device, dtype=model.dtype)
generation = model.generate(**model_inputs)
output = tokenizer.batch_decode(generation, skip_special_tokens=True) # ["An engine is idling."]
```
## Results
The following evaluation results are based on the model version: `mispeech/midashenglm-7b-1021-fp32`.
### Audio Captioning Results
| Domain | Dataset | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|:--------:|:--------------:|:--------------:|:----------------:|:-------------------:|
| Music | MusicCaps | **59.11** | 43.71 | 35.43 |
| Music | Songdescriber | **46.42** | 45.31 | 44.63 |
| Sound | AudioCaps | **62.13** | 60.79 | 49.00 |
| Sound | ClothoV2 | **49.35** | 47.55 | 48.01 |
| Sound | AutoACD | **67.13** | 55.93 | 44.76 |
*Metrics: FENSE (higher is better).*
### Audio and Paralinguistic Classification
| Dataset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|:----------------:|:------:|:--------------:|:----------------:|:------------------:|
| VoxCeleb1 | ACC↑ | **92.66** | 59.71 | 82.72 |
| VoxLingua107 | ACC↑ | **93.72** | 51.03 | 73.65 |
| VoxCeleb-Gender | ACC↑ | 97.72 | **99.82** | 99.69 |
| VGGSound | ACC↑ | **52.19** | 0.97 | 2.20 |
| Cochlscene | ACC↑ | **75.81** | 23.88 | 18.34 |
| NSynth | ACC↑ | **80.32** | 60.45 | 38.09 |
| FMA | ACC↑ | 62.94 | **66.77** | 27.91 |
| FSDKaggle2018 | ACC↑ | **73.38** | 31.38 | 24.75 |
| AudioSet | mAP↑ | **9.90** | 6.48 | 3.47 |
| FSD50K | mAP↑ | **38.10** | 23.87 | 27.23 |
### ASR Performance
| Dataset | Language | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|:------------------:|:------------:|:-------------:|:------------:|:-------------------:|
| LibriSpeech test-clean | English | 3.6 | 1.7 | **1.3** |
| LibriSpeech test-other | English | 5.9 | 3.4 | **2.4** |
| People's Speech | English | 26.12 | 28.6 | **22.3** |
| AISHELL2 Mic | Chinese | 3.2 | **2.5** | 2.7 |
| AISHELL2 iOS | Chinese | 2.9 | **2.6** | **2.6** |
| AISHELL2 Android | Chinese | 3.1 | 2.7 | **2.6** |
| GigaSpeech2 | Indonesian | 22.3 | **21.2** | >100 |
| GigaSpeech2 | Thai | **38.4** | 53.8 | >100 |
| GigaSpeech2 | Viet | **17.7** | 18.6 | >100 |
*Metrics: WER/CER (lower is better).*
### Question Answering Results
| Dataset | Subset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|:--------------:|:------------------:|:------:|:--------------:|:----------------:|:-------------------:|
| MMAU-Pro | IF | ACC↑ | 37.93 | **61.30** | 42.30 |
| MMAU-Pro | Multi-Audio | ACC↑ | **42.33** | 24.30 | 17.20 |
| MMAU-Pro | Music | ACC↑ | **62.20** | 61.50 | 57.60 |
| MMAU-Pro | Open-ended | ACC↑ | **63.21** | 52.30 | 34.50 |
| MMAU-Pro | Sound | ACC↑ | **58.36** | 47.60 | 46.00 |
| MMAU-Pro | Sound–Music | ACC↑ | 42.00 | 40.00 | **46.00** |
| MMAU-Pro | Sound–Music–Speech | ACC↑ | **71.43** | 28.50 | 42.80 |
| MMAU-Pro | Spatial | ACC↑ | 18.77 | 41.20 | **43.70** |
| MMAU-Pro | Speech | ACC↑ | **61.17** | 57.40 | 52.20 |
| MMAU-Pro | Speech–Music | ACC↑ | **58.70** | 53.20 | 54.30 |
| MMAU-Pro | Speech–Sound | ACC↑ | 51.14 | **60.20** | 48.90 |
| MMAU-Pro | Voice | ACC↑ | 54.83 | **60.00** | 50.60 |
| MMAU-Pro | Average | ACC↑ | **55.92** | 52.20 | 46.60 |
| MMAU-v05.15.25 | Sound | ACC↑ | 77.48 | **78.10** | 75.68 |
| MMAU-v05.15.25 | Music | ACC↑ | **70.96** | 65.90 | 66.77 |
| MMAU-v05.15.25 | Speech | ACC↑ | **76.28** | 70.60 | 62.16 |
| MMAU-v05.15.25 | Average | ACC↑ | **74.90** | 71.50 | 68.20 |
| MuChoMusic | | ACC↑ | **73.04** | 64.79 | 67.40 |
| MusicQA | | FENSE↑ | **61.56** | 60.60 | 40.00 |
| AudioCaps-QA | | FENSE↑ | **54.20** | 53.28 | 47.34 |
*Metrics: Higher is better.*
## Citation
MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**.
If you find MiDashengLM useful in your research, please consider citing our work:
```bibtex
@techreport{midashenglm7b,
title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
author = {{Horizon Team, MiLM Plus}},
institution= {Xiaomi Inc.},
year = {2025},
note = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
url = {https://arxiv.org/abs/2508.03983},
eprint = {2508.03983},
}
``` |