File size: 8,395 Bytes
4a15f1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8b7d9ad
4a15f1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aababcf
4a15f1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
---
license: apache-2.0
language:
- en
- zh
- th
- id
- vi
pipeline_tag: audio-text-to-text
tags:
- multimodal
- audio-language-model
- audio
base_model:
- mispeech/dasheng-0.6B
- Qwen/Qwen2.5-Omni-7B
base_model_relation: finetune
---

# MiDashengLM-7B-1021

MiDashengLM is an efficient audio-language model that achieves holistic audio understanding through caption-based alignment.
It achieves state-of-the-art performance on multiple audio understanding benchmarks while maintaining high inference efficiency—delivering 3.2× throughput speedup and supporting batch sizes up to 512.

📖 For more detailed introduction and technical report, please visit our [GitHub repository](https://github.com/xiaomi-research/dasheng-lm).

Note that for most applications, we strongly recommend using the BF16 version ([mispeech/midashenglm-7b-1021-bf16](https://huggingface.co/mispeech/midashenglm-7b-1021-bf16)) for optimal performance and efficiency.

## Usage

### Load Model

```python
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer

model_id = "mispeech/midashenglm-7b-1021-fp32"  # Only for exact reproduction; otherwise strongly recommend "mispeech/midashenglm-7b-1021-bf16"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
```

### Construct Prompt

```python
user_prompt = "Caption the audio."  # You may try any other prompt

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": user_prompt},
            {
                "type": "audio",
                "path": "/path/to/example.wav",
                # or "url": "https://example.com/example.wav"
                # or "audio": np.random.randn(16000)
            },
        ],
    },
]
```

### Generate Output

```python
import torch

with torch.no_grad():
    model_inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        add_special_tokens=True,
        return_dict=True,
    ).to(device=model.device, dtype=model.dtype)
    generation = model.generate(**model_inputs)
    output = tokenizer.batch_decode(generation, skip_special_tokens=True)  # ["An engine is idling."]
```


## Results

The following evaluation results are based on the model version: `mispeech/midashenglm-7b-1021-fp32`.

### Audio Captioning Results

| Domain   | Dataset        | MiDashengLM    | Qwen2.5-Omni-7B  | Kimi-Audio-Instruct |
|:--------:|:--------------:|:--------------:|:----------------:|:-------------------:|
| Music    | MusicCaps      | **59.11**      | 43.71            | 35.43               |
| Music    | Songdescriber  | **46.42**      | 45.31            | 44.63               |
| Sound    | AudioCaps      | **62.13**      | 60.79            | 49.00               |
| Sound    | ClothoV2       | **49.35**      | 47.55            | 48.01               |
| Sound    | AutoACD        | **67.13**      | 55.93            | 44.76               |

*Metrics: FENSE (higher is better).*

### Audio and Paralinguistic Classification

| Dataset          | Metric | MiDashengLM    | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|:----------------:|:------:|:--------------:|:----------------:|:------------------:|
| VoxCeleb1        | ACC↑   | **92.66**      | 59.71            | 82.72              |
| VoxLingua107     | ACC↑   | **93.72**      | 51.03            | 73.65              |
| VoxCeleb-Gender  | ACC↑   | 97.72          | **99.82**        | 99.69              |
| VGGSound         | ACC↑   | **52.19**      | 0.97             | 2.20               |
| Cochlscene       | ACC↑   | **75.81**      | 23.88            | 18.34              |
| NSynth           | ACC↑   | **80.32**      | 60.45            | 38.09              |
| FMA              | ACC↑   | 62.94          | **66.77**        | 27.91              |
| FSDKaggle2018    | ACC↑   | **73.38**      | 31.38            | 24.75              |
| AudioSet         | mAP↑   | **9.90**       | 6.48             | 3.47               |
| FSD50K           | mAP↑   | **38.10**      | 23.87            | 27.23              |

### ASR Performance

| Dataset            | Language     | MiDashengLM   | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|:------------------:|:------------:|:-------------:|:------------:|:-------------------:|
| LibriSpeech test-clean  | English | 3.6           | 1.7          | **1.3**             |
| LibriSpeech test-other  | English | 5.9           | 3.4          | **2.4**             |
| People's Speech    | English      | 26.12         | 28.6         | **22.3**            |
| AISHELL2 Mic       | Chinese      | 3.2           | **2.5**      | 2.7                 |
| AISHELL2 iOS       | Chinese      | 2.9           | **2.6**      | **2.6**             |
| AISHELL2 Android   | Chinese      | 3.1           | 2.7          | **2.6**             |
| GigaSpeech2        | Indonesian   | 22.3          | **21.2**     | >100                |
| GigaSpeech2        | Thai         | **38.4**      | 53.8         | >100                |
| GigaSpeech2        | Viet         | **17.7**      | 18.6         | >100                |

*Metrics: WER/CER (lower is better).*

### Question Answering Results

| Dataset        | Subset             | Metric | MiDashengLM    | Qwen2.5-Omni-7B  | Kimi-Audio-Instruct |
|:--------------:|:------------------:|:------:|:--------------:|:----------------:|:-------------------:|
| MMAU-Pro       | IF                 | ACC↑   | 37.93          | **61.30**        | 42.30               |
| MMAU-Pro       | Multi-Audio        | ACC↑   | **42.33**      | 24.30            | 17.20               |
| MMAU-Pro       | Music	          | ACC↑   | **62.20**      | 61.50            | 57.60               |
| MMAU-Pro       | Open-ended         | ACC↑   | **63.21**      | 52.30            | 34.50               |
| MMAU-Pro       | Sound              | ACC↑   | **58.36**      | 47.60            | 46.00               |
| MMAU-Pro       | Sound–Music        | ACC↑   | 42.00          | 40.00            | **46.00**           |
| MMAU-Pro       | Sound–Music–Speech | ACC↑   | **71.43**      | 28.50            | 42.80               |
| MMAU-Pro       | Spatial            | ACC↑   | 18.77          | 41.20            | **43.70**           |
| MMAU-Pro       | Speech             | ACC↑   | **61.17**      | 57.40            | 52.20               |
| MMAU-Pro       | Speech–Music       | ACC↑   | **58.70**      | 53.20            | 54.30               |
| MMAU-Pro       | Speech–Sound       | ACC↑   | 51.14          | **60.20**        | 48.90               |
| MMAU-Pro       | Voice              | ACC↑   | 54.83          | **60.00**        | 50.60               |
| MMAU-Pro       | Average            | ACC↑   | **55.92**      | 52.20            | 46.60               |
| MMAU-v05.15.25 | Sound              | ACC↑   | 77.48          | **78.10**        | 75.68               |
| MMAU-v05.15.25 | Music              | ACC↑   | **70.96**      | 65.90            | 66.77               |
| MMAU-v05.15.25 | Speech             | ACC↑   | **76.28**      | 70.60            | 62.16               |
| MMAU-v05.15.25 | Average            | ACC↑   | **74.90**      | 71.50            | 68.20               |
| MuChoMusic     |                    | ACC↑   | **73.04**      | 64.79            | 67.40               |
| MusicQA        |                    | FENSE↑ | **61.56**      | 60.60            | 40.00               |
| AudioCaps-QA   |                    | FENSE↑ | **54.20**      | 53.28            | 47.34               |

*Metrics: Higher is better.*

## Citation

MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**.

If you find MiDashengLM useful in your research, please consider citing our work:

```bibtex
@techreport{midashenglm7b,
  title      = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
  author     = {{Horizon Team, MiLM Plus}},
  institution= {Xiaomi Inc.},
  year       = {2025},
  note       = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
  url        = {https://arxiv.org/abs/2508.03983},
  eprint     = {2508.03983},
}
```