File size: 8,395 Bytes

---
license: apache-2.0
language:
- en
- zh
- th
- id
- vi
pipeline_tag: audio-text-to-text
tags:
- multimodal
- audio-language-model
- audio
base_model:
- mispeech/dasheng-0.6B
- Qwen/Qwen2.5-Omni-7B
base_model_relation: finetune
---

# MiDashengLM-7B-1021

MiDashengLM is an efficient audio-language model that achieves holistic audio understanding through caption-based alignment.
It achieves state-of-the-art performance on multiple audio understanding benchmarks while maintaining high inference efficiency—delivering 3.2× throughput speedup and supporting batch sizes up to 512.

📖 For more detailed introduction and technical report, please visit our [GitHub repository](https://github.com/xiaomi-research/dasheng-lm).

Note that for most applications, we strongly recommend using the BF16 version ([mispeech/midashenglm-7b-1021-bf16](https://huggingface.co/mispeech/midashenglm-7b-1021-bf16)) for optimal performance and efficiency.

## Usage

### Load Model

```python
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer

model_id = "mispeech/midashenglm-7b-1021-fp32"  # Only for exact reproduction; otherwise strongly recommend "mispeech/midashenglm-7b-1021-bf16"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
```

### Construct Prompt

```python
user_prompt = "Caption the audio."  # You may try any other prompt

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": user_prompt},
            {
                "type": "audio",
                "path": "/path/to/example.wav",
                # or "url": "https://example.com/example.wav"
                # or "audio": np.random.randn(16000)
            },
        ],
    },
]
```

### Generate Output

```python
import torch

with torch.no_grad():
    model_inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        add_special_tokens=True,
        return_dict=True,
    ).to(device=model.device, dtype=model.dtype)
    generation = model.generate(**model_inputs)
    output = tokenizer.batch_decode(generation, skip_special_tokens=True)  # ["An engine is idling."]
```


## Results

The following evaluation results are based on the model version: `mispeech/midashenglm-7b-1021-fp32`.

### Audio Captioning Results

| Domain   | Dataset        | MiDashengLM    | Qwen2.5-Omni-7B  | Kimi-Audio-Instruct |
|:--------:|:--------------:|:--------------:|:----------------:|:-------------------:|
| Music    | MusicCaps      | **59.11**      | 43.71            | 35.43               |
| Music    | Songdescriber  | **46.42**      | 45.31            | 44.63               |
| Sound    | AudioCaps      | **62.13**      | 60.79            | 49.00               |
| Sound    | ClothoV2       | **49.35**      | 47.55            | 48.01               |
| Sound    | AutoACD        | **67.13**      | 55.93            | 44.76               |

*Metrics: FENSE (higher is better).*

### Audio and Paralinguistic Classification

| Dataset          | Metric | MiDashengLM    | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|:----------------:|:------:|:--------------:|:----------------:|:------------------:|
| VoxCeleb1        | ACC↑   | **92.66**      | 59.71            | 82.72              |
| VoxLingua107     | ACC↑   | **93.72**      | 51.03            | 73.65              |
| VoxCeleb-Gender  | ACC↑   | 97.72          | **99.82**        | 99.69              |
| VGGSound         | ACC↑   | **52.19**      | 0.97             | 2.20               |
| Cochlscene       | ACC↑   | **75.81**      | 23.88            | 18.34              |
| NSynth           | ACC↑   | **80.32**      | 60.45            | 38.09              |
| FMA              | ACC↑   | 62.94          | **66.77**        | 27.91              |
| FSDKaggle2018    | ACC↑   | **73.38**      | 31.38            | 24.75              |
| AudioSet         | mAP↑   | **9.90**       | 6.48             | 3.47               |
| FSD50K           | mAP↑   | **38.10**      | 23.87            | 27.23              |

### ASR Performance

| Dataset            | Language     | MiDashengLM   | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|:------------------:|:------------:|:-------------:|:------------:|:-------------------:|
| LibriSpeech test-clean  | English | 3.6           | 1.7          | **1.3**             |
| LibriSpeech test-other  | English | 5.9           | 3.4          | **2.4**             |
| People's Speech    | English      | 26.12         | 28.6         | **22.3**            |
| AISHELL2 Mic       | Chinese      | 3.2           | **2.5**      | 2.7                 |
| AISHELL2 iOS       | Chinese      | 2.9           | **2.6**      | **2.6**             |
| AISHELL2 Android   | Chinese      | 3.1           | 2.7          | **2.6**             |
| GigaSpeech2        | Indonesian   | 22.3          | **21.2**     | >100                |
| GigaSpeech2        | Thai         | **38.4**      | 53.8         | >100                |
| GigaSpeech2        | Viet         | **17.7**      | 18.6         | >100                |

*Metrics: WER/CER (lower is better).*

### Question Answering Results

| Dataset        | Subset             | Metric | MiDashengLM    | Qwen2.5-Omni-7B  | Kimi-Audio-Instruct |
|:--------------:|:------------------:|:------:|:--------------:|:----------------:|:-------------------:|
| MMAU-Pro       | IF                 | ACC↑   | 37.93          | **61.30**        | 42.30               |
| MMAU-Pro       | Multi-Audio        | ACC↑   | **42.33**      | 24.30            | 17.20               |
| MMAU-Pro       | Music	          | ACC↑   | **62.20**      | 61.50            | 57.60               |
| MMAU-Pro       | Open-ended         | ACC↑   | **63.21**      | 52.30            | 34.50               |
| MMAU-Pro       | Sound              | ACC↑   | **58.36**      | 47.60            | 46.00               |
| MMAU-Pro       | Sound–Music        | ACC↑   | 42.00          | 40.00            | **46.00**           |
| MMAU-Pro       | Sound–Music–Speech | ACC↑   | **71.43**      | 28.50            | 42.80               |
| MMAU-Pro       | Spatial            | ACC↑   | 18.77          | 41.20            | **43.70**           |
| MMAU-Pro       | Speech             | ACC↑   | **61.17**      | 57.40            | 52.20               |
| MMAU-Pro       | Speech–Music       | ACC↑   | **58.70**      | 53.20            | 54.30               |
| MMAU-Pro       | Speech–Sound       | ACC↑   | 51.14          | **60.20**        | 48.90               |
| MMAU-Pro       | Voice              | ACC↑   | 54.83          | **60.00**        | 50.60               |
| MMAU-Pro       | Average            | ACC↑   | **55.92**      | 52.20            | 46.60               |
| MMAU-v05.15.25 | Sound              | ACC↑   | 77.48          | **78.10**        | 75.68               |
| MMAU-v05.15.25 | Music              | ACC↑   | **70.96**      | 65.90            | 66.77               |
| MMAU-v05.15.25 | Speech             | ACC↑   | **76.28**      | 70.60            | 62.16               |
| MMAU-v05.15.25 | Average            | ACC↑   | **74.90**      | 71.50            | 68.20               |
| MuChoMusic     |                    | ACC↑   | **73.04**      | 64.79            | 67.40               |
| MusicQA        |                    | FENSE↑ | **61.56**      | 60.60            | 40.00               |
| AudioCaps-QA   |                    | FENSE↑ | **54.20**      | 53.28            | 47.34               |

*Metrics: Higher is better.*

## Citation

MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**.

If you find MiDashengLM useful in your research, please consider citing our work:

```bibtex
@techreport{midashenglm7b,
  title      = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
  author     = {{Horizon Team, MiLM Plus}},
  institution= {Xiaomi Inc.},
  year       = {2025},
  note       = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
  url        = {https://arxiv.org/abs/2508.03983},
  eprint     = {2508.03983},
}
```