File size: 5,041 Bytes
07433a5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
---
license: cc-by-nc-4.0
datasets:
- NandemoGHS/Galgame_Gemini_Captions
language:
- ja
base_model:
- Qwen/Qwen3-Omni-30B-A3B-Captioner
---
# Anime-Speech-Japanese-Captioner
This model is a fine-tuned version of [Qwen/Qwen3-Omni-30B-A3B-Captioner](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Captioner).
This is an audio captioning model specialized for Japanese anime-style or game-style speech. It takes an audio input and generates a detailed description in Japanese, including emotion, speaker profile, mood, speed, prosody, pitch/timbre, style, and an overall caption.
It was fine-tuned using the [NandemoGHS/Galgame_Gemini_Captions](https://huggingface.co/datasets/NandemoGHS/Galgame_Gemini_Captions) dataset.
The training was conducted using the [ms-swift](https://github.com/modelscope/ms-swift) library with the Megatron Backend.
## Intended Use and Limitations
This model is specifically designed for **Japanese game-style or anime-style speech**.
Due to the nature of its training data, it is **not expected to perform well** on:
* Languages other than Japanese.
* General conversational speech (e.g., meetings, casual dialogue).
## How to Use (Inference)
We recommend using `vLLM` for inference.
### vLLM Installation Requirements
This model requires building `vLLM` from a recent development commit as it is not yet supported in the latest stable release (v0.11.0 as of this writing).
It has been tested and confirmed to work with commit `18961c5ea62976efc50525b72e40337993c5e4f9`. You must build vLLM from source:
```bash
git clone https://github.com/vllm-project/vllm.git
cd vllm
uv pip install . --torch-backend=auto -v --prerelease=allow
```
This requirement will likely be unnecessary after the `v0.11.1` release.
### Inference Example
Here is a simple inference script using `vLLM`:
```python
import os
import torch
from vllm import LLM, SamplingParams
from transformers import Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info
if __name__ == '__main__':
# vLLM engine v1 not supported yet
os.environ['VLLM_USE_V1'] = '0'
MODEL_PATH = "NandemoGHS/Anime-Speech-Japanese-Captioner-FP8-DYNAMIC"
llm = LLM(
model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,
tensor_parallel_size=torch.cuda.device_count(),
limit_mm_per_prompt={'audio': 1},
max_num_seqs=8,
max_model_len=8192,
seed=100,
)
sampling_params = SamplingParams(
temperature=0.6,
top_p=0.95,
top_k=20,
max_tokens=4096,
)
processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
# Example audio file
audio_path = "https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/resolve/main/examples/example1.wav"
messages = [
{
"role": "user",
"content": [
{"type": "audio", "audio": audio_path}
],
}
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
audios, _, _ = process_mm_info(messages, use_audio_in_video=False)
inputs = {
'prompt': text,
'multi_modal_data': {},
}
if audios is not None:
inputs['multi_modal_data']['audio'] = audios
outputs = llm.generate([inputs], sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
```
#### Example Output
This is the caption generated for [this example](https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/resolve/main/examples/example1.wav).
```
emotion: ecstatic
profile: ใๅฌขๆง้ขจใฎๅฅณๆงๅฃฐ
mood: ๅฟซๆฅฝใ็ตถ้
speed: ้ๅใ้ๅใ
prosody: ๆฏ้ฃใใ่ใใๆๆ
ใฎ่ตทไผใๆฟใใ
pitch_timbre: ้ซใใๆฏๅคใใๅใๅฃฐ
style: ๅใ
notes: ๅใๅฃฐใจๅๆฏใๆททใใใๆง็่ๅฅฎใ้ๅธธใซ้ซใ็ถๆ
ใ
caption: ใๅฌขๆง้ขจใฎๅฅณๆงใๅฟซๆฅฝใซๆบบใใๅใใชใใ่ฉฑใใๆฏ้ฃใใ่ใใ้ๅใ้ๅใใซๆๆ
ใ็็บใใใใๆง็่ๅฅฎใๆ้ซๆฝฎใซ้ใใฆใใใ
```
### Notebook Example
For a more detailed walkthrough, please see the **[inference\_example.ipynb](https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/blob/main/inference_example.ipynb)** notebook.
## Output Format
The model outputs a structured description of the audio in Japanese, following this format:
```
emotion: {Emotion of the speech}
profile: {Speaker profile}
mood: {Mood of the speech}
speed: {Speaking speed}
prosody: {Prosody, rhythm}
pitch_timbre:{Pitch, voice quality}
style: {Style of utterance}
notes: {Other relevant notes}
caption: {A comprehensive caption integrating all elements}
```
## License
This model is licensed under **[CC-BY-NC-4.0 License](https://creativecommons.org/licenses/by-nc/4.0/)**.
Furthermore, the training data utilized outputs from **Gemini 2.5 Pro**. Therefore, **any use that competes with or violates the terms of service of Gemini is strictly prohibited.** |