Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,156 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
datasets:
|
| 4 |
+
- NandemoGHS/Galgame_Gemini_Captions
|
| 5 |
+
language:
|
| 6 |
+
- ja
|
| 7 |
+
base_model:
|
| 8 |
+
- Qwen/Qwen3-Omni-30B-A3B-Captioner
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# Anime-Speech-Japanese-Captioner
|
| 12 |
+
|
| 13 |
+
This model is a fine-tuned version of [Qwen/Qwen3-Omni-30B-A3B-Captioner](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Captioner).
|
| 14 |
+
|
| 15 |
+
This is an audio captioning model specialized for Japanese anime-style or game-style speech. It takes an audio input and generates a detailed description in Japanese, including emotion, speaker profile, mood, speed, prosody, pitch/timbre, style, and an overall caption.
|
| 16 |
+
|
| 17 |
+
It was fine-tuned using the [NandemoGHS/Galgame_Gemini_Captions](https://huggingface.co/datasets/NandemoGHS/Galgame_Gemini_Captions) dataset.
|
| 18 |
+
|
| 19 |
+
The training was conducted using the [ms-swift](https://github.com/modelscope/ms-swift) library with the Megatron Backend.
|
| 20 |
+
|
| 21 |
+
## Intended Use and Limitations
|
| 22 |
+
|
| 23 |
+
This model is specifically designed for **Japanese game-style or anime-style speech**.
|
| 24 |
+
|
| 25 |
+
Due to the nature of its training data, it is **not expected to perform well** on:
|
| 26 |
+
|
| 27 |
+
* Languages other than Japanese.
|
| 28 |
+
* General conversational speech (e.g., meetings, casual dialogue).
|
| 29 |
+
|
| 30 |
+
## How to Use (Inference)
|
| 31 |
+
|
| 32 |
+
We recommend using `vLLM` for inference.
|
| 33 |
+
|
| 34 |
+
### vLLM Installation Requirements
|
| 35 |
+
|
| 36 |
+
This model requires building `vLLM` from a recent development commit as it is not yet supported in the latest stable release (v0.11.0 as of this writing).
|
| 37 |
+
|
| 38 |
+
It has been tested and confirmed to work with commit `18961c5ea62976efc50525b72e40337993c5e4f9`. You must build vLLM from source:
|
| 39 |
+
|
| 40 |
+
```bash
|
| 41 |
+
git clone https://github.com/vllm-project/vllm.git
|
| 42 |
+
cd vllm
|
| 43 |
+
uv pip install . --torch-backend=auto -v --prerelease=allow
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
This requirement will likely be unnecessary after the `v0.11.1` release.
|
| 47 |
+
|
| 48 |
+
### Inference Example
|
| 49 |
+
|
| 50 |
+
Here is a simple inference script using `vLLM`:
|
| 51 |
+
|
| 52 |
+
```python
|
| 53 |
+
import os
|
| 54 |
+
import torch
|
| 55 |
+
|
| 56 |
+
from vllm import LLM, SamplingParams
|
| 57 |
+
from transformers import Qwen3OmniMoeProcessor
|
| 58 |
+
from qwen_omni_utils import process_mm_info
|
| 59 |
+
|
| 60 |
+
if __name__ == '__main__':
|
| 61 |
+
# vLLM engine v1 not supported yet
|
| 62 |
+
os.environ['VLLM_USE_V1'] = '0'
|
| 63 |
+
|
| 64 |
+
MODEL_PATH = "NandemoGHS/Anime-Speech-Japanese-Captioner-FP8-DYNAMIC"
|
| 65 |
+
|
| 66 |
+
llm = LLM(
|
| 67 |
+
model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,
|
| 68 |
+
tensor_parallel_size=torch.cuda.device_count(),
|
| 69 |
+
limit_mm_per_prompt={'audio': 1},
|
| 70 |
+
max_num_seqs=8,
|
| 71 |
+
max_model_len=8192,
|
| 72 |
+
seed=100,
|
| 73 |
+
)
|
| 74 |
+
|
| 75 |
+
sampling_params = SamplingParams(
|
| 76 |
+
temperature=0.6,
|
| 77 |
+
top_p=0.95,
|
| 78 |
+
top_k=20,
|
| 79 |
+
max_tokens=4096,
|
| 80 |
+
)
|
| 81 |
+
|
| 82 |
+
processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
|
| 83 |
+
|
| 84 |
+
# Example audio file
|
| 85 |
+
audio_path = "https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/resolve/main/examples/example1.wav"
|
| 86 |
+
|
| 87 |
+
messages = [
|
| 88 |
+
{
|
| 89 |
+
"role": "user",
|
| 90 |
+
"content": [
|
| 91 |
+
{"type": "audio", "audio": audio_path}
|
| 92 |
+
],
|
| 93 |
+
}
|
| 94 |
+
]
|
| 95 |
+
|
| 96 |
+
text = processor.apply_chat_template(
|
| 97 |
+
messages,
|
| 98 |
+
tokenize=False,
|
| 99 |
+
add_generation_prompt=True,
|
| 100 |
+
)
|
| 101 |
+
audios, _, _ = process_mm_info(messages, use_audio_in_video=False)
|
| 102 |
+
|
| 103 |
+
inputs = {
|
| 104 |
+
'prompt': text,
|
| 105 |
+
'multi_modal_data': {},
|
| 106 |
+
}
|
| 107 |
+
|
| 108 |
+
if audios is not None:
|
| 109 |
+
inputs['multi_modal_data']['audio'] = audios
|
| 110 |
+
|
| 111 |
+
outputs = llm.generate([inputs], sampling_params=sampling_params)
|
| 112 |
+
|
| 113 |
+
print(outputs[0].outputs[0].text)
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
#### Example Output
|
| 117 |
+
|
| 118 |
+
This is the caption generated for [this example](https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/resolve/main/examples/example1.wav).
|
| 119 |
+
|
| 120 |
+
```
|
| 121 |
+
emotion: ecstatic
|
| 122 |
+
profile: お嬢様風の女性声
|
| 123 |
+
mood: 快楽、絶頂
|
| 124 |
+
speed: 途切れ途切れ
|
| 125 |
+
prosody: 息遣いが荒く、感情の起伏が激しい
|
| 126 |
+
pitch_timbre: 高め、息多め、喘ぎ声
|
| 127 |
+
style: 喘ぎ
|
| 128 |
+
notes: 喘ぎ声と吐息が混じり、性的興奮が非常に高い状態。
|
| 129 |
+
caption: お嬢様風の女性が快楽に溺れ、喘ぎながら話す。息遣いが荒く、途切れ途切れに感情を爆発させる。性的興奮が最高潮に達している。
|
| 130 |
+
```
|
| 131 |
+
|
| 132 |
+
### Notebook Example
|
| 133 |
+
|
| 134 |
+
For a more detailed walkthrough, please see the **[inference\_example.ipynb](https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/blob/main/inference_example.ipynb)** notebook.
|
| 135 |
+
|
| 136 |
+
## Output Format
|
| 137 |
+
|
| 138 |
+
The model outputs a structured description of the audio in Japanese, following this format:
|
| 139 |
+
|
| 140 |
+
```
|
| 141 |
+
emotion: {Emotion of the speech}
|
| 142 |
+
profile: {Speaker profile}
|
| 143 |
+
mood: {Mood of the speech}
|
| 144 |
+
speed: {Speaking speed}
|
| 145 |
+
prosody: {Prosody, rhythm}
|
| 146 |
+
pitch_timbre:{Pitch, voice quality}
|
| 147 |
+
style: {Style of utterance}
|
| 148 |
+
notes: {Other relevant notes}
|
| 149 |
+
caption: {A comprehensive caption integrating all elements}
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
## License
|
| 153 |
+
|
| 154 |
+
This model is licensed under **[CC-BY-NC-4.0 License](https://creativecommons.org/licenses/by-nc/4.0/)**.
|
| 155 |
+
|
| 156 |
+
Furthermore, the training data utilized outputs from **Gemini 2.5 Pro**. Therefore, **any use that competes with or violates the terms of service of Gemini is strictly prohibited.**
|