OmniAICreator commited on
Commit
07433a5
·
verified ·
1 Parent(s): afe370f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +156 -0
README.md ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ datasets:
4
+ - NandemoGHS/Galgame_Gemini_Captions
5
+ language:
6
+ - ja
7
+ base_model:
8
+ - Qwen/Qwen3-Omni-30B-A3B-Captioner
9
+ ---
10
+
11
+ # Anime-Speech-Japanese-Captioner
12
+
13
+ This model is a fine-tuned version of [Qwen/Qwen3-Omni-30B-A3B-Captioner](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Captioner).
14
+
15
+ This is an audio captioning model specialized for Japanese anime-style or game-style speech. It takes an audio input and generates a detailed description in Japanese, including emotion, speaker profile, mood, speed, prosody, pitch/timbre, style, and an overall caption.
16
+
17
+ It was fine-tuned using the [NandemoGHS/Galgame_Gemini_Captions](https://huggingface.co/datasets/NandemoGHS/Galgame_Gemini_Captions) dataset.
18
+
19
+ The training was conducted using the [ms-swift](https://github.com/modelscope/ms-swift) library with the Megatron Backend.
20
+
21
+ ## Intended Use and Limitations
22
+
23
+ This model is specifically designed for **Japanese game-style or anime-style speech**.
24
+
25
+ Due to the nature of its training data, it is **not expected to perform well** on:
26
+
27
+ * Languages other than Japanese.
28
+ * General conversational speech (e.g., meetings, casual dialogue).
29
+
30
+ ## How to Use (Inference)
31
+
32
+ We recommend using `vLLM` for inference.
33
+
34
+ ### vLLM Installation Requirements
35
+
36
+ This model requires building `vLLM` from a recent development commit as it is not yet supported in the latest stable release (v0.11.0 as of this writing).
37
+
38
+ It has been tested and confirmed to work with commit `18961c5ea62976efc50525b72e40337993c5e4f9`. You must build vLLM from source:
39
+
40
+ ```bash
41
+ git clone https://github.com/vllm-project/vllm.git
42
+ cd vllm
43
+ uv pip install . --torch-backend=auto -v --prerelease=allow
44
+ ```
45
+
46
+ This requirement will likely be unnecessary after the `v0.11.1` release.
47
+
48
+ ### Inference Example
49
+
50
+ Here is a simple inference script using `vLLM`:
51
+
52
+ ```python
53
+ import os
54
+ import torch
55
+
56
+ from vllm import LLM, SamplingParams
57
+ from transformers import Qwen3OmniMoeProcessor
58
+ from qwen_omni_utils import process_mm_info
59
+
60
+ if __name__ == '__main__':
61
+ # vLLM engine v1 not supported yet
62
+ os.environ['VLLM_USE_V1'] = '0'
63
+
64
+ MODEL_PATH = "NandemoGHS/Anime-Speech-Japanese-Captioner-FP8-DYNAMIC"
65
+
66
+ llm = LLM(
67
+ model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,
68
+ tensor_parallel_size=torch.cuda.device_count(),
69
+ limit_mm_per_prompt={'audio': 1},
70
+ max_num_seqs=8,
71
+ max_model_len=8192,
72
+ seed=100,
73
+ )
74
+
75
+ sampling_params = SamplingParams(
76
+ temperature=0.6,
77
+ top_p=0.95,
78
+ top_k=20,
79
+ max_tokens=4096,
80
+ )
81
+
82
+ processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
83
+
84
+ # Example audio file
85
+ audio_path = "https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/resolve/main/examples/example1.wav"
86
+
87
+ messages = [
88
+ {
89
+ "role": "user",
90
+ "content": [
91
+ {"type": "audio", "audio": audio_path}
92
+ ],
93
+ }
94
+ ]
95
+
96
+ text = processor.apply_chat_template(
97
+ messages,
98
+ tokenize=False,
99
+ add_generation_prompt=True,
100
+ )
101
+ audios, _, _ = process_mm_info(messages, use_audio_in_video=False)
102
+
103
+ inputs = {
104
+ 'prompt': text,
105
+ 'multi_modal_data': {},
106
+ }
107
+
108
+ if audios is not None:
109
+ inputs['multi_modal_data']['audio'] = audios
110
+
111
+ outputs = llm.generate([inputs], sampling_params=sampling_params)
112
+
113
+ print(outputs[0].outputs[0].text)
114
+ ```
115
+
116
+ #### Example Output
117
+
118
+ This is the caption generated for [this example](https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/resolve/main/examples/example1.wav).
119
+
120
+ ```
121
+ emotion: ecstatic
122
+ profile: お嬢様風の女性声
123
+ mood: 快楽、絶頂
124
+ speed: 途切れ途切れ
125
+ prosody: 息遣いが荒く、感情の起伏が激しい
126
+ pitch_timbre: 高め、息多め、喘ぎ声
127
+ style: 喘ぎ
128
+ notes: 喘ぎ声と吐息が混じり、性的興奮が非常に高い状態。
129
+ caption: お嬢様風の女性が快楽に溺れ、喘ぎながら話す。息遣いが荒く、途切れ途切れに感情を爆発させる。性的興奮が最高潮に達している。
130
+ ```
131
+
132
+ ### Notebook Example
133
+
134
+ For a more detailed walkthrough, please see the **[inference\_example.ipynb](https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/blob/main/inference_example.ipynb)** notebook.
135
+
136
+ ## Output Format
137
+
138
+ The model outputs a structured description of the audio in Japanese, following this format:
139
+
140
+ ```
141
+ emotion: {Emotion of the speech}
142
+ profile: {Speaker profile}
143
+ mood: {Mood of the speech}
144
+ speed: {Speaking speed}
145
+ prosody: {Prosody, rhythm}
146
+ pitch_timbre:{Pitch, voice quality}
147
+ style: {Style of utterance}
148
+ notes: {Other relevant notes}
149
+ caption: {A comprehensive caption integrating all elements}
150
+ ```
151
+
152
+ ## License
153
+
154
+ This model is licensed under **[CC-BY-NC-4.0 License](https://creativecommons.org/licenses/by-nc/4.0/)**.
155
+
156
+ Furthermore, the training data utilized outputs from **Gemini 2.5 Pro**. Therefore, **any use that competes with or violates the terms of service of Gemini is strictly prohibited.**