cpatonn commited on
Commit
2c21334
·
verified ·
1 Parent(s): e59de9b

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,256 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: apache-2.0
4
+ language:
5
+ - en
6
+ tags:
7
+ - multimodal
8
+ library_name: transformers
9
+ pipeline_tag: any-to-any
10
+ base_model: Qwen/Qwen3-Omni-30B-A3B-Captioner
11
+ ---
12
+
13
+ # Qwen3-Omni
14
+
15
+ <a href="https://chat.qwen.ai/" target="_blank" style="margin: 2px;">
16
+ <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
17
+ </a>
18
+
19
+
20
+ ## Overview
21
+ ### Introduction
22
+
23
+ <p align="center">
24
+ <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/q3o_introduction.png" width="100%"/>
25
+ <p>
26
+
27
+ Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain **Qwen3-Omni-30B-A3B-Captioner**, which produces detailed, low-hallucination captions for arbitrary audio inputs.
28
+
29
+ **Qwen3-Omni-30B-A3B-Captioner** is a powerful fine-grained audio analysis model, built upon the Qwen3-Omni-30B-A3B-Instruct base model. It is specifically designed to generate accurate and comprehensive content descriptions in complex and diverse audio scenarios. Without requiring any additional prompting, the model can automatically parse and describe various types of audio content, ranging from complex speech and environmental sounds to music and cinematic sound effects, delivering stable and reliable outputs even in multi-source, mixed audio environments.
30
+
31
+ In terms of speech understanding, Qwen3-Omni-30B-A3B-Captioner excels at identifying multiple speaker emotions, multilingual expressions, and layered intentions. It can also perceive cultural context and implicit information within the audio, enabling a deep comprehension of the underlying meaning behind the spoken words. In non-speech scenarios, the model demonstrates exceptional sound recognition and analysis capabilities, accurately distinguishing and describing intricate layers of real-world sounds, ambient atmospheres, and dynamic audio details in film and media.
32
+
33
+ **Note**: Qwen3-Omni-30B-A3B-Captioner is a single-turn model that accepts only one audio input per inference. It does not accept any text prompts and supports **audio input only**, with **text output only**. As Qwen3-Omni-30B-A3B-Captioner is designed for generating fine‑grained descriptions of audio, excessively long audio clips may diminish detail perception. We recommend, as a best practice, limiting audio length to no more than 30 seconds.
34
+
35
+ ## QuickStart
36
+
37
+ ### Model Description and Download
38
+
39
+ | Model Name | Description |
40
+ |------------------------------|-------------|
41
+ | Qwen3-Omni-30B-A3B-Captioner | A downstream audio fine-grained caption model fine-tuned from Qwen3-Omni-30B-A3B-Instruct, which produces detailed, low-hallucination captions for arbitrary audio inputs. It contains the thinker, supporting audio input and text output. For more information, you can refer to the model's [cookbook](https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/omni_captioner.ipynb) or [Hugging Face Demo](https://huggingface.co/spaces/Qwen/Qwen3-Omni-Captioner-Demo) and [ModelScope Demo](https://modelscope.cn/studios/Qwen/Qwen3-Omni-Captioner-Demo). |
42
+
43
+ During loading in Hugging Face Transformers or vLLM, model weights will be automatically downloaded based on the model name. However, if your runtime environment is not conducive to downloading weights during execution, you can refer to the following commands to manually download the model weights to a local directory:
44
+
45
+ ```bash
46
+ # Download through ModelScope (recommended for users in Mainland China)
47
+ pip install -U modelscope
48
+ modelscope download --model Qwen/Qwen3-Omni-30B-A3B-Captioner --local_dir ./Qwen3-Omni-30B-A3B-Captioner
49
+
50
+ # Download through Hugging Face
51
+ pip install -U "huggingface_hub[cli]"
52
+ huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Captioner --local-dir ./Qwen3-Omni-30B-A3B-Captioner
53
+ ```
54
+
55
+ ### Transformers Usage
56
+
57
+ #### Installation
58
+
59
+ The Hugging Face Transformers code for Qwen3-Omni has been successfully merged, but the PyPI package has not yet been released. Therefore, you need to install it from source using the following command. We strongly recommend that you **create a new Python environment** to avoid environment runtime issues.
60
+
61
+ ```bash
62
+ # If you already have transformers installed, please uninstall it first, or create a new Python environment
63
+ # pip uninstall transformers
64
+ pip install git+https://github.com/huggingface/transformers
65
+ pip install accelerate
66
+ ```
67
+
68
+ We offer a toolkit to help you handle various types of audio and visual input more conveniently, providing an API-like experience. This includes support for base64, URLs, and interleaved audio, images, and videos. You can install it using the following command and make sure your system has `ffmpeg` installed:
69
+
70
+ ```bash
71
+ pip install qwen-omni-utils -U
72
+ ```
73
+
74
+ Additionally, we recommend using FlashAttention 2 when running with Hugging Face Transformers to reduce GPU memory usage. However, if you are primarily using [vLLM](#vllm-usage) for inference, this installation is not necessary, as vLLM includes FlashAttention 2 by default.
75
+
76
+ ```bash
77
+ pip install -U flash-attn --no-build-isolation
78
+ ```
79
+
80
+ Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the [FlashAttention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention 2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.
81
+
82
+ #### Code Snippet
83
+
84
+ Here is a code snippet to show you how to use Qwen3-Omni-30B-A3B-Captioner with `transformers` and `qwen_omni_utils`:
85
+
86
+ ```python
87
+ import soundfile as sf
88
+
89
+ from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
90
+ from qwen_omni_utils import process_mm_info
91
+
92
+ MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Captioner"
93
+
94
+ model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
95
+ MODEL_PATH,
96
+ dtype="auto",
97
+ device_map="auto",
98
+ attn_implementation="flash_attention_2",
99
+ )
100
+
101
+ processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
102
+
103
+ conversation = [
104
+ {
105
+ "role": "user",
106
+ "content": [
107
+ {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/cookbook/caption2.mp3"},
108
+ ],
109
+ },
110
+ ]
111
+
112
+ # Preparation for inference
113
+ text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
114
+ audios, _, _ = process_mm_info(conversation, use_audio_in_video=False)
115
+ inputs = processor(text=text,
116
+ audio=audios,
117
+ return_tensors="pt",
118
+ padding=True,
119
+ use_audio_in_video=False)
120
+ inputs = inputs.to(model.device).to(model.dtype)
121
+
122
+ # Inference: Generation of the output text and audio
123
+ text_ids, audio = model.generate(**inputs,
124
+ thinker_return_dict_in_generate=True)
125
+
126
+ text = processor.batch_decode(text_ids.sequences[:, inputs["input_ids"].shape[1] :],
127
+ skip_special_tokens=True,
128
+ clean_up_tokenization_spaces=False)
129
+ print(text)
130
+ ```
131
+
132
+ ### vLLM Usage
133
+
134
+ #### Installation
135
+
136
+ We strongly recommend using vLLM for inference and deployment of the Qwen3-Omni series models. Since our code is currently in the pull request stage, you can follow the commands below to install vLLM from source. Please note that we recommend you **create a new Python environment** to avoid runtime environment conflicts and incompatibilities. For more details on compiling vLLM from source, please refer to the [vLLM official documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#set-up-using-python-only-build-without-compilation).
137
+
138
+ ```bash
139
+ git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git
140
+ cd vllm
141
+ pip install -r requirements/build.txt
142
+ pip install -r requirements/cuda.txt
143
+ export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl
144
+ VLLM_USE_PRECOMPILED=1 pip install -e . -v --no-build-isolation
145
+ # If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1, please use "pip install -e . -v" to build from source.
146
+ # Install the Transformers
147
+ pip install git+https://github.com/huggingface/transformers
148
+ pip install accelerate
149
+ pip install qwen-omni-utils -U
150
+ pip install -U flash-attn --no-build-isolation
151
+ ```
152
+
153
+ #### Inference
154
+
155
+ Below is a simple example of how to run Qwen3-Omni-30B-A3B-Captioner with vLLM:
156
+
157
+ ```python
158
+ import os
159
+ import torch
160
+
161
+ from vllm import LLM, SamplingParams
162
+ from transformers import Qwen3OmniMoeProcessor
163
+ from qwen_omni_utils import process_mm_info
164
+
165
+ if __name__ == '__main__':
166
+ # vLLM engine v1 not supported yet
167
+ os.environ['VLLM_USE_V1'] = '0'
168
+
169
+ MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Captioner"
170
+
171
+ llm = LLM(
172
+ model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,
173
+ tensor_parallel_size=torch.cuda.device_count(),
174
+ limit_mm_per_prompt={'audio': 1},
175
+ max_num_seqs=8,
176
+ max_model_len=32768,
177
+ seed=1234,
178
+ )
179
+
180
+ sampling_params = SamplingParams(
181
+ temperature=0.6,
182
+ top_p=0.95,
183
+ top_k=20,
184
+ max_tokens=16384,
185
+ )
186
+
187
+ processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
188
+
189
+ messages = [
190
+ {
191
+ "role": "user",
192
+ "content": [
193
+ {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/cookbook/caption2.mp3"}
194
+ ],
195
+ }
196
+ ]
197
+
198
+ text = processor.apply_chat_template(
199
+ messages,
200
+ tokenize=False,
201
+ add_generation_prompt=True,
202
+ )
203
+ audios, _, _ = process_mm_info(messages, use_audio_in_video=False)
204
+
205
+ inputs = {
206
+ 'prompt': text,
207
+ 'multi_modal_data': {},
208
+ }
209
+
210
+ if audios is not None:
211
+ inputs['multi_modal_data']['audio'] = audios
212
+
213
+ outputs = llm.generate([inputs], sampling_params=sampling_params)
214
+
215
+ print(outputs[0].outputs[0].text)
216
+ ```
217
+
218
+ #### vLLM Serve Usage
219
+
220
+ You can start vLLM serve through the following command:
221
+
222
+ ```bash
223
+ # Qwen3-Omni-30B-A3B-Captioner for single GPU
224
+ vllm serve Qwen/Qwen3-Omni-30B-A3B-Captioner --port 8901 --host 127.0.0.1 --dtype bfloat16 --max-model-len 32768 --allowed-local-media-path / -tp 1
225
+ # Qwen3-Omni-30B-A3B-Captioner for multi-GPU (example on 4 GPUs)
226
+ vllm serve Qwen/Qwen3-Omni-30B-A3B-Captioner --port 8901 --host 127.0.0.1 --dtype bfloat16 --max-model-len 32768 --allowed-local-media-path / -tp 4
227
+ ```
228
+
229
+ Then you can use the API as below (via curl, for example):
230
+ ```bash
231
+ curl http://localhost:8901/v1/chat/completions \
232
+ -H "Content-Type: application/json" \
233
+ -d '{
234
+ "messages": [
235
+ {"role": "user", "content": [
236
+ {"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/cookbook/caption2.mp3"}}
237
+ ]}
238
+ ]
239
+ }'
240
+ ```
241
+
242
+ <!-- ## Citation
243
+
244
+ If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)
245
+
246
+
247
+ ```BibTeX
248
+ @article{Qwen3-Omni,
249
+ title={Qwen3-Omni Technical Report},
250
+ author={Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin},
251
+ journal={arXiv preprint arXiv},
252
+ year={2025}
253
+ }
254
+ ``` -->
255
+
256
+ <br>
added_tokens.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</think>": 151668,
3
+ "</tool_call>": 151658,
4
+ "</tool_response>": 151666,
5
+ "<think>": 151667,
6
+ "<tool_call>": 151657,
7
+ "<tool_response>": 151665,
8
+ "<tts_pad>": 151671,
9
+ "<tts_text_bos>": 151672,
10
+ "<tts_text_bos_single>": 151674,
11
+ "<tts_text_eod>": 151673,
12
+ "<|audio_end|>": 151670,
13
+ "<|audio_pad|>": 151675,
14
+ "<|audio_start|>": 151669,
15
+ "<|box_end|>": 151649,
16
+ "<|box_start|>": 151648,
17
+ "<|endoftext|>": 151643,
18
+ "<|file_sep|>": 151664,
19
+ "<|fim_middle|>": 151660,
20
+ "<|fim_pad|>": 151662,
21
+ "<|fim_prefix|>": 151659,
22
+ "<|fim_suffix|>": 151661,
23
+ "<|im_end|>": 151645,
24
+ "<|im_start|>": 151644,
25
+ "<|image_pad|>": 151655,
26
+ "<|object_ref_end|>": 151647,
27
+ "<|object_ref_start|>": 151646,
28
+ "<|quad_end|>": 151651,
29
+ "<|quad_start|>": 151650,
30
+ "<|repo_name|>": 151663,
31
+ "<|video_pad|>": 151656,
32
+ "<|vision_end|>": 151653,
33
+ "<|vision_pad|>": 151654,
34
+ "<|vision_start|>": 151652
35
+ }
chat_template.jinja ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}{{- messages[0].content + '\n\n' }}{%- endif %}
4
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
5
+ {%- for tool in tools %}
6
+ {{- "\n" }}
7
+ {{- tool | tojson }}
8
+ {%- endfor %}
9
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
10
+ {%- else %}
11
+ {%- if messages[0].role == 'system' %}
12
+ {%- if messages[0].content is string %}
13
+ {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
14
+ {%- else %}
15
+ {%- for content in messages[0].content %}
16
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
17
+ {{- '<|im_start|>system\n' +"<|vision_start|><|image_pad|><|vision_end|>"+ '<|im_end|>\n' }}
18
+ {%- elif content.type == 'audio' or 'audio' in content or 'audio_url' in content %}
19
+ {{- '<|im_start|>system\n' +"<|audio_start|><|audio_pad|><|audio_end|>"+ '<|im_end|>\n' }}
20
+ {%- elif content.type == 'video' or 'video' in content %}
21
+ {{- '<|im_start|>system\n' +"<|vision_start|><|video_pad|><|vision_end|>"+ '<|im_end|>\n' }}
22
+ {%- elif content.type == 'text' %}
23
+ {{- '<|im_start|>system\n' +content.text+ '<|im_end|>\n' }}
24
+ {%- endif %}
25
+ {%- endfor %}
26
+ {%- endif %}
27
+ {%- endif %}
28
+ {%- endif %}
29
+ {%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
30
+ {%- for message in messages[::-1] %}
31
+ {%- set index = (messages|length - 1) - loop.index0 %}
32
+ {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
33
+ {%- set ns.multi_step_tool = false %}
34
+ {%- set ns.last_query_index = index %}
35
+ {%- endif %}
36
+ {%- endfor %}
37
+ {%- for message in messages %}
38
+ {%- if message.content is string %}
39
+ {%- set content = message.content %}
40
+ {%- else %}
41
+ {%- set content = namespace(text="") %}
42
+ {%- for mcontent in message.content %}
43
+ {%- if mcontent.type == 'image' or 'image' in mcontent or 'image_url' in mcontent %}
44
+ {%- set content.text = content.text~"<|vision_start|><|image_pad|><|vision_end|>" %}
45
+ {%- elif mcontent.type == 'audio' or 'audio' in mcontent or 'audio_url' in mcontent %}
46
+ {%- set content.text = content.text~"<|audio_start|><|audio_pad|><|audio_end|>" %}
47
+ {%- elif mcontent.type == 'video' or 'video' in mcontent %}
48
+ {%- set content.text = content.text~"<|vision_start|><|video_pad|><|vision_end|>" %}
49
+ {%- elif mcontent.type == 'text' %}
50
+ {%- set content.text = content.text~mcontent.text %}
51
+ {%- endif %}
52
+ {%- endfor %}
53
+ {%- set content = content.text %}
54
+ {%- endif %}
55
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
56
+ {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
57
+ {%- elif message.role == "assistant" %}
58
+ {%- set reasoning_content = "" %}
59
+ {%- if message.reasoning_content is string %}
60
+ {%- set reasoning_content = message.reasoning_content %}
61
+ {%- else %}
62
+ {%- if '</think>' in content %}
63
+ {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
64
+ {%- set content = content.split('</think>')[-1].lstrip('\n') %}
65
+ {%- endif %}
66
+ {%- endif %}
67
+ {%- if loop.index0 > ns.last_query_index %}
68
+ {%- if loop.last or (not loop.last and reasoning_content) %}
69
+ {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip("\n") + '\n</think>\n\n' + content.lstrip('\n') }}
70
+ {%- else %}
71
+ {{- '<|im_start|>' + message.role + '\n' + content }}
72
+ {%- endif %}
73
+ {%- else %}
74
+ {{- '<|im_start|>' + message.role + '\n' + content }}
75
+ {%- endif %}
76
+ {%- if message.tool_calls %}
77
+ {%- for tool_call in message.tool_calls %}
78
+ {%- if (loop.first and content) or (not loop.first) %}{{- '\n' }}{%- endif %}
79
+ {%- if tool_call.function %}
80
+ {%- set tool_call = tool_call.function %}
81
+ {%- endif %}
82
+ {{- '<tool_call>\n{"name": "' }}
83
+ {{- tool_call.name }}
84
+ {{- '", "arguments": ' }}
85
+ {%- if tool_call.arguments is string %}
86
+ {{- tool_call.arguments }}
87
+ {%- else %}
88
+ {{- tool_call.arguments | tojson }}
89
+ {%- endif %}
90
+ {{- '}\n</tool_call>' }}
91
+ {%- endfor %}
92
+ {%- endif %}
93
+ {{- '<|im_end|>\n' }}
94
+ {%- elif message.role == "tool" %}
95
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}{{- '<|im_start|>user' }}{%- endif %}
96
+ {{- '\n<tool_response>\n' }}
97
+ {{- content }}
98
+ {{- '\n</tool_response>' }}
99
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}{{- '<|im_end|>\n' }}{%- endif %}
100
+ {%- endif %}
101
+ {%- endfor %}
102
+ {%- if add_generation_prompt %}
103
+ {{- '<|im_start|>assistant\n' }}
104
+ {%- if enable_thinking is defined and enable_thinking is false %}{{- '<think>\n\n</think>\n\n' }}{%- endif %}
105
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,692 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen3OmniMoeForConditionalGeneration"
4
+ ],
5
+ "assistant_token_id": 77091,
6
+ "dtype": "bfloat16",
7
+ "enable_audio_output": false,
8
+ "im_end_token_id": 151645,
9
+ "im_start_token_id": 151644,
10
+ "model_type": "qwen3_omni_moe",
11
+ "system_token_id": 8948,
12
+ "thinker_config": {
13
+ "audio_config": {
14
+ "_name_or_path": "",
15
+ "activation_dropout": 0,
16
+ "activation_function": "gelu",
17
+ "add_cross_attention": false,
18
+ "architectures": null,
19
+ "attention_dropout": 0,
20
+ "bad_words_ids": null,
21
+ "begin_suppress_tokens": null,
22
+ "bos_token_id": null,
23
+ "chunk_size_feed_forward": 0,
24
+ "conv_chunksize": 500,
25
+ "cross_attention_hidden_size": null,
26
+ "d_model": 1280,
27
+ "decoder_start_token_id": null,
28
+ "diversity_penalty": 0.0,
29
+ "do_sample": false,
30
+ "downsample_hidden_size":480,
31
+ "dropout": 0,
32
+ "dtype": null,
33
+ "early_stopping": false,
34
+ "encoder_attention_heads": 20,
35
+ "encoder_ffn_dim": 5120,
36
+ "encoder_layers": 32,
37
+ "encoder_no_repeat_ngram_size": 0,
38
+ "eos_token_id": null,
39
+ "exponential_decay_length_penalty": null,
40
+ "finetuning_task": null,
41
+ "forced_bos_token_id": null,
42
+ "forced_eos_token_id": null,
43
+ "id2label": {
44
+ "0": "LABEL_0",
45
+ "1": "LABEL_1"
46
+ },
47
+ "initializer_range": 0.02,
48
+ "is_decoder": false,
49
+ "is_encoder_decoder": false,
50
+ "label2id": {
51
+ "LABEL_0": 0,
52
+ "LABEL_1": 1
53
+ },
54
+ "length_penalty": 1.0,
55
+ "max_length": 20,
56
+ "max_source_positions": 1500,
57
+ "min_length": 0,
58
+ "model_type": "qwen3_omni_moe_audio_encoder",
59
+ "n_window": 50,
60
+ "n_window_infer": 800,
61
+ "no_repeat_ngram_size": 0,
62
+ "num_beam_groups": 1,
63
+ "num_beams": 1,
64
+ "num_hidden_layers": 32,
65
+ "num_mel_bins": 128,
66
+ "num_return_sequences": 1,
67
+ "output_attentions": false,
68
+ "output_dim": 2048,
69
+ "output_hidden_states": false,
70
+ "output_scores": false,
71
+ "pad_token_id": null,
72
+ "prefix": null,
73
+ "problem_type": null,
74
+ "pruned_heads": {},
75
+ "remove_invalid_values": false,
76
+ "repetition_penalty": 1.0,
77
+ "return_dict": true,
78
+ "return_dict_in_generate": false,
79
+ "scale_embedding": false,
80
+ "sep_token_id": null,
81
+ "suppress_tokens": null,
82
+ "task_specific_params": null,
83
+ "temperature": 1.0,
84
+ "tf_legacy_loss": false,
85
+ "tie_encoder_decoder": false,
86
+ "tie_word_embeddings": true,
87
+ "tokenizer_class": null,
88
+ "top_k": 50,
89
+ "top_p": 1.0,
90
+ "torchscript": false,
91
+ "typical_p": 1.0,
92
+ "use_bfloat16": false
93
+ },
94
+ "audio_end_token_id": 151670,
95
+ "audio_start_token_id": 151669,
96
+ "audio_token_id": 151675,
97
+ "dtype": "bfloat16",
98
+ "image_token_id": 151655,
99
+ "initializer_range": 0.02,
100
+ "model_type": "qwen3_omni_moe_thinker",
101
+ "position_id_per_seconds": 13,
102
+ "seconds_per_chunk": 2,
103
+ "text_config": {
104
+ "_name_or_path": "",
105
+ "add_cross_attention": false,
106
+ "architectures": null,
107
+ "attention_bias": false,
108
+ "attention_dropout": 0.0,
109
+ "bad_words_ids": null,
110
+ "begin_suppress_tokens": null,
111
+ "bos_token_id": null,
112
+ "chunk_size_feed_forward": 0,
113
+ "cross_attention_hidden_size": null,
114
+ "decoder_sparse_step": 1,
115
+ "decoder_start_token_id": null,
116
+ "diversity_penalty": 0.0,
117
+ "do_sample": true,
118
+ "dtype": null,
119
+ "early_stopping": false,
120
+ "encoder_no_repeat_ngram_size": 0,
121
+ "eos_token_id": null,
122
+ "exponential_decay_length_penalty": null,
123
+ "finetuning_task": null,
124
+ "forced_bos_token_id": null,
125
+ "forced_eos_token_id": null,
126
+ "head_dim": 128,
127
+ "hidden_act": "silu",
128
+ "hidden_size": 2048,
129
+ "id2label": {
130
+ "0": "LABEL_0",
131
+ "1": "LABEL_1"
132
+ },
133
+ "initializer_range": 0.02,
134
+ "intermediate_size": 768,
135
+ "is_decoder": false,
136
+ "is_encoder_decoder": false,
137
+ "label2id": {
138
+ "LABEL_0": 0,
139
+ "LABEL_1": 1
140
+ },
141
+ "length_penalty": 1.0,
142
+ "max_length": 20,
143
+ "max_position_embeddings": 65536,
144
+ "min_length": 0,
145
+ "mlp_only_layers": [],
146
+ "model_type": "qwen3_omni_moe_text",
147
+ "moe_intermediate_size": 768,
148
+ "no_repeat_ngram_size": 0,
149
+ "norm_topk_prob": true,
150
+ "num_attention_heads": 32,
151
+ "num_beam_groups": 1,
152
+ "num_beams": 1,
153
+ "num_experts": 128,
154
+ "num_experts_per_tok": 8,
155
+ "num_hidden_layers": 48,
156
+ "num_key_value_heads": 4,
157
+ "num_return_sequences": 1,
158
+ "output_attentions": false,
159
+ "output_hidden_states": false,
160
+ "output_router_logits": false,
161
+ "output_scores": false,
162
+ "pad_token_id": null,
163
+ "prefix": null,
164
+ "problem_type": null,
165
+ "pruned_heads": {},
166
+ "remove_invalid_values": false,
167
+ "repetition_penalty": 1.0,
168
+ "return_dict": true,
169
+ "return_dict_in_generate": false,
170
+ "rms_norm_eps": 1e-06,
171
+ "rope_scaling": {
172
+ "interleaved": true,
173
+ "mrope_interleaved": true,
174
+ "mrope_section": [
175
+ 24,
176
+ 20,
177
+ 20
178
+ ],
179
+ "rope_type": "default",
180
+ "type": "default"
181
+ },
182
+ "rope_theta": 1000000,
183
+ "router_aux_loss_coef": 0.001,
184
+ "sep_token_id": null,
185
+ "shared_expert_intermediate_size": 0,
186
+ "sliding_window": null,
187
+ "suppress_tokens": null,
188
+ "task_specific_params": null,
189
+ "temperature": 1.0,
190
+ "tf_legacy_loss": false,
191
+ "tie_encoder_decoder": false,
192
+ "tie_word_embeddings": false,
193
+ "tokenizer_class": null,
194
+ "top_k": 50,
195
+ "top_p": 1.0,
196
+ "torchscript": false,
197
+ "typical_p": 1.0,
198
+ "use_bfloat16": false,
199
+ "use_cache": true,
200
+ "use_qk_norm": true,
201
+ "use_sliding_window": false,
202
+ "vocab_size": 152064
203
+ },
204
+ "user_token_id": 872,
205
+ "video_token_id": 151656,
206
+ "vision_config": {
207
+ "_name_or_path": "",
208
+ "add_cross_attention": false,
209
+ "apply_vit_abs_pos_embed": true,
210
+ "architectures": null,
211
+ "bad_words_ids": null,
212
+ "begin_suppress_tokens": null,
213
+ "bos_token_id": null,
214
+ "chunk_size_feed_forward": 0,
215
+ "cross_attention_hidden_size": null,
216
+ "decoder_start_token_id": null,
217
+ "deepstack_visual_indexes": [
218
+ 8,
219
+ 16,
220
+ 24
221
+ ],
222
+ "depth": 27,
223
+ "diversity_penalty": 0.0,
224
+ "do_sample": false,
225
+ "dtype": null,
226
+ "early_stopping": false,
227
+ "encoder_no_repeat_ngram_size": 0,
228
+ "eos_token_id": null,
229
+ "exponential_decay_length_penalty": null,
230
+ "finetuning_task": null,
231
+ "forced_bos_token_id": null,
232
+ "forced_eos_token_id": null,
233
+ "hidden_act": "gelu_pytorch_tanh",
234
+ "hidden_size": 1152,
235
+ "id2label": {
236
+ "0": "LABEL_0",
237
+ "1": "LABEL_1"
238
+ },
239
+ "image_size": 768,
240
+ "in_channels": 3,
241
+ "in_chans": 3,
242
+ "initializer_range": 0.02,
243
+ "intermediate_size": 4304,
244
+ "is_decoder": false,
245
+ "is_encoder_decoder": false,
246
+ "label2id": {
247
+ "LABEL_0": 0,
248
+ "LABEL_1": 1
249
+ },
250
+ "length_penalty": 1.0,
251
+ "max_length": 20,
252
+ "min_length": 0,
253
+ "model_type": "qwen3_omni_moe_vision_encoder",
254
+ "no_repeat_ngram_size": 0,
255
+ "num_beam_groups": 1,
256
+ "num_beams": 1,
257
+ "num_heads": 16,
258
+ "num_return_sequences": 1,
259
+ "out_hidden_size": 2048,
260
+ "output_attentions": false,
261
+ "output_hidden_states": false,
262
+ "output_scores": false,
263
+ "pad_token_id": null,
264
+ "patch_size": 16,
265
+ "prefix": null,
266
+ "problem_type": null,
267
+ "pruned_heads": {},
268
+ "remove_invalid_values": false,
269
+ "repetition_penalty": 1.0,
270
+ "return_dict": true,
271
+ "return_dict_in_generate": false,
272
+ "sep_token_id": null,
273
+ "spatial_merge_size": 2,
274
+ "spatial_patch_size": 16,
275
+ "suppress_tokens": null,
276
+ "task_specific_params": null,
277
+ "temperature": 1.0,
278
+ "temporal_patch_size": 2,
279
+ "tf_legacy_loss": false,
280
+ "tie_encoder_decoder": false,
281
+ "tie_word_embeddings": true,
282
+ "tokenizer_class": null,
283
+ "tokens_per_second": 2,
284
+ "top_k": 50,
285
+ "top_p": 1.0,
286
+ "torchscript": false,
287
+ "typical_p": 1.0,
288
+ "use_bfloat16": false
289
+ },
290
+ "vision_end_token_id": 151653,
291
+ "vision_start_token_id": 151652
292
+ },
293
+ "quantization_config": {
294
+ "config_groups": {
295
+ "group_0": {
296
+ "format": "pack-quantized",
297
+ "input_activations": null,
298
+ "output_activations": null,
299
+ "targets": [
300
+ "Linear"
301
+ ],
302
+ "weights": {
303
+ "actorder": null,
304
+ "block_structure": null,
305
+ "dynamic": false,
306
+ "group_size": 32,
307
+ "num_bits": 4,
308
+ "observer": "mse",
309
+ "observer_kwargs": {},
310
+ "strategy": "group",
311
+ "symmetric": true,
312
+ "type": "int"
313
+ }
314
+ }
315
+ },
316
+ "format": "pack-quantized",
317
+ "global_compression_ratio": null,
318
+ "ignore": [
319
+ "thinker.audio_tower.layers.0.self_attn.k_proj",
320
+ "thinker.audio_tower.layers.0.self_attn.v_proj",
321
+ "thinker.audio_tower.layers.0.self_attn.q_proj",
322
+ "thinker.audio_tower.layers.0.self_attn.out_proj",
323
+ "thinker.audio_tower.layers.0.fc1",
324
+ "thinker.audio_tower.layers.0.fc2",
325
+ "thinker.audio_tower.layers.1.self_attn.k_proj",
326
+ "thinker.audio_tower.layers.1.self_attn.v_proj",
327
+ "thinker.audio_tower.layers.1.self_attn.q_proj",
328
+ "thinker.audio_tower.layers.1.self_attn.out_proj",
329
+ "thinker.audio_tower.layers.1.fc1",
330
+ "thinker.audio_tower.layers.1.fc2",
331
+ "thinker.audio_tower.layers.2.self_attn.k_proj",
332
+ "thinker.audio_tower.layers.2.self_attn.v_proj",
333
+ "thinker.audio_tower.layers.2.self_attn.q_proj",
334
+ "thinker.audio_tower.layers.2.self_attn.out_proj",
335
+ "thinker.audio_tower.layers.2.fc1",
336
+ "thinker.audio_tower.layers.2.fc2",
337
+ "thinker.audio_tower.layers.3.self_attn.k_proj",
338
+ "thinker.audio_tower.layers.3.self_attn.v_proj",
339
+ "thinker.audio_tower.layers.3.self_attn.q_proj",
340
+ "thinker.audio_tower.layers.3.self_attn.out_proj",
341
+ "thinker.audio_tower.layers.3.fc1",
342
+ "thinker.audio_tower.layers.3.fc2",
343
+ "thinker.audio_tower.layers.4.self_attn.k_proj",
344
+ "thinker.audio_tower.layers.4.self_attn.v_proj",
345
+ "thinker.audio_tower.layers.4.self_attn.q_proj",
346
+ "thinker.audio_tower.layers.4.self_attn.out_proj",
347
+ "thinker.audio_tower.layers.4.fc1",
348
+ "thinker.audio_tower.layers.4.fc2",
349
+ "thinker.audio_tower.layers.5.self_attn.k_proj",
350
+ "thinker.audio_tower.layers.5.self_attn.v_proj",
351
+ "thinker.audio_tower.layers.5.self_attn.q_proj",
352
+ "thinker.audio_tower.layers.5.self_attn.out_proj",
353
+ "thinker.audio_tower.layers.5.fc1",
354
+ "thinker.audio_tower.layers.5.fc2",
355
+ "thinker.audio_tower.layers.6.self_attn.k_proj",
356
+ "thinker.audio_tower.layers.6.self_attn.v_proj",
357
+ "thinker.audio_tower.layers.6.self_attn.q_proj",
358
+ "thinker.audio_tower.layers.6.self_attn.out_proj",
359
+ "thinker.audio_tower.layers.6.fc1",
360
+ "thinker.audio_tower.layers.6.fc2",
361
+ "thinker.audio_tower.layers.7.self_attn.k_proj",
362
+ "thinker.audio_tower.layers.7.self_attn.v_proj",
363
+ "thinker.audio_tower.layers.7.self_attn.q_proj",
364
+ "thinker.audio_tower.layers.7.self_attn.out_proj",
365
+ "thinker.audio_tower.layers.7.fc1",
366
+ "thinker.audio_tower.layers.7.fc2",
367
+ "thinker.audio_tower.layers.8.self_attn.k_proj",
368
+ "thinker.audio_tower.layers.8.self_attn.v_proj",
369
+ "thinker.audio_tower.layers.8.self_attn.q_proj",
370
+ "thinker.audio_tower.layers.8.self_attn.out_proj",
371
+ "thinker.audio_tower.layers.8.fc1",
372
+ "thinker.audio_tower.layers.8.fc2",
373
+ "thinker.audio_tower.layers.9.self_attn.k_proj",
374
+ "thinker.audio_tower.layers.9.self_attn.v_proj",
375
+ "thinker.audio_tower.layers.9.self_attn.q_proj",
376
+ "thinker.audio_tower.layers.9.self_attn.out_proj",
377
+ "thinker.audio_tower.layers.9.fc1",
378
+ "thinker.audio_tower.layers.9.fc2",
379
+ "thinker.audio_tower.layers.10.self_attn.k_proj",
380
+ "thinker.audio_tower.layers.10.self_attn.v_proj",
381
+ "thinker.audio_tower.layers.10.self_attn.q_proj",
382
+ "thinker.audio_tower.layers.10.self_attn.out_proj",
383
+ "thinker.audio_tower.layers.10.fc1",
384
+ "thinker.audio_tower.layers.10.fc2",
385
+ "thinker.audio_tower.layers.11.self_attn.k_proj",
386
+ "thinker.audio_tower.layers.11.self_attn.v_proj",
387
+ "thinker.audio_tower.layers.11.self_attn.q_proj",
388
+ "thinker.audio_tower.layers.11.self_attn.out_proj",
389
+ "thinker.audio_tower.layers.11.fc1",
390
+ "thinker.audio_tower.layers.11.fc2",
391
+ "thinker.audio_tower.layers.12.self_attn.k_proj",
392
+ "thinker.audio_tower.layers.12.self_attn.v_proj",
393
+ "thinker.audio_tower.layers.12.self_attn.q_proj",
394
+ "thinker.audio_tower.layers.12.self_attn.out_proj",
395
+ "thinker.audio_tower.layers.12.fc1",
396
+ "thinker.audio_tower.layers.12.fc2",
397
+ "thinker.audio_tower.layers.13.self_attn.k_proj",
398
+ "thinker.audio_tower.layers.13.self_attn.v_proj",
399
+ "thinker.audio_tower.layers.13.self_attn.q_proj",
400
+ "thinker.audio_tower.layers.13.self_attn.out_proj",
401
+ "thinker.audio_tower.layers.13.fc1",
402
+ "thinker.audio_tower.layers.13.fc2",
403
+ "thinker.audio_tower.layers.14.self_attn.k_proj",
404
+ "thinker.audio_tower.layers.14.self_attn.v_proj",
405
+ "thinker.audio_tower.layers.14.self_attn.q_proj",
406
+ "thinker.audio_tower.layers.14.self_attn.out_proj",
407
+ "thinker.audio_tower.layers.14.fc1",
408
+ "thinker.audio_tower.layers.14.fc2",
409
+ "thinker.audio_tower.layers.15.self_attn.k_proj",
410
+ "thinker.audio_tower.layers.15.self_attn.v_proj",
411
+ "thinker.audio_tower.layers.15.self_attn.q_proj",
412
+ "thinker.audio_tower.layers.15.self_attn.out_proj",
413
+ "thinker.audio_tower.layers.15.fc1",
414
+ "thinker.audio_tower.layers.15.fc2",
415
+ "thinker.audio_tower.layers.16.self_attn.k_proj",
416
+ "thinker.audio_tower.layers.16.self_attn.v_proj",
417
+ "thinker.audio_tower.layers.16.self_attn.q_proj",
418
+ "thinker.audio_tower.layers.16.self_attn.out_proj",
419
+ "thinker.audio_tower.layers.16.fc1",
420
+ "thinker.audio_tower.layers.16.fc2",
421
+ "thinker.audio_tower.layers.17.self_attn.k_proj",
422
+ "thinker.audio_tower.layers.17.self_attn.v_proj",
423
+ "thinker.audio_tower.layers.17.self_attn.q_proj",
424
+ "thinker.audio_tower.layers.17.self_attn.out_proj",
425
+ "thinker.audio_tower.layers.17.fc1",
426
+ "thinker.audio_tower.layers.17.fc2",
427
+ "thinker.audio_tower.layers.18.self_attn.k_proj",
428
+ "thinker.audio_tower.layers.18.self_attn.v_proj",
429
+ "thinker.audio_tower.layers.18.self_attn.q_proj",
430
+ "thinker.audio_tower.layers.18.self_attn.out_proj",
431
+ "thinker.audio_tower.layers.18.fc1",
432
+ "thinker.audio_tower.layers.18.fc2",
433
+ "thinker.audio_tower.layers.19.self_attn.k_proj",
434
+ "thinker.audio_tower.layers.19.self_attn.v_proj",
435
+ "thinker.audio_tower.layers.19.self_attn.q_proj",
436
+ "thinker.audio_tower.layers.19.self_attn.out_proj",
437
+ "thinker.audio_tower.layers.19.fc1",
438
+ "thinker.audio_tower.layers.19.fc2",
439
+ "thinker.audio_tower.layers.20.self_attn.k_proj",
440
+ "thinker.audio_tower.layers.20.self_attn.v_proj",
441
+ "thinker.audio_tower.layers.20.self_attn.q_proj",
442
+ "thinker.audio_tower.layers.20.self_attn.out_proj",
443
+ "thinker.audio_tower.layers.20.fc1",
444
+ "thinker.audio_tower.layers.20.fc2",
445
+ "thinker.audio_tower.layers.21.self_attn.k_proj",
446
+ "thinker.audio_tower.layers.21.self_attn.v_proj",
447
+ "thinker.audio_tower.layers.21.self_attn.q_proj",
448
+ "thinker.audio_tower.layers.21.self_attn.out_proj",
449
+ "thinker.audio_tower.layers.21.fc1",
450
+ "thinker.audio_tower.layers.21.fc2",
451
+ "thinker.audio_tower.layers.22.self_attn.k_proj",
452
+ "thinker.audio_tower.layers.22.self_attn.v_proj",
453
+ "thinker.audio_tower.layers.22.self_attn.q_proj",
454
+ "thinker.audio_tower.layers.22.self_attn.out_proj",
455
+ "thinker.audio_tower.layers.22.fc1",
456
+ "thinker.audio_tower.layers.22.fc2",
457
+ "thinker.audio_tower.layers.23.self_attn.k_proj",
458
+ "thinker.audio_tower.layers.23.self_attn.v_proj",
459
+ "thinker.audio_tower.layers.23.self_attn.q_proj",
460
+ "thinker.audio_tower.layers.23.self_attn.out_proj",
461
+ "thinker.audio_tower.layers.23.fc1",
462
+ "thinker.audio_tower.layers.23.fc2",
463
+ "thinker.audio_tower.layers.24.self_attn.k_proj",
464
+ "thinker.audio_tower.layers.24.self_attn.v_proj",
465
+ "thinker.audio_tower.layers.24.self_attn.q_proj",
466
+ "thinker.audio_tower.layers.24.self_attn.out_proj",
467
+ "thinker.audio_tower.layers.24.fc1",
468
+ "thinker.audio_tower.layers.24.fc2",
469
+ "thinker.audio_tower.layers.25.self_attn.k_proj",
470
+ "thinker.audio_tower.layers.25.self_attn.v_proj",
471
+ "thinker.audio_tower.layers.25.self_attn.q_proj",
472
+ "thinker.audio_tower.layers.25.self_attn.out_proj",
473
+ "thinker.audio_tower.layers.25.fc1",
474
+ "thinker.audio_tower.layers.25.fc2",
475
+ "thinker.audio_tower.layers.26.self_attn.k_proj",
476
+ "thinker.audio_tower.layers.26.self_attn.v_proj",
477
+ "thinker.audio_tower.layers.26.self_attn.q_proj",
478
+ "thinker.audio_tower.layers.26.self_attn.out_proj",
479
+ "thinker.audio_tower.layers.26.fc1",
480
+ "thinker.audio_tower.layers.26.fc2",
481
+ "thinker.audio_tower.layers.27.self_attn.k_proj",
482
+ "thinker.audio_tower.layers.27.self_attn.v_proj",
483
+ "thinker.audio_tower.layers.27.self_attn.q_proj",
484
+ "thinker.audio_tower.layers.27.self_attn.out_proj",
485
+ "thinker.audio_tower.layers.27.fc1",
486
+ "thinker.audio_tower.layers.27.fc2",
487
+ "thinker.audio_tower.layers.28.self_attn.k_proj",
488
+ "thinker.audio_tower.layers.28.self_attn.v_proj",
489
+ "thinker.audio_tower.layers.28.self_attn.q_proj",
490
+ "thinker.audio_tower.layers.28.self_attn.out_proj",
491
+ "thinker.audio_tower.layers.28.fc1",
492
+ "thinker.audio_tower.layers.28.fc2",
493
+ "thinker.audio_tower.layers.29.self_attn.k_proj",
494
+ "thinker.audio_tower.layers.29.self_attn.v_proj",
495
+ "thinker.audio_tower.layers.29.self_attn.q_proj",
496
+ "thinker.audio_tower.layers.29.self_attn.out_proj",
497
+ "thinker.audio_tower.layers.29.fc1",
498
+ "thinker.audio_tower.layers.29.fc2",
499
+ "thinker.audio_tower.layers.30.self_attn.k_proj",
500
+ "thinker.audio_tower.layers.30.self_attn.v_proj",
501
+ "thinker.audio_tower.layers.30.self_attn.q_proj",
502
+ "thinker.audio_tower.layers.30.self_attn.out_proj",
503
+ "thinker.audio_tower.layers.30.fc1",
504
+ "thinker.audio_tower.layers.30.fc2",
505
+ "thinker.audio_tower.layers.31.self_attn.k_proj",
506
+ "thinker.audio_tower.layers.31.self_attn.v_proj",
507
+ "thinker.audio_tower.layers.31.self_attn.q_proj",
508
+ "thinker.audio_tower.layers.31.self_attn.out_proj",
509
+ "thinker.audio_tower.layers.31.fc1",
510
+ "thinker.audio_tower.layers.31.fc2",
511
+ "thinker.audio_tower.conv_out",
512
+ "thinker.audio_tower.proj1",
513
+ "thinker.audio_tower.proj2",
514
+ "thinker.visual.merger_list.0.mlp.0",
515
+ "thinker.visual.merger_list.0.mlp.2",
516
+ "thinker.visual.merger_list.1.mlp.0",
517
+ "thinker.visual.merger_list.1.mlp.2",
518
+ "thinker.visual.merger_list.2.mlp.0",
519
+ "thinker.visual.merger_list.2.mlp.2",
520
+ "thinker.visual.blocks.0.attn.qkv",
521
+ "thinker.visual.blocks.0.attn.proj",
522
+ "thinker.visual.blocks.0.mlp.linear_fc1",
523
+ "thinker.visual.blocks.0.mlp.linear_fc2",
524
+ "thinker.visual.blocks.1.attn.qkv",
525
+ "thinker.visual.blocks.1.attn.proj",
526
+ "thinker.visual.blocks.1.mlp.linear_fc1",
527
+ "thinker.visual.blocks.1.mlp.linear_fc2",
528
+ "thinker.visual.blocks.2.attn.qkv",
529
+ "thinker.visual.blocks.2.attn.proj",
530
+ "thinker.visual.blocks.2.mlp.linear_fc1",
531
+ "thinker.visual.blocks.2.mlp.linear_fc2",
532
+ "thinker.visual.blocks.3.attn.qkv",
533
+ "thinker.visual.blocks.3.attn.proj",
534
+ "thinker.visual.blocks.3.mlp.linear_fc1",
535
+ "thinker.visual.blocks.3.mlp.linear_fc2",
536
+ "thinker.visual.blocks.4.attn.qkv",
537
+ "thinker.visual.blocks.4.attn.proj",
538
+ "thinker.visual.blocks.4.mlp.linear_fc1",
539
+ "thinker.visual.blocks.4.mlp.linear_fc2",
540
+ "thinker.visual.blocks.5.attn.qkv",
541
+ "thinker.visual.blocks.5.attn.proj",
542
+ "thinker.visual.blocks.5.mlp.linear_fc1",
543
+ "thinker.visual.blocks.5.mlp.linear_fc2",
544
+ "thinker.visual.blocks.6.attn.qkv",
545
+ "thinker.visual.blocks.6.attn.proj",
546
+ "thinker.visual.blocks.6.mlp.linear_fc1",
547
+ "thinker.visual.blocks.6.mlp.linear_fc2",
548
+ "thinker.visual.blocks.7.attn.qkv",
549
+ "thinker.visual.blocks.7.attn.proj",
550
+ "thinker.visual.blocks.7.mlp.linear_fc1",
551
+ "thinker.visual.blocks.7.mlp.linear_fc2",
552
+ "thinker.visual.blocks.8.attn.qkv",
553
+ "thinker.visual.blocks.8.attn.proj",
554
+ "thinker.visual.blocks.8.mlp.linear_fc1",
555
+ "thinker.visual.blocks.8.mlp.linear_fc2",
556
+ "thinker.visual.blocks.9.attn.qkv",
557
+ "thinker.visual.blocks.9.attn.proj",
558
+ "thinker.visual.blocks.9.mlp.linear_fc1",
559
+ "thinker.visual.blocks.9.mlp.linear_fc2",
560
+ "thinker.visual.blocks.10.attn.qkv",
561
+ "thinker.visual.blocks.10.attn.proj",
562
+ "thinker.visual.blocks.10.mlp.linear_fc1",
563
+ "thinker.visual.blocks.10.mlp.linear_fc2",
564
+ "thinker.visual.blocks.11.attn.qkv",
565
+ "thinker.visual.blocks.11.attn.proj",
566
+ "thinker.visual.blocks.11.mlp.linear_fc1",
567
+ "thinker.visual.blocks.11.mlp.linear_fc2",
568
+ "thinker.visual.blocks.12.attn.qkv",
569
+ "thinker.visual.blocks.12.attn.proj",
570
+ "thinker.visual.blocks.12.mlp.linear_fc1",
571
+ "thinker.visual.blocks.12.mlp.linear_fc2",
572
+ "thinker.visual.blocks.13.attn.qkv",
573
+ "thinker.visual.blocks.13.attn.proj",
574
+ "thinker.visual.blocks.13.mlp.linear_fc1",
575
+ "thinker.visual.blocks.13.mlp.linear_fc2",
576
+ "thinker.visual.blocks.14.attn.qkv",
577
+ "thinker.visual.blocks.14.attn.proj",
578
+ "thinker.visual.blocks.14.mlp.linear_fc1",
579
+ "thinker.visual.blocks.14.mlp.linear_fc2",
580
+ "thinker.visual.blocks.15.attn.qkv",
581
+ "thinker.visual.blocks.15.attn.proj",
582
+ "thinker.visual.blocks.15.mlp.linear_fc1",
583
+ "thinker.visual.blocks.15.mlp.linear_fc2",
584
+ "thinker.visual.blocks.16.attn.qkv",
585
+ "thinker.visual.blocks.16.attn.proj",
586
+ "thinker.visual.blocks.16.mlp.linear_fc1",
587
+ "thinker.visual.blocks.16.mlp.linear_fc2",
588
+ "thinker.visual.blocks.17.attn.qkv",
589
+ "thinker.visual.blocks.17.attn.proj",
590
+ "thinker.visual.blocks.17.mlp.linear_fc1",
591
+ "thinker.visual.blocks.17.mlp.linear_fc2",
592
+ "thinker.visual.blocks.18.attn.qkv",
593
+ "thinker.visual.blocks.18.attn.proj",
594
+ "thinker.visual.blocks.18.mlp.linear_fc1",
595
+ "thinker.visual.blocks.18.mlp.linear_fc2",
596
+ "thinker.visual.blocks.19.attn.qkv",
597
+ "thinker.visual.blocks.19.attn.proj",
598
+ "thinker.visual.blocks.19.mlp.linear_fc1",
599
+ "thinker.visual.blocks.19.mlp.linear_fc2",
600
+ "thinker.visual.blocks.20.attn.qkv",
601
+ "thinker.visual.blocks.20.attn.proj",
602
+ "thinker.visual.blocks.20.mlp.linear_fc1",
603
+ "thinker.visual.blocks.20.mlp.linear_fc2",
604
+ "thinker.visual.blocks.21.attn.qkv",
605
+ "thinker.visual.blocks.21.attn.proj",
606
+ "thinker.visual.blocks.21.mlp.linear_fc1",
607
+ "thinker.visual.blocks.21.mlp.linear_fc2",
608
+ "thinker.visual.blocks.22.attn.qkv",
609
+ "thinker.visual.blocks.22.attn.proj",
610
+ "thinker.visual.blocks.22.mlp.linear_fc1",
611
+ "thinker.visual.blocks.22.mlp.linear_fc2",
612
+ "thinker.visual.blocks.23.attn.qkv",
613
+ "thinker.visual.blocks.23.attn.proj",
614
+ "thinker.visual.blocks.23.mlp.linear_fc1",
615
+ "thinker.visual.blocks.23.mlp.linear_fc2",
616
+ "thinker.visual.blocks.24.attn.qkv",
617
+ "thinker.visual.blocks.24.attn.proj",
618
+ "thinker.visual.blocks.24.mlp.linear_fc1",
619
+ "thinker.visual.blocks.24.mlp.linear_fc2",
620
+ "thinker.visual.blocks.25.attn.qkv",
621
+ "thinker.visual.blocks.25.attn.proj",
622
+ "thinker.visual.blocks.25.mlp.linear_fc1",
623
+ "thinker.visual.blocks.25.mlp.linear_fc2",
624
+ "thinker.visual.blocks.26.attn.qkv",
625
+ "thinker.visual.blocks.26.attn.proj",
626
+ "thinker.visual.blocks.26.mlp.linear_fc1",
627
+ "thinker.visual.blocks.26.mlp.linear_fc2",
628
+ "thinker.visual.merger.mlp.0",
629
+ "thinker.visual.merger.mlp.2",
630
+ "thinker.model.layers.0.mlp.gate",
631
+ "thinker.model.layers.1.mlp.gate",
632
+ "thinker.model.layers.2.mlp.gate",
633
+ "thinker.model.layers.3.mlp.gate",
634
+ "thinker.model.layers.4.mlp.gate",
635
+ "thinker.model.layers.5.mlp.gate",
636
+ "thinker.model.layers.6.mlp.gate",
637
+ "thinker.model.layers.7.mlp.gate",
638
+ "thinker.model.layers.8.mlp.gate",
639
+ "thinker.model.layers.9.mlp.gate",
640
+ "thinker.model.layers.10.mlp.gate",
641
+ "thinker.model.layers.11.mlp.gate",
642
+ "thinker.model.layers.12.mlp.gate",
643
+ "thinker.model.layers.13.mlp.gate",
644
+ "thinker.model.layers.14.mlp.gate",
645
+ "thinker.model.layers.15.mlp.gate",
646
+ "thinker.model.layers.16.mlp.gate",
647
+ "thinker.model.layers.17.mlp.gate",
648
+ "thinker.model.layers.18.mlp.gate",
649
+ "thinker.model.layers.19.mlp.gate",
650
+ "thinker.model.layers.20.mlp.gate",
651
+ "thinker.model.layers.21.mlp.gate",
652
+ "thinker.model.layers.22.mlp.gate",
653
+ "thinker.model.layers.23.mlp.gate",
654
+ "thinker.model.layers.24.mlp.gate",
655
+ "thinker.model.layers.25.mlp.gate",
656
+ "thinker.model.layers.26.mlp.gate",
657
+ "thinker.model.layers.27.mlp.gate",
658
+ "thinker.model.layers.28.mlp.gate",
659
+ "thinker.model.layers.29.mlp.gate",
660
+ "thinker.model.layers.30.mlp.gate",
661
+ "thinker.model.layers.31.mlp.gate",
662
+ "thinker.model.layers.32.mlp.gate",
663
+ "thinker.model.layers.33.mlp.gate",
664
+ "thinker.model.layers.34.mlp.gate",
665
+ "thinker.model.layers.35.mlp.gate",
666
+ "thinker.model.layers.36.mlp.gate",
667
+ "thinker.model.layers.37.mlp.gate",
668
+ "thinker.model.layers.38.mlp.gate",
669
+ "thinker.model.layers.39.mlp.gate",
670
+ "thinker.model.layers.40.mlp.gate",
671
+ "thinker.model.layers.41.mlp.gate",
672
+ "thinker.model.layers.42.mlp.gate",
673
+ "thinker.model.layers.43.mlp.gate",
674
+ "thinker.model.layers.44.mlp.gate",
675
+ "thinker.model.layers.45.mlp.gate",
676
+ "thinker.model.layers.46.mlp.gate",
677
+ "thinker.model.layers.47.mlp.gate",
678
+ "thinker.lm_head"
679
+ ],
680
+ "kv_cache_scheme": null,
681
+ "quant_method": "compressed-tensors",
682
+ "quantization_status": "compressed",
683
+ "sparsity_config": {},
684
+ "transform_config": {},
685
+ "version": "0.11.0"
686
+ },
687
+ "transformers_version": "4.57.0.dev0",
688
+ "tts_bos_token_id": 151672,
689
+ "tts_eos_token_id": 151673,
690
+ "tts_pad_token_id": 151671,
691
+ "user_token_id": 872
692
+ }
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "max_new_tokens": 32768,
3
+ "repetition_penalty": 1.0,
4
+ "temperature": 0.6,
5
+ "top_k": 20,
6
+ "top_p": 0.95
7
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b40b8dca1d63858cacab6d8b4750e7586f2adde06f0415ee8d7ebc46e6b0c59d
3
+ size 5000651392
model-00002-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7d80f45951ffe31d9a7790243d8fb21d2fdbdf4531d25e23ece55f3603701595
3
+ size 5001406576
model-00003-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:20d65074b354032338f041d52b4ab107142b8e7b7c9f6b9849d3444e4c2034a0
3
+ size 5001941584
model-00004-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b650ab7cc72508f466e23faeb2ecbc97834cf36896f8a24c735ca8e077f88db1
3
+ size 4842354408
model-00005-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2fa646c77d3a76d216d7f7cd6d7c75a81754d661fe22ec7551d97e288e0eaca3
3
+ size 622854280
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
preprocessor_config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "chunk_length": 30,
3
+ "dither": 0.0,
4
+ "feature_extractor_type": "WhisperFeatureExtractor",
5
+ "feature_size": 128,
6
+ "hop_length": 160,
7
+ "image_mean": [
8
+ 0.5,
9
+ 0.5,
10
+ 0.5
11
+ ],
12
+ "image_processor_type": "Qwen2VLImageProcessor",
13
+ "image_std": [
14
+ 0.5,
15
+ 0.5,
16
+ 0.5
17
+ ],
18
+ "max_pixels": 12845056,
19
+ "merge_size": 2,
20
+ "min_pixels": 3136,
21
+ "n_fft": 400,
22
+ "n_samples": 480000,
23
+ "nb_max_frames": 3000,
24
+ "padding_side": "right",
25
+ "padding_value": 0.0,
26
+ "patch_size": 16,
27
+ "processor_class": "Qwen3OmniMoeProcessor",
28
+ "return_attention_mask": true,
29
+ "sampling_rate": 16000,
30
+ "temporal_patch_size": 2
31
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>",
16
+ "<|audio_start|>",
17
+ "<|audio_end|>",
18
+ "<tts_pad>",
19
+ "<tts_text_bos>",
20
+ "<tts_text_bos_single>",
21
+ "<|audio_pad|>"
22
+ ],
23
+ "audio_bos_token": "<|audio_start|>",
24
+ "audio_eos_token": "<|audio_end|>",
25
+ "audio_token": "<|audio_pad|>",
26
+ "eos_token": {
27
+ "content": "<|im_end|>",
28
+ "lstrip": false,
29
+ "normalized": false,
30
+ "rstrip": false,
31
+ "single_word": false
32
+ },
33
+ "image_token": "<|image_pad|>",
34
+ "pad_token": {
35
+ "content": "<|endoftext|>",
36
+ "lstrip": false,
37
+ "normalized": false,
38
+ "rstrip": false,
39
+ "single_word": false
40
+ },
41
+ "video_token": "<|video_pad|>",
42
+ "vision_bos_token": "<|vision_start|>",
43
+ "vision_eos_token": "<|vision_end|>"
44
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:09267689b8362020b9763b65dd5be7e086b31e28d72e02837a9e781de9a91bc7
3
+ size 11423986
tokenizer_config.json ADDED
@@ -0,0 +1,317 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<tool_response>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "151666": {
190
+ "content": "</tool_response>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "151667": {
198
+ "content": "<think>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ },
205
+ "151668": {
206
+ "content": "</think>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": false
212
+ },
213
+ "151669": {
214
+ "content": "<|audio_start|>",
215
+ "lstrip": false,
216
+ "normalized": false,
217
+ "rstrip": false,
218
+ "single_word": false,
219
+ "special": true
220
+ },
221
+ "151670": {
222
+ "content": "<|audio_end|>",
223
+ "lstrip": false,
224
+ "normalized": false,
225
+ "rstrip": false,
226
+ "single_word": false,
227
+ "special": true
228
+ },
229
+ "151671": {
230
+ "content": "<tts_pad>",
231
+ "lstrip": false,
232
+ "normalized": false,
233
+ "rstrip": false,
234
+ "single_word": false,
235
+ "special": true
236
+ },
237
+ "151672": {
238
+ "content": "<tts_text_bos>",
239
+ "lstrip": false,
240
+ "normalized": false,
241
+ "rstrip": false,
242
+ "single_word": false,
243
+ "special": true
244
+ },
245
+ "151673": {
246
+ "content": "<tts_text_eod>",
247
+ "lstrip": false,
248
+ "normalized": false,
249
+ "rstrip": false,
250
+ "single_word": false,
251
+ "special": true
252
+ },
253
+ "151674": {
254
+ "content": "<tts_text_bos_single>",
255
+ "lstrip": false,
256
+ "normalized": false,
257
+ "rstrip": false,
258
+ "single_word": false,
259
+ "special": true
260
+ },
261
+ "151675": {
262
+ "content": "<|audio_pad|>",
263
+ "lstrip": false,
264
+ "normalized": false,
265
+ "rstrip": false,
266
+ "single_word": false,
267
+ "special": true
268
+ }
269
+ },
270
+ "additional_special_tokens": [
271
+ "<|im_start|>",
272
+ "<|im_end|>",
273
+ "<|object_ref_start|>",
274
+ "<|object_ref_end|>",
275
+ "<|box_start|>",
276
+ "<|box_end|>",
277
+ "<|quad_start|>",
278
+ "<|quad_end|>",
279
+ "<|vision_start|>",
280
+ "<|vision_end|>",
281
+ "<|vision_pad|>",
282
+ "<|image_pad|>",
283
+ "<|video_pad|>",
284
+ "<|audio_start|>",
285
+ "<|audio_end|>",
286
+ "<tts_pad>",
287
+ "<tts_text_bos>",
288
+ "<tts_text_bos_single>",
289
+ "<|audio_pad|>"
290
+ ],
291
+ "audio_bos_token": "<|audio_start|>",
292
+ "audio_eos_token": "<|audio_end|>",
293
+ "audio_token": "<|audio_pad|>",
294
+ "bos_token": null,
295
+ "clean_up_tokenization_spaces": false,
296
+ "eos_token": "<|im_end|>",
297
+ "errors": "replace",
298
+ "extra_special_tokens": {
299
+ "audio_bos_token": "<|audio_start|>",
300
+ "audio_eos_token": "<|audio_end|>",
301
+ "audio_token": "<|audio_pad|>",
302
+ "image_token": "<|image_pad|>",
303
+ "video_token": "<|video_pad|>",
304
+ "vision_bos_token": "<|vision_start|>",
305
+ "vision_eos_token": "<|vision_end|>"
306
+ },
307
+ "image_token": "<|image_pad|>",
308
+ "model_max_length": 131072,
309
+ "pad_token": "<|endoftext|>",
310
+ "processor_class": "Qwen3OmniMoeProcessor",
311
+ "split_special_tokens": false,
312
+ "tokenizer_class": "Qwen2Tokenizer",
313
+ "unk_token": null,
314
+ "video_token": "<|video_pad|>",
315
+ "vision_bos_token": "<|vision_start|>",
316
+ "vision_eos_token": "<|vision_end|>"
317
+ }
video_preprocessor_config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": null,
3
+ "data_format": "channels_first",
4
+ "default_to_square": true,
5
+ "device": null,
6
+ "dither": 0.0,
7
+ "do_center_crop": null,
8
+ "do_convert_rgb": true,
9
+ "do_normalize": true,
10
+ "do_rescale": true,
11
+ "do_resize": true,
12
+ "do_sample_frames": false,
13
+ "feature_extractor_type": "WhisperFeatureExtractor",
14
+ "feature_size": 128,
15
+ "fps": null,
16
+ "hop_length": 160,
17
+ "image_mean": [
18
+ 0.5,
19
+ 0.5,
20
+ 0.5
21
+ ],
22
+ "image_std": [
23
+ 0.5,
24
+ 0.5,
25
+ 0.5
26
+ ],
27
+ "input_data_format": null,
28
+ "max_frames": 768,
29
+ "max_pixels": 12845056,
30
+ "merge_size": 2,
31
+ "min_frames": 4,
32
+ "min_pixels": 3136,
33
+ "n_fft": 400,
34
+ "n_samples": 4800000,
35
+ "nb_max_frames": 30000,
36
+ "num_frames": null,
37
+ "pad_size": null,
38
+ "padding_side": "right",
39
+ "padding_value": 0.0,
40
+ "patch_size": 16,
41
+ "processor_class": "Qwen3OmniMoeProcessor",
42
+ "resample": 3,
43
+ "rescale_factor": 0.00392156862745098,
44
+ "return_attention_mask": true,
45
+ "return_metadata": false,
46
+ "sampling_rate": 16000,
47
+ "size": {
48
+ "longest_edge": 12845056,
49
+ "shortest_edge": 3136
50
+ },
51
+ "temporal_patch_size": 2,
52
+ "video_metadata": null,
53
+ "video_processor_type": "Qwen2VLVideoProcessor"
54
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff