lmmy commited on
Commit
50ae1a7
·
verified ·
1 Parent(s): 0392940

Add files using upload-large-folder tool

Browse files
README.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - ERNIE4.5
9
+ - mlx
10
+ library_name: transformers
11
+ base_model: baidu/ERNIE-4.5-21B-A3B-PT
12
+ ---
13
+ ## 💫 Community Model> ERNIE-4.5-21B-A3B-PT by baidu
14
+
15
+ *👾 [LM Studio](https://lmstudio.ai) Community models highlights program. Highlighting new & noteworthy models by the community. Join the conversation on [Discord](https://discord.gg/aPQfnNkxGC)*.
16
+
17
+ **Model creator:** [baidu](https://huggingface.co/baidu)<br>
18
+ **Original model**: [ERNIE-4.5-21B-A3B-PT](https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-PT)<br>
19
+ **MLX quantization:** provided by [LM Studio team](https://x.com/lmstudio) using [mlx_lm](https://github.com/ml-explore/mlx-lm)<br>
20
+
21
+ ## Technical Details
22
+
23
+ 6-bit quantized version of ERNIE-4.5-21B-A3B-PT using MLX, optimized for Apple Silicon.
24
+
25
+ ## Special thanks
26
+
27
+ 🙏 Special thanks to the [Apple Machine Learning Research](https://github.com/ml-explore) team for creating [MLX](https://github.com/ml-explore/mlx).
28
+
29
+ ## Disclaimers
30
+
31
+ LM Studio is not the creator, originator, or owner of any Model featured in the Community Model Program. Each Community Model is created and provided by third parties. LM Studio does not endorse, support, represent or guarantee the completeness, truthfulness, accuracy, or reliability of any Community Model. You understand that Community Models can produce content that might be offensive, harmful, inaccurate or otherwise inappropriate, or deceptive. Each Community Model is the sole responsibility of the person or entity who originated such Model. LM Studio may not monitor or control the Community Models and cannot, and does not, take responsibility for any such Model. LM Studio disclaims all warranties or guarantees about the accuracy, reliability or benefits of the Community Models. LM Studio further disclaims any warranty that the Community Model will meet your requirements, be secure, uninterrupted or available at any time or location, or error-free, viruses-free, or that any errors will be corrected, or otherwise. You will be solely responsible for any damage resulting from your use of or access to the Community Models, your downloading of any Community Model, or use of any other Community Model provided by or through LM Studio.
added_tokens.json ADDED
@@ -0,0 +1,1011 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "<|AUDIO_PLACEHOLDER|>": 100296,
3
+ "<|CROP_COL_SEP|>": 101301,
4
+ "<|CROP_ROW_SEP|>": 101302,
5
+ "<|IMAGE_PLACEHOLDER|>": 100295,
6
+ "<|IMAGE_SEP|>": 101303,
7
+ "<|LOC_0|>": 100297,
8
+ "<|LOC_1000|>": 101297,
9
+ "<|LOC_100|>": 100397,
10
+ "<|LOC_101|>": 100398,
11
+ "<|LOC_102|>": 100399,
12
+ "<|LOC_103|>": 100400,
13
+ "<|LOC_104|>": 100401,
14
+ "<|LOC_105|>": 100402,
15
+ "<|LOC_106|>": 100403,
16
+ "<|LOC_107|>": 100404,
17
+ "<|LOC_108|>": 100405,
18
+ "<|LOC_109|>": 100406,
19
+ "<|LOC_10|>": 100307,
20
+ "<|LOC_110|>": 100407,
21
+ "<|LOC_111|>": 100408,
22
+ "<|LOC_112|>": 100409,
23
+ "<|LOC_113|>": 100410,
24
+ "<|LOC_114|>": 100411,
25
+ "<|LOC_115|>": 100412,
26
+ "<|LOC_116|>": 100413,
27
+ "<|LOC_117|>": 100414,
28
+ "<|LOC_118|>": 100415,
29
+ "<|LOC_119|>": 100416,
30
+ "<|LOC_11|>": 100308,
31
+ "<|LOC_120|>": 100417,
32
+ "<|LOC_121|>": 100418,
33
+ "<|LOC_122|>": 100419,
34
+ "<|LOC_123|>": 100420,
35
+ "<|LOC_124|>": 100421,
36
+ "<|LOC_125|>": 100422,
37
+ "<|LOC_126|>": 100423,
38
+ "<|LOC_127|>": 100424,
39
+ "<|LOC_128|>": 100425,
40
+ "<|LOC_129|>": 100426,
41
+ "<|LOC_12|>": 100309,
42
+ "<|LOC_130|>": 100427,
43
+ "<|LOC_131|>": 100428,
44
+ "<|LOC_132|>": 100429,
45
+ "<|LOC_133|>": 100430,
46
+ "<|LOC_134|>": 100431,
47
+ "<|LOC_135|>": 100432,
48
+ "<|LOC_136|>": 100433,
49
+ "<|LOC_137|>": 100434,
50
+ "<|LOC_138|>": 100435,
51
+ "<|LOC_139|>": 100436,
52
+ "<|LOC_13|>": 100310,
53
+ "<|LOC_140|>": 100437,
54
+ "<|LOC_141|>": 100438,
55
+ "<|LOC_142|>": 100439,
56
+ "<|LOC_143|>": 100440,
57
+ "<|LOC_144|>": 100441,
58
+ "<|LOC_145|>": 100442,
59
+ "<|LOC_146|>": 100443,
60
+ "<|LOC_147|>": 100444,
61
+ "<|LOC_148|>": 100445,
62
+ "<|LOC_149|>": 100446,
63
+ "<|LOC_14|>": 100311,
64
+ "<|LOC_150|>": 100447,
65
+ "<|LOC_151|>": 100448,
66
+ "<|LOC_152|>": 100449,
67
+ "<|LOC_153|>": 100450,
68
+ "<|LOC_154|>": 100451,
69
+ "<|LOC_155|>": 100452,
70
+ "<|LOC_156|>": 100453,
71
+ "<|LOC_157|>": 100454,
72
+ "<|LOC_158|>": 100455,
73
+ "<|LOC_159|>": 100456,
74
+ "<|LOC_15|>": 100312,
75
+ "<|LOC_160|>": 100457,
76
+ "<|LOC_161|>": 100458,
77
+ "<|LOC_162|>": 100459,
78
+ "<|LOC_163|>": 100460,
79
+ "<|LOC_164|>": 100461,
80
+ "<|LOC_165|>": 100462,
81
+ "<|LOC_166|>": 100463,
82
+ "<|LOC_167|>": 100464,
83
+ "<|LOC_168|>": 100465,
84
+ "<|LOC_169|>": 100466,
85
+ "<|LOC_16|>": 100313,
86
+ "<|LOC_170|>": 100467,
87
+ "<|LOC_171|>": 100468,
88
+ "<|LOC_172|>": 100469,
89
+ "<|LOC_173|>": 100470,
90
+ "<|LOC_174|>": 100471,
91
+ "<|LOC_175|>": 100472,
92
+ "<|LOC_176|>": 100473,
93
+ "<|LOC_177|>": 100474,
94
+ "<|LOC_178|>": 100475,
95
+ "<|LOC_179|>": 100476,
96
+ "<|LOC_17|>": 100314,
97
+ "<|LOC_180|>": 100477,
98
+ "<|LOC_181|>": 100478,
99
+ "<|LOC_182|>": 100479,
100
+ "<|LOC_183|>": 100480,
101
+ "<|LOC_184|>": 100481,
102
+ "<|LOC_185|>": 100482,
103
+ "<|LOC_186|>": 100483,
104
+ "<|LOC_187|>": 100484,
105
+ "<|LOC_188|>": 100485,
106
+ "<|LOC_189|>": 100486,
107
+ "<|LOC_18|>": 100315,
108
+ "<|LOC_190|>": 100487,
109
+ "<|LOC_191|>": 100488,
110
+ "<|LOC_192|>": 100489,
111
+ "<|LOC_193|>": 100490,
112
+ "<|LOC_194|>": 100491,
113
+ "<|LOC_195|>": 100492,
114
+ "<|LOC_196|>": 100493,
115
+ "<|LOC_197|>": 100494,
116
+ "<|LOC_198|>": 100495,
117
+ "<|LOC_199|>": 100496,
118
+ "<|LOC_19|>": 100316,
119
+ "<|LOC_1|>": 100298,
120
+ "<|LOC_200|>": 100497,
121
+ "<|LOC_201|>": 100498,
122
+ "<|LOC_202|>": 100499,
123
+ "<|LOC_203|>": 100500,
124
+ "<|LOC_204|>": 100501,
125
+ "<|LOC_205|>": 100502,
126
+ "<|LOC_206|>": 100503,
127
+ "<|LOC_207|>": 100504,
128
+ "<|LOC_208|>": 100505,
129
+ "<|LOC_209|>": 100506,
130
+ "<|LOC_20|>": 100317,
131
+ "<|LOC_210|>": 100507,
132
+ "<|LOC_211|>": 100508,
133
+ "<|LOC_212|>": 100509,
134
+ "<|LOC_213|>": 100510,
135
+ "<|LOC_214|>": 100511,
136
+ "<|LOC_215|>": 100512,
137
+ "<|LOC_216|>": 100513,
138
+ "<|LOC_217|>": 100514,
139
+ "<|LOC_218|>": 100515,
140
+ "<|LOC_219|>": 100516,
141
+ "<|LOC_21|>": 100318,
142
+ "<|LOC_220|>": 100517,
143
+ "<|LOC_221|>": 100518,
144
+ "<|LOC_222|>": 100519,
145
+ "<|LOC_223|>": 100520,
146
+ "<|LOC_224|>": 100521,
147
+ "<|LOC_225|>": 100522,
148
+ "<|LOC_226|>": 100523,
149
+ "<|LOC_227|>": 100524,
150
+ "<|LOC_228|>": 100525,
151
+ "<|LOC_229|>": 100526,
152
+ "<|LOC_22|>": 100319,
153
+ "<|LOC_230|>": 100527,
154
+ "<|LOC_231|>": 100528,
155
+ "<|LOC_232|>": 100529,
156
+ "<|LOC_233|>": 100530,
157
+ "<|LOC_234|>": 100531,
158
+ "<|LOC_235|>": 100532,
159
+ "<|LOC_236|>": 100533,
160
+ "<|LOC_237|>": 100534,
161
+ "<|LOC_238|>": 100535,
162
+ "<|LOC_239|>": 100536,
163
+ "<|LOC_23|>": 100320,
164
+ "<|LOC_240|>": 100537,
165
+ "<|LOC_241|>": 100538,
166
+ "<|LOC_242|>": 100539,
167
+ "<|LOC_243|>": 100540,
168
+ "<|LOC_244|>": 100541,
169
+ "<|LOC_245|>": 100542,
170
+ "<|LOC_246|>": 100543,
171
+ "<|LOC_247|>": 100544,
172
+ "<|LOC_248|>": 100545,
173
+ "<|LOC_249|>": 100546,
174
+ "<|LOC_24|>": 100321,
175
+ "<|LOC_250|>": 100547,
176
+ "<|LOC_251|>": 100548,
177
+ "<|LOC_252|>": 100549,
178
+ "<|LOC_253|>": 100550,
179
+ "<|LOC_254|>": 100551,
180
+ "<|LOC_255|>": 100552,
181
+ "<|LOC_256|>": 100553,
182
+ "<|LOC_257|>": 100554,
183
+ "<|LOC_258|>": 100555,
184
+ "<|LOC_259|>": 100556,
185
+ "<|LOC_25|>": 100322,
186
+ "<|LOC_260|>": 100557,
187
+ "<|LOC_261|>": 100558,
188
+ "<|LOC_262|>": 100559,
189
+ "<|LOC_263|>": 100560,
190
+ "<|LOC_264|>": 100561,
191
+ "<|LOC_265|>": 100562,
192
+ "<|LOC_266|>": 100563,
193
+ "<|LOC_267|>": 100564,
194
+ "<|LOC_268|>": 100565,
195
+ "<|LOC_269|>": 100566,
196
+ "<|LOC_26|>": 100323,
197
+ "<|LOC_270|>": 100567,
198
+ "<|LOC_271|>": 100568,
199
+ "<|LOC_272|>": 100569,
200
+ "<|LOC_273|>": 100570,
201
+ "<|LOC_274|>": 100571,
202
+ "<|LOC_275|>": 100572,
203
+ "<|LOC_276|>": 100573,
204
+ "<|LOC_277|>": 100574,
205
+ "<|LOC_278|>": 100575,
206
+ "<|LOC_279|>": 100576,
207
+ "<|LOC_27|>": 100324,
208
+ "<|LOC_280|>": 100577,
209
+ "<|LOC_281|>": 100578,
210
+ "<|LOC_282|>": 100579,
211
+ "<|LOC_283|>": 100580,
212
+ "<|LOC_284|>": 100581,
213
+ "<|LOC_285|>": 100582,
214
+ "<|LOC_286|>": 100583,
215
+ "<|LOC_287|>": 100584,
216
+ "<|LOC_288|>": 100585,
217
+ "<|LOC_289|>": 100586,
218
+ "<|LOC_28|>": 100325,
219
+ "<|LOC_290|>": 100587,
220
+ "<|LOC_291|>": 100588,
221
+ "<|LOC_292|>": 100589,
222
+ "<|LOC_293|>": 100590,
223
+ "<|LOC_294|>": 100591,
224
+ "<|LOC_295|>": 100592,
225
+ "<|LOC_296|>": 100593,
226
+ "<|LOC_297|>": 100594,
227
+ "<|LOC_298|>": 100595,
228
+ "<|LOC_299|>": 100596,
229
+ "<|LOC_29|>": 100326,
230
+ "<|LOC_2|>": 100299,
231
+ "<|LOC_300|>": 100597,
232
+ "<|LOC_301|>": 100598,
233
+ "<|LOC_302|>": 100599,
234
+ "<|LOC_303|>": 100600,
235
+ "<|LOC_304|>": 100601,
236
+ "<|LOC_305|>": 100602,
237
+ "<|LOC_306|>": 100603,
238
+ "<|LOC_307|>": 100604,
239
+ "<|LOC_308|>": 100605,
240
+ "<|LOC_309|>": 100606,
241
+ "<|LOC_30|>": 100327,
242
+ "<|LOC_310|>": 100607,
243
+ "<|LOC_311|>": 100608,
244
+ "<|LOC_312|>": 100609,
245
+ "<|LOC_313|>": 100610,
246
+ "<|LOC_314|>": 100611,
247
+ "<|LOC_315|>": 100612,
248
+ "<|LOC_316|>": 100613,
249
+ "<|LOC_317|>": 100614,
250
+ "<|LOC_318|>": 100615,
251
+ "<|LOC_319|>": 100616,
252
+ "<|LOC_31|>": 100328,
253
+ "<|LOC_320|>": 100617,
254
+ "<|LOC_321|>": 100618,
255
+ "<|LOC_322|>": 100619,
256
+ "<|LOC_323|>": 100620,
257
+ "<|LOC_324|>": 100621,
258
+ "<|LOC_325|>": 100622,
259
+ "<|LOC_326|>": 100623,
260
+ "<|LOC_327|>": 100624,
261
+ "<|LOC_328|>": 100625,
262
+ "<|LOC_329|>": 100626,
263
+ "<|LOC_32|>": 100329,
264
+ "<|LOC_330|>": 100627,
265
+ "<|LOC_331|>": 100628,
266
+ "<|LOC_332|>": 100629,
267
+ "<|LOC_333|>": 100630,
268
+ "<|LOC_334|>": 100631,
269
+ "<|LOC_335|>": 100632,
270
+ "<|LOC_336|>": 100633,
271
+ "<|LOC_337|>": 100634,
272
+ "<|LOC_338|>": 100635,
273
+ "<|LOC_339|>": 100636,
274
+ "<|LOC_33|>": 100330,
275
+ "<|LOC_340|>": 100637,
276
+ "<|LOC_341|>": 100638,
277
+ "<|LOC_342|>": 100639,
278
+ "<|LOC_343|>": 100640,
279
+ "<|LOC_344|>": 100641,
280
+ "<|LOC_345|>": 100642,
281
+ "<|LOC_346|>": 100643,
282
+ "<|LOC_347|>": 100644,
283
+ "<|LOC_348|>": 100645,
284
+ "<|LOC_349|>": 100646,
285
+ "<|LOC_34|>": 100331,
286
+ "<|LOC_350|>": 100647,
287
+ "<|LOC_351|>": 100648,
288
+ "<|LOC_352|>": 100649,
289
+ "<|LOC_353|>": 100650,
290
+ "<|LOC_354|>": 100651,
291
+ "<|LOC_355|>": 100652,
292
+ "<|LOC_356|>": 100653,
293
+ "<|LOC_357|>": 100654,
294
+ "<|LOC_358|>": 100655,
295
+ "<|LOC_359|>": 100656,
296
+ "<|LOC_35|>": 100332,
297
+ "<|LOC_360|>": 100657,
298
+ "<|LOC_361|>": 100658,
299
+ "<|LOC_362|>": 100659,
300
+ "<|LOC_363|>": 100660,
301
+ "<|LOC_364|>": 100661,
302
+ "<|LOC_365|>": 100662,
303
+ "<|LOC_366|>": 100663,
304
+ "<|LOC_367|>": 100664,
305
+ "<|LOC_368|>": 100665,
306
+ "<|LOC_369|>": 100666,
307
+ "<|LOC_36|>": 100333,
308
+ "<|LOC_370|>": 100667,
309
+ "<|LOC_371|>": 100668,
310
+ "<|LOC_372|>": 100669,
311
+ "<|LOC_373|>": 100670,
312
+ "<|LOC_374|>": 100671,
313
+ "<|LOC_375|>": 100672,
314
+ "<|LOC_376|>": 100673,
315
+ "<|LOC_377|>": 100674,
316
+ "<|LOC_378|>": 100675,
317
+ "<|LOC_379|>": 100676,
318
+ "<|LOC_37|>": 100334,
319
+ "<|LOC_380|>": 100677,
320
+ "<|LOC_381|>": 100678,
321
+ "<|LOC_382|>": 100679,
322
+ "<|LOC_383|>": 100680,
323
+ "<|LOC_384|>": 100681,
324
+ "<|LOC_385|>": 100682,
325
+ "<|LOC_386|>": 100683,
326
+ "<|LOC_387|>": 100684,
327
+ "<|LOC_388|>": 100685,
328
+ "<|LOC_389|>": 100686,
329
+ "<|LOC_38|>": 100335,
330
+ "<|LOC_390|>": 100687,
331
+ "<|LOC_391|>": 100688,
332
+ "<|LOC_392|>": 100689,
333
+ "<|LOC_393|>": 100690,
334
+ "<|LOC_394|>": 100691,
335
+ "<|LOC_395|>": 100692,
336
+ "<|LOC_396|>": 100693,
337
+ "<|LOC_397|>": 100694,
338
+ "<|LOC_398|>": 100695,
339
+ "<|LOC_399|>": 100696,
340
+ "<|LOC_39|>": 100336,
341
+ "<|LOC_3|>": 100300,
342
+ "<|LOC_400|>": 100697,
343
+ "<|LOC_401|>": 100698,
344
+ "<|LOC_402|>": 100699,
345
+ "<|LOC_403|>": 100700,
346
+ "<|LOC_404|>": 100701,
347
+ "<|LOC_405|>": 100702,
348
+ "<|LOC_406|>": 100703,
349
+ "<|LOC_407|>": 100704,
350
+ "<|LOC_408|>": 100705,
351
+ "<|LOC_409|>": 100706,
352
+ "<|LOC_40|>": 100337,
353
+ "<|LOC_410|>": 100707,
354
+ "<|LOC_411|>": 100708,
355
+ "<|LOC_412|>": 100709,
356
+ "<|LOC_413|>": 100710,
357
+ "<|LOC_414|>": 100711,
358
+ "<|LOC_415|>": 100712,
359
+ "<|LOC_416|>": 100713,
360
+ "<|LOC_417|>": 100714,
361
+ "<|LOC_418|>": 100715,
362
+ "<|LOC_419|>": 100716,
363
+ "<|LOC_41|>": 100338,
364
+ "<|LOC_420|>": 100717,
365
+ "<|LOC_421|>": 100718,
366
+ "<|LOC_422|>": 100719,
367
+ "<|LOC_423|>": 100720,
368
+ "<|LOC_424|>": 100721,
369
+ "<|LOC_425|>": 100722,
370
+ "<|LOC_426|>": 100723,
371
+ "<|LOC_427|>": 100724,
372
+ "<|LOC_428|>": 100725,
373
+ "<|LOC_429|>": 100726,
374
+ "<|LOC_42|>": 100339,
375
+ "<|LOC_430|>": 100727,
376
+ "<|LOC_431|>": 100728,
377
+ "<|LOC_432|>": 100729,
378
+ "<|LOC_433|>": 100730,
379
+ "<|LOC_434|>": 100731,
380
+ "<|LOC_435|>": 100732,
381
+ "<|LOC_436|>": 100733,
382
+ "<|LOC_437|>": 100734,
383
+ "<|LOC_438|>": 100735,
384
+ "<|LOC_439|>": 100736,
385
+ "<|LOC_43|>": 100340,
386
+ "<|LOC_440|>": 100737,
387
+ "<|LOC_441|>": 100738,
388
+ "<|LOC_442|>": 100739,
389
+ "<|LOC_443|>": 100740,
390
+ "<|LOC_444|>": 100741,
391
+ "<|LOC_445|>": 100742,
392
+ "<|LOC_446|>": 100743,
393
+ "<|LOC_447|>": 100744,
394
+ "<|LOC_448|>": 100745,
395
+ "<|LOC_449|>": 100746,
396
+ "<|LOC_44|>": 100341,
397
+ "<|LOC_450|>": 100747,
398
+ "<|LOC_451|>": 100748,
399
+ "<|LOC_452|>": 100749,
400
+ "<|LOC_453|>": 100750,
401
+ "<|LOC_454|>": 100751,
402
+ "<|LOC_455|>": 100752,
403
+ "<|LOC_456|>": 100753,
404
+ "<|LOC_457|>": 100754,
405
+ "<|LOC_458|>": 100755,
406
+ "<|LOC_459|>": 100756,
407
+ "<|LOC_45|>": 100342,
408
+ "<|LOC_460|>": 100757,
409
+ "<|LOC_461|>": 100758,
410
+ "<|LOC_462|>": 100759,
411
+ "<|LOC_463|>": 100760,
412
+ "<|LOC_464|>": 100761,
413
+ "<|LOC_465|>": 100762,
414
+ "<|LOC_466|>": 100763,
415
+ "<|LOC_467|>": 100764,
416
+ "<|LOC_468|>": 100765,
417
+ "<|LOC_469|>": 100766,
418
+ "<|LOC_46|>": 100343,
419
+ "<|LOC_470|>": 100767,
420
+ "<|LOC_471|>": 100768,
421
+ "<|LOC_472|>": 100769,
422
+ "<|LOC_473|>": 100770,
423
+ "<|LOC_474|>": 100771,
424
+ "<|LOC_475|>": 100772,
425
+ "<|LOC_476|>": 100773,
426
+ "<|LOC_477|>": 100774,
427
+ "<|LOC_478|>": 100775,
428
+ "<|LOC_479|>": 100776,
429
+ "<|LOC_47|>": 100344,
430
+ "<|LOC_480|>": 100777,
431
+ "<|LOC_481|>": 100778,
432
+ "<|LOC_482|>": 100779,
433
+ "<|LOC_483|>": 100780,
434
+ "<|LOC_484|>": 100781,
435
+ "<|LOC_485|>": 100782,
436
+ "<|LOC_486|>": 100783,
437
+ "<|LOC_487|>": 100784,
438
+ "<|LOC_488|>": 100785,
439
+ "<|LOC_489|>": 100786,
440
+ "<|LOC_48|>": 100345,
441
+ "<|LOC_490|>": 100787,
442
+ "<|LOC_491|>": 100788,
443
+ "<|LOC_492|>": 100789,
444
+ "<|LOC_493|>": 100790,
445
+ "<|LOC_494|>": 100791,
446
+ "<|LOC_495|>": 100792,
447
+ "<|LOC_496|>": 100793,
448
+ "<|LOC_497|>": 100794,
449
+ "<|LOC_498|>": 100795,
450
+ "<|LOC_499|>": 100796,
451
+ "<|LOC_49|>": 100346,
452
+ "<|LOC_4|>": 100301,
453
+ "<|LOC_500|>": 100797,
454
+ "<|LOC_501|>": 100798,
455
+ "<|LOC_502|>": 100799,
456
+ "<|LOC_503|>": 100800,
457
+ "<|LOC_504|>": 100801,
458
+ "<|LOC_505|>": 100802,
459
+ "<|LOC_506|>": 100803,
460
+ "<|LOC_507|>": 100804,
461
+ "<|LOC_508|>": 100805,
462
+ "<|LOC_509|>": 100806,
463
+ "<|LOC_50|>": 100347,
464
+ "<|LOC_510|>": 100807,
465
+ "<|LOC_511|>": 100808,
466
+ "<|LOC_512|>": 100809,
467
+ "<|LOC_513|>": 100810,
468
+ "<|LOC_514|>": 100811,
469
+ "<|LOC_515|>": 100812,
470
+ "<|LOC_516|>": 100813,
471
+ "<|LOC_517|>": 100814,
472
+ "<|LOC_518|>": 100815,
473
+ "<|LOC_519|>": 100816,
474
+ "<|LOC_51|>": 100348,
475
+ "<|LOC_520|>": 100817,
476
+ "<|LOC_521|>": 100818,
477
+ "<|LOC_522|>": 100819,
478
+ "<|LOC_523|>": 100820,
479
+ "<|LOC_524|>": 100821,
480
+ "<|LOC_525|>": 100822,
481
+ "<|LOC_526|>": 100823,
482
+ "<|LOC_527|>": 100824,
483
+ "<|LOC_528|>": 100825,
484
+ "<|LOC_529|>": 100826,
485
+ "<|LOC_52|>": 100349,
486
+ "<|LOC_530|>": 100827,
487
+ "<|LOC_531|>": 100828,
488
+ "<|LOC_532|>": 100829,
489
+ "<|LOC_533|>": 100830,
490
+ "<|LOC_534|>": 100831,
491
+ "<|LOC_535|>": 100832,
492
+ "<|LOC_536|>": 100833,
493
+ "<|LOC_537|>": 100834,
494
+ "<|LOC_538|>": 100835,
495
+ "<|LOC_539|>": 100836,
496
+ "<|LOC_53|>": 100350,
497
+ "<|LOC_540|>": 100837,
498
+ "<|LOC_541|>": 100838,
499
+ "<|LOC_542|>": 100839,
500
+ "<|LOC_543|>": 100840,
501
+ "<|LOC_544|>": 100841,
502
+ "<|LOC_545|>": 100842,
503
+ "<|LOC_546|>": 100843,
504
+ "<|LOC_547|>": 100844,
505
+ "<|LOC_548|>": 100845,
506
+ "<|LOC_549|>": 100846,
507
+ "<|LOC_54|>": 100351,
508
+ "<|LOC_550|>": 100847,
509
+ "<|LOC_551|>": 100848,
510
+ "<|LOC_552|>": 100849,
511
+ "<|LOC_553|>": 100850,
512
+ "<|LOC_554|>": 100851,
513
+ "<|LOC_555|>": 100852,
514
+ "<|LOC_556|>": 100853,
515
+ "<|LOC_557|>": 100854,
516
+ "<|LOC_558|>": 100855,
517
+ "<|LOC_559|>": 100856,
518
+ "<|LOC_55|>": 100352,
519
+ "<|LOC_560|>": 100857,
520
+ "<|LOC_561|>": 100858,
521
+ "<|LOC_562|>": 100859,
522
+ "<|LOC_563|>": 100860,
523
+ "<|LOC_564|>": 100861,
524
+ "<|LOC_565|>": 100862,
525
+ "<|LOC_566|>": 100863,
526
+ "<|LOC_567|>": 100864,
527
+ "<|LOC_568|>": 100865,
528
+ "<|LOC_569|>": 100866,
529
+ "<|LOC_56|>": 100353,
530
+ "<|LOC_570|>": 100867,
531
+ "<|LOC_571|>": 100868,
532
+ "<|LOC_572|>": 100869,
533
+ "<|LOC_573|>": 100870,
534
+ "<|LOC_574|>": 100871,
535
+ "<|LOC_575|>": 100872,
536
+ "<|LOC_576|>": 100873,
537
+ "<|LOC_577|>": 100874,
538
+ "<|LOC_578|>": 100875,
539
+ "<|LOC_579|>": 100876,
540
+ "<|LOC_57|>": 100354,
541
+ "<|LOC_580|>": 100877,
542
+ "<|LOC_581|>": 100878,
543
+ "<|LOC_582|>": 100879,
544
+ "<|LOC_583|>": 100880,
545
+ "<|LOC_584|>": 100881,
546
+ "<|LOC_585|>": 100882,
547
+ "<|LOC_586|>": 100883,
548
+ "<|LOC_587|>": 100884,
549
+ "<|LOC_588|>": 100885,
550
+ "<|LOC_589|>": 100886,
551
+ "<|LOC_58|>": 100355,
552
+ "<|LOC_590|>": 100887,
553
+ "<|LOC_591|>": 100888,
554
+ "<|LOC_592|>": 100889,
555
+ "<|LOC_593|>": 100890,
556
+ "<|LOC_594|>": 100891,
557
+ "<|LOC_595|>": 100892,
558
+ "<|LOC_596|>": 100893,
559
+ "<|LOC_597|>": 100894,
560
+ "<|LOC_598|>": 100895,
561
+ "<|LOC_599|>": 100896,
562
+ "<|LOC_59|>": 100356,
563
+ "<|LOC_5|>": 100302,
564
+ "<|LOC_600|>": 100897,
565
+ "<|LOC_601|>": 100898,
566
+ "<|LOC_602|>": 100899,
567
+ "<|LOC_603|>": 100900,
568
+ "<|LOC_604|>": 100901,
569
+ "<|LOC_605|>": 100902,
570
+ "<|LOC_606|>": 100903,
571
+ "<|LOC_607|>": 100904,
572
+ "<|LOC_608|>": 100905,
573
+ "<|LOC_609|>": 100906,
574
+ "<|LOC_60|>": 100357,
575
+ "<|LOC_610|>": 100907,
576
+ "<|LOC_611|>": 100908,
577
+ "<|LOC_612|>": 100909,
578
+ "<|LOC_613|>": 100910,
579
+ "<|LOC_614|>": 100911,
580
+ "<|LOC_615|>": 100912,
581
+ "<|LOC_616|>": 100913,
582
+ "<|LOC_617|>": 100914,
583
+ "<|LOC_618|>": 100915,
584
+ "<|LOC_619|>": 100916,
585
+ "<|LOC_61|>": 100358,
586
+ "<|LOC_620|>": 100917,
587
+ "<|LOC_621|>": 100918,
588
+ "<|LOC_622|>": 100919,
589
+ "<|LOC_623|>": 100920,
590
+ "<|LOC_624|>": 100921,
591
+ "<|LOC_625|>": 100922,
592
+ "<|LOC_626|>": 100923,
593
+ "<|LOC_627|>": 100924,
594
+ "<|LOC_628|>": 100925,
595
+ "<|LOC_629|>": 100926,
596
+ "<|LOC_62|>": 100359,
597
+ "<|LOC_630|>": 100927,
598
+ "<|LOC_631|>": 100928,
599
+ "<|LOC_632|>": 100929,
600
+ "<|LOC_633|>": 100930,
601
+ "<|LOC_634|>": 100931,
602
+ "<|LOC_635|>": 100932,
603
+ "<|LOC_636|>": 100933,
604
+ "<|LOC_637|>": 100934,
605
+ "<|LOC_638|>": 100935,
606
+ "<|LOC_639|>": 100936,
607
+ "<|LOC_63|>": 100360,
608
+ "<|LOC_640|>": 100937,
609
+ "<|LOC_641|>": 100938,
610
+ "<|LOC_642|>": 100939,
611
+ "<|LOC_643|>": 100940,
612
+ "<|LOC_644|>": 100941,
613
+ "<|LOC_645|>": 100942,
614
+ "<|LOC_646|>": 100943,
615
+ "<|LOC_647|>": 100944,
616
+ "<|LOC_648|>": 100945,
617
+ "<|LOC_649|>": 100946,
618
+ "<|LOC_64|>": 100361,
619
+ "<|LOC_650|>": 100947,
620
+ "<|LOC_651|>": 100948,
621
+ "<|LOC_652|>": 100949,
622
+ "<|LOC_653|>": 100950,
623
+ "<|LOC_654|>": 100951,
624
+ "<|LOC_655|>": 100952,
625
+ "<|LOC_656|>": 100953,
626
+ "<|LOC_657|>": 100954,
627
+ "<|LOC_658|>": 100955,
628
+ "<|LOC_659|>": 100956,
629
+ "<|LOC_65|>": 100362,
630
+ "<|LOC_660|>": 100957,
631
+ "<|LOC_661|>": 100958,
632
+ "<|LOC_662|>": 100959,
633
+ "<|LOC_663|>": 100960,
634
+ "<|LOC_664|>": 100961,
635
+ "<|LOC_665|>": 100962,
636
+ "<|LOC_666|>": 100963,
637
+ "<|LOC_667|>": 100964,
638
+ "<|LOC_668|>": 100965,
639
+ "<|LOC_669|>": 100966,
640
+ "<|LOC_66|>": 100363,
641
+ "<|LOC_670|>": 100967,
642
+ "<|LOC_671|>": 100968,
643
+ "<|LOC_672|>": 100969,
644
+ "<|LOC_673|>": 100970,
645
+ "<|LOC_674|>": 100971,
646
+ "<|LOC_675|>": 100972,
647
+ "<|LOC_676|>": 100973,
648
+ "<|LOC_677|>": 100974,
649
+ "<|LOC_678|>": 100975,
650
+ "<|LOC_679|>": 100976,
651
+ "<|LOC_67|>": 100364,
652
+ "<|LOC_680|>": 100977,
653
+ "<|LOC_681|>": 100978,
654
+ "<|LOC_682|>": 100979,
655
+ "<|LOC_683|>": 100980,
656
+ "<|LOC_684|>": 100981,
657
+ "<|LOC_685|>": 100982,
658
+ "<|LOC_686|>": 100983,
659
+ "<|LOC_687|>": 100984,
660
+ "<|LOC_688|>": 100985,
661
+ "<|LOC_689|>": 100986,
662
+ "<|LOC_68|>": 100365,
663
+ "<|LOC_690|>": 100987,
664
+ "<|LOC_691|>": 100988,
665
+ "<|LOC_692|>": 100989,
666
+ "<|LOC_693|>": 100990,
667
+ "<|LOC_694|>": 100991,
668
+ "<|LOC_695|>": 100992,
669
+ "<|LOC_696|>": 100993,
670
+ "<|LOC_697|>": 100994,
671
+ "<|LOC_698|>": 100995,
672
+ "<|LOC_699|>": 100996,
673
+ "<|LOC_69|>": 100366,
674
+ "<|LOC_6|>": 100303,
675
+ "<|LOC_700|>": 100997,
676
+ "<|LOC_701|>": 100998,
677
+ "<|LOC_702|>": 100999,
678
+ "<|LOC_703|>": 101000,
679
+ "<|LOC_704|>": 101001,
680
+ "<|LOC_705|>": 101002,
681
+ "<|LOC_706|>": 101003,
682
+ "<|LOC_707|>": 101004,
683
+ "<|LOC_708|>": 101005,
684
+ "<|LOC_709|>": 101006,
685
+ "<|LOC_70|>": 100367,
686
+ "<|LOC_710|>": 101007,
687
+ "<|LOC_711|>": 101008,
688
+ "<|LOC_712|>": 101009,
689
+ "<|LOC_713|>": 101010,
690
+ "<|LOC_714|>": 101011,
691
+ "<|LOC_715|>": 101012,
692
+ "<|LOC_716|>": 101013,
693
+ "<|LOC_717|>": 101014,
694
+ "<|LOC_718|>": 101015,
695
+ "<|LOC_719|>": 101016,
696
+ "<|LOC_71|>": 100368,
697
+ "<|LOC_720|>": 101017,
698
+ "<|LOC_721|>": 101018,
699
+ "<|LOC_722|>": 101019,
700
+ "<|LOC_723|>": 101020,
701
+ "<|LOC_724|>": 101021,
702
+ "<|LOC_725|>": 101022,
703
+ "<|LOC_726|>": 101023,
704
+ "<|LOC_727|>": 101024,
705
+ "<|LOC_728|>": 101025,
706
+ "<|LOC_729|>": 101026,
707
+ "<|LOC_72|>": 100369,
708
+ "<|LOC_730|>": 101027,
709
+ "<|LOC_731|>": 101028,
710
+ "<|LOC_732|>": 101029,
711
+ "<|LOC_733|>": 101030,
712
+ "<|LOC_734|>": 101031,
713
+ "<|LOC_735|>": 101032,
714
+ "<|LOC_736|>": 101033,
715
+ "<|LOC_737|>": 101034,
716
+ "<|LOC_738|>": 101035,
717
+ "<|LOC_739|>": 101036,
718
+ "<|LOC_73|>": 100370,
719
+ "<|LOC_740|>": 101037,
720
+ "<|LOC_741|>": 101038,
721
+ "<|LOC_742|>": 101039,
722
+ "<|LOC_743|>": 101040,
723
+ "<|LOC_744|>": 101041,
724
+ "<|LOC_745|>": 101042,
725
+ "<|LOC_746|>": 101043,
726
+ "<|LOC_747|>": 101044,
727
+ "<|LOC_748|>": 101045,
728
+ "<|LOC_749|>": 101046,
729
+ "<|LOC_74|>": 100371,
730
+ "<|LOC_750|>": 101047,
731
+ "<|LOC_751|>": 101048,
732
+ "<|LOC_752|>": 101049,
733
+ "<|LOC_753|>": 101050,
734
+ "<|LOC_754|>": 101051,
735
+ "<|LOC_755|>": 101052,
736
+ "<|LOC_756|>": 101053,
737
+ "<|LOC_757|>": 101054,
738
+ "<|LOC_758|>": 101055,
739
+ "<|LOC_759|>": 101056,
740
+ "<|LOC_75|>": 100372,
741
+ "<|LOC_760|>": 101057,
742
+ "<|LOC_761|>": 101058,
743
+ "<|LOC_762|>": 101059,
744
+ "<|LOC_763|>": 101060,
745
+ "<|LOC_764|>": 101061,
746
+ "<|LOC_765|>": 101062,
747
+ "<|LOC_766|>": 101063,
748
+ "<|LOC_767|>": 101064,
749
+ "<|LOC_768|>": 101065,
750
+ "<|LOC_769|>": 101066,
751
+ "<|LOC_76|>": 100373,
752
+ "<|LOC_770|>": 101067,
753
+ "<|LOC_771|>": 101068,
754
+ "<|LOC_772|>": 101069,
755
+ "<|LOC_773|>": 101070,
756
+ "<|LOC_774|>": 101071,
757
+ "<|LOC_775|>": 101072,
758
+ "<|LOC_776|>": 101073,
759
+ "<|LOC_777|>": 101074,
760
+ "<|LOC_778|>": 101075,
761
+ "<|LOC_779|>": 101076,
762
+ "<|LOC_77|>": 100374,
763
+ "<|LOC_780|>": 101077,
764
+ "<|LOC_781|>": 101078,
765
+ "<|LOC_782|>": 101079,
766
+ "<|LOC_783|>": 101080,
767
+ "<|LOC_784|>": 101081,
768
+ "<|LOC_785|>": 101082,
769
+ "<|LOC_786|>": 101083,
770
+ "<|LOC_787|>": 101084,
771
+ "<|LOC_788|>": 101085,
772
+ "<|LOC_789|>": 101086,
773
+ "<|LOC_78|>": 100375,
774
+ "<|LOC_790|>": 101087,
775
+ "<|LOC_791|>": 101088,
776
+ "<|LOC_792|>": 101089,
777
+ "<|LOC_793|>": 101090,
778
+ "<|LOC_794|>": 101091,
779
+ "<|LOC_795|>": 101092,
780
+ "<|LOC_796|>": 101093,
781
+ "<|LOC_797|>": 101094,
782
+ "<|LOC_798|>": 101095,
783
+ "<|LOC_799|>": 101096,
784
+ "<|LOC_79|>": 100376,
785
+ "<|LOC_7|>": 100304,
786
+ "<|LOC_800|>": 101097,
787
+ "<|LOC_801|>": 101098,
788
+ "<|LOC_802|>": 101099,
789
+ "<|LOC_803|>": 101100,
790
+ "<|LOC_804|>": 101101,
791
+ "<|LOC_805|>": 101102,
792
+ "<|LOC_806|>": 101103,
793
+ "<|LOC_807|>": 101104,
794
+ "<|LOC_808|>": 101105,
795
+ "<|LOC_809|>": 101106,
796
+ "<|LOC_80|>": 100377,
797
+ "<|LOC_810|>": 101107,
798
+ "<|LOC_811|>": 101108,
799
+ "<|LOC_812|>": 101109,
800
+ "<|LOC_813|>": 101110,
801
+ "<|LOC_814|>": 101111,
802
+ "<|LOC_815|>": 101112,
803
+ "<|LOC_816|>": 101113,
804
+ "<|LOC_817|>": 101114,
805
+ "<|LOC_818|>": 101115,
806
+ "<|LOC_819|>": 101116,
807
+ "<|LOC_81|>": 100378,
808
+ "<|LOC_820|>": 101117,
809
+ "<|LOC_821|>": 101118,
810
+ "<|LOC_822|>": 101119,
811
+ "<|LOC_823|>": 101120,
812
+ "<|LOC_824|>": 101121,
813
+ "<|LOC_825|>": 101122,
814
+ "<|LOC_826|>": 101123,
815
+ "<|LOC_827|>": 101124,
816
+ "<|LOC_828|>": 101125,
817
+ "<|LOC_829|>": 101126,
818
+ "<|LOC_82|>": 100379,
819
+ "<|LOC_830|>": 101127,
820
+ "<|LOC_831|>": 101128,
821
+ "<|LOC_832|>": 101129,
822
+ "<|LOC_833|>": 101130,
823
+ "<|LOC_834|>": 101131,
824
+ "<|LOC_835|>": 101132,
825
+ "<|LOC_836|>": 101133,
826
+ "<|LOC_837|>": 101134,
827
+ "<|LOC_838|>": 101135,
828
+ "<|LOC_839|>": 101136,
829
+ "<|LOC_83|>": 100380,
830
+ "<|LOC_840|>": 101137,
831
+ "<|LOC_841|>": 101138,
832
+ "<|LOC_842|>": 101139,
833
+ "<|LOC_843|>": 101140,
834
+ "<|LOC_844|>": 101141,
835
+ "<|LOC_845|>": 101142,
836
+ "<|LOC_846|>": 101143,
837
+ "<|LOC_847|>": 101144,
838
+ "<|LOC_848|>": 101145,
839
+ "<|LOC_849|>": 101146,
840
+ "<|LOC_84|>": 100381,
841
+ "<|LOC_850|>": 101147,
842
+ "<|LOC_851|>": 101148,
843
+ "<|LOC_852|>": 101149,
844
+ "<|LOC_853|>": 101150,
845
+ "<|LOC_854|>": 101151,
846
+ "<|LOC_855|>": 101152,
847
+ "<|LOC_856|>": 101153,
848
+ "<|LOC_857|>": 101154,
849
+ "<|LOC_858|>": 101155,
850
+ "<|LOC_859|>": 101156,
851
+ "<|LOC_85|>": 100382,
852
+ "<|LOC_860|>": 101157,
853
+ "<|LOC_861|>": 101158,
854
+ "<|LOC_862|>": 101159,
855
+ "<|LOC_863|>": 101160,
856
+ "<|LOC_864|>": 101161,
857
+ "<|LOC_865|>": 101162,
858
+ "<|LOC_866|>": 101163,
859
+ "<|LOC_867|>": 101164,
860
+ "<|LOC_868|>": 101165,
861
+ "<|LOC_869|>": 101166,
862
+ "<|LOC_86|>": 100383,
863
+ "<|LOC_870|>": 101167,
864
+ "<|LOC_871|>": 101168,
865
+ "<|LOC_872|>": 101169,
866
+ "<|LOC_873|>": 101170,
867
+ "<|LOC_874|>": 101171,
868
+ "<|LOC_875|>": 101172,
869
+ "<|LOC_876|>": 101173,
870
+ "<|LOC_877|>": 101174,
871
+ "<|LOC_878|>": 101175,
872
+ "<|LOC_879|>": 101176,
873
+ "<|LOC_87|>": 100384,
874
+ "<|LOC_880|>": 101177,
875
+ "<|LOC_881|>": 101178,
876
+ "<|LOC_882|>": 101179,
877
+ "<|LOC_883|>": 101180,
878
+ "<|LOC_884|>": 101181,
879
+ "<|LOC_885|>": 101182,
880
+ "<|LOC_886|>": 101183,
881
+ "<|LOC_887|>": 101184,
882
+ "<|LOC_888|>": 101185,
883
+ "<|LOC_889|>": 101186,
884
+ "<|LOC_88|>": 100385,
885
+ "<|LOC_890|>": 101187,
886
+ "<|LOC_891|>": 101188,
887
+ "<|LOC_892|>": 101189,
888
+ "<|LOC_893|>": 101190,
889
+ "<|LOC_894|>": 101191,
890
+ "<|LOC_895|>": 101192,
891
+ "<|LOC_896|>": 101193,
892
+ "<|LOC_897|>": 101194,
893
+ "<|LOC_898|>": 101195,
894
+ "<|LOC_899|>": 101196,
895
+ "<|LOC_89|>": 100386,
896
+ "<|LOC_8|>": 100305,
897
+ "<|LOC_900|>": 101197,
898
+ "<|LOC_901|>": 101198,
899
+ "<|LOC_902|>": 101199,
900
+ "<|LOC_903|>": 101200,
901
+ "<|LOC_904|>": 101201,
902
+ "<|LOC_905|>": 101202,
903
+ "<|LOC_906|>": 101203,
904
+ "<|LOC_907|>": 101204,
905
+ "<|LOC_908|>": 101205,
906
+ "<|LOC_909|>": 101206,
907
+ "<|LOC_90|>": 100387,
908
+ "<|LOC_910|>": 101207,
909
+ "<|LOC_911|>": 101208,
910
+ "<|LOC_912|>": 101209,
911
+ "<|LOC_913|>": 101210,
912
+ "<|LOC_914|>": 101211,
913
+ "<|LOC_915|>": 101212,
914
+ "<|LOC_916|>": 101213,
915
+ "<|LOC_917|>": 101214,
916
+ "<|LOC_918|>": 101215,
917
+ "<|LOC_919|>": 101216,
918
+ "<|LOC_91|>": 100388,
919
+ "<|LOC_920|>": 101217,
920
+ "<|LOC_921|>": 101218,
921
+ "<|LOC_922|>": 101219,
922
+ "<|LOC_923|>": 101220,
923
+ "<|LOC_924|>": 101221,
924
+ "<|LOC_925|>": 101222,
925
+ "<|LOC_926|>": 101223,
926
+ "<|LOC_927|>": 101224,
927
+ "<|LOC_928|>": 101225,
928
+ "<|LOC_929|>": 101226,
929
+ "<|LOC_92|>": 100389,
930
+ "<|LOC_930|>": 101227,
931
+ "<|LOC_931|>": 101228,
932
+ "<|LOC_932|>": 101229,
933
+ "<|LOC_933|>": 101230,
934
+ "<|LOC_934|>": 101231,
935
+ "<|LOC_935|>": 101232,
936
+ "<|LOC_936|>": 101233,
937
+ "<|LOC_937|>": 101234,
938
+ "<|LOC_938|>": 101235,
939
+ "<|LOC_939|>": 101236,
940
+ "<|LOC_93|>": 100390,
941
+ "<|LOC_940|>": 101237,
942
+ "<|LOC_941|>": 101238,
943
+ "<|LOC_942|>": 101239,
944
+ "<|LOC_943|>": 101240,
945
+ "<|LOC_944|>": 101241,
946
+ "<|LOC_945|>": 101242,
947
+ "<|LOC_946|>": 101243,
948
+ "<|LOC_947|>": 101244,
949
+ "<|LOC_948|>": 101245,
950
+ "<|LOC_949|>": 101246,
951
+ "<|LOC_94|>": 100391,
952
+ "<|LOC_950|>": 101247,
953
+ "<|LOC_951|>": 101248,
954
+ "<|LOC_952|>": 101249,
955
+ "<|LOC_953|>": 101250,
956
+ "<|LOC_954|>": 101251,
957
+ "<|LOC_955|>": 101252,
958
+ "<|LOC_956|>": 101253,
959
+ "<|LOC_957|>": 101254,
960
+ "<|LOC_958|>": 101255,
961
+ "<|LOC_959|>": 101256,
962
+ "<|LOC_95|>": 100392,
963
+ "<|LOC_960|>": 101257,
964
+ "<|LOC_961|>": 101258,
965
+ "<|LOC_962|>": 101259,
966
+ "<|LOC_963|>": 101260,
967
+ "<|LOC_964|>": 101261,
968
+ "<|LOC_965|>": 101262,
969
+ "<|LOC_966|>": 101263,
970
+ "<|LOC_967|>": 101264,
971
+ "<|LOC_968|>": 101265,
972
+ "<|LOC_969|>": 101266,
973
+ "<|LOC_96|>": 100393,
974
+ "<|LOC_970|>": 101267,
975
+ "<|LOC_971|>": 101268,
976
+ "<|LOC_972|>": 101269,
977
+ "<|LOC_973|>": 101270,
978
+ "<|LOC_974|>": 101271,
979
+ "<|LOC_975|>": 101272,
980
+ "<|LOC_976|>": 101273,
981
+ "<|LOC_977|>": 101274,
982
+ "<|LOC_978|>": 101275,
983
+ "<|LOC_979|>": 101276,
984
+ "<|LOC_97|>": 100394,
985
+ "<|LOC_980|>": 101277,
986
+ "<|LOC_981|>": 101278,
987
+ "<|LOC_982|>": 101279,
988
+ "<|LOC_983|>": 101280,
989
+ "<|LOC_984|>": 101281,
990
+ "<|LOC_985|>": 101282,
991
+ "<|LOC_986|>": 101283,
992
+ "<|LOC_987|>": 101284,
993
+ "<|LOC_988|>": 101285,
994
+ "<|LOC_989|>": 101286,
995
+ "<|LOC_98|>": 100395,
996
+ "<|LOC_990|>": 101287,
997
+ "<|LOC_991|>": 101288,
998
+ "<|LOC_992|>": 101289,
999
+ "<|LOC_993|>": 101290,
1000
+ "<|LOC_994|>": 101291,
1001
+ "<|LOC_995|>": 101292,
1002
+ "<|LOC_996|>": 101293,
1003
+ "<|LOC_997|>": 101294,
1004
+ "<|LOC_998|>": 101295,
1005
+ "<|LOC_999|>": 101296,
1006
+ "<|LOC_99|>": 100396,
1007
+ "<|LOC_9|>": 100306,
1008
+ "<|LOC_BEGIN|>": 101298,
1009
+ "<|LOC_END|>": 101299,
1010
+ "<|LOC_SEP|>": 101300
1011
+ }
chat_template.jinja ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if not add_generation_prompt is defined -%}
2
+ {%- set add_generation_prompt = true -%}
3
+ {%- endif -%}
4
+ {%- if not cls_token is defined -%}
5
+ {%- set cls_token = "<|begin_of_sentence|>" -%}
6
+ {%- endif -%}
7
+ {%- if not sep_token is defined -%}
8
+ {%- set sep_token = "<|end_of_sentence|>" -%}
9
+ {%- endif -%}
10
+ {{- cls_token -}}
11
+ {%- for message in messages -%}
12
+ {%- if message["role"] == "user" -%}
13
+ {{- "User: " + message["content"] + "
14
+ " -}}
15
+ {%- elif message["role"] == "assistant" -%}
16
+ {{- "Assistant: " + message["content"] + sep_token -}}
17
+ {%- elif message["role"] == "system" -%}
18
+ {{- message["content"] + "
19
+ " -}}
20
+ {%- endif -%}
21
+ {%- endfor -%}
22
+ {%- if add_generation_prompt -%}
23
+ {{- "Assistant: " -}}
24
+ {%- endif -%}
config.json ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Ernie4_5_MoeForCausalLM"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_ernie4_5_moe.Ernie4_5_MoeConfig",
7
+ "AutoModel": "modeling_ernie4_5_moe.Ernie4_5_Model",
8
+ "AutoModelForCausalLM": "modeling_ernie4_5_moe.Ernie4_5_MoeForCausalLM"
9
+ },
10
+ "bos_token_id": 1,
11
+ "eos_token_id": 2,
12
+ "hidden_act": "silu",
13
+ "hidden_size": 2560,
14
+ "intermediate_size": 12288,
15
+ "max_position_embeddings": 131072,
16
+ "model_type": "ernie4_5_moe",
17
+ "moe_capacity": [
18
+ 64,
19
+ 64,
20
+ 64
21
+ ],
22
+ "moe_intermediate_size": 1536,
23
+ "moe_k": 6,
24
+ "moe_layer_interval": 1,
25
+ "moe_layer_start_index": 1,
26
+ "moe_num_experts": 64,
27
+ "moe_num_shared_experts": 2,
28
+ "moe_use_aux_free": true,
29
+ "multi_token_pred_lambda": 1.0,
30
+ "num_attention_heads": 20,
31
+ "num_hidden_layers": 28,
32
+ "num_key_value_heads": 4,
33
+ "num_nextn_predict_layers": 1,
34
+ "pad_token_id": 0,
35
+ "quantization": {
36
+ "group_size": 64,
37
+ "bits": 6
38
+ },
39
+ "quantization_config": {
40
+ "group_size": 64,
41
+ "bits": 6
42
+ },
43
+ "rms_norm_eps": 1e-05,
44
+ "rope_theta": 500000,
45
+ "tie_word_embeddings": true,
46
+ "torch_dtype": "bfloat16",
47
+ "use_bias": false,
48
+ "use_cache": false,
49
+ "vocab_size": 103424
50
+ }
configuration_ernie4_5_moe.py ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) 2025 Baidu, Inc. All Rights Reserved.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+
15
+ from transformers import PretrainedConfig
16
+
17
+
18
+
19
+ class Ernie4_5_MoeConfig(PretrainedConfig):
20
+ r"""
21
+ This is the configuration class to store the configuration of a [`Ernie4_5_Model`].
22
+
23
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
24
+ documentation from [`PretrainedConfig`] for more information.
25
+
26
+
27
+ Args:
28
+ vocab_size (int): Size of the vocabulary (number of unique tokens)
29
+ hidden_size (int): Dimensionality of the encoder layers and the pooler layer
30
+ intermediate_size (int): Dimensionality of the "intermediate" (feed-forward) layer
31
+ max_position_embeddings (int): Maximum sequence length the model can handle
32
+ num_hidden_layers (int): Number of hidden layers in the Transformer encoder
33
+ num_attention_heads (int): Number of attention heads for each attention layer
34
+ rms_norm_eps (float): The epsilon used by the RMS normalization layers
35
+ use_cache (bool): Whether to use caching for faster generation (decoding)
36
+ use_flash_attention (bool): Whether to use FlashAttention for optimized attention computation
37
+ pad_token_id (int): Token ID used for padding sequences
38
+ bos_token_id (int): Token ID used for beginning-of-sequence
39
+ eos_token_id (int): Token ID used for end-of-sequence
40
+ use_bias (bool): Whether to use bias terms in linear layers
41
+ rope_theta (float): The base period of the RoPE embeddings
42
+ weight_share_add_bias (bool): Whether to share bias weights in certain layers
43
+ ignored_index (int): Target value that is ignored during loss computation
44
+ attention_probs_dropout_prob (float): Dropout probability for attention weights
45
+ hidden_dropout_prob (float): Dropout probability for hidden layers
46
+ num_key_value_heads (int): Number of key/value heads (for Grouped Query Attention)
47
+ max_sequence_length (int): Maximum sequence length for positional embeddings
48
+ moe_num_experts: Number of experts in MoE layers
49
+ moe_capacity: Capacity configuration for MoE layers
50
+ moe_layer_interval: Interval between MoE layers
51
+ moe_layer_start_index: Starting layer index for MoE
52
+ moe_layer_end_index: Ending layer index for MoE (-1 means last layer)
53
+ sinkhorn_2gate: Whether to use sinkhorn 2-gate routing
54
+ sinkhorn_temp: Temperature for sinkhorn routing
55
+ moe_dropout_prob: Dropout probability for MoE layers
56
+ moe_gate: Type of gating mechanism ('top2', etc.)
57
+ moe_intermediate_size: Intermediate size for MoE layers
58
+ moe_gate_act: Activation function for gating
59
+ moe_k: Number of experts to route to
60
+ num_nextn_predict_layers: Number of mtp predict layers, if use mtp, set `num_nextn_predict_layers > 0`
61
+ multi_token_pred_lambda: The weight of multi token prediction loss
62
+ **kwargs: Additional base model configuration parameters
63
+ """
64
+
65
+ model_type = "ernie4_5_moe"
66
+ use_keep_in_fp32_modules = True
67
+ keys_to_ignore_at_inference = ["past_key_values"]
68
+
69
+ attribute_map = {
70
+ "n_positions": "max_position_embeddings",
71
+ "n_embd": "hidden_size",
72
+ "n_layer": "num_hidden_layers",
73
+ "n_head": "num_attention_heads",
74
+ "n_inner": "intermediate_size",
75
+ "activation_function": "hidden_act",
76
+ }
77
+
78
+ # Default tensor parallel plan for base model `ernie_4_5_moe`
79
+ base_model_tp_plan = {
80
+ "model.layers.*.self_attn.q_proj": "colwise_rep",
81
+ "model.layers.*.self_attn.k_proj": "colwise_rep",
82
+ "model.layers.*.self_attn.v_proj": "colwise_rep",
83
+ "model.layers.*.self_attn.o_proj": "rowwise_rep",
84
+ "model.layers.*.mlp.experts.*.gate_proj": "colwise",
85
+ "model.layers.*.mlp.experts.*.up_proj": "colwise",
86
+ "model.layers.*.mlp.experts.*.down_proj": "rowwise",
87
+ "model.layers.*.mlp.gate_proj": "colwise",
88
+ "model.layers.*.mlp.up_proj": "colwise",
89
+ "model.layers.*.mlp.down_proj": "rowwise",
90
+ }
91
+ base_model_pp_plan = {
92
+ "embed_tokens": (["input_ids"], ["inputs_embeds"]),
93
+ "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
94
+ "norm": (["hidden_states"], ["hidden_states"]),
95
+ }
96
+
97
+ def __init__(
98
+ self,
99
+ vocab_size=32000,
100
+ hidden_size=768,
101
+ intermediate_size=11008,
102
+ num_hidden_layers=2,
103
+ num_attention_heads=2,
104
+ num_key_value_heads=None,
105
+ max_position_embeddings=32768,
106
+ rms_norm_eps=1e-6,
107
+ use_cache=False,
108
+ pad_token_id=0,
109
+ bos_token_id=1,
110
+ eos_token_id=2,
111
+ attention_probs_dropout_prob=0.0,
112
+ hidden_dropout_prob=0.0,
113
+ rope_theta=10000.0,
114
+ use_flash_attention=False,
115
+ use_rmsnorm=True,
116
+ use_bias=False,
117
+ weight_share_add_bias=True,
118
+ max_sequence_length=None,
119
+ ignored_index=-100,
120
+ use_moe=True,
121
+ moe_num_experts=64,
122
+ moe_capacity=(64, 64, 64),
123
+ moe_layer_interval=2,
124
+ moe_layer_start_index=0,
125
+ moe_layer_end_index=-1,
126
+ sinkhorn_2gate=True,
127
+ sinkhorn_temp=3e-2,
128
+ moe_dropout_prob=0.0,
129
+ moe_gate="top2",
130
+ moe_intermediate_size=3584,
131
+ moe_k=2,
132
+ moe_gate_act: str = "softmax",
133
+ moe_use_aux_free=False,
134
+ num_nextn_predict_layers=0,
135
+ multi_token_pred_lambda=1.0,
136
+ **kwargs,
137
+ ):
138
+ self.vocab_size = vocab_size
139
+ self.max_position_embeddings = max_position_embeddings
140
+ self.hidden_size = hidden_size
141
+ self.intermediate_size = intermediate_size
142
+ self.num_hidden_layers = num_hidden_layers
143
+ self.num_attention_heads = num_attention_heads
144
+
145
+ if num_key_value_heads is None:
146
+ num_key_value_heads = num_attention_heads
147
+
148
+ self.num_key_value_heads = num_key_value_heads
149
+ self.use_rmsnorm = use_rmsnorm
150
+ self.rms_norm_eps = rms_norm_eps
151
+ self.rope_theta = rope_theta
152
+ self.max_sequence_length = max_sequence_length
153
+ self.pad_token_id = pad_token_id
154
+ self.bos_token_id = bos_token_id
155
+ self.eos_token_id = eos_token_id
156
+ self.ignored_index = ignored_index
157
+ self.use_cache = use_cache
158
+ self.use_bias = use_bias
159
+ self.weight_share_add_bias = weight_share_add_bias
160
+ self.use_flash_attention = use_flash_attention
161
+ self.attention_probs_dropout_prob = attention_probs_dropout_prob
162
+ self.hidden_dropout_prob = hidden_dropout_prob
163
+
164
+ self.use_moe = moe_num_experts > 0 and use_moe
165
+ self.moe_num_experts = moe_num_experts
166
+ self.moe_capacity = moe_capacity
167
+ self.sinkhorn_2gate = sinkhorn_2gate
168
+ self.sinkhorn_temp = sinkhorn_temp
169
+ self.moe_layer_interval = moe_layer_interval
170
+ self.moe_dropout_prob = moe_dropout_prob
171
+ self.moe_gate = moe_gate
172
+ self.moe_intermediate_size = moe_intermediate_size
173
+ self.moe_k = moe_k
174
+ self.moe_layer_start_index = moe_layer_start_index
175
+ self.moe_layer_end_index = (
176
+ self.num_hidden_layers - 1
177
+ if moe_layer_end_index == -1
178
+ else moe_layer_end_index
179
+ )
180
+ self.moe_gate_act = moe_gate_act
181
+ self.moe_use_aux_free = moe_use_aux_free
182
+ self.num_nextn_predict_layers = num_nextn_predict_layers
183
+ self.multi_token_pred_lambda = multi_token_pred_lambda
184
+
185
+ # Set default for tied embeddings if not specified.
186
+ if "tie_word_embeddings" not in kwargs:
187
+ kwargs["tie_word_embeddings"] = False
188
+
189
+ super().__init__(
190
+ pad_token_id=pad_token_id,
191
+ bos_token_id=bos_token_id,
192
+ eos_token_id=eos_token_id,
193
+ **kwargs,
194
+ )
generation_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_sample": true,
3
+ "top_p": 0.8,
4
+ "temperature": 0.8,
5
+ "bos_token_id": 1,
6
+ "eos_token_id": 2,
7
+ "pad_token_id": 0,
8
+ "repetition_penalty": 1.0,
9
+ "frequency_penalty": 0.0,
10
+ "presence_penalty": 0.0
11
+ }
model-00001-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c9a43716323e572089c455073dc912d6c9dc69d8197354b737217f05a88ea0a8
3
+ size 5245044492
model-00002-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b7108fe28fb2d1adcf2b71951aee75ea8e97fe68030eb4f153a9ebdc460e499f
3
+ size 5368579260
model-00003-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:447fd90218bcb130a3c3040f199de5ca10e77bc5d8e60967bb2c3e596d74b598
3
+ size 5196201739
model-00004-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ff41f158ad50aac11559a11d3ec8e3155b16195aef96b0b99faf4fabdfc893bc
3
+ size 1923630763
model.safetensors.index.json ADDED
@@ -0,0 +1,980 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 17733340160,
4
+ "total_parameters": 21825436160
5
+ },
6
+ "weight_map": {
7
+ "model.embed_tokens.biases": "model-00001-of-00004.safetensors",
8
+ "model.embed_tokens.scales": "model-00001-of-00004.safetensors",
9
+ "model.embed_tokens.weight": "model-00001-of-00004.safetensors",
10
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00004.safetensors",
11
+ "model.layers.0.mlp.down_proj.biases": "model-00001-of-00004.safetensors",
12
+ "model.layers.0.mlp.down_proj.scales": "model-00001-of-00004.safetensors",
13
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
14
+ "model.layers.0.mlp.gate_proj.biases": "model-00001-of-00004.safetensors",
15
+ "model.layers.0.mlp.gate_proj.scales": "model-00001-of-00004.safetensors",
16
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
17
+ "model.layers.0.mlp.up_proj.biases": "model-00001-of-00004.safetensors",
18
+ "model.layers.0.mlp.up_proj.scales": "model-00001-of-00004.safetensors",
19
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
20
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
21
+ "model.layers.0.self_attn.k_proj.biases": "model-00001-of-00004.safetensors",
22
+ "model.layers.0.self_attn.k_proj.scales": "model-00001-of-00004.safetensors",
23
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
24
+ "model.layers.0.self_attn.o_proj.biases": "model-00001-of-00004.safetensors",
25
+ "model.layers.0.self_attn.o_proj.scales": "model-00001-of-00004.safetensors",
26
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
27
+ "model.layers.0.self_attn.q_proj.biases": "model-00001-of-00004.safetensors",
28
+ "model.layers.0.self_attn.q_proj.scales": "model-00001-of-00004.safetensors",
29
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
30
+ "model.layers.0.self_attn.v_proj.biases": "model-00001-of-00004.safetensors",
31
+ "model.layers.0.self_attn.v_proj.scales": "model-00001-of-00004.safetensors",
32
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
33
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00004.safetensors",
34
+ "model.layers.1.mlp.gate.biases": "model-00001-of-00004.safetensors",
35
+ "model.layers.1.mlp.gate.scales": "model-00001-of-00004.safetensors",
36
+ "model.layers.1.mlp.gate.weight": "model-00001-of-00004.safetensors",
37
+ "model.layers.1.mlp.shared_experts.down_proj.biases": "model-00001-of-00004.safetensors",
38
+ "model.layers.1.mlp.shared_experts.down_proj.scales": "model-00001-of-00004.safetensors",
39
+ "model.layers.1.mlp.shared_experts.down_proj.weight": "model-00001-of-00004.safetensors",
40
+ "model.layers.1.mlp.shared_experts.gate_proj.biases": "model-00001-of-00004.safetensors",
41
+ "model.layers.1.mlp.shared_experts.gate_proj.scales": "model-00001-of-00004.safetensors",
42
+ "model.layers.1.mlp.shared_experts.gate_proj.weight": "model-00001-of-00004.safetensors",
43
+ "model.layers.1.mlp.shared_experts.up_proj.biases": "model-00001-of-00004.safetensors",
44
+ "model.layers.1.mlp.shared_experts.up_proj.scales": "model-00001-of-00004.safetensors",
45
+ "model.layers.1.mlp.shared_experts.up_proj.weight": "model-00001-of-00004.safetensors",
46
+ "model.layers.1.mlp.switch_mlp.down_proj.biases": "model-00001-of-00004.safetensors",
47
+ "model.layers.1.mlp.switch_mlp.down_proj.scales": "model-00001-of-00004.safetensors",
48
+ "model.layers.1.mlp.switch_mlp.down_proj.weight": "model-00001-of-00004.safetensors",
49
+ "model.layers.1.mlp.switch_mlp.gate_proj.biases": "model-00001-of-00004.safetensors",
50
+ "model.layers.1.mlp.switch_mlp.gate_proj.scales": "model-00001-of-00004.safetensors",
51
+ "model.layers.1.mlp.switch_mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
52
+ "model.layers.1.mlp.switch_mlp.up_proj.biases": "model-00001-of-00004.safetensors",
53
+ "model.layers.1.mlp.switch_mlp.up_proj.scales": "model-00001-of-00004.safetensors",
54
+ "model.layers.1.mlp.switch_mlp.up_proj.weight": "model-00001-of-00004.safetensors",
55
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
56
+ "model.layers.1.self_attn.k_proj.biases": "model-00001-of-00004.safetensors",
57
+ "model.layers.1.self_attn.k_proj.scales": "model-00001-of-00004.safetensors",
58
+ "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
59
+ "model.layers.1.self_attn.o_proj.biases": "model-00001-of-00004.safetensors",
60
+ "model.layers.1.self_attn.o_proj.scales": "model-00001-of-00004.safetensors",
61
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
62
+ "model.layers.1.self_attn.q_proj.biases": "model-00001-of-00004.safetensors",
63
+ "model.layers.1.self_attn.q_proj.scales": "model-00001-of-00004.safetensors",
64
+ "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
65
+ "model.layers.1.self_attn.v_proj.biases": "model-00001-of-00004.safetensors",
66
+ "model.layers.1.self_attn.v_proj.scales": "model-00001-of-00004.safetensors",
67
+ "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
68
+ "model.layers.10.input_layernorm.weight": "model-00002-of-00004.safetensors",
69
+ "model.layers.10.mlp.gate.biases": "model-00002-of-00004.safetensors",
70
+ "model.layers.10.mlp.gate.scales": "model-00002-of-00004.safetensors",
71
+ "model.layers.10.mlp.gate.weight": "model-00002-of-00004.safetensors",
72
+ "model.layers.10.mlp.shared_experts.down_proj.biases": "model-00002-of-00004.safetensors",
73
+ "model.layers.10.mlp.shared_experts.down_proj.scales": "model-00002-of-00004.safetensors",
74
+ "model.layers.10.mlp.shared_experts.down_proj.weight": "model-00002-of-00004.safetensors",
75
+ "model.layers.10.mlp.shared_experts.gate_proj.biases": "model-00002-of-00004.safetensors",
76
+ "model.layers.10.mlp.shared_experts.gate_proj.scales": "model-00002-of-00004.safetensors",
77
+ "model.layers.10.mlp.shared_experts.gate_proj.weight": "model-00002-of-00004.safetensors",
78
+ "model.layers.10.mlp.shared_experts.up_proj.biases": "model-00002-of-00004.safetensors",
79
+ "model.layers.10.mlp.shared_experts.up_proj.scales": "model-00002-of-00004.safetensors",
80
+ "model.layers.10.mlp.shared_experts.up_proj.weight": "model-00002-of-00004.safetensors",
81
+ "model.layers.10.mlp.switch_mlp.down_proj.biases": "model-00002-of-00004.safetensors",
82
+ "model.layers.10.mlp.switch_mlp.down_proj.scales": "model-00002-of-00004.safetensors",
83
+ "model.layers.10.mlp.switch_mlp.down_proj.weight": "model-00002-of-00004.safetensors",
84
+ "model.layers.10.mlp.switch_mlp.gate_proj.biases": "model-00002-of-00004.safetensors",
85
+ "model.layers.10.mlp.switch_mlp.gate_proj.scales": "model-00002-of-00004.safetensors",
86
+ "model.layers.10.mlp.switch_mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
87
+ "model.layers.10.mlp.switch_mlp.up_proj.biases": "model-00002-of-00004.safetensors",
88
+ "model.layers.10.mlp.switch_mlp.up_proj.scales": "model-00002-of-00004.safetensors",
89
+ "model.layers.10.mlp.switch_mlp.up_proj.weight": "model-00002-of-00004.safetensors",
90
+ "model.layers.10.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
91
+ "model.layers.10.self_attn.k_proj.biases": "model-00002-of-00004.safetensors",
92
+ "model.layers.10.self_attn.k_proj.scales": "model-00002-of-00004.safetensors",
93
+ "model.layers.10.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
94
+ "model.layers.10.self_attn.o_proj.biases": "model-00002-of-00004.safetensors",
95
+ "model.layers.10.self_attn.o_proj.scales": "model-00002-of-00004.safetensors",
96
+ "model.layers.10.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
97
+ "model.layers.10.self_attn.q_proj.biases": "model-00002-of-00004.safetensors",
98
+ "model.layers.10.self_attn.q_proj.scales": "model-00002-of-00004.safetensors",
99
+ "model.layers.10.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
100
+ "model.layers.10.self_attn.v_proj.biases": "model-00002-of-00004.safetensors",
101
+ "model.layers.10.self_attn.v_proj.scales": "model-00002-of-00004.safetensors",
102
+ "model.layers.10.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
103
+ "model.layers.11.input_layernorm.weight": "model-00002-of-00004.safetensors",
104
+ "model.layers.11.mlp.gate.biases": "model-00002-of-00004.safetensors",
105
+ "model.layers.11.mlp.gate.scales": "model-00002-of-00004.safetensors",
106
+ "model.layers.11.mlp.gate.weight": "model-00002-of-00004.safetensors",
107
+ "model.layers.11.mlp.shared_experts.down_proj.biases": "model-00002-of-00004.safetensors",
108
+ "model.layers.11.mlp.shared_experts.down_proj.scales": "model-00002-of-00004.safetensors",
109
+ "model.layers.11.mlp.shared_experts.down_proj.weight": "model-00002-of-00004.safetensors",
110
+ "model.layers.11.mlp.shared_experts.gate_proj.biases": "model-00002-of-00004.safetensors",
111
+ "model.layers.11.mlp.shared_experts.gate_proj.scales": "model-00002-of-00004.safetensors",
112
+ "model.layers.11.mlp.shared_experts.gate_proj.weight": "model-00002-of-00004.safetensors",
113
+ "model.layers.11.mlp.shared_experts.up_proj.biases": "model-00002-of-00004.safetensors",
114
+ "model.layers.11.mlp.shared_experts.up_proj.scales": "model-00002-of-00004.safetensors",
115
+ "model.layers.11.mlp.shared_experts.up_proj.weight": "model-00002-of-00004.safetensors",
116
+ "model.layers.11.mlp.switch_mlp.down_proj.biases": "model-00002-of-00004.safetensors",
117
+ "model.layers.11.mlp.switch_mlp.down_proj.scales": "model-00002-of-00004.safetensors",
118
+ "model.layers.11.mlp.switch_mlp.down_proj.weight": "model-00002-of-00004.safetensors",
119
+ "model.layers.11.mlp.switch_mlp.gate_proj.biases": "model-00002-of-00004.safetensors",
120
+ "model.layers.11.mlp.switch_mlp.gate_proj.scales": "model-00002-of-00004.safetensors",
121
+ "model.layers.11.mlp.switch_mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
122
+ "model.layers.11.mlp.switch_mlp.up_proj.biases": "model-00002-of-00004.safetensors",
123
+ "model.layers.11.mlp.switch_mlp.up_proj.scales": "model-00002-of-00004.safetensors",
124
+ "model.layers.11.mlp.switch_mlp.up_proj.weight": "model-00002-of-00004.safetensors",
125
+ "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
126
+ "model.layers.11.self_attn.k_proj.biases": "model-00002-of-00004.safetensors",
127
+ "model.layers.11.self_attn.k_proj.scales": "model-00002-of-00004.safetensors",
128
+ "model.layers.11.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
129
+ "model.layers.11.self_attn.o_proj.biases": "model-00002-of-00004.safetensors",
130
+ "model.layers.11.self_attn.o_proj.scales": "model-00002-of-00004.safetensors",
131
+ "model.layers.11.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
132
+ "model.layers.11.self_attn.q_proj.biases": "model-00002-of-00004.safetensors",
133
+ "model.layers.11.self_attn.q_proj.scales": "model-00002-of-00004.safetensors",
134
+ "model.layers.11.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
135
+ "model.layers.11.self_attn.v_proj.biases": "model-00002-of-00004.safetensors",
136
+ "model.layers.11.self_attn.v_proj.scales": "model-00002-of-00004.safetensors",
137
+ "model.layers.11.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
138
+ "model.layers.12.input_layernorm.weight": "model-00002-of-00004.safetensors",
139
+ "model.layers.12.mlp.gate.biases": "model-00002-of-00004.safetensors",
140
+ "model.layers.12.mlp.gate.scales": "model-00002-of-00004.safetensors",
141
+ "model.layers.12.mlp.gate.weight": "model-00002-of-00004.safetensors",
142
+ "model.layers.12.mlp.shared_experts.down_proj.biases": "model-00002-of-00004.safetensors",
143
+ "model.layers.12.mlp.shared_experts.down_proj.scales": "model-00002-of-00004.safetensors",
144
+ "model.layers.12.mlp.shared_experts.down_proj.weight": "model-00002-of-00004.safetensors",
145
+ "model.layers.12.mlp.shared_experts.gate_proj.biases": "model-00002-of-00004.safetensors",
146
+ "model.layers.12.mlp.shared_experts.gate_proj.scales": "model-00002-of-00004.safetensors",
147
+ "model.layers.12.mlp.shared_experts.gate_proj.weight": "model-00002-of-00004.safetensors",
148
+ "model.layers.12.mlp.shared_experts.up_proj.biases": "model-00002-of-00004.safetensors",
149
+ "model.layers.12.mlp.shared_experts.up_proj.scales": "model-00002-of-00004.safetensors",
150
+ "model.layers.12.mlp.shared_experts.up_proj.weight": "model-00002-of-00004.safetensors",
151
+ "model.layers.12.mlp.switch_mlp.down_proj.biases": "model-00002-of-00004.safetensors",
152
+ "model.layers.12.mlp.switch_mlp.down_proj.scales": "model-00002-of-00004.safetensors",
153
+ "model.layers.12.mlp.switch_mlp.down_proj.weight": "model-00002-of-00004.safetensors",
154
+ "model.layers.12.mlp.switch_mlp.gate_proj.biases": "model-00002-of-00004.safetensors",
155
+ "model.layers.12.mlp.switch_mlp.gate_proj.scales": "model-00002-of-00004.safetensors",
156
+ "model.layers.12.mlp.switch_mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
157
+ "model.layers.12.mlp.switch_mlp.up_proj.biases": "model-00002-of-00004.safetensors",
158
+ "model.layers.12.mlp.switch_mlp.up_proj.scales": "model-00002-of-00004.safetensors",
159
+ "model.layers.12.mlp.switch_mlp.up_proj.weight": "model-00002-of-00004.safetensors",
160
+ "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
161
+ "model.layers.12.self_attn.k_proj.biases": "model-00002-of-00004.safetensors",
162
+ "model.layers.12.self_attn.k_proj.scales": "model-00002-of-00004.safetensors",
163
+ "model.layers.12.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
164
+ "model.layers.12.self_attn.o_proj.biases": "model-00002-of-00004.safetensors",
165
+ "model.layers.12.self_attn.o_proj.scales": "model-00002-of-00004.safetensors",
166
+ "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
167
+ "model.layers.12.self_attn.q_proj.biases": "model-00002-of-00004.safetensors",
168
+ "model.layers.12.self_attn.q_proj.scales": "model-00002-of-00004.safetensors",
169
+ "model.layers.12.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
170
+ "model.layers.12.self_attn.v_proj.biases": "model-00002-of-00004.safetensors",
171
+ "model.layers.12.self_attn.v_proj.scales": "model-00002-of-00004.safetensors",
172
+ "model.layers.12.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
173
+ "model.layers.13.input_layernorm.weight": "model-00002-of-00004.safetensors",
174
+ "model.layers.13.mlp.gate.biases": "model-00002-of-00004.safetensors",
175
+ "model.layers.13.mlp.gate.scales": "model-00002-of-00004.safetensors",
176
+ "model.layers.13.mlp.gate.weight": "model-00002-of-00004.safetensors",
177
+ "model.layers.13.mlp.shared_experts.down_proj.biases": "model-00002-of-00004.safetensors",
178
+ "model.layers.13.mlp.shared_experts.down_proj.scales": "model-00002-of-00004.safetensors",
179
+ "model.layers.13.mlp.shared_experts.down_proj.weight": "model-00002-of-00004.safetensors",
180
+ "model.layers.13.mlp.shared_experts.gate_proj.biases": "model-00002-of-00004.safetensors",
181
+ "model.layers.13.mlp.shared_experts.gate_proj.scales": "model-00002-of-00004.safetensors",
182
+ "model.layers.13.mlp.shared_experts.gate_proj.weight": "model-00002-of-00004.safetensors",
183
+ "model.layers.13.mlp.shared_experts.up_proj.biases": "model-00002-of-00004.safetensors",
184
+ "model.layers.13.mlp.shared_experts.up_proj.scales": "model-00002-of-00004.safetensors",
185
+ "model.layers.13.mlp.shared_experts.up_proj.weight": "model-00002-of-00004.safetensors",
186
+ "model.layers.13.mlp.switch_mlp.down_proj.biases": "model-00002-of-00004.safetensors",
187
+ "model.layers.13.mlp.switch_mlp.down_proj.scales": "model-00002-of-00004.safetensors",
188
+ "model.layers.13.mlp.switch_mlp.down_proj.weight": "model-00002-of-00004.safetensors",
189
+ "model.layers.13.mlp.switch_mlp.gate_proj.biases": "model-00002-of-00004.safetensors",
190
+ "model.layers.13.mlp.switch_mlp.gate_proj.scales": "model-00002-of-00004.safetensors",
191
+ "model.layers.13.mlp.switch_mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
192
+ "model.layers.13.mlp.switch_mlp.up_proj.biases": "model-00002-of-00004.safetensors",
193
+ "model.layers.13.mlp.switch_mlp.up_proj.scales": "model-00002-of-00004.safetensors",
194
+ "model.layers.13.mlp.switch_mlp.up_proj.weight": "model-00002-of-00004.safetensors",
195
+ "model.layers.13.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
196
+ "model.layers.13.self_attn.k_proj.biases": "model-00002-of-00004.safetensors",
197
+ "model.layers.13.self_attn.k_proj.scales": "model-00002-of-00004.safetensors",
198
+ "model.layers.13.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
199
+ "model.layers.13.self_attn.o_proj.biases": "model-00002-of-00004.safetensors",
200
+ "model.layers.13.self_attn.o_proj.scales": "model-00002-of-00004.safetensors",
201
+ "model.layers.13.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
202
+ "model.layers.13.self_attn.q_proj.biases": "model-00002-of-00004.safetensors",
203
+ "model.layers.13.self_attn.q_proj.scales": "model-00002-of-00004.safetensors",
204
+ "model.layers.13.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
205
+ "model.layers.13.self_attn.v_proj.biases": "model-00002-of-00004.safetensors",
206
+ "model.layers.13.self_attn.v_proj.scales": "model-00002-of-00004.safetensors",
207
+ "model.layers.13.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
208
+ "model.layers.14.input_layernorm.weight": "model-00002-of-00004.safetensors",
209
+ "model.layers.14.mlp.gate.biases": "model-00002-of-00004.safetensors",
210
+ "model.layers.14.mlp.gate.scales": "model-00002-of-00004.safetensors",
211
+ "model.layers.14.mlp.gate.weight": "model-00002-of-00004.safetensors",
212
+ "model.layers.14.mlp.shared_experts.down_proj.biases": "model-00002-of-00004.safetensors",
213
+ "model.layers.14.mlp.shared_experts.down_proj.scales": "model-00002-of-00004.safetensors",
214
+ "model.layers.14.mlp.shared_experts.down_proj.weight": "model-00002-of-00004.safetensors",
215
+ "model.layers.14.mlp.shared_experts.gate_proj.biases": "model-00002-of-00004.safetensors",
216
+ "model.layers.14.mlp.shared_experts.gate_proj.scales": "model-00002-of-00004.safetensors",
217
+ "model.layers.14.mlp.shared_experts.gate_proj.weight": "model-00002-of-00004.safetensors",
218
+ "model.layers.14.mlp.shared_experts.up_proj.biases": "model-00002-of-00004.safetensors",
219
+ "model.layers.14.mlp.shared_experts.up_proj.scales": "model-00002-of-00004.safetensors",
220
+ "model.layers.14.mlp.shared_experts.up_proj.weight": "model-00002-of-00004.safetensors",
221
+ "model.layers.14.mlp.switch_mlp.down_proj.biases": "model-00002-of-00004.safetensors",
222
+ "model.layers.14.mlp.switch_mlp.down_proj.scales": "model-00002-of-00004.safetensors",
223
+ "model.layers.14.mlp.switch_mlp.down_proj.weight": "model-00002-of-00004.safetensors",
224
+ "model.layers.14.mlp.switch_mlp.gate_proj.biases": "model-00002-of-00004.safetensors",
225
+ "model.layers.14.mlp.switch_mlp.gate_proj.scales": "model-00002-of-00004.safetensors",
226
+ "model.layers.14.mlp.switch_mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
227
+ "model.layers.14.mlp.switch_mlp.up_proj.biases": "model-00002-of-00004.safetensors",
228
+ "model.layers.14.mlp.switch_mlp.up_proj.scales": "model-00002-of-00004.safetensors",
229
+ "model.layers.14.mlp.switch_mlp.up_proj.weight": "model-00002-of-00004.safetensors",
230
+ "model.layers.14.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
231
+ "model.layers.14.self_attn.k_proj.biases": "model-00002-of-00004.safetensors",
232
+ "model.layers.14.self_attn.k_proj.scales": "model-00002-of-00004.safetensors",
233
+ "model.layers.14.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
234
+ "model.layers.14.self_attn.o_proj.biases": "model-00002-of-00004.safetensors",
235
+ "model.layers.14.self_attn.o_proj.scales": "model-00002-of-00004.safetensors",
236
+ "model.layers.14.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
237
+ "model.layers.14.self_attn.q_proj.biases": "model-00002-of-00004.safetensors",
238
+ "model.layers.14.self_attn.q_proj.scales": "model-00002-of-00004.safetensors",
239
+ "model.layers.14.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
240
+ "model.layers.14.self_attn.v_proj.biases": "model-00002-of-00004.safetensors",
241
+ "model.layers.14.self_attn.v_proj.scales": "model-00002-of-00004.safetensors",
242
+ "model.layers.14.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
243
+ "model.layers.15.input_layernorm.weight": "model-00002-of-00004.safetensors",
244
+ "model.layers.15.mlp.gate.biases": "model-00002-of-00004.safetensors",
245
+ "model.layers.15.mlp.gate.scales": "model-00002-of-00004.safetensors",
246
+ "model.layers.15.mlp.gate.weight": "model-00002-of-00004.safetensors",
247
+ "model.layers.15.mlp.shared_experts.down_proj.biases": "model-00002-of-00004.safetensors",
248
+ "model.layers.15.mlp.shared_experts.down_proj.scales": "model-00002-of-00004.safetensors",
249
+ "model.layers.15.mlp.shared_experts.down_proj.weight": "model-00002-of-00004.safetensors",
250
+ "model.layers.15.mlp.shared_experts.gate_proj.biases": "model-00002-of-00004.safetensors",
251
+ "model.layers.15.mlp.shared_experts.gate_proj.scales": "model-00002-of-00004.safetensors",
252
+ "model.layers.15.mlp.shared_experts.gate_proj.weight": "model-00002-of-00004.safetensors",
253
+ "model.layers.15.mlp.shared_experts.up_proj.biases": "model-00002-of-00004.safetensors",
254
+ "model.layers.15.mlp.shared_experts.up_proj.scales": "model-00002-of-00004.safetensors",
255
+ "model.layers.15.mlp.shared_experts.up_proj.weight": "model-00002-of-00004.safetensors",
256
+ "model.layers.15.mlp.switch_mlp.down_proj.biases": "model-00002-of-00004.safetensors",
257
+ "model.layers.15.mlp.switch_mlp.down_proj.scales": "model-00002-of-00004.safetensors",
258
+ "model.layers.15.mlp.switch_mlp.down_proj.weight": "model-00002-of-00004.safetensors",
259
+ "model.layers.15.mlp.switch_mlp.gate_proj.biases": "model-00002-of-00004.safetensors",
260
+ "model.layers.15.mlp.switch_mlp.gate_proj.scales": "model-00002-of-00004.safetensors",
261
+ "model.layers.15.mlp.switch_mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
262
+ "model.layers.15.mlp.switch_mlp.up_proj.biases": "model-00002-of-00004.safetensors",
263
+ "model.layers.15.mlp.switch_mlp.up_proj.scales": "model-00002-of-00004.safetensors",
264
+ "model.layers.15.mlp.switch_mlp.up_proj.weight": "model-00002-of-00004.safetensors",
265
+ "model.layers.15.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
266
+ "model.layers.15.self_attn.k_proj.biases": "model-00002-of-00004.safetensors",
267
+ "model.layers.15.self_attn.k_proj.scales": "model-00002-of-00004.safetensors",
268
+ "model.layers.15.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
269
+ "model.layers.15.self_attn.o_proj.biases": "model-00002-of-00004.safetensors",
270
+ "model.layers.15.self_attn.o_proj.scales": "model-00002-of-00004.safetensors",
271
+ "model.layers.15.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
272
+ "model.layers.15.self_attn.q_proj.biases": "model-00002-of-00004.safetensors",
273
+ "model.layers.15.self_attn.q_proj.scales": "model-00002-of-00004.safetensors",
274
+ "model.layers.15.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
275
+ "model.layers.15.self_attn.v_proj.biases": "model-00002-of-00004.safetensors",
276
+ "model.layers.15.self_attn.v_proj.scales": "model-00002-of-00004.safetensors",
277
+ "model.layers.15.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
278
+ "model.layers.16.input_layernorm.weight": "model-00003-of-00004.safetensors",
279
+ "model.layers.16.mlp.gate.biases": "model-00002-of-00004.safetensors",
280
+ "model.layers.16.mlp.gate.scales": "model-00002-of-00004.safetensors",
281
+ "model.layers.16.mlp.gate.weight": "model-00002-of-00004.safetensors",
282
+ "model.layers.16.mlp.shared_experts.down_proj.biases": "model-00003-of-00004.safetensors",
283
+ "model.layers.16.mlp.shared_experts.down_proj.scales": "model-00003-of-00004.safetensors",
284
+ "model.layers.16.mlp.shared_experts.down_proj.weight": "model-00003-of-00004.safetensors",
285
+ "model.layers.16.mlp.shared_experts.gate_proj.biases": "model-00003-of-00004.safetensors",
286
+ "model.layers.16.mlp.shared_experts.gate_proj.scales": "model-00003-of-00004.safetensors",
287
+ "model.layers.16.mlp.shared_experts.gate_proj.weight": "model-00003-of-00004.safetensors",
288
+ "model.layers.16.mlp.shared_experts.up_proj.biases": "model-00003-of-00004.safetensors",
289
+ "model.layers.16.mlp.shared_experts.up_proj.scales": "model-00003-of-00004.safetensors",
290
+ "model.layers.16.mlp.shared_experts.up_proj.weight": "model-00003-of-00004.safetensors",
291
+ "model.layers.16.mlp.switch_mlp.down_proj.biases": "model-00002-of-00004.safetensors",
292
+ "model.layers.16.mlp.switch_mlp.down_proj.scales": "model-00002-of-00004.safetensors",
293
+ "model.layers.16.mlp.switch_mlp.down_proj.weight": "model-00002-of-00004.safetensors",
294
+ "model.layers.16.mlp.switch_mlp.gate_proj.biases": "model-00002-of-00004.safetensors",
295
+ "model.layers.16.mlp.switch_mlp.gate_proj.scales": "model-00002-of-00004.safetensors",
296
+ "model.layers.16.mlp.switch_mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
297
+ "model.layers.16.mlp.switch_mlp.up_proj.biases": "model-00002-of-00004.safetensors",
298
+ "model.layers.16.mlp.switch_mlp.up_proj.scales": "model-00002-of-00004.safetensors",
299
+ "model.layers.16.mlp.switch_mlp.up_proj.weight": "model-00002-of-00004.safetensors",
300
+ "model.layers.16.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
301
+ "model.layers.16.self_attn.k_proj.biases": "model-00002-of-00004.safetensors",
302
+ "model.layers.16.self_attn.k_proj.scales": "model-00002-of-00004.safetensors",
303
+ "model.layers.16.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
304
+ "model.layers.16.self_attn.o_proj.biases": "model-00002-of-00004.safetensors",
305
+ "model.layers.16.self_attn.o_proj.scales": "model-00002-of-00004.safetensors",
306
+ "model.layers.16.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
307
+ "model.layers.16.self_attn.q_proj.biases": "model-00002-of-00004.safetensors",
308
+ "model.layers.16.self_attn.q_proj.scales": "model-00002-of-00004.safetensors",
309
+ "model.layers.16.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
310
+ "model.layers.16.self_attn.v_proj.biases": "model-00002-of-00004.safetensors",
311
+ "model.layers.16.self_attn.v_proj.scales": "model-00002-of-00004.safetensors",
312
+ "model.layers.16.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
313
+ "model.layers.17.input_layernorm.weight": "model-00003-of-00004.safetensors",
314
+ "model.layers.17.mlp.gate.biases": "model-00003-of-00004.safetensors",
315
+ "model.layers.17.mlp.gate.scales": "model-00003-of-00004.safetensors",
316
+ "model.layers.17.mlp.gate.weight": "model-00003-of-00004.safetensors",
317
+ "model.layers.17.mlp.shared_experts.down_proj.biases": "model-00003-of-00004.safetensors",
318
+ "model.layers.17.mlp.shared_experts.down_proj.scales": "model-00003-of-00004.safetensors",
319
+ "model.layers.17.mlp.shared_experts.down_proj.weight": "model-00003-of-00004.safetensors",
320
+ "model.layers.17.mlp.shared_experts.gate_proj.biases": "model-00003-of-00004.safetensors",
321
+ "model.layers.17.mlp.shared_experts.gate_proj.scales": "model-00003-of-00004.safetensors",
322
+ "model.layers.17.mlp.shared_experts.gate_proj.weight": "model-00003-of-00004.safetensors",
323
+ "model.layers.17.mlp.shared_experts.up_proj.biases": "model-00003-of-00004.safetensors",
324
+ "model.layers.17.mlp.shared_experts.up_proj.scales": "model-00003-of-00004.safetensors",
325
+ "model.layers.17.mlp.shared_experts.up_proj.weight": "model-00003-of-00004.safetensors",
326
+ "model.layers.17.mlp.switch_mlp.down_proj.biases": "model-00003-of-00004.safetensors",
327
+ "model.layers.17.mlp.switch_mlp.down_proj.scales": "model-00003-of-00004.safetensors",
328
+ "model.layers.17.mlp.switch_mlp.down_proj.weight": "model-00003-of-00004.safetensors",
329
+ "model.layers.17.mlp.switch_mlp.gate_proj.biases": "model-00003-of-00004.safetensors",
330
+ "model.layers.17.mlp.switch_mlp.gate_proj.scales": "model-00003-of-00004.safetensors",
331
+ "model.layers.17.mlp.switch_mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
332
+ "model.layers.17.mlp.switch_mlp.up_proj.biases": "model-00003-of-00004.safetensors",
333
+ "model.layers.17.mlp.switch_mlp.up_proj.scales": "model-00003-of-00004.safetensors",
334
+ "model.layers.17.mlp.switch_mlp.up_proj.weight": "model-00003-of-00004.safetensors",
335
+ "model.layers.17.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
336
+ "model.layers.17.self_attn.k_proj.biases": "model-00003-of-00004.safetensors",
337
+ "model.layers.17.self_attn.k_proj.scales": "model-00003-of-00004.safetensors",
338
+ "model.layers.17.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
339
+ "model.layers.17.self_attn.o_proj.biases": "model-00003-of-00004.safetensors",
340
+ "model.layers.17.self_attn.o_proj.scales": "model-00003-of-00004.safetensors",
341
+ "model.layers.17.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
342
+ "model.layers.17.self_attn.q_proj.biases": "model-00003-of-00004.safetensors",
343
+ "model.layers.17.self_attn.q_proj.scales": "model-00003-of-00004.safetensors",
344
+ "model.layers.17.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
345
+ "model.layers.17.self_attn.v_proj.biases": "model-00003-of-00004.safetensors",
346
+ "model.layers.17.self_attn.v_proj.scales": "model-00003-of-00004.safetensors",
347
+ "model.layers.17.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
348
+ "model.layers.18.input_layernorm.weight": "model-00003-of-00004.safetensors",
349
+ "model.layers.18.mlp.gate.biases": "model-00003-of-00004.safetensors",
350
+ "model.layers.18.mlp.gate.scales": "model-00003-of-00004.safetensors",
351
+ "model.layers.18.mlp.gate.weight": "model-00003-of-00004.safetensors",
352
+ "model.layers.18.mlp.shared_experts.down_proj.biases": "model-00003-of-00004.safetensors",
353
+ "model.layers.18.mlp.shared_experts.down_proj.scales": "model-00003-of-00004.safetensors",
354
+ "model.layers.18.mlp.shared_experts.down_proj.weight": "model-00003-of-00004.safetensors",
355
+ "model.layers.18.mlp.shared_experts.gate_proj.biases": "model-00003-of-00004.safetensors",
356
+ "model.layers.18.mlp.shared_experts.gate_proj.scales": "model-00003-of-00004.safetensors",
357
+ "model.layers.18.mlp.shared_experts.gate_proj.weight": "model-00003-of-00004.safetensors",
358
+ "model.layers.18.mlp.shared_experts.up_proj.biases": "model-00003-of-00004.safetensors",
359
+ "model.layers.18.mlp.shared_experts.up_proj.scales": "model-00003-of-00004.safetensors",
360
+ "model.layers.18.mlp.shared_experts.up_proj.weight": "model-00003-of-00004.safetensors",
361
+ "model.layers.18.mlp.switch_mlp.down_proj.biases": "model-00003-of-00004.safetensors",
362
+ "model.layers.18.mlp.switch_mlp.down_proj.scales": "model-00003-of-00004.safetensors",
363
+ "model.layers.18.mlp.switch_mlp.down_proj.weight": "model-00003-of-00004.safetensors",
364
+ "model.layers.18.mlp.switch_mlp.gate_proj.biases": "model-00003-of-00004.safetensors",
365
+ "model.layers.18.mlp.switch_mlp.gate_proj.scales": "model-00003-of-00004.safetensors",
366
+ "model.layers.18.mlp.switch_mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
367
+ "model.layers.18.mlp.switch_mlp.up_proj.biases": "model-00003-of-00004.safetensors",
368
+ "model.layers.18.mlp.switch_mlp.up_proj.scales": "model-00003-of-00004.safetensors",
369
+ "model.layers.18.mlp.switch_mlp.up_proj.weight": "model-00003-of-00004.safetensors",
370
+ "model.layers.18.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
371
+ "model.layers.18.self_attn.k_proj.biases": "model-00003-of-00004.safetensors",
372
+ "model.layers.18.self_attn.k_proj.scales": "model-00003-of-00004.safetensors",
373
+ "model.layers.18.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
374
+ "model.layers.18.self_attn.o_proj.biases": "model-00003-of-00004.safetensors",
375
+ "model.layers.18.self_attn.o_proj.scales": "model-00003-of-00004.safetensors",
376
+ "model.layers.18.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
377
+ "model.layers.18.self_attn.q_proj.biases": "model-00003-of-00004.safetensors",
378
+ "model.layers.18.self_attn.q_proj.scales": "model-00003-of-00004.safetensors",
379
+ "model.layers.18.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
380
+ "model.layers.18.self_attn.v_proj.biases": "model-00003-of-00004.safetensors",
381
+ "model.layers.18.self_attn.v_proj.scales": "model-00003-of-00004.safetensors",
382
+ "model.layers.18.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
383
+ "model.layers.19.input_layernorm.weight": "model-00003-of-00004.safetensors",
384
+ "model.layers.19.mlp.gate.biases": "model-00003-of-00004.safetensors",
385
+ "model.layers.19.mlp.gate.scales": "model-00003-of-00004.safetensors",
386
+ "model.layers.19.mlp.gate.weight": "model-00003-of-00004.safetensors",
387
+ "model.layers.19.mlp.shared_experts.down_proj.biases": "model-00003-of-00004.safetensors",
388
+ "model.layers.19.mlp.shared_experts.down_proj.scales": "model-00003-of-00004.safetensors",
389
+ "model.layers.19.mlp.shared_experts.down_proj.weight": "model-00003-of-00004.safetensors",
390
+ "model.layers.19.mlp.shared_experts.gate_proj.biases": "model-00003-of-00004.safetensors",
391
+ "model.layers.19.mlp.shared_experts.gate_proj.scales": "model-00003-of-00004.safetensors",
392
+ "model.layers.19.mlp.shared_experts.gate_proj.weight": "model-00003-of-00004.safetensors",
393
+ "model.layers.19.mlp.shared_experts.up_proj.biases": "model-00003-of-00004.safetensors",
394
+ "model.layers.19.mlp.shared_experts.up_proj.scales": "model-00003-of-00004.safetensors",
395
+ "model.layers.19.mlp.shared_experts.up_proj.weight": "model-00003-of-00004.safetensors",
396
+ "model.layers.19.mlp.switch_mlp.down_proj.biases": "model-00003-of-00004.safetensors",
397
+ "model.layers.19.mlp.switch_mlp.down_proj.scales": "model-00003-of-00004.safetensors",
398
+ "model.layers.19.mlp.switch_mlp.down_proj.weight": "model-00003-of-00004.safetensors",
399
+ "model.layers.19.mlp.switch_mlp.gate_proj.biases": "model-00003-of-00004.safetensors",
400
+ "model.layers.19.mlp.switch_mlp.gate_proj.scales": "model-00003-of-00004.safetensors",
401
+ "model.layers.19.mlp.switch_mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
402
+ "model.layers.19.mlp.switch_mlp.up_proj.biases": "model-00003-of-00004.safetensors",
403
+ "model.layers.19.mlp.switch_mlp.up_proj.scales": "model-00003-of-00004.safetensors",
404
+ "model.layers.19.mlp.switch_mlp.up_proj.weight": "model-00003-of-00004.safetensors",
405
+ "model.layers.19.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
406
+ "model.layers.19.self_attn.k_proj.biases": "model-00003-of-00004.safetensors",
407
+ "model.layers.19.self_attn.k_proj.scales": "model-00003-of-00004.safetensors",
408
+ "model.layers.19.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
409
+ "model.layers.19.self_attn.o_proj.biases": "model-00003-of-00004.safetensors",
410
+ "model.layers.19.self_attn.o_proj.scales": "model-00003-of-00004.safetensors",
411
+ "model.layers.19.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
412
+ "model.layers.19.self_attn.q_proj.biases": "model-00003-of-00004.safetensors",
413
+ "model.layers.19.self_attn.q_proj.scales": "model-00003-of-00004.safetensors",
414
+ "model.layers.19.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
415
+ "model.layers.19.self_attn.v_proj.biases": "model-00003-of-00004.safetensors",
416
+ "model.layers.19.self_attn.v_proj.scales": "model-00003-of-00004.safetensors",
417
+ "model.layers.19.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
418
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00004.safetensors",
419
+ "model.layers.2.mlp.gate.biases": "model-00001-of-00004.safetensors",
420
+ "model.layers.2.mlp.gate.scales": "model-00001-of-00004.safetensors",
421
+ "model.layers.2.mlp.gate.weight": "model-00001-of-00004.safetensors",
422
+ "model.layers.2.mlp.shared_experts.down_proj.biases": "model-00001-of-00004.safetensors",
423
+ "model.layers.2.mlp.shared_experts.down_proj.scales": "model-00001-of-00004.safetensors",
424
+ "model.layers.2.mlp.shared_experts.down_proj.weight": "model-00001-of-00004.safetensors",
425
+ "model.layers.2.mlp.shared_experts.gate_proj.biases": "model-00001-of-00004.safetensors",
426
+ "model.layers.2.mlp.shared_experts.gate_proj.scales": "model-00001-of-00004.safetensors",
427
+ "model.layers.2.mlp.shared_experts.gate_proj.weight": "model-00001-of-00004.safetensors",
428
+ "model.layers.2.mlp.shared_experts.up_proj.biases": "model-00001-of-00004.safetensors",
429
+ "model.layers.2.mlp.shared_experts.up_proj.scales": "model-00001-of-00004.safetensors",
430
+ "model.layers.2.mlp.shared_experts.up_proj.weight": "model-00001-of-00004.safetensors",
431
+ "model.layers.2.mlp.switch_mlp.down_proj.biases": "model-00001-of-00004.safetensors",
432
+ "model.layers.2.mlp.switch_mlp.down_proj.scales": "model-00001-of-00004.safetensors",
433
+ "model.layers.2.mlp.switch_mlp.down_proj.weight": "model-00001-of-00004.safetensors",
434
+ "model.layers.2.mlp.switch_mlp.gate_proj.biases": "model-00001-of-00004.safetensors",
435
+ "model.layers.2.mlp.switch_mlp.gate_proj.scales": "model-00001-of-00004.safetensors",
436
+ "model.layers.2.mlp.switch_mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
437
+ "model.layers.2.mlp.switch_mlp.up_proj.biases": "model-00001-of-00004.safetensors",
438
+ "model.layers.2.mlp.switch_mlp.up_proj.scales": "model-00001-of-00004.safetensors",
439
+ "model.layers.2.mlp.switch_mlp.up_proj.weight": "model-00001-of-00004.safetensors",
440
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
441
+ "model.layers.2.self_attn.k_proj.biases": "model-00001-of-00004.safetensors",
442
+ "model.layers.2.self_attn.k_proj.scales": "model-00001-of-00004.safetensors",
443
+ "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
444
+ "model.layers.2.self_attn.o_proj.biases": "model-00001-of-00004.safetensors",
445
+ "model.layers.2.self_attn.o_proj.scales": "model-00001-of-00004.safetensors",
446
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
447
+ "model.layers.2.self_attn.q_proj.biases": "model-00001-of-00004.safetensors",
448
+ "model.layers.2.self_attn.q_proj.scales": "model-00001-of-00004.safetensors",
449
+ "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
450
+ "model.layers.2.self_attn.v_proj.biases": "model-00001-of-00004.safetensors",
451
+ "model.layers.2.self_attn.v_proj.scales": "model-00001-of-00004.safetensors",
452
+ "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
453
+ "model.layers.20.input_layernorm.weight": "model-00003-of-00004.safetensors",
454
+ "model.layers.20.mlp.gate.biases": "model-00003-of-00004.safetensors",
455
+ "model.layers.20.mlp.gate.scales": "model-00003-of-00004.safetensors",
456
+ "model.layers.20.mlp.gate.weight": "model-00003-of-00004.safetensors",
457
+ "model.layers.20.mlp.shared_experts.down_proj.biases": "model-00003-of-00004.safetensors",
458
+ "model.layers.20.mlp.shared_experts.down_proj.scales": "model-00003-of-00004.safetensors",
459
+ "model.layers.20.mlp.shared_experts.down_proj.weight": "model-00003-of-00004.safetensors",
460
+ "model.layers.20.mlp.shared_experts.gate_proj.biases": "model-00003-of-00004.safetensors",
461
+ "model.layers.20.mlp.shared_experts.gate_proj.scales": "model-00003-of-00004.safetensors",
462
+ "model.layers.20.mlp.shared_experts.gate_proj.weight": "model-00003-of-00004.safetensors",
463
+ "model.layers.20.mlp.shared_experts.up_proj.biases": "model-00003-of-00004.safetensors",
464
+ "model.layers.20.mlp.shared_experts.up_proj.scales": "model-00003-of-00004.safetensors",
465
+ "model.layers.20.mlp.shared_experts.up_proj.weight": "model-00003-of-00004.safetensors",
466
+ "model.layers.20.mlp.switch_mlp.down_proj.biases": "model-00003-of-00004.safetensors",
467
+ "model.layers.20.mlp.switch_mlp.down_proj.scales": "model-00003-of-00004.safetensors",
468
+ "model.layers.20.mlp.switch_mlp.down_proj.weight": "model-00003-of-00004.safetensors",
469
+ "model.layers.20.mlp.switch_mlp.gate_proj.biases": "model-00003-of-00004.safetensors",
470
+ "model.layers.20.mlp.switch_mlp.gate_proj.scales": "model-00003-of-00004.safetensors",
471
+ "model.layers.20.mlp.switch_mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
472
+ "model.layers.20.mlp.switch_mlp.up_proj.biases": "model-00003-of-00004.safetensors",
473
+ "model.layers.20.mlp.switch_mlp.up_proj.scales": "model-00003-of-00004.safetensors",
474
+ "model.layers.20.mlp.switch_mlp.up_proj.weight": "model-00003-of-00004.safetensors",
475
+ "model.layers.20.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
476
+ "model.layers.20.self_attn.k_proj.biases": "model-00003-of-00004.safetensors",
477
+ "model.layers.20.self_attn.k_proj.scales": "model-00003-of-00004.safetensors",
478
+ "model.layers.20.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
479
+ "model.layers.20.self_attn.o_proj.biases": "model-00003-of-00004.safetensors",
480
+ "model.layers.20.self_attn.o_proj.scales": "model-00003-of-00004.safetensors",
481
+ "model.layers.20.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
482
+ "model.layers.20.self_attn.q_proj.biases": "model-00003-of-00004.safetensors",
483
+ "model.layers.20.self_attn.q_proj.scales": "model-00003-of-00004.safetensors",
484
+ "model.layers.20.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
485
+ "model.layers.20.self_attn.v_proj.biases": "model-00003-of-00004.safetensors",
486
+ "model.layers.20.self_attn.v_proj.scales": "model-00003-of-00004.safetensors",
487
+ "model.layers.20.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
488
+ "model.layers.21.input_layernorm.weight": "model-00003-of-00004.safetensors",
489
+ "model.layers.21.mlp.gate.biases": "model-00003-of-00004.safetensors",
490
+ "model.layers.21.mlp.gate.scales": "model-00003-of-00004.safetensors",
491
+ "model.layers.21.mlp.gate.weight": "model-00003-of-00004.safetensors",
492
+ "model.layers.21.mlp.shared_experts.down_proj.biases": "model-00003-of-00004.safetensors",
493
+ "model.layers.21.mlp.shared_experts.down_proj.scales": "model-00003-of-00004.safetensors",
494
+ "model.layers.21.mlp.shared_experts.down_proj.weight": "model-00003-of-00004.safetensors",
495
+ "model.layers.21.mlp.shared_experts.gate_proj.biases": "model-00003-of-00004.safetensors",
496
+ "model.layers.21.mlp.shared_experts.gate_proj.scales": "model-00003-of-00004.safetensors",
497
+ "model.layers.21.mlp.shared_experts.gate_proj.weight": "model-00003-of-00004.safetensors",
498
+ "model.layers.21.mlp.shared_experts.up_proj.biases": "model-00003-of-00004.safetensors",
499
+ "model.layers.21.mlp.shared_experts.up_proj.scales": "model-00003-of-00004.safetensors",
500
+ "model.layers.21.mlp.shared_experts.up_proj.weight": "model-00003-of-00004.safetensors",
501
+ "model.layers.21.mlp.switch_mlp.down_proj.biases": "model-00003-of-00004.safetensors",
502
+ "model.layers.21.mlp.switch_mlp.down_proj.scales": "model-00003-of-00004.safetensors",
503
+ "model.layers.21.mlp.switch_mlp.down_proj.weight": "model-00003-of-00004.safetensors",
504
+ "model.layers.21.mlp.switch_mlp.gate_proj.biases": "model-00003-of-00004.safetensors",
505
+ "model.layers.21.mlp.switch_mlp.gate_proj.scales": "model-00003-of-00004.safetensors",
506
+ "model.layers.21.mlp.switch_mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
507
+ "model.layers.21.mlp.switch_mlp.up_proj.biases": "model-00003-of-00004.safetensors",
508
+ "model.layers.21.mlp.switch_mlp.up_proj.scales": "model-00003-of-00004.safetensors",
509
+ "model.layers.21.mlp.switch_mlp.up_proj.weight": "model-00003-of-00004.safetensors",
510
+ "model.layers.21.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
511
+ "model.layers.21.self_attn.k_proj.biases": "model-00003-of-00004.safetensors",
512
+ "model.layers.21.self_attn.k_proj.scales": "model-00003-of-00004.safetensors",
513
+ "model.layers.21.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
514
+ "model.layers.21.self_attn.o_proj.biases": "model-00003-of-00004.safetensors",
515
+ "model.layers.21.self_attn.o_proj.scales": "model-00003-of-00004.safetensors",
516
+ "model.layers.21.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
517
+ "model.layers.21.self_attn.q_proj.biases": "model-00003-of-00004.safetensors",
518
+ "model.layers.21.self_attn.q_proj.scales": "model-00003-of-00004.safetensors",
519
+ "model.layers.21.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
520
+ "model.layers.21.self_attn.v_proj.biases": "model-00003-of-00004.safetensors",
521
+ "model.layers.21.self_attn.v_proj.scales": "model-00003-of-00004.safetensors",
522
+ "model.layers.21.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
523
+ "model.layers.22.input_layernorm.weight": "model-00003-of-00004.safetensors",
524
+ "model.layers.22.mlp.gate.biases": "model-00003-of-00004.safetensors",
525
+ "model.layers.22.mlp.gate.scales": "model-00003-of-00004.safetensors",
526
+ "model.layers.22.mlp.gate.weight": "model-00003-of-00004.safetensors",
527
+ "model.layers.22.mlp.shared_experts.down_proj.biases": "model-00003-of-00004.safetensors",
528
+ "model.layers.22.mlp.shared_experts.down_proj.scales": "model-00003-of-00004.safetensors",
529
+ "model.layers.22.mlp.shared_experts.down_proj.weight": "model-00003-of-00004.safetensors",
530
+ "model.layers.22.mlp.shared_experts.gate_proj.biases": "model-00003-of-00004.safetensors",
531
+ "model.layers.22.mlp.shared_experts.gate_proj.scales": "model-00003-of-00004.safetensors",
532
+ "model.layers.22.mlp.shared_experts.gate_proj.weight": "model-00003-of-00004.safetensors",
533
+ "model.layers.22.mlp.shared_experts.up_proj.biases": "model-00003-of-00004.safetensors",
534
+ "model.layers.22.mlp.shared_experts.up_proj.scales": "model-00003-of-00004.safetensors",
535
+ "model.layers.22.mlp.shared_experts.up_proj.weight": "model-00003-of-00004.safetensors",
536
+ "model.layers.22.mlp.switch_mlp.down_proj.biases": "model-00003-of-00004.safetensors",
537
+ "model.layers.22.mlp.switch_mlp.down_proj.scales": "model-00003-of-00004.safetensors",
538
+ "model.layers.22.mlp.switch_mlp.down_proj.weight": "model-00003-of-00004.safetensors",
539
+ "model.layers.22.mlp.switch_mlp.gate_proj.biases": "model-00003-of-00004.safetensors",
540
+ "model.layers.22.mlp.switch_mlp.gate_proj.scales": "model-00003-of-00004.safetensors",
541
+ "model.layers.22.mlp.switch_mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
542
+ "model.layers.22.mlp.switch_mlp.up_proj.biases": "model-00003-of-00004.safetensors",
543
+ "model.layers.22.mlp.switch_mlp.up_proj.scales": "model-00003-of-00004.safetensors",
544
+ "model.layers.22.mlp.switch_mlp.up_proj.weight": "model-00003-of-00004.safetensors",
545
+ "model.layers.22.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
546
+ "model.layers.22.self_attn.k_proj.biases": "model-00003-of-00004.safetensors",
547
+ "model.layers.22.self_attn.k_proj.scales": "model-00003-of-00004.safetensors",
548
+ "model.layers.22.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
549
+ "model.layers.22.self_attn.o_proj.biases": "model-00003-of-00004.safetensors",
550
+ "model.layers.22.self_attn.o_proj.scales": "model-00003-of-00004.safetensors",
551
+ "model.layers.22.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
552
+ "model.layers.22.self_attn.q_proj.biases": "model-00003-of-00004.safetensors",
553
+ "model.layers.22.self_attn.q_proj.scales": "model-00003-of-00004.safetensors",
554
+ "model.layers.22.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
555
+ "model.layers.22.self_attn.v_proj.biases": "model-00003-of-00004.safetensors",
556
+ "model.layers.22.self_attn.v_proj.scales": "model-00003-of-00004.safetensors",
557
+ "model.layers.22.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
558
+ "model.layers.23.input_layernorm.weight": "model-00003-of-00004.safetensors",
559
+ "model.layers.23.mlp.gate.biases": "model-00003-of-00004.safetensors",
560
+ "model.layers.23.mlp.gate.scales": "model-00003-of-00004.safetensors",
561
+ "model.layers.23.mlp.gate.weight": "model-00003-of-00004.safetensors",
562
+ "model.layers.23.mlp.shared_experts.down_proj.biases": "model-00003-of-00004.safetensors",
563
+ "model.layers.23.mlp.shared_experts.down_proj.scales": "model-00003-of-00004.safetensors",
564
+ "model.layers.23.mlp.shared_experts.down_proj.weight": "model-00003-of-00004.safetensors",
565
+ "model.layers.23.mlp.shared_experts.gate_proj.biases": "model-00003-of-00004.safetensors",
566
+ "model.layers.23.mlp.shared_experts.gate_proj.scales": "model-00003-of-00004.safetensors",
567
+ "model.layers.23.mlp.shared_experts.gate_proj.weight": "model-00003-of-00004.safetensors",
568
+ "model.layers.23.mlp.shared_experts.up_proj.biases": "model-00003-of-00004.safetensors",
569
+ "model.layers.23.mlp.shared_experts.up_proj.scales": "model-00003-of-00004.safetensors",
570
+ "model.layers.23.mlp.shared_experts.up_proj.weight": "model-00003-of-00004.safetensors",
571
+ "model.layers.23.mlp.switch_mlp.down_proj.biases": "model-00003-of-00004.safetensors",
572
+ "model.layers.23.mlp.switch_mlp.down_proj.scales": "model-00003-of-00004.safetensors",
573
+ "model.layers.23.mlp.switch_mlp.down_proj.weight": "model-00003-of-00004.safetensors",
574
+ "model.layers.23.mlp.switch_mlp.gate_proj.biases": "model-00003-of-00004.safetensors",
575
+ "model.layers.23.mlp.switch_mlp.gate_proj.scales": "model-00003-of-00004.safetensors",
576
+ "model.layers.23.mlp.switch_mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
577
+ "model.layers.23.mlp.switch_mlp.up_proj.biases": "model-00003-of-00004.safetensors",
578
+ "model.layers.23.mlp.switch_mlp.up_proj.scales": "model-00003-of-00004.safetensors",
579
+ "model.layers.23.mlp.switch_mlp.up_proj.weight": "model-00003-of-00004.safetensors",
580
+ "model.layers.23.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
581
+ "model.layers.23.self_attn.k_proj.biases": "model-00003-of-00004.safetensors",
582
+ "model.layers.23.self_attn.k_proj.scales": "model-00003-of-00004.safetensors",
583
+ "model.layers.23.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
584
+ "model.layers.23.self_attn.o_proj.biases": "model-00003-of-00004.safetensors",
585
+ "model.layers.23.self_attn.o_proj.scales": "model-00003-of-00004.safetensors",
586
+ "model.layers.23.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
587
+ "model.layers.23.self_attn.q_proj.biases": "model-00003-of-00004.safetensors",
588
+ "model.layers.23.self_attn.q_proj.scales": "model-00003-of-00004.safetensors",
589
+ "model.layers.23.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
590
+ "model.layers.23.self_attn.v_proj.biases": "model-00003-of-00004.safetensors",
591
+ "model.layers.23.self_attn.v_proj.scales": "model-00003-of-00004.safetensors",
592
+ "model.layers.23.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
593
+ "model.layers.24.input_layernorm.weight": "model-00003-of-00004.safetensors",
594
+ "model.layers.24.mlp.gate.biases": "model-00003-of-00004.safetensors",
595
+ "model.layers.24.mlp.gate.scales": "model-00003-of-00004.safetensors",
596
+ "model.layers.24.mlp.gate.weight": "model-00003-of-00004.safetensors",
597
+ "model.layers.24.mlp.shared_experts.down_proj.biases": "model-00003-of-00004.safetensors",
598
+ "model.layers.24.mlp.shared_experts.down_proj.scales": "model-00003-of-00004.safetensors",
599
+ "model.layers.24.mlp.shared_experts.down_proj.weight": "model-00003-of-00004.safetensors",
600
+ "model.layers.24.mlp.shared_experts.gate_proj.biases": "model-00003-of-00004.safetensors",
601
+ "model.layers.24.mlp.shared_experts.gate_proj.scales": "model-00003-of-00004.safetensors",
602
+ "model.layers.24.mlp.shared_experts.gate_proj.weight": "model-00003-of-00004.safetensors",
603
+ "model.layers.24.mlp.shared_experts.up_proj.biases": "model-00003-of-00004.safetensors",
604
+ "model.layers.24.mlp.shared_experts.up_proj.scales": "model-00003-of-00004.safetensors",
605
+ "model.layers.24.mlp.shared_experts.up_proj.weight": "model-00003-of-00004.safetensors",
606
+ "model.layers.24.mlp.switch_mlp.down_proj.biases": "model-00003-of-00004.safetensors",
607
+ "model.layers.24.mlp.switch_mlp.down_proj.scales": "model-00003-of-00004.safetensors",
608
+ "model.layers.24.mlp.switch_mlp.down_proj.weight": "model-00003-of-00004.safetensors",
609
+ "model.layers.24.mlp.switch_mlp.gate_proj.biases": "model-00003-of-00004.safetensors",
610
+ "model.layers.24.mlp.switch_mlp.gate_proj.scales": "model-00003-of-00004.safetensors",
611
+ "model.layers.24.mlp.switch_mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
612
+ "model.layers.24.mlp.switch_mlp.up_proj.biases": "model-00003-of-00004.safetensors",
613
+ "model.layers.24.mlp.switch_mlp.up_proj.scales": "model-00003-of-00004.safetensors",
614
+ "model.layers.24.mlp.switch_mlp.up_proj.weight": "model-00003-of-00004.safetensors",
615
+ "model.layers.24.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
616
+ "model.layers.24.self_attn.k_proj.biases": "model-00003-of-00004.safetensors",
617
+ "model.layers.24.self_attn.k_proj.scales": "model-00003-of-00004.safetensors",
618
+ "model.layers.24.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
619
+ "model.layers.24.self_attn.o_proj.biases": "model-00003-of-00004.safetensors",
620
+ "model.layers.24.self_attn.o_proj.scales": "model-00003-of-00004.safetensors",
621
+ "model.layers.24.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
622
+ "model.layers.24.self_attn.q_proj.biases": "model-00003-of-00004.safetensors",
623
+ "model.layers.24.self_attn.q_proj.scales": "model-00003-of-00004.safetensors",
624
+ "model.layers.24.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
625
+ "model.layers.24.self_attn.v_proj.biases": "model-00003-of-00004.safetensors",
626
+ "model.layers.24.self_attn.v_proj.scales": "model-00003-of-00004.safetensors",
627
+ "model.layers.24.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
628
+ "model.layers.25.input_layernorm.weight": "model-00004-of-00004.safetensors",
629
+ "model.layers.25.mlp.gate.biases": "model-00003-of-00004.safetensors",
630
+ "model.layers.25.mlp.gate.scales": "model-00003-of-00004.safetensors",
631
+ "model.layers.25.mlp.gate.weight": "model-00003-of-00004.safetensors",
632
+ "model.layers.25.mlp.shared_experts.down_proj.biases": "model-00004-of-00004.safetensors",
633
+ "model.layers.25.mlp.shared_experts.down_proj.scales": "model-00004-of-00004.safetensors",
634
+ "model.layers.25.mlp.shared_experts.down_proj.weight": "model-00004-of-00004.safetensors",
635
+ "model.layers.25.mlp.shared_experts.gate_proj.biases": "model-00004-of-00004.safetensors",
636
+ "model.layers.25.mlp.shared_experts.gate_proj.scales": "model-00004-of-00004.safetensors",
637
+ "model.layers.25.mlp.shared_experts.gate_proj.weight": "model-00004-of-00004.safetensors",
638
+ "model.layers.25.mlp.shared_experts.up_proj.biases": "model-00004-of-00004.safetensors",
639
+ "model.layers.25.mlp.shared_experts.up_proj.scales": "model-00004-of-00004.safetensors",
640
+ "model.layers.25.mlp.shared_experts.up_proj.weight": "model-00004-of-00004.safetensors",
641
+ "model.layers.25.mlp.switch_mlp.down_proj.biases": "model-00004-of-00004.safetensors",
642
+ "model.layers.25.mlp.switch_mlp.down_proj.scales": "model-00004-of-00004.safetensors",
643
+ "model.layers.25.mlp.switch_mlp.down_proj.weight": "model-00004-of-00004.safetensors",
644
+ "model.layers.25.mlp.switch_mlp.gate_proj.biases": "model-00004-of-00004.safetensors",
645
+ "model.layers.25.mlp.switch_mlp.gate_proj.scales": "model-00004-of-00004.safetensors",
646
+ "model.layers.25.mlp.switch_mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
647
+ "model.layers.25.mlp.switch_mlp.up_proj.biases": "model-00004-of-00004.safetensors",
648
+ "model.layers.25.mlp.switch_mlp.up_proj.scales": "model-00004-of-00004.safetensors",
649
+ "model.layers.25.mlp.switch_mlp.up_proj.weight": "model-00004-of-00004.safetensors",
650
+ "model.layers.25.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
651
+ "model.layers.25.self_attn.k_proj.biases": "model-00003-of-00004.safetensors",
652
+ "model.layers.25.self_attn.k_proj.scales": "model-00003-of-00004.safetensors",
653
+ "model.layers.25.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
654
+ "model.layers.25.self_attn.o_proj.biases": "model-00003-of-00004.safetensors",
655
+ "model.layers.25.self_attn.o_proj.scales": "model-00003-of-00004.safetensors",
656
+ "model.layers.25.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
657
+ "model.layers.25.self_attn.q_proj.biases": "model-00003-of-00004.safetensors",
658
+ "model.layers.25.self_attn.q_proj.scales": "model-00003-of-00004.safetensors",
659
+ "model.layers.25.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
660
+ "model.layers.25.self_attn.v_proj.biases": "model-00003-of-00004.safetensors",
661
+ "model.layers.25.self_attn.v_proj.scales": "model-00003-of-00004.safetensors",
662
+ "model.layers.25.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
663
+ "model.layers.26.input_layernorm.weight": "model-00004-of-00004.safetensors",
664
+ "model.layers.26.mlp.gate.biases": "model-00004-of-00004.safetensors",
665
+ "model.layers.26.mlp.gate.scales": "model-00004-of-00004.safetensors",
666
+ "model.layers.26.mlp.gate.weight": "model-00004-of-00004.safetensors",
667
+ "model.layers.26.mlp.shared_experts.down_proj.biases": "model-00004-of-00004.safetensors",
668
+ "model.layers.26.mlp.shared_experts.down_proj.scales": "model-00004-of-00004.safetensors",
669
+ "model.layers.26.mlp.shared_experts.down_proj.weight": "model-00004-of-00004.safetensors",
670
+ "model.layers.26.mlp.shared_experts.gate_proj.biases": "model-00004-of-00004.safetensors",
671
+ "model.layers.26.mlp.shared_experts.gate_proj.scales": "model-00004-of-00004.safetensors",
672
+ "model.layers.26.mlp.shared_experts.gate_proj.weight": "model-00004-of-00004.safetensors",
673
+ "model.layers.26.mlp.shared_experts.up_proj.biases": "model-00004-of-00004.safetensors",
674
+ "model.layers.26.mlp.shared_experts.up_proj.scales": "model-00004-of-00004.safetensors",
675
+ "model.layers.26.mlp.shared_experts.up_proj.weight": "model-00004-of-00004.safetensors",
676
+ "model.layers.26.mlp.switch_mlp.down_proj.biases": "model-00004-of-00004.safetensors",
677
+ "model.layers.26.mlp.switch_mlp.down_proj.scales": "model-00004-of-00004.safetensors",
678
+ "model.layers.26.mlp.switch_mlp.down_proj.weight": "model-00004-of-00004.safetensors",
679
+ "model.layers.26.mlp.switch_mlp.gate_proj.biases": "model-00004-of-00004.safetensors",
680
+ "model.layers.26.mlp.switch_mlp.gate_proj.scales": "model-00004-of-00004.safetensors",
681
+ "model.layers.26.mlp.switch_mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
682
+ "model.layers.26.mlp.switch_mlp.up_proj.biases": "model-00004-of-00004.safetensors",
683
+ "model.layers.26.mlp.switch_mlp.up_proj.scales": "model-00004-of-00004.safetensors",
684
+ "model.layers.26.mlp.switch_mlp.up_proj.weight": "model-00004-of-00004.safetensors",
685
+ "model.layers.26.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
686
+ "model.layers.26.self_attn.k_proj.biases": "model-00004-of-00004.safetensors",
687
+ "model.layers.26.self_attn.k_proj.scales": "model-00004-of-00004.safetensors",
688
+ "model.layers.26.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
689
+ "model.layers.26.self_attn.o_proj.biases": "model-00004-of-00004.safetensors",
690
+ "model.layers.26.self_attn.o_proj.scales": "model-00004-of-00004.safetensors",
691
+ "model.layers.26.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
692
+ "model.layers.26.self_attn.q_proj.biases": "model-00004-of-00004.safetensors",
693
+ "model.layers.26.self_attn.q_proj.scales": "model-00004-of-00004.safetensors",
694
+ "model.layers.26.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
695
+ "model.layers.26.self_attn.v_proj.biases": "model-00004-of-00004.safetensors",
696
+ "model.layers.26.self_attn.v_proj.scales": "model-00004-of-00004.safetensors",
697
+ "model.layers.26.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
698
+ "model.layers.27.input_layernorm.weight": "model-00004-of-00004.safetensors",
699
+ "model.layers.27.mlp.gate.biases": "model-00004-of-00004.safetensors",
700
+ "model.layers.27.mlp.gate.scales": "model-00004-of-00004.safetensors",
701
+ "model.layers.27.mlp.gate.weight": "model-00004-of-00004.safetensors",
702
+ "model.layers.27.mlp.shared_experts.down_proj.biases": "model-00004-of-00004.safetensors",
703
+ "model.layers.27.mlp.shared_experts.down_proj.scales": "model-00004-of-00004.safetensors",
704
+ "model.layers.27.mlp.shared_experts.down_proj.weight": "model-00004-of-00004.safetensors",
705
+ "model.layers.27.mlp.shared_experts.gate_proj.biases": "model-00004-of-00004.safetensors",
706
+ "model.layers.27.mlp.shared_experts.gate_proj.scales": "model-00004-of-00004.safetensors",
707
+ "model.layers.27.mlp.shared_experts.gate_proj.weight": "model-00004-of-00004.safetensors",
708
+ "model.layers.27.mlp.shared_experts.up_proj.biases": "model-00004-of-00004.safetensors",
709
+ "model.layers.27.mlp.shared_experts.up_proj.scales": "model-00004-of-00004.safetensors",
710
+ "model.layers.27.mlp.shared_experts.up_proj.weight": "model-00004-of-00004.safetensors",
711
+ "model.layers.27.mlp.switch_mlp.down_proj.biases": "model-00004-of-00004.safetensors",
712
+ "model.layers.27.mlp.switch_mlp.down_proj.scales": "model-00004-of-00004.safetensors",
713
+ "model.layers.27.mlp.switch_mlp.down_proj.weight": "model-00004-of-00004.safetensors",
714
+ "model.layers.27.mlp.switch_mlp.gate_proj.biases": "model-00004-of-00004.safetensors",
715
+ "model.layers.27.mlp.switch_mlp.gate_proj.scales": "model-00004-of-00004.safetensors",
716
+ "model.layers.27.mlp.switch_mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
717
+ "model.layers.27.mlp.switch_mlp.up_proj.biases": "model-00004-of-00004.safetensors",
718
+ "model.layers.27.mlp.switch_mlp.up_proj.scales": "model-00004-of-00004.safetensors",
719
+ "model.layers.27.mlp.switch_mlp.up_proj.weight": "model-00004-of-00004.safetensors",
720
+ "model.layers.27.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
721
+ "model.layers.27.self_attn.k_proj.biases": "model-00004-of-00004.safetensors",
722
+ "model.layers.27.self_attn.k_proj.scales": "model-00004-of-00004.safetensors",
723
+ "model.layers.27.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
724
+ "model.layers.27.self_attn.o_proj.biases": "model-00004-of-00004.safetensors",
725
+ "model.layers.27.self_attn.o_proj.scales": "model-00004-of-00004.safetensors",
726
+ "model.layers.27.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
727
+ "model.layers.27.self_attn.q_proj.biases": "model-00004-of-00004.safetensors",
728
+ "model.layers.27.self_attn.q_proj.scales": "model-00004-of-00004.safetensors",
729
+ "model.layers.27.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
730
+ "model.layers.27.self_attn.v_proj.biases": "model-00004-of-00004.safetensors",
731
+ "model.layers.27.self_attn.v_proj.scales": "model-00004-of-00004.safetensors",
732
+ "model.layers.27.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
733
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00004.safetensors",
734
+ "model.layers.3.mlp.gate.biases": "model-00001-of-00004.safetensors",
735
+ "model.layers.3.mlp.gate.scales": "model-00001-of-00004.safetensors",
736
+ "model.layers.3.mlp.gate.weight": "model-00001-of-00004.safetensors",
737
+ "model.layers.3.mlp.shared_experts.down_proj.biases": "model-00001-of-00004.safetensors",
738
+ "model.layers.3.mlp.shared_experts.down_proj.scales": "model-00001-of-00004.safetensors",
739
+ "model.layers.3.mlp.shared_experts.down_proj.weight": "model-00001-of-00004.safetensors",
740
+ "model.layers.3.mlp.shared_experts.gate_proj.biases": "model-00001-of-00004.safetensors",
741
+ "model.layers.3.mlp.shared_experts.gate_proj.scales": "model-00001-of-00004.safetensors",
742
+ "model.layers.3.mlp.shared_experts.gate_proj.weight": "model-00001-of-00004.safetensors",
743
+ "model.layers.3.mlp.shared_experts.up_proj.biases": "model-00001-of-00004.safetensors",
744
+ "model.layers.3.mlp.shared_experts.up_proj.scales": "model-00001-of-00004.safetensors",
745
+ "model.layers.3.mlp.shared_experts.up_proj.weight": "model-00001-of-00004.safetensors",
746
+ "model.layers.3.mlp.switch_mlp.down_proj.biases": "model-00001-of-00004.safetensors",
747
+ "model.layers.3.mlp.switch_mlp.down_proj.scales": "model-00001-of-00004.safetensors",
748
+ "model.layers.3.mlp.switch_mlp.down_proj.weight": "model-00001-of-00004.safetensors",
749
+ "model.layers.3.mlp.switch_mlp.gate_proj.biases": "model-00001-of-00004.safetensors",
750
+ "model.layers.3.mlp.switch_mlp.gate_proj.scales": "model-00001-of-00004.safetensors",
751
+ "model.layers.3.mlp.switch_mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
752
+ "model.layers.3.mlp.switch_mlp.up_proj.biases": "model-00001-of-00004.safetensors",
753
+ "model.layers.3.mlp.switch_mlp.up_proj.scales": "model-00001-of-00004.safetensors",
754
+ "model.layers.3.mlp.switch_mlp.up_proj.weight": "model-00001-of-00004.safetensors",
755
+ "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
756
+ "model.layers.3.self_attn.k_proj.biases": "model-00001-of-00004.safetensors",
757
+ "model.layers.3.self_attn.k_proj.scales": "model-00001-of-00004.safetensors",
758
+ "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
759
+ "model.layers.3.self_attn.o_proj.biases": "model-00001-of-00004.safetensors",
760
+ "model.layers.3.self_attn.o_proj.scales": "model-00001-of-00004.safetensors",
761
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
762
+ "model.layers.3.self_attn.q_proj.biases": "model-00001-of-00004.safetensors",
763
+ "model.layers.3.self_attn.q_proj.scales": "model-00001-of-00004.safetensors",
764
+ "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
765
+ "model.layers.3.self_attn.v_proj.biases": "model-00001-of-00004.safetensors",
766
+ "model.layers.3.self_attn.v_proj.scales": "model-00001-of-00004.safetensors",
767
+ "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
768
+ "model.layers.4.input_layernorm.weight": "model-00001-of-00004.safetensors",
769
+ "model.layers.4.mlp.gate.biases": "model-00001-of-00004.safetensors",
770
+ "model.layers.4.mlp.gate.scales": "model-00001-of-00004.safetensors",
771
+ "model.layers.4.mlp.gate.weight": "model-00001-of-00004.safetensors",
772
+ "model.layers.4.mlp.shared_experts.down_proj.biases": "model-00001-of-00004.safetensors",
773
+ "model.layers.4.mlp.shared_experts.down_proj.scales": "model-00001-of-00004.safetensors",
774
+ "model.layers.4.mlp.shared_experts.down_proj.weight": "model-00001-of-00004.safetensors",
775
+ "model.layers.4.mlp.shared_experts.gate_proj.biases": "model-00001-of-00004.safetensors",
776
+ "model.layers.4.mlp.shared_experts.gate_proj.scales": "model-00001-of-00004.safetensors",
777
+ "model.layers.4.mlp.shared_experts.gate_proj.weight": "model-00001-of-00004.safetensors",
778
+ "model.layers.4.mlp.shared_experts.up_proj.biases": "model-00001-of-00004.safetensors",
779
+ "model.layers.4.mlp.shared_experts.up_proj.scales": "model-00001-of-00004.safetensors",
780
+ "model.layers.4.mlp.shared_experts.up_proj.weight": "model-00001-of-00004.safetensors",
781
+ "model.layers.4.mlp.switch_mlp.down_proj.biases": "model-00001-of-00004.safetensors",
782
+ "model.layers.4.mlp.switch_mlp.down_proj.scales": "model-00001-of-00004.safetensors",
783
+ "model.layers.4.mlp.switch_mlp.down_proj.weight": "model-00001-of-00004.safetensors",
784
+ "model.layers.4.mlp.switch_mlp.gate_proj.biases": "model-00001-of-00004.safetensors",
785
+ "model.layers.4.mlp.switch_mlp.gate_proj.scales": "model-00001-of-00004.safetensors",
786
+ "model.layers.4.mlp.switch_mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
787
+ "model.layers.4.mlp.switch_mlp.up_proj.biases": "model-00001-of-00004.safetensors",
788
+ "model.layers.4.mlp.switch_mlp.up_proj.scales": "model-00001-of-00004.safetensors",
789
+ "model.layers.4.mlp.switch_mlp.up_proj.weight": "model-00001-of-00004.safetensors",
790
+ "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
791
+ "model.layers.4.self_attn.k_proj.biases": "model-00001-of-00004.safetensors",
792
+ "model.layers.4.self_attn.k_proj.scales": "model-00001-of-00004.safetensors",
793
+ "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
794
+ "model.layers.4.self_attn.o_proj.biases": "model-00001-of-00004.safetensors",
795
+ "model.layers.4.self_attn.o_proj.scales": "model-00001-of-00004.safetensors",
796
+ "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
797
+ "model.layers.4.self_attn.q_proj.biases": "model-00001-of-00004.safetensors",
798
+ "model.layers.4.self_attn.q_proj.scales": "model-00001-of-00004.safetensors",
799
+ "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
800
+ "model.layers.4.self_attn.v_proj.biases": "model-00001-of-00004.safetensors",
801
+ "model.layers.4.self_attn.v_proj.scales": "model-00001-of-00004.safetensors",
802
+ "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
803
+ "model.layers.5.input_layernorm.weight": "model-00001-of-00004.safetensors",
804
+ "model.layers.5.mlp.gate.biases": "model-00001-of-00004.safetensors",
805
+ "model.layers.5.mlp.gate.scales": "model-00001-of-00004.safetensors",
806
+ "model.layers.5.mlp.gate.weight": "model-00001-of-00004.safetensors",
807
+ "model.layers.5.mlp.shared_experts.down_proj.biases": "model-00001-of-00004.safetensors",
808
+ "model.layers.5.mlp.shared_experts.down_proj.scales": "model-00001-of-00004.safetensors",
809
+ "model.layers.5.mlp.shared_experts.down_proj.weight": "model-00001-of-00004.safetensors",
810
+ "model.layers.5.mlp.shared_experts.gate_proj.biases": "model-00001-of-00004.safetensors",
811
+ "model.layers.5.mlp.shared_experts.gate_proj.scales": "model-00001-of-00004.safetensors",
812
+ "model.layers.5.mlp.shared_experts.gate_proj.weight": "model-00001-of-00004.safetensors",
813
+ "model.layers.5.mlp.shared_experts.up_proj.biases": "model-00001-of-00004.safetensors",
814
+ "model.layers.5.mlp.shared_experts.up_proj.scales": "model-00001-of-00004.safetensors",
815
+ "model.layers.5.mlp.shared_experts.up_proj.weight": "model-00001-of-00004.safetensors",
816
+ "model.layers.5.mlp.switch_mlp.down_proj.biases": "model-00001-of-00004.safetensors",
817
+ "model.layers.5.mlp.switch_mlp.down_proj.scales": "model-00001-of-00004.safetensors",
818
+ "model.layers.5.mlp.switch_mlp.down_proj.weight": "model-00001-of-00004.safetensors",
819
+ "model.layers.5.mlp.switch_mlp.gate_proj.biases": "model-00001-of-00004.safetensors",
820
+ "model.layers.5.mlp.switch_mlp.gate_proj.scales": "model-00001-of-00004.safetensors",
821
+ "model.layers.5.mlp.switch_mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
822
+ "model.layers.5.mlp.switch_mlp.up_proj.biases": "model-00001-of-00004.safetensors",
823
+ "model.layers.5.mlp.switch_mlp.up_proj.scales": "model-00001-of-00004.safetensors",
824
+ "model.layers.5.mlp.switch_mlp.up_proj.weight": "model-00001-of-00004.safetensors",
825
+ "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
826
+ "model.layers.5.self_attn.k_proj.biases": "model-00001-of-00004.safetensors",
827
+ "model.layers.5.self_attn.k_proj.scales": "model-00001-of-00004.safetensors",
828
+ "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
829
+ "model.layers.5.self_attn.o_proj.biases": "model-00001-of-00004.safetensors",
830
+ "model.layers.5.self_attn.o_proj.scales": "model-00001-of-00004.safetensors",
831
+ "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
832
+ "model.layers.5.self_attn.q_proj.biases": "model-00001-of-00004.safetensors",
833
+ "model.layers.5.self_attn.q_proj.scales": "model-00001-of-00004.safetensors",
834
+ "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
835
+ "model.layers.5.self_attn.v_proj.biases": "model-00001-of-00004.safetensors",
836
+ "model.layers.5.self_attn.v_proj.scales": "model-00001-of-00004.safetensors",
837
+ "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
838
+ "model.layers.6.input_layernorm.weight": "model-00001-of-00004.safetensors",
839
+ "model.layers.6.mlp.gate.biases": "model-00001-of-00004.safetensors",
840
+ "model.layers.6.mlp.gate.scales": "model-00001-of-00004.safetensors",
841
+ "model.layers.6.mlp.gate.weight": "model-00001-of-00004.safetensors",
842
+ "model.layers.6.mlp.shared_experts.down_proj.biases": "model-00001-of-00004.safetensors",
843
+ "model.layers.6.mlp.shared_experts.down_proj.scales": "model-00001-of-00004.safetensors",
844
+ "model.layers.6.mlp.shared_experts.down_proj.weight": "model-00001-of-00004.safetensors",
845
+ "model.layers.6.mlp.shared_experts.gate_proj.biases": "model-00001-of-00004.safetensors",
846
+ "model.layers.6.mlp.shared_experts.gate_proj.scales": "model-00001-of-00004.safetensors",
847
+ "model.layers.6.mlp.shared_experts.gate_proj.weight": "model-00001-of-00004.safetensors",
848
+ "model.layers.6.mlp.shared_experts.up_proj.biases": "model-00001-of-00004.safetensors",
849
+ "model.layers.6.mlp.shared_experts.up_proj.scales": "model-00001-of-00004.safetensors",
850
+ "model.layers.6.mlp.shared_experts.up_proj.weight": "model-00001-of-00004.safetensors",
851
+ "model.layers.6.mlp.switch_mlp.down_proj.biases": "model-00001-of-00004.safetensors",
852
+ "model.layers.6.mlp.switch_mlp.down_proj.scales": "model-00001-of-00004.safetensors",
853
+ "model.layers.6.mlp.switch_mlp.down_proj.weight": "model-00001-of-00004.safetensors",
854
+ "model.layers.6.mlp.switch_mlp.gate_proj.biases": "model-00001-of-00004.safetensors",
855
+ "model.layers.6.mlp.switch_mlp.gate_proj.scales": "model-00001-of-00004.safetensors",
856
+ "model.layers.6.mlp.switch_mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
857
+ "model.layers.6.mlp.switch_mlp.up_proj.biases": "model-00001-of-00004.safetensors",
858
+ "model.layers.6.mlp.switch_mlp.up_proj.scales": "model-00001-of-00004.safetensors",
859
+ "model.layers.6.mlp.switch_mlp.up_proj.weight": "model-00001-of-00004.safetensors",
860
+ "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
861
+ "model.layers.6.self_attn.k_proj.biases": "model-00001-of-00004.safetensors",
862
+ "model.layers.6.self_attn.k_proj.scales": "model-00001-of-00004.safetensors",
863
+ "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
864
+ "model.layers.6.self_attn.o_proj.biases": "model-00001-of-00004.safetensors",
865
+ "model.layers.6.self_attn.o_proj.scales": "model-00001-of-00004.safetensors",
866
+ "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
867
+ "model.layers.6.self_attn.q_proj.biases": "model-00001-of-00004.safetensors",
868
+ "model.layers.6.self_attn.q_proj.scales": "model-00001-of-00004.safetensors",
869
+ "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
870
+ "model.layers.6.self_attn.v_proj.biases": "model-00001-of-00004.safetensors",
871
+ "model.layers.6.self_attn.v_proj.scales": "model-00001-of-00004.safetensors",
872
+ "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
873
+ "model.layers.7.input_layernorm.weight": "model-00001-of-00004.safetensors",
874
+ "model.layers.7.mlp.gate.biases": "model-00001-of-00004.safetensors",
875
+ "model.layers.7.mlp.gate.scales": "model-00001-of-00004.safetensors",
876
+ "model.layers.7.mlp.gate.weight": "model-00001-of-00004.safetensors",
877
+ "model.layers.7.mlp.shared_experts.down_proj.biases": "model-00001-of-00004.safetensors",
878
+ "model.layers.7.mlp.shared_experts.down_proj.scales": "model-00001-of-00004.safetensors",
879
+ "model.layers.7.mlp.shared_experts.down_proj.weight": "model-00001-of-00004.safetensors",
880
+ "model.layers.7.mlp.shared_experts.gate_proj.biases": "model-00001-of-00004.safetensors",
881
+ "model.layers.7.mlp.shared_experts.gate_proj.scales": "model-00001-of-00004.safetensors",
882
+ "model.layers.7.mlp.shared_experts.gate_proj.weight": "model-00001-of-00004.safetensors",
883
+ "model.layers.7.mlp.shared_experts.up_proj.biases": "model-00001-of-00004.safetensors",
884
+ "model.layers.7.mlp.shared_experts.up_proj.scales": "model-00001-of-00004.safetensors",
885
+ "model.layers.7.mlp.shared_experts.up_proj.weight": "model-00001-of-00004.safetensors",
886
+ "model.layers.7.mlp.switch_mlp.down_proj.biases": "model-00001-of-00004.safetensors",
887
+ "model.layers.7.mlp.switch_mlp.down_proj.scales": "model-00001-of-00004.safetensors",
888
+ "model.layers.7.mlp.switch_mlp.down_proj.weight": "model-00001-of-00004.safetensors",
889
+ "model.layers.7.mlp.switch_mlp.gate_proj.biases": "model-00001-of-00004.safetensors",
890
+ "model.layers.7.mlp.switch_mlp.gate_proj.scales": "model-00001-of-00004.safetensors",
891
+ "model.layers.7.mlp.switch_mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
892
+ "model.layers.7.mlp.switch_mlp.up_proj.biases": "model-00001-of-00004.safetensors",
893
+ "model.layers.7.mlp.switch_mlp.up_proj.scales": "model-00001-of-00004.safetensors",
894
+ "model.layers.7.mlp.switch_mlp.up_proj.weight": "model-00001-of-00004.safetensors",
895
+ "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
896
+ "model.layers.7.self_attn.k_proj.biases": "model-00001-of-00004.safetensors",
897
+ "model.layers.7.self_attn.k_proj.scales": "model-00001-of-00004.safetensors",
898
+ "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
899
+ "model.layers.7.self_attn.o_proj.biases": "model-00001-of-00004.safetensors",
900
+ "model.layers.7.self_attn.o_proj.scales": "model-00001-of-00004.safetensors",
901
+ "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
902
+ "model.layers.7.self_attn.q_proj.biases": "model-00001-of-00004.safetensors",
903
+ "model.layers.7.self_attn.q_proj.scales": "model-00001-of-00004.safetensors",
904
+ "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
905
+ "model.layers.7.self_attn.v_proj.biases": "model-00001-of-00004.safetensors",
906
+ "model.layers.7.self_attn.v_proj.scales": "model-00001-of-00004.safetensors",
907
+ "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
908
+ "model.layers.8.input_layernorm.weight": "model-00002-of-00004.safetensors",
909
+ "model.layers.8.mlp.gate.biases": "model-00001-of-00004.safetensors",
910
+ "model.layers.8.mlp.gate.scales": "model-00001-of-00004.safetensors",
911
+ "model.layers.8.mlp.gate.weight": "model-00001-of-00004.safetensors",
912
+ "model.layers.8.mlp.shared_experts.down_proj.biases": "model-00002-of-00004.safetensors",
913
+ "model.layers.8.mlp.shared_experts.down_proj.scales": "model-00002-of-00004.safetensors",
914
+ "model.layers.8.mlp.shared_experts.down_proj.weight": "model-00002-of-00004.safetensors",
915
+ "model.layers.8.mlp.shared_experts.gate_proj.biases": "model-00002-of-00004.safetensors",
916
+ "model.layers.8.mlp.shared_experts.gate_proj.scales": "model-00002-of-00004.safetensors",
917
+ "model.layers.8.mlp.shared_experts.gate_proj.weight": "model-00002-of-00004.safetensors",
918
+ "model.layers.8.mlp.shared_experts.up_proj.biases": "model-00002-of-00004.safetensors",
919
+ "model.layers.8.mlp.shared_experts.up_proj.scales": "model-00002-of-00004.safetensors",
920
+ "model.layers.8.mlp.shared_experts.up_proj.weight": "model-00002-of-00004.safetensors",
921
+ "model.layers.8.mlp.switch_mlp.down_proj.biases": "model-00002-of-00004.safetensors",
922
+ "model.layers.8.mlp.switch_mlp.down_proj.scales": "model-00002-of-00004.safetensors",
923
+ "model.layers.8.mlp.switch_mlp.down_proj.weight": "model-00002-of-00004.safetensors",
924
+ "model.layers.8.mlp.switch_mlp.gate_proj.biases": "model-00001-of-00004.safetensors",
925
+ "model.layers.8.mlp.switch_mlp.gate_proj.scales": "model-00001-of-00004.safetensors",
926
+ "model.layers.8.mlp.switch_mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
927
+ "model.layers.8.mlp.switch_mlp.up_proj.biases": "model-00001-of-00004.safetensors",
928
+ "model.layers.8.mlp.switch_mlp.up_proj.scales": "model-00001-of-00004.safetensors",
929
+ "model.layers.8.mlp.switch_mlp.up_proj.weight": "model-00001-of-00004.safetensors",
930
+ "model.layers.8.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
931
+ "model.layers.8.self_attn.k_proj.biases": "model-00001-of-00004.safetensors",
932
+ "model.layers.8.self_attn.k_proj.scales": "model-00001-of-00004.safetensors",
933
+ "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
934
+ "model.layers.8.self_attn.o_proj.biases": "model-00001-of-00004.safetensors",
935
+ "model.layers.8.self_attn.o_proj.scales": "model-00001-of-00004.safetensors",
936
+ "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
937
+ "model.layers.8.self_attn.q_proj.biases": "model-00001-of-00004.safetensors",
938
+ "model.layers.8.self_attn.q_proj.scales": "model-00001-of-00004.safetensors",
939
+ "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
940
+ "model.layers.8.self_attn.v_proj.biases": "model-00001-of-00004.safetensors",
941
+ "model.layers.8.self_attn.v_proj.scales": "model-00001-of-00004.safetensors",
942
+ "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
943
+ "model.layers.9.input_layernorm.weight": "model-00002-of-00004.safetensors",
944
+ "model.layers.9.mlp.gate.biases": "model-00002-of-00004.safetensors",
945
+ "model.layers.9.mlp.gate.scales": "model-00002-of-00004.safetensors",
946
+ "model.layers.9.mlp.gate.weight": "model-00002-of-00004.safetensors",
947
+ "model.layers.9.mlp.shared_experts.down_proj.biases": "model-00002-of-00004.safetensors",
948
+ "model.layers.9.mlp.shared_experts.down_proj.scales": "model-00002-of-00004.safetensors",
949
+ "model.layers.9.mlp.shared_experts.down_proj.weight": "model-00002-of-00004.safetensors",
950
+ "model.layers.9.mlp.shared_experts.gate_proj.biases": "model-00002-of-00004.safetensors",
951
+ "model.layers.9.mlp.shared_experts.gate_proj.scales": "model-00002-of-00004.safetensors",
952
+ "model.layers.9.mlp.shared_experts.gate_proj.weight": "model-00002-of-00004.safetensors",
953
+ "model.layers.9.mlp.shared_experts.up_proj.biases": "model-00002-of-00004.safetensors",
954
+ "model.layers.9.mlp.shared_experts.up_proj.scales": "model-00002-of-00004.safetensors",
955
+ "model.layers.9.mlp.shared_experts.up_proj.weight": "model-00002-of-00004.safetensors",
956
+ "model.layers.9.mlp.switch_mlp.down_proj.biases": "model-00002-of-00004.safetensors",
957
+ "model.layers.9.mlp.switch_mlp.down_proj.scales": "model-00002-of-00004.safetensors",
958
+ "model.layers.9.mlp.switch_mlp.down_proj.weight": "model-00002-of-00004.safetensors",
959
+ "model.layers.9.mlp.switch_mlp.gate_proj.biases": "model-00002-of-00004.safetensors",
960
+ "model.layers.9.mlp.switch_mlp.gate_proj.scales": "model-00002-of-00004.safetensors",
961
+ "model.layers.9.mlp.switch_mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
962
+ "model.layers.9.mlp.switch_mlp.up_proj.biases": "model-00002-of-00004.safetensors",
963
+ "model.layers.9.mlp.switch_mlp.up_proj.scales": "model-00002-of-00004.safetensors",
964
+ "model.layers.9.mlp.switch_mlp.up_proj.weight": "model-00002-of-00004.safetensors",
965
+ "model.layers.9.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
966
+ "model.layers.9.self_attn.k_proj.biases": "model-00002-of-00004.safetensors",
967
+ "model.layers.9.self_attn.k_proj.scales": "model-00002-of-00004.safetensors",
968
+ "model.layers.9.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
969
+ "model.layers.9.self_attn.o_proj.biases": "model-00002-of-00004.safetensors",
970
+ "model.layers.9.self_attn.o_proj.scales": "model-00002-of-00004.safetensors",
971
+ "model.layers.9.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
972
+ "model.layers.9.self_attn.q_proj.biases": "model-00002-of-00004.safetensors",
973
+ "model.layers.9.self_attn.q_proj.scales": "model-00002-of-00004.safetensors",
974
+ "model.layers.9.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
975
+ "model.layers.9.self_attn.v_proj.biases": "model-00002-of-00004.safetensors",
976
+ "model.layers.9.self_attn.v_proj.scales": "model-00002-of-00004.safetensors",
977
+ "model.layers.9.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
978
+ "model.norm.weight": "model-00004-of-00004.safetensors"
979
+ }
980
+ }
modeling_ernie4_5_moe.py ADDED
@@ -0,0 +1,1504 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) 2025 Baidu, Inc. All Rights Reserved.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+
15
+ from copy import deepcopy
16
+ from dataclasses import dataclass
17
+ from functools import partial
18
+ from typing import Callable, Optional, Tuple, Union
19
+
20
+ import torch
21
+ import torch.nn.functional as F
22
+ import torch.nn as nn
23
+
24
+ from transformers.cache_utils import Cache, DynamicCache, StaticCache
25
+ from transformers.generation import GenerationMixin
26
+ from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
27
+ from transformers.modeling_outputs import ModelOutput, MoeCausalLMOutputWithPast
28
+ from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS
29
+ from transformers.modeling_attn_mask_utils import AttentionMaskConverter
30
+ from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
31
+ from transformers.processing_utils import Unpack
32
+ from transformers.utils import LossKwargs, auto_docstring, can_return_tuple, logging, is_torch_flex_attn_available
33
+
34
+ from .configuration_ernie4_5_moe import Ernie4_5_MoeConfig
35
+
36
+
37
+ if is_torch_flex_attn_available():
38
+ from torch.nn.attention.flex_attention import BlockMask
39
+
40
+ from transformers.integrations.flex_attention import make_flex_block_causal_mask
41
+
42
+ logger = logging.get_logger(__name__)
43
+
44
+
45
+ class KwargsForCausalLM(FlashAttentionKwargs, LossKwargs): ...
46
+
47
+ @dataclass
48
+ class Erine4_5_MoeModelOutputWithPast(ModelOutput):
49
+ last_hidden_state: Optional[torch.FloatTensor] = None
50
+ past_key_values: Optional[Cache] = None
51
+ hidden_states: Optional[tuple[torch.FloatTensor, ...]] = None
52
+ attentions: Optional[tuple[torch.FloatTensor, ...]] = None
53
+ router_loss: Optional[torch.FloatTensor] = None
54
+ gate_logits: Optional[tuple[torch.FloatTensor, ...]] = None
55
+ mtp_outputs: Optional[torch.FloatTensor] = None
56
+
57
+
58
+ @dataclass
59
+ class Ernie4_5_MoeCausalLMOutputWithPast(MoeCausalLMOutputWithPast):
60
+ router_loss: Optional[torch.FloatTensor] = None
61
+
62
+ def rotate_half(x):
63
+ """Rotates half the hidden dims of the input."""
64
+
65
+ x1 = x[..., 0::2]
66
+ x2 = x[..., 1::2]
67
+ return torch.stack((-x2, x1), dim=-1).reshape(x.shape)
68
+
69
+
70
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
71
+ """
72
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
73
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
74
+ """
75
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
76
+ if n_rep == 1:
77
+ return hidden_states
78
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
79
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
80
+
81
+
82
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
83
+ """Applies Rotary Position Embedding to the query and key tensors.
84
+
85
+ Args:
86
+ q (`torch.Tensor`): The query tensor.
87
+ k (`torch.Tensor`): The key tensor.
88
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
89
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
90
+ position_ids (`torch.Tensor`, *optional*):
91
+ Deprecated and unused.
92
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
93
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
94
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
95
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
96
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
97
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
98
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
99
+ Returns:
100
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
101
+ """
102
+ orig_dtype = q.dtype
103
+ sin_pos = torch.stack([sin, sin], dim=-1).reshape(*sin.shape[:-1],-1)
104
+ cos_pos = torch.stack([cos, cos], dim=-1).reshape(*sin.shape[:-1],-1)
105
+ q_embed = (q.float() * cos_pos) + (rotate_half(q).float() * sin_pos)
106
+ k_embed = (k.float() * cos_pos) + (rotate_half(k).float() * sin_pos)
107
+ return q_embed.to(orig_dtype), k_embed.to(orig_dtype)
108
+
109
+
110
+ def eager_attention_forward(
111
+ module: nn.Module,
112
+ query: torch.Tensor,
113
+ key: torch.Tensor,
114
+ value: torch.Tensor,
115
+ attention_mask: Optional[torch.Tensor],
116
+ scaling: float,
117
+ dropout: float = 0.0,
118
+ **kwargs,
119
+ ):
120
+ key_states = repeat_kv(key, module.num_key_value_groups)
121
+ value_states = repeat_kv(value, module.num_key_value_groups)
122
+
123
+ attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
124
+ if attention_mask is not None:
125
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
126
+ attn_weights = attn_weights + causal_mask.to(attn_weights.device)
127
+
128
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
129
+ attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
130
+ attn_output = torch.matmul(attn_weights, value_states)
131
+ attn_output = attn_output.transpose(1, 2).contiguous()
132
+
133
+ return attn_output, attn_weights
134
+
135
+
136
+ def topk_gate_func(
137
+ module: nn.Module,
138
+ hidden_states: torch.Tensor,
139
+ ):
140
+ capacity = module.get_capacity(hidden_states.shape[0])
141
+ with torch.autocast(device_type='cuda',dtype=torch.float32):
142
+ logits = module.gate(hidden_states.float())
143
+ router_loss = torch.zeros([1], dtype=torch.float32, device=hidden_states.device)
144
+ router_loss.detach()
145
+ return logits, capacity, router_loss
146
+
147
+
148
+ class Ernie4_5_ResidualWithDropout(nn.Module):
149
+ """
150
+ Fused dropout implementation with residual connection support.
151
+
152
+ This layer combines dropout and residual addition in a single operation for better performance,
153
+ particularly on GPU devices. The dropout is conditionally applied based on the probability.
154
+
155
+ Args:
156
+ prob (float): Dropout probability (between 0 and 1)
157
+
158
+ Attributes:
159
+ prob (float): Stores the dropout probability
160
+ dropout (nn.Dropout): The actual dropout layer instance
161
+ """
162
+
163
+ def __init__(self, prob):
164
+ """
165
+ Initialize the fused dropout layer.
166
+
167
+ Args:
168
+ prob (float): Dropout probability (0 means no dropout)
169
+ """
170
+ super().__init__()
171
+ self.prob = prob
172
+ self.dropout = nn.Dropout(p=prob)
173
+
174
+ def forward(self, x, y):
175
+ """
176
+ Forward pass of the fused dropout layer.
177
+
178
+ Args:
179
+ x (torch.Tensor): Input tensor to potentially apply dropout on
180
+ y (torch.Tensor): Residual tensor to add to the (possibly dropped out) x
181
+
182
+ Returns:
183
+ torch.Tensor: Result of x (with optional dropout) + y
184
+ """
185
+ if self.prob > 0:
186
+ x = self.dropout(x)
187
+ output = x + y
188
+
189
+ return output
190
+
191
+
192
+ class Ernie4_5_Attention(nn.Module):
193
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
194
+
195
+ def __init__(self, config, layer_idx=0):
196
+ """
197
+ Args:
198
+ config (ErnieConfig): Model configuration.
199
+ layer_idx (int, optional): Index in transformer stack. Defaults to 0.
200
+ """
201
+ super().__init__()
202
+ self.layer_idx = layer_idx
203
+ self.hidden_size = config.hidden_size
204
+ self.num_heads = config.num_attention_heads
205
+ self.num_key_value_heads = config.num_key_value_heads if config.num_key_value_heads is not None else self.nums_head
206
+ self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
207
+ self.head_dim = self.hidden_size // self.num_heads
208
+ self.freq_allocation = config.freq_allocation if hasattr(config, "freq_allocation") else 0
209
+ self.scaling = self.head_dim**-0.5
210
+ self.attention_dropout = getattr(config, "attention_probs_dropout_prob", 0.0)
211
+ self.is_causal = True
212
+
213
+ self.q_proj = nn.Linear(
214
+ self.hidden_size,
215
+ self.num_heads * self.head_dim,
216
+ bias=config.use_bias,
217
+ )
218
+
219
+ self.k_proj = nn.Linear(
220
+ self.hidden_size,
221
+ self.num_key_value_heads * self.head_dim,
222
+ bias=config.use_bias,
223
+ )
224
+
225
+ self.v_proj = nn.Linear(
226
+ self.hidden_size,
227
+ self.num_key_value_heads * self.head_dim,
228
+ bias=config.use_bias,
229
+ )
230
+
231
+ self.o_proj = nn.Linear(
232
+ self.hidden_size,
233
+ self.hidden_size,
234
+ bias=config.use_bias,
235
+ )
236
+
237
+ self.config = config
238
+
239
+
240
+ def forward(
241
+ self,
242
+ hidden_states: torch.Tensor,
243
+ attention_mask: Optional[torch.Tensor] = None,
244
+ past_key_value: Optional[Cache] = None,
245
+ position_ids: Optional[torch.Tensor] = None,
246
+ cache_position: Optional[torch.LongTensor] = None,
247
+ position_embeddings: tuple[torch.Tensor, torch.Tensor] = None,
248
+ **kwargs: Unpack[FlashAttentionKwargs],
249
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor, torch.Tensor]]]:
250
+ B, L = hidden_states.shape[:-1]
251
+
252
+ query_states = self.q_proj(hidden_states).view(B, L, self.num_heads, -1).transpose(1, 2)
253
+ key_states = self.k_proj(hidden_states).view(B, L, self.num_key_value_heads, -1).transpose(1, 2)
254
+ value_states = self.v_proj(hidden_states).view(B, L, self.num_key_value_heads, -1).transpose(1, 2)
255
+
256
+ cos, sin = position_embeddings
257
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
258
+
259
+ if past_key_value is not None:
260
+ # sin and cos are specific to RoPE models; cache_position needed for the static cache
261
+ cache_kwargs = {"cache_position": cache_position}
262
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
263
+
264
+ attention_interface: Callable = eager_attention_forward
265
+ if self.config._attn_implementation != "eager":
266
+ attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
267
+
268
+ attn_output, attn_weights = attention_interface(
269
+ self,
270
+ query_states,
271
+ key_states,
272
+ value_states,
273
+ attention_mask,
274
+ dropout=0.0 if not self.training else self.attention_dropout,
275
+ scaling=self.scaling,
276
+ **kwargs,
277
+ )
278
+ attn_output = attn_output.reshape(B, L, -1).contiguous()
279
+ attn_output = self.o_proj(attn_output)
280
+
281
+ return attn_output, attn_weights
282
+
283
+
284
+ class Ernie4_5_MLP(nn.Module):
285
+ """
286
+ Ernie4_5_MLP - Gated Multi-Layer Perceptron module used in Ernie model.
287
+ """
288
+
289
+ def __init__(self, config,intermediate_size=None):
290
+ """
291
+ Initialize the MLP module with configuration options.
292
+
293
+ Args:
294
+ config: Model configuration object with attributes:
295
+ - hidden_size: int
296
+ - intermediate_size: int
297
+ - use_bias: bool
298
+ layer_idx (int): Index of current layer (default: 0)
299
+ """
300
+ super().__init__()
301
+ self.config = config
302
+ self.hidden_size = config.hidden_size
303
+ self.intermediate_size = intermediate_size if intermediate_size is not None else config.intermediate_size
304
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.use_bias)
305
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.use_bias)
306
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=config.use_bias)
307
+
308
+
309
+ def forward(self, x):
310
+ """
311
+ Args:
312
+ x (Tensor): shape [batch_size, seq_len, hidden_size]
313
+
314
+ Returns:
315
+ Tensor: shape [batch_size, seq_len, hidden_size]
316
+ """
317
+ down_proj = self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x))
318
+ return down_proj
319
+
320
+
321
+ class Ernie4_5_MoeStatics(nn.Module):
322
+ """
323
+ Stores MoE (Mixture of Experts) statistics
324
+ and expert usage information.
325
+ """
326
+
327
+ def __init__(self, config):
328
+ """
329
+ Initialize MoE statistics tracking.
330
+
331
+ Args:
332
+ config: Model configuration containing MoE parameters
333
+ """
334
+ super().__init__()
335
+
336
+ num_experts = config.moe_num_experts
337
+ num_experts_groups = 1
338
+
339
+ self.e_score_correction_bias = nn.Parameter(
340
+ torch.zeros(num_experts_groups, num_experts, dtype=torch.float32),
341
+ requires_grad=False
342
+ )
343
+
344
+ class Ernie4_5_MoeMLP(nn.Module):
345
+ """Mixture of Experts (MoE) variant of ERNIE's MLP layer."""
346
+
347
+ def __init__(self,config):
348
+ super().__init__()
349
+ self.config = config
350
+ self.k = config.moe_k
351
+ self.sinkhorn_2gate = config.sinkhorn_2gate
352
+ self.sinkhorn_temp = config.sinkhorn_temp
353
+
354
+ moe_intermediate_size = config.moe_intermediate_size if config.moe_intermediate_size else config.intermediate_size
355
+ self.gate = nn.Linear(config.hidden_size, config.moe_num_experts, bias=False, dtype=torch.float32)
356
+ if config.moe_gate_act == "softmax":
357
+ self.gate_act = partial(F.softmax, dim=-1)
358
+ elif config.moe_gate_act == "sigmoid":
359
+ self.gate_act = F.sigmoid
360
+ else:
361
+ raise ValueError(f"{config.moe_gate_act} is not supported.")
362
+
363
+ self.experts = nn.ModuleList(
364
+ [Ernie4_5_MLP(config,moe_intermediate_size) for i in range(config.moe_num_experts)]
365
+ )
366
+
367
+ if config.moe_use_aux_free:
368
+ self.moe_statics = Ernie4_5_MoeStatics(config)
369
+
370
+ self.use_correction_bias = config.moe_use_aux_free
371
+ self.num_local_experts = len(self.experts)
372
+
373
+ self.shared_experts = self._init_shared_experts()
374
+
375
+ def _init_shared_experts(self):
376
+ """
377
+ Initialize the shared expert module.
378
+
379
+ Returns:
380
+ shared_experts: Shared expert module, returns None if no shared experts are needed.
381
+
382
+ """
383
+ cfg = deepcopy(self.config)
384
+ if getattr(cfg, 'moe_num_shared_experts', 0) > 0:
385
+ if getattr(cfg, 'moe_intermediate_size', None):
386
+ cfg.intermediate_size = cfg.moe_intermediate_size * cfg.moe_num_shared_experts
387
+ else:
388
+ cfg.intermediate_size = cfg.intermediate_size * cfg.moe_num_shared_experts
389
+ shared_experts = Ernie4_5_MLP(cfg, cfg.intermediate_size)
390
+ else:
391
+ shared_experts = None
392
+ return shared_experts
393
+
394
+ def forward(
395
+ self,
396
+ input: torch.Tensor,
397
+ ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
398
+ """
399
+ Forward pass through MoE layer.
400
+
401
+ Args:
402
+ input (Tensor): Input tensor of shape [s, d].
403
+ token_type_ids: Optional tensor for token types.
404
+
405
+ Returns:
406
+ tuple: (output, combine_weights, router_loss, gate_logits)
407
+ """
408
+
409
+ if input.dim() == 3:
410
+ orig_shape = input.shape
411
+ input = input.reshape(-1, input.shape[-1])
412
+ else:
413
+ orig_shape = None
414
+ assert input.dim() == 2, f"input Tensor must have dimensions: (s)equence, (d)im, got:{input.shape}"
415
+
416
+ assert self.gate is not None
417
+
418
+ gate_input = input
419
+
420
+ (
421
+ dispatched_input,
422
+ combine_weights,
423
+ dispatch_mask,
424
+ scatter_index,
425
+ router_loss,
426
+ gate_logits,
427
+ gate_prob
428
+ ) = self.gate_and_dispatch(gate_input)
429
+
430
+ expert_out = self.forward_experts(dispatched_input)
431
+
432
+ combined_output = self.combine_expert_output(expert_out, combine_weights, scatter_index)
433
+
434
+ if self.shared_experts is not None:
435
+ shared_expert_out = self.shared_experts(gate_input)
436
+ combined_output += shared_expert_out
437
+
438
+ if orig_shape:
439
+ combined_output = combined_output.reshape(orig_shape[:-1] + (combined_output.shape[-1],))
440
+
441
+ return combined_output, combine_weights, router_loss, gate_logits
442
+
443
+ def forward_experts(self, dispatched_input: torch.Tensor) -> torch.Tensor:
444
+ """
445
+ Forward pass through experts sequentially.
446
+
447
+ Args:
448
+ dispatched_input (Tensor): Input tensor of shape [num_experts, capacity, dim].
449
+
450
+ Returns:
451
+ Tensor: Expert outputs of shape [num_experts, capacity, dim].
452
+ """
453
+ true_experts = self.experts
454
+ dispatched_input = dispatched_input.reshape(
455
+ 1, self.num_local_experts, -1, dispatched_input.shape[-1]
456
+ )
457
+ expert_outputs = []
458
+ if isinstance(self.experts, nn.ModuleList):
459
+ chunks = dispatched_input.permute(1, 0, 2, 3).contiguous().unbind(0)
460
+ assert len(chunks) == len(true_experts), f"{len(chunks)}, {len(true_experts)}"
461
+ for chunk, expert in zip(chunks, true_experts):
462
+ expert_outputs.append(expert(chunk))
463
+ else:
464
+ dispatched_input = dispatched_input.permute(1, 0, 2, 3).contiguous()
465
+ orig_shape = dispatched_input.shape
466
+ chunks = dispatched_input.reshape(orig_shape[0], -1, orig_shape[-1])
467
+ chunks = self.experts(chunks)
468
+ chunks = chunks.reshape(orig_shape[:-1] + (chunks.shape[-1],)).unbind(0)
469
+ expert_outputs.extend(chunks)
470
+
471
+ expert_output = torch.stack(expert_outputs, dim=1)
472
+ return expert_output
473
+
474
+ def moe_gate_dispatch(
475
+ self,
476
+ x: torch.Tensor,
477
+ gate_logits: torch.Tensor,
478
+ k: int,
479
+ capacity: Optional[int],
480
+ ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor,
481
+ torch.Tensor, torch.Tensor]:
482
+
483
+ S, H = x.shape
484
+ E = gate_logits.shape[1]
485
+ device = x.device
486
+ topk_prob, topk_idx = torch.topk(gate_logits, k, dim=-1)
487
+ combine_weights = topk_prob
488
+ expert_id = topk_idx
489
+ y = x.new_zeros((E, capacity, H))
490
+ scatter_index = x.new_full((k, S), -1, dtype=torch.int32)
491
+
492
+ # per-expert slot counters
493
+ slot_counter = torch.zeros(E, dtype=torch.int32, device=device)
494
+
495
+ for tok in range(S):
496
+ for route in range(k):
497
+ e = expert_id[tok, route].item()
498
+ slot = slot_counter[e].item()
499
+ if slot >= capacity:
500
+ combine_weights[tok, route] = 0.0
501
+ continue
502
+
503
+ # record mapping & dispatch activation
504
+ scatter_index[route, tok] = e * capacity + slot
505
+ y[e, slot] = x[tok]
506
+ slot_counter[e] += 1
507
+
508
+ expert_offset = torch.cumsum(slot_counter, 0, dtype=torch.int64)
509
+
510
+ return y, combine_weights, scatter_index, expert_offset, expert_id
511
+
512
+ def combine_expert_output(self, expert_output: torch.Tensor, combine_weights: torch.Tensor, scatter_index: torch.Tensor) -> torch.Tensor:
513
+ """
514
+ Combine expert outputs using combination weights.
515
+
516
+ Args:
517
+ expert_output (Tensor): Expert outputs [num_experts, capacity, dim].
518
+ combine_weights (Tensor): Combination weights.
519
+ scatter_index (Tensor): Scatter indices.
520
+
521
+ Returns:
522
+ Tensor: Combined output [seqlen, dim].
523
+ """
524
+ expert_output = expert_output.reshape(-1, expert_output.shape[-1])
525
+ combined_output = self.combining(expert_output, combine_weights, scatter_index)
526
+ return combined_output
527
+
528
+ def combining(self, x, combine_weights, scatter_index):
529
+ """
530
+ Combines and aggregates input matrix using combination weights.
531
+
532
+ Args:
533
+ x (Tensor): Input tensor of shape [num_experts * capacity, dim]
534
+ combine_weights (Tensor): Combination weights of shape [seq, 2]
535
+ scatter_index (Tensor): Scatter indices of shape [seq, 2]
536
+
537
+ Returns:
538
+ Tensor: Combined output tensor of shape [seq, dim]
539
+ """
540
+ dim = x.shape[-1]
541
+
542
+ scatter_index = scatter_index.reshape([-1])
543
+ num_k = combine_weights.shape[-1]
544
+
545
+ combine_weights = combine_weights.unsqueeze(1)
546
+
547
+ x = x[scatter_index].reshape([-1, num_k, dim])
548
+
549
+ return torch.matmul(combine_weights, x).squeeze(1)
550
+
551
+ def gate_and_dispatch(self, input):
552
+ """
553
+ Calculate gate and dispatch inputs.
554
+
555
+ Args:
556
+ input: Input tensor of shape [seq, dim]
557
+
558
+ Returns:
559
+ tuple: (dispatched_input, combine_weights, dispatch_mask,
560
+ scatter_index, router_loss, gate_logits, gate_prob)
561
+ """
562
+ gate_logits, capacity, router_loss = topk_gate_func(
563
+ self,
564
+ input,
565
+ )
566
+
567
+ # capacity no use
568
+ prob = self.gate_act(gate_logits)
569
+ (
570
+ dispatched_input,
571
+ combine_weights_unnorm,
572
+ scatter_index,
573
+ dispatch_mask,
574
+ _,
575
+ ) = self.moe_gate_dispatch(input, prob, k=self.k, capacity=capacity)
576
+ dispatch_mask = torch.diff(F.pad(dispatch_mask, (1, 0)))
577
+
578
+ scatter_index.detach()
579
+ dispatch_mask.detach()
580
+
581
+ scatter_index = scatter_index.transpose(0, 1) # [k, s] -> [s, k]
582
+ combine_weights = combine_weights_unnorm / torch.clamp(
583
+ combine_weights_unnorm.sum(dim=-1, keepdim=True), min=1e-12
584
+ )
585
+ combine_weights = combine_weights.to(dtype=dispatched_input.dtype)
586
+
587
+ return dispatched_input, combine_weights, dispatch_mask, scatter_index, router_loss, gate_logits, prob
588
+
589
+ def get_capacity(self, num_tokens, cap_factor=None):
590
+ """
591
+ Calculate capacity based on number of tokens.
592
+
593
+ Args:
594
+ num_tokens: Number of input tokens
595
+ cap_factor: Optional capacity factor override
596
+
597
+ Returns:
598
+ int: Calculated capacity
599
+ """
600
+ num_experts = self.config.moe_num_experts
601
+ if cap_factor is not None:
602
+ cap = cap_factor
603
+ else:
604
+ if self.training:
605
+ cap = self.config.moe_capacity[0]
606
+ elif num_tokens < num_experts:
607
+ cap = self.config.moe_capacity[2]
608
+ else:
609
+ cap = self.config.moe_capacity[1]
610
+
611
+ capacity = int(cap * num_tokens // num_experts)
612
+ assert capacity > 0, f"requires capacity to >= 0. cap={cap}, num_tokens={num_tokens}"
613
+ return capacity
614
+
615
+
616
+ class Ernie4_5_RMSNorm(nn.Module):
617
+ """
618
+ Ernie Root Mean Square Layer Normalization (Ernie4_5_RMSNorm) implementation.
619
+
620
+ Ernie4_5_RMSNorm is a simplified version of LayerNorm that focuses on the root mean square of inputs,
621
+ omitting the mean-centering operation. This provides computational efficiency while maintaining
622
+ good performance.
623
+
624
+ """
625
+
626
+ def __init__(self, config):
627
+ """
628
+ Initialize RMSNorm layer.
629
+
630
+ Args:
631
+ config (ErnieConfig): Model configuration.
632
+ """
633
+ super().__init__()
634
+ self.config = config
635
+ self.hidden_size = config.hidden_size
636
+ self.weight = nn.Parameter(torch.ones(config.hidden_size))
637
+ self.variance_epsilon = config.rms_norm_eps
638
+
639
+ def forward(self, hidden_states):
640
+ """
641
+ Apply RMS normalization to input hidden states.
642
+
643
+ Args:
644
+ hidden_states (Tensor): Input tensor of shape [batch_size, seq_len, hidden_size]
645
+
646
+ Returns:
647
+ Tensor: Normalized output tensor of same shape as input
648
+ """
649
+ input_dtype = hidden_states.dtype
650
+ hidden_states = hidden_states.to(torch.float32)
651
+ variance = hidden_states.pow(2).mean(dim=-1, keepdim=True)
652
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
653
+
654
+ return self.weight * hidden_states.to(input_dtype)
655
+
656
+
657
+ class Ernie4_5_RopeEmbedding(nn.Module):
658
+ def __init__(self, config: Ernie4_5_MoeConfig, device=None):
659
+ super().__init__()
660
+ # BC: "rope_type" was originally "type"
661
+ if hasattr(config, "rope_scaling") and config.rope_scaling is not None:
662
+ self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
663
+ else:
664
+ self.rope_type = "default"
665
+ self.max_seq_len_cached = config.max_position_embeddings
666
+ self.original_max_seq_len = config.max_position_embeddings
667
+
668
+ self.config = config
669
+ self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
670
+ inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
671
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
672
+ self.original_inv_freq = self.inv_freq
673
+
674
+ @torch.no_grad()
675
+ def forward(self, x, position_ids):
676
+ inv_freq_expanded = self.inv_freq[None,None,:].float()
677
+ position_ids_expanded = position_ids[...,None].float()
678
+ freqs = (inv_freq_expanded.float() * position_ids_expanded.float())
679
+ cos = torch.cos(freqs) * self.attention_scaling
680
+ sin = torch.sin(freqs) * self.attention_scaling
681
+ return cos, sin
682
+ # return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
683
+
684
+
685
+ class Ernie4_5_DecoderLayer(nn.Module):
686
+ """A single transformer decoder layer in ERNIE-MoE model.
687
+
688
+ Contains self-attention and feed-forward components with optional MoE (Mixture of Experts)
689
+ support, residual connections, and layer normalization.
690
+ """
691
+
692
+ def __init__(self, config, layer_idx):
693
+ """Initialize the decoder layer.
694
+
695
+ Args:
696
+ config (ErnieMoEConfig): Model configuration.
697
+ layer_idx (int): Index of this layer in the transformer stack
698
+ """
699
+ super().__init__()
700
+ self.hidden_size = config.hidden_size
701
+ self.layer_idx = layer_idx
702
+ self.config = config
703
+ self.use_moe = config.use_moe
704
+ self.self_attn = Ernie4_5_Attention(config, layer_idx)
705
+
706
+ moe_layer_start_index = (
707
+ min(config.moe_layer_start_index)
708
+ if isinstance(config.moe_layer_start_index, (tuple, list))
709
+ else config.moe_layer_start_index
710
+ )
711
+ moe_layer_end_index = (
712
+ max(config.moe_layer_end_index)
713
+ if isinstance(config.moe_layer_end_index, (tuple, list))
714
+ else config.moe_layer_end_index
715
+ )
716
+
717
+ if (
718
+ self.use_moe
719
+ and ((layer_idx + 1) % config.moe_layer_interval == 0)
720
+ and layer_idx >= moe_layer_start_index
721
+ and layer_idx <= moe_layer_end_index
722
+ ):
723
+ self.mlp = Ernie4_5_MoeMLP(config)
724
+ else:
725
+ self.mlp = Ernie4_5_MLP(config)
726
+
727
+ self.input_layernorm = Ernie4_5_RMSNorm(config)
728
+ self.post_attention_layernorm = Ernie4_5_RMSNorm(config)
729
+
730
+ self.residual_add1 = Ernie4_5_ResidualWithDropout(config.hidden_dropout_prob)
731
+ self.residual_add2 = Ernie4_5_ResidualWithDropout(config.hidden_dropout_prob)
732
+
733
+ def forward(
734
+ self,
735
+ hidden_states: torch.Tensor,
736
+ attention_mask: Optional[torch.Tensor] = None,
737
+ position_ids: Optional[torch.Tensor] = None,
738
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
739
+ output_attentions: Optional[bool] = False,
740
+ use_cache: Optional[bool] = False,
741
+ cache_position: Optional[torch.LongTensor] = None,
742
+ position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None, # necessary, but kept here for BC
743
+ output_router_loss: bool = True,
744
+ output_gate_logits: bool = True,
745
+ **kwargs: Unpack[FlashAttentionKwargs],
746
+ ) -> tuple[torch.FloatTensor, Optional[tuple[torch.FloatTensor, torch.FloatTensor]]]:
747
+ """Forward pass through the decoder layer.
748
+
749
+ Args:
750
+ hidden_states (torch.Tensor): Input tensor [batch_size, seq_len, hidden_size]
751
+ attention_mask (Optional[torch.Tensor]): Attention mask tensor
752
+ position_ids (Optional[torch.Tensor]): Position indices for rotary embeddings
753
+ past_key_value (Optional[Tuple[torch.Tensor]]): Cached key/value states
754
+ output_attentions (Optional[bool]): Whether to return attention weights
755
+ use_cache (Optional[bool]): Whether to cache key/value states
756
+ cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
757
+ Indices depicting the position of the input sequence tokens in the sequence.
758
+ position_embeddings (`tuple[torch.FloatTensor, torch.FloatTensor]`, *optional*):
759
+ Tuple containing the cosine and sine positional embeddings of shape `(batch_size, seq_len, head_dim)`,
760
+ with `head_dim` being the embedding dimension of each attention head.
761
+ output_router_loss (bool): Whether to return MoE router loss
762
+ output_gate_logits (bool): Whether to return MoE gate logits
763
+
764
+ Returns:
765
+ Union: Various output combinations depending on arguments:
766
+ - Base case: Hidden states tensor
767
+ - With attention: Tuple of (hidden_states, attention_weights)
768
+ - With router loss: May include gate logits in output tuple
769
+ - With MoE gate logits: May include gate logits in output tuple
770
+ """
771
+ residual = hidden_states
772
+
773
+ hidden_states = self.input_layernorm(hidden_states)
774
+
775
+ # Self Attention
776
+ hidden_states, self_attn_weights = self.self_attn(
777
+ hidden_states=hidden_states,
778
+ attention_mask=attention_mask,
779
+ past_key_value=past_key_value,
780
+ position_ids=position_ids,
781
+ use_cache=use_cache,
782
+ cache_position=cache_position,
783
+ position_embeddings=position_embeddings,
784
+ **kwargs,
785
+ )
786
+
787
+ hidden_states = self.residual_add1(hidden_states, residual)
788
+
789
+ # Fully Connected
790
+ residual = hidden_states
791
+ hidden_states = self.post_attention_layernorm(hidden_states)
792
+
793
+ router_loss = None
794
+ gate_logits = None
795
+
796
+ if isinstance(self.mlp, Ernie4_5_MoeMLP):
797
+ hidden_states, _, router_loss, gate_logits = self.mlp(hidden_states)
798
+ else:
799
+ hidden_states = self.mlp(hidden_states)
800
+
801
+ hidden_states = self.residual_add2(hidden_states, residual)
802
+
803
+ outputs = (hidden_states,)
804
+
805
+ if output_attentions:
806
+ outputs += (self_attn_weights,)
807
+
808
+ if output_router_loss:
809
+ outputs += (router_loss,)
810
+
811
+ if output_gate_logits:
812
+ outputs += (gate_logits,)
813
+
814
+ return outputs
815
+
816
+
817
+ @auto_docstring
818
+ class Ernie4_5_PretrainedModel(PreTrainedModel):
819
+ """Base class for ERNIE pretrained models."""
820
+ config_class = Ernie4_5_MoeConfig
821
+ base_model_prefix = "model"
822
+ supports_gradient_checkpointing = True
823
+ _no_split_modules = ["Ernie4_5_DecoderLayer"]
824
+ _skip_keys_device_placement = ["past_key_values"]
825
+ _supports_flash_attn_2 = True
826
+ _supports_sdpa = True
827
+ _supports_flex_attn = True
828
+ _supports_cache_class = True
829
+ _supports_quantized_cache = True
830
+ _supports_static_cache = False # MoE models don't work with torch.compile (`torch.where(condition)` not supported)
831
+
832
+
833
+ def subbatch(f, arg_idx, axis, bs, out_idx, same_arg_idx={}):
834
+ """
835
+ Converts a function to one that applies to subbatch of an input dimension.
836
+ Useful for processing large tensors in smaller chunks to reduce memory usage.
837
+
838
+ Args:
839
+ f (Callable): Function to be subbatched.
840
+ arg_idx ([int]): Indices of the inputs to be subbatched.
841
+ axis ([int]): Indices of the dimensions to be subbatched for each input.
842
+ bs (int): Subbatch size.
843
+ out_idx (int): Dimension to concatenate outputs along.
844
+ same_arg_idx (dict): Mapping of argument indices that share the same tensor.
845
+
846
+ Returns:
847
+ Callable: New function that processes inputs in subbatches.
848
+ """
849
+
850
+ @functools.wraps(f)
851
+ def wrapper(*args, **kwargs):
852
+
853
+ assert len(arg_idx) == len(axis), "Number of batching args and number of batching dims should match."
854
+
855
+ inps = [args[i] for i in arg_idx]
856
+ axis_width = [inp.shape[d] for inp, d in zip(inps, axis)]
857
+ assert len(set(axis_width)) == 1, "Batch sizes should be kept equal."
858
+
859
+ inp_axis = {idx: d for idx, d in zip(arg_idx, axis)}
860
+
861
+ axis_width = axis_width[0]
862
+ if axis_width < bs:
863
+ return f(*args, **kwargs)
864
+
865
+ outs = []
866
+ for slice_at in range(0, axis_width, bs):
867
+ _args = []
868
+ for i, inp in enumerate(args):
869
+ if i in same_arg_idx:
870
+ assert (
871
+ i > same_arg_idx[i]
872
+ ), f"expect i > same_arg_idx[i], but got i: {i} and same_arg_idx[i]: {same_arg_idx[i]}"
873
+ _args.append(_args[same_arg_idx[i]])
874
+ elif i in arg_idx:
875
+ d = inp_axis[i]
876
+ start = slice_at
877
+ end = min(inp.shape[d], slice_at + bs)
878
+ # Build slice for all dims, only slice along axis d
879
+ slices = [slice(None)] * inp.ndim
880
+ slices[d] = slice(start, end)
881
+ _args.append(inp[tuple(slices)])
882
+ else:
883
+ _args.append(inp)
884
+
885
+ out = f(*_args, **kwargs)
886
+ outs.append(out)
887
+
888
+ return torch.cat(outs, dim=out_idx)
889
+
890
+ return wrapper
891
+
892
+
893
+ class ErniePretrainingCriterion(nn.Module):
894
+ """Criterion for ERNIE pretraining task."""
895
+
896
+ def __init__(self, config, return_tuple=True):
897
+ """Initialize the pretraining criterion.
898
+
899
+ Args:
900
+ config (ErnieConfig): Model configuration.
901
+ return_tuple (bool): Whether to return loss as tuple (loss, loss_sum). Defaults to True.
902
+ """
903
+ super().__init__()
904
+ self.ignored_index = getattr(config, "ignored_index", -100)
905
+ self.config = config
906
+ self.return_tuple = return_tuple
907
+
908
+ self.loss_func = nn.CrossEntropyLoss(reduction="none")
909
+
910
+ def forward(self, prediction_scores, masked_lm_labels, loss_mask, router_loss=None, mtp_logits=None):
911
+ """Compute the combined pretraining loss.
912
+
913
+ Args:
914
+ prediction_scores: Prediction scores tensor, [batch_size, seq_len, vocab_size]
915
+ masked_lm_labels: Target labels tensor [batch_size, seq_len]
916
+ loss_mask: Optional mask for valid tokens
917
+ router_loss: Optional MoE router loss tensor
918
+
919
+ Returns:
920
+ Union:
921
+ - If return_tuple=True: Tuple of (combined_loss, mlm_loss_sum)
922
+ - If return_tuple=False: Combined loss tensor
923
+ """
924
+ if self.config.num_nextn_predict_layers > 0 and self.training:
925
+ masked_lm_labels_ori = masked_lm_labels
926
+ masked_lm_labels = masked_lm_labels[:, : -self.config.num_nextn_predict_layers]
927
+ loss_mask = loss_mask[:, : -self.config.num_nextn_predict_layers]
928
+ seq_length = masked_lm_labels.shape[1]
929
+
930
+ res = self.forward_impl(prediction_scores, masked_lm_labels, loss_mask)
931
+
932
+ if self.config.num_nextn_predict_layers > 0 and self.training:
933
+ mtp_loss_res = []
934
+ for depth in range(self.config.num_nextn_predict_layers):
935
+ prediction_scores_cur_depth = mtp_logits[depth]
936
+ masked_lm_labels_cur_depth = masked_lm_labels_ori[:, (depth + 1) : (depth + 1 + seq_length)]
937
+ res_cur_depth = super().forward(
938
+ prediction_scores_cur_depth,
939
+ masked_lm_labels_cur_depth,
940
+ )
941
+ mtp_loss_res.append(res_cur_depth)
942
+
943
+ def add_loss(main_loss, loss):
944
+ return main_loss + loss - loss.detach()
945
+
946
+
947
+ if self.return_tuple:
948
+ loss, loss_sum = res
949
+ if self.config.num_nextn_predict_layers > 0 and self.training:
950
+ loss = add_loss(
951
+ loss, self.config.multi_token_pred_lambda * sum([x[0] for x in mtp_loss_res]) / len(mtp_loss_res)
952
+ )
953
+ loss_sum = loss_sum + self.config.multi_token_pred_lambda * sum(
954
+ [x[1].detach() for x in mtp_loss_res]
955
+ ) / len(mtp_loss_res)
956
+ else:
957
+ loss, loss_sum = res, None
958
+ if self.config.num_nextn_predict_layers > 0 and self.training:
959
+ loss = add_loss(
960
+ loss, self.config.multi_token_pred_lambda * sum([x[0] for x in mtp_loss_res]) / len(mtp_loss_res)
961
+ )
962
+
963
+ if router_loss is not None and isinstance(router_loss, torch.Tensor):
964
+ loss = loss + router_loss - router_loss.detach()
965
+
966
+ return loss, loss_sum
967
+
968
+
969
+ def loss_impl(self, prediction_scores: torch.Tensor, masked_lm_labels: torch.Tensor) -> torch.Tensor:
970
+ """
971
+ Core loss computation without reduction (but per-token).
972
+
973
+ Args:
974
+ prediction_scores (torch.Tensor): Logits tensor [batch_size, seq_len, vocab_size].
975
+ masked_lm_labels (torch.Tensor): Target labels tensor [batch_size, seq_len].
976
+
977
+ Returns:
978
+ torch.Tensor: Unreduced loss tensor of shape [batch_size, seq_len].
979
+ Losses are calculated in float32.
980
+ """
981
+ scores_float32 = prediction_scores.to(torch.float32)
982
+ # prediction_scores: [batch_size, seq_len, vocab_size]
983
+ # masked_lm_labels: [batch_size, seq_len]
984
+ # Transpose prediction_scores to [batch_size, vocab_size, seq_len]
985
+ unreduced_loss = self.loss_func(
986
+ scores_float32.transpose(1, 2), # Shape: [batch_size, vocab_size, seq_len]
987
+ masked_lm_labels.long() # Shape: [batch_size, seq_len], ensure long type
988
+ )
989
+ # unreduced_loss will be of shape [batch_size, seq_len] and dtype float32
990
+ return unreduced_loss
991
+
992
+ def forward_impl(self, prediction_scores, masked_lm_labels, loss_mask=None):
993
+ prediction_scores_dims = len(prediction_scores.shape)
994
+
995
+ loss_subbatch_seqlen_config_key = "loss_subbatch_seqlen"
996
+ default_loss_subbatch_seqlen = 32768
997
+
998
+ current_loss_subbatch_seqlen = getattr(self.config, loss_subbatch_seqlen_config_key, default_loss_subbatch_seqlen)
999
+
1000
+ if prediction_scores_dims == 2 and prediction_scores.shape[0] > current_loss_subbatch_seqlen:
1001
+ sb_loss_func = subbatch(
1002
+ self.loss_impl, [0, 1], [0, 0], current_loss_subbatch_seqlen, 0
1003
+ )
1004
+ masked_lm_loss = sb_loss_func(prediction_scores, masked_lm_labels)
1005
+ elif prediction_scores_dims == 3 and prediction_scores.shape[1] > current_loss_subbatch_seqlen:
1006
+ sb_loss_func = subbatch(
1007
+ self.loss_impl, [0, 1], [1, 1], current_loss_subbatch_seqlen, 1
1008
+ )
1009
+ masked_lm_loss = sb_loss_func(prediction_scores, masked_lm_labels)
1010
+ else:
1011
+ masked_lm_loss = self.loss_impl(prediction_scores, masked_lm_labels)
1012
+
1013
+ if loss_mask is None:
1014
+ loss_mask = masked_lm_labels != self.ignored_index
1015
+
1016
+ loss_mask = loss_mask.reshape(-1).to(torch.float32)
1017
+
1018
+ masked_lm_loss = torch.sum(masked_lm_loss.to(torch.float32).reshape(-1) * loss_mask)
1019
+
1020
+ # The division will be in float32
1021
+ loss = masked_lm_loss / loss_mask.sum()
1022
+
1023
+ loss_sum = masked_lm_loss.sum().detach()
1024
+
1025
+ if not self.return_tuple:
1026
+ if self.training:
1027
+ return loss
1028
+ return loss_sum
1029
+ return loss, loss_sum
1030
+
1031
+ @auto_docstring
1032
+ class Ernie4_5_Model(Ernie4_5_PretrainedModel):
1033
+ """The core ERNIE transformer model with MoE (Mixture of Experts) support."""
1034
+ _keep_in_fp32_modules = ['gate']
1035
+ def __init__(self, config: Ernie4_5_MoeConfig):
1036
+ """Initialize the ERNIE model architecture."""
1037
+ super().__init__(config)
1038
+ self.padding_idx = config.pad_token_id
1039
+ self.vocab_size = config.vocab_size
1040
+ self.hidden_size = config.hidden_size
1041
+ self.config = config
1042
+
1043
+ self.embed_tokens = nn.Embedding(
1044
+ self.vocab_size,
1045
+ self.hidden_size,
1046
+ )
1047
+
1048
+ self.layers = nn.ModuleList(
1049
+ [
1050
+ Ernie4_5_DecoderLayer(config, i)
1051
+ for i in range(config.num_hidden_layers)
1052
+ ]
1053
+ )
1054
+ self.norm = Ernie4_5_RMSNorm(config)
1055
+ self.rotary_emb = Ernie4_5_RopeEmbedding(config=config)
1056
+
1057
+ self.gradient_checkpointing = False
1058
+
1059
+ if config.num_nextn_predict_layers > 0 and self.training:
1060
+ self.mtp_block = nn.ModuleList(
1061
+ [Ernie4_5_DecoderLayer(config, layer_idx) for layer_idx in range(config.num_nextn_predict_layers)]
1062
+ )
1063
+ self.mtp_emb_norm = nn.ModuleList(
1064
+ [Ernie4_5_RMSNorm(config) for _ in range(config.num_nextn_predict_layers)]
1065
+ )
1066
+ self.mtp_hidden_norm = nn.ModuleList(
1067
+ [Ernie4_5_RMSNorm(config) for _ in range(config.num_nextn_predict_layers)]
1068
+ )
1069
+ self.mtp_linear_proj = nn.ModuleList(
1070
+ [nn.Linear(config.hidden_size * 2, config.hidden_size, bias=config.use_bias) for _ in range(config.num_nextn_predict_layers)]
1071
+ )
1072
+
1073
+ self.post_init()
1074
+
1075
+ def get_input_embeddings(self):
1076
+ """Get the input embedding layer."""
1077
+ return self.embed_tokens
1078
+
1079
+ def set_input_embeddings(self, value):
1080
+ """Set new input embeddings."""
1081
+ self.embed_tokens = value
1082
+
1083
+ def forward(
1084
+ self,
1085
+ input_ids: Optional[torch.LongTensor] = None,
1086
+ attention_mask: Optional[torch.Tensor] = None,
1087
+ position_ids: Optional[torch.LongTensor] = None,
1088
+ past_key_values: Optional[Cache] = None,
1089
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1090
+ use_cache: Optional[bool] = None,
1091
+ output_attentions: Optional[bool] = None,
1092
+ output_hidden_states: Optional[bool] = None,
1093
+ cache_position: Optional[torch.LongTensor] = None,
1094
+ **flash_attn_kwargs: Unpack[FlashAttentionKwargs],
1095
+ ):
1096
+ """Forward pass through the ERNIE model."""
1097
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1098
+ output_hidden_states = (
1099
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1100
+ )
1101
+
1102
+ if (input_ids is None) ^ (inputs_embeds is not None):
1103
+ raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
1104
+
1105
+ if self.gradient_checkpointing and self.training:
1106
+ if use_cache:
1107
+ logger.warning_once(
1108
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
1109
+ )
1110
+ use_cache = False
1111
+
1112
+ if use_cache and past_key_values is None:
1113
+ past_key_values = DynamicCache()
1114
+
1115
+ if inputs_embeds is None:
1116
+ inputs_embeds = self.embed_tokens(input_ids)
1117
+
1118
+ inputs_embeds = inputs_embeds.to(self.embed_tokens.weight.dtype)
1119
+
1120
+ if cache_position is None:
1121
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
1122
+ cache_position = torch.arange(
1123
+ past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
1124
+ )
1125
+ if position_ids is None:
1126
+ position_ids = cache_position.unsqueeze(0)
1127
+
1128
+ seq_length = inputs_embeds.size(1)
1129
+ if self.config.num_nextn_predict_layers > 0 and self.training:
1130
+ seq_length -= self.config.num_nextn_predict_layers
1131
+ seq_length_with_past = seq_length
1132
+ if position_ids is not None:
1133
+ position_ids = position_ids[:, :seq_length]
1134
+ inputs_embeds_extra = inputs_embeds[:, -self.config.num_nextn_predict_layers :, :]
1135
+ inputs_embeds = inputs_embeds[:, : -self.config.num_nextn_predict_layers, :]
1136
+ inputs_embeds_ori = inputs_embeds
1137
+
1138
+ causal_mask = self._update_causal_mask(
1139
+ attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
1140
+ )
1141
+
1142
+ hidden_states = inputs_embeds
1143
+
1144
+ # create position embeddings to be shared across the decoder layers
1145
+ position_embeddings = self.rotary_emb(hidden_states, position_ids)
1146
+
1147
+ # decoder layers
1148
+ all_hidden_states = () if output_hidden_states else None
1149
+ all_self_attns = () if output_attentions else None
1150
+ all_router_loss = torch.tensor(0.0, device=inputs_embeds.device) if self.config.use_moe else None
1151
+ all_gate_logits = ()
1152
+
1153
+ for decoder_layer in self.layers:
1154
+ if output_hidden_states:
1155
+ all_hidden_states += (hidden_states,)
1156
+
1157
+ if self.gradient_checkpointing and self.training:
1158
+ layer_outputs = self._gradient_checkpointing_func(
1159
+ partial(decoder_layer.__call__, **flash_attn_kwargs),
1160
+ hidden_states,
1161
+ causal_mask,
1162
+ position_ids,
1163
+ past_key_values,
1164
+ output_attentions,
1165
+ use_cache,
1166
+ cache_position,
1167
+ position_embeddings,
1168
+ )
1169
+ else:
1170
+ layer_outputs = decoder_layer(
1171
+ hidden_states,
1172
+ causal_mask,
1173
+ position_ids,
1174
+ past_key_values,
1175
+ output_attentions,
1176
+ use_cache,
1177
+ cache_position,
1178
+ position_embeddings,
1179
+ **flash_attn_kwargs,
1180
+ )
1181
+
1182
+ hidden_states = layer_outputs[0]
1183
+
1184
+ if output_attentions:
1185
+ all_self_attns += (layer_outputs[1],)
1186
+
1187
+ if self.config.use_moe:
1188
+ layer_outputs, gate_logits = layer_outputs[:-1], layer_outputs[-1]
1189
+ all_gate_logits = all_gate_logits + (gate_logits,)
1190
+
1191
+ mtp_outputs = []
1192
+ if self.config.num_nextn_predict_layers > 0 and self.training:
1193
+ mtp_outputs.append(hidden_states)
1194
+ for depth in range(self.config.num_nextn_predict_layers):
1195
+ inputs_embeds_cur_depth = torch.concat(
1196
+ [inputs_embeds_ori[:, (depth + 1) :, :], inputs_embeds_extra[:, : (depth + 1), :]], axis=1
1197
+ )
1198
+ inputs_embeds_cur_depth_norm = self.mtp_emb_norm[depth](inputs_embeds_cur_depth)
1199
+ hidden_states_norm = self.mtp_hidden_norm[depth](hidden_states)
1200
+
1201
+ inputs_embeds_cur_depth = self.mtp_linear_proj[depth](
1202
+ torch.concat([inputs_embeds_cur_depth_norm, hidden_states_norm], axis=-1)
1203
+ )
1204
+
1205
+ decoder_layer = self.mtp_block[depth]
1206
+ layer_outputs = decoder_layer(
1207
+ inputs_embeds_cur_depth,
1208
+ causal_mask,
1209
+ position_ids,
1210
+ past_key_values,
1211
+ output_attentions,
1212
+ use_cache,
1213
+ cache_position,
1214
+ position_embeddings,
1215
+ **flash_attn_kwargs,
1216
+ )
1217
+ if isinstance(layer_outputs, (tuple, list)):
1218
+ hidden_states = layer_outputs[0]
1219
+ else:
1220
+ hidden_states = layer_outputs
1221
+
1222
+ if self.config.use_moe:
1223
+ layer_outputs, gate_logits = layer_outputs[:-1], layer_outputs[-1]
1224
+ all_gate_logits = all_gate_logits + (gate_logits,)
1225
+
1226
+ mtp_outputs.append(hidden_states)
1227
+ mtp_outputs = [self.norm(hidden_states) for depth, hidden_states in enumerate(mtp_outputs)]
1228
+ hidden_states, mtp_outputs = mtp_outputs[0], mtp_outputs[1:]
1229
+ else:
1230
+ hidden_states = self.norm(hidden_states)
1231
+
1232
+ # add hidden states from the last decoder layer
1233
+ if output_hidden_states:
1234
+ all_hidden_states += (hidden_states,)
1235
+
1236
+ # assert all_router_loss is None, f'moe not support `return-dict`'
1237
+ return Erine4_5_MoeModelOutputWithPast(
1238
+ last_hidden_state=hidden_states,
1239
+ past_key_values=past_key_values,
1240
+ hidden_states=all_hidden_states,
1241
+ attentions=all_self_attns,
1242
+ router_loss=all_router_loss,
1243
+ gate_logits=all_gate_logits,
1244
+ mtp_outputs=mtp_outputs,
1245
+ )
1246
+
1247
+ def _update_causal_mask(
1248
+ self,
1249
+ attention_mask: Union[torch.Tensor, "BlockMask"],
1250
+ input_tensor: torch.Tensor,
1251
+ cache_position: torch.Tensor,
1252
+ past_key_values: Cache,
1253
+ output_attentions: bool = False,
1254
+ ):
1255
+ if self.config._attn_implementation == "flash_attention_2":
1256
+ if attention_mask is not None and past_key_values is not None:
1257
+ is_padding_right = attention_mask[:, -1].sum().item() != input_tensor.size()[0]
1258
+ if is_padding_right:
1259
+ raise ValueError(
1260
+ "You are attempting to perform batched generation with padding_side='right'"
1261
+ " this may lead to unexpected behaviour for Flash Attention version of Ernie4_5. Make sure to "
1262
+ " call `tokenizer.padding_side = 'left'` before tokenizing the input. "
1263
+ )
1264
+ if attention_mask is not None and 0.0 in attention_mask:
1265
+ return attention_mask
1266
+ return None
1267
+ if self.config._attn_implementation == "flex_attention":
1268
+ if isinstance(attention_mask, torch.Tensor):
1269
+ attention_mask = make_flex_block_causal_mask(attention_mask)
1270
+ return attention_mask
1271
+
1272
+ # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
1273
+ # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
1274
+ # to infer the attention mask.
1275
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
1276
+ using_static_cache = isinstance(past_key_values, StaticCache)
1277
+
1278
+ # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
1279
+ if (
1280
+ self.config._attn_implementation == "sdpa"
1281
+ and not using_static_cache
1282
+ and not output_attentions
1283
+ ):
1284
+ if AttentionMaskConverter._ignore_causal_mask_sdpa(
1285
+ attention_mask,
1286
+ inputs_embeds=input_tensor,
1287
+ past_key_values_length=past_seen_tokens,
1288
+ is_training=self.training,
1289
+ ):
1290
+ return None
1291
+
1292
+ dtype = input_tensor.dtype
1293
+ min_dtype = torch.finfo(dtype).min
1294
+ sequence_length = input_tensor.shape[1]
1295
+ # StaticCache
1296
+ if using_static_cache:
1297
+ target_length = past_key_values.get_max_cache_shape()
1298
+ # DynamicCache or no cache
1299
+ else:
1300
+ target_length = (
1301
+ attention_mask.shape[-1]
1302
+ if isinstance(attention_mask, torch.Tensor)
1303
+ else past_seen_tokens + sequence_length + 1
1304
+ )
1305
+
1306
+ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
1307
+ causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
1308
+ attention_mask,
1309
+ sequence_length=sequence_length,
1310
+ target_length=target_length,
1311
+ dtype=dtype,
1312
+ cache_position=cache_position,
1313
+ batch_size=input_tensor.shape[0],
1314
+ config=self.config,
1315
+ past_key_values=past_key_values,
1316
+ )
1317
+
1318
+ if (
1319
+ self.config._attn_implementation == "sdpa"
1320
+ and attention_mask is not None
1321
+ and attention_mask.device.type in ["cuda", "xpu", "npu"]
1322
+ and not output_attentions
1323
+ ):
1324
+ # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
1325
+ # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
1326
+ # Details: https://github.com/pytorch/pytorch/issues/110213
1327
+ causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
1328
+
1329
+ return causal_mask
1330
+
1331
+ @staticmethod
1332
+ def _prepare_4d_causal_attention_mask_with_cache_position(
1333
+ attention_mask: torch.Tensor,
1334
+ sequence_length: int,
1335
+ target_length: int,
1336
+ dtype: torch.dtype,
1337
+ cache_position: torch.Tensor,
1338
+ batch_size: int,
1339
+ config: Ernie4_5_MoeConfig,
1340
+ past_key_values: Cache,
1341
+ ):
1342
+ """
1343
+ Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
1344
+ `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
1345
+
1346
+ Args:
1347
+ attention_mask (`torch.Tensor`):
1348
+ A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape `(batch_size, 1, query_length, key_value_length)`.
1349
+ sequence_length (`int`):
1350
+ The sequence length being processed.
1351
+ target_length (`int`):
1352
+ The target length: when generating with static cache, the mask should be as long as the static cache, to account for the 0 padding, the part of the cache that is not filled yet.
1353
+ dtype (`torch.dtype`):
1354
+ The dtype to use for the 4D attention mask.
1355
+ cache_position (`torch.Tensor`):
1356
+ Indices depicting the position of the input sequence tokens in the sequence.
1357
+ batch_size (`torch.Tensor`):
1358
+ Batch size.
1359
+ config (`Ernie4_5_MoeConfig`):
1360
+ The model's configuration class
1361
+ past_key_values (`Cache`):
1362
+ The cache class that is being used currently to generate
1363
+ """
1364
+ if attention_mask is not None and attention_mask.dim() == 4:
1365
+ # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
1366
+ causal_mask = attention_mask
1367
+ else:
1368
+ min_dtype = torch.finfo(dtype).min
1369
+ causal_mask = torch.full(
1370
+ (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=cache_position.device
1371
+ )
1372
+ diagonal_attend_mask = torch.arange(target_length, device=cache_position.device) > cache_position.reshape(
1373
+ -1, 1
1374
+ )
1375
+ text_config = config.get_text_config()
1376
+ causal_mask *= diagonal_attend_mask
1377
+ causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
1378
+ if attention_mask is not None:
1379
+ causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
1380
+ if attention_mask.shape[-1] > target_length:
1381
+ attention_mask = attention_mask[:, :target_length]
1382
+ mask_length = attention_mask.shape[-1]
1383
+ padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :].to(
1384
+ causal_mask.device
1385
+ )
1386
+ padding_mask = padding_mask == 0
1387
+ causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
1388
+ padding_mask, min_dtype
1389
+ )
1390
+ return causal_mask
1391
+
1392
+ @auto_docstring
1393
+ class Ernie4_5_MoeForCausalLM(Ernie4_5_PretrainedModel,GenerationMixin):
1394
+ """ERNIE Mixture of Experts (MoE) model for causal language modeling."""
1395
+
1396
+ _tied_weights_keys = ["lm_head.weight"]
1397
+ _tp_plan = {"lm_head": "colwise_rep"}
1398
+ _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
1399
+
1400
+ def __init__(self, config):
1401
+ """
1402
+ Initializes the ERNIE MoE model for causal language modeling.
1403
+
1404
+ Args:
1405
+ config (dict): Model configuration.
1406
+ """
1407
+ super().__init__(config)
1408
+ self.config = config
1409
+ self.model = Ernie4_5_Model(config)
1410
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size,bias=config.weight_share_add_bias and config.use_bias) # TODO
1411
+ self._loss_function = ErniePretrainingCriterion(config)
1412
+
1413
+ # Initialize weights and apply final processing
1414
+ self.post_init()
1415
+
1416
+ def get_input_embeddings(self):
1417
+ """Returns the input embeddings layer."""
1418
+ return self.model.embed_tokens
1419
+
1420
+ def set_input_embeddings(self, value):
1421
+ """Sets the input embeddings layer."""
1422
+ self.ernie.embed_tokens = value
1423
+
1424
+ def get_output_embeddings(self):
1425
+ """Returns the output embeddings (LM head)."""
1426
+ return self.lm_head
1427
+
1428
+ def set_output_embeddings(self, new_embeddings):
1429
+ """Sets the output embeddings layer."""
1430
+ self.lm_head = new_embeddings
1431
+
1432
+ def set_decoder(self, decoder):
1433
+ """Sets the ERNIE decoder model."""
1434
+ self.model = decoder
1435
+
1436
+ def get_decoder(self):
1437
+ """Get the transformer decoder."""
1438
+ return self.model
1439
+
1440
+ @can_return_tuple
1441
+ def forward(
1442
+ self,
1443
+ input_ids,
1444
+ attention_mask=None,
1445
+ position_ids=None,
1446
+ past_key_values: Optional[list[torch.FloatTensor]] = None,
1447
+ inputs_embeds=None,
1448
+ labels=None,
1449
+ loss_mask=None,
1450
+ use_cache=False,
1451
+ output_attentions: Optional[bool] = None,
1452
+ output_hidden_states: Optional[bool] = None,
1453
+ **kwargs: Unpack[KwargsForCausalLM],
1454
+ ):
1455
+ """
1456
+ Forward pass for causal language modeling.
1457
+ """
1458
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1459
+ output_hidden_states = (
1460
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1461
+ )
1462
+
1463
+ outputs = self.model(
1464
+ input_ids,
1465
+ position_ids=position_ids,
1466
+ attention_mask=attention_mask,
1467
+ inputs_embeds=inputs_embeds,
1468
+ use_cache=use_cache,
1469
+ past_key_values=past_key_values,
1470
+ output_attentions=output_attentions,
1471
+ output_hidden_states=output_hidden_states,
1472
+ **kwargs,
1473
+ )
1474
+
1475
+ hidden_states = outputs.last_hidden_state
1476
+ mtp_outputs = outputs.mtp_outputs
1477
+
1478
+ logits = self.lm_head(hidden_states)
1479
+ mtp_logits = []
1480
+ if len(mtp_outputs) > 0:
1481
+ mtp_logits = [self.lm_head(_hidden_states) for _hidden_states in mtp_outputs]
1482
+ loss, router_loss = None, None
1483
+ if getattr(self.config, "use_moe", False):
1484
+ router_loss = outputs.router_loss
1485
+
1486
+ if labels is not None:
1487
+ loss, _ = self.loss_function(logits, labels, loss_mask, router_loss, mtp_logits)
1488
+
1489
+ return Ernie4_5_MoeCausalLMOutputWithPast(
1490
+ loss=loss,
1491
+ logits=logits,
1492
+ past_key_values=outputs.past_key_values,
1493
+ hidden_states=outputs.hidden_states,
1494
+ attentions=outputs.attentions,
1495
+ router_loss=router_loss,
1496
+ )
1497
+
1498
+
1499
+
1500
+ __all__ = [
1501
+ "Ernie4_5_Model",
1502
+ "Ernie4_5_MoeForCausalLM",
1503
+ "Ernie4_5_PretrainedModel"
1504
+ ]
special_tokens_map.json ADDED
@@ -0,0 +1,1020 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|IMAGE_PLACEHOLDER|>",
4
+ "<|AUDIO_PLACEHOLDER|>",
5
+ "<|LOC_0|>",
6
+ "<|LOC_1|>",
7
+ "<|LOC_2|>",
8
+ "<|LOC_3|>",
9
+ "<|LOC_4|>",
10
+ "<|LOC_5|>",
11
+ "<|LOC_6|>",
12
+ "<|LOC_7|>",
13
+ "<|LOC_8|>",
14
+ "<|LOC_9|>",
15
+ "<|LOC_10|>",
16
+ "<|LOC_11|>",
17
+ "<|LOC_12|>",
18
+ "<|LOC_13|>",
19
+ "<|LOC_14|>",
20
+ "<|LOC_15|>",
21
+ "<|LOC_16|>",
22
+ "<|LOC_17|>",
23
+ "<|LOC_18|>",
24
+ "<|LOC_19|>",
25
+ "<|LOC_20|>",
26
+ "<|LOC_21|>",
27
+ "<|LOC_22|>",
28
+ "<|LOC_23|>",
29
+ "<|LOC_24|>",
30
+ "<|LOC_25|>",
31
+ "<|LOC_26|>",
32
+ "<|LOC_27|>",
33
+ "<|LOC_28|>",
34
+ "<|LOC_29|>",
35
+ "<|LOC_30|>",
36
+ "<|LOC_31|>",
37
+ "<|LOC_32|>",
38
+ "<|LOC_33|>",
39
+ "<|LOC_34|>",
40
+ "<|LOC_35|>",
41
+ "<|LOC_36|>",
42
+ "<|LOC_37|>",
43
+ "<|LOC_38|>",
44
+ "<|LOC_39|>",
45
+ "<|LOC_40|>",
46
+ "<|LOC_41|>",
47
+ "<|LOC_42|>",
48
+ "<|LOC_43|>",
49
+ "<|LOC_44|>",
50
+ "<|LOC_45|>",
51
+ "<|LOC_46|>",
52
+ "<|LOC_47|>",
53
+ "<|LOC_48|>",
54
+ "<|LOC_49|>",
55
+ "<|LOC_50|>",
56
+ "<|LOC_51|>",
57
+ "<|LOC_52|>",
58
+ "<|LOC_53|>",
59
+ "<|LOC_54|>",
60
+ "<|LOC_55|>",
61
+ "<|LOC_56|>",
62
+ "<|LOC_57|>",
63
+ "<|LOC_58|>",
64
+ "<|LOC_59|>",
65
+ "<|LOC_60|>",
66
+ "<|LOC_61|>",
67
+ "<|LOC_62|>",
68
+ "<|LOC_63|>",
69
+ "<|LOC_64|>",
70
+ "<|LOC_65|>",
71
+ "<|LOC_66|>",
72
+ "<|LOC_67|>",
73
+ "<|LOC_68|>",
74
+ "<|LOC_69|>",
75
+ "<|LOC_70|>",
76
+ "<|LOC_71|>",
77
+ "<|LOC_72|>",
78
+ "<|LOC_73|>",
79
+ "<|LOC_74|>",
80
+ "<|LOC_75|>",
81
+ "<|LOC_76|>",
82
+ "<|LOC_77|>",
83
+ "<|LOC_78|>",
84
+ "<|LOC_79|>",
85
+ "<|LOC_80|>",
86
+ "<|LOC_81|>",
87
+ "<|LOC_82|>",
88
+ "<|LOC_83|>",
89
+ "<|LOC_84|>",
90
+ "<|LOC_85|>",
91
+ "<|LOC_86|>",
92
+ "<|LOC_87|>",
93
+ "<|LOC_88|>",
94
+ "<|LOC_89|>",
95
+ "<|LOC_90|>",
96
+ "<|LOC_91|>",
97
+ "<|LOC_92|>",
98
+ "<|LOC_93|>",
99
+ "<|LOC_94|>",
100
+ "<|LOC_95|>",
101
+ "<|LOC_96|>",
102
+ "<|LOC_97|>",
103
+ "<|LOC_98|>",
104
+ "<|LOC_99|>",
105
+ "<|LOC_100|>",
106
+ "<|LOC_101|>",
107
+ "<|LOC_102|>",
108
+ "<|LOC_103|>",
109
+ "<|LOC_104|>",
110
+ "<|LOC_105|>",
111
+ "<|LOC_106|>",
112
+ "<|LOC_107|>",
113
+ "<|LOC_108|>",
114
+ "<|LOC_109|>",
115
+ "<|LOC_110|>",
116
+ "<|LOC_111|>",
117
+ "<|LOC_112|>",
118
+ "<|LOC_113|>",
119
+ "<|LOC_114|>",
120
+ "<|LOC_115|>",
121
+ "<|LOC_116|>",
122
+ "<|LOC_117|>",
123
+ "<|LOC_118|>",
124
+ "<|LOC_119|>",
125
+ "<|LOC_120|>",
126
+ "<|LOC_121|>",
127
+ "<|LOC_122|>",
128
+ "<|LOC_123|>",
129
+ "<|LOC_124|>",
130
+ "<|LOC_125|>",
131
+ "<|LOC_126|>",
132
+ "<|LOC_127|>",
133
+ "<|LOC_128|>",
134
+ "<|LOC_129|>",
135
+ "<|LOC_130|>",
136
+ "<|LOC_131|>",
137
+ "<|LOC_132|>",
138
+ "<|LOC_133|>",
139
+ "<|LOC_134|>",
140
+ "<|LOC_135|>",
141
+ "<|LOC_136|>",
142
+ "<|LOC_137|>",
143
+ "<|LOC_138|>",
144
+ "<|LOC_139|>",
145
+ "<|LOC_140|>",
146
+ "<|LOC_141|>",
147
+ "<|LOC_142|>",
148
+ "<|LOC_143|>",
149
+ "<|LOC_144|>",
150
+ "<|LOC_145|>",
151
+ "<|LOC_146|>",
152
+ "<|LOC_147|>",
153
+ "<|LOC_148|>",
154
+ "<|LOC_149|>",
155
+ "<|LOC_150|>",
156
+ "<|LOC_151|>",
157
+ "<|LOC_152|>",
158
+ "<|LOC_153|>",
159
+ "<|LOC_154|>",
160
+ "<|LOC_155|>",
161
+ "<|LOC_156|>",
162
+ "<|LOC_157|>",
163
+ "<|LOC_158|>",
164
+ "<|LOC_159|>",
165
+ "<|LOC_160|>",
166
+ "<|LOC_161|>",
167
+ "<|LOC_162|>",
168
+ "<|LOC_163|>",
169
+ "<|LOC_164|>",
170
+ "<|LOC_165|>",
171
+ "<|LOC_166|>",
172
+ "<|LOC_167|>",
173
+ "<|LOC_168|>",
174
+ "<|LOC_169|>",
175
+ "<|LOC_170|>",
176
+ "<|LOC_171|>",
177
+ "<|LOC_172|>",
178
+ "<|LOC_173|>",
179
+ "<|LOC_174|>",
180
+ "<|LOC_175|>",
181
+ "<|LOC_176|>",
182
+ "<|LOC_177|>",
183
+ "<|LOC_178|>",
184
+ "<|LOC_179|>",
185
+ "<|LOC_180|>",
186
+ "<|LOC_181|>",
187
+ "<|LOC_182|>",
188
+ "<|LOC_183|>",
189
+ "<|LOC_184|>",
190
+ "<|LOC_185|>",
191
+ "<|LOC_186|>",
192
+ "<|LOC_187|>",
193
+ "<|LOC_188|>",
194
+ "<|LOC_189|>",
195
+ "<|LOC_190|>",
196
+ "<|LOC_191|>",
197
+ "<|LOC_192|>",
198
+ "<|LOC_193|>",
199
+ "<|LOC_194|>",
200
+ "<|LOC_195|>",
201
+ "<|LOC_196|>",
202
+ "<|LOC_197|>",
203
+ "<|LOC_198|>",
204
+ "<|LOC_199|>",
205
+ "<|LOC_200|>",
206
+ "<|LOC_201|>",
207
+ "<|LOC_202|>",
208
+ "<|LOC_203|>",
209
+ "<|LOC_204|>",
210
+ "<|LOC_205|>",
211
+ "<|LOC_206|>",
212
+ "<|LOC_207|>",
213
+ "<|LOC_208|>",
214
+ "<|LOC_209|>",
215
+ "<|LOC_210|>",
216
+ "<|LOC_211|>",
217
+ "<|LOC_212|>",
218
+ "<|LOC_213|>",
219
+ "<|LOC_214|>",
220
+ "<|LOC_215|>",
221
+ "<|LOC_216|>",
222
+ "<|LOC_217|>",
223
+ "<|LOC_218|>",
224
+ "<|LOC_219|>",
225
+ "<|LOC_220|>",
226
+ "<|LOC_221|>",
227
+ "<|LOC_222|>",
228
+ "<|LOC_223|>",
229
+ "<|LOC_224|>",
230
+ "<|LOC_225|>",
231
+ "<|LOC_226|>",
232
+ "<|LOC_227|>",
233
+ "<|LOC_228|>",
234
+ "<|LOC_229|>",
235
+ "<|LOC_230|>",
236
+ "<|LOC_231|>",
237
+ "<|LOC_232|>",
238
+ "<|LOC_233|>",
239
+ "<|LOC_234|>",
240
+ "<|LOC_235|>",
241
+ "<|LOC_236|>",
242
+ "<|LOC_237|>",
243
+ "<|LOC_238|>",
244
+ "<|LOC_239|>",
245
+ "<|LOC_240|>",
246
+ "<|LOC_241|>",
247
+ "<|LOC_242|>",
248
+ "<|LOC_243|>",
249
+ "<|LOC_244|>",
250
+ "<|LOC_245|>",
251
+ "<|LOC_246|>",
252
+ "<|LOC_247|>",
253
+ "<|LOC_248|>",
254
+ "<|LOC_249|>",
255
+ "<|LOC_250|>",
256
+ "<|LOC_251|>",
257
+ "<|LOC_252|>",
258
+ "<|LOC_253|>",
259
+ "<|LOC_254|>",
260
+ "<|LOC_255|>",
261
+ "<|LOC_256|>",
262
+ "<|LOC_257|>",
263
+ "<|LOC_258|>",
264
+ "<|LOC_259|>",
265
+ "<|LOC_260|>",
266
+ "<|LOC_261|>",
267
+ "<|LOC_262|>",
268
+ "<|LOC_263|>",
269
+ "<|LOC_264|>",
270
+ "<|LOC_265|>",
271
+ "<|LOC_266|>",
272
+ "<|LOC_267|>",
273
+ "<|LOC_268|>",
274
+ "<|LOC_269|>",
275
+ "<|LOC_270|>",
276
+ "<|LOC_271|>",
277
+ "<|LOC_272|>",
278
+ "<|LOC_273|>",
279
+ "<|LOC_274|>",
280
+ "<|LOC_275|>",
281
+ "<|LOC_276|>",
282
+ "<|LOC_277|>",
283
+ "<|LOC_278|>",
284
+ "<|LOC_279|>",
285
+ "<|LOC_280|>",
286
+ "<|LOC_281|>",
287
+ "<|LOC_282|>",
288
+ "<|LOC_283|>",
289
+ "<|LOC_284|>",
290
+ "<|LOC_285|>",
291
+ "<|LOC_286|>",
292
+ "<|LOC_287|>",
293
+ "<|LOC_288|>",
294
+ "<|LOC_289|>",
295
+ "<|LOC_290|>",
296
+ "<|LOC_291|>",
297
+ "<|LOC_292|>",
298
+ "<|LOC_293|>",
299
+ "<|LOC_294|>",
300
+ "<|LOC_295|>",
301
+ "<|LOC_296|>",
302
+ "<|LOC_297|>",
303
+ "<|LOC_298|>",
304
+ "<|LOC_299|>",
305
+ "<|LOC_300|>",
306
+ "<|LOC_301|>",
307
+ "<|LOC_302|>",
308
+ "<|LOC_303|>",
309
+ "<|LOC_304|>",
310
+ "<|LOC_305|>",
311
+ "<|LOC_306|>",
312
+ "<|LOC_307|>",
313
+ "<|LOC_308|>",
314
+ "<|LOC_309|>",
315
+ "<|LOC_310|>",
316
+ "<|LOC_311|>",
317
+ "<|LOC_312|>",
318
+ "<|LOC_313|>",
319
+ "<|LOC_314|>",
320
+ "<|LOC_315|>",
321
+ "<|LOC_316|>",
322
+ "<|LOC_317|>",
323
+ "<|LOC_318|>",
324
+ "<|LOC_319|>",
325
+ "<|LOC_320|>",
326
+ "<|LOC_321|>",
327
+ "<|LOC_322|>",
328
+ "<|LOC_323|>",
329
+ "<|LOC_324|>",
330
+ "<|LOC_325|>",
331
+ "<|LOC_326|>",
332
+ "<|LOC_327|>",
333
+ "<|LOC_328|>",
334
+ "<|LOC_329|>",
335
+ "<|LOC_330|>",
336
+ "<|LOC_331|>",
337
+ "<|LOC_332|>",
338
+ "<|LOC_333|>",
339
+ "<|LOC_334|>",
340
+ "<|LOC_335|>",
341
+ "<|LOC_336|>",
342
+ "<|LOC_337|>",
343
+ "<|LOC_338|>",
344
+ "<|LOC_339|>",
345
+ "<|LOC_340|>",
346
+ "<|LOC_341|>",
347
+ "<|LOC_342|>",
348
+ "<|LOC_343|>",
349
+ "<|LOC_344|>",
350
+ "<|LOC_345|>",
351
+ "<|LOC_346|>",
352
+ "<|LOC_347|>",
353
+ "<|LOC_348|>",
354
+ "<|LOC_349|>",
355
+ "<|LOC_350|>",
356
+ "<|LOC_351|>",
357
+ "<|LOC_352|>",
358
+ "<|LOC_353|>",
359
+ "<|LOC_354|>",
360
+ "<|LOC_355|>",
361
+ "<|LOC_356|>",
362
+ "<|LOC_357|>",
363
+ "<|LOC_358|>",
364
+ "<|LOC_359|>",
365
+ "<|LOC_360|>",
366
+ "<|LOC_361|>",
367
+ "<|LOC_362|>",
368
+ "<|LOC_363|>",
369
+ "<|LOC_364|>",
370
+ "<|LOC_365|>",
371
+ "<|LOC_366|>",
372
+ "<|LOC_367|>",
373
+ "<|LOC_368|>",
374
+ "<|LOC_369|>",
375
+ "<|LOC_370|>",
376
+ "<|LOC_371|>",
377
+ "<|LOC_372|>",
378
+ "<|LOC_373|>",
379
+ "<|LOC_374|>",
380
+ "<|LOC_375|>",
381
+ "<|LOC_376|>",
382
+ "<|LOC_377|>",
383
+ "<|LOC_378|>",
384
+ "<|LOC_379|>",
385
+ "<|LOC_380|>",
386
+ "<|LOC_381|>",
387
+ "<|LOC_382|>",
388
+ "<|LOC_383|>",
389
+ "<|LOC_384|>",
390
+ "<|LOC_385|>",
391
+ "<|LOC_386|>",
392
+ "<|LOC_387|>",
393
+ "<|LOC_388|>",
394
+ "<|LOC_389|>",
395
+ "<|LOC_390|>",
396
+ "<|LOC_391|>",
397
+ "<|LOC_392|>",
398
+ "<|LOC_393|>",
399
+ "<|LOC_394|>",
400
+ "<|LOC_395|>",
401
+ "<|LOC_396|>",
402
+ "<|LOC_397|>",
403
+ "<|LOC_398|>",
404
+ "<|LOC_399|>",
405
+ "<|LOC_400|>",
406
+ "<|LOC_401|>",
407
+ "<|LOC_402|>",
408
+ "<|LOC_403|>",
409
+ "<|LOC_404|>",
410
+ "<|LOC_405|>",
411
+ "<|LOC_406|>",
412
+ "<|LOC_407|>",
413
+ "<|LOC_408|>",
414
+ "<|LOC_409|>",
415
+ "<|LOC_410|>",
416
+ "<|LOC_411|>",
417
+ "<|LOC_412|>",
418
+ "<|LOC_413|>",
419
+ "<|LOC_414|>",
420
+ "<|LOC_415|>",
421
+ "<|LOC_416|>",
422
+ "<|LOC_417|>",
423
+ "<|LOC_418|>",
424
+ "<|LOC_419|>",
425
+ "<|LOC_420|>",
426
+ "<|LOC_421|>",
427
+ "<|LOC_422|>",
428
+ "<|LOC_423|>",
429
+ "<|LOC_424|>",
430
+ "<|LOC_425|>",
431
+ "<|LOC_426|>",
432
+ "<|LOC_427|>",
433
+ "<|LOC_428|>",
434
+ "<|LOC_429|>",
435
+ "<|LOC_430|>",
436
+ "<|LOC_431|>",
437
+ "<|LOC_432|>",
438
+ "<|LOC_433|>",
439
+ "<|LOC_434|>",
440
+ "<|LOC_435|>",
441
+ "<|LOC_436|>",
442
+ "<|LOC_437|>",
443
+ "<|LOC_438|>",
444
+ "<|LOC_439|>",
445
+ "<|LOC_440|>",
446
+ "<|LOC_441|>",
447
+ "<|LOC_442|>",
448
+ "<|LOC_443|>",
449
+ "<|LOC_444|>",
450
+ "<|LOC_445|>",
451
+ "<|LOC_446|>",
452
+ "<|LOC_447|>",
453
+ "<|LOC_448|>",
454
+ "<|LOC_449|>",
455
+ "<|LOC_450|>",
456
+ "<|LOC_451|>",
457
+ "<|LOC_452|>",
458
+ "<|LOC_453|>",
459
+ "<|LOC_454|>",
460
+ "<|LOC_455|>",
461
+ "<|LOC_456|>",
462
+ "<|LOC_457|>",
463
+ "<|LOC_458|>",
464
+ "<|LOC_459|>",
465
+ "<|LOC_460|>",
466
+ "<|LOC_461|>",
467
+ "<|LOC_462|>",
468
+ "<|LOC_463|>",
469
+ "<|LOC_464|>",
470
+ "<|LOC_465|>",
471
+ "<|LOC_466|>",
472
+ "<|LOC_467|>",
473
+ "<|LOC_468|>",
474
+ "<|LOC_469|>",
475
+ "<|LOC_470|>",
476
+ "<|LOC_471|>",
477
+ "<|LOC_472|>",
478
+ "<|LOC_473|>",
479
+ "<|LOC_474|>",
480
+ "<|LOC_475|>",
481
+ "<|LOC_476|>",
482
+ "<|LOC_477|>",
483
+ "<|LOC_478|>",
484
+ "<|LOC_479|>",
485
+ "<|LOC_480|>",
486
+ "<|LOC_481|>",
487
+ "<|LOC_482|>",
488
+ "<|LOC_483|>",
489
+ "<|LOC_484|>",
490
+ "<|LOC_485|>",
491
+ "<|LOC_486|>",
492
+ "<|LOC_487|>",
493
+ "<|LOC_488|>",
494
+ "<|LOC_489|>",
495
+ "<|LOC_490|>",
496
+ "<|LOC_491|>",
497
+ "<|LOC_492|>",
498
+ "<|LOC_493|>",
499
+ "<|LOC_494|>",
500
+ "<|LOC_495|>",
501
+ "<|LOC_496|>",
502
+ "<|LOC_497|>",
503
+ "<|LOC_498|>",
504
+ "<|LOC_499|>",
505
+ "<|LOC_500|>",
506
+ "<|LOC_501|>",
507
+ "<|LOC_502|>",
508
+ "<|LOC_503|>",
509
+ "<|LOC_504|>",
510
+ "<|LOC_505|>",
511
+ "<|LOC_506|>",
512
+ "<|LOC_507|>",
513
+ "<|LOC_508|>",
514
+ "<|LOC_509|>",
515
+ "<|LOC_510|>",
516
+ "<|LOC_511|>",
517
+ "<|LOC_512|>",
518
+ "<|LOC_513|>",
519
+ "<|LOC_514|>",
520
+ "<|LOC_515|>",
521
+ "<|LOC_516|>",
522
+ "<|LOC_517|>",
523
+ "<|LOC_518|>",
524
+ "<|LOC_519|>",
525
+ "<|LOC_520|>",
526
+ "<|LOC_521|>",
527
+ "<|LOC_522|>",
528
+ "<|LOC_523|>",
529
+ "<|LOC_524|>",
530
+ "<|LOC_525|>",
531
+ "<|LOC_526|>",
532
+ "<|LOC_527|>",
533
+ "<|LOC_528|>",
534
+ "<|LOC_529|>",
535
+ "<|LOC_530|>",
536
+ "<|LOC_531|>",
537
+ "<|LOC_532|>",
538
+ "<|LOC_533|>",
539
+ "<|LOC_534|>",
540
+ "<|LOC_535|>",
541
+ "<|LOC_536|>",
542
+ "<|LOC_537|>",
543
+ "<|LOC_538|>",
544
+ "<|LOC_539|>",
545
+ "<|LOC_540|>",
546
+ "<|LOC_541|>",
547
+ "<|LOC_542|>",
548
+ "<|LOC_543|>",
549
+ "<|LOC_544|>",
550
+ "<|LOC_545|>",
551
+ "<|LOC_546|>",
552
+ "<|LOC_547|>",
553
+ "<|LOC_548|>",
554
+ "<|LOC_549|>",
555
+ "<|LOC_550|>",
556
+ "<|LOC_551|>",
557
+ "<|LOC_552|>",
558
+ "<|LOC_553|>",
559
+ "<|LOC_554|>",
560
+ "<|LOC_555|>",
561
+ "<|LOC_556|>",
562
+ "<|LOC_557|>",
563
+ "<|LOC_558|>",
564
+ "<|LOC_559|>",
565
+ "<|LOC_560|>",
566
+ "<|LOC_561|>",
567
+ "<|LOC_562|>",
568
+ "<|LOC_563|>",
569
+ "<|LOC_564|>",
570
+ "<|LOC_565|>",
571
+ "<|LOC_566|>",
572
+ "<|LOC_567|>",
573
+ "<|LOC_568|>",
574
+ "<|LOC_569|>",
575
+ "<|LOC_570|>",
576
+ "<|LOC_571|>",
577
+ "<|LOC_572|>",
578
+ "<|LOC_573|>",
579
+ "<|LOC_574|>",
580
+ "<|LOC_575|>",
581
+ "<|LOC_576|>",
582
+ "<|LOC_577|>",
583
+ "<|LOC_578|>",
584
+ "<|LOC_579|>",
585
+ "<|LOC_580|>",
586
+ "<|LOC_581|>",
587
+ "<|LOC_582|>",
588
+ "<|LOC_583|>",
589
+ "<|LOC_584|>",
590
+ "<|LOC_585|>",
591
+ "<|LOC_586|>",
592
+ "<|LOC_587|>",
593
+ "<|LOC_588|>",
594
+ "<|LOC_589|>",
595
+ "<|LOC_590|>",
596
+ "<|LOC_591|>",
597
+ "<|LOC_592|>",
598
+ "<|LOC_593|>",
599
+ "<|LOC_594|>",
600
+ "<|LOC_595|>",
601
+ "<|LOC_596|>",
602
+ "<|LOC_597|>",
603
+ "<|LOC_598|>",
604
+ "<|LOC_599|>",
605
+ "<|LOC_600|>",
606
+ "<|LOC_601|>",
607
+ "<|LOC_602|>",
608
+ "<|LOC_603|>",
609
+ "<|LOC_604|>",
610
+ "<|LOC_605|>",
611
+ "<|LOC_606|>",
612
+ "<|LOC_607|>",
613
+ "<|LOC_608|>",
614
+ "<|LOC_609|>",
615
+ "<|LOC_610|>",
616
+ "<|LOC_611|>",
617
+ "<|LOC_612|>",
618
+ "<|LOC_613|>",
619
+ "<|LOC_614|>",
620
+ "<|LOC_615|>",
621
+ "<|LOC_616|>",
622
+ "<|LOC_617|>",
623
+ "<|LOC_618|>",
624
+ "<|LOC_619|>",
625
+ "<|LOC_620|>",
626
+ "<|LOC_621|>",
627
+ "<|LOC_622|>",
628
+ "<|LOC_623|>",
629
+ "<|LOC_624|>",
630
+ "<|LOC_625|>",
631
+ "<|LOC_626|>",
632
+ "<|LOC_627|>",
633
+ "<|LOC_628|>",
634
+ "<|LOC_629|>",
635
+ "<|LOC_630|>",
636
+ "<|LOC_631|>",
637
+ "<|LOC_632|>",
638
+ "<|LOC_633|>",
639
+ "<|LOC_634|>",
640
+ "<|LOC_635|>",
641
+ "<|LOC_636|>",
642
+ "<|LOC_637|>",
643
+ "<|LOC_638|>",
644
+ "<|LOC_639|>",
645
+ "<|LOC_640|>",
646
+ "<|LOC_641|>",
647
+ "<|LOC_642|>",
648
+ "<|LOC_643|>",
649
+ "<|LOC_644|>",
650
+ "<|LOC_645|>",
651
+ "<|LOC_646|>",
652
+ "<|LOC_647|>",
653
+ "<|LOC_648|>",
654
+ "<|LOC_649|>",
655
+ "<|LOC_650|>",
656
+ "<|LOC_651|>",
657
+ "<|LOC_652|>",
658
+ "<|LOC_653|>",
659
+ "<|LOC_654|>",
660
+ "<|LOC_655|>",
661
+ "<|LOC_656|>",
662
+ "<|LOC_657|>",
663
+ "<|LOC_658|>",
664
+ "<|LOC_659|>",
665
+ "<|LOC_660|>",
666
+ "<|LOC_661|>",
667
+ "<|LOC_662|>",
668
+ "<|LOC_663|>",
669
+ "<|LOC_664|>",
670
+ "<|LOC_665|>",
671
+ "<|LOC_666|>",
672
+ "<|LOC_667|>",
673
+ "<|LOC_668|>",
674
+ "<|LOC_669|>",
675
+ "<|LOC_670|>",
676
+ "<|LOC_671|>",
677
+ "<|LOC_672|>",
678
+ "<|LOC_673|>",
679
+ "<|LOC_674|>",
680
+ "<|LOC_675|>",
681
+ "<|LOC_676|>",
682
+ "<|LOC_677|>",
683
+ "<|LOC_678|>",
684
+ "<|LOC_679|>",
685
+ "<|LOC_680|>",
686
+ "<|LOC_681|>",
687
+ "<|LOC_682|>",
688
+ "<|LOC_683|>",
689
+ "<|LOC_684|>",
690
+ "<|LOC_685|>",
691
+ "<|LOC_686|>",
692
+ "<|LOC_687|>",
693
+ "<|LOC_688|>",
694
+ "<|LOC_689|>",
695
+ "<|LOC_690|>",
696
+ "<|LOC_691|>",
697
+ "<|LOC_692|>",
698
+ "<|LOC_693|>",
699
+ "<|LOC_694|>",
700
+ "<|LOC_695|>",
701
+ "<|LOC_696|>",
702
+ "<|LOC_697|>",
703
+ "<|LOC_698|>",
704
+ "<|LOC_699|>",
705
+ "<|LOC_700|>",
706
+ "<|LOC_701|>",
707
+ "<|LOC_702|>",
708
+ "<|LOC_703|>",
709
+ "<|LOC_704|>",
710
+ "<|LOC_705|>",
711
+ "<|LOC_706|>",
712
+ "<|LOC_707|>",
713
+ "<|LOC_708|>",
714
+ "<|LOC_709|>",
715
+ "<|LOC_710|>",
716
+ "<|LOC_711|>",
717
+ "<|LOC_712|>",
718
+ "<|LOC_713|>",
719
+ "<|LOC_714|>",
720
+ "<|LOC_715|>",
721
+ "<|LOC_716|>",
722
+ "<|LOC_717|>",
723
+ "<|LOC_718|>",
724
+ "<|LOC_719|>",
725
+ "<|LOC_720|>",
726
+ "<|LOC_721|>",
727
+ "<|LOC_722|>",
728
+ "<|LOC_723|>",
729
+ "<|LOC_724|>",
730
+ "<|LOC_725|>",
731
+ "<|LOC_726|>",
732
+ "<|LOC_727|>",
733
+ "<|LOC_728|>",
734
+ "<|LOC_729|>",
735
+ "<|LOC_730|>",
736
+ "<|LOC_731|>",
737
+ "<|LOC_732|>",
738
+ "<|LOC_733|>",
739
+ "<|LOC_734|>",
740
+ "<|LOC_735|>",
741
+ "<|LOC_736|>",
742
+ "<|LOC_737|>",
743
+ "<|LOC_738|>",
744
+ "<|LOC_739|>",
745
+ "<|LOC_740|>",
746
+ "<|LOC_741|>",
747
+ "<|LOC_742|>",
748
+ "<|LOC_743|>",
749
+ "<|LOC_744|>",
750
+ "<|LOC_745|>",
751
+ "<|LOC_746|>",
752
+ "<|LOC_747|>",
753
+ "<|LOC_748|>",
754
+ "<|LOC_749|>",
755
+ "<|LOC_750|>",
756
+ "<|LOC_751|>",
757
+ "<|LOC_752|>",
758
+ "<|LOC_753|>",
759
+ "<|LOC_754|>",
760
+ "<|LOC_755|>",
761
+ "<|LOC_756|>",
762
+ "<|LOC_757|>",
763
+ "<|LOC_758|>",
764
+ "<|LOC_759|>",
765
+ "<|LOC_760|>",
766
+ "<|LOC_761|>",
767
+ "<|LOC_762|>",
768
+ "<|LOC_763|>",
769
+ "<|LOC_764|>",
770
+ "<|LOC_765|>",
771
+ "<|LOC_766|>",
772
+ "<|LOC_767|>",
773
+ "<|LOC_768|>",
774
+ "<|LOC_769|>",
775
+ "<|LOC_770|>",
776
+ "<|LOC_771|>",
777
+ "<|LOC_772|>",
778
+ "<|LOC_773|>",
779
+ "<|LOC_774|>",
780
+ "<|LOC_775|>",
781
+ "<|LOC_776|>",
782
+ "<|LOC_777|>",
783
+ "<|LOC_778|>",
784
+ "<|LOC_779|>",
785
+ "<|LOC_780|>",
786
+ "<|LOC_781|>",
787
+ "<|LOC_782|>",
788
+ "<|LOC_783|>",
789
+ "<|LOC_784|>",
790
+ "<|LOC_785|>",
791
+ "<|LOC_786|>",
792
+ "<|LOC_787|>",
793
+ "<|LOC_788|>",
794
+ "<|LOC_789|>",
795
+ "<|LOC_790|>",
796
+ "<|LOC_791|>",
797
+ "<|LOC_792|>",
798
+ "<|LOC_793|>",
799
+ "<|LOC_794|>",
800
+ "<|LOC_795|>",
801
+ "<|LOC_796|>",
802
+ "<|LOC_797|>",
803
+ "<|LOC_798|>",
804
+ "<|LOC_799|>",
805
+ "<|LOC_800|>",
806
+ "<|LOC_801|>",
807
+ "<|LOC_802|>",
808
+ "<|LOC_803|>",
809
+ "<|LOC_804|>",
810
+ "<|LOC_805|>",
811
+ "<|LOC_806|>",
812
+ "<|LOC_807|>",
813
+ "<|LOC_808|>",
814
+ "<|LOC_809|>",
815
+ "<|LOC_810|>",
816
+ "<|LOC_811|>",
817
+ "<|LOC_812|>",
818
+ "<|LOC_813|>",
819
+ "<|LOC_814|>",
820
+ "<|LOC_815|>",
821
+ "<|LOC_816|>",
822
+ "<|LOC_817|>",
823
+ "<|LOC_818|>",
824
+ "<|LOC_819|>",
825
+ "<|LOC_820|>",
826
+ "<|LOC_821|>",
827
+ "<|LOC_822|>",
828
+ "<|LOC_823|>",
829
+ "<|LOC_824|>",
830
+ "<|LOC_825|>",
831
+ "<|LOC_826|>",
832
+ "<|LOC_827|>",
833
+ "<|LOC_828|>",
834
+ "<|LOC_829|>",
835
+ "<|LOC_830|>",
836
+ "<|LOC_831|>",
837
+ "<|LOC_832|>",
838
+ "<|LOC_833|>",
839
+ "<|LOC_834|>",
840
+ "<|LOC_835|>",
841
+ "<|LOC_836|>",
842
+ "<|LOC_837|>",
843
+ "<|LOC_838|>",
844
+ "<|LOC_839|>",
845
+ "<|LOC_840|>",
846
+ "<|LOC_841|>",
847
+ "<|LOC_842|>",
848
+ "<|LOC_843|>",
849
+ "<|LOC_844|>",
850
+ "<|LOC_845|>",
851
+ "<|LOC_846|>",
852
+ "<|LOC_847|>",
853
+ "<|LOC_848|>",
854
+ "<|LOC_849|>",
855
+ "<|LOC_850|>",
856
+ "<|LOC_851|>",
857
+ "<|LOC_852|>",
858
+ "<|LOC_853|>",
859
+ "<|LOC_854|>",
860
+ "<|LOC_855|>",
861
+ "<|LOC_856|>",
862
+ "<|LOC_857|>",
863
+ "<|LOC_858|>",
864
+ "<|LOC_859|>",
865
+ "<|LOC_860|>",
866
+ "<|LOC_861|>",
867
+ "<|LOC_862|>",
868
+ "<|LOC_863|>",
869
+ "<|LOC_864|>",
870
+ "<|LOC_865|>",
871
+ "<|LOC_866|>",
872
+ "<|LOC_867|>",
873
+ "<|LOC_868|>",
874
+ "<|LOC_869|>",
875
+ "<|LOC_870|>",
876
+ "<|LOC_871|>",
877
+ "<|LOC_872|>",
878
+ "<|LOC_873|>",
879
+ "<|LOC_874|>",
880
+ "<|LOC_875|>",
881
+ "<|LOC_876|>",
882
+ "<|LOC_877|>",
883
+ "<|LOC_878|>",
884
+ "<|LOC_879|>",
885
+ "<|LOC_880|>",
886
+ "<|LOC_881|>",
887
+ "<|LOC_882|>",
888
+ "<|LOC_883|>",
889
+ "<|LOC_884|>",
890
+ "<|LOC_885|>",
891
+ "<|LOC_886|>",
892
+ "<|LOC_887|>",
893
+ "<|LOC_888|>",
894
+ "<|LOC_889|>",
895
+ "<|LOC_890|>",
896
+ "<|LOC_891|>",
897
+ "<|LOC_892|>",
898
+ "<|LOC_893|>",
899
+ "<|LOC_894|>",
900
+ "<|LOC_895|>",
901
+ "<|LOC_896|>",
902
+ "<|LOC_897|>",
903
+ "<|LOC_898|>",
904
+ "<|LOC_899|>",
905
+ "<|LOC_900|>",
906
+ "<|LOC_901|>",
907
+ "<|LOC_902|>",
908
+ "<|LOC_903|>",
909
+ "<|LOC_904|>",
910
+ "<|LOC_905|>",
911
+ "<|LOC_906|>",
912
+ "<|LOC_907|>",
913
+ "<|LOC_908|>",
914
+ "<|LOC_909|>",
915
+ "<|LOC_910|>",
916
+ "<|LOC_911|>",
917
+ "<|LOC_912|>",
918
+ "<|LOC_913|>",
919
+ "<|LOC_914|>",
920
+ "<|LOC_915|>",
921
+ "<|LOC_916|>",
922
+ "<|LOC_917|>",
923
+ "<|LOC_918|>",
924
+ "<|LOC_919|>",
925
+ "<|LOC_920|>",
926
+ "<|LOC_921|>",
927
+ "<|LOC_922|>",
928
+ "<|LOC_923|>",
929
+ "<|LOC_924|>",
930
+ "<|LOC_925|>",
931
+ "<|LOC_926|>",
932
+ "<|LOC_927|>",
933
+ "<|LOC_928|>",
934
+ "<|LOC_929|>",
935
+ "<|LOC_930|>",
936
+ "<|LOC_931|>",
937
+ "<|LOC_932|>",
938
+ "<|LOC_933|>",
939
+ "<|LOC_934|>",
940
+ "<|LOC_935|>",
941
+ "<|LOC_936|>",
942
+ "<|LOC_937|>",
943
+ "<|LOC_938|>",
944
+ "<|LOC_939|>",
945
+ "<|LOC_940|>",
946
+ "<|LOC_941|>",
947
+ "<|LOC_942|>",
948
+ "<|LOC_943|>",
949
+ "<|LOC_944|>",
950
+ "<|LOC_945|>",
951
+ "<|LOC_946|>",
952
+ "<|LOC_947|>",
953
+ "<|LOC_948|>",
954
+ "<|LOC_949|>",
955
+ "<|LOC_950|>",
956
+ "<|LOC_951|>",
957
+ "<|LOC_952|>",
958
+ "<|LOC_953|>",
959
+ "<|LOC_954|>",
960
+ "<|LOC_955|>",
961
+ "<|LOC_956|>",
962
+ "<|LOC_957|>",
963
+ "<|LOC_958|>",
964
+ "<|LOC_959|>",
965
+ "<|LOC_960|>",
966
+ "<|LOC_961|>",
967
+ "<|LOC_962|>",
968
+ "<|LOC_963|>",
969
+ "<|LOC_964|>",
970
+ "<|LOC_965|>",
971
+ "<|LOC_966|>",
972
+ "<|LOC_967|>",
973
+ "<|LOC_968|>",
974
+ "<|LOC_969|>",
975
+ "<|LOC_970|>",
976
+ "<|LOC_971|>",
977
+ "<|LOC_972|>",
978
+ "<|LOC_973|>",
979
+ "<|LOC_974|>",
980
+ "<|LOC_975|>",
981
+ "<|LOC_976|>",
982
+ "<|LOC_977|>",
983
+ "<|LOC_978|>",
984
+ "<|LOC_979|>",
985
+ "<|LOC_980|>",
986
+ "<|LOC_981|>",
987
+ "<|LOC_982|>",
988
+ "<|LOC_983|>",
989
+ "<|LOC_984|>",
990
+ "<|LOC_985|>",
991
+ "<|LOC_986|>",
992
+ "<|LOC_987|>",
993
+ "<|LOC_988|>",
994
+ "<|LOC_989|>",
995
+ "<|LOC_990|>",
996
+ "<|LOC_991|>",
997
+ "<|LOC_992|>",
998
+ "<|LOC_993|>",
999
+ "<|LOC_994|>",
1000
+ "<|LOC_995|>",
1001
+ "<|LOC_996|>",
1002
+ "<|LOC_997|>",
1003
+ "<|LOC_998|>",
1004
+ "<|LOC_999|>",
1005
+ "<|LOC_1000|>",
1006
+ "<|LOC_BEGIN|>",
1007
+ "<|LOC_END|>",
1008
+ "<|LOC_SEP|>",
1009
+ "<|CROP_COL_SEP|>",
1010
+ "<|CROP_ROW_SEP|>",
1011
+ "<|IMAGE_SEP|>"
1012
+ ],
1013
+ "bos_token": "<s>",
1014
+ "cls_token": "<|begin_of_sentence|>",
1015
+ "eos_token": "</s>",
1016
+ "mask_token": "<mask:1>",
1017
+ "pad_token": "<unk>",
1018
+ "sep_token": "<|end_of_sentence|>",
1019
+ "unk_token": "<unk>"
1020
+ }
tokenization_ernie4_5.py ADDED
@@ -0,0 +1,214 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) 2025 Baidu, Inc. All Rights Reserved.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+
15
+ import os
16
+ from shutil import copyfile
17
+ from typing import List, Optional, Tuple
18
+ import sentencepiece as spm
19
+
20
+ from transformers.tokenization_utils import PreTrainedTokenizer
21
+ from transformers.utils import logging
22
+
23
+
24
+ logger = logging.get_logger(__name__)
25
+
26
+
27
+ class Ernie4_5_Tokenizer(PreTrainedTokenizer):
28
+
29
+ vocab_files_names = {
30
+ "vocab_file": "tokenizer.model",
31
+ }
32
+ # Model input names expected by the tokenizer
33
+ model_input_names = ["input_ids", "position_ids", "attention_mask", "labels"]
34
+ # Padding side (where to add padding tokens)
35
+ padding_side = "right"
36
+
37
+ def __init__(
38
+ self,
39
+ vocab_file,
40
+ bos_token="<s>",
41
+ cls_token="<cls>",
42
+ eos_token="</s>",
43
+ mask_token="<mask:0>",
44
+ pad_token="<pad>",
45
+ sep_token="<sep>",
46
+ unk_token="<unk>",
47
+ additional_special_tokens=None,
48
+ verbose=False,
49
+ **kwargs,
50
+ ):
51
+ """
52
+ Initialize the ERNIE tokenizer.
53
+
54
+ Args:
55
+ vocab_file (str): Path to the SentencePiece model file.
56
+ bos_token (str, optional): Beginning of sentence token. Defaults to "<s>".
57
+ cls_token (str, optional): Classification token. Defaults to "<cls>".
58
+ eos_token (str, optional): End of sentence token. Defaults to "</s>".
59
+ mask_token (str, optional): Mask token. Defaults to "<mask:0>".
60
+ pad_token (str, optional): Padding token. Defaults to "<pad>".
61
+ sep_token (str, optional): Separator token. Defaults to "<sep>".
62
+ unk_token (str, optional): Unknown token. Defaults to "<unk>".
63
+ additional_special_tokens (List[str], optional): Additional special tokens.
64
+ Defaults to ["<mask:1>", "<mask:7>"].
65
+ verbose (bool, optional): Whether to print detailed logs or progress information during execution.
66
+ **kwargs: Additional keyword arguments passed to the parent class.
67
+ """
68
+
69
+ self.vocab_file = vocab_file
70
+ self.sp_model = spm.SentencePieceProcessor()
71
+ self.sp_model.Load(vocab_file)
72
+
73
+ if additional_special_tokens is None:
74
+ additional_special_tokens = ["<mask:1>", "<mask:7>"]
75
+ super().__init__(
76
+ bos_token=bos_token,
77
+ cls_token=cls_token,
78
+ eos_token=eos_token,
79
+ mask_token=mask_token,
80
+ pad_token=pad_token,
81
+ sep_token=sep_token,
82
+ unk_token=unk_token,
83
+ additional_special_tokens=additional_special_tokens,
84
+ verbose=verbose,
85
+ **kwargs,
86
+ )
87
+
88
+ @property
89
+ def vocab_size(self):
90
+ """Returns the size of the vocabulary.
91
+
92
+ Returns:
93
+ int: The number of tokens in the vocabulary.
94
+ """
95
+ return self.sp_model.vocab_size()
96
+
97
+ def get_vocab(self):
98
+ """Get the vocabulary as a dictionary mapping tokens to their IDs.
99
+
100
+ Returns:
101
+ dict: A dictionary mapping tokens to their corresponding IDs.
102
+ """
103
+ vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
104
+ vocab.update(self.added_tokens_encoder)
105
+ return vocab
106
+
107
+ def _tokenize(self, text):
108
+ """Tokenize text using SentencePiece.
109
+
110
+ Args:
111
+ text (str): The text to tokenize.
112
+
113
+ Returns:
114
+ list: A list of tokens.
115
+ """
116
+ return self.sp_model.encode_as_pieces(text)
117
+
118
+ def _convert_token_to_id(self, token):
119
+ """Convert a token (str) to an ID using the vocabulary.
120
+
121
+ Args:
122
+ token (str): The token to convert.
123
+
124
+ Returns:
125
+ int: The corresponding token ID.
126
+ """
127
+ return self.sp_model.piece_to_id(token)
128
+
129
+ def _convert_id_to_token(self, id):
130
+ """Convert an ID to a token (str) using the vocabulary.
131
+
132
+ Args:
133
+ id (int): The token ID to convert.
134
+
135
+ Returns:
136
+ str: The corresponding token.
137
+ """
138
+ if id >= self.vocab_size:
139
+ return self.unk_token
140
+ else:
141
+ return self.sp_model.id_to_piece(id)
142
+
143
+ def convert_tokens_to_string(self, tokens):
144
+ """Convert a sequence of tokens back to a single string.
145
+
146
+ Args:
147
+ tokens (List[str]): A list of tokens to convert.
148
+
149
+ Returns:
150
+ str: The reconstructed string.
151
+ """
152
+ current_sub_tokens = []
153
+ out_string = ""
154
+ for token in tokens:
155
+ # make sure that special tokens are not decoded using sentencepiece model
156
+ if token in self.all_special_tokens:
157
+ out_string += self.sp_model.decode(current_sub_tokens) + token
158
+ current_sub_tokens = []
159
+ else:
160
+ current_sub_tokens.append(token)
161
+ out_string += self.sp_model.decode(current_sub_tokens)
162
+ return out_string
163
+
164
+ def prepare_for_model(self, *args, **kwargs):
165
+ if "add_special_tokens" in kwargs:
166
+ kwargs.pop("add_special_tokens")
167
+ return super().prepare_for_model(*args, **kwargs)
168
+
169
+ def save_vocabulary(
170
+ self, save_directory, filename_prefix: Optional[str] = None
171
+ ) -> Tuple[str]:
172
+ """
173
+ Save the vocabulary and special tokens file to a directory.
174
+
175
+ Args:
176
+ save_directory (str): The directory in which to save the vocabulary.
177
+ filename_prefix (Optional[str]): Optional prefix for the saved filename.
178
+
179
+ Returns:
180
+ Tuple[str]: Paths to the files saved.
181
+
182
+ Raises:
183
+ ValueError: If the save_directory is not a valid directory.
184
+ """
185
+ if not os.path.isdir(save_directory):
186
+ logger.error(f"Vocabulary path ({save_directory}) should be a directory")
187
+ return
188
+ out_vocab_file = os.path.join(
189
+ save_directory,
190
+ (filename_prefix + "-" if filename_prefix else "")
191
+ + self.vocab_files_names["vocab_file"],
192
+ )
193
+
194
+ if os.path.abspath(self.vocab_file) != os.path.abspath(
195
+ out_vocab_file
196
+ ) and os.path.isfile(self.vocab_file):
197
+ copyfile(self.vocab_file, out_vocab_file)
198
+ elif not os.path.isfile(self.vocab_file):
199
+ with open(out_vocab_file, "wb") as fi:
200
+ content_spiece_model = self.sp_model.serialized_model_proto()
201
+ fi.write(content_spiece_model)
202
+
203
+ return (out_vocab_file,)
204
+
205
+ def _decode(self, *args, **kwargs):
206
+ kwargs.pop("clean_up_tokenization_spaces", None)
207
+ kwargs.pop("spaces_between_special_tokens", None)
208
+ return super()._decode(
209
+ *args,
210
+ **kwargs,
211
+ clean_up_tokenization_spaces=False,
212
+ spaces_between_special_tokens=False,
213
+ )
214
+
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:34ef7db83df785924fb83d7b887b6e822a031c56e15cff40aaf9b982988180df
3
+ size 1614363
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff