File name changes
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- .gitattributes +9 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md +11 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/llamabench.md +11 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_code.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_general.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_math.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/llamabench.md +11 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_code.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_general.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_math.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/llamabench.md +11 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_code.log +153 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_general.log +153 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_math.log +153 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md +11 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md +11 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md +11 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md +11 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log +152 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/llamabench.md +11 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_code.log +151 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_general.log +151 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_math.log +151 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json +44 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md +11 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log +151 -0
- Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log +151 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,12 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
thinking_budget.png filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
Seed-OSS-36B-Instruct-MXFP4_MOE.gguf filter=lfs diff=lfs merge=lfs -text
|
| 39 |
+
Seed-OSS-36B-Instruct-mxfp4_moe-E-B16-D-IQ4NL-KOU-Q6K-HQ-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
|
| 40 |
+
Seed-OSS-36B-Instruct-mxfp4_moe-EHQKOUD-IQ4NL.gguf filter=lfs diff=lfs merge=lfs -text
|
| 41 |
+
Seed-OSS-36B-Instruct-mxfp4_moe-EHQKOUD-Q6K.gguf filter=lfs diff=lfs merge=lfs -text
|
| 42 |
+
Seed-OSS-36B-Instruct-mxfp4_moe-HK-B16-EO-Q5K-QUD-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
|
| 43 |
+
Seed-OSS-36B-Instruct-mxfp4_moe-HQKU-IQ4NL-EOD-MXFP4.gguf filter=lfs diff=lfs merge=lfs -text
|
| 44 |
+
Seed-OSS-36B-Instruct-mxfp4_moe-O-MXFP4-EHQKUD-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "seed_oss 36B IQ4_NL - 4.5 bpw",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "36.15 B",
|
| 12 |
+
"size": "19.43 GiB",
|
| 13 |
+
"t/s": "30.53 \u00b1 0.74",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 30.53
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 30.53
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log",
|
| 23 |
+
"ppl": 1.4176,
|
| 24 |
+
"ppl_error": 0.00953
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log",
|
| 28 |
+
"ppl": 6.8507,
|
| 29 |
+
"ppl_error": 0.16499
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log",
|
| 33 |
+
"ppl": 5.4384,
|
| 34 |
+
"ppl_error": 0.1198
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 0.3254,
|
| 40 |
+
"bench_tps": 30.53,
|
| 41 |
+
"file_size_bytes": 20864981792,
|
| 42 |
+
"file_size_gb": 19.43
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 19.43 GiB | 36.15 B | CUDA | 35 | pp8 | 30.53 ± 0.74 |
|
| 9 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 19.43 GiB | 36.15 B | CUDA | 35 | tg128 | 5.08 ± 0.02 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20468 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q5_K: 65 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 385 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 19.43 GiB (4.62 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 13971.93 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 2960.23 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 2960.23 MiB
|
| 108 |
+
..................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 833.78 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 113.066 ms
|
| 139 |
+
perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.64 seconds per pass - ETA 5.30 minutes
|
| 141 |
+
[1]1.5663,[2]1.4687,[3]1.2922,[4]1.2374,[5]1.1926,[6]1.2795,[7]1.3859,[8]1.4450,[9]1.4272,[10]1.4042,[11]1.3816,[12]1.3867,[13]1.3871,[14]1.3725,[15]1.3537,[16]1.3689,[17]1.3703,[18]1.3515,[19]1.3491,[20]1.3652,[21]1.3554,[22]1.3451,[23]1.3557,[24]1.3503,[25]1.3534,[26]1.3494,[27]1.3664,[28]1.3717,[29]1.3721,[30]1.3730,[31]1.3704,[32]1.3812,[33]1.3819,[34]1.3743,[35]1.3700,[36]1.3651,[37]1.3731,[38]1.3820,[39]1.3733,[40]1.3951,[41]1.4041,[42]1.4071,[43]1.4154,[44]1.4164,[45]1.4097,[46]1.4126,[47]1.4164,[48]1.4176,
|
| 142 |
+
Final estimate: PPL = 1.4176 +/- 0.00953
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2579.70 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 306842.14 ms / 98304 tokens ( 3.12 ms per token, 320.37 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 308603.02 ms / 98305 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16342 + ( 3874 = 2960 + 80 + 833) + 3898 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20148 + ( 3234 = 2960 + 80 + 194) + 741 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 14337 = 13971 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20468 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q5_K: 65 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 385 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 19.43 GiB (4.62 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 13971.93 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 2960.23 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 2960.23 MiB
|
| 108 |
+
..................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 833.78 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 50.772 ms
|
| 139 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.57 seconds per pass - ETA 1.63 minutes
|
| 141 |
+
[1]6.9757,[2]8.0597,[3]8.4705,[4]8.2112,[5]8.0009,[6]6.7181,[7]5.9191,[8]5.9832,[9]6.2474,[10]6.3077,[11]6.4380,[12]6.7394,[13]6.7657,[14]6.8428,[15]6.8507,
|
| 142 |
+
Final estimate: PPL = 6.8507 +/- 0.16499
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2555.14 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 95523.06 ms / 30720 tokens ( 3.11 ms per token, 321.60 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 96021.46 ms / 30721 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16389 + ( 3874 = 2960 + 80 + 833) + 3851 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20148 + ( 3234 = 2960 + 80 + 194) + 741 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 14337 = 13971 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20416 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q5_K: 65 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 385 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 19.43 GiB (4.62 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 13971.93 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 2960.23 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 2960.23 MiB
|
| 108 |
+
..................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 833.78 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 46.827 ms
|
| 139 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.63 seconds per pass - ETA 1.77 minutes
|
| 141 |
+
[1]2.7608,[2]2.9054,[3]3.3323,[4]3.5844,[5]4.0966,[6]4.3685,[7]4.5745,[8]4.7023,[9]4.8500,[10]5.0005,[11]5.0791,[12]5.1544,[13]5.2872,[14]5.3968,[15]5.4247,[16]5.4384,
|
| 142 |
+
Final estimate: PPL = 5.4384 +/- 0.11980
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2540.69 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 102170.30 ms / 32768 tokens ( 3.12 ms per token, 320.72 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 102686.99 ms / 32769 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16345 + ( 3874 = 2960 + 80 + 833) + 3896 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20148 + ( 3234 = 2960 + 80 + 194) + 741 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 14337 = 13971 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "seed_oss 36B IQ4_NL - 4.5 bpw",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "36.15 B",
|
| 12 |
+
"size": "19.94 GiB",
|
| 13 |
+
"t/s": "25.00 \u00b1 2.73",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 25.0
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 25.0
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_code.log",
|
| 23 |
+
"ppl": 1.4162,
|
| 24 |
+
"ppl_error": 0.00952
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_general.log",
|
| 28 |
+
"ppl": 6.8281,
|
| 29 |
+
"ppl_error": 0.16452
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_math.log",
|
| 33 |
+
"ppl": 5.442,
|
| 34 |
+
"ppl_error": 0.11987
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 0.3797,
|
| 40 |
+
"bench_tps": 25.0,
|
| 41 |
+
"file_size_bytes": 21416119072,
|
| 42 |
+
"file_size_gb": 19.95
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 19.94 GiB | 36.15 B | CUDA | 35 | pp8 | 25.00 ± 2.73 |
|
| 9 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 19.94 GiB | 36.15 B | CUDA | 35 | tg128 | 4.99 ± 0.01 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_code.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20469 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q6_K: 65 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 385 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 19.94 GiB (4.74 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 14364.72 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 3026.64 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 3026.64 MiB
|
| 108 |
+
.................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 934.39 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 111.885 ms
|
| 139 |
+
perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.75 seconds per pass - ETA 5.38 minutes
|
| 141 |
+
[1]1.5604,[2]1.4661,[3]1.2906,[4]1.2351,[5]1.1912,[6]1.2788,[7]1.3853,[8]1.4444,[9]1.4265,[10]1.4033,[11]1.3807,[12]1.3857,[13]1.3862,[14]1.3715,[15]1.3527,[16]1.3679,[17]1.3692,[18]1.3505,[19]1.3481,[20]1.3641,[21]1.3544,[22]1.3441,[23]1.3545,[24]1.3490,[25]1.3521,[26]1.3479,[27]1.3652,[28]1.3705,[29]1.3711,[30]1.3719,[31]1.3692,[32]1.3801,[33]1.3807,[34]1.3732,[35]1.3689,[36]1.3638,[37]1.3718,[38]1.3806,[39]1.3719,[40]1.3937,[41]1.4027,[42]1.4057,[43]1.4140,[44]1.4151,[45]1.4084,[46]1.4113,[47]1.4151,[48]1.4162,
|
| 142 |
+
Final estimate: PPL = 1.4162 +/- 0.00952
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2551.49 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 312027.57 ms / 98304 tokens ( 3.17 ms per token, 315.05 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 313585.58 ms / 98305 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16234 + ( 4041 = 3026 + 80 + 934) + 3839 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20082 + ( 3300 = 3026 + 80 + 194) + 740 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 14730 = 14364 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_general.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20465 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q6_K: 65 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 385 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 19.94 GiB (4.74 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 14364.72 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 3026.64 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 3026.64 MiB
|
| 108 |
+
.................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 934.39 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 46.269 ms
|
| 139 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.73 seconds per pass - ETA 1.67 minutes
|
| 141 |
+
[1]6.9699,[2]8.0356,[3]8.4251,[4]8.1681,[5]7.9547,[6]6.6787,[7]5.8897,[8]5.9520,[9]6.2160,[10]6.2771,[11]6.4095,[12]6.7141,[13]6.7415,[14]6.8193,[15]6.8281,
|
| 142 |
+
Final estimate: PPL = 6.8281 +/- 0.16452
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2816.56 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 97428.48 ms / 30720 tokens ( 3.17 ms per token, 315.31 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 97912.35 ms / 30721 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16225 + ( 4041 = 3026 + 80 + 934) + 3848 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20082 + ( 3300 = 3026 + 80 + 194) + 740 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 14730 = 14364 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_math.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20474 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q6_K: 65 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 385 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 19.94 GiB (4.74 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 14364.72 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 3026.64 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 3026.64 MiB
|
| 108 |
+
.................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 934.39 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 44.395 ms
|
| 139 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.77 seconds per pass - ETA 1.80 minutes
|
| 141 |
+
[1]2.7756,[2]2.9075,[3]3.3369,[4]3.5882,[5]4.0989,[6]4.3711,[7]4.5809,[8]4.7081,[9]4.8552,[10]5.0064,[11]5.0821,[12]5.1599,[13]5.2922,[14]5.4012,[15]5.4277,[16]5.4420,
|
| 142 |
+
Final estimate: PPL = 5.4420 +/- 0.11987
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2590.28 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 104259.46 ms / 32768 tokens ( 3.18 ms per token, 314.29 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 104865.05 ms / 32769 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16229 + ( 4041 = 3026 + 80 + 934) + 3844 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20082 + ( 3300 = 3026 + 80 + 194) + 740 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 14730 = 14364 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "seed_oss 36B IQ4_NL - 4.5 bpw",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "36.15 B",
|
| 12 |
+
"size": "20.88 GiB",
|
| 13 |
+
"t/s": "25.81 \u00b1 2.12",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 25.81
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 25.81
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_code.log",
|
| 23 |
+
"ppl": 1.4161,
|
| 24 |
+
"ppl_error": 0.00951
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_general.log",
|
| 28 |
+
"ppl": 6.822,
|
| 29 |
+
"ppl_error": 0.16425
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_math.log",
|
| 33 |
+
"ppl": 5.4388,
|
| 34 |
+
"ppl_error": 0.11973
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 0.4265,
|
| 40 |
+
"bench_tps": 25.81,
|
| 41 |
+
"file_size_bytes": 22421134112,
|
| 42 |
+
"file_size_gb": 20.88
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 20.88 GiB | 36.15 B | CUDA | 35 | pp8 | 25.81 ± 2.12 |
|
| 9 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 20.88 GiB | 36.15 B | CUDA | 35 | tg128 | 4.76 ± 0.00 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_code.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20719 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 65 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 385 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 20.88 GiB (4.96 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 15080.99 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 3147.73 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 3147.73 MiB
|
| 108 |
+
.................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 1117.84 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 113.425 ms
|
| 139 |
+
perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.95 seconds per pass - ETA 5.55 minutes
|
| 141 |
+
[1]1.5621,[2]1.4658,[3]1.2905,[4]1.2351,[5]1.1909,[6]1.2778,[7]1.3842,[8]1.4435,[9]1.4255,[10]1.4025,[11]1.3801,[12]1.3853,[13]1.3858,[14]1.3712,[15]1.3524,[16]1.3676,[17]1.3689,[18]1.3502,[19]1.3478,[20]1.3637,[21]1.3540,[22]1.3436,[23]1.3541,[24]1.3486,[25]1.3516,[26]1.3475,[27]1.3646,[28]1.3700,[29]1.3705,[30]1.3714,[31]1.3688,[32]1.3796,[33]1.3802,[34]1.3727,[35]1.3684,[36]1.3634,[37]1.3714,[38]1.3803,[39]1.3716,[40]1.3934,[41]1.4025,[42]1.4056,[43]1.4139,[44]1.4150,[45]1.4082,[46]1.4111,[47]1.4149,[48]1.4161,
|
| 142 |
+
Final estimate: PPL = 1.4161 +/- 0.00951
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2635.78 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 320424.15 ms / 98304 tokens ( 3.26 ms per token, 306.79 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 321933.86 ms / 98305 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 15877 + ( 4345 = 3147 + 80 + 1117) + 3891 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19962 + ( 3421 = 3147 + 80 + 194) + 739 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 15447 = 15080 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_general.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20718 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 65 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 385 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 20.88 GiB (4.96 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 15080.99 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 3147.73 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 3147.73 MiB
|
| 108 |
+
.................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 1117.84 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 48.519 ms
|
| 139 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.90 seconds per pass - ETA 1.72 minutes
|
| 141 |
+
[1]6.9907,[2]8.0518,[3]8.4369,[4]8.1699,[5]7.9557,[6]6.6757,[7]5.8864,[8]5.9502,[9]6.2128,[10]6.2733,[11]6.4054,[12]6.7076,[13]6.7338,[14]6.8125,[15]6.8220,
|
| 142 |
+
Final estimate: PPL = 6.8220 +/- 0.16425
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2659.91 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 100053.26 ms / 30720 tokens ( 3.26 ms per token, 307.04 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 100537.54 ms / 30721 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16177 + ( 4345 = 3147 + 80 + 1117) + 3592 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19962 + ( 3421 = 3147 + 80 + 194) + 739 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 15447 = 15080 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_math.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20420 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 65 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 385 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 20.88 GiB (4.96 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 15080.99 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 3147.73 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 3147.73 MiB
|
| 108 |
+
.................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 1117.84 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 44.701 ms
|
| 139 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.92 seconds per pass - ETA 1.83 minutes
|
| 141 |
+
[1]2.7663,[2]2.9032,[3]3.3313,[4]3.5822,[5]4.0935,[6]4.3647,[7]4.5735,[8]4.7008,[9]4.8480,[10]4.9997,[11]5.0763,[12]5.1539,[13]5.2864,[14]5.3967,[15]5.4243,[16]5.4388,
|
| 142 |
+
Final estimate: PPL = 5.4388 +/- 0.11973
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2648.84 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 106970.58 ms / 32768 tokens ( 3.26 ms per token, 306.33 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 107483.72 ms / 32769 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 15873 + ( 4345 = 3147 + 80 + 1117) + 3896 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19962 + ( 3421 = 3147 + 80 + 194) + 739 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 15447 = 15080 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "seed_oss 36B IQ4_NL - 4.5 bpw",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "36.15 B",
|
| 12 |
+
"size": "27.86 GiB",
|
| 13 |
+
"t/s": "19.03 \u00b1 0.59",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 19.03
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 19.03
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_code.log",
|
| 23 |
+
"ppl": 1.4133,
|
| 24 |
+
"ppl_error": 0.00946
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_general.log",
|
| 28 |
+
"ppl": 6.8037,
|
| 29 |
+
"ppl_error": 0.16387
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_math.log",
|
| 33 |
+
"ppl": 5.3769,
|
| 34 |
+
"ppl_error": 0.11787
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 0.828,
|
| 40 |
+
"bench_tps": 19.03,
|
| 41 |
+
"file_size_bytes": 29924678432,
|
| 42 |
+
"file_size_gb": 27.87
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 27.86 GiB | 36.15 B | CUDA | 35 | pp8 | 19.03 ± 0.59 |
|
| 9 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 27.86 GiB | 36.15 B | CUDA | 35 | tg128 | 3.36 ± 0.01 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_code.log
ADDED
|
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19183 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 128 tensors
|
| 46 |
+
llama_model_loader: - type q5_K: 65 tensors
|
| 47 |
+
llama_model_loader: - type iq4_nl: 257 tensors
|
| 48 |
+
print_info: file format = GGUF V3 (latest)
|
| 49 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 50 |
+
print_info: file size = 27.86 GiB (6.62 BPW)
|
| 51 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 2 ('<seed:eos>')
|
| 54 |
+
load: special tokens cache size = 128
|
| 55 |
+
load: token to piece cache size = 0.9296 MB
|
| 56 |
+
print_info: arch = seed_oss
|
| 57 |
+
print_info: vocab_only = 0
|
| 58 |
+
print_info: n_ctx_train = 524288
|
| 59 |
+
print_info: n_embd = 5120
|
| 60 |
+
print_info: n_embd_inp = 5120
|
| 61 |
+
print_info: n_layer = 64
|
| 62 |
+
print_info: n_head = 80
|
| 63 |
+
print_info: n_head_kv = 8
|
| 64 |
+
print_info: n_rot = 128
|
| 65 |
+
print_info: n_swa = 0
|
| 66 |
+
print_info: is_swa_any = 0
|
| 67 |
+
print_info: n_embd_head_k = 128
|
| 68 |
+
print_info: n_embd_head_v = 128
|
| 69 |
+
print_info: n_gqa = 10
|
| 70 |
+
print_info: n_embd_k_gqa = 1024
|
| 71 |
+
print_info: n_embd_v_gqa = 1024
|
| 72 |
+
print_info: f_norm_eps = 0.0e+00
|
| 73 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 74 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 75 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 76 |
+
print_info: f_logit_scale = 0.0e+00
|
| 77 |
+
print_info: f_attn_scale = 0.0e+00
|
| 78 |
+
print_info: n_ff = 27648
|
| 79 |
+
print_info: n_expert = 0
|
| 80 |
+
print_info: n_expert_used = 0
|
| 81 |
+
print_info: n_expert_groups = 0
|
| 82 |
+
print_info: n_group_used = 0
|
| 83 |
+
print_info: causal attn = 1
|
| 84 |
+
print_info: pooling type = 0
|
| 85 |
+
print_info: rope type = 2
|
| 86 |
+
print_info: rope scaling = linear
|
| 87 |
+
print_info: freq_base_train = 10000000.0
|
| 88 |
+
print_info: freq_scale_train = 1
|
| 89 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 90 |
+
print_info: rope_finetuned = unknown
|
| 91 |
+
print_info: model type = 36B
|
| 92 |
+
print_info: model params = 36.15 B
|
| 93 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 94 |
+
print_info: vocab type = BPE
|
| 95 |
+
print_info: n_vocab = 155136
|
| 96 |
+
print_info: n_merges = 154737
|
| 97 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 98 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 99 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 100 |
+
print_info: LF token = 326 'Ċ'
|
| 101 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 102 |
+
print_info: max token length = 1024
|
| 103 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 104 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 105 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 106 |
+
load_tensors: CPU_Mapped model buffer size = 19911.93 MiB
|
| 107 |
+
load_tensors: CUDA0 model buffer size = 3879.21 MiB
|
| 108 |
+
load_tensors: CUDA1 model buffer size = 4741.26 MiB
|
| 109 |
+
...................................................................................................
|
| 110 |
+
llama_context: constructing llama_context
|
| 111 |
+
llama_context: n_seq_max = 1
|
| 112 |
+
llama_context: n_ctx = 2048
|
| 113 |
+
llama_context: n_ctx_seq = 2048
|
| 114 |
+
llama_context: n_batch = 2048
|
| 115 |
+
llama_context: n_ubatch = 512
|
| 116 |
+
llama_context: causal_attn = 1
|
| 117 |
+
llama_context: flash_attn = auto
|
| 118 |
+
llama_context: kv_unified = false
|
| 119 |
+
llama_context: freq_base = 10000000.0
|
| 120 |
+
llama_context: freq_scale = 1
|
| 121 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 122 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 123 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA0 KV buffer size = 72.00 MiB
|
| 125 |
+
llama_kv_cache: CUDA1 KV buffer size = 88.00 MiB
|
| 126 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 127 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 128 |
+
llama_context: CUDA0 compute buffer size = 833.78 MiB
|
| 129 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 130 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 131 |
+
llama_context: graph nodes = 2183
|
| 132 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 133 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 134 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 135 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 136 |
+
|
| 137 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 138 |
+
perplexity: tokenizing the input ..
|
| 139 |
+
perplexity: tokenization took 121.381 ms
|
| 140 |
+
perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 141 |
+
perplexity: 8.93 seconds per pass - ETA 7.13 minutes
|
| 142 |
+
[1]1.5570,[2]1.4577,[3]1.2857,[4]1.2317,[5]1.1880,[6]1.2751,[7]1.3796,[8]1.4376,[9]1.4205,[10]1.3983,[11]1.3764,[12]1.3819,[13]1.3820,[14]1.3678,[15]1.3488,[16]1.3645,[17]1.3655,[18]1.3472,[19]1.3450,[20]1.3606,[21]1.3508,[22]1.3407,[23]1.3511,[24]1.3456,[25]1.3494,[26]1.3454,[27]1.3623,[28]1.3675,[29]1.3677,[30]1.3685,[31]1.3657,[32]1.3763,[33]1.3769,[34]1.3693,[35]1.3651,[36]1.3603,[37]1.3681,[38]1.3769,[39]1.3684,[40]1.3900,[41]1.3990,[42]1.4020,[43]1.4105,[44]1.4116,[45]1.4052,[46]1.4081,[47]1.4121,[48]1.4133,
|
| 143 |
+
Final estimate: PPL = 1.4133 +/- 0.00946
|
| 144 |
+
|
| 145 |
+
llama_perf_context_print: load time = 3785.94 ms
|
| 146 |
+
llama_perf_context_print: prompt eval time = 416431.87 ms / 98304 tokens ( 4.24 ms per token, 236.06 tokens per second)
|
| 147 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 148 |
+
llama_perf_context_print: total time = 418154.88 ms / 98305 tokens
|
| 149 |
+
llama_perf_context_print: graphs reused = 0
|
| 150 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 14172 + ( 4784 = 3879 + 72 + 833) + 5157 |
|
| 152 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 18360 + ( 5023 = 4741 + 88 + 194) + 740 |
|
| 153 |
+
llama_memory_breakdown_print: | - Host | 20277 = 19911 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_general.log
ADDED
|
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19144 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 128 tensors
|
| 46 |
+
llama_model_loader: - type q5_K: 65 tensors
|
| 47 |
+
llama_model_loader: - type iq4_nl: 257 tensors
|
| 48 |
+
print_info: file format = GGUF V3 (latest)
|
| 49 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 50 |
+
print_info: file size = 27.86 GiB (6.62 BPW)
|
| 51 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 2 ('<seed:eos>')
|
| 54 |
+
load: special tokens cache size = 128
|
| 55 |
+
load: token to piece cache size = 0.9296 MB
|
| 56 |
+
print_info: arch = seed_oss
|
| 57 |
+
print_info: vocab_only = 0
|
| 58 |
+
print_info: n_ctx_train = 524288
|
| 59 |
+
print_info: n_embd = 5120
|
| 60 |
+
print_info: n_embd_inp = 5120
|
| 61 |
+
print_info: n_layer = 64
|
| 62 |
+
print_info: n_head = 80
|
| 63 |
+
print_info: n_head_kv = 8
|
| 64 |
+
print_info: n_rot = 128
|
| 65 |
+
print_info: n_swa = 0
|
| 66 |
+
print_info: is_swa_any = 0
|
| 67 |
+
print_info: n_embd_head_k = 128
|
| 68 |
+
print_info: n_embd_head_v = 128
|
| 69 |
+
print_info: n_gqa = 10
|
| 70 |
+
print_info: n_embd_k_gqa = 1024
|
| 71 |
+
print_info: n_embd_v_gqa = 1024
|
| 72 |
+
print_info: f_norm_eps = 0.0e+00
|
| 73 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 74 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 75 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 76 |
+
print_info: f_logit_scale = 0.0e+00
|
| 77 |
+
print_info: f_attn_scale = 0.0e+00
|
| 78 |
+
print_info: n_ff = 27648
|
| 79 |
+
print_info: n_expert = 0
|
| 80 |
+
print_info: n_expert_used = 0
|
| 81 |
+
print_info: n_expert_groups = 0
|
| 82 |
+
print_info: n_group_used = 0
|
| 83 |
+
print_info: causal attn = 1
|
| 84 |
+
print_info: pooling type = 0
|
| 85 |
+
print_info: rope type = 2
|
| 86 |
+
print_info: rope scaling = linear
|
| 87 |
+
print_info: freq_base_train = 10000000.0
|
| 88 |
+
print_info: freq_scale_train = 1
|
| 89 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 90 |
+
print_info: rope_finetuned = unknown
|
| 91 |
+
print_info: model type = 36B
|
| 92 |
+
print_info: model params = 36.15 B
|
| 93 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 94 |
+
print_info: vocab type = BPE
|
| 95 |
+
print_info: n_vocab = 155136
|
| 96 |
+
print_info: n_merges = 154737
|
| 97 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 98 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 99 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 100 |
+
print_info: LF token = 326 'Ċ'
|
| 101 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 102 |
+
print_info: max token length = 1024
|
| 103 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 104 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 105 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 106 |
+
load_tensors: CPU_Mapped model buffer size = 19911.93 MiB
|
| 107 |
+
load_tensors: CUDA0 model buffer size = 3879.21 MiB
|
| 108 |
+
load_tensors: CUDA1 model buffer size = 4741.26 MiB
|
| 109 |
+
...................................................................................................
|
| 110 |
+
llama_context: constructing llama_context
|
| 111 |
+
llama_context: n_seq_max = 1
|
| 112 |
+
llama_context: n_ctx = 2048
|
| 113 |
+
llama_context: n_ctx_seq = 2048
|
| 114 |
+
llama_context: n_batch = 2048
|
| 115 |
+
llama_context: n_ubatch = 512
|
| 116 |
+
llama_context: causal_attn = 1
|
| 117 |
+
llama_context: flash_attn = auto
|
| 118 |
+
llama_context: kv_unified = false
|
| 119 |
+
llama_context: freq_base = 10000000.0
|
| 120 |
+
llama_context: freq_scale = 1
|
| 121 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 122 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 123 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA0 KV buffer size = 72.00 MiB
|
| 125 |
+
llama_kv_cache: CUDA1 KV buffer size = 88.00 MiB
|
| 126 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 127 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 128 |
+
llama_context: CUDA0 compute buffer size = 833.78 MiB
|
| 129 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 130 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 131 |
+
llama_context: graph nodes = 2183
|
| 132 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 133 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 134 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 135 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 136 |
+
|
| 137 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 138 |
+
perplexity: tokenizing the input ..
|
| 139 |
+
perplexity: tokenization took 49.451 ms
|
| 140 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 141 |
+
perplexity: 8.88 seconds per pass - ETA 2.22 minutes
|
| 142 |
+
[1]6.9263,[2]7.9946,[3]8.3471,[4]8.0938,[5]7.8933,[6]6.6467,[7]5.8691,[8]5.9351,[9]6.2024,[10]6.2615,[11]6.3906,[12]6.6929,[13]6.7168,[14]6.7946,[15]6.8037,
|
| 143 |
+
Final estimate: PPL = 6.8037 +/- 0.16387
|
| 144 |
+
|
| 145 |
+
llama_perf_context_print: load time = 3982.53 ms
|
| 146 |
+
llama_perf_context_print: prompt eval time = 130029.31 ms / 30720 tokens ( 4.23 ms per token, 236.25 tokens per second)
|
| 147 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 148 |
+
llama_perf_context_print: total time = 130712.15 ms / 30721 tokens
|
| 149 |
+
llama_perf_context_print: graphs reused = 0
|
| 150 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 14201 + ( 4784 = 3879 + 72 + 833) + 5128 |
|
| 152 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 18360 + ( 5023 = 4741 + 88 + 194) + 740 |
|
| 153 |
+
llama_memory_breakdown_print: | - Host | 20277 = 19911 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_math.log
ADDED
|
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19152 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 128 tensors
|
| 46 |
+
llama_model_loader: - type q5_K: 65 tensors
|
| 47 |
+
llama_model_loader: - type iq4_nl: 257 tensors
|
| 48 |
+
print_info: file format = GGUF V3 (latest)
|
| 49 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 50 |
+
print_info: file size = 27.86 GiB (6.62 BPW)
|
| 51 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 2 ('<seed:eos>')
|
| 54 |
+
load: special tokens cache size = 128
|
| 55 |
+
load: token to piece cache size = 0.9296 MB
|
| 56 |
+
print_info: arch = seed_oss
|
| 57 |
+
print_info: vocab_only = 0
|
| 58 |
+
print_info: n_ctx_train = 524288
|
| 59 |
+
print_info: n_embd = 5120
|
| 60 |
+
print_info: n_embd_inp = 5120
|
| 61 |
+
print_info: n_layer = 64
|
| 62 |
+
print_info: n_head = 80
|
| 63 |
+
print_info: n_head_kv = 8
|
| 64 |
+
print_info: n_rot = 128
|
| 65 |
+
print_info: n_swa = 0
|
| 66 |
+
print_info: is_swa_any = 0
|
| 67 |
+
print_info: n_embd_head_k = 128
|
| 68 |
+
print_info: n_embd_head_v = 128
|
| 69 |
+
print_info: n_gqa = 10
|
| 70 |
+
print_info: n_embd_k_gqa = 1024
|
| 71 |
+
print_info: n_embd_v_gqa = 1024
|
| 72 |
+
print_info: f_norm_eps = 0.0e+00
|
| 73 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 74 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 75 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 76 |
+
print_info: f_logit_scale = 0.0e+00
|
| 77 |
+
print_info: f_attn_scale = 0.0e+00
|
| 78 |
+
print_info: n_ff = 27648
|
| 79 |
+
print_info: n_expert = 0
|
| 80 |
+
print_info: n_expert_used = 0
|
| 81 |
+
print_info: n_expert_groups = 0
|
| 82 |
+
print_info: n_group_used = 0
|
| 83 |
+
print_info: causal attn = 1
|
| 84 |
+
print_info: pooling type = 0
|
| 85 |
+
print_info: rope type = 2
|
| 86 |
+
print_info: rope scaling = linear
|
| 87 |
+
print_info: freq_base_train = 10000000.0
|
| 88 |
+
print_info: freq_scale_train = 1
|
| 89 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 90 |
+
print_info: rope_finetuned = unknown
|
| 91 |
+
print_info: model type = 36B
|
| 92 |
+
print_info: model params = 36.15 B
|
| 93 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 94 |
+
print_info: vocab type = BPE
|
| 95 |
+
print_info: n_vocab = 155136
|
| 96 |
+
print_info: n_merges = 154737
|
| 97 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 98 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 99 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 100 |
+
print_info: LF token = 326 'Ċ'
|
| 101 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 102 |
+
print_info: max token length = 1024
|
| 103 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 104 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 105 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 106 |
+
load_tensors: CPU_Mapped model buffer size = 19911.93 MiB
|
| 107 |
+
load_tensors: CUDA0 model buffer size = 3879.21 MiB
|
| 108 |
+
load_tensors: CUDA1 model buffer size = 4741.26 MiB
|
| 109 |
+
...................................................................................................
|
| 110 |
+
llama_context: constructing llama_context
|
| 111 |
+
llama_context: n_seq_max = 1
|
| 112 |
+
llama_context: n_ctx = 2048
|
| 113 |
+
llama_context: n_ctx_seq = 2048
|
| 114 |
+
llama_context: n_batch = 2048
|
| 115 |
+
llama_context: n_ubatch = 512
|
| 116 |
+
llama_context: causal_attn = 1
|
| 117 |
+
llama_context: flash_attn = auto
|
| 118 |
+
llama_context: kv_unified = false
|
| 119 |
+
llama_context: freq_base = 10000000.0
|
| 120 |
+
llama_context: freq_scale = 1
|
| 121 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 122 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 123 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA0 KV buffer size = 72.00 MiB
|
| 125 |
+
llama_kv_cache: CUDA1 KV buffer size = 88.00 MiB
|
| 126 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 127 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 128 |
+
llama_context: CUDA0 compute buffer size = 833.78 MiB
|
| 129 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 130 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 131 |
+
llama_context: graph nodes = 2183
|
| 132 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 133 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 134 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 135 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 136 |
+
|
| 137 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 138 |
+
perplexity: tokenizing the input ..
|
| 139 |
+
perplexity: tokenization took 45.858 ms
|
| 140 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 141 |
+
perplexity: 8.86 seconds per pass - ETA 2.35 minutes
|
| 142 |
+
[1]2.7024,[2]2.8391,[3]3.2683,[4]3.5229,[5]4.0384,[6]4.3092,[7]4.5171,[8]4.6441,[9]4.7886,[10]4.9372,[11]5.0187,[12]5.0935,[13]5.2285,[14]5.3377,[15]5.3687,[16]5.3769,
|
| 143 |
+
Final estimate: PPL = 5.3769 +/- 0.11787
|
| 144 |
+
|
| 145 |
+
llama_perf_context_print: load time = 3518.68 ms
|
| 146 |
+
llama_perf_context_print: prompt eval time = 138566.37 ms / 32768 tokens ( 4.23 ms per token, 236.48 tokens per second)
|
| 147 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 148 |
+
llama_perf_context_print: total time = 139097.25 ms / 32769 tokens
|
| 149 |
+
llama_perf_context_print: graphs reused = 0
|
| 150 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 14169 + ( 4784 = 3879 + 72 + 833) + 5160 |
|
| 152 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 18360 + ( 5023 = 4741 + 88 + 194) + 740 |
|
| 153 |
+
llama_memory_breakdown_print: | - Host | 20277 = 19911 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "seed_oss 36B IQ4_NL - 4.5 bpw",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "36.15 B",
|
| 12 |
+
"size": "19.04 GiB",
|
| 13 |
+
"t/s": "28.78 \u00b1 3.16",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 28.78
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 28.78
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log",
|
| 23 |
+
"ppl": 1.4161,
|
| 24 |
+
"ppl_error": 0.00948
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log",
|
| 28 |
+
"ppl": 6.8715,
|
| 29 |
+
"ppl_error": 0.16547
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log",
|
| 33 |
+
"ppl": 5.4643,
|
| 34 |
+
"ppl_error": 0.12019
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 0.2769,
|
| 40 |
+
"bench_tps": 28.78,
|
| 41 |
+
"file_size_bytes": 20445551392,
|
| 42 |
+
"file_size_gb": 19.04
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 19.04 GiB | 36.15 B | CUDA | 35 | pp8 | 28.78 ± 3.16 |
|
| 9 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 19.04 GiB | 36.15 B | CUDA | 35 | tg128 | 5.28 ± 0.05 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20350 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q5_K: 1 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 449 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 19.04 GiB (4.52 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 13696.93 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 2897.73 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 2897.73 MiB
|
| 108 |
+
..................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 739.09 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 113.916 ms
|
| 139 |
+
perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.62 seconds per pass - ETA 5.28 minutes
|
| 141 |
+
[1]1.5690,[2]1.4709,[3]1.2935,[4]1.2379,[5]1.1930,[6]1.2804,[7]1.3859,[8]1.4444,[9]1.4266,[10]1.4032,[11]1.3804,[12]1.3856,[13]1.3859,[14]1.3713,[15]1.3522,[16]1.3675,[17]1.3691,[18]1.3504,[19]1.3479,[20]1.3635,[21]1.3537,[22]1.3435,[23]1.3540,[24]1.3484,[25]1.3513,[26]1.3473,[27]1.3644,[28]1.3698,[29]1.3702,[30]1.3712,[31]1.3684,[32]1.3794,[33]1.3801,[34]1.3725,[35]1.3682,[36]1.3631,[37]1.3710,[38]1.3798,[39]1.3712,[40]1.3934,[41]1.4024,[42]1.4055,[43]1.4139,[44]1.4150,[45]1.4083,[46]1.4112,[47]1.4148,[48]1.4161,
|
| 142 |
+
Final estimate: PPL = 1.4161 +/- 0.00948
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2643.34 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 306009.13 ms / 98304 tokens ( 3.11 ms per token, 321.25 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 307563.05 ms / 98305 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16500 + ( 3716 = 2897 + 80 + 739) + 3898 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20212 + ( 3171 = 2897 + 80 + 194) + 739 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 14062 = 13696 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20255 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q5_K: 1 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 449 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 19.04 GiB (4.52 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 13696.93 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 2897.73 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 2897.73 MiB
|
| 108 |
+
..................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 739.09 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 50.4 ms
|
| 139 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.59 seconds per pass - ETA 1.63 minutes
|
| 141 |
+
[1]7.0029,[2]8.1095,[3]8.5099,[4]8.2651,[5]8.0598,[6]6.7509,[7]5.9414,[8]5.9974,[9]6.2556,[10]6.3166,[11]6.4537,[12]6.7592,[13]6.7836,[14]6.8633,[15]6.8715,
|
| 142 |
+
Final estimate: PPL = 6.8715 +/- 0.16547
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2509.91 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 95548.39 ms / 30720 tokens ( 3.11 ms per token, 321.51 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 96051.84 ms / 30721 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16436 + ( 3716 = 2897 + 80 + 739) + 3961 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20212 + ( 3171 = 2897 + 80 + 194) + 739 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 14062 = 13696 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20416 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q5_K: 1 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 449 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 19.04 GiB (4.52 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 13696.93 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 2897.73 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 2897.73 MiB
|
| 108 |
+
..................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 739.09 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 49.049 ms
|
| 139 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.64 seconds per pass - ETA 1.77 minutes
|
| 141 |
+
[1]2.7786,[2]2.9173,[3]3.3383,[4]3.6012,[5]4.1247,[6]4.3961,[7]4.6081,[8]4.7268,[9]4.8720,[10]5.0228,[11]5.0974,[12]5.1771,[13]5.3097,[14]5.4174,[15]5.4507,[16]5.4643,
|
| 142 |
+
Final estimate: PPL = 5.4643 +/- 0.12019
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2506.68 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 102309.14 ms / 32768 tokens ( 3.12 ms per token, 320.28 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 102843.30 ms / 32769 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16414 + ( 3716 = 2897 + 80 + 739) + 3984 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20212 + ( 3171 = 2897 + 80 + 194) + 739 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 14062 = 13696 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "seed_oss 36B IQ4_NL - 4.5 bpw",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "36.15 B",
|
| 12 |
+
"size": "19.13 GiB",
|
| 13 |
+
"t/s": "26.86 \u00b1 2.90",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 26.86
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 26.86
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log",
|
| 23 |
+
"ppl": 1.4159,
|
| 24 |
+
"ppl_error": 0.00948
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log",
|
| 28 |
+
"ppl": 6.8703,
|
| 29 |
+
"ppl_error": 0.16545
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log",
|
| 33 |
+
"ppl": 5.4647,
|
| 34 |
+
"ppl_error": 0.12022
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 0.2805,
|
| 40 |
+
"bench_tps": 26.86,
|
| 41 |
+
"file_size_bytes": 20551043872,
|
| 42 |
+
"file_size_gb": 19.14
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 19.13 GiB | 36.15 B | CUDA | 35 | pp8 | 26.86 ± 2.90 |
|
| 9 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 19.13 GiB | 36.15 B | CUDA | 35 | tg128 | 5.33 ± 0.01 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20410 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q6_K: 1 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 449 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 19.13 GiB (4.55 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 13797.53 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 2897.73 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 2897.73 MiB
|
| 108 |
+
.................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 739.09 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 112.004 ms
|
| 139 |
+
perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.48 seconds per pass - ETA 5.18 minutes
|
| 141 |
+
[1]1.5664,[2]1.4694,[3]1.2926,[4]1.2373,[5]1.1926,[6]1.2807,[7]1.3863,[8]1.4452,[9]1.4274,[10]1.4039,[11]1.3810,[12]1.3861,[13]1.3865,[14]1.3716,[15]1.3526,[16]1.3677,[17]1.3693,[18]1.3506,[19]1.3481,[20]1.3637,[21]1.3539,[22]1.3437,[23]1.3542,[24]1.3486,[25]1.3515,[26]1.3473,[27]1.3644,[28]1.3698,[29]1.3702,[30]1.3711,[31]1.3684,[32]1.3793,[33]1.3800,[34]1.3723,[35]1.3680,[36]1.3630,[37]1.3709,[38]1.3797,[39]1.3710,[40]1.3932,[41]1.4022,[42]1.4052,[43]1.4136,[44]1.4147,[45]1.4080,[46]1.4109,[47]1.4146,[48]1.4159,
|
| 142 |
+
Final estimate: PPL = 1.4159 +/- 0.00948
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2495.13 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 299791.84 ms / 98304 tokens ( 3.05 ms per token, 327.91 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 301299.08 ms / 98305 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16504 + ( 3716 = 2897 + 80 + 739) + 3893 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20212 + ( 3171 = 2897 + 80 + 194) + 739 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 14163 = 13797 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20404 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q6_K: 1 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 449 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 19.13 GiB (4.55 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 13797.53 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 2897.73 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 2897.73 MiB
|
| 108 |
+
.................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 739.09 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 49.617 ms
|
| 139 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.44 seconds per pass - ETA 1.60 minutes
|
| 141 |
+
[1]7.0036,[2]8.1097,[3]8.5035,[4]8.2616,[5]8.0590,[6]6.7519,[7]5.9419,[8]5.9978,[9]6.2567,[10]6.3171,[11]6.4548,[12]6.7584,[13]6.7832,[14]6.8623,[15]6.8703,
|
| 142 |
+
Final estimate: PPL = 6.8703 +/- 0.16545
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2488.76 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 93409.83 ms / 30720 tokens ( 3.04 ms per token, 328.87 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 93894.78 ms / 30721 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16496 + ( 3716 = 2897 + 80 + 739) + 3901 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20212 + ( 3171 = 2897 + 80 + 194) + 739 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 14163 = 13797 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20420 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q6_K: 1 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 449 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 19.13 GiB (4.55 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 13797.53 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 2897.73 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 2897.73 MiB
|
| 108 |
+
.................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 739.09 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 44.823 ms
|
| 139 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.50 seconds per pass - ETA 1.72 minutes
|
| 141 |
+
[1]2.7835,[2]2.9216,[3]3.3409,[4]3.6013,[5]4.1248,[6]4.3958,[7]4.6067,[8]4.7254,[9]4.8714,[10]5.0227,[11]5.0961,[12]5.1761,[13]5.3092,[14]5.4165,[15]5.4500,[16]5.4647,
|
| 142 |
+
Final estimate: PPL = 5.4647 +/- 0.12022
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2511.76 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 100187.11 ms / 32768 tokens ( 3.06 ms per token, 327.07 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 100696.88 ms / 32769 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16501 + ( 3716 = 2897 + 80 + 739) + 3896 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20212 + ( 3171 = 2897 + 80 + 194) + 739 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 14163 = 13797 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "seed_oss 36B IQ4_NL - 4.5 bpw",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "36.15 B",
|
| 12 |
+
"size": "19.31 GiB",
|
| 13 |
+
"t/s": "25.73 \u00b1 2.34",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 25.73
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 25.73
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log",
|
| 23 |
+
"ppl": 1.4159,
|
| 24 |
+
"ppl_error": 0.00947
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log",
|
| 28 |
+
"ppl": 6.8736,
|
| 29 |
+
"ppl_error": 0.16558
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log",
|
| 33 |
+
"ppl": 5.4656,
|
| 34 |
+
"ppl_error": 0.12023
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 0.27,
|
| 40 |
+
"bench_tps": 25.73,
|
| 41 |
+
"file_size_bytes": 20743412512,
|
| 42 |
+
"file_size_gb": 19.32
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 19.31 GiB | 36.15 B | CUDA | 35 | pp8 | 25.73 ± 2.34 |
|
| 9 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 19.31 GiB | 36.15 B | CUDA | 35 | tg128 | 5.32 ± 0.00 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20429 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 1 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 449 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 19.31 GiB (4.59 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 13980.99 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 2897.73 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 2897.73 MiB
|
| 108 |
+
................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 739.09 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 111.451 ms
|
| 139 |
+
perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.45 seconds per pass - ETA 5.15 minutes
|
| 141 |
+
[1]1.5697,[2]1.4711,[3]1.2936,[4]1.2378,[5]1.1929,[6]1.2810,[7]1.3862,[8]1.4451,[9]1.4271,[10]1.4037,[11]1.3808,[12]1.3860,[13]1.3864,[14]1.3715,[15]1.3525,[16]1.3677,[17]1.3692,[18]1.3505,[19]1.3480,[20]1.3636,[21]1.3538,[22]1.3436,[23]1.3541,[24]1.3485,[25]1.3513,[26]1.3472,[27]1.3643,[28]1.3697,[29]1.3701,[30]1.3711,[31]1.3683,[32]1.3792,[33]1.3800,[34]1.3723,[35]1.3680,[36]1.3630,[37]1.3709,[38]1.3796,[39]1.3710,[40]1.3932,[41]1.4021,[42]1.4052,[43]1.4137,[44]1.4148,[45]1.4081,[46]1.4110,[47]1.4146,[48]1.4159,
|
| 142 |
+
Final estimate: PPL = 1.4159 +/- 0.00947
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2493.93 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 299600.26 ms / 98304 tokens ( 3.05 ms per token, 328.12 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 301103.88 ms / 98305 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16531 + ( 3716 = 2897 + 80 + 739) + 3866 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20212 + ( 3171 = 2897 + 80 + 194) + 739 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 14347 = 13980 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20412 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 1 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 449 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 19.31 GiB (4.59 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 13980.99 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 2897.73 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 2897.73 MiB
|
| 108 |
+
................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 739.09 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 46.856 ms
|
| 139 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.44 seconds per pass - ETA 1.60 minutes
|
| 141 |
+
[1]7.0159,[2]8.1163,[3]8.5190,[4]8.2693,[5]8.0635,[6]6.7555,[7]5.9452,[8]6.0011,[9]6.2586,[10]6.3193,[11]6.4576,[12]6.7629,[13]6.7864,[14]6.8658,[15]6.8736,
|
| 142 |
+
Final estimate: PPL = 6.8736 +/- 0.16558
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2645.96 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 93386.51 ms / 30720 tokens ( 3.04 ms per token, 328.96 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 93871.31 ms / 30721 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16501 + ( 3716 = 2897 + 80 + 739) + 3897 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20212 + ( 3171 = 2897 + 80 + 194) + 739 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 14347 = 13980 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20450 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 1 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 449 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 19.31 GiB (4.59 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 13980.99 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 2897.73 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 2897.73 MiB
|
| 108 |
+
................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 739.09 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 46.739 ms
|
| 139 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.48 seconds per pass - ETA 1.72 minutes
|
| 141 |
+
[1]2.7843,[2]2.9209,[3]3.3419,[4]3.6083,[5]4.1312,[6]4.4011,[7]4.6125,[8]4.7301,[9]4.8745,[10]5.0244,[11]5.0976,[12]5.1776,[13]5.3111,[14]5.4185,[15]5.4517,[16]5.4656,
|
| 142 |
+
Final estimate: PPL = 5.4656 +/- 0.12023
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2485.58 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 99922.90 ms / 32768 tokens ( 3.05 ms per token, 327.93 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 100433.38 ms / 32769 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16535 + ( 3716 = 2897 + 80 + 739) + 3862 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20212 + ( 3171 = 2897 + 80 + 194) + 739 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 14347 = 13980 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "seed_oss 36B IQ4_NL - 4.5 bpw",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "36.15 B",
|
| 12 |
+
"size": "19.43 GiB",
|
| 13 |
+
"t/s": "28.76 \u00b1 0.96",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 28.76
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 28.76
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log",
|
| 23 |
+
"ppl": 1.4176,
|
| 24 |
+
"ppl_error": 0.00953
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log",
|
| 28 |
+
"ppl": 6.8507,
|
| 29 |
+
"ppl_error": 0.16499
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log",
|
| 33 |
+
"ppl": 5.4384,
|
| 34 |
+
"ppl_error": 0.1198
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 0.3254,
|
| 40 |
+
"bench_tps": 28.76,
|
| 41 |
+
"file_size_bytes": 20864981792,
|
| 42 |
+
"file_size_gb": 19.43
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 19.43 GiB | 36.15 B | CUDA | 35 | pp8 | 28.76 ± 0.96 |
|
| 9 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 19.43 GiB | 36.15 B | CUDA | 35 | tg128 | 4.87 ± 0.01 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19133 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q5_K: 65 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 385 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 19.43 GiB (4.62 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 13971.93 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 2664.21 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 3256.26 MiB
|
| 108 |
+
..................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 72.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 88.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 833.78 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 115.322 ms
|
| 139 |
+
perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.91 seconds per pass - ETA 5.52 minutes
|
| 141 |
+
[1]1.5663,[2]1.4687,[3]1.2922,[4]1.2374,[5]1.1926,[6]1.2795,[7]1.3859,[8]1.4450,[9]1.4272,[10]1.4042,[11]1.3816,[12]1.3867,[13]1.3871,[14]1.3725,[15]1.3537,[16]1.3689,[17]1.3703,[18]1.3515,[19]1.3491,[20]1.3652,[21]1.3554,[22]1.3451,[23]1.3557,[24]1.3503,[25]1.3534,[26]1.3494,[27]1.3664,[28]1.3717,[29]1.3721,[30]1.3730,[31]1.3704,[32]1.3812,[33]1.3819,[34]1.3743,[35]1.3700,[36]1.3651,[37]1.3731,[38]1.3820,[39]1.3733,[40]1.3951,[41]1.4041,[42]1.4071,[43]1.4154,[44]1.4164,[45]1.4097,[46]1.4126,[47]1.4164,[48]1.4176,
|
| 142 |
+
Final estimate: PPL = 1.4176 +/- 0.00953
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2582.59 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 321034.86 ms / 98304 tokens ( 3.27 ms per token, 306.21 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 322609.98 ms / 98305 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 15274 + ( 3569 = 2664 + 72 + 833) + 5271 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19844 + ( 3538 = 3256 + 88 + 194) + 741 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 14337 = 13971 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19177 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q5_K: 65 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 385 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 19.43 GiB (4.62 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 13971.93 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 2664.21 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 3256.26 MiB
|
| 108 |
+
..................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 72.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 88.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 833.78 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 49.576 ms
|
| 139 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.90 seconds per pass - ETA 1.72 minutes
|
| 141 |
+
[1]6.9757,[2]8.0597,[3]8.4705,[4]8.2112,[5]8.0009,[6]6.7181,[7]5.9191,[8]5.9832,[9]6.2474,[10]6.3077,[11]6.4380,[12]6.7394,[13]6.7657,[14]6.8428,[15]6.8507,
|
| 142 |
+
Final estimate: PPL = 6.8507 +/- 0.16499
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2546.10 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 99968.38 ms / 30720 tokens ( 3.25 ms per token, 307.30 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 100472.57 ms / 30721 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 15365 + ( 3569 = 2664 + 72 + 833) + 5180 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19844 + ( 3538 = 3256 + 88 + 194) + 741 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 14337 = 13971 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19043 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q5_K: 65 tensors
|
| 46 |
+
llama_model_loader: - type iq4_nl: 385 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 49 |
+
print_info: file size = 19.43 GiB (4.62 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 13971.93 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 2664.21 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 3256.26 MiB
|
| 108 |
+
..................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 72.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 88.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 833.78 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 47.332 ms
|
| 139 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 6.93 seconds per pass - ETA 1.83 minutes
|
| 141 |
+
[1]2.7608,[2]2.9054,[3]3.3323,[4]3.5844,[5]4.0966,[6]4.3685,[7]4.5745,[8]4.7023,[9]4.8500,[10]5.0005,[11]5.0791,[12]5.1544,[13]5.2872,[14]5.3968,[15]5.4247,[16]5.4384,
|
| 142 |
+
Final estimate: PPL = 5.4384 +/- 0.11980
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 2673.19 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 107032.67 ms / 32768 tokens ( 3.27 ms per token, 306.15 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 107563.78 ms / 32769 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 15331 + ( 3569 = 2664 + 72 + 833) + 5213 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19844 + ( 3538 = 3256 + 88 + 194) + 741 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 14337 = 13971 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "seed_oss 36B IQ4_NL - 4.5 bpw",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "36.15 B",
|
| 12 |
+
"size": "67.34 GiB",
|
| 13 |
+
"t/s": "11.32 \u00b1 0.12",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 11.32
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 11.32
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_code.log",
|
| 23 |
+
"ppl": 1.4128,
|
| 24 |
+
"ppl_error": 0.00952
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_general.log",
|
| 28 |
+
"ppl": 6.8872,
|
| 29 |
+
"ppl_error": 0.16794
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_math.log",
|
| 33 |
+
"ppl": 5.4442,
|
| 34 |
+
"ppl_error": 0.12088
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 0.0,
|
| 40 |
+
"bench_tps": 11.32,
|
| 41 |
+
"file_size_bytes": 72311397152,
|
| 42 |
+
"file_size_gb": 67.35
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 67.34 GiB | 36.15 B | CUDA | 35 | pp8 | 11.32 ± 0.12 |
|
| 9 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 67.34 GiB | 36.15 B | CUDA | 35 | tg128 | 1.53 ± 0.02 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_code.log
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19670 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type bf16: 450 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 48 |
+
print_info: file size = 67.34 GiB (16.00 BPW)
|
| 49 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 50 |
+
load: printing all EOG tokens:
|
| 51 |
+
load: - 2 ('<seed:eos>')
|
| 52 |
+
load: special tokens cache size = 128
|
| 53 |
+
load: token to piece cache size = 0.9296 MB
|
| 54 |
+
print_info: arch = seed_oss
|
| 55 |
+
print_info: vocab_only = 0
|
| 56 |
+
print_info: n_ctx_train = 524288
|
| 57 |
+
print_info: n_embd = 5120
|
| 58 |
+
print_info: n_embd_inp = 5120
|
| 59 |
+
print_info: n_layer = 64
|
| 60 |
+
print_info: n_head = 80
|
| 61 |
+
print_info: n_head_kv = 8
|
| 62 |
+
print_info: n_rot = 128
|
| 63 |
+
print_info: n_swa = 0
|
| 64 |
+
print_info: is_swa_any = 0
|
| 65 |
+
print_info: n_embd_head_k = 128
|
| 66 |
+
print_info: n_embd_head_v = 128
|
| 67 |
+
print_info: n_gqa = 10
|
| 68 |
+
print_info: n_embd_k_gqa = 1024
|
| 69 |
+
print_info: n_embd_v_gqa = 1024
|
| 70 |
+
print_info: f_norm_eps = 0.0e+00
|
| 71 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 72 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 73 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 74 |
+
print_info: f_logit_scale = 0.0e+00
|
| 75 |
+
print_info: f_attn_scale = 0.0e+00
|
| 76 |
+
print_info: n_ff = 27648
|
| 77 |
+
print_info: n_expert = 0
|
| 78 |
+
print_info: n_expert_used = 0
|
| 79 |
+
print_info: n_expert_groups = 0
|
| 80 |
+
print_info: n_group_used = 0
|
| 81 |
+
print_info: causal attn = 1
|
| 82 |
+
print_info: pooling type = 0
|
| 83 |
+
print_info: rope type = 2
|
| 84 |
+
print_info: rope scaling = linear
|
| 85 |
+
print_info: freq_base_train = 10000000.0
|
| 86 |
+
print_info: freq_scale_train = 1
|
| 87 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 88 |
+
print_info: rope_finetuned = unknown
|
| 89 |
+
print_info: model type = 36B
|
| 90 |
+
print_info: model params = 36.15 B
|
| 91 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 92 |
+
print_info: vocab type = BPE
|
| 93 |
+
print_info: n_vocab = 155136
|
| 94 |
+
print_info: n_merges = 154737
|
| 95 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 96 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 97 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 98 |
+
print_info: LF token = 326 'Ċ'
|
| 99 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 100 |
+
print_info: max token length = 1024
|
| 101 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 102 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 103 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 104 |
+
load_tensors: CPU_Mapped model buffer size = 48353.80 MiB
|
| 105 |
+
load_tensors: CUDA0 model buffer size = 10300.86 MiB
|
| 106 |
+
load_tensors: CUDA1 model buffer size = 10300.86 MiB
|
| 107 |
+
..................................................................................................
|
| 108 |
+
llama_context: constructing llama_context
|
| 109 |
+
llama_context: n_seq_max = 1
|
| 110 |
+
llama_context: n_ctx = 2048
|
| 111 |
+
llama_context: n_ctx_seq = 2048
|
| 112 |
+
llama_context: n_batch = 2048
|
| 113 |
+
llama_context: n_ubatch = 512
|
| 114 |
+
llama_context: causal_attn = 1
|
| 115 |
+
llama_context: flash_attn = auto
|
| 116 |
+
llama_context: kv_unified = false
|
| 117 |
+
llama_context: freq_base = 10000000.0
|
| 118 |
+
llama_context: freq_scale = 1
|
| 119 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 120 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 121 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 122 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 125 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 126 |
+
llama_context: CUDA0 compute buffer size = 1828.00 MiB
|
| 127 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 128 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 129 |
+
llama_context: graph nodes = 2183
|
| 130 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 131 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 132 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 133 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 134 |
+
|
| 135 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 136 |
+
perplexity: tokenizing the input ..
|
| 137 |
+
perplexity: tokenization took 112.237 ms
|
| 138 |
+
perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 139 |
+
perplexity: 17.78 seconds per pass - ETA 14.22 minutes
|
| 140 |
+
[1]1.5107,[2]1.4416,[3]1.2762,[4]1.2238,[5]1.1809,[6]1.2685,[7]1.3738,[8]1.4318,[9]1.4155,[10]1.3932,[11]1.3715,[12]1.3774,[13]1.3779,[14]1.3640,[15]1.3454,[16]1.3621,[17]1.3633,[18]1.3450,[19]1.3424,[20]1.3583,[21]1.3485,[22]1.3382,[23]1.3488,[24]1.3431,[25]1.3473,[26]1.3431,[27]1.3609,[28]1.3662,[29]1.3668,[30]1.3675,[31]1.3649,[32]1.3754,[33]1.3757,[34]1.3681,[35]1.3643,[36]1.3595,[37]1.3672,[38]1.3761,[39]1.3676,[40]1.3894,[41]1.3983,[42]1.4012,[43]1.4096,[44]1.4109,[45]1.4046,[46]1.4078,[47]1.4116,[48]1.4128,
|
| 141 |
+
Final estimate: PPL = 1.4128 +/- 0.00952
|
| 142 |
+
|
| 143 |
+
llama_perf_context_print: load time = 7800.56 ms
|
| 144 |
+
llama_perf_context_print: prompt eval time = 840300.57 ms / 98304 tokens ( 8.55 ms per token, 116.99 tokens per second)
|
| 145 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 146 |
+
llama_perf_context_print: total time = 841852.62 ms / 98305 tokens
|
| 147 |
+
llama_perf_context_print: graphs reused = 0
|
| 148 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 149 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 6983 + (12208 = 10300 + 80 + 1828) + 4923 |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 12270 + (10574 = 10300 + 80 + 194) + 1279 |
|
| 151 |
+
llama_memory_breakdown_print: | - Host | 48719 = 48353 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_general.log
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19658 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type bf16: 450 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 48 |
+
print_info: file size = 67.34 GiB (16.00 BPW)
|
| 49 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 50 |
+
load: printing all EOG tokens:
|
| 51 |
+
load: - 2 ('<seed:eos>')
|
| 52 |
+
load: special tokens cache size = 128
|
| 53 |
+
load: token to piece cache size = 0.9296 MB
|
| 54 |
+
print_info: arch = seed_oss
|
| 55 |
+
print_info: vocab_only = 0
|
| 56 |
+
print_info: n_ctx_train = 524288
|
| 57 |
+
print_info: n_embd = 5120
|
| 58 |
+
print_info: n_embd_inp = 5120
|
| 59 |
+
print_info: n_layer = 64
|
| 60 |
+
print_info: n_head = 80
|
| 61 |
+
print_info: n_head_kv = 8
|
| 62 |
+
print_info: n_rot = 128
|
| 63 |
+
print_info: n_swa = 0
|
| 64 |
+
print_info: is_swa_any = 0
|
| 65 |
+
print_info: n_embd_head_k = 128
|
| 66 |
+
print_info: n_embd_head_v = 128
|
| 67 |
+
print_info: n_gqa = 10
|
| 68 |
+
print_info: n_embd_k_gqa = 1024
|
| 69 |
+
print_info: n_embd_v_gqa = 1024
|
| 70 |
+
print_info: f_norm_eps = 0.0e+00
|
| 71 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 72 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 73 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 74 |
+
print_info: f_logit_scale = 0.0e+00
|
| 75 |
+
print_info: f_attn_scale = 0.0e+00
|
| 76 |
+
print_info: n_ff = 27648
|
| 77 |
+
print_info: n_expert = 0
|
| 78 |
+
print_info: n_expert_used = 0
|
| 79 |
+
print_info: n_expert_groups = 0
|
| 80 |
+
print_info: n_group_used = 0
|
| 81 |
+
print_info: causal attn = 1
|
| 82 |
+
print_info: pooling type = 0
|
| 83 |
+
print_info: rope type = 2
|
| 84 |
+
print_info: rope scaling = linear
|
| 85 |
+
print_info: freq_base_train = 10000000.0
|
| 86 |
+
print_info: freq_scale_train = 1
|
| 87 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 88 |
+
print_info: rope_finetuned = unknown
|
| 89 |
+
print_info: model type = 36B
|
| 90 |
+
print_info: model params = 36.15 B
|
| 91 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 92 |
+
print_info: vocab type = BPE
|
| 93 |
+
print_info: n_vocab = 155136
|
| 94 |
+
print_info: n_merges = 154737
|
| 95 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 96 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 97 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 98 |
+
print_info: LF token = 326 'Ċ'
|
| 99 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 100 |
+
print_info: max token length = 1024
|
| 101 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 102 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 103 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 104 |
+
load_tensors: CPU_Mapped model buffer size = 48353.80 MiB
|
| 105 |
+
load_tensors: CUDA0 model buffer size = 10300.86 MiB
|
| 106 |
+
load_tensors: CUDA1 model buffer size = 10300.86 MiB
|
| 107 |
+
..................................................................................................
|
| 108 |
+
llama_context: constructing llama_context
|
| 109 |
+
llama_context: n_seq_max = 1
|
| 110 |
+
llama_context: n_ctx = 2048
|
| 111 |
+
llama_context: n_ctx_seq = 2048
|
| 112 |
+
llama_context: n_batch = 2048
|
| 113 |
+
llama_context: n_ubatch = 512
|
| 114 |
+
llama_context: causal_attn = 1
|
| 115 |
+
llama_context: flash_attn = auto
|
| 116 |
+
llama_context: kv_unified = false
|
| 117 |
+
llama_context: freq_base = 10000000.0
|
| 118 |
+
llama_context: freq_scale = 1
|
| 119 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 120 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 121 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 122 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 125 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 126 |
+
llama_context: CUDA0 compute buffer size = 1828.00 MiB
|
| 127 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 128 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 129 |
+
llama_context: graph nodes = 2183
|
| 130 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 131 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 132 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 133 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 134 |
+
|
| 135 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 136 |
+
perplexity: tokenizing the input ..
|
| 137 |
+
perplexity: tokenization took 49.672 ms
|
| 138 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 139 |
+
perplexity: 17.80 seconds per pass - ETA 4.45 minutes
|
| 140 |
+
[1]7.1957,[2]8.1195,[3]8.4548,[4]8.2130,[5]8.0074,[6]6.7286,[7]5.9325,[8]5.9903,[9]6.2600,[10]6.3190,[11]6.4561,[12]6.7865,[13]6.8028,[14]6.8780,[15]6.8872,
|
| 141 |
+
Final estimate: PPL = 6.8872 +/- 0.16794
|
| 142 |
+
|
| 143 |
+
llama_perf_context_print: load time = 8091.77 ms
|
| 144 |
+
llama_perf_context_print: prompt eval time = 263005.39 ms / 30720 tokens ( 8.56 ms per token, 116.80 tokens per second)
|
| 145 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 146 |
+
llama_perf_context_print: total time = 263735.66 ms / 30721 tokens
|
| 147 |
+
llama_perf_context_print: graphs reused = 0
|
| 148 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 149 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 7242 + (12208 = 10300 + 80 + 1828) + 4663 |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 12270 + (10574 = 10300 + 80 + 194) + 1279 |
|
| 151 |
+
llama_memory_breakdown_print: | - Host | 48719 = 48353 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_math.log
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19403 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type bf16: 450 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 48 |
+
print_info: file size = 67.34 GiB (16.00 BPW)
|
| 49 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 50 |
+
load: printing all EOG tokens:
|
| 51 |
+
load: - 2 ('<seed:eos>')
|
| 52 |
+
load: special tokens cache size = 128
|
| 53 |
+
load: token to piece cache size = 0.9296 MB
|
| 54 |
+
print_info: arch = seed_oss
|
| 55 |
+
print_info: vocab_only = 0
|
| 56 |
+
print_info: n_ctx_train = 524288
|
| 57 |
+
print_info: n_embd = 5120
|
| 58 |
+
print_info: n_embd_inp = 5120
|
| 59 |
+
print_info: n_layer = 64
|
| 60 |
+
print_info: n_head = 80
|
| 61 |
+
print_info: n_head_kv = 8
|
| 62 |
+
print_info: n_rot = 128
|
| 63 |
+
print_info: n_swa = 0
|
| 64 |
+
print_info: is_swa_any = 0
|
| 65 |
+
print_info: n_embd_head_k = 128
|
| 66 |
+
print_info: n_embd_head_v = 128
|
| 67 |
+
print_info: n_gqa = 10
|
| 68 |
+
print_info: n_embd_k_gqa = 1024
|
| 69 |
+
print_info: n_embd_v_gqa = 1024
|
| 70 |
+
print_info: f_norm_eps = 0.0e+00
|
| 71 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 72 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 73 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 74 |
+
print_info: f_logit_scale = 0.0e+00
|
| 75 |
+
print_info: f_attn_scale = 0.0e+00
|
| 76 |
+
print_info: n_ff = 27648
|
| 77 |
+
print_info: n_expert = 0
|
| 78 |
+
print_info: n_expert_used = 0
|
| 79 |
+
print_info: n_expert_groups = 0
|
| 80 |
+
print_info: n_group_used = 0
|
| 81 |
+
print_info: causal attn = 1
|
| 82 |
+
print_info: pooling type = 0
|
| 83 |
+
print_info: rope type = 2
|
| 84 |
+
print_info: rope scaling = linear
|
| 85 |
+
print_info: freq_base_train = 10000000.0
|
| 86 |
+
print_info: freq_scale_train = 1
|
| 87 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 88 |
+
print_info: rope_finetuned = unknown
|
| 89 |
+
print_info: model type = 36B
|
| 90 |
+
print_info: model params = 36.15 B
|
| 91 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 92 |
+
print_info: vocab type = BPE
|
| 93 |
+
print_info: n_vocab = 155136
|
| 94 |
+
print_info: n_merges = 154737
|
| 95 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 96 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 97 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 98 |
+
print_info: LF token = 326 'Ċ'
|
| 99 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 100 |
+
print_info: max token length = 1024
|
| 101 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 102 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 103 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 104 |
+
load_tensors: CPU_Mapped model buffer size = 48353.80 MiB
|
| 105 |
+
load_tensors: CUDA0 model buffer size = 10300.86 MiB
|
| 106 |
+
load_tensors: CUDA1 model buffer size = 10300.86 MiB
|
| 107 |
+
..................................................................................................
|
| 108 |
+
llama_context: constructing llama_context
|
| 109 |
+
llama_context: n_seq_max = 1
|
| 110 |
+
llama_context: n_ctx = 2048
|
| 111 |
+
llama_context: n_ctx_seq = 2048
|
| 112 |
+
llama_context: n_batch = 2048
|
| 113 |
+
llama_context: n_ubatch = 512
|
| 114 |
+
llama_context: causal_attn = 1
|
| 115 |
+
llama_context: flash_attn = auto
|
| 116 |
+
llama_context: kv_unified = false
|
| 117 |
+
llama_context: freq_base = 10000000.0
|
| 118 |
+
llama_context: freq_scale = 1
|
| 119 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 120 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 121 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 122 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 125 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 126 |
+
llama_context: CUDA0 compute buffer size = 1828.00 MiB
|
| 127 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 128 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 129 |
+
llama_context: graph nodes = 2183
|
| 130 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 131 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 132 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 133 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 134 |
+
|
| 135 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 136 |
+
perplexity: tokenizing the input ..
|
| 137 |
+
perplexity: tokenization took 46.673 ms
|
| 138 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 139 |
+
perplexity: 17.81 seconds per pass - ETA 4.73 minutes
|
| 140 |
+
[1]2.6577,[2]2.8378,[3]3.2807,[4]3.5315,[5]4.0764,[6]4.3578,[7]4.5789,[8]4.7049,[9]4.8470,[10]5.0057,[11]5.0877,[12]5.1590,[13]5.2956,[14]5.4047,[15]5.4376,[16]5.4442,
|
| 141 |
+
Final estimate: PPL = 5.4442 +/- 0.12088
|
| 142 |
+
|
| 143 |
+
llama_perf_context_print: load time = 8172.12 ms
|
| 144 |
+
llama_perf_context_print: prompt eval time = 280924.96 ms / 32768 tokens ( 8.57 ms per token, 116.64 tokens per second)
|
| 145 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 146 |
+
llama_perf_context_print: total time = 281977.55 ms / 32769 tokens
|
| 147 |
+
llama_perf_context_print: graphs reused = 0
|
| 148 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 149 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 6968 + (12208 = 10300 + 80 + 1828) + 4937 |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 12270 + (10574 = 10300 + 80 + 194) + 1279 |
|
| 151 |
+
llama_memory_breakdown_print: | - Host | 48719 = 48353 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"raw_metrics": {
|
| 3 |
+
"llamabench": {
|
| 4 |
+
"backend": "CUDA",
|
| 5 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md",
|
| 6 |
+
"ngl": "35",
|
| 7 |
+
"raw_row": {
|
| 8 |
+
"backend": "CUDA",
|
| 9 |
+
"model": "seed_oss 36B IQ4_NL - 4.5 bpw",
|
| 10 |
+
"ngl": "35",
|
| 11 |
+
"params": "36.15 B",
|
| 12 |
+
"size": "18.94 GiB",
|
| 13 |
+
"t/s": "28.49 \u00b1 3.98",
|
| 14 |
+
"test": "pp8",
|
| 15 |
+
"tps_value": 28.49
|
| 16 |
+
},
|
| 17 |
+
"test": "pp8",
|
| 18 |
+
"tps": 28.49
|
| 19 |
+
},
|
| 20 |
+
"perplexity": {
|
| 21 |
+
"code": {
|
| 22 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log",
|
| 23 |
+
"ppl": 1.4162,
|
| 24 |
+
"ppl_error": 0.00948
|
| 25 |
+
},
|
| 26 |
+
"general": {
|
| 27 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log",
|
| 28 |
+
"ppl": 6.8712,
|
| 29 |
+
"ppl_error": 0.16544
|
| 30 |
+
},
|
| 31 |
+
"math": {
|
| 32 |
+
"log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log",
|
| 33 |
+
"ppl": 5.4627,
|
| 34 |
+
"ppl_error": 0.12011
|
| 35 |
+
}
|
| 36 |
+
}
|
| 37 |
+
},
|
| 38 |
+
"summary": {
|
| 39 |
+
"avg_prec_loss_pct": 0.2709,
|
| 40 |
+
"bench_tps": 28.49,
|
| 41 |
+
"file_size_bytes": 20346264352,
|
| 42 |
+
"file_size_gb": 18.95
|
| 43 |
+
}
|
| 44 |
+
}
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 18.94 GiB | 36.15 B | CUDA | 35 | pp8 | 28.49 ± 3.98 |
|
| 9 |
+
| seed_oss 36B IQ4_NL - 4.5 bpw | 18.94 GiB | 36.15 B | CUDA | 35 | tg128 | 5.12 ± 0.01 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20217 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type iq4_nl: 450 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 48 |
+
print_info: file size = 18.94 GiB (4.50 BPW)
|
| 49 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 50 |
+
load: printing all EOG tokens:
|
| 51 |
+
load: - 2 ('<seed:eos>')
|
| 52 |
+
load: special tokens cache size = 128
|
| 53 |
+
load: token to piece cache size = 0.9296 MB
|
| 54 |
+
print_info: arch = seed_oss
|
| 55 |
+
print_info: vocab_only = 0
|
| 56 |
+
print_info: n_ctx_train = 524288
|
| 57 |
+
print_info: n_embd = 5120
|
| 58 |
+
print_info: n_embd_inp = 5120
|
| 59 |
+
print_info: n_layer = 64
|
| 60 |
+
print_info: n_head = 80
|
| 61 |
+
print_info: n_head_kv = 8
|
| 62 |
+
print_info: n_rot = 128
|
| 63 |
+
print_info: n_swa = 0
|
| 64 |
+
print_info: is_swa_any = 0
|
| 65 |
+
print_info: n_embd_head_k = 128
|
| 66 |
+
print_info: n_embd_head_v = 128
|
| 67 |
+
print_info: n_gqa = 10
|
| 68 |
+
print_info: n_embd_k_gqa = 1024
|
| 69 |
+
print_info: n_embd_v_gqa = 1024
|
| 70 |
+
print_info: f_norm_eps = 0.0e+00
|
| 71 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 72 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 73 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 74 |
+
print_info: f_logit_scale = 0.0e+00
|
| 75 |
+
print_info: f_attn_scale = 0.0e+00
|
| 76 |
+
print_info: n_ff = 27648
|
| 77 |
+
print_info: n_expert = 0
|
| 78 |
+
print_info: n_expert_used = 0
|
| 79 |
+
print_info: n_expert_groups = 0
|
| 80 |
+
print_info: n_group_used = 0
|
| 81 |
+
print_info: causal attn = 1
|
| 82 |
+
print_info: pooling type = 0
|
| 83 |
+
print_info: rope type = 2
|
| 84 |
+
print_info: rope scaling = linear
|
| 85 |
+
print_info: freq_base_train = 10000000.0
|
| 86 |
+
print_info: freq_scale_train = 1
|
| 87 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 88 |
+
print_info: rope_finetuned = unknown
|
| 89 |
+
print_info: model type = 36B
|
| 90 |
+
print_info: model params = 36.15 B
|
| 91 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 92 |
+
print_info: vocab type = BPE
|
| 93 |
+
print_info: n_vocab = 155136
|
| 94 |
+
print_info: n_merges = 154737
|
| 95 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 96 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 97 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 98 |
+
print_info: LF token = 326 'Ċ'
|
| 99 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 100 |
+
print_info: max token length = 1024
|
| 101 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 102 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 103 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 104 |
+
load_tensors: CPU_Mapped model buffer size = 13602.24 MiB
|
| 105 |
+
load_tensors: CUDA0 model buffer size = 2897.73 MiB
|
| 106 |
+
load_tensors: CUDA1 model buffer size = 2897.73 MiB
|
| 107 |
+
..................................................................................................
|
| 108 |
+
llama_context: constructing llama_context
|
| 109 |
+
llama_context: n_seq_max = 1
|
| 110 |
+
llama_context: n_ctx = 2048
|
| 111 |
+
llama_context: n_ctx_seq = 2048
|
| 112 |
+
llama_context: n_batch = 2048
|
| 113 |
+
llama_context: n_ubatch = 512
|
| 114 |
+
llama_context: causal_attn = 1
|
| 115 |
+
llama_context: flash_attn = auto
|
| 116 |
+
llama_context: kv_unified = false
|
| 117 |
+
llama_context: freq_base = 10000000.0
|
| 118 |
+
llama_context: freq_scale = 1
|
| 119 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 120 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 121 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 122 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 125 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 126 |
+
llama_context: CUDA0 compute buffer size = 739.09 MiB
|
| 127 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 128 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 129 |
+
llama_context: graph nodes = 2183
|
| 130 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 131 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 132 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 133 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 134 |
+
|
| 135 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 136 |
+
perplexity: tokenizing the input ..
|
| 137 |
+
perplexity: tokenization took 109.985 ms
|
| 138 |
+
perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 139 |
+
perplexity: 6.58 seconds per pass - ETA 5.27 minutes
|
| 140 |
+
[1]1.5682,[2]1.4703,[3]1.2931,[4]1.2377,[5]1.1929,[6]1.2806,[7]1.3861,[8]1.4448,[9]1.4269,[10]1.4035,[11]1.3807,[12]1.3858,[13]1.3863,[14]1.3715,[15]1.3526,[16]1.3677,[17]1.3694,[18]1.3506,[19]1.3482,[20]1.3638,[21]1.3540,[22]1.3438,[23]1.3542,[24]1.3488,[25]1.3517,[26]1.3476,[27]1.3647,[28]1.3700,[29]1.3704,[30]1.3714,[31]1.3686,[32]1.3796,[33]1.3802,[34]1.3726,[35]1.3683,[36]1.3633,[37]1.3711,[38]1.3799,[39]1.3712,[40]1.3935,[41]1.4025,[42]1.4055,[43]1.4140,[44]1.4151,[45]1.4084,[46]1.4113,[47]1.4149,[48]1.4162,
|
| 141 |
+
Final estimate: PPL = 1.4162 +/- 0.00948
|
| 142 |
+
|
| 143 |
+
llama_perf_context_print: load time = 2451.89 ms
|
| 144 |
+
llama_perf_context_print: prompt eval time = 306874.12 ms / 98304 tokens ( 3.12 ms per token, 320.34 tokens per second)
|
| 145 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 146 |
+
llama_perf_context_print: total time = 308445.15 ms / 98305 tokens
|
| 147 |
+
llama_perf_context_print: graphs reused = 0
|
| 148 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 149 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16303 + ( 3716 = 2897 + 80 + 739) + 4094 |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19690 + ( 3171 = 2897 + 80 + 194) + 1262 |
|
| 151 |
+
llama_memory_breakdown_print: | - Host | 13968 = 13602 + 352 + 14 |
|
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20064 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 25
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type iq4_nl: 450 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = IQ4_NL - 4.5 bpw
|
| 48 |
+
print_info: file size = 18.94 GiB (4.50 BPW)
|
| 49 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 50 |
+
load: printing all EOG tokens:
|
| 51 |
+
load: - 2 ('<seed:eos>')
|
| 52 |
+
load: special tokens cache size = 128
|
| 53 |
+
load: token to piece cache size = 0.9296 MB
|
| 54 |
+
print_info: arch = seed_oss
|
| 55 |
+
print_info: vocab_only = 0
|
| 56 |
+
print_info: n_ctx_train = 524288
|
| 57 |
+
print_info: n_embd = 5120
|
| 58 |
+
print_info: n_embd_inp = 5120
|
| 59 |
+
print_info: n_layer = 64
|
| 60 |
+
print_info: n_head = 80
|
| 61 |
+
print_info: n_head_kv = 8
|
| 62 |
+
print_info: n_rot = 128
|
| 63 |
+
print_info: n_swa = 0
|
| 64 |
+
print_info: is_swa_any = 0
|
| 65 |
+
print_info: n_embd_head_k = 128
|
| 66 |
+
print_info: n_embd_head_v = 128
|
| 67 |
+
print_info: n_gqa = 10
|
| 68 |
+
print_info: n_embd_k_gqa = 1024
|
| 69 |
+
print_info: n_embd_v_gqa = 1024
|
| 70 |
+
print_info: f_norm_eps = 0.0e+00
|
| 71 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 72 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 73 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 74 |
+
print_info: f_logit_scale = 0.0e+00
|
| 75 |
+
print_info: f_attn_scale = 0.0e+00
|
| 76 |
+
print_info: n_ff = 27648
|
| 77 |
+
print_info: n_expert = 0
|
| 78 |
+
print_info: n_expert_used = 0
|
| 79 |
+
print_info: n_expert_groups = 0
|
| 80 |
+
print_info: n_group_used = 0
|
| 81 |
+
print_info: causal attn = 1
|
| 82 |
+
print_info: pooling type = 0
|
| 83 |
+
print_info: rope type = 2
|
| 84 |
+
print_info: rope scaling = linear
|
| 85 |
+
print_info: freq_base_train = 10000000.0
|
| 86 |
+
print_info: freq_scale_train = 1
|
| 87 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 88 |
+
print_info: rope_finetuned = unknown
|
| 89 |
+
print_info: model type = 36B
|
| 90 |
+
print_info: model params = 36.15 B
|
| 91 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 92 |
+
print_info: vocab type = BPE
|
| 93 |
+
print_info: n_vocab = 155136
|
| 94 |
+
print_info: n_merges = 154737
|
| 95 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 96 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 97 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 98 |
+
print_info: LF token = 326 'Ċ'
|
| 99 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 100 |
+
print_info: max token length = 1024
|
| 101 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 102 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 103 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 104 |
+
load_tensors: CPU_Mapped model buffer size = 13602.24 MiB
|
| 105 |
+
load_tensors: CUDA0 model buffer size = 2897.73 MiB
|
| 106 |
+
load_tensors: CUDA1 model buffer size = 2897.73 MiB
|
| 107 |
+
..................................................................................................
|
| 108 |
+
llama_context: constructing llama_context
|
| 109 |
+
llama_context: n_seq_max = 1
|
| 110 |
+
llama_context: n_ctx = 2048
|
| 111 |
+
llama_context: n_ctx_seq = 2048
|
| 112 |
+
llama_context: n_batch = 2048
|
| 113 |
+
llama_context: n_ubatch = 512
|
| 114 |
+
llama_context: causal_attn = 1
|
| 115 |
+
llama_context: flash_attn = auto
|
| 116 |
+
llama_context: kv_unified = false
|
| 117 |
+
llama_context: freq_base = 10000000.0
|
| 118 |
+
llama_context: freq_scale = 1
|
| 119 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 120 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 121 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 122 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 125 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 126 |
+
llama_context: CUDA0 compute buffer size = 739.09 MiB
|
| 127 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 128 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 129 |
+
llama_context: graph nodes = 2183
|
| 130 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 131 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 132 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 133 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 134 |
+
|
| 135 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 136 |
+
perplexity: tokenizing the input ..
|
| 137 |
+
perplexity: tokenization took 50.71 ms
|
| 138 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 139 |
+
perplexity: 6.57 seconds per pass - ETA 1.63 minutes
|
| 140 |
+
[1]7.0271,[2]8.1273,[3]8.5259,[4]8.2753,[5]8.0705,[6]6.7617,[7]5.9479,[8]6.0034,[9]6.2604,[10]6.3227,[11]6.4549,[12]6.7596,[13]6.7839,[14]6.8638,[15]6.8712,
|
| 141 |
+
Final estimate: PPL = 6.8712 +/- 0.16544
|
| 142 |
+
|
| 143 |
+
llama_perf_context_print: load time = 2474.98 ms
|
| 144 |
+
llama_perf_context_print: prompt eval time = 95401.62 ms / 30720 tokens ( 3.11 ms per token, 322.01 tokens per second)
|
| 145 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 146 |
+
llama_perf_context_print: total time = 95899.95 ms / 30721 tokens
|
| 147 |
+
llama_perf_context_print: graphs reused = 0
|
| 148 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 149 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16309 + ( 3716 = 2897 + 80 + 739) + 4089 |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19690 + ( 3171 = 2897 + 80 + 194) + 1262 |
|
| 151 |
+
llama_memory_breakdown_print: | - Host | 13968 = 13602 + 352 + 14 |
|