magiccodingman commited on 4 days ago

Commit

94a426d

verified ·

1 Parent(s): f822bc4

File name changes

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +9 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/bench_metrics.json +44 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md +11 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/bench_metrics.json +44 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/llamabench.md +11 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_code.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_general.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_math.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/bench_metrics.json +44 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/llamabench.md +11 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_code.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_general.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_math.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/bench_metrics.json +44 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/llamabench.md +11 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_code.log +153 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_general.log +153 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_math.log +153 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json +44 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md +11 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json +44 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md +11 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json +44 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md +11 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/bench_metrics.json +44 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md +11 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log +152 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/bench_metrics.json +44 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/llamabench.md +11 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_code.log +151 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_general.log +151 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_math.log +151 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json +44 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md +11 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log +151 -0
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log +151 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,12 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text
+thinking_budget.png filter=lfs diff=lfs merge=lfs -text
+Seed-OSS-36B-Instruct-MXFP4_MOE.gguf filter=lfs diff=lfs merge=lfs -text
+Seed-OSS-36B-Instruct-mxfp4_moe-E-B16-D-IQ4NL-KOU-Q6K-HQ-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
+Seed-OSS-36B-Instruct-mxfp4_moe-EHQKOUD-IQ4NL.gguf filter=lfs diff=lfs merge=lfs -text
+Seed-OSS-36B-Instruct-mxfp4_moe-EHQKOUD-Q6K.gguf filter=lfs diff=lfs merge=lfs -text
+Seed-OSS-36B-Instruct-mxfp4_moe-HK-B16-EO-Q5K-QUD-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
+Seed-OSS-36B-Instruct-mxfp4_moe-HQKU-IQ4NL-EOD-MXFP4.gguf filter=lfs diff=lfs merge=lfs -text
+Seed-OSS-36B-Instruct-mxfp4_moe-O-MXFP4-EHQKUD-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text

Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/bench_metrics.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "36.15 B",
+        "size": "19.43 GiB",
+        "t/s": "30.53 \u00b1 0.74",
+        "test": "pp8",
+        "tps_value": 30.53
+      },
+      "test": "pp8",
+      "tps": 30.53
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log",
+        "ppl": 1.4176,
+        "ppl_error": 0.00953
+      },
+      "general": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log",
+        "ppl": 6.8507,
+        "ppl_error": 0.16499
+      },
+      "math": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log",
+        "ppl": 5.4384,
+        "ppl_error": 0.1198
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 0.3254,
+    "bench_tps": 30.53,
+    "file_size_bytes": 20864981792,
+    "file_size_gb": 19.43
+  }
+}

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  19.43 GiB |    36.15 B | CUDA       |  35 |             pp8 |         30.53 ± 0.74 |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  19.43 GiB |    36.15 B | CUDA       |  35 |           tg128 |          5.08 ± 0.02 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20468 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q5_K:   65 tensors
+llama_model_loader: - type iq4_nl:  385 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 19.43 GiB (4.62 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 13971.93 MiB
+load_tensors:        CUDA0 model buffer size =  2960.23 MiB
+load_tensors:        CUDA1 model buffer size =  2960.23 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   833.78 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 113.066 ms
+perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.64 seconds per pass - ETA 5.30 minutes
+[1]1.5663,[2]1.4687,[3]1.2922,[4]1.2374,[5]1.1926,[6]1.2795,[7]1.3859,[8]1.4450,[9]1.4272,[10]1.4042,[11]1.3816,[12]1.3867,[13]1.3871,[14]1.3725,[15]1.3537,[16]1.3689,[17]1.3703,[18]1.3515,[19]1.3491,[20]1.3652,[21]1.3554,[22]1.3451,[23]1.3557,[24]1.3503,[25]1.3534,[26]1.3494,[27]1.3664,[28]1.3717,[29]1.3721,[30]1.3730,[31]1.3704,[32]1.3812,[33]1.3819,[34]1.3743,[35]1.3700,[36]1.3651,[37]1.3731,[38]1.3820,[39]1.3733,[40]1.3951,[41]1.4041,[42]1.4071,[43]1.4154,[44]1.4164,[45]1.4097,[46]1.4126,[47]1.4164,[48]1.4176,
+Final estimate: PPL = 1.4176 +/- 0.00953
+llama_perf_context_print:        load time =    2579.70 ms
+llama_perf_context_print: prompt eval time =  306842.14 ms / 98304 tokens (    3.12 ms per token,   320.37 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  308603.02 ms / 98305 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16342 + ( 3874 =  2960 +      80 +     833) +        3898 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20148 + ( 3234 =  2960 +      80 +     194) +         741 |
+llama_memory_breakdown_print: |   - Host               |                  14337 = 13971 +     352 +      14                |

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20468 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q5_K:   65 tensors
+llama_model_loader: - type iq4_nl:  385 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 19.43 GiB (4.62 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 13971.93 MiB
+load_tensors:        CUDA0 model buffer size =  2960.23 MiB
+load_tensors:        CUDA1 model buffer size =  2960.23 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   833.78 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 50.772 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.57 seconds per pass - ETA 1.63 minutes
+[1]6.9757,[2]8.0597,[3]8.4705,[4]8.2112,[5]8.0009,[6]6.7181,[7]5.9191,[8]5.9832,[9]6.2474,[10]6.3077,[11]6.4380,[12]6.7394,[13]6.7657,[14]6.8428,[15]6.8507,
+Final estimate: PPL = 6.8507 +/- 0.16499
+llama_perf_context_print:        load time =    2555.14 ms
+llama_perf_context_print: prompt eval time =   95523.06 ms / 30720 tokens (    3.11 ms per token,   321.60 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   96021.46 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16389 + ( 3874 =  2960 +      80 +     833) +        3851 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20148 + ( 3234 =  2960 +      80 +     194) +         741 |
+llama_memory_breakdown_print: |   - Host               |                  14337 = 13971 +     352 +      14                |

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20416 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q5_K:   65 tensors
+llama_model_loader: - type iq4_nl:  385 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 19.43 GiB (4.62 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 13971.93 MiB
+load_tensors:        CUDA0 model buffer size =  2960.23 MiB
+load_tensors:        CUDA1 model buffer size =  2960.23 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   833.78 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 46.827 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.63 seconds per pass - ETA 1.77 minutes
+[1]2.7608,[2]2.9054,[3]3.3323,[4]3.5844,[5]4.0966,[6]4.3685,[7]4.5745,[8]4.7023,[9]4.8500,[10]5.0005,[11]5.0791,[12]5.1544,[13]5.2872,[14]5.3968,[15]5.4247,[16]5.4384,
+Final estimate: PPL = 5.4384 +/- 0.11980
+llama_perf_context_print:        load time =    2540.69 ms
+llama_perf_context_print: prompt eval time =  102170.30 ms / 32768 tokens (    3.12 ms per token,   320.72 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  102686.99 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16345 + ( 3874 =  2960 +      80 +     833) +        3896 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20148 + ( 3234 =  2960 +      80 +     194) +         741 |
+llama_memory_breakdown_print: |   - Host               |                  14337 = 13971 +     352 +      14                |

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "36.15 B",
+        "size": "19.94 GiB",
+        "t/s": "25.00 \u00b1 2.73",
+        "test": "pp8",
+        "tps_value": 25.0
+      },
+      "test": "pp8",
+      "tps": 25.0
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_code.log",
+        "ppl": 1.4162,
+        "ppl_error": 0.00952
+      },
+      "general": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_general.log",
+        "ppl": 6.8281,
+        "ppl_error": 0.16452
+      },
+      "math": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_math.log",
+        "ppl": 5.442,
+        "ppl_error": 0.11987
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 0.3797,
+    "bench_tps": 25.0,
+    "file_size_bytes": 21416119072,
+    "file_size_gb": 19.95
+  }
+}

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  19.94 GiB |    36.15 B | CUDA       |  35 |             pp8 |         25.00 ± 2.73 |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  19.94 GiB |    36.15 B | CUDA       |  35 |           tg128 |          4.99 ± 0.01 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20469 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q6_K:   65 tensors
+llama_model_loader: - type iq4_nl:  385 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 19.94 GiB (4.74 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 14364.72 MiB
+load_tensors:        CUDA0 model buffer size =  3026.64 MiB
+load_tensors:        CUDA1 model buffer size =  3026.64 MiB
+.................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   934.39 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 111.885 ms
+perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.75 seconds per pass - ETA 5.38 minutes
+[1]1.5604,[2]1.4661,[3]1.2906,[4]1.2351,[5]1.1912,[6]1.2788,[7]1.3853,[8]1.4444,[9]1.4265,[10]1.4033,[11]1.3807,[12]1.3857,[13]1.3862,[14]1.3715,[15]1.3527,[16]1.3679,[17]1.3692,[18]1.3505,[19]1.3481,[20]1.3641,[21]1.3544,[22]1.3441,[23]1.3545,[24]1.3490,[25]1.3521,[26]1.3479,[27]1.3652,[28]1.3705,[29]1.3711,[30]1.3719,[31]1.3692,[32]1.3801,[33]1.3807,[34]1.3732,[35]1.3689,[36]1.3638,[37]1.3718,[38]1.3806,[39]1.3719,[40]1.3937,[41]1.4027,[42]1.4057,[43]1.4140,[44]1.4151,[45]1.4084,[46]1.4113,[47]1.4151,[48]1.4162,
+Final estimate: PPL = 1.4162 +/- 0.00952
+llama_perf_context_print:        load time =    2551.49 ms
+llama_perf_context_print: prompt eval time =  312027.57 ms / 98304 tokens (    3.17 ms per token,   315.05 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  313585.58 ms / 98305 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16234 + ( 4041 =  3026 +      80 +     934) +        3839 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20082 + ( 3300 =  3026 +      80 +     194) +         740 |
+llama_memory_breakdown_print: |   - Host               |                  14730 = 14364 +     352 +      14                |

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20465 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q6_K:   65 tensors
+llama_model_loader: - type iq4_nl:  385 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 19.94 GiB (4.74 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 14364.72 MiB
+load_tensors:        CUDA0 model buffer size =  3026.64 MiB
+load_tensors:        CUDA1 model buffer size =  3026.64 MiB
+.................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   934.39 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 46.269 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.73 seconds per pass - ETA 1.67 minutes
+[1]6.9699,[2]8.0356,[3]8.4251,[4]8.1681,[5]7.9547,[6]6.6787,[7]5.8897,[8]5.9520,[9]6.2160,[10]6.2771,[11]6.4095,[12]6.7141,[13]6.7415,[14]6.8193,[15]6.8281,
+Final estimate: PPL = 6.8281 +/- 0.16452
+llama_perf_context_print:        load time =    2816.56 ms
+llama_perf_context_print: prompt eval time =   97428.48 ms / 30720 tokens (    3.17 ms per token,   315.31 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   97912.35 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16225 + ( 4041 =  3026 +      80 +     934) +        3848 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20082 + ( 3300 =  3026 +      80 +     194) +         740 |
+llama_memory_breakdown_print: |   - Host               |                  14730 = 14364 +     352 +      14                |

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20474 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q6_K:   65 tensors
+llama_model_loader: - type iq4_nl:  385 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 19.94 GiB (4.74 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 14364.72 MiB
+load_tensors:        CUDA0 model buffer size =  3026.64 MiB
+load_tensors:        CUDA1 model buffer size =  3026.64 MiB
+.................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   934.39 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 44.395 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.77 seconds per pass - ETA 1.80 minutes
+[1]2.7756,[2]2.9075,[3]3.3369,[4]3.5882,[5]4.0989,[6]4.3711,[7]4.5809,[8]4.7081,[9]4.8552,[10]5.0064,[11]5.0821,[12]5.1599,[13]5.2922,[14]5.4012,[15]5.4277,[16]5.4420,
+Final estimate: PPL = 5.4420 +/- 0.11987
+llama_perf_context_print:        load time =    2590.28 ms
+llama_perf_context_print: prompt eval time =  104259.46 ms / 32768 tokens (    3.18 ms per token,   314.29 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  104865.05 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16229 + ( 4041 =  3026 +      80 +     934) +        3844 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20082 + ( 3300 =  3026 +      80 +     194) +         740 |
+llama_memory_breakdown_print: |   - Host               |                  14730 = 14364 +     352 +      14                |

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "36.15 B",
+        "size": "20.88 GiB",
+        "t/s": "25.81 \u00b1 2.12",
+        "test": "pp8",
+        "tps_value": 25.81
+      },
+      "test": "pp8",
+      "tps": 25.81
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_code.log",
+        "ppl": 1.4161,
+        "ppl_error": 0.00951
+      },
+      "general": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_general.log",
+        "ppl": 6.822,
+        "ppl_error": 0.16425
+      },
+      "math": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_math.log",
+        "ppl": 5.4388,
+        "ppl_error": 0.11973
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 0.4265,
+    "bench_tps": 25.81,
+    "file_size_bytes": 22421134112,
+    "file_size_gb": 20.88
+  }
+}

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  20.88 GiB |    36.15 B | CUDA       |  35 |             pp8 |         25.81 ± 2.12 |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  20.88 GiB |    36.15 B | CUDA       |  35 |           tg128 |          4.76 ± 0.00 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20719 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:   65 tensors
+llama_model_loader: - type iq4_nl:  385 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 20.88 GiB (4.96 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 15080.99 MiB
+load_tensors:        CUDA0 model buffer size =  3147.73 MiB
+load_tensors:        CUDA1 model buffer size =  3147.73 MiB
+.................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1117.84 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 113.425 ms
+perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.95 seconds per pass - ETA 5.55 minutes
+[1]1.5621,[2]1.4658,[3]1.2905,[4]1.2351,[5]1.1909,[6]1.2778,[7]1.3842,[8]1.4435,[9]1.4255,[10]1.4025,[11]1.3801,[12]1.3853,[13]1.3858,[14]1.3712,[15]1.3524,[16]1.3676,[17]1.3689,[18]1.3502,[19]1.3478,[20]1.3637,[21]1.3540,[22]1.3436,[23]1.3541,[24]1.3486,[25]1.3516,[26]1.3475,[27]1.3646,[28]1.3700,[29]1.3705,[30]1.3714,[31]1.3688,[32]1.3796,[33]1.3802,[34]1.3727,[35]1.3684,[36]1.3634,[37]1.3714,[38]1.3803,[39]1.3716,[40]1.3934,[41]1.4025,[42]1.4056,[43]1.4139,[44]1.4150,[45]1.4082,[46]1.4111,[47]1.4149,[48]1.4161,
+Final estimate: PPL = 1.4161 +/- 0.00951
+llama_perf_context_print:        load time =    2635.78 ms
+llama_perf_context_print: prompt eval time =  320424.15 ms / 98304 tokens (    3.26 ms per token,   306.79 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  321933.86 ms / 98305 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 15877 + ( 4345 =  3147 +      80 +    1117) +        3891 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 19962 + ( 3421 =  3147 +      80 +     194) +         739 |
+llama_memory_breakdown_print: |   - Host               |                  15447 = 15080 +     352 +      14                |

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20718 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:   65 tensors
+llama_model_loader: - type iq4_nl:  385 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 20.88 GiB (4.96 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 15080.99 MiB
+load_tensors:        CUDA0 model buffer size =  3147.73 MiB
+load_tensors:        CUDA1 model buffer size =  3147.73 MiB
+.................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1117.84 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 48.519 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.90 seconds per pass - ETA 1.72 minutes
+[1]6.9907,[2]8.0518,[3]8.4369,[4]8.1699,[5]7.9557,[6]6.6757,[7]5.8864,[8]5.9502,[9]6.2128,[10]6.2733,[11]6.4054,[12]6.7076,[13]6.7338,[14]6.8125,[15]6.8220,
+Final estimate: PPL = 6.8220 +/- 0.16425
+llama_perf_context_print:        load time =    2659.91 ms
+llama_perf_context_print: prompt eval time =  100053.26 ms / 30720 tokens (    3.26 ms per token,   307.04 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  100537.54 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16177 + ( 4345 =  3147 +      80 +    1117) +        3592 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 19962 + ( 3421 =  3147 +      80 +     194) +         739 |
+llama_memory_breakdown_print: |   - Host               |                  15447 = 15080 +     352 +      14                |

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20420 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:   65 tensors
+llama_model_loader: - type iq4_nl:  385 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 20.88 GiB (4.96 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 15080.99 MiB
+load_tensors:        CUDA0 model buffer size =  3147.73 MiB
+load_tensors:        CUDA1 model buffer size =  3147.73 MiB
+.................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1117.84 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 44.701 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.92 seconds per pass - ETA 1.83 minutes
+[1]2.7663,[2]2.9032,[3]3.3313,[4]3.5822,[5]4.0935,[6]4.3647,[7]4.5735,[8]4.7008,[9]4.8480,[10]4.9997,[11]5.0763,[12]5.1539,[13]5.2864,[14]5.3967,[15]5.4243,[16]5.4388,
+Final estimate: PPL = 5.4388 +/- 0.11973
+llama_perf_context_print:        load time =    2648.84 ms
+llama_perf_context_print: prompt eval time =  106970.58 ms / 32768 tokens (    3.26 ms per token,   306.33 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  107483.72 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 15873 + ( 4345 =  3147 +      80 +    1117) +        3896 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 19962 + ( 3421 =  3147 +      80 +     194) +         739 |
+llama_memory_breakdown_print: |   - Host               |                  15447 = 15080 +     352 +      14                |

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "36.15 B",
+        "size": "27.86 GiB",
+        "t/s": "19.03 \u00b1 0.59",
+        "test": "pp8",
+        "tps_value": 19.03
+      },
+      "test": "pp8",
+      "tps": 19.03
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_code.log",
+        "ppl": 1.4133,
+        "ppl_error": 0.00946
+      },
+      "general": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_general.log",
+        "ppl": 6.8037,
+        "ppl_error": 0.16387
+      },
+      "math": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_math.log",
+        "ppl": 5.3769,
+        "ppl_error": 0.11787
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 0.828,
+    "bench_tps": 19.03,
+    "file_size_bytes": 29924678432,
+    "file_size_gb": 27.87
+  }
+}

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  27.86 GiB |    36.15 B | CUDA       |  35 |             pp8 |         19.03 ± 0.59 |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  27.86 GiB |    36.15 B | CUDA       |  35 |           tg128 |          3.36 ± 0.01 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,153 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19183 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:  128 tensors
+llama_model_loader: - type q5_K:   65 tensors
+llama_model_loader: - type iq4_nl:  257 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 27.86 GiB (6.62 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 19911.93 MiB
+load_tensors:        CUDA0 model buffer size =  3879.21 MiB
+load_tensors:        CUDA1 model buffer size =  4741.26 MiB
+...................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    72.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    88.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   833.78 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 121.381 ms
+perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 8.93 seconds per pass - ETA 7.13 minutes
+[1]1.5570,[2]1.4577,[3]1.2857,[4]1.2317,[5]1.1880,[6]1.2751,[7]1.3796,[8]1.4376,[9]1.4205,[10]1.3983,[11]1.3764,[12]1.3819,[13]1.3820,[14]1.3678,[15]1.3488,[16]1.3645,[17]1.3655,[18]1.3472,[19]1.3450,[20]1.3606,[21]1.3508,[22]1.3407,[23]1.3511,[24]1.3456,[25]1.3494,[26]1.3454,[27]1.3623,[28]1.3675,[29]1.3677,[30]1.3685,[31]1.3657,[32]1.3763,[33]1.3769,[34]1.3693,[35]1.3651,[36]1.3603,[37]1.3681,[38]1.3769,[39]1.3684,[40]1.3900,[41]1.3990,[42]1.4020,[43]1.4105,[44]1.4116,[45]1.4052,[46]1.4081,[47]1.4121,[48]1.4133,
+Final estimate: PPL = 1.4133 +/- 0.00946
+llama_perf_context_print:        load time =    3785.94 ms
+llama_perf_context_print: prompt eval time =  416431.87 ms / 98304 tokens (    4.24 ms per token,   236.06 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  418154.88 ms / 98305 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 14172 + ( 4784 =  3879 +      72 +     833) +        5157 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 18360 + ( 5023 =  4741 +      88 +     194) +         740 |
+llama_memory_breakdown_print: |   - Host               |                  20277 = 19911 +     352 +      14                |

Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_general.log ADDED Viewed

	@@ -0,0 +1,153 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19144 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:  128 tensors
+llama_model_loader: - type q5_K:   65 tensors
+llama_model_loader: - type iq4_nl:  257 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 27.86 GiB (6.62 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 19911.93 MiB
+load_tensors:        CUDA0 model buffer size =  3879.21 MiB
+load_tensors:        CUDA1 model buffer size =  4741.26 MiB
+...................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    72.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    88.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   833.78 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 49.451 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 8.88 seconds per pass - ETA 2.22 minutes
+[1]6.9263,[2]7.9946,[3]8.3471,[4]8.0938,[5]7.8933,[6]6.6467,[7]5.8691,[8]5.9351,[9]6.2024,[10]6.2615,[11]6.3906,[12]6.6929,[13]6.7168,[14]6.7946,[15]6.8037,
+Final estimate: PPL = 6.8037 +/- 0.16387
+llama_perf_context_print:        load time =    3982.53 ms
+llama_perf_context_print: prompt eval time =  130029.31 ms / 30720 tokens (    4.23 ms per token,   236.25 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  130712.15 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 14201 + ( 4784 =  3879 +      72 +     833) +        5128 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 18360 + ( 5023 =  4741 +      88 +     194) +         740 |
+llama_memory_breakdown_print: |   - Host               |                  20277 = 19911 +     352 +      14                |

	@@ -0,0 +1,153 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19152 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:  128 tensors
+llama_model_loader: - type q5_K:   65 tensors
+llama_model_loader: - type iq4_nl:  257 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 27.86 GiB (6.62 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 19911.93 MiB
+load_tensors:        CUDA0 model buffer size =  3879.21 MiB
+load_tensors:        CUDA1 model buffer size =  4741.26 MiB
+...................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    72.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    88.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   833.78 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 45.858 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 8.86 seconds per pass - ETA 2.35 minutes
+[1]2.7024,[2]2.8391,[3]3.2683,[4]3.5229,[5]4.0384,[6]4.3092,[7]4.5171,[8]4.6441,[9]4.7886,[10]4.9372,[11]5.0187,[12]5.0935,[13]5.2285,[14]5.3377,[15]5.3687,[16]5.3769,
+Final estimate: PPL = 5.3769 +/- 0.11787
+llama_perf_context_print:        load time =    3518.68 ms
+llama_perf_context_print: prompt eval time =  138566.37 ms / 32768 tokens (    4.23 ms per token,   236.48 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  139097.25 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 14169 + ( 4784 =  3879 +      72 +     833) +        5160 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 18360 + ( 5023 =  4741 +      88 +     194) +         740 |
+llama_memory_breakdown_print: |   - Host               |                  20277 = 19911 +     352 +      14                |

Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "36.15 B",
+        "size": "19.04 GiB",
+        "t/s": "28.78 \u00b1 3.16",
+        "test": "pp8",
+        "tps_value": 28.78
+      },
+      "test": "pp8",
+      "tps": 28.78
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log",
+        "ppl": 1.4161,
+        "ppl_error": 0.00948
+      },
+      "general": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log",
+        "ppl": 6.8715,
+        "ppl_error": 0.16547
+      },
+      "math": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log",
+        "ppl": 5.4643,
+        "ppl_error": 0.12019
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 0.2769,
+    "bench_tps": 28.78,
+    "file_size_bytes": 20445551392,
+    "file_size_gb": 19.04
+  }
+}

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  19.04 GiB |    36.15 B | CUDA       |  35 |             pp8 |         28.78 ± 3.16 |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  19.04 GiB |    36.15 B | CUDA       |  35 |           tg128 |          5.28 ± 0.05 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20350 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q5_K:    1 tensors
+llama_model_loader: - type iq4_nl:  449 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 19.04 GiB (4.52 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 13696.93 MiB
+load_tensors:        CUDA0 model buffer size =  2897.73 MiB
+load_tensors:        CUDA1 model buffer size =  2897.73 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   739.09 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 113.916 ms
+perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.62 seconds per pass - ETA 5.28 minutes
+[1]1.5690,[2]1.4709,[3]1.2935,[4]1.2379,[5]1.1930,[6]1.2804,[7]1.3859,[8]1.4444,[9]1.4266,[10]1.4032,[11]1.3804,[12]1.3856,[13]1.3859,[14]1.3713,[15]1.3522,[16]1.3675,[17]1.3691,[18]1.3504,[19]1.3479,[20]1.3635,[21]1.3537,[22]1.3435,[23]1.3540,[24]1.3484,[25]1.3513,[26]1.3473,[27]1.3644,[28]1.3698,[29]1.3702,[30]1.3712,[31]1.3684,[32]1.3794,[33]1.3801,[34]1.3725,[35]1.3682,[36]1.3631,[37]1.3710,[38]1.3798,[39]1.3712,[40]1.3934,[41]1.4024,[42]1.4055,[43]1.4139,[44]1.4150,[45]1.4083,[46]1.4112,[47]1.4148,[48]1.4161,
+Final estimate: PPL = 1.4161 +/- 0.00948
+llama_perf_context_print:        load time =    2643.34 ms
+llama_perf_context_print: prompt eval time =  306009.13 ms / 98304 tokens (    3.11 ms per token,   321.25 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  307563.05 ms / 98305 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16500 + ( 3716 =  2897 +      80 +     739) +        3898 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20212 + ( 3171 =  2897 +      80 +     194) +         739 |
+llama_memory_breakdown_print: |   - Host               |                  14062 = 13696 +     352 +      14                |

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20255 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q5_K:    1 tensors
+llama_model_loader: - type iq4_nl:  449 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 19.04 GiB (4.52 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 13696.93 MiB
+load_tensors:        CUDA0 model buffer size =  2897.73 MiB
+load_tensors:        CUDA1 model buffer size =  2897.73 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   739.09 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 50.4 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.59 seconds per pass - ETA 1.63 minutes
+[1]7.0029,[2]8.1095,[3]8.5099,[4]8.2651,[5]8.0598,[6]6.7509,[7]5.9414,[8]5.9974,[9]6.2556,[10]6.3166,[11]6.4537,[12]6.7592,[13]6.7836,[14]6.8633,[15]6.8715,
+Final estimate: PPL = 6.8715 +/- 0.16547
+llama_perf_context_print:        load time =    2509.91 ms
+llama_perf_context_print: prompt eval time =   95548.39 ms / 30720 tokens (    3.11 ms per token,   321.51 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   96051.84 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16436 + ( 3716 =  2897 +      80 +     739) +        3961 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20212 + ( 3171 =  2897 +      80 +     194) +         739 |
+llama_memory_breakdown_print: |   - Host               |                  14062 = 13696 +     352 +      14                |

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20416 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q5_K:    1 tensors
+llama_model_loader: - type iq4_nl:  449 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 19.04 GiB (4.52 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 13696.93 MiB
+load_tensors:        CUDA0 model buffer size =  2897.73 MiB
+load_tensors:        CUDA1 model buffer size =  2897.73 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   739.09 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 49.049 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.64 seconds per pass - ETA 1.77 minutes
+[1]2.7786,[2]2.9173,[3]3.3383,[4]3.6012,[5]4.1247,[6]4.3961,[7]4.6081,[8]4.7268,[9]4.8720,[10]5.0228,[11]5.0974,[12]5.1771,[13]5.3097,[14]5.4174,[15]5.4507,[16]5.4643,
+Final estimate: PPL = 5.4643 +/- 0.12019
+llama_perf_context_print:        load time =    2506.68 ms
+llama_perf_context_print: prompt eval time =  102309.14 ms / 32768 tokens (    3.12 ms per token,   320.28 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  102843.30 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16414 + ( 3716 =  2897 +      80 +     739) +        3984 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20212 + ( 3171 =  2897 +      80 +     194) +         739 |
+llama_memory_breakdown_print: |   - Host               |                  14062 = 13696 +     352 +      14                |

Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "36.15 B",
+        "size": "19.13 GiB",
+        "t/s": "26.86 \u00b1 2.90",
+        "test": "pp8",
+        "tps_value": 26.86
+      },
+      "test": "pp8",
+      "tps": 26.86
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log",
+        "ppl": 1.4159,
+        "ppl_error": 0.00948
+      },
+      "general": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log",
+        "ppl": 6.8703,
+        "ppl_error": 0.16545
+      },
+      "math": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log",
+        "ppl": 5.4647,
+        "ppl_error": 0.12022
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 0.2805,
+    "bench_tps": 26.86,
+    "file_size_bytes": 20551043872,
+    "file_size_gb": 19.14
+  }
+}

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  19.13 GiB |    36.15 B | CUDA       |  35 |             pp8 |         26.86 ± 2.90 |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  19.13 GiB |    36.15 B | CUDA       |  35 |           tg128 |          5.33 ± 0.01 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20410 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q6_K:    1 tensors
+llama_model_loader: - type iq4_nl:  449 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 19.13 GiB (4.55 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 13797.53 MiB
+load_tensors:        CUDA0 model buffer size =  2897.73 MiB
+load_tensors:        CUDA1 model buffer size =  2897.73 MiB
+.................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   739.09 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 112.004 ms
+perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.48 seconds per pass - ETA 5.18 minutes
+[1]1.5664,[2]1.4694,[3]1.2926,[4]1.2373,[5]1.1926,[6]1.2807,[7]1.3863,[8]1.4452,[9]1.4274,[10]1.4039,[11]1.3810,[12]1.3861,[13]1.3865,[14]1.3716,[15]1.3526,[16]1.3677,[17]1.3693,[18]1.3506,[19]1.3481,[20]1.3637,[21]1.3539,[22]1.3437,[23]1.3542,[24]1.3486,[25]1.3515,[26]1.3473,[27]1.3644,[28]1.3698,[29]1.3702,[30]1.3711,[31]1.3684,[32]1.3793,[33]1.3800,[34]1.3723,[35]1.3680,[36]1.3630,[37]1.3709,[38]1.3797,[39]1.3710,[40]1.3932,[41]1.4022,[42]1.4052,[43]1.4136,[44]1.4147,[45]1.4080,[46]1.4109,[47]1.4146,[48]1.4159,
+Final estimate: PPL = 1.4159 +/- 0.00948
+llama_perf_context_print:        load time =    2495.13 ms
+llama_perf_context_print: prompt eval time =  299791.84 ms / 98304 tokens (    3.05 ms per token,   327.91 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  301299.08 ms / 98305 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16504 + ( 3716 =  2897 +      80 +     739) +        3893 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20212 + ( 3171 =  2897 +      80 +     194) +         739 |
+llama_memory_breakdown_print: |   - Host               |                  14163 = 13797 +     352 +      14                |

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20404 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q6_K:    1 tensors
+llama_model_loader: - type iq4_nl:  449 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 19.13 GiB (4.55 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 13797.53 MiB
+load_tensors:        CUDA0 model buffer size =  2897.73 MiB
+load_tensors:        CUDA1 model buffer size =  2897.73 MiB
+.................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   739.09 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 49.617 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.44 seconds per pass - ETA 1.60 minutes
+[1]7.0036,[2]8.1097,[3]8.5035,[4]8.2616,[5]8.0590,[6]6.7519,[7]5.9419,[8]5.9978,[9]6.2567,[10]6.3171,[11]6.4548,[12]6.7584,[13]6.7832,[14]6.8623,[15]6.8703,
+Final estimate: PPL = 6.8703 +/- 0.16545
+llama_perf_context_print:        load time =    2488.76 ms
+llama_perf_context_print: prompt eval time =   93409.83 ms / 30720 tokens (    3.04 ms per token,   328.87 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   93894.78 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16496 + ( 3716 =  2897 +      80 +     739) +        3901 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20212 + ( 3171 =  2897 +      80 +     194) +         739 |
+llama_memory_breakdown_print: |   - Host               |                  14163 = 13797 +     352 +      14                |

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20420 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q6_K:    1 tensors
+llama_model_loader: - type iq4_nl:  449 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 19.13 GiB (4.55 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 13797.53 MiB
+load_tensors:        CUDA0 model buffer size =  2897.73 MiB
+load_tensors:        CUDA1 model buffer size =  2897.73 MiB
+.................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   739.09 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 44.823 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.50 seconds per pass - ETA 1.72 minutes
+[1]2.7835,[2]2.9216,[3]3.3409,[4]3.6013,[5]4.1248,[6]4.3958,[7]4.6067,[8]4.7254,[9]4.8714,[10]5.0227,[11]5.0961,[12]5.1761,[13]5.3092,[14]5.4165,[15]5.4500,[16]5.4647,
+Final estimate: PPL = 5.4647 +/- 0.12022
+llama_perf_context_print:        load time =    2511.76 ms
+llama_perf_context_print: prompt eval time =  100187.11 ms / 32768 tokens (    3.06 ms per token,   327.07 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  100696.88 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16501 + ( 3716 =  2897 +      80 +     739) +        3896 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20212 + ( 3171 =  2897 +      80 +     194) +         739 |
+llama_memory_breakdown_print: |   - Host               |                  14163 = 13797 +     352 +      14                |

Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "36.15 B",
+        "size": "19.31 GiB",
+        "t/s": "25.73 \u00b1 2.34",
+        "test": "pp8",
+        "tps_value": 25.73
+      },
+      "test": "pp8",
+      "tps": 25.73
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log",
+        "ppl": 1.4159,
+        "ppl_error": 0.00947
+      },
+      "general": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log",
+        "ppl": 6.8736,
+        "ppl_error": 0.16558
+      },
+      "math": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log",
+        "ppl": 5.4656,
+        "ppl_error": 0.12023
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 0.27,
+    "bench_tps": 25.73,
+    "file_size_bytes": 20743412512,
+    "file_size_gb": 19.32
+  }
+}

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  19.31 GiB |    36.15 B | CUDA       |  35 |             pp8 |         25.73 ± 2.34 |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  19.31 GiB |    36.15 B | CUDA       |  35 |           tg128 |          5.32 ± 0.00 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20429 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:    1 tensors
+llama_model_loader: - type iq4_nl:  449 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 19.31 GiB (4.59 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 13980.99 MiB
+load_tensors:        CUDA0 model buffer size =  2897.73 MiB
+load_tensors:        CUDA1 model buffer size =  2897.73 MiB
+................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   739.09 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 111.451 ms
+perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.45 seconds per pass - ETA 5.15 minutes
+[1]1.5697,[2]1.4711,[3]1.2936,[4]1.2378,[5]1.1929,[6]1.2810,[7]1.3862,[8]1.4451,[9]1.4271,[10]1.4037,[11]1.3808,[12]1.3860,[13]1.3864,[14]1.3715,[15]1.3525,[16]1.3677,[17]1.3692,[18]1.3505,[19]1.3480,[20]1.3636,[21]1.3538,[22]1.3436,[23]1.3541,[24]1.3485,[25]1.3513,[26]1.3472,[27]1.3643,[28]1.3697,[29]1.3701,[30]1.3711,[31]1.3683,[32]1.3792,[33]1.3800,[34]1.3723,[35]1.3680,[36]1.3630,[37]1.3709,[38]1.3796,[39]1.3710,[40]1.3932,[41]1.4021,[42]1.4052,[43]1.4137,[44]1.4148,[45]1.4081,[46]1.4110,[47]1.4146,[48]1.4159,
+Final estimate: PPL = 1.4159 +/- 0.00947
+llama_perf_context_print:        load time =    2493.93 ms
+llama_perf_context_print: prompt eval time =  299600.26 ms / 98304 tokens (    3.05 ms per token,   328.12 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  301103.88 ms / 98305 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16531 + ( 3716 =  2897 +      80 +     739) +        3866 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20212 + ( 3171 =  2897 +      80 +     194) +         739 |
+llama_memory_breakdown_print: |   - Host               |                  14347 = 13980 +     352 +      14                |

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20412 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:    1 tensors
+llama_model_loader: - type iq4_nl:  449 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 19.31 GiB (4.59 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 13980.99 MiB
+load_tensors:        CUDA0 model buffer size =  2897.73 MiB
+load_tensors:        CUDA1 model buffer size =  2897.73 MiB
+................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   739.09 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 46.856 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.44 seconds per pass - ETA 1.60 minutes
+[1]7.0159,[2]8.1163,[3]8.5190,[4]8.2693,[5]8.0635,[6]6.7555,[7]5.9452,[8]6.0011,[9]6.2586,[10]6.3193,[11]6.4576,[12]6.7629,[13]6.7864,[14]6.8658,[15]6.8736,
+Final estimate: PPL = 6.8736 +/- 0.16558
+llama_perf_context_print:        load time =    2645.96 ms
+llama_perf_context_print: prompt eval time =   93386.51 ms / 30720 tokens (    3.04 ms per token,   328.96 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   93871.31 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16501 + ( 3716 =  2897 +      80 +     739) +        3897 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20212 + ( 3171 =  2897 +      80 +     194) +         739 |
+llama_memory_breakdown_print: |   - Host               |                  14347 = 13980 +     352 +      14                |

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20450 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:    1 tensors
+llama_model_loader: - type iq4_nl:  449 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 19.31 GiB (4.59 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 13980.99 MiB
+load_tensors:        CUDA0 model buffer size =  2897.73 MiB
+load_tensors:        CUDA1 model buffer size =  2897.73 MiB
+................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   739.09 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 46.739 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.48 seconds per pass - ETA 1.72 minutes
+[1]2.7843,[2]2.9209,[3]3.3419,[4]3.6083,[5]4.1312,[6]4.4011,[7]4.6125,[8]4.7301,[9]4.8745,[10]5.0244,[11]5.0976,[12]5.1776,[13]5.3111,[14]5.4185,[15]5.4517,[16]5.4656,
+Final estimate: PPL = 5.4656 +/- 0.12023
+llama_perf_context_print:        load time =    2485.58 ms
+llama_perf_context_print: prompt eval time =   99922.90 ms / 32768 tokens (    3.05 ms per token,   327.93 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  100433.38 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16535 + ( 3716 =  2897 +      80 +     739) +        3862 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20212 + ( 3171 =  2897 +      80 +     194) +         739 |
+llama_memory_breakdown_print: |   - Host               |                  14347 = 13980 +     352 +      14                |

Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/bench_metrics.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "36.15 B",
+        "size": "19.43 GiB",
+        "t/s": "28.76 \u00b1 0.96",
+        "test": "pp8",
+        "tps_value": 28.76
+      },
+      "test": "pp8",
+      "tps": 28.76
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log",
+        "ppl": 1.4176,
+        "ppl_error": 0.00953
+      },
+      "general": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log",
+        "ppl": 6.8507,
+        "ppl_error": 0.16499
+      },
+      "math": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log",
+        "ppl": 5.4384,
+        "ppl_error": 0.1198
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 0.3254,
+    "bench_tps": 28.76,
+    "file_size_bytes": 20864981792,
+    "file_size_gb": 19.43
+  }
+}

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  19.43 GiB |    36.15 B | CUDA       |  35 |             pp8 |         28.76 ± 0.96 |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  19.43 GiB |    36.15 B | CUDA       |  35 |           tg128 |          4.87 ± 0.01 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19133 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q5_K:   65 tensors
+llama_model_loader: - type iq4_nl:  385 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 19.43 GiB (4.62 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 13971.93 MiB
+load_tensors:        CUDA0 model buffer size =  2664.21 MiB
+load_tensors:        CUDA1 model buffer size =  3256.26 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    72.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    88.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   833.78 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 115.322 ms
+perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.91 seconds per pass - ETA 5.52 minutes
+[1]1.5663,[2]1.4687,[3]1.2922,[4]1.2374,[5]1.1926,[6]1.2795,[7]1.3859,[8]1.4450,[9]1.4272,[10]1.4042,[11]1.3816,[12]1.3867,[13]1.3871,[14]1.3725,[15]1.3537,[16]1.3689,[17]1.3703,[18]1.3515,[19]1.3491,[20]1.3652,[21]1.3554,[22]1.3451,[23]1.3557,[24]1.3503,[25]1.3534,[26]1.3494,[27]1.3664,[28]1.3717,[29]1.3721,[30]1.3730,[31]1.3704,[32]1.3812,[33]1.3819,[34]1.3743,[35]1.3700,[36]1.3651,[37]1.3731,[38]1.3820,[39]1.3733,[40]1.3951,[41]1.4041,[42]1.4071,[43]1.4154,[44]1.4164,[45]1.4097,[46]1.4126,[47]1.4164,[48]1.4176,
+Final estimate: PPL = 1.4176 +/- 0.00953
+llama_perf_context_print:        load time =    2582.59 ms
+llama_perf_context_print: prompt eval time =  321034.86 ms / 98304 tokens (    3.27 ms per token,   306.21 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  322609.98 ms / 98305 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 15274 + ( 3569 =  2664 +      72 +     833) +        5271 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 19844 + ( 3538 =  3256 +      88 +     194) +         741 |
+llama_memory_breakdown_print: |   - Host               |                  14337 = 13971 +     352 +      14                |

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19177 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q5_K:   65 tensors
+llama_model_loader: - type iq4_nl:  385 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 19.43 GiB (4.62 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 13971.93 MiB
+load_tensors:        CUDA0 model buffer size =  2664.21 MiB
+load_tensors:        CUDA1 model buffer size =  3256.26 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    72.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    88.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   833.78 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 49.576 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.90 seconds per pass - ETA 1.72 minutes
+[1]6.9757,[2]8.0597,[3]8.4705,[4]8.2112,[5]8.0009,[6]6.7181,[7]5.9191,[8]5.9832,[9]6.2474,[10]6.3077,[11]6.4380,[12]6.7394,[13]6.7657,[14]6.8428,[15]6.8507,
+Final estimate: PPL = 6.8507 +/- 0.16499
+llama_perf_context_print:        load time =    2546.10 ms
+llama_perf_context_print: prompt eval time =   99968.38 ms / 30720 tokens (    3.25 ms per token,   307.30 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  100472.57 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 15365 + ( 3569 =  2664 +      72 +     833) +        5180 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 19844 + ( 3538 =  3256 +      88 +     194) +         741 |
+llama_memory_breakdown_print: |   - Host               |                  14337 = 13971 +     352 +      14                |

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19043 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q5_K:   65 tensors
+llama_model_loader: - type iq4_nl:  385 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 19.43 GiB (4.62 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 13971.93 MiB
+load_tensors:        CUDA0 model buffer size =  2664.21 MiB
+load_tensors:        CUDA1 model buffer size =  3256.26 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    72.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    88.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   833.78 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 47.332 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.93 seconds per pass - ETA 1.83 minutes
+[1]2.7608,[2]2.9054,[3]3.3323,[4]3.5844,[5]4.0966,[6]4.3685,[7]4.5745,[8]4.7023,[9]4.8500,[10]5.0005,[11]5.0791,[12]5.1544,[13]5.2872,[14]5.3968,[15]5.4247,[16]5.4384,
+Final estimate: PPL = 5.4384 +/- 0.11980
+llama_perf_context_print:        load time =    2673.19 ms
+llama_perf_context_print: prompt eval time =  107032.67 ms / 32768 tokens (    3.27 ms per token,   306.15 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  107563.78 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 15331 + ( 3569 =  2664 +      72 +     833) +        5213 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 19844 + ( 3538 =  3256 +      88 +     194) +         741 |
+llama_memory_breakdown_print: |   - Host               |                  14337 = 13971 +     352 +      14                |

Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/bench_metrics.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "36.15 B",
+        "size": "67.34 GiB",
+        "t/s": "11.32 \u00b1 0.12",
+        "test": "pp8",
+        "tps_value": 11.32
+      },
+      "test": "pp8",
+      "tps": 11.32
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_code.log",
+        "ppl": 1.4128,
+        "ppl_error": 0.00952
+      },
+      "general": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_general.log",
+        "ppl": 6.8872,
+        "ppl_error": 0.16794
+      },
+      "math": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_math.log",
+        "ppl": 5.4442,
+        "ppl_error": 0.12088
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 0.0,
+    "bench_tps": 11.32,
+    "file_size_bytes": 72311397152,
+    "file_size_gb": 67.35
+  }
+}

Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/llamabench.md ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  67.34 GiB |    36.15 B | CUDA       |  35 |             pp8 |         11.32 ± 0.12 |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  67.34 GiB |    36.15 B | CUDA       |  35 |           tg128 |          1.53 ± 0.02 |
+build: 92bb442ad (7040)

Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_code.log ADDED Viewed

	@@ -0,0 +1,151 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19670 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type bf16:  450 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 67.34 GiB (16.00 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 48353.80 MiB
+load_tensors:        CUDA0 model buffer size = 10300.86 MiB
+load_tensors:        CUDA1 model buffer size = 10300.86 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1828.00 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 112.237 ms
+perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 17.78 seconds per pass - ETA 14.22 minutes
+[1]1.5107,[2]1.4416,[3]1.2762,[4]1.2238,[5]1.1809,[6]1.2685,[7]1.3738,[8]1.4318,[9]1.4155,[10]1.3932,[11]1.3715,[12]1.3774,[13]1.3779,[14]1.3640,[15]1.3454,[16]1.3621,[17]1.3633,[18]1.3450,[19]1.3424,[20]1.3583,[21]1.3485,[22]1.3382,[23]1.3488,[24]1.3431,[25]1.3473,[26]1.3431,[27]1.3609,[28]1.3662,[29]1.3668,[30]1.3675,[31]1.3649,[32]1.3754,[33]1.3757,[34]1.3681,[35]1.3643,[36]1.3595,[37]1.3672,[38]1.3761,[39]1.3676,[40]1.3894,[41]1.3983,[42]1.4012,[43]1.4096,[44]1.4109,[45]1.4046,[46]1.4078,[47]1.4116,[48]1.4128,
+Final estimate: PPL = 1.4128 +/- 0.00952
+llama_perf_context_print:        load time =    7800.56 ms
+llama_perf_context_print: prompt eval time =  840300.57 ms / 98304 tokens (    8.55 ms per token,   116.99 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  841852.62 ms / 98305 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 =  6983 + (12208 = 10300 +      80 +    1828) +        4923 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 12270 + (10574 = 10300 +      80 +     194) +        1279 |
+llama_memory_breakdown_print: |   - Host               |                  48719 = 48353 +     352 +      14                |

Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_general.log ADDED Viewed

	@@ -0,0 +1,151 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19658 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type bf16:  450 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 67.34 GiB (16.00 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 48353.80 MiB
+load_tensors:        CUDA0 model buffer size = 10300.86 MiB
+load_tensors:        CUDA1 model buffer size = 10300.86 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1828.00 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 49.672 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 17.80 seconds per pass - ETA 4.45 minutes
+[1]7.1957,[2]8.1195,[3]8.4548,[4]8.2130,[5]8.0074,[6]6.7286,[7]5.9325,[8]5.9903,[9]6.2600,[10]6.3190,[11]6.4561,[12]6.7865,[13]6.8028,[14]6.8780,[15]6.8872,
+Final estimate: PPL = 6.8872 +/- 0.16794
+llama_perf_context_print:        load time =    8091.77 ms
+llama_perf_context_print: prompt eval time =  263005.39 ms / 30720 tokens (    8.56 ms per token,   116.80 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  263735.66 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 =  7242 + (12208 = 10300 +      80 +    1828) +        4663 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 12270 + (10574 = 10300 +      80 +     194) +        1279 |
+llama_memory_breakdown_print: |   - Host               |                  48719 = 48353 +     352 +      14                |

Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_math.log ADDED Viewed

	@@ -0,0 +1,151 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19403 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type bf16:  450 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 67.34 GiB (16.00 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 48353.80 MiB
+load_tensors:        CUDA0 model buffer size = 10300.86 MiB
+load_tensors:        CUDA1 model buffer size = 10300.86 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1828.00 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 46.673 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 17.81 seconds per pass - ETA 4.73 minutes
+[1]2.6577,[2]2.8378,[3]3.2807,[4]3.5315,[5]4.0764,[6]4.3578,[7]4.5789,[8]4.7049,[9]4.8470,[10]5.0057,[11]5.0877,[12]5.1590,[13]5.2956,[14]5.4047,[15]5.4376,[16]5.4442,
+Final estimate: PPL = 5.4442 +/- 0.12088
+llama_perf_context_print:        load time =    8172.12 ms
+llama_perf_context_print: prompt eval time =  280924.96 ms / 32768 tokens (    8.57 ms per token,   116.64 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  281977.55 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 =  6968 + (12208 = 10300 +      80 +    1828) +        4937 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 12270 + (10574 = 10300 +      80 +     194) +        1279 |
+llama_memory_breakdown_print: |   - Host               |                  48719 = 48353 +     352 +      14                |

Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "36.15 B",
+        "size": "18.94 GiB",
+        "t/s": "28.49 \u00b1 3.98",
+        "test": "pp8",
+        "tps_value": 28.49
+      },
+      "test": "pp8",
+      "tps": 28.49
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log",
+        "ppl": 1.4162,
+        "ppl_error": 0.00948
+      },
+      "general": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log",
+        "ppl": 6.8712,
+        "ppl_error": 0.16544
+      },
+      "math": {
+        "log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log",
+        "ppl": 5.4627,
+        "ppl_error": 0.12011
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 0.2709,
+    "bench_tps": 28.49,
+    "file_size_bytes": 20346264352,
+    "file_size_gb": 18.95
+  }
+}

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  18.94 GiB |    36.15 B | CUDA       |  35 |             pp8 |         28.49 ± 3.98 |
+| seed_oss 36B IQ4_NL - 4.5 bpw  |  18.94 GiB |    36.15 B | CUDA       |  35 |           tg128 |          5.12 ± 0.01 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,151 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20217 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type iq4_nl:  450 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 18.94 GiB (4.50 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 13602.24 MiB
+load_tensors:        CUDA0 model buffer size =  2897.73 MiB
+load_tensors:        CUDA1 model buffer size =  2897.73 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   739.09 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 109.985 ms
+perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.58 seconds per pass - ETA 5.27 minutes
+[1]1.5682,[2]1.4703,[3]1.2931,[4]1.2377,[5]1.1929,[6]1.2806,[7]1.3861,[8]1.4448,[9]1.4269,[10]1.4035,[11]1.3807,[12]1.3858,[13]1.3863,[14]1.3715,[15]1.3526,[16]1.3677,[17]1.3694,[18]1.3506,[19]1.3482,[20]1.3638,[21]1.3540,[22]1.3438,[23]1.3542,[24]1.3488,[25]1.3517,[26]1.3476,[27]1.3647,[28]1.3700,[29]1.3704,[30]1.3714,[31]1.3686,[32]1.3796,[33]1.3802,[34]1.3726,[35]1.3683,[36]1.3633,[37]1.3711,[38]1.3799,[39]1.3712,[40]1.3935,[41]1.4025,[42]1.4055,[43]1.4140,[44]1.4151,[45]1.4084,[46]1.4113,[47]1.4149,[48]1.4162,
+Final estimate: PPL = 1.4162 +/- 0.00948
+llama_perf_context_print:        load time =    2451.89 ms
+llama_perf_context_print: prompt eval time =  306874.12 ms / 98304 tokens (    3.12 ms per token,   320.34 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  308445.15 ms / 98305 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16303 + ( 3716 =  2897 +      80 +     739) +        4094 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 19690 + ( 3171 =  2897 +      80 +     194) +        1262 |
+llama_memory_breakdown_print: |   - Host               |                  13968 = 13602 +     352 +      14                |

	@@ -0,0 +1,151 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20064 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type iq4_nl:  450 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 18.94 GiB (4.50 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 13602.24 MiB
+load_tensors:        CUDA0 model buffer size =  2897.73 MiB
+load_tensors:        CUDA1 model buffer size =  2897.73 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   739.09 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 50.71 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.57 seconds per pass - ETA 1.63 minutes
+[1]7.0271,[2]8.1273,[3]8.5259,[4]8.2753,[5]8.0705,[6]6.7617,[7]5.9479,[8]6.0034,[9]6.2604,[10]6.3227,[11]6.4549,[12]6.7596,[13]6.7839,[14]6.8638,[15]6.8712,
+Final estimate: PPL = 6.8712 +/- 0.16544
+llama_perf_context_print:        load time =    2474.98 ms
+llama_perf_context_print: prompt eval time =   95401.62 ms / 30720 tokens (    3.11 ms per token,   322.01 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   95899.95 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16309 + ( 3716 =  2897 +      80 +     739) +        4089 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 19690 + ( 3171 =  2897 +      80 +     194) +        1262 |
+llama_memory_breakdown_print: |   - Host               |                  13968 = 13602 +     352 +      14                |